Learn how Windows NT manages disks

The evolution of storage management in Windows NT begins with DOS, Microsoft's first OS. As hard disks became larger, DOS needed to accommodate them. To do so, one of the first steps Microsoft took was to let DOS create multiple partitions, or logical disks, on a physical disk. DOS could format each partition with a different file-system type (i.e., FAT12 or FAT16) and assign each partition a different drive letter.

DOS 3 and 4 were severely limited in the size and number of partitions they could create, but in DOS 5 the partitioning scheme fully matured. DOS 5 was able to divide a disk into any number of partitions of any size. NT borrowed the partitioning scheme that evolved in DOS to provide disk compatibility with DOS and Windows 3.x, and to let the NT development team rely on proven tools for disk management. In NT, Microsoft extended the basic concepts of DOS disk partitioning to support storage-management features that an enterprise-class OS requires: disk spanning and fault tolerance. Starting with the first version of NT, systems administrators have been able to create volumes that comprise multiple partitions. This capability lets large volumes consist of partitions from multiple physical disks and allows implementation of fault tolerance through software-based data redundancy.

Although NT's partitioning support is flexible enough to support most storage-management tasks, this support suffers from several drawbacks. One drawback is that most disk-configuration changes require a reboot to take effect. In today's world of servers that must remain online for months or even years at a time, any reboot—even a planned reboot—is a major inconvenience. Another drawback is that the NT Registry stores advanced disk-configuration information. This arrangement means that moving configuration information is onerous when you move disks between systems, and losing configuration information is easy when you need to reinstall NT. Finally, NT's requirement that each volume have a unique drive letter in the A through Z range places an upper limit on the number of possible local and remote volumes that users can create.

Windows 2000 (Win2K) eases many NT 4.0 storage-management restrictions with a slew of new storage-management enhancements. From volume mount points that remove the limit on the number of possible volumes, to integrated support for file migration to offline storage, to disk management without reboots, Win2K takes NT storage management to a level on par with most advanced UNIX systems. This month, I begin a two-part series that looks at storage-management internals in NT and Win2K. I begin by describing NT 4.0 disk-management implementation, including partitioning, drive-letter assignment, the mount process, and details of NT's software-based fault-tolerant drives. I'll conclude the series by looking at Win2K storage-management changes such as the Logical Disk Manager (LDM) and mount points.

As I've stated, the foundation of NT 4.0 disk management is the partitioning scheme that NT inherited from DOS 5. Before I delve into partitioning, let me define the terminology I use. A disk is a physical storage device such as a hard disk, a 3.5" disk, or a CD-ROM. A disk's hardware divides the disk into sectors, addressable blocks of fixed size. All x86-processor hard-disk sectors are 512 bytes, whereas CD-ROM sectors are typically 2048 bytes. A volume is an object that represents sectors from the same or different partitions that a file system manages as one unit. A volume is typically associated with one partition, but if you create a spanned volume or a volume that has data redundancy, the volume consists of more than one partition, possibly spread across different disks. People use the word drive to refer sometimes to disks and sometimes to volumes. To avoid confusion, I won't use drive.

When you install NT on a computer, one of the first things the OS requires you to do is to create a partition on the system's primary physical disk. NT defines the system volume on this partition and stores the files that it invokes early in the boot process on the system volume. In addition, NT Setup requires you to create a partition onto which the setup program installs the NT system files and creates the system directory. This partition serves as the home for the boot volume, which is where NT Setup installs the NT system files and creates the NT system directory (\winnt). The system and boot volume can be the same volume. The nomenclature that Microsoft defines for system and boot volumes is a little confusing. The system volume is where NT places boot files, including NT Loader (NTLDR) and NTDETECT, and the boot volume is where NT stores OS files such as ntoskrnl.exe, the core kernel file.

The standard BIOS implementations that x86 hardware uses dictate one requirement of NT's partitioning format—that the first sector of the primary disk contain the Master Boot Record (MBR). When an x86 processor boots, the computer's BIOS reads the MBR and treats part of the sector's contents as executable code. The BIOS invokes the MBR code to initiate an OS boot process after the BIOS performs preliminary configuration of the computer's hardware. In the case of Microsoft OSs, including NT, the MBR also contains a partition table. A partition table consists of four entries that define the locations of as many as four primary partitions on a disk. Numerous predefined partition types exist, and a partition's type, which the partition table records, specifies which file system the partition includes. For example, partition types exist for FAT32 and NTFS. A special partition type, extended partition (or Extended Boot Record—EBR), contains another MBR with its own partition table. By using extended partitions, Microsoft's OSs overcome the apparent limit of four partitions per disk, as an MBR's partition table defines. In general, the recursion that extended partitions permit can continue indefinitely, which means that no upper limit exists to the number of possible partitions on a disk. Figure 1 shows an example disk-partitioning scenario.

NT's boot first makes evident the distinction between primary and extended partitions. The system must mark one primary partition of the primary disk as active. The NT code in the MBR will transfer control to the code that the first sector of the active partition stores after the MBR code loads the sector's code into memory. The active partition is the NT system volume. NT designates the first sector of a partition the boot sector because, in the primary partition's case, the sector plays a role in the computer's boot process. However, every partition formatted with a file system has a boot sector that stores information about the structure of the file system on that partition. The second instance in which NT makes evident the distinction between primary and extended partitions is drive-letter assignment, which I discuss shortly.

Storage Drivers and Device Objects
NTLDR is the NT OS file that conducts the first portion of the NT process. NTLDR resides on the system volume; the boot sector code on the system volume executes NTLDR. NTLDR reads the boot.ini file from the system volume and presents the computer's boot choices to the user. The partition names that boot.ini designates are in the form multi(0)disk(0)rdisk(0)partition(1). These names are Advanced RISC Computing (ARC) names because they're part of a standard partition-naming scheme that Alpha firmware and other RISC processors use. NTLDR translates the name of the boot.ini boot entry that a user selects to the appropriate boot partition, then loads the NT system files (starting with the Registry, ntoskrnl.exe, and the boot drivers) into memory to continue the boot process.

During initialization, the NT kernel starts the hard disk storage drivers. Storage drivers in NT follow a class/port/miniport architecture, in which Microsoft supplies a class driver that implements functionality common to all storage devices and a port driver that implements functionality common to a particular bus (e.g., SCSI, IDE). OEMs supply miniport drivers that plug into the port driver to interface NT to a particular implementation. For example, Adaptec supplies SCSI miniport drivers for the company's various SCSI controllers. This division's benefits are twofold. First, a miniport developer writes to the miniport environment rather than to the more complex NT driver environment. Second, Microsoft provides class and port drivers for Windows 9x so that miniport drivers that developers write for NT run on Win9x, and vice versa.

As miniport drivers present to the class driver the disks that they identify early in the boot, NT's I/O Manager includes the IoReadPartitionTable function, which the class driver invokes for each disk. IoReadPartitionTable then invokes sector-level disk I/O, which the class, port, and miniport drivers provide to read a disk's partition table and construct an internal representation of the disk's partitioning. The Disk class driver creates device objects to represent each primary partition (including primary partitions within extended partitions) that the driver obtains from IoReadPartitionTable. Device drivers use device objects to represent both physical and logical devices, including disks, keyboards, and driver management interfaces. The device drivers can name a device object so that other device drivers or applications can open the object and send I/O requests to the object. The Disk class driver creates a HarddiskX directory (in which X is the disk number NT assigns the disk) in the NT Object Manager namespace for each hard disk. The Disk class driver then places partition device objects in the HarddiskX directory of the disk on which the objects reside. For example, Screen 1, page 66, shows the contents of the \device\harddisk0 directory on a computer whose primary hard disk has four partitions, which the four numbered partition device objects in the right-hand pane represent. The Disk class driver gives the name partition0 to the device object that represents the entire physical disk.

Whenever a device driver or application sends an I/O request to a device object, the NT I/O Manager routes the request (which comes in an I/O request packet—IRP—a self-contained package) to the device driver that created the target device object. Thus, if an application wants to read the boot sector of the second partition on harddisk0, the application first opens the device object \device\harddisk0\partition2, then sends the object a request to read 512 bytes starting at offset zero on the device. The NT I/O Manager routes the application's request to the Disk class driver, notifying the class driver that the IRP is aimed at the partition2 device object. Because partitions are logical conveniences that NT uses to represent contiguous areas on a physical disk, the class driver must translate offsets that are relative to a partition to offsets that are relative to the beginning of a disk. If partition2 begins 4096 sectors into the disk, the class driver would adjust the IRP's parameters to designate an offset with that value before passing the request to the miniport driver. The miniport driver carries out physical disk I/O and reads requested data into an application buffer designated in the IRP.

The Win32 API is unaware of the NT Object Manager namespace. NT reserves a couple of different namespace subdirectories for Win32's use and names one of these subdirectories \??. In this subdirectory, NT makes available device objects that Win32 applications interact with—including COM and parallel ports—as well as disks. Because disk objects actually reside in other subdirectories, NT uses symbolic links to connect names under \?? with objects located elsewhere in the namespace. For each physical disk on a system, the I/O Manager creates a \??\PhysicalDriveX link that points to \device\harddiskX\partition0 (numbers beginning with 0 replace X). Win32 applications that directly interact with the sectors on a disk open the disk by specifying the name \\.\PhysicalDriveX (in which X is the disk number) to invoke the Win32 CreateFile API. The Win32 application layer converts the name to \??\PhysicalDriveX before handing the name to the NT Object Manager.

Drive-Letter Assignments
After the I/O Manager initializes the disk storage drivers, it invokes the internal function IoAssignDriveLetters. This function creates a symbolic link under \?? in the form of a drive letter for each disk partition, as well as for CD-ROMs and 3.5" disks. The drive-letter symbolic links refer to associated partition device objects. The I/O Manager's drive-letter assignment follows a default formula, but you can override the formula by explicitly assigning drive letters in Disk Administrator. After you start Disk Administrator, the program scans the partitions on the system's hard disks and generates a random signature for each partition that Disk Administrator hasn't seen in previous executions. Disk Administrator stores a partition's signature in the partition's boot sector and also in the Registry value HKEY_LOCAL_MACHINE\SYSTEM\DISK\Information. The Information value includes a data structure for each disk partition that incorporates the Disk Administrator signature and the partition's drive letter, if you've assigned one. IoAssignDriveLetters reads the Information value and honors the drive letters you've specified before performing default assignments. The function reads partition signatures and matches them with the data that the Information value stores to correlate partitions with their assigned drive letters.

After IoAssignDriveLetters assigns explicitly specified drive letters, the function starts with the letter C (or the first unassigned letter higher than C) and assigns letters to the first active primary partition of each disk. If a disk has no active primary partition, IoAssignDriveLetters assigns a letter to the first primary partition. In the subsequent phase of assignment, IoAssignDriveLetters gives letters to each partition that is in each disk's extended partitions. Finally, IoAssignDriveLetters creates letters for the remaining unassigned primary partitions.

After IoAssignDriveLetters has created drive-letter symbolic links for hard disk partitions, the function gives letters to 3.5" disks and then to CD-ROMs. The first two 3.5" disks get the letters A and B, and any others receive the next available letter. You can assign letters to CD-ROMs in Disk Administrator, but rather than storing those assignments in the Information value, Disk Administrator stores the assignments in separate values that share the names of the device objects NT uses to represent the CD-ROMs. For example, a system with one CD-ROM that has an assigned drive letter will have a Registry value \device\cdrom0 beneath HKEY_LOCAL_MACHINE\SYSTEM\DISK that specifies the CD-ROM's assigned drive letter. Screen 2, page 68, shows the contents of a system's Object Manager \?? directory and highlights the C drive's symbolic link.

File-System Mounting
Because NT assigns a drive letter to a partition doesn't mean that the partition contains data that is organized by a file-system format NT recognizes. The volume-recognition process consists of a file system claiming ownership for a partition; that process takes place the first time the kernel, a device driver, or an application accesses a file or directory on a partition. After a file-system driver signals its responsibility for a partition, the I/O Manager directs all IRPs aimed at the partition to the owning driver. Mount operations in NT 4.0 consist of three components: file-system driver registration, Volume Parameter Blocks (VPBs), and mount requests.

The I/O Manager oversees the mount process and is aware of available file-system drivers because all file-system drivers register with the I/O Manager when they initialize. The I/O Manager provides the IoRegisterFileSystem function to local disk (rather than network) file-system drivers for this registration. When a file-system driver registers, the I/O Manager stores a reference to the driver in a list that the I/O Manager uses during mount operations.

Every device object contains a VPB data structure, but the I/O Manager treats VPBs as meaningful only for partition device objects. A VPB serves as the link between a partition device object and the device object that a file-system driver creates to represent a mounted file-system instance for that partition. If a VPB's file-system reference is empty, then no file system has mounted the partition. The I/O Manager checks a partition device object's VPB whenever an open API that specifies a filename or directory name on a partition device object executes. For example, if the I/O Manager assigns drive letter D to the second partition on a system's first hard disk, IoAssignDriveLetters creates a \??\D: symbolic link that resolves to the device object \device\harddisk0\partition2. A Win32 application that attempts to open the \test file on the D drive specifies the name D:\test, which the Win32 subsystem converts to \??\D:\test before invoking NtCreateFile, the kernel's file-open routine. NtCreateFile uses the Object Manager to parse the name, and the Object Manager encounters the \device\harddisk0\partition2 device object with the path \test still unresolved. At that point, the I/O Manager checks to see whether \device\harddisk0\partition2's VPB references a file system. If not, the I/O Manager uses a mount request to ask each registered file-system driver whether the driver recognizes the format of the partition in question as the driver's own. If a file-system driver signals affirmatively, the I/O Manager fills in the VPB and passes the open request with the remaining path (i.e., \test) to the file-system driver. The file-system driver completes the request by using its file-system format to interpret the data that the partition stores. After a mount fills in a partition device object's VPB, the I/O Manager hands subsequent open requests aimed at the partition to the mounted file-system driver. If no file-system driver claims a partition, then RAW—a file-system driver built into NT—claims the partition and fails all requests to open files on the partition. Figure 2 shows a simplified example (i.e., the figure omits the file-system driver's interactions with the NT Cache Manager) of the path that I/O that is directed at a mounted partition follows.

Aside from the boot volume, which a driver mounts while the kernel is initializing, file-system drivers mount most volumes when Chkdsk runs during the blue-screen portion of the boot sequence. Chkdsk accesses each drive letter to see whether the volume associated with the letter requires a consistency check. Mounting can occur more than once for the same disk with removable media (e.g., 3.5" disk device). CD-ROM File System (CDFS) and FAT, NT's two file-system drivers that support removable media, respond to media changes by querying the disk's volume identifier. If either driver sees the volume identifier change, the driver dismounts the disk and attempts to remount it.

Fault Tolerance
NT's I/O architecture permits a powerful feature: dynamic layering of device objects. A device driver can create a device object and attach it to a target device object. The I/O Manager routes requests directed at a target device object to the object's attached device object, if one exists. Device drivers use this mechanism to monitor or change the behavior of device objects that belong to other device drivers. A driver that relies on layering is a filter driver, and when a filter driver receives an IRP aimed at a target device, the filter has full control over the request. The filter can fail the request, create new subrequests, or pass the unmodified request to the target device. NT storage drivers commonly use layering in three places. At the highest level, file-system filter drivers attach to the target device objects that represent mounted partitions that file-system drivers create. A file-system filter driver therefore intercepts requests aimed at mounted volumes so that the driver can implement functionality such as monitoring, encryption, or on-access virus scanning.

If you've installed NT disk performance counters by executing the Diskperf ­y command, then you've installed the DiskPerf filter driver. DiskPerf attaches to the device objects that represent physical disks (e.g., \device\harddisk0\partition0) so that DiskPerf can generate performance-related statistics for Performance Monitor to present. If you create a nonstandard volume—such as a volume set, mirrored drive, stripe set, or stripe set with parity—in Disk Administrator, you enable the FtDisk filter driver.

A volume set is a volume that uses two or more partitions to create the image of one contiguous partition. A systems administrator can use partitions from different disks to create a volume set that is larger than any given physical disk on a computer. A mirror is a volume that maintains copies of its data on two partitions. In a mirror, all write operations take place on both partitions, but read operations take place only from one partition. Mirrors tolerate single-disk failures; operation continues on the surviving half of the mirror. A stripe set is a multipartition volume whose data is interleaved between partitions. NT uses a stripe unit of 64KB. The system stores the first 64KB of file-system data on the first partition of the stripe, stores the second 64KB on the second partition, and so on, thus wrapping back to the first partition. Stripe sets can improve performance when the partitions are on different disks because I/O operations can proceed in parallel on different disks. Finally, stripe sets with parity are stripe sets with an extra 64KB block of data for each 64KB stripe spread across the set's partitions. The extra block stores parity information that NT can use to recover the data stored on one of the set's partitions if the disk on which the partition is located fails. Stripe sets with parity are also known as RAID 5 volumes.

Disk Administrator stores advanced volume-configuration information in the HKEY_LOCAL_MACHINE\SYSTEM\DISK\Information value, with the partition drive-letter and signature information, and the FtDisk driver reads this information during the boot process. A data structure that FtDisk manages in the Information value associates partitions that belong to the same volume. Because file-system drivers expect a volume's contents to reside on one partition, without FtDisk, file-system drivers typically don't recognize a volume that consists of multiple partitions. FtDisk therefore attaches itself to every partition device object in a system to manipulate requests aimed at the device objects that constitute advanced volumes.

Some examples of FtDisk's operations will help clarify its role. If a striped volume consists of \device\harddisk0\partition2 and \device\harddisk1\partition3, as Figure 3 shows, and an administrator has assigned drive letter D to the stripe, then the I/O Manager defines the link \??\D: to reference \device\harddisk0\partition2. If FtDisk were not present, an application opening a file on the stripe would receive an error because no file system would understand or mount the partial volume that \device\harddisk0\partition2 represents. With FtDisk present, an FtDisk device object intercepts file-system disk I/O aimed at \device\harddisk0\partition2, and the FtDisk driver adjusts the request before passing it to the disk class driver. The adjustment FtDisk makes configures the request to refer to the correct offset of the request's target stripe on either \device\harddisk0\partition2 or \device\hardisk1\partition3.

In the case of writes to a mirrored volume, FtDisk splits each request so that each half of the mirror receives the full write operation. For mirrored reads, FtDisk performs a read from half of a mirror, relying on the other half when a read operation fails.

The Dynamics of Win2K
Next month, I'll continue with a look inside the Win2K LDM. I'll also discuss reparse points.