Technologies people use on a regular basis follow a fairly predictable lifecycle—one that Gartner nicely captured in its Hype Cycle concept. First, a technology reaches a level of maturity that raises its visibility. Gartner calls this the "technology trigger." Then, as more people take advantage of the technology, its value is over-promised. This is the "peak of inflated expectations." Of course, as people realize the technology won't end world hunger or roll back global warming, it falls into the "trough of disillusionment." And finally, as the technology dilettantes move their attention to the next shiny thing, the technology is incorporated into products (the "slope of enlightenment") and IT infrastructures so that it becomes a part of everyday life (the "plateau of productivity").
Virtualization has followed this lifecycle. Internet usage has followed it as well. And data deduplication—a storage technology that reduces disk space requirements—has followed an abbreviated version of the hype cycle. Third-party vendors first introduced data deduplication as an add-on feature, but Microsoft added the capability to, effectively making data deduplication a commodity technology that's available to everyone.
Data deduplication has become particularly important with the explosion of storage. A 2011 IDC report predicts that the world will consume 90 million terabytes (that's 90 yottabytes, in case you were wondering) of data in 2014 and 125 million terabytes in 2015. That's a lot of Facebook posts, folks. Storing this data efficiently is critical, and data deduplication technology is a key piece of efficient storage.
Defining Data Deduplication
Data deduplication is a simple concept. Many of the data blocks on a volume are duplicate data. If you have multiple virtual hard disks, they might be quite similar to one another because they contain the same or similar OSs and similar applications. A software installation share has files with internal similarities because the compiled code shares many common libraries and so on. You could save a lot of disk space by eliminating the duplication. Microsoft's implementation analyzes data on the volume at a block level to find duplicates. It replaces duplicate blocks on a volume with a reparse point and metadata that points to the location of the original file data.
Everything that's required to access your data is located on the drive. This means you can move a drive from one server to another and ensure the drive data will be read correctly. There's an important caveat here: The server holding the deduplicated disk must be Server 2012 or newer (i.e., Windows Server 2012 R2) and have the Data Deduplication feature installed, or else it can't interpret the deduplicated data. You can access a deduplicated disk from Server 2012 without data deduplication installed, but only original data will be available—not the deduplicated data. I certainly wouldn't call this a best practice!
By default, files aren't considered for deduplication until they remain unchanged for five days; this way, the active files still have excellent performance. If a deduplicated file is accessed, it becomes "hot" and isn't touched for another five days. You can easily change this setting (which I discuss in a later section); you also can configure the process to exclude certain folders or file types. The process is designed to run at low priority and memory demand, and therefore not interfere with the primary purpose of your server: serving data to users. Data deduplication stays out of the way and runs at a low priority when system utilization is low. Thus, after you enable deduplication, it might be a few days before you see substantial savings. Microsoft guidance states that the deduplication feature can process roughly 2TB of data per volume in a 24-hour period (100GB per hour) while running a throughput optimization job on a single volume.
You can determine the potential savings that data deduplication might net you on a volume by installing the feature, opening a command prompt, and running
ddpeval.exe /o:<output file path>
in the \system32 folder. Figure 1 shows DDPEval output on a relatively small folder.
The path can be a volume, directory, or mapped network share. The /v switch ostensibly provides verbose output, but in my tests the utility didn't return any more information. You can use the /o switch to write output to a file. And be prepared to be patient; a block-by-block analysis of a large volume at a low priority takes some time.
File operations might or might not be affected on a deduplicated volume. For example, copying files to a non-deduplicated volume might take longer, but because deduplication has its own cache, simultaneous copying of large files might be considerably faster. In other words, your mileage may vary, so you must determine data deduplication's effect on your own environment. I suggest you restore the contents of a typical production file server into a lab environment and use the free File Server Capacity Tool (FSCT) to gather baseline I/O throughput and user capacity data. Then, enable data deduplication, let it complete its optimization, and measure the storage savings (which you can see an example of in Figure 8). Finally, rerun FSCT to compare the storage-optimized version with the original.
Installing Data Deduplication
To configure data deduplication, follow these steps:
- From Server Manager, select Add Roles and Features, as Figure 2 shows.
- Expand File and Storage Services and select File and iSCSI Services. Then, select Data Deduplication, as Figure 3 shows.
- Complete the wizard, accepting the defaults.
You also can install data deduplication with Windows PowerShell as follows:
PS C:\> Import-Module ServerManager
PS C:\> Add-WindowsFeature -name FS-Data-Deduplication
PS C:\> Import-Module Deduplication
Enabling Data Deduplication
Use Server Manager to enable data deduplication for a specific volume. Select File and Storage Services in the navigation pane and then select Volumes. Figure 4 shows a 2TB data volume (highlighted) that is 88 percent full.
Hanshi is a Hyper-V server, so it contains 344GB of virtual machines (VMs) in a \VMs folder. In addition, more than 1TB of files (software, MP3, video, backups) comprise 305KB files in 41,400 folders.
Right-click the volume and choose Configure Data Deduplication, as Figure 5 shows.
This launches the Deduplication Settings wizard, which you can see in Figure 6.
On this dialog box, configure how you want data deduplication to operate on the volume. In this example, I accept the default file age of five days. I'm not excluding any file types, but I do exclude the \VMs folder (which contains VMs, not the old Digital Equipment Corporation OS!), because I don't want to attempt deduplication of live VMs. Well, the live VMs won't be deduplicated, but if any in that folder shut down for a week, the VM will be deduplicated, and subsequent VM startup will be slower.
Next, click Set Deduplication Schedule on the Deduplication Settings screen to see what it's all about. Figure 7 shows the Deduplication Schedule.
By default, Enable background optimization is selected. It allows the deduplication process to run in the background when the system is at low-resource usage. If you want, you can also schedule optimization to run on a regular basis—by selecting Enable throughput optimization—up to twice a day by selecting the second check box, Create a second schedule for throughput optimization. In this example, accept the defaults and just cancel out of the dialog box. Selecting Finish enables data deduplication on the volume. You can also enable deduplication using PowerShell:
PS C:\> Enable-DedupVolume E:
PS C:\> Set-Dedupvolume E: -MinimumFileAgeDays 20
After about 36 hours, data deduplication has optimized the volume, as you can see in Figure 8.
The deduplication process reclaimed 23 percent of the volume, resulting in a savings of 363GB. This certainly isn't as large a savings as Microsoft touted; however, 363GB of reclaimed disk space is nothing to sneeze at—especially for a "set it and forget it" process.
Exploring Data Deduplication Cans and Can'ts
Not every type of volume is a good candidate for data deduplication, and some can't be de-duped at all. I provide a list of the cans and can'ts of deduplication.
- Data deduplication can be performed only on NTFS-formatted volumes and will work with either Master Boot Record (MBR) or GUID Partition Table (GPT) partitioning.
- Data deduplication can reside on shared storage, such as storage that uses a Fibre Channel or a Serial Attached SCSI (SAS) array, or when an iSCSI SAN and Windows Server Failover Clustering is fully supported.
- A volume can't be a system or boot volume, because deduplication isn't supported on OS volumes.
- If you convert a regular volume that has been enabled for deduplication to a Cluster Shared Volume, no further deduplication will take place.
- Microsoft guidance says, "Do not rely on the Microsoft Resilient File System (ReFS)." No reason is given, but a data accessibility recommendation isn't something I'd ignore.
- Data deduplication can't be performed on removable drives.
- Data deduplication doesn't support remotely mapped drives.
Reaping the Savings By Upgrading
Data deduplication in Server 2012 is yet another of the OS's features that contributes to its must-have status. The savings that data deduplication will give you in storage costs will probably—by itself—justify the cost of upgrading Windows-based file servers. Your savings will vary from what Microsoft predicts, but whatever you gain is a benefit. Data deduplication has become an essential storage technology that should be included in every Server 2012 deployment and beyond.