All too often, my machine pops up a window with a message that says I don't have enough memory available on my hard disk. Typically, my initial reaction is one of astonishment because I have a pretty large hard disk and there certainly should be some room left. Then anger sets in because I realize that it's quite possible that my disk is filled with junk, such as temporary files that Windows forgot to remove.
When I receive this message, I'm forced to digress from the task at hand and sift through the hard disk's vast directory tree. The goal is to find what directories contain the most amounts of files so I can clean them out. The process is time-consuming and tedious—a perfect job for a Perl script.
Checking Directory Sizes
As system components pop up windows indicating that drive space is running low, you can be left wondering where all your disk space has gone. With the huge capacities of modern-day hard disks, you can have files buried so deep in the disk's tree that you don't know where to start to look for oversized files and directories.
One easy way to examine a directory tree is to run the TREE.COM command in the command-shell window. This command displays the entire directory tree, which you can walk through to look for items to remove to free up space. However, system disks can contain numerous directories. For example, on my personal desktop PC, there are well over 10,000 directories on the boot drive. It would take an extremely long time to examine each of these directories looking for files to delete.
The Dir_Sizes.pl script walks through an entire directory tree and identifies which directories contain the largest files. Using the output from this script, you can easily and quickly identify which directories consume the most amount of disk space regardless of whether you have 10 or 10,000 directories.
How the Script Works
The theory behind how Dir_Sizes.pl works is quite simple. The script examines each directory and subdirectory starting from a path you specify. For each directory, the script calculates the total size of all the directory's files. This requires that the script examine each file in the tree to determine its size. This information is aggregated up the tree.
The actual implementation is a bit more complex than the theory. Dir_Sizes.pl implements a recursive technique in which a subroutine is repeatedly called for each directory found. For each child directory, the script totals all the files' sizes. To obtain the size of a parent directory, the script sums up the size of its child directories. After determining all the directories' sizes, the script displays them onscreen in order from largest size to smallest size. This process can be slow depending on how many files and directories there are to examine. It's unfortunate that the script must examine each file, but it's necessary because NTFS doesn't aggregate directory sizes.
You can run Dir_Sizes.pl on any implementation of Win32 Perl. Run the script by passing in the name of a directory or a series of directories. For example, the command
will process the entire directory tree for both C:\Program Files and C:\ Windows.
Dir_Sizes.pl will run on all Windows OSs, with the possible exception of Windows CE. The only item in the script that makes it Win32 specific is the use of the standard DOS and Windows backslash (\) as a directory delimiter. If you change the directory delimiters, you should be able to run the script on non-Win32 platforms.
Examining the Script
Listing 1 shows Dir_Sizes.pl. At callout A, the script sets the constant variables (e.g., $KILOBYTE, $MEGABYTE). These variables are used later when displaying directory sizes. The script then cycles through each path specified on the command line. For each of these paths, trailing directory delimiters are stripped off to prevent using paths with double delimiters (e.g., C:\Windows\\System32). Dir_Sizes.pl actually strips off both forward slashes and backslashes for all the Linux users out there. However, care is taken to strip only trailing delimiters if the path is not a root path, which always ends with a delimiter. The script then passes each path, along with a reference to the %PathSize global hash, to the ProcessPath() subroutine. A reference to the hash rather than the hash itself is passed in because the hash's contents will be modified.
The script also passes a reference to the %PathSize hash to the Report() subroutine.The script passes a reference here only because this hash might be huge, depending on how many directories are processed. The hash reference is small compared to passing the entire hash. I prefer to keep memory allocation to a Perl process low, if possible. Therefore, passing in a reference is preferred.
The code in callout B relies on the $Path value that's passed into the ProcessPath() subroutine. However, on older versions of Win32 Perl, this might be a problem if $Path contains long filenames or filenames with embedded spaces. Windows veterans already know that FAT32 and NTFS file paths and filenames can contain spaces and up to 255 characters. However, older programs might still expect to see the old DOS standard of filenames being no longer than 8 characters long with up to 3 characters used for a file extension (the so-called 8.3 filename convention). Oddly enough, Win32 Perl has some problems handling long Windows paths. Commonly used functions such as glob() and more commonly used functions in older versions of Win32 Perl might experience problems when crunching paths with spaces. If you run into such problems, you can add the following code at the very beginning of callout B:
This code calls the Win32::GetShort-PathName() function, which converts any path into an 8.3 file path.
The code at callout B attempts to open the directory specified by $Path. If the call to open the directory is successful, the script examines each object within the directory. The script stores the files in the @FileList array and the directories in the @DirList array. Current directory (.) and parent directory (..) objects are ignored. Finally, the directory is closed.
Next, the ProcessPath() subroutine processes each subdirectory, as callout C shows. First, the script recursively calls ProcessPath(), this time passing in the path of a subdirectory. The function call returns a reference to the $DirEntry hash. This hash contains the aggregate sum of all the files in the subdirectory and further down the tree. This information is added to values in a local copy of the %ThisDir hash, which represents the size of the directory currently being processed.
The code at callout D processes all the files that were discovered in the directory as opposed to the subdirectories. The code retrieves the size of each file and sums them. This information is added to the local copy of the %ThisDir hash. Finally, the code at callout D returns a reference to %ThisDir.
Eventually, the script calls the Report() subroutine (see the last line in callout A), which displays the results of the data it has collected. As callout E shows, the core of this subroutine is a foreach loop. The foreach loop processes each entry in the %PathSize hash, which contains keys representing the path to each directory that has been processed. The hash's values are the same %ThisDir hashes that were created and updated in the code at callout C and callout D.
The directories are processed in descending order of the total size listed in each directory's hash. To accommodate long paths on the display without wrapping to another line, the $ShortPath variable is modified when its length exceeds 64 characters. The modification results in a path that's shortened by replacing the middle of it with an ellipsis (...).
The two remaining subroutines— FormatNumber() and FormatNumberPretty()—in Dir_Sizes.pl are for formatting purposes. FormatNumber() adds commas to large numbers (e.g., 12345 becomes 12,345). Format-NumberPretty() adds memory size suffixes, such as M for megabytes.
Scan Your Hard Disk
Dir_Sizes.pl has become a useful part of my personal toolkit. The script works well on any file-based drive, such as local drives, CD-ROMs, USB flash drives, and even network shares. Running the script occasionally is a great way to reacquaint yourself with your hard drive. You might be amazed on what you find lurking in its depths.
Dave Roth (email@example.com) is the author of several Win32 Perl extensions, including Win32::AdminMisc, Win32::ODBC, Win32::Daemon, and Win32::Perms. His most recent book is Win32 Perl Programming: The Standard Extensions, 2nd edition (New Riders Publishing/ Macmillan Technical Publishing).