Use Perl scripts to reclaim lost disk space and improve user navigation

In your corporate environment, do you view an increase in disk-space consumption as a positive or negative occurrence? The general IT philosophy is to blindly restrict disk growth and slap the wrists of users or departments that use too much disk space. My company frequently uses disk quotas, for example, to limit disk utilization. Sometimes, we even use Microsoft Excel charts at management meetings to expose the identities of "offenders." Perhaps such negative reinforcement is misguided.

Of course, I'm not suggesting that you let space utilization run rampant or that you throw your disk quotas out the window. Although quotas can discourage healthy disk-utilization growth, they're essential to ensure that users don't accidentally—or even maliciously—fill up disks with garbage, thereby negatively affecting other departments and users. (For example, you wouldn't want a user to copy his or her entire C folder to centralized storage.)

If your company is experiencing increased disk-space requirements, you can be sure that your users trust your file-server disk resources. They have confidence that their files are safe, virus-free, backed up, and available when they need them. In problem environments, disk-space growth is seldom a challenge. If users sense dependability and availability problems, they simply store their work locally. Local storage results in frequent user requests for larger hard disks—driven by the fear of placing work on the file server that might not be available when users need it.

Making users feel guilty about using server storage can be dangerous. You wouldn't want your users and departments to circumvent established quotas by hosting rogue file servers or saving valuable data on desktop PCs that have no disk redundancy. Typically, these activities go undetected until disaster strikes and the secret storage area comes to light. The cost of purchasing additional storage is always less than the cost of data loss, data recovery, user downtime, and loss of productivity.

Your real enemy isn't necessarily increased disk-space utilization but rather file and folder clutter. Most disk-utilization growth occurs as a result of users uploading real business data that they access regularly. Obviously, we need to encourage this kind of healthy growth. Parallel to the accumulation of useful data, however, is the accretion of nonbusiness data and other file clutter.

File and folder clutter can make browsing file-share resources frustrating for all users. Additionally, detecting and deleting unproductive files is difficult. Although some file types clearly aren't business-related (e.g., personal music files), other personal files might not be so easy to identify. For example, locating and removing .jpeg files of a user's personal vacation is a tough task if those files are scattered among legitimate business-related .jpeg files. Here are five scripts that you can use to control file and folder clutter.

1. Search by Extension
Your first step is to root out any file types that are clearly inappropriate for server storage. A handful of user folders that each contains 500MB to 1GB worth of MP3 files, for example, can quickly eat up server storage space. MP3 files are easy to detect based on the file extension (.mp3), and as long as you have no business-oriented MP3s residing in storage, you can automate their deletion.

Table 1, page 53, lists file types that can potentially waste storage space. Of course, some of these file types can hold appropriate business-related content, so you need to carefully review your business policies and the needs of your user community before you start globally deleting files.

A quick note about user-circulated game and video files: After these files make their debut through email, download, or floppy disk, they can quickly spread throughout the office and eventually invade server storage. These kinds of files seem to take on a life of their own as users copy, rename, and circulate them. Regular searches for .exe files can help cut their life cycle short.

For more common file types that you might need to include in your customized list, go to The file types that you search for will probably change as new extensions become available.

You can use scripts to automate the detection and deletion of these files' types. (For more information about using scripts, see the Web-exclusive sidebar "Getting Started with Scripting,", InstantDoc ID 22035.) For a simple script that deletes MP3 files residing in your D:\test folder, see the single-extension search-and-delete Perl script, which Listing 1 shows. The code looks complicated because of its comments and logging functions, but it's quite simple. The Unlink line deletes the .mp3 files. However, I've commented out the Unlink line and added a Print line so that the script will show you the files instead of deleting them. You could use the Windows NT shell commands Del or Erase to accomplish the deletions, but Perl creates a cleaner log file and is very fast. Perl's performance is evident in its lower CPU utilization and its speed when dealing with large numbers of files. If you're unfamiliar with Perl and could use some tips, see the Web-exclusive sidebar "Script Pseudo-Coding,", InstantDoc ID 22036, for a closer look at's Recurse routine.

You can modify the code to look for multiple file types. The multiextension search-and-delete script, which Listing 2 shows, also comments out the Unlink command and includes a Print statement that will show a list of the files that match the search criteria. The modified script will search for any file with an .asf, .asx, .ra, .ram, or .rm extension. To add file types, you can simply chain additional OR code sections, such as


Always test run your file-deletion and folder-removal scripts and comment out Unlink or Rmdir to observe the results before you attempt a production run. Schedule your scripts to run periodically, and review the logs of the deletion runs. (Weekly runs are probably sufficient.) Also, monitor new file types that are being introduced. Schedule your deletion runs to follow your backups so that you can restore files if necessary.

2. Search by Keyword
I once developed a script to determine how many .jpeg files resided on a client's main file server. The script located more than 20,000 JPEGs on one share point. Who knew how many personal image files were floating among the legitimate business-related images? Obviously, opening and viewing each image to root out the personal files was impractical. The answer was to locate personal files by performing keyword searches on filenames and folder names.

You can use Windows Explorer's Find utility*which you can also access by pressing F3*to search for keywords that might help you locate out-of-scope or obsolete files and folders. However, NT 4.0's Find utility has two primary disadvantages. First, NT 4.0's Find has a built-in limit of 10,000 items. In my large-scale search for .jpeg files, for example, I quickly exceeded that limit. Second, NT 4.0's Find is glacial if you're attempting to search for several keywords in the same instance. Although Windows 2000 addresses these limitations with its Search For Files or Folders, you can also use a script to overcome the limitations.

Figure 1 shows a keyword list that I use to locate personal and out-of-scope files. (You can add any keywords that are unique to your environment.) The multifolder keyword-search and log script that Listing 3 shows——searches for the keywords personal, vacation, trash, garbage, junk, and bad. The purpose of this type of search is logging—not deletion or simple screen output. After you execute the script, you can use the log output to locate folders that contain out-of-scope files. Notice that the script opens, writes to, and closes the C:\foldertestlog.txt output file. The script searches for the test string anywhere in the folder name and ignores case. Therefore, folders named junk yard dog, junky stuff, and My Junk would all return a match.

3. Delete New Folders and Empty Folders
When you delete clutter, you don't reclaim a significant amount of space. However, such cleanup efforts can streamline your folder structure and improve navigation. One type of folder that can quickly clutter your file server is New Folders that users intentionally or inadvertently create and then don't use. The first time I performed a search (using the Start menu's Find utility) for the default folder name New Folder, I was surprised to see several thousand hits. Some of the folders were populated with files, but the majority were empty. Executing a mass deletion is risky because Windows will obediently delete inactive empty folders and in-use folders, along with all file contents. A better alternative is to use the NT shell command Rmdir, or Rd, or Perl's Rmdir command. Unless you use the /s switch, these commands won't delete a folder if it holds contents.

Listing 4 shows, which deletes New Folder directories. Notice that the script uses the Rmdir command instead of Unlink and adds a counter. The hits variable tracks the script's success in locating instances of the target folder and removing them.

Another type of folder clutter is folder structures that users create but never use and then abandon. You certainly don't need a bunch of empty folders hanging around. If you're experiencing this situation, you can determine a time threshold for folder deletions and schedule a script to run regularly to perform the deletions.

Listing 5, page 56, shows, which tests folders' age and attempts to delete any empty folder older than a configurable date. The Rmdir command gracefully fails for any folder that contains data. Running this script on a scheduled basis will gradually minimize the number of empty folders on your server and simplify your folder structure. Note that the Rmdir command can delete an empty shared folder—this deletion effectively removes the folder, the share, and its permissions. If you have empty shared folders that you want to retain, simply place a file inside them to prevent their deletion. An alternative is to set the shared folder's read-only attribute; then, Rmdir will leave the share point intact unless a component of your script changes the attributes.

4. Delete Temporary Files
Temporary files occasionally end up in server storage when users run applications over the network or copy file structures that contain .tmp files from local desktop PCs. Microsoft Word and Microsoft PowerPoint create temporary files that the system typically cleans up when the user closes a document. However, if the save operation is interrupted or if the user doesn't close an application cleanly, .tmp files might remain.

These .tmp files, which approximately mirror the size of the Word (i.e., .doc) or PowerPoint (i.e., .ppt) file on which the .tmp files are based, can consume significant disk space—especially if the original files are documents with a lot of graphics or involved presentations. Generally, you can safely delete .tmp files as long as you check their age to be sure you're not deleting any in-use files.

The script, which Listing 6 shows, is similar to, but notice the addition of the —C > 5 test, which ensures that 5 days have passed since the file creation. also adds a size variable—$totsize—to track the size of the deleted files.

5. Delete Default-Named Office Documents
When a user starts to create a Microsoft Office document but never names it or adds content, default naming can occur. Common default-named files include New Microsoft Word Document.doc, New Microsoft Excel Worksheet.xls, and New Microsoft PowerPoint Presentation.ppt. In your user environment, other applications might create different default filenames. You can use a script to search for these filenames. Your script can also check file size because empty default-named documents are typically a consistent size. (Users sometimes add content to a file but neglect to change the default name, so checking the size is a good idea.) See Table 2 for a list of default filenames and their sizes.

The Find.Wordpl script, which Listing 7 shows, searches for default-named Word documents and for files that are 10,752 bytes in length (i.e., the size of a default-named Word document). The —s $_ variable indicates file size. If a file matches only one criterion, the script logs the document's location so that you can investigate the situation further at a later time. You can configure this script to automatically delete files that match both the default name and default size. For more information, see the detailed notes in the complete script, which you can download from the Windows 2000 Magazine Web site.

Continue Polishing Your Shares
If you want to clean up your file servers, you'll find this article's scripts to be a good starting point. Polishing your share points will reclaim lost disk space and improve user navigation. Remember that your true enemy isn't necessarily an increase in disk-space consumption but rather out-of-scope files and other clutter.