Divide and Conquer Mega-Sized Text and Log Files

Have split files rather than a splitting headache

Downloads
101218.zip

Several times a year, various department heads give me text files and ask me to perform data analyses and create summary reports. Often these files are massive directory dumps or application data dumps that can be as large as 700MB and contain more than 9.5 million lines of text. I'm also occasionally asked to extract contents from extremely large text-based cluster logs, Web logs, and event logs for technicians who need to send log samples to our security department or to analysts to diagnose problems. In addition, there are times when I need to look at the data in an enormous text file so that I know how to work with its contents in a script.

Sometimes these mega-sized text and log files are too large for Notepad to open. Other times, Notepad is sluggish when I try to scroll through their contents. Having smaller files not only makes it easier to do data analyses but also dramatically speeds up code development and testing.

After many months of working around this problem using a mixed bag tactics such as exporting data to Microsoft Access or trying to open the files with some other application, I finally decided to write the Log Splitter utility. This HTML Application (HTA) splits large text files into smaller files that I can easily open with Notepad and easily work with when writing a script.

The Log Splitter utility offers simple but adequate functionality. After you download the utility (click the Download the Code Here button at the top of the page) and copy it to your computer, double-click it. In the UI (see Figure 1), enter the pathname of the text or log file you want to split or use the Browse button to locate it.

You can split up the file by the number of lines or number of pages. To find out how many lines are in the file, click the GetLineCount button. Knowing the total number of lines can help you decide whether to split it by line count or number of files.

If you want to split the large file into a specific number of smaller files, select the Split into number of files option, then specify that number in the Enter Line Count or Number of Files field. If you want to split the large file into smaller files that contain a certain number of lines, select the Split by Line count option, then specify the maximum number of lines you want in the smaller files. You must enter a value of 100,000 or higher. I found that lower values tend to produce too many files, particularly if you're splitting a file that's several hundred megabytes.

All that's left to do is to click the RunScript button and click OK. Before the utility starts splitting the large file, it checks for possible problems. If it finds a problem, it displays a message. For example, the utility checks to see whether the specified file is a text or log file. If you try to split another type of file, you'll receive the message This script only works with '.log' and '.txt' files.

After splitting the large file, the utility saves and names each smaller file. For example, if you're splitting C:\data\massive.log into three smaller files, the smaller files will be named C:\data\massive~1.log, C:\data\massive~2.log, and C:\data\massive~3.log. If these smaller files already exist, they'll be overwritten.

Depending on the size of the file you're splitting, the process could take a long time to finish (e.g., about five minutes to split a 100MB file into five files), so the utility's UI is hidden while the process runs. The UI reappears when the process completes.

If you get the following Microsoft Internet Explorer (IE) message when running the Log Splitter utility—A script on this page is causing Internet Explorer to run slowly. If it continues to run, your computer may become unresponsive. Do you want to abort the script?—abort and see the Microsoft article "How to set time-out period for script." This article tells you how to add a new registry entry named MaxScriptStatements to alleviate the problem. It's a relatively simple modification but as with all registry changes, you need to use extreme caution. After I received this message, I set the MaxScriptStatements value to 100000000 (100 million). That value works well for me, but you could try a smaller value and see how it works on your computer.

 

Discuss this Article 4

Bill (not verified)
on Mar 10, 2009
Notepad is an awful application to use once the file size gets up to several megabytes. There are many free Notepad replacements out there, but most of them also fail miserably when confronted with a hundred-megabyte file. Fortunately there are several free text editors that aren't fazed by big files. When I started needing to handle these beasts I did extensive research and testing on the available freebies. I primarily wanted fast large file handling capability, but I also looked for several extras.... line numbers, multiple document handling, column mode, etc. My current favorites are SciTE (http://www.scintilla.org/SciTE.html), ConTEXT (http://www.contexteditor.org/), and Crimson Editor (http://www.crimsoneditor.com/). They're all capable of handling hundred megabyte files without choking and they have lots of features built in. Crimson Editor also includes a column mode, which is a lifesaver in some situations. I would also be remiss if I didn't mention UltraEdit (http://www.ultraedit.com/), a commercial offering that has almost every feature you'd want. Coming from a Big Iron/ISPF background, though, I have to say that almost none of today's text editors can do what IBM's ISPF Edit could. If you are familiar with ISPF Edit and Rexx edit macros, you'll understand what I'm saying - the power of that interface can make short work of complicated editing tasks. There have been several DOS-type clones of this environment, but they've been limited to an 80 by 24 screen and aren't available any more. Now here's the good news - Mizumaki-machi (sakachin2@yahoo.co.jp) has produced an ISPF Edit clone that takes advantage of Windows features like resizable screens! The program is called Hybrid Editor XE and it is FREE. There are two web sites for this program, http://hp.vector.co.jp/authors/VA010562 and http://www.geocities.jp/sakachin2/index.htm. You've got to try this out, just use the help since some commands have changed slightly from ISPF use.
KBemowski
on Mar 16, 2009
Hi x16wda, The information about the free text editors is helpful—thanks! If you (or anyone reading this) would like to share information about the free tools they like to use, you can send me a short description of what it does, where to download it, and how to use its main features. We feature a "Tool Time" column in the Reader to Reader" section of Windows IT Pro for such recommendations. If your write-up is selected for publication, it would get printed in the "Tool Time" column and you'd get $100. For an example of a "Tool Time" write-up, see "Tool Time: Test Connectivity to Remote Email Servers with TestMX" (http://windowsitpro.com/Windows/article/articleid/100732/100732.html) or "Tool Time: Copy Many Pathnames at Once With Path Copy" (http://windowsitpro.com/Windows/article/articleid/100962/100962.html) You can email the "Tool Time" write-up to kbemowski@windowsitpro.com or r2r@windowsitpro.com. Sincerely, Karen Bemowski, senior editor Windows IT Pro, SQL Server Magazine

Please or Register to post comments.

IT/Dev Connections

Las Vegas
September 30th - October 4th

Paul ThurottYou'll have the opportunity to experience:
• The Microsoft
Technology Roadmap
• Office 365 Implementation
• Hyper-V Optimizing
• Windows 8 Deployment
and much more!

Come See Paul Thurrott & Rod Trent in Person!

Early Registration Now Open

Upcoming Training

Mastering SharePoint 2013: Succeeding, Not Just Surviving

Building on the success of the “Mastering SharePoint 2010” seminars, the presenters have updated the content to cover the latest and greatest SharePoint product: SharePoint 2013. While SharePoint 2013 is relatively new on the marketplace, the presenters have been working with SharePoint 2013 for well over a year, and have implemented it with a number of clients in production environments.

Register Now

Current Issue

May 2013 - The NameTranslate object is useful when you need to translate Active Directory object names between different formats, but it's awkward to use from PowerShell. Here's a PowerShell script that eliminates the awkwardness.

CURRENT ISSUE / ARCHIVE / SUBSCRIBE

Windows Forums

Get answers to questions, share tips, and engage with the Windows Community in our Forums.