Dealing with (really) big log files

Recently we had to deal with some really big log files. One of our clients had an Apache access log several gigs large, and needed to logically separate entries for a new analytics package. The easiest way, of course, is to write a simple Perl (or any other language, such as Ruby) script, iterate through the file line-by-line, and parse each line. The problem here was speed and system resources: opening the file took several minutes on a very fast, SCSI RAID box, and used a huge amount of RAM. We realized there had to be a better way.

Probably the first, and most helpful, thing to do in a situation like this is to split the log file down into smaller chunks. It reduces the time it takes to open the file, reduces the memory used when opening it, and just makes it easier all around. Using the split -l command allowed us to pull out a set number of lines (say 100k), keeping the line endings intact (rather than splitting on bytes, for instance). The split command is extremely fast, which is great for files of this size.

The next step was to write a Perl script to do our parsing. Pretty simple (here’s the template):
open(FILE, "file-aa");
foreach my $line (<file>) {
# parse using split/regex/etc
}
close(FILE);

Finally, to output the line again, we used the “echo” shell command, called from within the script. Though a little extra overhead is incurred from having to execute a shell command, it meant we didn’t have to deal with managing file handles, etc. Plus, there didn’t seem to be a lot of extra overhead involved, and the real bottleneck is with the disk anyway, not the CPU.

One useful trick was to buffer lines rather than writing every line immediately. This allowed us to read in, say, a thousand or so lines at a time, then append all of them to the appropriate new file(s) as a batch, making the write larger, but all at once, which our disks could handle much faster.

These tricks allowed us to split this huge log access log in under a day or so, when other methods (especially before using split) we projected to take closer to a week. With the splitting and buffering we managed to get some significant speed increases.

Have you ever dealt with huge log files before? What tricks have you come up with? Certainly the best is to avoid them altogether using appropriate rolling and other tricks.

  • http://donghaima.wordpress.com/2007/05/16/links-for-2007-05-16/ links for 2007-05-16 « Donghai Ma

    [...] Dealing with (really) big log files » Draconis Software Blog (tags: linux reference tips) [...]