Dominic J. Thoreau

Perl internal file-caching

This was originally written in early 1999 when I was working for Landcare Research. Presumably ownership, strictly speaking, belongs to them. Mind you, computing technique isn't really their focus.

In-Source file caching in perl

When text processing significantly large text files in Perl a moderate portion of the run-time is spent waiting while the file system waits for the disk to spin into position (known as rotational latency) for reading the next record.

Many operating systems attempt to avoid this by caching files, that is, by fetching more than is needed at any one instance, and storing it in a reserved portion of memory so it can be fetched faster.

However, the amount of memory used for this purpose is usually small. Windows 95 seems to have a dynamically resizable cache that will (supposedly) change on demand. I have got good speed gains from doing it myself however.

Windows 95, for example, allocates a maximum of 64k for read-ahead caching (if you request a block of x kbytes, the OS will fetch the next 64k bytes in the chance that you will continue to read from that portion of the disk) (further investigation suggests that this is only for caching from the CD). Read ahead-caching generally works 90-95% of the time. When processing text files larger than this more time is spent waiting for the disk.

The most obvious way to avoid this is to load the whole file into memory. Thankfully, this is easy to write in perl.

Without caching:

open (IN,"afile.txt");
while ($in=<IN>) { # until EOF, read a line and put into $in variable
chop $in; #remove trailing CR from text
# ..... process text
}
close (IN);
With caching
open (IN,"afile.txt");
@myin=<IN>; # slurp the whole file into the array
close (IN);
foreach $temp (@myin) { # step through that array
$in=$temp; # don't refer to the var from the array,
# as it gets put back in afterward.
chomp $in;
# ..... process text
}
Note
A little code later, it was realized that the first three lines of code can actually be moved off into a local utilities module, and cut the amount of code used down, while centralisation error handling.
Now we use something like @myarr=&prepdatafile("afile.txt"); . The code in the module is something like
sub prepdatafile {
open (FH,$_[0])|| die "Cannot open datafile $_[0]";
@temparr=<FH>;
close (FH);
@_=@temparr;
}

In addition, it should be noted that hard disk accesses do consume CPU cycles, and while this increases demand for memory the OS should manage to refrain from swapping it to disk for a little while...

While processing a 470kb text file this improved performance by ~30%! The code takes a little longer to start up (doing ~1Mb/sec file reads in the first 2 seconds or so) but then drops completely away...

Further hack testing seems to indicate my system tops out at about 3 MB/sec. This is for both read and write. However, running multiple process seriously slows down the process (with 5 perl processes the File I/0 is down to 2M/sec or less. The perl intepreter doesn't seem to do multithreaded programs).

Note: This system is most effecient when every record needs to be processed in a batch-type environment. In a cold-start interactive (like CGI) it just adds to the overhead and doesn't improve performance. If we had written a db server daemon it would naturally be a speed gain to hold the data in memory as to avoid waiting for the disk.... up to a point. After that point it makes more sense to have lots of available memory. This point is the size of the data. Some times you may only want to cache a subset (index, or maybe core data)

© 2004 Dominic J. Thoreau - this is http://www.thoreau-online.net/perlcache.html
Updated and uploaded Fri Dec 29 11:45:07 2006