Here's my stab at it: WideFinder in Java 6. Unfortunately, blogger is a really crappy way to publish this sort of thing, but here goes...
The idea is to treat the log file as a random-access file and use Java's NIO API to memory-map one chunk of it per worker thread. The file is processed on a line-by-line basis, but the chunks are split up based roughly on the file size / number of workers. This means that each worker's "chunk" probably doesn't begin or end on a line break. These edge cases (literally, the edges of the buffer) are resolved after all the worker threads have completed.
The initial implementation is really straight-forward. My first approach used ByteBuffer.asCharBuffer to treat the memory mapped chunk as character data. The problem was that 2 ASCII characters were getting packed into each 16-bit Java char, which meant that all the file data was appearing as CJK characters. It's totally sensible that Java would do this, but I didn't see a quick way around it. I'm going to take a closer look, though, because I think that the current implementation handles the data more than it needs to.
- The source file is a 4,000,000 line (178Mb) file where each line matches the regex specified by Tim in the Wide Finder article (see the source for wf.DataGenerator).
- Times are given as "number of workers"
x"elapsed (wall) time in seconds"
- Java VM is 1.6.0
- VM arguments are -Xmx1024M -Xms1024M.
- Figure out how to use the NIO CharBuffer with US-ASCII encoding... or at least something that lets me deal with the characters as something other than CJK chars. I confess that I'm frightfully ignorant of Unicode and character encodings. About once a year I think, "I gotta learn that stuff..." so I read up on it, and I get to the point where I think I understand it, but I never seem to apply it in my day job, so it's quickly forgotten.
- Run this bad boy through DTrace and see where the hot spots are... just a guess, I reckon Pattern.matches is burning up most of the CPU time.
- Some low-hanging fruit is probably to do a plaintext match on the initial, constant portion of the regex... ideally take advantage of the 64-bit architecture and compare the first 8 characters of the line all at once!
- Oh, yeah, also I should get around to ensuring that the program is actually correct... it's definitely at the point where it's useful to gather performance data but I think there may be a couple corner cases that it'd puke on.
- Back-port to Java 5 so I can develop/test on my MacBook (low priority)