Building a Linux Media Network, one step at a time

Friday, November 09, 2007


Tim Bray has recently published a series of articles about a project he calls Wide Finder. It pretty much amounts to a multi-threaded text processor for large files. The idea is to implement the requirements in a variety of languages and determine the strengths and weaknesses of each implementation in a highly-parallelized environment.

Here's my stab at it: WideFinder in Java 6. Unfortunately, blogger is a really crappy way to publish this sort of thing, but here goes...

The idea is to treat the log file as a random-access file and use Java's NIO API to memory-map one chunk of it per worker thread. The file is processed on a line-by-line basis, but the chunks are split up based roughly on the file size / number of workers. This means that each worker's "chunk" probably doesn't begin or end on a line break. These edge cases (literally, the edges of the buffer) are resolved after all the worker threads have completed.

The initial implementation is really straight-forward. My first approach used ByteBuffer.asCharBuffer to treat the memory mapped chunk as character data. The problem was that 2 ASCII characters were getting packed into each 16-bit Java char, which meant that all the file data was appearing as CJK characters. It's totally sensible that Java would do this, but I didn't see a quick way around it. I'm going to take a closer look, though, because I think that the current implementation handles the data more than it needs to.

Timing Data
  • The source file is a 4,000,000 line (178Mb) file where each line matches the regex specified by Tim in the Wide Finder article (see the source for wf.DataGenerator).
  • Times are given as "number of workers"x"elapsed (wall) time in seconds"
  • Java VM is 1.6.0
  • VM arguments are -Xmx1024M -Xms1024M.

AMD Athlon 64 1x2.2Ghz, 2.5Gb RAM, IO is nothing special (SATA something): 1x13.2 2x12.3 4x12.6
Sun T2000 24x1Ghz 8Gb RAM, IO is nothing special (stock 80Gb): 1x142.9 2x53.0 4x28.2 24x9.7
Intel Xeon 4x2.8Ghz, 3Gb RAM, some kind of SCSI I/O: 1x14.6 2x8.67 4x8.49

My 2Ghz Core 2 Duo MacBook had to be excluded from this test because the code is currently dependent on Java 6... very frustrating.

Instructions to Run

Once the code has been compiled, the command is:

java -cp <output directory> wf.WideFinder <log file> <regex> [num-workers]

Where "output directory" is where the classes were compiled to, "log file" is the path to the log file, "regex" is the regex to search for.  In this case, regex=='GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)'.  If "num-workers" is not specified, the value returned by Runtime.getRuntime().availableProcessors() is used.

Next on the to-do list
  • Figure out how to use the NIO CharBuffer with US-ASCII encoding... or at least something that lets me deal with the characters as something other than CJK chars. I confess that I'm frightfully ignorant of Unicode and character encodings. About once a year I think, "I gotta learn that stuff..." so I read up on it, and I get to the point where I think I understand it, but I never seem to apply it in my day job, so it's quickly forgotten.
  • Run this bad boy through DTrace and see where the hot spots are... just a guess, I reckon Pattern.matches is burning up most of the CPU time.
  • Some low-hanging fruit is probably to do a plaintext match on the initial, constant portion of the regex... ideally take advantage of the 64-bit architecture and compare the first 8 characters of the line all at once!
  • Oh, yeah, also I should get around to ensuring that the program is actually correct... it's definitely at the point where it's useful to gather performance data but I think there may be a couple corner cases that it'd puke on.
  • Back-port to Java 5 so I can develop/test on my MacBook (low priority)
... Updated 11/9 21:17, removed inline source, Blogger was having a hissy fit about it.

... Updated 11/9 21:41, just realized that to run this against actual log data, you'd probably have to change m.matches to m.find at line 76 of  That will skew the performance numbers, you may be able to optimize the regex by putting a caret at the beginning.


Post a Comment

<< Home