Building a Linux Media Network, one step at a time

Thursday, November 22, 2007

WideFinder in Java6: revision3

Revision 3 of my WideFinder implementation is available here. I see a small performance gain when I stagger the thread starts by 50ms. I've tweaked the processing of the memory mapped "chunk" boundaries. Now it's 99.9999998% correct :(

Tuesday, November 13, 2007

WideFinder/Java6 Rev2

... The sum total of the next 6 paragraphs is, "I tested a theory without collecting performance data and my changes seemed to have no impact on performance"... Skip to the DTrace Analysis section if that doesn't sound like a fun read...

In the initial revision of the WideFinder, I couldn't get away from the thought that I was handling the data one too many times. If you look at (,
  byte[] seq = new byte[ seek - _mbb.position() ];
_mbb.get( seq );
nextLine = new String( seq, LOG_CHARSET );

We're grabbing bytes out of the buffer, sticking them into a fresh byte array, and creating a new String object based on those bytes, using a specific charset. This is because the character set used by CharBuffer treats each byte as half of a 16-byte char, which means that the bytes are interpreted as non-English characters.

Creating new byte[]s and new Strings like that, in a pretty tight loop, is just asking for performance trouble. I wanted to get to the point where the regex could be applied straight to the ByteBuffer. Looking at the implementation of Charset.encode, it seemed like it suffered from the same sort of space-inefficiency as my original approach.

The solution was to wrap the ByteBuffer in a trivial implementation of CharSequence that indexed directly into the current line of the buffer. The only gotcha was to use the mark() and reset() methods of the buffer in the implementation of subSequence.

Unfortunately, my efforts produced only a minimal increase in performance (~2%) on either my single-core Athlon or the T2K. So much for "low-hanging fruit".

DTrace Analysis

I created a new revision of the Java 6 WideFinder with some tweaks to the ByteBuffer implementation. It's available here: wf_02.tar.gz. It turned out to be a few hours' effort for a very minimal performance gain.

With that little misadventure under my belt I decided to run the WideFinder through DTrace to see where the bulk of time was being spent. DTrace is Sun's dynamic tracing tool, it allows you to inspect a bunch of different aspects of a program's performance as it is running. If you've never used it, you should really give it a try. It's pretty amazing, the stuff you get visibility into. I gather it's also available on OS X as the "Instruments" app.

This is the output of running the method-timings.d script against the WideFinder on a 100-line log file (produced by the DataGenerator). Times given are in microseconds and represent the total time spent in methods in classes of the given package:
sun/net/www/protocol/file         103
sun/net/www 266
sun/net/www/protocol/jar 375
java/nio/channels 474
java/util/concurrent/atomic 791
sun/security/action 1065
java/util/jar 1731
java/util/zip 3146
java/net 3654
java/security 3807
java/lang/ref 5319
java/util/concurrent 7419
java/util/concurrent/locks 8329
java/lang/reflect 10365
sun/misc 20678
java/nio/channels/spi 25753
sun/nio/ch 32353
java/nio/charset 160325
java/util 197794
sun/nio/cs 204016
java/io 240720
sun/reflect 293949
wf 464158
java/lang 1061836
java/nio 1596102
java/util/regex 2738455

The absolute values are not as meaningful as the relative values. The act of observing the program's performance has skewed the measurements (where's that damned cat!), but we hope that it's skewed the measurements equally for each package. From these data, we can see that the "big spenders" are java.util.regex (~39% of total time), java.nio (~25% of total time, with subpackages), and java.lang (~15% of total time).

Here are the 10 most expensive methods in java.nio (not including subpackages), also measured in microseconds:
CharBuffer.arrayOffset     43204
CharBuffer.<init> 44219
ByteBuffer.arrayOffset 51038
HeapCharBuffer.<init> 62420
CharBuffer.wrap 122109
Buffer.position 280288
Buffer.limit 286871
Buffer.checkIndex 375539
DirectByteBuffer.ix 485376
DirectByteBuffer.get 1084831

And here are the 10 most expensive methods in java.util.regex (although there seem to be a lot of very expensive methods in this package):
Pattern.escape                   165501
Pattern$CharProperty.<init> 180954
Pattern.peek 184470
Pattern$8.isSatisfiedBy 199079
Pattern.isSupplementary 238639
Pattern.atom 240097
Pattern$Slice.match 261159
Matcher.getSubSequence 282353 291688
Pattern$BmpCharProperty.<init> 291799
Pattern$BmpCharProperty.match 1080132
Looks like, if there's low-hanging fruit, it's in the regex processing. A little regex-optimization may go a long way.

Data Processing and the Gravel Biz

My Dad's in the gravel business. Ever watch The Flintstones? He's like Mr. Slate. His job is to get as much sand and gravel as possible out of a mountain and onto barges. From there, it floats down a river to a depot where (presumably) people are willing to pay for it.

The trick about the gravel business, as with any other commodity industry I guess, is that you pretty much live or die based on 2 factors:
  • How often you touch the product.
  • How much it costs you each time you touch it.

In my Dad's case, they can't drive loaded trucks down the steep hill of their quarry to get to the river bank. The road is too narrow to allow the trucks to pass each other. So this is what they do:
  • A dumptruck pulls up to a big exacavator, which is scraping away at the side of the mountain pretty much non-stop.
  • The excavator fills the dump truck. Takes somewhere around 5-6 scoops, I think.
  • The dump truck backs into position at the top of the cliff and waits for the all-clear to dump its load. You don't want to rush that task or you wind up with a lot of expensive metal at the bottom of a cliff.
  • While that dump truck is getting ready to dump its load, another dump truck (from the pool, see where I'm going with this?) pulls up to the excavator and begins to receive its load.

It's not a perfect setup: there is a finite amount of room on this plateau where the excavation takes place, you can't fit an unlimited number of dumptrucks in there. Sometimes a dumptruck is forced to wait while the excavator fills up another truck. We would not call this problem "embarassingly parallel" but there is definitely a producer-consumer pattern here.

But a similar pattern plays itself out at the bottom of the cliff: loaders scoop up the dumped gravel and deposit it on a conveyor belt, where it is fed into a crusher (from there into another gravel pile, and from there onto a conveyor belt/barge, and from there to the sales facility, where another loader unloads the barge. All told, I think they handle the product 4 times).

The bottleneck here, of course, is the road. At some point as the production capacity up on the plateau expands, the capital and operational expense of widening the road will become less than the cost of handling all that sand and gravel one extra time.

What I find interesting about this problem is that the cost of handing a piece of gravel is infinitesimally small. But when you multiply that cost by several trillion, it adds up to real dollars and cents. And so it is with the Wide Finder. Handling a single byte's worth of data, or a line's worth of data, is so "cheap" we hardly ever think about it. But handling a Gb's worth of data, or 10 million lines worth, now you're talking real money. Because processing time, especially in a batch environment like this, is money.

Meh, some neat parallels there, is all I'm saying.

Friday, November 09, 2007


Tim Bray has recently published a series of articles about a project he calls Wide Finder. It pretty much amounts to a multi-threaded text processor for large files. The idea is to implement the requirements in a variety of languages and determine the strengths and weaknesses of each implementation in a highly-parallelized environment.

Here's my stab at it: WideFinder in Java 6. Unfortunately, blogger is a really crappy way to publish this sort of thing, but here goes...

The idea is to treat the log file as a random-access file and use Java's NIO API to memory-map one chunk of it per worker thread. The file is processed on a line-by-line basis, but the chunks are split up based roughly on the file size / number of workers. This means that each worker's "chunk" probably doesn't begin or end on a line break. These edge cases (literally, the edges of the buffer) are resolved after all the worker threads have completed.

The initial implementation is really straight-forward. My first approach used ByteBuffer.asCharBuffer to treat the memory mapped chunk as character data. The problem was that 2 ASCII characters were getting packed into each 16-bit Java char, which meant that all the file data was appearing as CJK characters. It's totally sensible that Java would do this, but I didn't see a quick way around it. I'm going to take a closer look, though, because I think that the current implementation handles the data more than it needs to.

Timing Data
  • The source file is a 4,000,000 line (178Mb) file where each line matches the regex specified by Tim in the Wide Finder article (see the source for wf.DataGenerator).
  • Times are given as "number of workers"x"elapsed (wall) time in seconds"
  • Java VM is 1.6.0
  • VM arguments are -Xmx1024M -Xms1024M.

AMD Athlon 64 1x2.2Ghz, 2.5Gb RAM, IO is nothing special (SATA something): 1x13.2 2x12.3 4x12.6
Sun T2000 24x1Ghz 8Gb RAM, IO is nothing special (stock 80Gb): 1x142.9 2x53.0 4x28.2 24x9.7
Intel Xeon 4x2.8Ghz, 3Gb RAM, some kind of SCSI I/O: 1x14.6 2x8.67 4x8.49

My 2Ghz Core 2 Duo MacBook had to be excluded from this test because the code is currently dependent on Java 6... very frustrating.

Instructions to Run

Once the code has been compiled, the command is:

java -cp <output directory> wf.WideFinder <log file> <regex> [num-workers]

Where "output directory" is where the classes were compiled to, "log file" is the path to the log file, "regex" is the regex to search for.  In this case, regex=='GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)'.  If "num-workers" is not specified, the value returned by Runtime.getRuntime().availableProcessors() is used.

Next on the to-do list
  • Figure out how to use the NIO CharBuffer with US-ASCII encoding... or at least something that lets me deal with the characters as something other than CJK chars. I confess that I'm frightfully ignorant of Unicode and character encodings. About once a year I think, "I gotta learn that stuff..." so I read up on it, and I get to the point where I think I understand it, but I never seem to apply it in my day job, so it's quickly forgotten.
  • Run this bad boy through DTrace and see where the hot spots are... just a guess, I reckon Pattern.matches is burning up most of the CPU time.
  • Some low-hanging fruit is probably to do a plaintext match on the initial, constant portion of the regex... ideally take advantage of the 64-bit architecture and compare the first 8 characters of the line all at once!
  • Oh, yeah, also I should get around to ensuring that the program is actually correct... it's definitely at the point where it's useful to gather performance data but I think there may be a couple corner cases that it'd puke on.
  • Back-port to Java 5 so I can develop/test on my MacBook (low priority)
... Updated 11/9 21:17, removed inline source, Blogger was having a hissy fit about it.

... Updated 11/9 21:41, just realized that to run this against actual log data, you'd probably have to change m.matches to m.find at line 76 of  That will skew the performance numbers, you may be able to optimize the regex by putting a caret at the beginning.