Thursday, November 22, 2007
Tuesday, November 13, 2007
In the initial revision of the WideFinder, I couldn't get away from the thought that I was handling the data one too many times. If you look at WideFinder.run (WideFinder.java:57),
byte seq = new byte[ seek - _mbb.position() ];We're grabbing bytes out of the buffer, sticking them into a fresh byte array, and creating a new String object based on those bytes, using a specific charset. This is because the character set used by CharBuffer treats each byte as half of a 16-byte char, which means that the bytes are interpreted as non-English characters.
_mbb.get( seq );
nextLine = new String( seq, LOG_CHARSET );
Creating new bytes and new Strings like that, in a pretty tight loop, is just asking for performance trouble. I wanted to get to the point where the regex could be applied straight to the ByteBuffer. Looking at the implementation of Charset.encode, it seemed like it suffered from the same sort of space-inefficiency as my original approach.
The solution was to wrap the ByteBuffer in a trivial implementation of CharSequence that indexed directly into the current line of the buffer. The only gotcha was to use the mark() and reset() methods of the buffer in the implementation of subSequence.
Unfortunately, my efforts produced only a minimal increase in performance (~2%) on either my single-core Athlon or the T2K. So much for "low-hanging fruit".
I created a new revision of the Java 6 WideFinder with some tweaks to the ByteBuffer implementation. It's available here: wf_02.tar.gz. It turned out to be a few hours' effort for a very minimal performance gain.
With that little misadventure under my belt I decided to run the WideFinder through DTrace to see where the bulk of time was being spent. DTrace is Sun's dynamic tracing tool, it allows you to inspect a bunch of different aspects of a program's performance as it is running. If you've never used it, you should really give it a try. It's pretty amazing, the stuff you get visibility into. I gather it's also available on OS X as the "Instruments" app.
This is the output of running the method-timings.d script against the WideFinder on a 100-line log file (produced by the DataGenerator). Times given are in microseconds and represent the total time spent in methods in classes of the given package:
The absolute values are not as meaningful as the relative values. The act of observing the program's performance has skewed the measurements (where's that damned cat!), but we hope that it's skewed the measurements equally for each package. From these data, we can see that the "big spenders" are java.util.regex (~39% of total time), java.nio (~25% of total time, with subpackages), and java.lang (~15% of total time).
Here are the 10 most expensive methods in java.nio (not including subpackages), also measured in microseconds:
CharBuffer.arrayOffset 43204And here are the 10 most expensive methods in java.util.regex (although there seem to be a lot of very expensive methods in this package):
Pattern.escape 165501Looks like, if there's low-hanging fruit, it's in the regex processing. A little regex-optimization may go a long way.
Data Processing and the Gravel Biz
The trick about the gravel business, as with any other commodity industry I guess, is that you pretty much live or die based on 2 factors:
- How often you touch the product.
- How much it costs you each time you touch it.
In my Dad's case, they can't drive loaded trucks down the steep hill of their quarry to get to the river bank. The road is too narrow to allow the trucks to pass each other. So this is what they do:
- A dumptruck pulls up to a big exacavator, which is scraping away at the side of the mountain pretty much non-stop.
- The excavator fills the dump truck. Takes somewhere around 5-6 scoops, I think.
- The dump truck backs into position at the top of the cliff and waits for the all-clear to dump its load. You don't want to rush that task or you wind up with a lot of expensive metal at the bottom of a cliff.
- While that dump truck is getting ready to dump its load, another dump truck (from the pool, see where I'm going with this?) pulls up to the excavator and begins to receive its load.
It's not a perfect setup: there is a finite amount of room on this plateau where the excavation takes place, you can't fit an unlimited number of dumptrucks in there. Sometimes a dumptruck is forced to wait while the excavator fills up another truck. We would not call this problem "embarassingly parallel" but there is definitely a producer-consumer pattern here.
But a similar pattern plays itself out at the bottom of the cliff: loaders scoop up the dumped gravel and deposit it on a conveyor belt, where it is fed into a crusher (from there into another gravel pile, and from there onto a conveyor belt/barge, and from there to the sales facility, where another loader unloads the barge. All told, I think they handle the product 4 times).
The bottleneck here, of course, is the road. At some point as the production capacity up on the plateau expands, the capital and operational expense of widening the road will become less than the cost of handling all that sand and gravel one extra time.
What I find interesting about this problem is that the cost of handing a piece of gravel is infinitesimally small. But when you multiply that cost by several trillion, it adds up to real dollars and cents. And so it is with the Wide Finder. Handling a single byte's worth of data, or a line's worth of data, is so "cheap" we hardly ever think about it. But handling a Gb's worth of data, or 10 million lines worth, now you're talking real money. Because processing time, especially in a batch environment like this, is money.
Meh, some neat parallels there, is all I'm saying.
Friday, November 09, 2007
Here's my stab at it: WideFinder in Java 6. Unfortunately, blogger is a really crappy way to publish this sort of thing, but here goes...
The idea is to treat the log file as a random-access file and use Java's NIO API to memory-map one chunk of it per worker thread. The file is processed on a line-by-line basis, but the chunks are split up based roughly on the file size / number of workers. This means that each worker's "chunk" probably doesn't begin or end on a line break. These edge cases (literally, the edges of the buffer) are resolved after all the worker threads have completed.
The initial implementation is really straight-forward. My first approach used ByteBuffer.asCharBuffer to treat the memory mapped chunk as character data. The problem was that 2 ASCII characters were getting packed into each 16-bit Java char, which meant that all the file data was appearing as CJK characters. It's totally sensible that Java would do this, but I didn't see a quick way around it. I'm going to take a closer look, though, because I think that the current implementation handles the data more than it needs to.
- The source file is a 4,000,000 line (178Mb) file where each line matches the regex specified by Tim in the Wide Finder article (see the source for wf.DataGenerator).
- Times are given as "number of workers"
x"elapsed (wall) time in seconds"
- Java VM is 1.6.0
- VM arguments are -Xmx1024M -Xms1024M.
- Figure out how to use the NIO CharBuffer with US-ASCII encoding... or at least something that lets me deal with the characters as something other than CJK chars. I confess that I'm frightfully ignorant of Unicode and character encodings. About once a year I think, "I gotta learn that stuff..." so I read up on it, and I get to the point where I think I understand it, but I never seem to apply it in my day job, so it's quickly forgotten.
- Run this bad boy through DTrace and see where the hot spots are... just a guess, I reckon Pattern.matches is burning up most of the CPU time.
- Some low-hanging fruit is probably to do a plaintext match on the initial, constant portion of the regex... ideally take advantage of the 64-bit architecture and compare the first 8 characters of the line all at once!
- Oh, yeah, also I should get around to ensuring that the program is actually correct... it's definitely at the point where it's useful to gather performance data but I think there may be a couple corner cases that it'd puke on.
- Back-port to Java 5 so I can develop/test on my MacBook (low priority)