Building a Linux Media Network, one step at a time

Friday, July 27, 2007

Comparing Java Performance on Multi-Core CPUs

The tests in this article measure fixed-point Java performance on a variety of CPU architectures. In summary: Java can take advantage of multiple cores to avoid CPU contention, but in some cases not as well as you'd expect.
I tested on 4 hardware configurations:

  • Athlon 64 3500+: 1 CPU, 1 Core, 2.2Ghz. Running Linux kernel version 2.6.13, Java 1.6.0_02. This was used as a baseline.

  • MacBook Core 2 Duo: 1 CPU, 2 Cores, 2.2Ghz. Running OS X kernel version 8.10.1, Java 1.6.0 (b88)

  • Dual Opteron 248: 2 CPUs, 1 Core each, 2.2Ghz/core. Running Linux kernel version 2.6.13, Java 1.6.0_02

  • Sun T2000: 1 CPU, 6 cores, h/w support for 4 threads per core, Running SunOS kernel version 5.10


This is the most trivial test class I could come up with. It's actually more complex than I thought it would be. All it does is synchronize 1 or more threads to calculate a lot of large prime numbers at the same time. This task is designed to provide high CPU contention and low IO/memory contention.

package ThreadTester;

import java.math.BigInteger;
import java.util.concurrent.Executors;
import java.util.concurrent.CyclicBarrier;
import java.util.concurrent.ExecutorService;

public class ThreadTester
{
private static final long START = Long.MAX_VALUE;
private static final int NUM_PRIMES = 4000;

private final CyclicBarrier _start;
private final CyclicBarrier _finish;
private final int _numThreads;

public ThreadTester( int numThreads )
{
_numThreads = numThreads;
_start = new CyclicBarrier( _numThreads, new Runnable()
{
public void run()
{
System.out.print( _numThreads + " " + System.currentTimeMillis() + " " );
}
});
_finish = new CyclicBarrier( _numThreads, new Runnable()
{
public void run()
{
System.out.println( System.currentTimeMillis() );
}
});
}
public static void main( String[] args )
{
int numThreads = Integer.parseInt( args[ 0 ] );
new ThreadTester(numThreads).go();
}

private void go()
{
for ( int i = 0; i < _numThreads; i++ )
{
new Thread( new PrimeFinder() ).start();
}
}

private class PrimeFinder implements Runnable
{
private BigInteger _bigNum = BigInteger.valueOf( START );

public void run()
{
try
{
_start.await();
for ( int i = 0; i < NUM_PRIMES; i++ )
{
_bigNum = _bigNum.nextProbablePrime();
}
}
catch ( Exception ignore ) {}
finally
{
try
{
_finish.await();
}
catch ( Exception ignore ) {};
}
}
}
}

The raw data follows. The timing data from different architectures should not be compared to each other. In each graph, as the Y value (Time in Seconds) begins to grow linearly with X (Number of Threads), CPU contention among the threads is increasing.
The Sun T2000 server clearly exhibits the best thread utilization. This is unsurprising given the number of independent execution units available in the Niagara processor. Note that had this test involved floating-point math, contention for the Niagara's single FPU among its 24 execution units would be intense.





Athlon-64
1 6.81 1185600041009 1185600047814
2 13.58 1185600047914 1185600061492
3 20.3 1185600061558 1185600081853
4 26.9 1185600081925 1185600108820
5 33.34 1185600108906 1185600142247
6 40.13 1185600142336 1185600182463
7 46.84 1185600182546 1185600229390
8 54.85 1185600229477 1185600284327
9 60.45 1185600284394 1185600344845
10 65.59 1185600344954 1185600410545

MacBook
1 7.05 1185596894768 1185596901822
2 11.34 1185596902131 1185596913471
3 17.89 1185596914281 1185596932169
4 25.19 1185596932657 1185596957846
5 31.64 1185596958512 1185596990148
6 37.19 1185596990764 1185597027957
7 42.94 1185597028423 1185597071366
8 48.03 1185597071971 1185597120005
9 54.92 1185597120659 1185597175582
10 59.95 1185597175905 1185597235859

248
1 6.16 1185598190089 1185598196249
2 8.32 1185598196398 1185598204713
3 15.57 1185598204849 1185598220423
4 21.08 1185598220564 1185598241640
5 25.96 1185598241766 1185598267726
6 30.86 1185598267868 1185598298726
7 35.15 1185598298870 1185598334015
8 41.11 1185598334145 1185598375253
9 45.52 1185598375373 1185598420891
10 53.08 1185598421038 1185598474117

T2K
1 31.2 1185597917415 1185597948610
2 31.47 1185597949296 1185597980763
3 32.47 1185597981545 1185598014017
4 34 1185598014867 1185598048869
5 32.84 1185598049745 1185598082581
6 34.11 1185598083555 1185598117668
7 35.38 1185598118647 1185598154031
8 37.43 1185598154995 1185598192425
9 38.97 1185598193408 1185598232373
10 40.06 1185598233337 1185598273396