Performance
In order to illustrate the potential advantages of jbarrier, we present some performance results measured using a simple test program. The test program ("BarrierExample") is shipped with the library, so you can run your own tests for comparison or improve the program if you like. The results shown were obtained with JDK 6 Update 21 on a Dell PowerEdge R905 equipped with 4 Quad Core AMD Opteron 2.2GHz CPUs and running Red Hat Enterprise Linux Server edition.
Comparison of jbarrier to CyclicBarrier
First of all, we compare the jbarrier algorithms to the standard CyclicBarrier in two test scenarios: one with only a very small workload for each thread in each iteration of the barrier, and one with more expensive computations in each iteration. The following command starts the first test, where N has to be replaced by the number of threads to use.
java edu.bonn.cs.net.jbarrier.examples.BarrierExample b ALL 100 16 1000000 N acst
Briefly put, this test compares the different barriers by running 100 experiments for each barrier and measuring their execution times. One experiment consists of running the threads for 1000000 iterations and synchronizing them at the barrier at the end of each iteration. The parameter 16 specifies the "workload", which is very small. For the other parameters, please consult the documentation of the test program. In the following, all graphs show the average execution times of all 100 experiments of the respective test of each barrier.
The graph shows that the CyclicBarrier fails to achieve any speedup over the sequential (i.e., single-threaded without any barrier synchronization) execution of the same workload. In fact, the synchronization overhead slows down the parallel computation so much that it takes much longer than the sequential execution. In contrast to that, the jbarrier algorithms achieve some speedup over sequential execution, with the tree-based barriers scaling better than the central barrier.
Many applications generate more computational workload than this first example - otherwise, parallelization wouldn't be an attractive option at all. In our second test, workload was increased from 16000 (and the number of iterations was reduced to 100000, just to get performance results more quickly):
java edu.bonn.cs.net.jbarrier.examples.BarrierExample b ALL 100 16000 100000 N acst
As is to be expected, all barrier algorithms now achieve a much higher speedup. However, it can be observed that the CyclicBarrier doesn't scale too well even in such a favourable scenario, reaching a speedup of almost 5 for 8 threads but then dropping to only 2.5 for 16 threads. For the jbarrier algorithms, speedup is almost linear reaching 7.7 for 8 threads and 15.4 for 16 threads.
Comparison of the jbarrier reduction barriers
The jbarrier algorithms also support global reductions, and in two different ways. First of all, specialized variants of the barriers exist for several popular reductions (currently: minimum, maximum, and sum) and the four primitive types int, long, float, and double. We compare the performance of the reduction barriers (more specifically, when they perform aminimum reduction of floats) by running
java edu.bonn.cs.net.jbarrier.examples.BarrierExample r ALL 100 16 1000000 N ast
Note that the workload has again been reduced to the initial value of 16 in order to provide a stress test for the barriers.
While there is no noticeable difference between the barrier algorithms for up to 8 threads, the results for 16 threads show that the central barrier does not scale as well as the others. The reason is simple: With tree-based barriers, parts of the reduction are also performed in parallel (in different branches of the tree). With the central barrier, the reduction is only performed sequentially by the last thread that has reached the barrier while all the other threads are waiting.
The jbarrier algorithms also support arbitrary reductions via a construct called GenericReductor. We recommend to use a GenericReductor only to implement reductions not already available as primitive type classes. Compared to the primitive type reduction classes, a GenericReductor adds more overhead as shown by the following comparison. The command line for the experiment is
java edu.bonn.cs.net.jbarrier.examples.BarrierExample b ALL 100 16 1000000 N arst
When used in conjunction with the tournament barrier or the static tree barrier, the GenericReductor performs similarly well than the corresponding primitive type reductions. However, the GenericReductor scales much worse when used with the butterfly barrier, the dissemination barrier, and above all the central barrier.
As stated initially, these results are not conclusive at all. Still, the results clearly show that parallel application developers in need of barrier synchronization (with or without reductions) should consider using jbarrier (and, in particular, one of the tree-based barrier algorithms) .