Wednesday, 11 June 2014

Advice for the concurrently confused: AtomicLong JDK7/8 vs. LongAdder

Almost a year ago I posted some thoughts on scalable performance counters and compared AtomicLong, a LongAdder backport to JDK7, Cliff Clicks ConcurrentAtomicTable and my own humble ThreadLocalCounter. With JDK8 now released I thought I'd take an opportunity to refresh the numbers and have a look at the delivered LongAdder and the now changed AtomicLong. This is also an opportunity to refresh the JMH usage for this little exercise.

JMH Refresh

If you never heard of JMH here are a few links to catch up on:
In short, JMH is a micro-benchmarking harness written by the Oracle performance engineering team to help in constructing performance experiments while side stepping (or providing means to side step) the many pitfalls of Java related benchmarks.

Experiment Refresh

Here's a brief overview of the steps I took to update the original benchmark to the current:
  • At the time I wrote the original post JMH was pretty new but already had support for thread groups. Alas there was no way to tweak their size from the command line. Now there is so I could drop the boilerplate for different thread counts.
  • Hide counter types behind an interface/factory and use single benchmark (was possible before, just tidying up my own mess there).
  • Switched to using @Param to select the benchmarked counter type. With JMH 0.9 the framework will pick up an enum and run through it's values for me, yay!

The revised benchmark is pretty concise:
The JMH infrastructure takes care of launching requested numbers of incrementing/getting threads and sums up all the results neatly for us. The @Param defaults will run all variant if we don't pick a particular implementation from the command line. All together a more pleasant experience than the rough and tumble of version 0.1. Code repository is here.

AtomicLong: CAS vs LOCK XADD

With JDK8 a change has been made to the AtomicLong class to replace the CAS loop:
With a single intrinsic:
getAndAddLong() (which corresponds to fetch-and-add) translates into LOCK XADD on x86 CPUs which atomically returns the current value and increments it. This translates into better performance under contention as we leave the hardware to negotiate the atomic increment.

Incrementing only

Running the benchmark with incrementing threads only on a nice (but slightly old) dual socket Xeon X5650@2.67GHz (HyperThreading is off, 2 CPUs, 6 cores each, JDK7u51, JDK8u5) shows the improvement:

  • Left hand column is number of threads, all threads increment
  • All numbers are in nanoseconds measuring the average cost per op
  • Values are averaged across threads, so the cost per op is presented from a single threaded method call perspective.




JDK7-AL .inc (err) JDK8-AL .inc (err)
1 9.8 +/- 0.0 6.2 +/- 0.0
2 40.4 +/- 2.1 34.7 +/- 0.8
3 69.4 +/- 0.3 54.1 +/- 5.6
4 90.2 +/- 2.8 85.4 +/- 1.3
5 123.3 +/- 3.7 102.7 +/- 1.9
6 144.2 +/- 0.4 120.4 +/- 2.5
7 398.0 +/- 29.6 276.6 +/- 33.9
8 417.3 +/- 32.6 307.8 +/- 39.1
9 493.8 +/- 46.0 257.5 +/- 16.4
10 457.7 +/- 17.9 409.8 +/- 34.8
11 515.3 +/- 13.7 308.5 +/- 10.1
12 537.9 +/- 7.9 351.3 +/- 7.6

Notes on the how benchmarks were run:
  • I used taskset to pin the benchmark process to a set of cores
  • The number of cores allocated matches the number of threads required by the benchmark
  • Each socket has 6 cores and I taskset the cores from a single socket from 1 to 6, then spilt over to utilizing the other socket. In the particular layout of cores on this machine this turned out to translate into taskset -c 0-<number-of-threads - 1>
  • There are no get() threads in use, only inc() threads. This is controlled from the command line by setting the -tg option. E.g. -tg 0,6 will spin no get() threads and 6 inc() threads
 Observations:
  • JDK 8 AtomicLong is consistently faster as expected.
  • LOCK XADD does NOT cure the scalability issue in this case. This echoes the rule of thumb by D. Yukov here, which is that shared data writes are a scalability bottleneck (he is comparing private writes to a LOCK XADD on shared). The chart in his post demonstrates throughput rather than cost per op and the benchmark he uses is somewhat different. Importantly his chart demonstrates LOCK XADD hits a sustained throughput which remains fixed as threads are added. The large noise in the dual socket measurements renders the data less than conclusive in my measurements here but the throughput does converge. 
  • When we cross the socket boundary (>6 threads) the ratio between JDK7 and JDK8 increases.
  • Crossing the socket boundary also increases results variability (as expressed by the error). 
This benchmark demonstrates cost under contention. In a real application with many threads doing a variety of tasks you are unlikely to experience this kind of contention, but when you hit it you will pay.

LongAdder JDK7 backport vs. JDK8 LongAdder

Is a very boring comparison as they turn out to scale and cost pretty much the same (minor win for the native LongAdder implementation). It is perhaps comforting to anyone who needs to support a JDK7 client base that the backport should work fine on both and that no further work is required for now. Below are the results, LA7 is the LongAdder backport, LA8 is the JDK8 implementation:


JDK7-LA7 .inc (err) JDK8-LA7 .inc (err) JDK8-LA8 .inc (err)
1 9.8 +/- 0.0 10.0 +/- 0.8 9.8 +/- 0.0
2 12.1 +/- 0.6 10.8 +/- 0.1 10.0 +/- 0.1
3 11.5 +/- 0.4 11.7 +/- 0.3 10.3 +/- 0.0
4 12.4 +/- 0.6 11.1 +/- 0.1 10.3 +/- 0.0
5 12.4 +/- 0.8 11.5 +/- 0.6 10.3 +/- 0.0
6 11.8 +/- 0.3 11.1 +/- 0.3 10.3 +/- 0.0
7 11.6 +/- 0.4 12.9 +/- 1.3 10.6 +/- 0.4
8 11.8 +/- 0.6 11.8 +/- 0.7 10.8 +/- 0.5
9 12.9 +/- 0.9 12.0 +/- 0.7 10.5 +/- 0.3
10 12.6 +/- 0.4 12.1 +/- 0.6 11.0 +/- 0.5
11 11.5 +/- 0.2 12.3 +/- 0.6 10.7 +/- 0.3
12 11.7 +/- 0.4 11.5 +/- 0.3 10.4 +/- 0.1

JDK8: AtomicLong vs LongAdder

Similar results were demonstrated has been discussed in the prev. post, but here's the JDK8 versions side by side:



JDK8-AL .inc (err) JDK8-LA8 .inc (err)
1 6.2 +/- 0.0 9.8 +/- 0.0
2 34.7 +/- 0.8 10.0 +/- 0.1
3 54.1 +/- 5.6 10.3 +/- 0.0
4 85.4 +/- 1.3 10.3 +/- 0.0
5 102.7 +/- 1.9 10.3 +/- 0.0
6 120.4 +/- 2.5 10.3 +/- 0.0
7 276.6 +/- 33.9 10.6 +/- 0.4
8 307.8 +/- 39.1 10.8 +/- 0.5
9 257.5 +/- 16.4 10.5 +/- 0.3
10 409.8 +/- 34.8 11.0 +/- 0.5
11 308.5 +/- 10.1 10.7 +/- 0.3
12 351.3 +/- 7.6 10.4 +/- 0.1

Should I use AtomicLong or LongAdder?

Firstly this question is only relevant if you are not using AtomicLong as a unique sequence generator. LongAdder does not claim to, nor makes any attempt to give you that guarantee. So LongAdder is definitely NOT a drop in replacement for AtomicLong, they have very different semantics.
From the LongAdder JavaDoc:
This class is usually preferable to AtomicLong when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization
control.  Under low update contention, the two classes have similar characteristics. But under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.
Assuming you were using AtomicLong as a counter you will need to consider a few tradeoffs:
  • When NO contention is present, AtomicLong performs slightly better than LongAdder.
  • To avoid contention LongAdder will allocate Cells (see previous post for implementation discussion) each Cell will consume at least 256 bytes (current implementation of @Contended) and you may have as many Cells as CPUs. If you are on a tight memory budget and have allot of counters this is perhaps not the tool for the job.
  • If you prefer get() performance to inc() performance than you should definitely stick with AtomicLong.
  • When you prefer inc() performance and expect contention, and when you have some memory to spare then LongAdder is indeed a great choice.

Bonus material: How the Observer effects the Experiment

What is the impact of reading a value that is being rapidly mutated by another thread? On the observing thread side we expect to pay a read-miss, but as discussed prev. here there is a price to pay on the mutator side as well. I run the same benchmark with an equal number of inc()/get() threads. The process is pinned as before but as the roles are not uniform I have no control on which socket the readers/writers end up on, so we should expect more noise as we cross the socket line (LA - LongAdder, AL - AtomicLong, both on JDK8, same type ing/get are within the same run, left column is number of inc()/get() threads):


LA .get (err) LA .inc (err) AL .get (err) AL .inc (err)
1,1 4.7 +/- 0.2 68.6 +/- 5.0 10.2 +/- 1.0 39.1 +/- 7.6
2,2 24.9 +/- 2.7 69.7 +/- 4.0 41.5 +/- 26.4 87.6 +/- 11.0
3,3 139.9 +/- 24.3 69.0 +/- 8.2 55.1 +/- 13.3 157.4 +/- 28.2
4,4 332.9 +/- 10.3 80.4 +/- 5.1 56.8 +/- 7.6 198.7 +/- 21.1
5,5 479.3 +/- 13.2 84.1 +/- 4.4 71.9 +/- 7.3 233.1 +/- 20.1
6,6 600.6 +/- 11.2 89.8 +/- 4.6 152.1 +/- 41.7 343.2 +/- 41.0

Now that is a very different picture... 2 factors come into play:
  1. The reads disturb the writes by generating cache coherency noise as discussed here. A writer must have the cache line in an Exclusive/Mutated state, but the read will cause it to shift into Shared.
  2. The get() measurement does not differentiate between new values and old values.
This second point is important as we compare different means of mutating values being read. If the value being read is slowly mutating we will succeed in reading the same value many times before a change is visible. This will make our average operation time look great as most reads will be L1 hitting. If we have a fast incrementing implementation we will cause more cache misses for the reader making it look bad. On the other hand a slow reader will cause less of a disturbance to the incrementing threads as it produces less coherency noise, thus making less of a dent in the writer performance. Martin Thompson has previously hit on this issue from a different angle here (Note his posts discuss Nahelem and Sandybridge, I'm benchmarking on Westmere here).
In this light I'm not sure we can read much into the effect of readers in this particular use case. The 'model' represented by a hot reading thread does not sit well with the use case I have in mind for these counters which is normally as performance indicators to be sampled at some set frequency (once a second or millisecond). A different experiment is more appropriate, perhaps utilising Blackhole.consumeCPU to set a gap between get() (see a great insight into this method here).
With this explanation in mind we can see perhaps more sense in the following comparison between AtomicLong on JDK7 and JDK8:


AL7 .get (err) AL7 .inc (err) AL8 .get (err) AL8 .inc (err)
1,1 4.0 +/- 0.2 61.1 +/- 7.8 10.2 +/- 1.0 39.1 +/- 7.6
2,2 7.0 +/- 0.3 133.1 +/- 2.9 41.5 +/- 26.4 87.6 +/- 11.0
3,3 21.8 +/- 3.4 278.5 +/- 47.4 55.1 +/- 13.3 157.4 +/- 28.2
4,4 27.2 +/- 3.3 324.1 +/- 34.6 56.8 +/- 7.6 198.7 +/- 21.1
5,5 31.2 +/- 1.8 378.5 +/- 28.1 71.9 +/- 7.3 233.1 +/- 20.1
6,6 57.3 +/- 12.5 481.8 +/- 39.4 152.1 +/- 41.7 343.2 +/- 41.0

Now consider that the implementation of get() has not changed. The reason for the change in cost for get is down to the increase in visible values and thus cache misses caused by the change to the increment method.

No comments:

Post a Comment