Psychosomatic, Lobotomy, Saw: June 2014

Thursday 26 June 2014

Notes on False Sharing

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I've recently given a talk on queues and related optimisations, but due to the limitations of time, the Universe and Everything else had to cut short the deep dive into false sharing. Here however I'm not limited by such worldly concerns, so here is more detail for those who can stomach it.

What mean False Sharing?

Described before on this blog and in many other places (Intel, usenix, Mr. T), but assuming you are not already familiar with the issue:

"false sharing is a performance-degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism. When a system participant attempts to periodically access data that will never be altered by another party, but that data shares a cache block with data that is altered, the caching protocol may force the first participant to reload the whole unit despite a lack of logical necessity. The caching system is unaware of activity within this block and forces the first participant to bear the caching system overhead required by true shared access of a resource." - From Wikipedia

Makes sense, right?
I tried my hand previously at explaining the phenomena in terms of the underlying coherency protocol which boils down to the same issue. These explanations often leave people still confused and I spent some time trying to reduce the explanation so that the idea can get across in a presentation... I'm pretty sure it still left people confused.
Let's try again.
The simplest explanation I've come up with to date is:

Memory is cached at a granularity of a cache line (assumed 64 bytes, which is typical, in the following examples but may be a smaller or larger power of 2, determined by the hardware you run on)
Both reading and writing to a memory location from a particular CPU require the presence of a copy of the memory location in the reading/writing CPU cache. The required location is 'viewed' or cached at the granularity of a cache line.
When a line is not in a threads cache we experience a cache miss as we go searching for a valid copy of it.
Once a line is in our cache it may get evicted (when more memory is required) or invalidated (because a value in that line was changed by another CPU). Otherwise it just stays there.
A write to a cache line invalidates ALL cached copies of that line for ALL CPUs except the writer.

I think there is a mistaken intuition most of us have when it comes to memory access, which is that we think of it as direct. In reality however you access memory through the CPU cache hierarchy and it is in the attempt to make this system coherent and performant that the hardware undoes our intuition. Rather than thinking of all CPUs accessing a global memory pool we should have a mental model that is closer to a distributed version control system as illustrated in this excellent post on memory barriers.
Given the above setting of the stage we can say that false sharing occurs when Thread 1 invalidates (i.e writes to) a cache line required by Thread2 (for reading/writing) even though Thread2 and Thread1 are not accessing the same location.
The cost of false sharing is therefore observable as cache misses, and the cause of false sharing is the write to the cache line (coupled with the granularity of cache coherency). Ergo, the symptom of false sharing can be observed by looking at hardware events for your process or thread:

If you are on linux you can use perf. (Checkout the awesome perf integration in JMH!)
On linux/windows/mac Intel machines there is PCM.
From Java you can use Overseer
... look for hardware counters on google.

Now assuming you have established that the number of cache misses you experience is excessive, the challenge remains in finding the sources of these cache misses. Sadly for Java there is no free utility that I'm aware of which can easily point you to the source of your false sharing problem, C/C++ developers are spoilt for choice by comparison. Some usual suspects to consider however are any shared data constructs, for example queues.

Active False Sharing: Busy Counters

Lets start with a queue with 2 fields which share the same cache line where each field is being updated by a separate thread. In this example we have an SPSC queue with the following structure:
Ignoring the actual correct implementation and edge cases to make this a working queue (you can read more about how that part works here), we can see that offer will be called by one thread, and poll by another as this is an SPSC queue for delivering messages between 2 threads. Each thread will update a distinct field, but as the fields are all jammed tight together we know False Sharing is bound to follow.
The solution at the time of writing seems to be the introduction of padding by means of class inheritance which is covered in detail here. This will add up to an unattractive class as follows:
Yipppeee! the counters are no longer false sharing! That little Pad in the middle will make them feel all fresh and secure.

Passive False Sharing: Hot Neighbours

But the journey is far from over (stay strong), because in False Sharing you don't have to write to a shared cache line to suffer. In the above example both the producer and the consumer are reading the reference to buffer which is sharing a cache line with naughty Mr. consumerIndex. Every time we update consumerIndex an angel loses her wings! As if the wings business wasn't enough, the write also invalidates the cache line for the producer which is trying to read the buffer so that it can write to the damn thing. The solution?
MORE PADDING!!
Surely now everyone will just get along?

The Object Next Door

So now we have the consumer and the producer feverishly updating their respective indexes and we know the indexes are not interfering with each other or with buffer, but what about the objects allocated before/after our queue object? Imagine we allocate 2 instances of the queue above and they end up next to each other in memory, we'd be practically back to where we started with the consumerIndex on the tail end of one object false sharing with the producer index at the head end of the other. In fact, the conclusion we are approaching here is that rapidly mutated fields make very poor neighbours altogether, be it to other fields in the same class or to neighbouring classes. The solution?

EVEN MORE PADDING!!!

Seems a tad extreme don't it? But hell, at least now we're safe, right?

Buffers are people too

So we have our queue padded like a new born on it's first trip out the house in winter, but what about Miss buffer? Arrays in Java are objects and as such the buffer field merely point to another object and is not inlined into the queue (as it can be in C/C++ for instance). This means the buffer is also subject to potentially causing or suffering from false sharing. This is particularly true for the length field on the array which will get invalidated by writes into the first few elements of the queue, but equally true for any object allocated before or after the array. For circular array queues we know the producer will go around writing to all the elements and come back to the beginning, so the middle elements will be naturally padded. If the array is small enough and the message passing rate is high enough this can have the same effect as any hot field. Alternatively we might experience an uneven behaviour for the queue as the elements around the edges of the array suffer false sharing while the middle ones don't.

Since we cannot extend the array object and pad it to our needs we can over-allocate the buffer and never use the initial/last slots:

We cannot prevent false sharing between neighbours and the array length, what we can do is hoist it into our own class field and use our copy instead of the buffer field. This can also result in better locality if the hoisted length value is hosted in a field alongside the array itself.
Note that this is costing us some computational overhead to compute the offset index instead of the natural one, but in practice we can implement queues such that the we achieve the same effect with no overhead at all.

Protection against the Elements

While on this topic (I do go on about it... hmmm) we can observe that the producer and the consumer threads can still experience false sharing on the cache line holding adjacent elements in the buffer array. This is something we can perhaps handle more effectively if we had arrays of structs in Java (see the value types JEP and structured arrays proposal), and if the queue was to be a queue of such structs. Reality being the terrible place that it is, this is not the case yet. If we could prove our application indeed suffered from this issue, could we solve it?
Yes, but at a questionable price...
If we allocate extra elements for each reference we mean to store in the queue we can use empty elements to pad the used elements. This will reduce the density of each cache line and as a result the probability of false sharing as less and less elements are in a cache line. This comes a high cost however as we multiply the size of the buffer to make room for the empty slots and actively sabotage the memory throughput as we have less data per read cache line. This is an optimization I am reluctant to straight out recommend as a result, but what the hell? sometimes it might helps.

Playing with Dirty Cards

This last one is a nugget and a half. Usually when people say: "It's a JVM/compiler/OS issue" it turns out that it's a code issue written by those same people. But sometimes it is indeed the machine.

In certain cases you might not be the cause of False Sharing. In some cases you might be experiencing False Sharing induced by card marking. I won't spoil it for you, follow the link.

Summary

So there you have it, 6 different cases of False Sharing encountered in the process of optimizing/proofing the piddly SPSC queue. Each one had an impact, some more easily measured than others. The take away here is not "Pad every object and field to death" as that will be detrimental in most cases, much like making every field volatile. But this is an issue worth keeping in mind when writing shared data structures, and in particular when considering highly contended ones.
We've discussed 2 types of False Sharing:

Active: where both threads update distinct locations on the same cache line. This will severely impact both threads as they both experience cache misses and cause repeated invalidation of the cache line.
Passive: where one thread writes to and another reads from distinct locations on the same cache line. This will have a major impact on the reading thread with many reads resulting in a cache miss. This will also have an impact on the writer as the cache line sharing overhead is experienced.

These were identified and mitigated in several forms:

Neighbouring hot/hot fields (Active)
Neighbouring hot/cold fields (Passive)
Neighbouring objects (potential for both Active and Passive)
Objects neighbouring an array and the array elements (same as 3, but worth highlighting as a subtlety that arrays are separate objects)
Array elements and array length field (Passive as the length is never mutated)
Distinct elements sharing the same cache line (Passive in above example, but normally Active as consumer nulls out the reference on poll)
-XX:+UseCondCardMark - see Dave Dice article

What should you do?

Consider fields concurrently accessed by different threads, in particular frequently written to fields and their neighbours.
Consider Active AND Passive false sharing.
Consider (and be considerate to) your neighbours.
All padding comes at the cost of using up valuable cache space for isolation. If over used (like in the elements padding) performance can suffer.

And finally: love thy neighbour, just a little bit and with adequate protection and appropriate padding, but still love'em.

Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium! Belgium!

Wednesday 11 June 2014

Advice for the concurrently confused: AtomicLong JDK7/8 vs. LongAdder

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Almost a year ago I posted some thoughts on scalable performance counters and compared AtomicLong, a LongAdder backport to JDK7, Cliff Clicks ConcurrentAtomicTable and my own humble ThreadLocalCounter. With JDK8 now released I thought I'd take an opportunity to refresh the numbers and have a look at the delivered LongAdder and the now changed AtomicLong. This is also an opportunity to refresh the JMH usage for this little exercise.

JMH Refresh

If you never heard of JMH here are a few links to catch up on:

The JMH project
My own introductory post on the topic (in need of some update as JMH evolved in the past year)
A few other posts demonstrating JMH usage:

Multithreaded Benchmarks
Benchmarking Queue implementations latency

Some masterful posts from Shipilev who is part of the team behind JMH:

Examining exception throwing as control flow
A presentation on benchmarking and JMH

Use google to find more...

In short, JMH is a micro-benchmarking harness written by the Oracle performance engineering team to help in constructing performance experiments while side stepping (or providing means to side step) the many pitfalls of Java related benchmarks.

Experiment Refresh

Here's a brief overview of the steps I took to update the original benchmark to the current:

At the time I wrote the original post JMH was pretty new but already had support for thread groups. Alas there was no way to tweak their size from the command line. Now there is so I could drop the boilerplate for different thread counts.
Hide counter types behind an interface/factory and use single benchmark (was possible before, just tidying up my own mess there).
Switched to using @Param to select the benchmarked counter type. With JMH 0.9 the framework will pick up an enum and run through it's values for me, yay!

The revised benchmark is pretty concise:
The JMH infrastructure takes care of launching requested numbers of incrementing/getting threads and sums up all the results neatly for us. The @Param defaults will run all variant if we don't pick a particular implementation from the command line. All together a more pleasant experience than the rough and tumble of version 0.1. Code repository is here.

AtomicLong: CAS vs LOCK XADD

With JDK8 a change has been made to the AtomicLong class to replace the CAS loop:
With a single intrinsic:
getAndAddLong() (which corresponds to fetch-and-add) translates into LOCK XADD on x86 CPUs which atomically returns the current value and increments it. This translates into better performance under contention as we leave the hardware to negotiate the atomic increment.

Incrementing only

Running the benchmark with incrementing threads only on a nice (but slightly old) dual socket Xeon X5650@2.67GHz (HyperThreading is off, 2 CPUs, 6 cores each, JDK7u51, JDK8u5) shows the improvement:

Left hand column is number of threads, all threads increment
All numbers are in nanoseconds measuring the average cost per op
Values are averaged across threads, so the cost per op is presented from a single threaded method call perspective.

	JDK7-AL	.inc	(err)	JDK8-AL	.inc	(err)
1	9.8	+/-	0.0	6.2	+/-	0.0
2	40.4	+/-	2.1	34.7	+/-	0.8
3	69.4	+/-	0.3	54.1	+/-	5.6
4	90.2	+/-	2.8	85.4	+/-	1.3
5	123.3	+/-	3.7	102.7	+/-	1.9
6	144.2	+/-	0.4	120.4	+/-	2.5
7	398.0	+/-	29.6	276.6	+/-	33.9
8	417.3	+/-	32.6	307.8	+/-	39.1
9	493.8	+/-	46.0	257.5	+/-	16.4
10	457.7	+/-	17.9	409.8	+/-	34.8
11	515.3	+/-	13.7	308.5	+/-	10.1
12	537.9	+/-	7.9	351.3	+/-	7.6

Notes on the how benchmarks were run:

I used taskset to pin the benchmark process to a set of cores
The number of cores allocated matches the number of threads required by the benchmark
Each socket has 6 cores and I taskset the cores from a single socket from 1 to 6, then spilt over to utilizing the other socket. In the particular layout of cores on this machine this turned out to translate into taskset -c 0-<number-of-threads - 1>
There are no get() threads in use, only inc() threads. This is controlled from the command line by setting the -tg option. E.g. -tg 0,6 will spin no get() threads and 6 inc() threads

Observations:

JDK 8 AtomicLong is consistently faster as expected.
LOCK XADD does NOT cure the scalability issue in this case. This echoes the rule of thumb by D. Yukov here, which is that shared data writes are a scalability bottleneck (he is comparing private writes to a LOCK XADD on shared). The chart in his post demonstrates throughput rather than cost per op and the benchmark he uses is somewhat different. Importantly his chart demonstrates LOCK XADD hits a sustained throughput which remains fixed as threads are added. The large noise in the dual socket measurements renders the data less than conclusive in my measurements here but the throughput does converge.
When we cross the socket boundary (>6 threads) the ratio between JDK7 and JDK8 increases.
Crossing the socket boundary also increases results variability (as expressed by the error).

This benchmark demonstrates cost under contention. In a real application with many threads doing a variety of tasks you are unlikely to experience this kind of contention, but when you hit it you will pay.

LongAdder JDK7 backport vs. JDK8 LongAdder

Is a very boring comparison as they turn out to scale and cost pretty much the same (minor win for the native LongAdder implementation). It is perhaps comforting to anyone who needs to support a JDK7 client base that the backport should work fine on both and that no further work is required for now. Below are the results, LA7 is the LongAdder backport, LA8 is the JDK8 implementation:

	JDK7-LA7	.inc	(err)	JDK8-LA7	.inc	(err)	JDK8-LA8	.inc	(err)
1	9.8	+/-	0.0	10.0	+/-	0.8	9.8	+/-	0.0
2	12.1	+/-	0.6	10.8	+/-	0.1	10.0	+/-	0.1
3	11.5	+/-	0.4	11.7	+/-	0.3	10.3	+/-	0.0
4	12.4	+/-	0.6	11.1	+/-	0.1	10.3	+/-	0.0
5	12.4	+/-	0.8	11.5	+/-	0.6	10.3	+/-	0.0
6	11.8	+/-	0.3	11.1	+/-	0.3	10.3	+/-	0.0
7	11.6	+/-	0.4	12.9	+/-	1.3	10.6	+/-	0.4
8	11.8	+/-	0.6	11.8	+/-	0.7	10.8	+/-	0.5
9	12.9	+/-	0.9	12.0	+/-	0.7	10.5	+/-	0.3
10	12.6	+/-	0.4	12.1	+/-	0.6	11.0	+/-	0.5
11	11.5	+/-	0.2	12.3	+/-	0.6	10.7	+/-	0.3
12	11.7	+/-	0.4	11.5	+/-	0.3	10.4	+/-	0.1

JDK8: AtomicLong vs LongAdder

Similar results were demonstrated has been discussed in the prev. post, but here's the JDK8 versions side by side:

	JDK8-AL	.inc	(err)	JDK8-LA8	.inc	(err)
1	6.2	+/-	0.0	9.8	+/-	0.0
2	34.7	+/-	0.8	10.0	+/-	0.1
3	54.1	+/-	5.6	10.3	+/-	0.0
4	85.4	+/-	1.3	10.3	+/-	0.0
5	102.7	+/-	1.9	10.3	+/-	0.0
6	120.4	+/-	2.5	10.3	+/-	0.0
7	276.6	+/-	33.9	10.6	+/-	0.4
8	307.8	+/-	39.1	10.8	+/-	0.5
9	257.5	+/-	16.4	10.5	+/-	0.3
10	409.8	+/-	34.8	11.0	+/-	0.5
11	308.5	+/-	10.1	10.7	+/-	0.3
12	351.3	+/-	7.6	10.4	+/-	0.1

Should I use AtomicLong or LongAdder?

Firstly this question is only relevant if you are not using AtomicLong as a unique sequence generator. LongAdder does not claim to, nor makes any attempt to give you that guarantee. So LongAdder is definitely NOT a drop in replacement for AtomicLong, they have very different semantics.
From the LongAdder JavaDoc:

This class is usually preferable to AtomicLong when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization

control. Under low update contention, the two classes have similar characteristics. But under high contention, expected throughput of this class is significantly higher, at the expense of higher space consumption.

Assuming you were using AtomicLong as a counter you will need to consider a few tradeoffs:

When NO contention is present, AtomicLong performs slightly better than LongAdder.
To avoid contention LongAdder will allocate Cells (see previous post for implementation discussion) each Cell will consume at least 256 bytes (current implementation of @Contended) and you may have as many Cells as CPUs. If you are on a tight memory budget and have allot of counters this is perhaps not the tool for the job.
If you prefer get() performance to inc() performance than you should definitely stick with AtomicLong.
When you prefer inc() performance and expect contention, and when you have some memory to spare then LongAdder is indeed a great choice.

Bonus material: How the Observer effects the Experiment

What is the impact of reading a value that is being rapidly mutated by another thread? On the observing thread side we expect to pay a read-miss, but as discussed prev. here there is a price to pay on the mutator side as well. I run the same benchmark with an equal number of inc()/get() threads. The process is pinned as before but as the roles are not uniform I have no control on which socket the readers/writers end up on, so we should expect more noise as we cross the socket line (LA - LongAdder, AL - AtomicLong, both on JDK8, same type ing/get are within the same run, left column is number of inc()/get() threads):

	LA	.get	(err)	LA	.inc	(err)	AL	.get	(err)	AL	.inc	(err)
1,1	4.7	+/-	0.2	68.6	+/-	5.0	10.2	+/-	1.0	39.1	+/-	7.6
2,2	24.9	+/-	2.7	69.7	+/-	4.0	41.5	+/-	26.4	87.6	+/-	11.0
3,3	139.9	+/-	24.3	69.0	+/-	8.2	55.1	+/-	13.3	157.4	+/-	28.2
4,4	332.9	+/-	10.3	80.4	+/-	5.1	56.8	+/-	7.6	198.7	+/-	21.1
5,5	479.3	+/-	13.2	84.1	+/-	4.4	71.9	+/-	7.3	233.1	+/-	20.1
6,6	600.6	+/-	11.2	89.8	+/-	4.6	152.1	+/-	41.7	343.2	+/-	41.0

Now that is a very different picture... 2 factors come into play:

The reads disturb the writes by generating cache coherency noise as discussed here. A writer must have the cache line in an Exclusive/Mutated state, but the read will cause it to shift into Shared.
The get() measurement does not differentiate between new values and old values.

This second point is important as we compare different means of mutating values being read. If the value being read is slowly mutating we will succeed in reading the same value many times before a change is visible. This will make our average operation time look great as most reads will be L1 hitting. If we have a fast incrementing implementation we will cause more cache misses for the reader making it look bad. On the other hand a slow reader will cause less of a disturbance to the incrementing threads as it produces less coherency noise, thus making less of a dent in the writer performance. Martin Thompson has previously hit on this issue from a different angle here (Note his posts discuss Nahelem and Sandybridge, I'm benchmarking on Westmere here).
In this light I'm not sure we can read much into the effect of readers in this particular use case. The 'model' represented by a hot reading thread does not sit well with the use case I have in mind for these counters which is normally as performance indicators to be sampled at some set frequency (once a second or millisecond). A different experiment is more appropriate, perhaps utilising Blackhole.consumeCPU to set a gap between get() (see a great insight into this method here).
With this explanation in mind we can see perhaps more sense in the following comparison between AtomicLong on JDK7 and JDK8:

	AL7	.get	(err)	AL7	.inc	(err)	AL8	.get	(err)	AL8	.inc	(err)
1,1	4.0	+/-	0.2	61.1	+/-	7.8	10.2	+/-	1.0	39.1	+/-	7.6
2,2	7.0	+/-	0.3	133.1	+/-	2.9	41.5	+/-	26.4	87.6	+/-	11.0
3,3	21.8	+/-	3.4	278.5	+/-	47.4	55.1	+/-	13.3	157.4	+/-	28.2
4,4	27.2	+/-	3.3	324.1	+/-	34.6	56.8	+/-	7.6	198.7	+/-	21.1
5,5	31.2	+/-	1.8	378.5	+/-	28.1	71.9	+/-	7.3	233.1	+/-	20.1
6,6	57.3	+/-	12.5	481.8	+/-	39.4	152.1	+/-	41.7	343.2	+/-	41.0

Now consider that the implementation of get() has not changed. The reason for the change in cost for get is down to the increase in visible values and thus cache misses caused by the change to the increment method.