Tuesday 21 May 2013

Know Thy Java Object Memory Layout

Sometimes we want to know how an object is laid out in memory, here's 2 use cases and 2 solutions which remove the need to hypothesise.
A few months back I was satisfying my OCD by reading up on java object memory layout. Now Java, as we all know and love, is all about taking care of  such pesky details as memory layout for you. You just leave it to the JVM son, and don't lose sleep over it.
Sometimes though... sometimes you do care. And when you do, here's how to find out.

In theory, theory and practice are the same

Here's an excellent article from a few years back which tells you all about how Java should layout your object, to summarise:
  • Objects are 8 bytes aligned in memory (address A is K aligned if A % K == 0)
  • All fields are type aligned (long/double is 8 aligned, integer/float 4, short/char 2)
  • Fields are packed in the order of their size, except for references which are last
  • Classes fields are never mixed, so if B extends A, an object of class B will be laid out in memory with A's fields first, then B's
  • Sub class fields start at a 4 byte alignment
  • If the first field of a class is long/double and the class starting point (after header, or after super) is not 8 aligned then a smaller field may be swapped to fill in the 4 bytes gap.
The reasons why the JVM doesn't just plonk your fields one after the other in the order you tell it are also discussed in the article, to summarise:
  • Unaligned access is bad, so JVM saves you from bad layout (unaligned access to memory can cause all sorts of ill side effects, including crashing your process on some architectures)
  • Naive layout of your fields would be wasting memory, the JVM reorders fields to improve the overall size of your object
  • JVM implementation requires types to have consistent layout, thus requiring the sub class rules
So... nice clear rules, what could possibly go wrong?

False False Sharing Protection

For one thing, the rules are not part of the JLS, they are just implementation details. If you read Martin Thompson's article about false sharing you'll notice Mr. T had a solution to false sharing which worked on JDK 6, but no longer worked on JDK 7. Here are the 2 versions:

It turns out the JVM changed the way it orders the fields between 6 and 7, and that was enough to break the spell. In fairness there is no rule specified above which requires the fields order to correlate to the order in which they were defined, but ... it's allot to worry about and it can trip you up.
Just as above rules were still fresh in my mind, LMAX (who kindly open sourced the Disruptor) released the Coalescing Ring Buffer. I read through the code and came across the following:

I approached Nick Zeeb on the blog post which introduced the CoalescingRingBuffer and raised my concern that the fields accessed by the producer/consumer might be suffering from false sharing, Nick's reply:
I’ve tried to order the fields such that the risk of false-sharing is minimized. I am aware that Java 7 can re-order fields however. I’ve run the performance test using Martin Thompson’s PaddedAtomicLong instead but got no performance increase on Java 7. Perhaps I’ve missed something so feel free to try it yourself.
Now Nick is a savvy dude, and I'm not quoting him here to criticise him. I'm quoting him to show that this is confusing stuff (so in a way, I quote him to comfort myself in the company of others equally confused professionals). How can we know? here's one way I thought of after talking to Nick:

Using Unsafe I can get the field offset from the object reference, if 2 fields are less than a cache line apart they can suffer from false sharing (depending on the end location in memory). Sure, it's a bit of a hackish way to verify things, but it can become part of your build so in the case of version changes you on't get caught out.
Note that it's the runtime which will determine the layout, not the compiler, so if you build on version X and run on Y it won't help you much.
Enough of that false sharing thing... so negative... why would you care about memory layout apart from false sharing? Here's another example.

The Hot Bunch

Through the blessings of the gods, at about the same time LMAX released the CoalescingRingBuffer, Gil Tene (CTO of Azul) released HdrHistogram. Now Gil is seriously, awesomely, bright and knows more about JVMs than most mortals (here's his InfoQ talk, watch it)  so I was keen to look into his code. And what do you know, a bunch of hot fields:

What Gil is doing here is good stuff, he's trying to get relevant fields to huddle together in memory, which will improve the likelihood of them ending up on the same cache line, saving the CPU a potential cache miss. Sadly the JVM has other plans... 
So here is another tool to help make sense of your memory layout to add to your tool belt: Java Object Layout I bumped into it by accident, not while obsessing about memory layout at all. Here's the output for Histogram:

Note how histogramData jumps to the botton and subBucketMask is moved to the top, breaking up our hot bunch. The solution is ugly but effective, move all fields but the hot bunch to an otherwise pointless parent class:

And the new layout:

Joy! I shall be sending Mr. Tene a pull request shortly :-)

UPDATE 16/01/2014: The excellent JOL has now been released under OpenJDK here. It's even better than before and supports many a funky feature (worthy of a separate post). I've updated the links to point to the new project. Also check out Shipilev's blog post on heap dumps demonstrating the use of this tool.

Wednesday 15 May 2013

Using JMH to Benchmark Multi-Threaded Code

{UPDATE 03/09/14: If you come here looking for JMH related content start at the new and improved JMH Resources Page and branch out from there!}
Exploring some of the multi-threading related functionality in JMH via a benchmark of false shared counters vs. uncontended counters.
As discussed in previous post the JMH framework introduced by the good folk of the OpenJDK offers a wealth of out of the box functionality to micro benchmark writers. So much in fact that a single post is not enough! In this post I'll explore some of the functionality on offer for people writing and measuring multi-threaded code.

Sharing != Caring?

The phenomena I'm putting under the microscope here is the infamous False Sharing. There are many explanations of what False Sharing is and why it's bad to be found elsewhere(Intelusenix, Mr. T) but to save you from having to chase links here's mine. Here goes:
In order to maintain memory view consistency across caches modern CPUs must observe which CPU is modifying which location is memory. 2 CPUs modifying the same location are actually sharing and must take turns in doing so leading to contention a much back and forth between the CPUs invalidating each others view on the memory in question. Sadly the protocol used by the CPUs is limited in granularity to cache lines. So when CPU 1 modifies a memory location, the whole cache line (some memory locations to the actual locations right and left) is considered as modified. False sharing occurs when CPU1 and CPU2 are modifying 2 distinct memory locations on the same cache line. They shouldn't interfere with each other, as there is no actual contention, but due to the way cache coherency works they end up uselessly chattering to each other. The image to the right is taken from the Intel link provided above.
I will be using JMH to demonstrate the impact of false sharing by allocating an array of counters and accessing a counter per thread. I am comparing the difference between using all the counters bunched together and spacing them out such that no false sharing happens. I've repeated the same experiment for  plain access, lazySet and volatile access to highlight that false sharing is not limited to memory explicitly involved in multi-threading building blocks.

Multi-Threaded experiments with JMH

JMH offers support for running multi-threaded benchmarks, which is a new feature for such frameworks (to the best of my knowledge). I'll cover some of the available annotations and command line options I used for this benchmark. First up, let's look at the code:

The lazySet and volatile access versions are very similar using an AtomicLongArray instead of the primitive array used above.
In the previous post I briefly described the State annotation and made the most basic use of it. Here's the full description from the javadoc (slightly edited to read as one paragraph here):
State: Marks the state object. State objects naturally encapsulate the state on which benchmark is working on. The Scope of state object defines to which extent it is shared among the worker threads. Three Scope types are supported: 
  • Benchmark: The objects of this scope are always shared between all the threads, and all the identifiers. Note: the state objects of different types are naturally distinct. Setup and TearDown methods on this state object would be performed by one of the worker threads. No other threads would ever touch the state object. 
  • Group: The objects of this scope are shared within the execution group, across all the identifiers of the same type. Setup and TearDown methods on this state object would be performed by one of the group threads. No other threads would ever touch the state object.
  • Thread: The objects of this scope are always unshared, even with multiple identifiers within the same worker thread. Setup and TearDown methods on this state object would be performed by single worker thread exclusively. No other threads would ever touch the state object.
In the above example I'm exploring the use of scoped state further by utilising both Benchmark scope state to store the shared array and a Thread scope state to hold the thread counter index for both the false shared and no sharing case.
The state classes I define are instantiated by JMH rather than directly by my benchmark class. The framework then passes them as parameters to the methods being benchmarked. The only difference between the methods being the use of a different index for accessing the array of counters. As I add more threads, more instances of the Thread state will materialize, but only a single Benchmark state object will be used throughout.

Great Expectations

Before we start looking at what JMH offers us let's consider our expectations of the behaviour under test:
  1. We expect shared and unshared cases to behave the same when there is only a single benchmarking thread.
  2. When there are more than 1 benchmarking threads we expect the unshared case to outperform the shared case.
  3. We expect the unshared case to scale linearly, where the shared case performance should scale badly.
We can put our expectations/theory to the test by running the benchmark and scaling the number of threads participating. JMH offers us several ways to control the number of threads allocated to a benchmark:
  • -t 4 - will launch the benchmark with 4 benchmark threads hitting the code.
  • -tc 1,2,4,8 - will replace both the count of iterations and the number of threads by the thread count pattern described. This example will result in the benchmark being run for 4 iterations, iteration 1 will have 1 benchmarking thread, the second will have 2, the third 4 and the last will have 8 threads.
  • -s - will scale the number of threads per iteration from 1 to the maximum set by -t
I ended up either setting the number of threads directly or using the 'pattern' option. Variety is the spice of life though :-) and its nice to have the choice.
False sharing as a phenomena is a bit random (in Java where memory alignment is hard to control) as it relies on the memory layout falling a particular way on top of the cache line. Consider 2 adjacent memory locations being written to by 2 different threads, should they be placed such that they are on different cache lines in a particular run then that run will not suffer false sharing. This means we'll want to get statistics from a fair amount of runs to make sure we hit the variations we expect on memory layout. JMH supports running multiple forks of your benchmark by using the -f option to determine the number of forks.

And the results are?

Left as an exercise to the reader :-). I implore you to go have a play, come on.
Here's a high level review on my experimentation process for this benchmark:

  • As sanity check, during development I ran the benchmark locally. My laptop is a dual core MBA and is really not a suitable environment for benchmarking, still the results matched the theory although with massive measurement error:
    • Even at this point it was clear the false sharing is a significant issue. Results are consistent, if unstable in their suggestion of how bad it is. 
    • Also as expected, volatile writes are slower than lazy/ordered writes which are slower than plain writes.
  • Once I was happy with the code I deployed it on a more suitable machine. A dual socket Intel Xeon CPU E5-2630 @ 2.30GHz, using OpenJDK 7u9 on CentOS 6.3(OS and JDK are 64bit). This machine has 24 logical cores and is tuned for benchmarking (what that means is a post onto itself). First up I thought I'd go wild and use all the cores, just for the hell of it. Results were interesting if somewhat flaky. 
    • Using all the available CPU power is not a good idea if you want stable results ;-)
  • I decided to pin the process to a single socket and use all it's cores, this leaves plenty of head room for the OS and give me data on scaling up to 12 threads. Nice. At this point I could have a meaningful run with 1,2,4,8,12 threads to see the way false sharing hurts scaling for the different write methods. At this point things started to get more interesting:
    • Volatile writes suffering from false sharing degrade in overall performance as more threads are added, 12 threads suffering from false sharing achieve less than 1 thread. 
    • Once false sharing is resolved volatile writes scale near linearly until 8 threads, the same is true for lazy and plain. 8 to 12 seem to not scale quite so cleanly.
    • Lazy/Plain writes do not suffer quite as much from false sharing but the more threads are at work the larger the difference (up to 2.5x) becomes between false sharing and unshared.
    Scaling from 1-12 threads
  • I focused on the 12 thread case, running longer/more iterations and observing the output, it soon became clear the results for the false shared access suffer from high run-to-run variance. The reason for that, I assumed, was the different memory layout from run to run resulting in different cache line populations. In the case of 12 threads we have 12 adjacent longs. Given the cache line is 8 longs wide they can be distributed over 2 or 3 lines in a number of ways (2-8-2,1-8-3,8-4,7-5,6-6). Going ever deeper into the rabbit hole I thought I could print out the thread specific stats (choose -otss, JMH delivers again :-) ). The thread stats did deliver a wealth of information, but at this point I hit slight disappointment... There was no (easy) way for me to correlate thread numbers from the run to thread numbers in the output, which would have made my life slightly easier in characterising the different distributions...
  • I had enough and played with the data until I was happy with the results. And I finally found a use for a stacked bar chart :-). Each thread score is a different color, the sum of all thread scores is the overall score.
Here's the results for plain/volatile shared/unshared running on 12 cores (20 runs, 5 iterations per run: -f 20 -i 5 -r 1):
Plain writes suffering from false sharing. Note the different threads score
differently and the overall result falls into several different 'categories'.
Plain writes with no false sharing. Threads score is fairly uniform,
and overall score is stable.
Volatile writes suffering from false sharing, note the score varies
by less than it did in the plain write case but similar categories emerge.
Volatile writes with no false sharing, similarly uniform behaviour
for all threads.
Hope you enjoyed the pictures :-)

UPDATE (1/08/2013): If you are looking for more JMH related info, see Shipilev's slides from JVMLS.
UPDATE (1/07/2014): The samples repository has been updated to reflect JMH progress, sample code may have minor differences from the code presented above and command line options may differ. I will follow up at some point with a re-vamp of JMH related posts but for now you can always reproduce above results by reviving the older code from history.