Monday 11 November 2013

SPSC IV - A look at BQueue

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
I thought I was done with SPSC queues, but it turns out they were not done with me...
A kind reader, Rajiv Kurian, referred me to this interesting paper on an alternative SPSC implementation called BQueue. The paper compares the BQueue to a naive Lamport queue, the FastFlow queue and the MCRingBuffer queue and claims great performance improvements both in synthetic benchmarks and when utilised in a more real world application. I had a read, ported to Java and compared with my growing collection, here are my notes for your shared joy and bewilderment.

Quick Summary of Paper Intro. Up To Implementation

I enjoyed reading the paper, it's not a hard read if you have the time. If not, here's my brief notes (but read the original, really):
  • Some chit chat goes into why cross core queues are important - agreed.
  • Paper claims many SPSC queues perform very differently in a test-bed vs real world, in particular as real world applications tend to introduce cache traffic beyond the trivial experienced in naive benchmarking - I agree with the sentiment, but I think the "Real World" is a challenging thing to bring into your lab in any case. I would argue that a larger set of synthetic experiments would serve us all better in reviewing queue performance. I also admit my benchmarking thus far has not been anywhere rich enough to cover all I want it to cover. The application presented by the authors is a single sample of the rich ocean of applications out there, who's to say they have found the one shoe which fits all? Having said that, it still strikes me as a good way to benchmark and a valid sample to add to the general "your mileage may vary, measure with your own application" rule.
  • Existing queue implementations require configuration or deadlock prevention mechanisms and are therefore harder to use in a real application -While this argument is partially true for the solutions discussed thus far on this blog, I fear for best results some tuning has to be done in any case. In particular the solution presented by the paper still requires some configuration.
  • Examines Lamport and points out cache line thrashing is caused by cross checking the head/tail variables -  Code for the Lamport variation can be found here.  The variant put forward by Martin Thompson and previously discussed here (with code here) solves this issue by employing producer/consumer local caches of these values. This enhancement and a more involved discussion of cache coherency traffic involved is further discussed here.
  • Examines FastForward queue, which is superior to Lamport but - "cache thrashing still occurs if two buffer slots indexed by head and tail are located in the same cache line". To avoid this issue FF employs Temporal Slipping which is another way of saying a sleep in the consumer if the head/tail are too close. This technique leads to cache thrashing again, and the definition of too close is the configuration mentioned and resented above. - Right, I was not aware of temporal slipping, but I love the term! Next time I'm late anywhere I'll be sure to claim a temporal slippage. More seriously: this strikes me as a streaming biased solution, intentionally introducing latency as the queue becomes near empty/full. From the way it is presented in the article it is also unclear when this method is to be used, as it is obviously not desirable to inflict it on every call.
  • Examine MCRingBuffer, which is a mix of 2 techniques: head/tail caching (like the one employed by Mr. T) and producer batching. The discussion focuses on the issues with producer batching (producer waits for a local buffer to fill before actually writing anything to the queue), in particular the risk of deadlock between 2 threads communicating via such queues - I would have liked the discussion to explore the 2 optimisations separately. The head/tail cache is a very valid optimisation, the producer batching is very problematic due to the deadlock issue described, but also as it introduces arbitrary latency into the pipeline. Some variations on the queue interface support batching by allowing the publisher to 'flush' their write to the queue, which is an interesting feature in this context.
  • Some discussion of deadlock prevention techniques follows with the conclusion that they don't really work.

Looking at the Implementation

So I've been playing with SPSC queues for a while and recently wrote a post covering both the FastFlow queue and the use of sparse data to eliminate the near empty thrashing issue described above. I was curious to see a new algorithm so ported the algorithm from the paper to Java. Note the scary Unsafe use is in order to introduce the required memory barriers (not NOOPs!). Here is what the new algo brings to the table when written on top of my already existing SPSC queues code (code here):

Changes I made to original (beyond porting to Java):
  • Renamed tail/head to be in line with my previous implementations.
  • Renamed the batch constants according to their use.
  • Left the modulo on capacity to the offset calculation and let the counters grow as per my previous implementations (helped me eliminate a bug in the original code).
In a way it's a mix of the FastFlow(thread local use of counters, null check the buffer to verify offer/poll availability) and the MC/Mr.T solutions(use a cached look ahead index value we know to be available a write/read up to that value with no further checks), which I like (yay cross pollination/mongrels... if you are into any variation on racial purity try having children with your cousins and let me know how it went). The big difference lies in the way near empty/full condition are handled. Here's a few points on the algorithm:
  • The cached batchHead/Tail are used to eliminate some of the offer/poll need to null compare we have in FFBuffer. Instead of reading the tail/head every time we probe ahead on the queue. This counts on the null/not-null regions of the queue being continuous:
    • For the producer, if the entry N steps ahead of TAIL (as long as N is less than capacity) is null we can assume all entries up to that entry are also null (lines 4-11). 
    • Similarly we probe ahead as we approach the near empty state, but here we are looking for a not null entry. Given that the not null region is continuous, we probe into the queue ahead of the HEAD, if the entry is null then we have overshot the TAIL. If it is not null we can safely read up to that point with no further check.
  • The BQueue offers us a variation on the above mentioned temporal slip that involves the consumer doing a binary search for a successful probe until it hits the next batch size or declares the queue empty (lines 40 - 49). This is triggered every time the consumer needs to re-calculate the batchHead value having exhausted the known available slots.
  • Every poll probe failure is followed by a spinning wait (line 55-59).
  • Using the buffer elements as the indicator of availability allows the BQueue to avoid reverting to the cache thrashing caused by comparing and sharing the head/tail counters. 
  • The slowdown induced by the algorithm and the spinning wait allows the producer to build up further slack in the queue.  
The temporal slip alternative is the main contribution celebrated in the paper, so it is worth comparing the code snippets offered for all variations.

A Word On Benchmarking

I ran the same benchmark I've been using all along to evaluate SPSC queues. It's synthetic and imperfect. The numbers you get in this benchmark are not the numbers you get in the real world. Your mileage is almost certain to vary. I've been getting some worried responses/twits/comments about this... so why do I use it? 
  • If you have been reading this series of posts you will see that the benchmark has been instrumental in identifying and measuring the differences between implementations, it's been good enough to provide evidence of progress or regression.
  • As a test harness it's been easy to use and examine via other tools. Printing out assembly, analysing hardware counter data, tweaking to a particular format: all easily done with a simple test harness. 
It's imperfect, but practical. I'm increasingly uncomfortable with it as a benchmark because I don't feel the harness scales well to multi-producer examination and I'm in the process of implementing the benchmarks using JMH. As that work is not complete I use what I have.
In the interest of completeness/open for review/honesty/general info, here is how I run benchmarks at the moment:
  1. I'm using my laptop, a Lenovo y510p. It's an i7-4700MQ.
  2. When running benchmarks I pin all running processes to core 0, then run my benchmarks on cores 1-7 (I avoid 1 if I can). It's not as good is isolcpus, but good enough for a quiet run IMHO.
  3. For this current batch of runs I also disabled the turbo-boost (set the frequency to 2.4GHz) to eliminate related variance. In the past I've left it on (careless, but it was not an issue at the time), summer is coming to South Africa and my machine overheats with it on.
  4. I'm running Ubuntu 13.10 64 bit, and using JDK7u45 also 64 bit.
  5. All the benchmarks were run with "-server -XX:+UseCondCardMark -XX:CompileThreshold=100000". I use these parameters for consistency with previous testing. In particular the UseCondCardMark is important in this context. See previous post for more in depth comparison of flags impact on performance.
  6. For this set of results I only examined the cross core behaviour, so the runs were pinned to core 3,7
  7. I do 30 runs of each benchmarks to get an idea of run to run variance.
I include the above as an open invitation for correction and to enable others to reproduce the results should they wish. If anyone has the time and inclination to run the same benchmarks on their own setup and wants to share the data I'd be very interested. If anyone has a suite of queue benchmarks they want to share, that would be great too.
As for the validity of the results as an indication of real world performance... the only indication of real world performance is performance in the real world. Having said that, some problems can be classified as performance bugs and I would argue that false-sharing, and bad choice of instructions falls under that category. All other things being equal I'd expect the same algorithm to perform better when implemented such that it's performance is optimal.

Taking BQueue for a Spin  

And how does it perform? It depends... on  the configuration(what? but they said?! what?). There are 3 parameters to tune in this algorithm:
  1. TICKS - how long to spin for when poll probing fails.
  2. POLL_MAX_BATCH_SIZE - the limit, and initial value for the poll probe.
  3. OFFER_BATCH_SIZE - how far to probe for offer, also note that OFFER_BATCH_SIZE elements will remain unused of the queue capacity (i.e the queue can only fit [capacity - OFFER_BATCH_SIZE] elements before offer fails).
I played around with the parameters to find a good spot, here's some variations (see the legend for parameters, X is run number, Y is ops/sec):
  1. There's no point in setting the batch sizes lower than 32 (that's 2 cache lines in compressed OOPS references) as we are trying to avoid contention. Also, it'll just end up being a confusing branch for the CPU to hit on and the whole point of batching is to get some amortized cost out of it. Even with batch sizes set as low as 32 for both the queue performs well (median is around 261M) but with significant variance.
  2. Upper left chart shows the results as the offer batch is extended (poll batch remains at 32). The larger offer batch offers an improvement. Note that extending the offer batch only gives opportunity for the producer to overtake the consumer as the consumer hits the poll probing logic every 32 steps. A large slack in the queue allows the best throughput (assuming consumer is mostly catching up). The variance also decreases as the offer batch is increased.
  3. Upper right chart shows the results for increasing the poll batch (offer batch remains at 32). As we can see this actually results in worse variance.
  4. Increasing both the offer and the batch size in step (lower left chart) ends up with better overall results, but still quite significant variance
  5. Keeping the offer/poll batch at 8k I varied the spin period which again results in a different profile, but no cure to variance.
For context here is the BQ result (I picked the relatively stable 80,32,8k run) compared with the FFBuffer port and my inlined counters variation on Thompson/Lamport with and without sparse data(Y8=inlined counters,the square brackets are parameters, for FF/Y8 it is the sparse shift so 2 means use every 4th reference):
[NOTE: previous benchmarks for same queues were run with turbo boost on, leading to better but less stable results, please keep in mind when considering previous posts]
As you can see the BQueue certainly kicks ass when compared to the non-sparse data variations, but is less impressive when sparse data is used. Note that BQueue manages to achieve this result with far less memory as it still packs the references densely (note that some memory still goes wasted as mentioned above, but not as much as in the sparse data variations).What I read into the above results:
  1. This algorithm tackles the nearly empty/full queue in an appealing manner. Probing the queue to discover availability is also a means of touching the queue ahead and bringing some required data into the cache.
  2. The main reason for the speed up is the slowing down of the consumer when approaching empty. This serves as a neat solution for a queue that is under constant pressure.
  3. The spinning wait between probes presents a potentially unacceptable behaviour. Consider for instance a consumer who is sampling this queue for data, but needs to get on with other stuff should it find it empty. Alternatively consider a low latency application with bursty traffic such that queues are nearly empty most of the time. 
  4. I'll be posting a further post on latency benchmarking the queues, but currently the results I see (across cores, implementing ping pong with in/out queue) suggest the FF queue offers the best latency(200ns RTT), followed by Y8(300ns) and the BQueue coming in last(750ns). I expect the results to be worst with bursty traffic preventing the batch history from correctly predicting a poll batch size.

Summary

This is an interesting queue/paper to me, importantly because it highlights 2 very valid issues:
  1. Near empty/full queue contention is a significant concern in the design of queues and solving it can bring large performance gains to throughput.
  2. Real application benefits may well differ from synthetic benchmark benefits. To support better decision making a wider variety of tests and contexts needs to be looked at.
I think the end solution is appropriate for applications which are willing to accept the latency penalty incurred by the consumer when hitting a nearly empty queue.  The near full queue guard implemented for this queue can benefit other queue implementations and has no downside that I can see beyond a moderate amount of wasted space. 
Thanks again to Rajiv Kurian for the pointer to this queue, and Norman M. for his help in reviewing this post.