Lock-free Data Structures: Atomicity and Atomic Primitives

_ezkx · on May 12, 2014

> It’s quite difficult, if possible at all, to contrive an algorithm of a lock-free container, which uses read/ write only (I don’t know such data structures for a random number of threads).

I don't know either, but Peterson's Algorithm is a mind-melting and elegant way to achieve mutual exclusion for two threads, using reads and writes (and in reality probably some barriers), generalizable to N threads.

  http://en.wikipedia.org/wiki/Peterson%27s_algorithm

bandwitch · on May 12, 2014

Peterson's Algorithm isn't lock-free, it's a blocking one.

_ezkx · on May 13, 2014

Hmm, are you sure that's correct? I believe both the qualities re. "progress" and "bounded waiting" here

http://en.wikipedia.org/wiki/Peterson%27s_algorithm#Progress

are enough to guarantee system-wide progress. Can you elaborate?

Edit: Nevermind. So of course we'd need the critical section and thread-coordination mechanism to be a single atomic operation for this to be a lock-free algorithm. I should RTFA.

virtue3 · on May 13, 2014

... if you look at the psuedo code the "busy wait" (with a busy while loop...) comment and "critical section" comment should also be very large flags to this being a blocking operation.

nkurz · on May 12, 2014

Wow, that was a much better article than I was expecting. I'm eager to read the others in the series.

Thus, for Intel Xeon 3.06 hHz (2005 model) atomic increment has the length of 400 nop, CAS – 850 – 1000 nop. The Processor IBM Power4 1.45 hHz: 180 nop for atomic increment and 250 nop for CAS. The measures are quite old, processor architecture has made several steps forward since then, but the order of figures is the same, I guess.

Does anyone know if his guess is correct? What are the numbers like for modern processors?

arielweisberg · on May 12, 2014

It's a bad idea to treat these operations as constants. They get slower as traffic and contention increases. Different cache topologies will have different timing and different thread scheduling arrangements will also have different timings.

I find it helpful to think in terms of cache lines and cache coherence as that is the biggest factor in performance on cache coherent CPUs when you want to move a lot of data between cores.

A CAS is no worse than a store in that scenario because they both generate the same coherence traffic. One key difference is that the store can drain in the background while the CAS will pause processing.

Once you start thinking of the CPU as an asynchronous instruction processor it starts to get more interesting. For instance in Java if you use volatile, it will stall the CPU until the volatile write completes despite the fact that in most cases you don't actually care. But you can use Atomic* and lazySet or unsafe to make the store non-blocking while still getting the necessary protection against reordering.

What I am getting at is that thinking about a CPU in terms of instructions is the wrong way to go about it. Instructions have to actually execute somehow and that is where the rubber meets the road. A single core can simultaneously be performing several arithmetic operations, draining stores, fetching and pre-fetching cache lines, and several other things I probably haven't learned about. It's causing stalls in this process or introducing dependencies that prevents things from moving along quickly.

nkurz · on May 12, 2014

I agree with what you are saying, but am still interested in how the numbers have changed over time. This paper seems like a reasonable update, as well as a good introduction to the variables involved:

http://sigops.org/sosp/sosp13/papers/p33-david.pdf

Only having skimmed it, I think that Table 2 is making the same point that you did, that on x64, apart from instruction dependency, there very little difference in latency between a CAS and a store.

gengstrand · on May 13, 2014

I don't know why lock free concurrency control is so fashionable these days. I realize that deadlock and livelock are hard to debug but STM and software implemented CAS is a CPU hog that gets worse as load increases.

sitkack · on May 13, 2014

Doesn't it depend on the overlap and granularity of work? The concurrency mechanism should be treated as a failsafe on conflict, reducing the conflict in the algorithm should be the primary goal.

chaotic-good · on May 13, 2014

This is translation from Russian. Original article is here: http://habrahabr.ru/company/ifree/blog/195948/