The Art of Multiprocessor Programming

Category: Programming
Author: Maurice Herlihy, Nir Shavit
All Stack Overflow 18
This Year Stack Overflow 2
This Month Stack Overflow 7


by anonymous   2019-07-21
If you can definitely prove that a method has no linearization points, does it necessarily 
mean that that method is not linearizable? 

Firstly, linearizability is not property of a method, it is property of execution sequence.

how can you prove that a method has no linearizatioon points?

It depends on the execution sequence whether we are able to find linearization point for the method or not.

For example, we have the below sequence, for thread A on a FIFO queue. t1, t2, t3 are time intervals.

A.enq(1)   A.enq(2)   A.deq(1)
     t1          t2                t3

We can choose linearization points(lp) for first two enq methods as any points in time interval t1 and t2 respectively, and for deq any point in t3. The points that we choose are lp for these methods.

Now, consider a faulty implementation

A.enq(1)   A.enq(2)    A.deq(2)
    t1           t2                 t3

Linerizability allows lp to respect the real-time ordering. Therefore, lp of the methods should follow the time ordering i.e. t1 < t2 < t3. However, since our implementation is incorrect, we cannot clearly do this. Hence, we cannot find linearization point for the method A.deq(2), in turn our seq. too in not linerizable.

Hope this helps, if you need to know more you can read this book:

by anonymous   2019-07-21

There are some major problems in your thinking.

  1. does your PC have 30 Cores, so that every core can exactly takes one task? I don't think so
  2. starting a seperate thread also takes some time.
  3. with every concurrent thread more overhead is generated.
  4. Can your problem be solved faster by starting more threads? This is only the case, when all threads do different tasks, like reading from disk, quering a database, computing something, etc.. 10 threads that do all "high-performance" tasks on the cpu, won't give an boost, quite contrary to, because every thread needs to clean up his mess, before he can give some cpu time to the next thread, and that one needs to clean up his mess too.

Check this link out

You can use the TPL

they try to guaranty the maximum effect from parallel threads. Also I recommend this book

When you really want to solve your problem in under 2 seconds, buy more CPU power ;)

by anonymous   2019-07-21

I have also been looking for such a book, they are very hard to come by. This one will be released in May, if that's any help:

I purchased this book:

It's very good, it's in java, but most of the principles apply to c/c++ anyway.

by anonymous   2019-07-21

There seem to be misconceptions by even guru software developers about what constitutes a lock.

One has to make a distinction between atomic operations and locks. Atomic operations like compare and swap perform an operation (which would otherwise require two or more instructions) as a single uninterruptible instruction. Locks are built from atomic operations however they can result in threads busy-waiting or sleeping until the lock is unlocked.

In most cases if you manage to implement an parallel algorithm with atomic operations without resorting to locking you will find that it will be orders of magnitude faster. This is why there is so much interest in wait-free and lock-free algorithms.

There has been a ton of research done on implementing various wait-free data-structures. While the code tends to be short, they can be notoriously hard to prove that they really work due to the subtle race conditions that arise. Debugging is also a nightmare. However a lot of work has been done and you can find wait-free/lock-free hashmaps, queues (Michael Scott's lock free queue), stacks, lists, trees, the list goes on. If you're lucky you'll also find some open-source implementations.

Just google 'lock-free my-data-structure' and see what you get.

For further reading on this interesting subject start from The Art of Multiprocessor Programming by Maurice Herlihy.

by anonymous   2019-07-21

To build upon the answers described above, a method can be described as linearizable. As referenced in the book that djoker mentioned:

on page 69, exercise 32, we see

enter image description here

It should be noted that enq() is indeed a method, that is described as possibily being linearizable/not linearizable.

Proving that there are linearizable points comes down to finding if there are examples that can break linearizability. If you make the assumption that various read/write memory operations in a method are linearizable, and then prove by contradiction that there are non-linearizable situations that result from such an assumption, you can declare that the previously mentioned read/write operation is not a valid linearization point.

Take, for example, the following enq()/deq() methods, assuming they are part of a standard queue implementation with head/tail pointers thaand a backing array "arr":

public terribleQueue(){
  arr = new T[10];
  tail = 0;
  head = 0;

void enq(T x){
  int slot = tail;
  arr[slot] = x;
  tail = tail + 1;

T deq(){
  if( head == tail ) throw new EmptyQueueException();
  T temp = arr[head];
  head = head + 1;
  return temp;

In this terrible implementation, we can easily prove, for example, that the first line of enq is not a valid linearization point, by assuming that it is a linearization point, and then finding an example displaying otherwise, as seen here:

Take the example two threads, A and B, and the example history:

A: enq( 1 )
A: slot = 0
B: enq( 2 )
B: slot = 0

(A and B are now past their linearization points, therefore we are not allowed to re-order them to fit our history)

A: arr[0] = 1
B: arr[0] = 2
A: tail = 1
B: tail = 2

C: deq()
C: temp = arr[0] = 2
C: head = 1
C: return 2

Now we see that because of our choice of linearization point (which fixes the order of A and B), this execution will be impossible make linearizable, because we cannot make C's deq return 1, no matter where we put it.

Kind of a long winded answer, but I hope this helps

by anonymous   2019-01-13

Atomic reference should be used in a setting where you need to do simple atomic (i.e. thread safe, non-trivial) operations on a reference, for which monitor-based synchronization is not appropriate. Suppose you want to check to see if a specific field only if the state of the object remains as you last checked:

AtomicReference<Object> cache = new AtomicReference<Object>();

Object cachedValue = new Object();

//... time passes ...
Object cachedValueToUpdate = cache.get();
//... do some work to transform cachedValueToUpdate into a new version
Object newValue = someFunctionOfOld(cachedValueToUpdate);
boolean success = cache.compareAndSet(cachedValue,cachedValueToUpdate);

Because of the atomic reference semantics, you can do this even if the cache object is shared amongst threads, without using synchronized. In general, you're better off using synchronizers or the java.util.concurrent framework rather than bare Atomic* unless you know what you're doing.

Two excellent dead-tree references which will introduce you to this topic:

Note that (I don't know if this has always been true) reference assignment (i.e. =) is itself atomic (updating primitive 64-bit types like long or double may not be atomic; but updating a reference is always atomic, even if it's 64 bit) without explicitly using an Atomic*.
See the Java Language Specification 3ed, Section 17.7.

by anonymous   2017-08-20

Lock-free skip lists are described in the book The Art of Multiprocessor Programming, and the technical report Practical lock-freedom, which is based on a PhD thesis on the subject. The skip list discussion begins on page 53. An example implementation, based on these sources, is included in this google code project.

There are related discussions, links to literature and implementations (not necessarily lock-free) in the SO questions Skip List vs. Binary Tree, and Skip Lists - ever used them?.

by anonymous   2017-08-20

my own question, with answer, on stackoverflow's sister site:

I will copy the answer to avoid the need for click-through:

Quote Boris:

Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures

This is a book, I recommend wholeheartedly.

It is:

New - published last year. Means you are not reading somewhat outdated practices.

Short - about 200+ pages, dense with information. These days there is too much to read and too little time to read 1000+ pages books.

Easy to read - not only it is very well written but it introduces hard to grasps concepts in really simple to read way.

Intended to teach - each chapter gives exercises to do. I know it is always beneficial to do these, but rarely do. This book gives very compelling and interesting tasks. Surprisingly I did most of them and enjoyed doing them.

additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.

PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

by anonymous   2017-08-20

Reads and writes to a shared memory location take a finite period of time, so they may either overlap, or be completely distinct.


Thread 1:      wwwww     wwwww
Thread 2:   rrrrr              rrrrr
Thread 3:   rrrrr rrrrr

The first read from thread 2 overlaps with the first write from thread 1, whilst the second read and second write do not overlap. In thread 3, both reads overlap the first write.

A safe register is only safe as far as reads that do not overlap writes. If a read does not overlap any writes then it must read the value written by the most recent write. Otherwise it may return any value that the register may hold. So, in thread 2, the second read must return the value written by the second write, but the first read can return any valid value.

A regular register adds the additional guarantee that if a read overlaps with a write then it will either read the old value or the new one, but multiple reads that overlap the write do not have to agree on which, and the value may appear to "flicker" back and forth. This means that two reads from the same thread (such as in thread 3 above) that both overlap the write may appear "out of order": the earlier read returning the new value, and the later returning the old value.

An atomic register guarantees that the reads and writes appears to happen at a single point in time. Readers that act at a point before that point will all read the old value and readers that act after that point will all read the new value. In particular, if two reads from the same thread overlap a write then the later read cannot return the old value if the earlier read returns the new one. Atomic registers are linearizable.

The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit gives a good description, along with examples and use cases.

by anonymous   2017-08-20

In general, you have two options: Thread-based parallelism or Task-based parallelism. The first is the most traditional approach and pthreads and OpenMP are a good examples of it. In the second alternative, you have one more abstraction level, where you see your parallel program as a set of tasks that are mapped to threads. A good reference to learn the model of computation is the chapter 27 of Introduction to Algorithms of Cormen ( and some tools to program are CilkPlus (, Threading Building Blocks (, OpenMP Tasks(, and Microsoft’s Task Parallel Library (

Finally, you can read The Art of Multiprocessor Programming (

by anonymous   2017-08-20

The following is taken directly from The Art of Multiprocessor Programming which is a good book to learn about this stuff. There's actually 2 implementations presented: a simple version and a fair version. I'll go ahead and reproduce the fair version.

One of the requirements for this implementation is that you have a condition variable primitive. I'll try to figure out a way to remove it but that might take me a little while. Until then, this should still be better than nothing. Note that it's also possible to implement this primitive using only locks.

public class FifoReadWriteLock {
    int readAcquires = 0, readReleases = 0;
    boolean writer = false;
    ReentrantLock lock;
    Condition condition = lock.newCondition(); // This is the condition variable.

    void readLock () {
        try {
        finally {

    void readUnlock () {
        try {
            if (readAcquires == readReleases)
        finally {

    void writeLock () {
        try {
            while (writer)

            writer = true;

            while (readAcquires != readReleases)
        finally {

    void writeUnlock() {
        writer = false;

First off, I simplified the code a little but the algorithm remains the same. There also happens to be an error in the book for this algorithm which is corrected in the errata. If you plan on reading the book, keep the errata close by or you'll end up being very confused (like me a few minutes ago when I was trying to re-understand the algorithm). Note that on the bright side, this is a good thing since it keeps you on your toes and that's a requirement when you're dealing with concurrency.

Next, while this may be a Java implementation, only use it as pseudo code. When doing the actual implementation you'll have to be carefull about the memory model of the language or you'll definitely end up with a headache. As an example, I think that the readAcquires and readReleases and writer variable all have to be declared as volatile in Java or the compiler is free to optimize them out of the loops. This is because in a strictly sequential programs there's no point in continuously looping on a variable that is never changed inside the loop. Note that my Java is a little rusty so I might be wrong. There's also another issue with integer overflow of the readReleases and readAcquires variables which is ignored in the algorithm.

One last note before I explain the algorithm. The condition variable is initialized using the lock. That means that when a thread calls condition.await(), it gives up its ownership of the lock. Once it's woken up by a call to condition.signalAll() the thread will resume once it has reacquired the lock.

Finally, here's how and why it works. The readReleases and readAcquires variables keep track of the number threads that have acquired and released the read lock. When these are equal, no thread has the read lock. The writer variable indicates that a thread is trying to acquire the write lock or it already has it.

The read lock part of the algorithm is fairly simple. When trying to lock, it first checks to see if a writer is holding the lock or is trying to acquire it. If so, it waits until the writer is done and then claims the lock for the readers by incrementing the readAcquires variable. When unlocking, a thread increases the readReleases variable and if there's no more readers, it notifies any writers that may be waiting.

The write lock part of the algorithm isn't much more complicated. To lock, a thread must first check whether any other writer is active. If they are, it has to wait until the other writer is done. It then indicates that it wants the lock by setting writer to true (note that it doesn't hold it yet). It then waits until there's no more readers before continuing. To unlock, it simply sets the variable writer to false and notifies any other threads that might be waiting.

This algorithm is fair because the readers can't block a writer indefinitely. Once a writer indicates that it wants to acquire the lock, no more readers can acquire the lock. After that the writer simply waits for the last remaining readers to finish up before continuing. Note that there's still the possibility of a writer indefinitely blocking another writer. That's a fairly rare case but the algorithm could be improved to take that into account.

So I re-read your question and realised that I partly (badly) answered it with the algorithm presented below. So here's my second attempt.

The algorithm, you described is fairly similar to the simple version presented in the book I mentionned. The only problem is that A) it's not fair and B) I'm not sure how you would implement wait until recursive mutex has lock count zero. For A), see above and for B), the book uses a single int to keep track of the readers and a condition variable to do the signalling.

by kmavm   2017-08-19
If you are into wait-free and lock-free data structures, you owe it to yourself to get Herlihy and Shavit's "Art of Multiprocessor Programming."

Not only does it provide a well-debugged, well-written collection of wait-free data structures; it teaches you to think about concurrency in a structured way. The mutual exclusion chapter doesn't teach you which pthread routines to use, but instead tells you how to implement mutual exclusion, and the trade-offs inherent in, e.g., spin locks vs. queueing locks. A significant bonus is Maurice Herlihy's distinctive prose voice: correct, concise, and funny in that order. It is the best computer book I've read recently, and the one I recommend to all colleagues who are trying to take their concurrency chops up to the next level.

(I took a course from Herlihy in '99, but am otherwise unaffiliated.)

by gtani   2017-08-19
Since then, Real world haskell (freely available), Hutton's book, maybe another came out

also: Herlihy , Shavit , Multiprocessor Programming

by strlen   2017-08-19
There are rather basic (producer/consume queues) and as some have pointed out, are buggy. I'd highly suggest looking at Boost's threading library (for an example of an object oriented approach to threading, taking advantage of RAII -- much of it is now standard in C++11), Intel's Thread Building Blocks, Java's built in Concurrency Utilities (java.util.concurrent package), Doug Lea's Fork Join Framework. The a great book on the subject is Maurice Herlihy's The Art of Multiprocessor Programming:

The book is in Java, but C++11 has the required primitives (cross platform compare and set for integral types, a defined memory model) so you could follow allong in C++11.