The last book on the JVM itself (e.g. stack-based byte code, class file format) I saw was from nearly 20 years ago. I cannot think about anything more modern. It's a niche which used to be popular in FinTech circles mostly. As such valuable info was scattered across random blogs and the JVM spec itself.
"Garbage Collection: Algorithms for Automatic Dynamic Memory Management" (https://www.amazon.com/Garbage-Collection-Algorithms-Automat...) seems to cover GC algorithms up to the CMS.
"The Art of Multiprocessor Programming" (https://www.amazon.com/Art-Multiprocessor-Programming-Mauric...) is for those bored by JCiP. But it's not about threads per se. I imagine nobody cares about the original green threads and I wouldn't expect quality literature on the subject until project Loom goes to prod.
Try the book Garbage Collection: Algorithms for Automatic Dynamic Memory Management. It wont have the more recent stuff in it, but it'll get you on your way.
Let me give some names so that explanations are clearer.
A variable is any slot for data, which may contain a pointer and may change over time. This includes global variables, local variables, CPU registers and fields in allocated objects.
In a tricolor incremental or concurrent GC, there are three types of variables:
the true roots, which are always accessible (CPU registers, global variables);
the fast variables, which are scanned in a stop-the-world fashion;
the slow variables, which are handled with the colors. Slow variables are fields in colored objects.
The "true roots" and "fast variables" will be hereafter collectively called roots.
The application threads are called the mutators because they change the contents of variables.
With an incremental or concurrent GC, GC pauses occur regularly. The world is stopped (mutators are paused), and the roots are scanned. This scan reveals a number of references to colored objects. Object colors are adjusted accordingly (such white objects are made grey).
When the GC is incremental, some object scanning activity takes place: some grey objects are scanned (and painted black), greying referenced white objects. This activity (the "marking") is maintained for some time, but not necessarily as long as there are grey objects. At some point, the marking stops and the world is awakened. The GC is called "incremental" because the GC cycle is performed in small increments, interleaved with mutator activity.
In a concurrent GC, scanning of grey objects occurs concurrently with mutator activity. The world is then awakened as soon as the roots have been scanned. With a concurrent GC, access barriers are a tad complex to implement because they must handle concurrent access from the GC thread; but at a conceptual level, this is not very different from an incremental GC. A concurrent GC can be viewed as an optimization over incremental GC, which takes advantage of the presence of multiple CPU cores (a concurrent GC has little advantage over an incremental GC when there is only one core).
Roots need not be protected by an access barrier, since they are scanned with the world stopped. The GC mark phase ends when the following conditions are simultaneously met:
the roots have just been scanned;
all objects are either black or white, but not grey.
so this situation can occur only during a pause. At that point, the sweep phase begins, during which white objects are released. The sweep can be done incrementally or concurrently; objects created during the sweep are immediately painted black. When the sweep is finished, a new GC mark phase can take place: objects (which are all black at that point) are all repainted white (this is done atomically by simply changing the way color bits are interpreted).
Variable Classification:
With that being said, I can now answer to your question. With the description above, the question becomes: what are the roots ? This is actually up to the implementation; there are several possibilities, and trade-offs.
True roots must always be scanned; true roots are the CPU register contents and the global variables. Note that the stacks are not true roots; only the current stack frame pointer is.
Since fast variables are accessed without barriers, it is customary to make stack frames fast variables (i.e. roots as well). This is because while write accesses are rare system-wide, they are quite common in the local variables. It has been measured (on some Lisp programs) that about 99% of writes (of a pointer value) have a local variable as target.
Fast variables are often extended even further, in the case of a generational GC: the "young generation" consists in a special allocation area for new objects, limited in length, and scanned as fast variables. The bright side of fast variables is fast access (hence the name); the downside is that all these fast variables may be scanned only during a pause (the world is stopped). There is a trade-off on the size of the fast variables, which often translates to a limit on the young generation size. A larger young generation promotes average performance (by reducing the number of access barriers) at the cost of longer pauses.
At the other extreme, you may have no fast variable at all, and no root but the true roots. The stack frames are then handled as objects, each with their own color. Pauses are then minimal (a mere snapshot of a dozen register) but barriers must be used even for access to local variables. This is expensive, but has some advantages:
Hard guarantees on pause times can be made. This is difficult if stack frames are roots, because each new thread has its own stack, so the roots total size may grow up to arbitrary amounts as new threads are launched. If only CPU registers and global variables (no more than a few dozens in typical cases, and their number is known at compilation time) are roots, then pauses can be kept very short.
This allows for dynamic allocation of stack frames in the heap. This is needed if you play with co-routines and continuations, as with Scheme's call/cc primitive. In such a case, frames are no longer handled as a pure "stack". Proper handling of continuations in a GC-aware language mostly requires that function frames be allocated dynamically.
It is possible to make stack frames non-root while keeping a young generation as root. Guarantees on pause times can still be made (depending on the young generation size, which is fixed) and some trickery can be applied to make sure that stack frames are in the young generation when their function is active. This can ensure barrier-free access to local variables. None of this is really free, but it can be made efficient enough for most purposes.
Another Conceptual View:
Another way to view root-handling is the following: roots are the variables for which the tricolor rule (no black-to-white pointer) is not maintained at all times; these variables are allowed to be mutated without constraint. But they must be brought back in line regularly, by stopping the world and scanning them.
In practice, the mutators are racing with the GC. Mutators create new objects, and point to them; each pause creates new grey objects. In a concurrent or incremental GC, if you let the mutators play with roots for too long, then each pause may create a big batch of new grey objects. In the worst case, the GC cannot scan objects fast enough to keep up with the rate of grey object creation. This is an issue because white objects can be released only during the sweep phase, which is reached only if at some point the GC may complete its marking. A usual implementation strategy for an incremental GC is to scan grey objects, during each pause, for a total size which is proportional to the total size of roots. Thus, pause time remains bounded by the roots total size, and if the proportionality factor is well balanced then it can be guaranteed that the GC will ultimately terminate is marking phase and enter the sweeping phase.
In a concurrent GC, things are a bit more complex, because the mutators roam freely in the wild. A possible implementation would make a little bit of incremental marking while the world is still stopped.
Bibliography:
Garbage Collection: Algorithms for Automatic Dynamic Memory Management: a must-read book on garbage collection.
The answer is that this depends on the garbage collection algorithms used. In some cases, you are correct that all threads are stopped during GC. In other cases, you are incorrect in that garbage collection proceeds while normal threads are running. To understand how GC's achieve that, you need a detailed understanding of the theory and terminology of garbage collectors, combined with an understanding of the specific collector. It is simply not amenable to a simple explanation.
Oh yes, and it is worth pointing out that many modern collectors don't have a compaction phase per-se. Rather they work by copying live objects to a new "space" and zeroing the old "space" when they are done.
If I am incorrect my question would be answered by a simple explanation of the strategy used to minimise blocking.
If you really want to understand how garbage collectors work, I recommend:
... and beware that finding accurate, detailed, public descriptions of the internals of production garbage collectors is not easy. (Though in the case of the Hotspot GC's, you can look at the source code ...)
EDIT: in response to the OP's comment ...
"It seems it is as I thought -- there is no getting around the "stop the world" part."
It depends. In the case of the Java 6 Concurrent Collector, there are two pauses during the marking of the roots (including stacks), and then marking / copying of other objects proceeds in parallel. For other kinds of concurrent collector, read or write barriers are used while the collector is running to trap situations where the collector and application threads would otherwise interfere with each other. I don't have my copy of [Jones] here right now, but I also recall that it is possible to make the "stop the world" interval negligible ... at the cost of more expensive pointer operations and/or not collecting all garbage.
How the compiler and run-time environment handle memory management has grown up over a long period of time. The stack memory v.s. heap memory allocation decision had a lot to do with what could be known at compile-time and what could be known at runtime. This was before managed run times.
In general, the compiler has very good control of what's on the stack, it gets to decide what is cleaned up and when based on calling conventions. The heap on the other hand, was more like the wild west. The compiler did not have good control of when things came and went. By placing function arguments on the stack, the compiler is able to make a scope -- that scope can be controlled over the lifetime of the call. This is a natural place to put value types, because they are easy to control as opposed to reference types that can hand out memory locations (pointers) to just about anyone they want.
Modern memory management changes a lot of this. The .NET runtime can take control of reference types and the managed heap through complex garbage collection and memory management algorithms. This is also a very, very deep subject.
"Garbage Collection: Algorithms for Automatic Dynamic Memory Management" (https://www.amazon.com/Garbage-Collection-Algorithms-Automat...) seems to cover GC algorithms up to the CMS.
"The Art of Multiprocessor Programming" (https://www.amazon.com/Art-Multiprocessor-Programming-Mauric...) is for those bored by JCiP. But it's not about threads per se. I imagine nobody cares about the original green threads and I wouldn't expect quality literature on the subject until project Loom goes to prod.
Try the book Garbage Collection: Algorithms for Automatic Dynamic Memory Management. It wont have the more recent stuff in it, but it'll get you on your way.
A bit of terminology:
Let me give some names so that explanations are clearer.
A variable is any slot for data, which may contain a pointer and may change over time. This includes global variables, local variables, CPU registers and fields in allocated objects.
In a tricolor incremental or concurrent GC, there are three types of variables:
The "true roots" and "fast variables" will be hereafter collectively called roots.
The application threads are called the mutators because they change the contents of variables.
With an incremental or concurrent GC, GC pauses occur regularly. The world is stopped (mutators are paused), and the roots are scanned. This scan reveals a number of references to colored objects. Object colors are adjusted accordingly (such white objects are made grey).
When the GC is incremental, some object scanning activity takes place: some grey objects are scanned (and painted black), greying referenced white objects. This activity (the "marking") is maintained for some time, but not necessarily as long as there are grey objects. At some point, the marking stops and the world is awakened. The GC is called "incremental" because the GC cycle is performed in small increments, interleaved with mutator activity.
In a concurrent GC, scanning of grey objects occurs concurrently with mutator activity. The world is then awakened as soon as the roots have been scanned. With a concurrent GC, access barriers are a tad complex to implement because they must handle concurrent access from the GC thread; but at a conceptual level, this is not very different from an incremental GC. A concurrent GC can be viewed as an optimization over incremental GC, which takes advantage of the presence of multiple CPU cores (a concurrent GC has little advantage over an incremental GC when there is only one core).
Roots need not be protected by an access barrier, since they are scanned with the world stopped. The GC mark phase ends when the following conditions are simultaneously met:
so this situation can occur only during a pause. At that point, the sweep phase begins, during which white objects are released. The sweep can be done incrementally or concurrently; objects created during the sweep are immediately painted black. When the sweep is finished, a new GC mark phase can take place: objects (which are all black at that point) are all repainted white (this is done atomically by simply changing the way color bits are interpreted).
Variable Classification:
With that being said, I can now answer to your question. With the description above, the question becomes: what are the roots ? This is actually up to the implementation; there are several possibilities, and trade-offs.
True roots must always be scanned; true roots are the CPU register contents and the global variables. Note that the stacks are not true roots; only the current stack frame pointer is.
Since fast variables are accessed without barriers, it is customary to make stack frames fast variables (i.e. roots as well). This is because while write accesses are rare system-wide, they are quite common in the local variables. It has been measured (on some Lisp programs) that about 99% of writes (of a pointer value) have a local variable as target.
Fast variables are often extended even further, in the case of a generational GC: the "young generation" consists in a special allocation area for new objects, limited in length, and scanned as fast variables. The bright side of fast variables is fast access (hence the name); the downside is that all these fast variables may be scanned only during a pause (the world is stopped). There is a trade-off on the size of the fast variables, which often translates to a limit on the young generation size. A larger young generation promotes average performance (by reducing the number of access barriers) at the cost of longer pauses.
At the other extreme, you may have no fast variable at all, and no root but the true roots. The stack frames are then handled as objects, each with their own color. Pauses are then minimal (a mere snapshot of a dozen register) but barriers must be used even for access to local variables. This is expensive, but has some advantages:
call/cc
primitive. In such a case, frames are no longer handled as a pure "stack". Proper handling of continuations in a GC-aware language mostly requires that function frames be allocated dynamically.It is possible to make stack frames non-root while keeping a young generation as root. Guarantees on pause times can still be made (depending on the young generation size, which is fixed) and some trickery can be applied to make sure that stack frames are in the young generation when their function is active. This can ensure barrier-free access to local variables. None of this is really free, but it can be made efficient enough for most purposes.
Another Conceptual View:
Another way to view root-handling is the following: roots are the variables for which the tricolor rule (no black-to-white pointer) is not maintained at all times; these variables are allowed to be mutated without constraint. But they must be brought back in line regularly, by stopping the world and scanning them.
In practice, the mutators are racing with the GC. Mutators create new objects, and point to them; each pause creates new grey objects. In a concurrent or incremental GC, if you let the mutators play with roots for too long, then each pause may create a big batch of new grey objects. In the worst case, the GC cannot scan objects fast enough to keep up with the rate of grey object creation. This is an issue because white objects can be released only during the sweep phase, which is reached only if at some point the GC may complete its marking. A usual implementation strategy for an incremental GC is to scan grey objects, during each pause, for a total size which is proportional to the total size of roots. Thus, pause time remains bounded by the roots total size, and if the proportionality factor is well balanced then it can be guaranteed that the GC will ultimately terminate is marking phase and enter the sweeping phase.
In a concurrent GC, things are a bit more complex, because the mutators roam freely in the wild. A possible implementation would make a little bit of incremental marking while the world is still stopped.
Bibliography:
Garbage Collection: Algorithms for Automatic Dynamic Memory Management: a must-read book on garbage collection.
The answer is that this depends on the garbage collection algorithms used. In some cases, you are correct that all threads are stopped during GC. In other cases, you are incorrect in that garbage collection proceeds while normal threads are running. To understand how GC's achieve that, you need a detailed understanding of the theory and terminology of garbage collectors, combined with an understanding of the specific collector. It is simply not amenable to a simple explanation.
Oh yes, and it is worth pointing out that many modern collectors don't have a compaction phase per-se. Rather they work by copying live objects to a new "space" and zeroing the old "space" when they are done.
If you really want to understand how garbage collectors work, I recommend:
... and beware that finding accurate, detailed, public descriptions of the internals of production garbage collectors is not easy. (Though in the case of the Hotspot GC's, you can look at the source code ...)
EDIT: in response to the OP's comment ...
It depends. In the case of the Java 6 Concurrent Collector, there are two pauses during the marking of the roots (including stacks), and then marking / copying of other objects proceeds in parallel. For other kinds of concurrent collector, read or write barriers are used while the collector is running to trap situations where the collector and application threads would otherwise interfere with each other. I don't have my copy of [Jones] here right now, but I also recall that it is possible to make the "stop the world" interval negligible ... at the cost of more expensive pointer operations and/or not collecting all garbage.
Only a very naive implementation would have a problem with circular references. Wikipedia has a good article on the different GC algorithms. If you really want to learn more, try (Amazon) Garbage Collection: Algorithms for Automatic Dynamic Memory Management . Java has had a good garbage collector since 1.2 and an exceptionally good one in 1.5 and Java 6.
The hard part for improving GC is reducing pauses and overhead, not basic things like circular reference.
How the compiler and run-time environment handle memory management has grown up over a long period of time. The stack memory v.s. heap memory allocation decision had a lot to do with what could be known at compile-time and what could be known at runtime. This was before managed run times.
In general, the compiler has very good control of what's on the stack, it gets to decide what is cleaned up and when based on calling conventions. The heap on the other hand, was more like the wild west. The compiler did not have good control of when things came and went. By placing function arguments on the stack, the compiler is able to make a scope -- that scope can be controlled over the lifetime of the call. This is a natural place to put value types, because they are easy to control as opposed to reference types that can hand out memory locations (pointers) to just about anyone they want.
Modern memory management changes a lot of this. The .NET runtime can take control of reference types and the managed heap through complex garbage collection and memory management algorithms. This is also a very, very deep subject.
I recommend you check out some texts on compilers -- I grew up on Aho, so I recommend that. You can also learn a lot about the subject by reading Gosling.