I became interested in compilers while in graduate school in the 70's and I spent many many hours studying Aho and Ulman's two volumes work on compilers that preceded their Dragon book [1]. The first volume was all about parsing, mostly LR (bottom-up) parsing and it's variations. These books resembled math books more than CS books. A few years before, Knuth had invented LR parsing [2] and I think that CS departments were still enthralled by the fascinating formal theory discovered around parsing. Aho and Ulman's Dragon Books on compiling are much more balanced.
I was fascinated by the formal methods that could be used to specify a language's grammar and then the automated generation of a parser from the grammar. I even went so far as to write a LR parser generator in Fortran IV back then.
Once I got into industry and was working in a group doing real-world compiler development I realized that there is a lot more than just lexical scanning and parsing going on in a compiler, it's tool chain, and runtime.
Teaching compilers backwards sounds like a really good approach for students learning compilers.
On a slightly broader but related topic, many programmers have never been exposed to assembly language, parameter passing mechanisms or how a program turns into a process. Anyone interested in system level programming in the real-world could benefit from Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective [3]. This is not an easy book, but it is excellent and suitable for undergraduate CS students in their 3rd or 4th year.
[1] Aho and Ulman, Compiling (Theory of Parsing, Translation and Compiling), Vol 1 (1972) & Vol 2 (1973), Prentice Hall.
The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.
Now, if we look at the code
if (data[c] >= 128)
sum += data[c];
we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.
In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:
sum += data[c] >=128 ? data[c] : 0;
While maintaining readability, we can check the speedup factor.
On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):
The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.
Now let's look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.
max1 uses the conditional branch if... else ...:
int max1(int a, int b) {
if (a > b)
return a;
else
return b;
}
max2 uses the ternary operator ... ? ... : ...:
int max2(int a, int b) {
return a > b ? a : b;
}
On a x86-64 machine, GCC -S generates the assembly below.
max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2 does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.
So why does a conditional move perform better?
In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.
In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.
In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.
The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.
Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.
Along with the usual classics, I highly recommend Computer Systems: A Programmer's Perspective by Randal E. Bryant and David R. O'Hallaron of Carnegie Mellon. The authors wrote it after teaching a class on the subject. It's extremely readable and gives you an excellent introduction of machine level code, processor architecture and memory as well as a solid foundation of higher level concepts including networking and concurrency. If you're considering programming as a career, I'd say this book (or something similar, probably spread across multiple books) is a must-read. It's used by CMU, Stanford, Caltech, UIUC, Harvard and dozens of other schools.
OK, if you don't have any real experience in low-level embedded coding (relevant to device drivers), RTOS or OS design in general, file systems, data structures, algorithms, interfaces, etc. And, if you have "hobby level" experience with Assembler, C and C++. And, if your intent is to write a desktop OS, from the ground up, without making use of existing technologies, drivers, file systems, memory management, POSIX, etc. Here's a list of books that could be considered required reading before you can really start to write specifications and code. Pick twenty of these and that might be a good start.
Of course, the equivalent knowledge can be obtained by trial-and-error, which would take longer and might result in costly errors and imperfect design. The greater danger here is that a sole developer, without the feedback and interaction of even a small group of capable and experienced programmers could simply burn a lot of time repeating the mistakes made by those who have already trenched that territory.
If the goal is to write a small RTOS on a small but nicely-featured microcontroller, then the C books and the uC/OS book might be a good shove in the right direction. Things start getting complicated if you need to write such things as a full USB stack, PCIe subsystem, graphics drivers, etc.
I became interested in compilers while in graduate school in the 70's and I spent many many hours studying Aho and Ulman's two volumes work on compilers that preceded their Dragon book [1]. The first volume was all about parsing, mostly LR (bottom-up) parsing and it's variations. These books resembled math books more than CS books. A few years before, Knuth had invented LR parsing [2] and I think that CS departments were still enthralled by the fascinating formal theory discovered around parsing. Aho and Ulman's Dragon Books on compiling are much more balanced.
I was fascinated by the formal methods that could be used to specify a language's grammar and then the automated generation of a parser from the grammar. I even went so far as to write a LR parser generator in Fortran IV back then.
Once I got into industry and was working in a group doing real-world compiler development I realized that there is a lot more than just lexical scanning and parsing going on in a compiler, it's tool chain, and runtime.
Teaching compilers backwards sounds like a really good approach for students learning compilers.
On a slightly broader but related topic, many programmers have never been exposed to assembly language, parameter passing mechanisms or how a program turns into a process. Anyone interested in system level programming in the real-world could benefit from Bryant and O'Hallaron's Computer Systems: A Programmer's Perspective [3]. This is not an easy book, but it is excellent and suitable for undergraduate CS students in their 3rd or 4th year.
[1] Aho and Ulman, Compiling (Theory of Parsing, Translation and Compiling), Vol 1 (1972) & Vol 2 (1973), Prentice Hall.
[2] https://www.amazon.com/Computer-Systems-Programmers-Perspect...
The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.
Now, if we look at the code
we can find that the meaning of this particular
if... else...
branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction:cmovl
, in anx86
system. The branch and thus the potential branch prediction penalty is removed.In
C
, thusC++
, the statement, which would compile directly (without any optimization) into the conditional move instruction inx86
, is the ternary operator... ? ... : ...
. So we rewrite the above statement into an equivalent one:While maintaining readability, we can check the speedup factor.
On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):
x86
x64
The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.
Now let's look more closely by investigating the
x86
assembly they generate. For simplicity, we use two functionsmax1
andmax2
.max1
uses the conditional branchif... else ...
:max2
uses the ternary operator... ? ... : ...
:On a x86-64 machine,
GCC -S
generates the assembly below.max2
uses much less code due to the usage of instructioncmovge
. But the real gain is thatmax2
does not involve branch jumps,jmp
, which would have a significant performance penalty if the predicted result is not right.So why does a conditional move perform better?
In a typical
x86
processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.
In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like
Fetch
andDecode
does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.
Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.
is superior to the widely mandated
for software engineers and programmers. The former has less hardcoded numbers than the latter and more timeless principles.http://csapp.cs.cmu.edu/ http://www.amazon.com/Computer-Systems-Programmers-Perspecti...
In no particular order:
1- http://www.amazon.com/C-Programming-Language-2nd-Edition/dp/...
2- http://www.amazon.com/The-Answer-Book-Solutions-Programming/...
3- http://www.amazon.com/The-Standard-Library-P-J-Plauger/dp/01...
4- http://www.amazon.com/C-Traps-Pitfalls-Andrew-Koenig/dp/0201...
5- http://www.amazon.com/Expert-Programming-Peter-van-Linden/dp...
6- http://www.amazon.com/Data-Structures-In-Noel-Kalicharan/dp/...
7- http://www.amazon.com/Data-Structures-Using-Aaron-Tenenbaum/...
8- http://www.amazon.com/Mastering-Algorithms-C-Kyle-Loudon/dp/...
9- http://www.amazon.com/Code-Complete-Practical-Handbook-Const...
10- http://www.amazon.com/Design-Patterns-Elements-Reusable-Obje...
11- http://www.amazon.com/The-Mythical-Man-Month-Engineering-Ann...
12- http://www.amazon.com/The-Programming-Language-4th-Edition/d...
13- http://www.amazon.com/The-Standard-Library-Tutorial-Referenc...
14- http://www.amazon.com/API-Design-C-Martin-Reddy/dp/012385003...
15- http://www.amazon.com/The-Linux-Programming-Interface-Handbo...
16- http://www.amazon.com/Computer-Systems-Programmers-Perspecti...
17- http://www.amazon.com/System-Programming-Unix-Adam-Hoover/dp...
18- http://www.amazon.com/Memory-Programming-Concept-Frantisek-F...
19- http://www.amazon.com/Memory-Management-Implementations-Prog...
20- http://www.amazon.com/UNIX-Filesystems-Evolution-Design-Impl...
21- http://www.amazon.com/PCI-System-Architecture-4th-Edition/dp...
22- http://www.amazon.com/Universal-Serial-System-Architecture-E...
23- http://www.amazon.com/Introduction-PCI-Express-Hardware-Deve...
24- http://www.amazon.com/Serial-Storage-Architecture-Applicatio...
25- http://www.amazon.com/SATA-Storage-Technology-Serial-ATA/dp/...
26- http://www.amazon.com/Beyond-BIOS-Developing-Extensible-Inte...
27- http://www.amazon.com/Professional-Assembly-Language-Program...
28- http://www.amazon.com/Linux-Kernel-Development-3rd-Edition/d...
29- http://www.amazon.com/Version-Control-Git-collaborative-deve...
30- http://www.amazon.com/Embedded-Software-Primer-David-Simon/d...
31- http://www.amazon.com/Programming-Embedded-Systems-C/dp/1565...
32- http://www.amazon.com/Making-Embedded-Systems-Patterns-Softw...
33- http://www.amazon.com/Operating-System-Concepts-Abraham-Silb...
34- http://www.amazon.com/Performance-Preemptive-Multitasking-Mi...
35- http://www.amazon.com/Design-Operating-System-Prentice-Hall-...
36- http://www.amazon.com/Unix-Network-Programming-Sockets-Netwo...
37- http://www.amazon.com/TCP-Illustrated-Volume-Addison-Wesley-...
38- http://www.amazon.com/TCP-IP-Illustrated-Vol-Implementation/...
39- http://www.amazon.com/TCP-Illustrated-Vol-Transactions-Proto...
40- http://www.amazon.com/User-Interface-Design-Programmers-Spol...
41- http://www.amazon.com/Designing-Interfaces-Jenifer-Tidwell/d...
42- http://www.amazon.com/Designing-Interfaces-Jenifer-Tidwell/d...
43- http://www.amazon.com/Programming-POSIX-Threads-David-Butenh...
44- http://www.intel.com/p/en_US/embedded/hwsw/software/hd-gma#d...
45- http://www.intel.com/content/www/us/en/processors/architectu...
46- http://www.intel.com/p/en_US/embedded/hwsw/hardware/core-b75...
47- http://www.hdmi.org/index.aspx
48- http://en.wikipedia.org/wiki/Digital_Visual_Interface
49- http://www.amazon.com/Essential-Device-Drivers-Sreekrishnan-...
50- http://www.amazon.com/Making-Embedded-Systems-Patterns-Softw...
51- http://www.amazon.com/Python-Programming-Introduction-Comput...
52- http://www.amazon.com/Practical-System-Design-Dominic-Giampa...
53- http://www.amazon.com/File-Systems-Structures-Thomas-Harbron...
54- ...well, I'll stop here.
Of course, the equivalent knowledge can be obtained by trial-and-error, which would take longer and might result in costly errors and imperfect design. The greater danger here is that a sole developer, without the feedback and interaction of even a small group of capable and experienced programmers could simply burn a lot of time repeating the mistakes made by those who have already trenched that territory.
If the goal is to write a small RTOS on a small but nicely-featured microcontroller, then the C books and the uC/OS book might be a good shove in the right direction. Things start getting complicated if you need to write such things as a full USB stack, PCIe subsystem, graphics drivers, etc.