Understanding Software Dynamics (Addison-Wesley Professional Computing Series) 1st Edition
by Richard L. Sites (https://www.amazon.com/Understanding-Software-Addison-Wesley...)
This is a recent book and I have not completed reading it but I know already is it destined to become a classic. If you are serious about learning how to understand and improve code, this is the book for you. (Full disclosure: Dick is a long-standing friend and one of the brightest and insightful guys around.)
Writing Efficient Programs (Prentice-Hall Software Series) 1st Edition by Jon Louis Bentley (https://www.amazon.com/Writing-Efficient-Programs-Prentice-H...)
This book is out of print but I believe it's a better read that Jon's Programming Pearls books. C is a great (and permissive) programming language. Jon shows how the language can be exploited for gain performance. Evil but effective.
Hacker's Delight 2nd Edition by Henry Warren (https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...)
The Hacker's Delight is exactly what Hank Warren intended--a collection of tricks and facts that programs can exploit. It's informative on many levels. It uses deep knowledge of how numbers (and other mathematical objects) work to compute useful information. And it shows how seemingly arcane information can be useful. For a programmer excited by the fabric of programs, it is a continual delight.
The first edition has a clarity and compactness that I find appealing. The second edition has additional material. When you get through all of the Delights you can read Knuth Volume 4 which embeds more arcane and useful knowledge.
This book will tell you how to efficiently generate condition flag values such as carry and overflow. It's easy when when you have extra bits available (such as emulating an 8 or 16 bit processor on a 32-bit processor), but less obvious when you are emulating a 32-bit processor on a 32-bit processor from a high level language.
For some of the "clever tricks" at higher levels, Hacker's Delight is a great treasure trove - it's not x86-64 specific, as a lot of the tricks are applicable to many CPUs, but it helps understand some of the seemingly bizarre tricks.
It didn't "pop-up" in a genius' head. Right shifting binary numbers would divide a number by 2 and left shifting the numbers would multiply it by 2. This is because 10 is 2 in binary. Multiplying a number by 10(be it binary or decimal or hexadecimal) appends a 0 to the number(which is effectively left shifting). Similarly, dividing by 10(or 2) removes a binary digit from the number(effectively right shifting). This is how the logic really works.
There are plenty of such bit-twiddlery(a word I invented a minute ago) in computer world.
http://graphics.stanford.edu/~seander/bithacks.html Here is for the starters.
This is my favorite book: http://www.amazon.com/Hackers-Delight-Edition-Henry-Warren/dp/0321842685/ref=dp_ob_image_bk on bit-twiddlery.
If bit manipulation is something you enjoy, there is an excellent book called Hacker's Delight (2e) by Henry Warren that you will find to be delightful. https://www.amazon.com/dp/0321842685
You can just use any of the standard bit twiddling hacks to reverse bytes and then shuffle the bytes (see e.g. [Hacker's Delight](https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685) and [this list](http://graphics.stanford.edu/%7Eseander/bithacks.html#BitReverseObvious)).
The Hacker's Delight website only contains the source code from the book for the 8x8 and 32x32 cases, but the latter generalizes trivially to your 64x64 case:
#include <stdint.h>
void
transpose64(uint64_t a[64]) {
int j, k;
uint64_t m, t;
for (j = 32, m = 0x00000000FFFFFFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (a[k] ^ (a[k | j] >> j)) & m;
a[k] ^= t;
a[k | j] ^= (t << j);
}
}
}
The way that this works is that the function swaps successively smaller blocks of bits, starting with 32x32 blocks (without transposing the bit within those blocks), after that within those 32x32 blocks it swaps the appropriate 16x16 blocks, etc. The variable that holds the block size is j. Therefore, the outer loop has j succcessively take the values 32, 16, 8, 4, 2 and 1, which means that the outer loop runs six times. The inner loop runs over half the lines of your of bits, the lines where a given bit in the variable k is equal to zero. When j is 32 those are the lines 0-31, when j is 16 those are the lines 0-15 and 32-47, etc. Together the inner part of the loop runs 6*32 = 192 times. What happens inside this inner part is that the mask m determines what are the bits that should be swapped, in t the xor or those bits are calculated, and that xor-ed lists of bits is used to update the bits in both places appropriately.
The book (and the website) also has a version of this code in which these loops have both been unrolled, and where the mask m is not calculated, but just assigned. I guess it depends on things like the number of registers and the size of your instruction cache whether that is an improvement?
To test that this works, suppose we define some bit pattern, say:
This is actually not really what you asked for, as you asked for a non-destructive version of this code. You can get this by having the first swap of the 32x32 blocks go from x to y. For instance, you might do something like:
void
non_destructive_transpose64(uint64_t x[64], uint64_t y[64]) {
int j, k;
uint64_t m, t;
for (k = 0; k < 64; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k ^ 64 + 1];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k + 1];
}
for (; k < 128; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k ^ 64];
}
for (j = 16, m = 0x0000FFFF0000FFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (y[k] ^ (y[k | j] >> j)) & m;
y[k] ^= t;
y[k | j] ^= (t << j);
}
}
}
Unlike the other version of the code this does not work regardless of endianness of the architecture. Also, I know that the C standard does not allow you to access an array of uint64_t as an array of uint32_t. However, I like it that no shifts or xors are needed for the first iteration of the move-the-blocks-around loop when you do it like this.
This is a recent book and I have not completed reading it but I know already is it destined to become a classic. If you are serious about learning how to understand and improve code, this is the book for you. (Full disclosure: Dick is a long-standing friend and one of the brightest and insightful guys around.)
Writing Efficient Programs (Prentice-Hall Software Series) 1st Edition by Jon Louis Bentley (https://www.amazon.com/Writing-Efficient-Programs-Prentice-H...)
This book is out of print but I believe it's a better read that Jon's Programming Pearls books. C is a great (and permissive) programming language. Jon shows how the language can be exploited for gain performance. Evil but effective.
Hacker's Delight 2nd Edition by Henry Warren (https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...)
The Hacker's Delight is exactly what Hank Warren intended--a collection of tricks and facts that programs can exploit. It's informative on many levels. It uses deep knowledge of how numbers (and other mathematical objects) work to compute useful information. And it shows how seemingly arcane information can be useful. For a programmer excited by the fabric of programs, it is a continual delight.
The first edition has a clarity and compactness that I find appealing. The second edition has additional material. When you get through all of the Delights you can read Knuth Volume 4 which embeds more arcane and useful knowledge.
https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...
This book will tell you how to efficiently generate condition flag values such as carry and overflow. It's easy when when you have extra bits available (such as emulating an 8 or 16 bit processor on a 32-bit processor), but less obvious when you are emulating a 32-bit processor on a 32-bit processor from a high level language.
https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...
It didn't "pop-up" in a genius' head. Right shifting binary numbers would divide a number by 2 and left shifting the numbers would multiply it by 2. This is because
10
is 2 in binary. Multiplying a number by10
(be it binary or decimal or hexadecimal) appends a0
to the number(which is effectively left shifting). Similarly, dividing by10
(or 2) removes a binary digit from the number(effectively right shifting). This is how the logic really works.There are plenty of such
bit-twiddlery
(a word I invented a minute ago) in computer world.http://graphics.stanford.edu/~seander/bithacks.html Here is for the starters.
This is my favorite book: http://www.amazon.com/Hackers-Delight-Edition-Henry-Warren/dp/0321842685/ref=dp_ob_image_bk on bit-twiddlery.
Author's website: https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0...
This is mostly to address another answer here, but here's a faster way to get the bit count:
And the tests:
I'm sure there's a clever way to remove the repetition, but that can also be left as an exercise to the reader.
Edit: I copied and modified the code from hacker's delight.
This seems a generalization of the question Bitwise transpose of 8 bytes. That question was just about 8x8 transposition, so what you are asking is a bit different. But your question is answered just as well in section 7.3 of the book Hacker's Delight (you might be able to see the relevant pages on Google books). The code that is presented there apparently originates with Guy Steele.
The Hacker's Delight website only contains the source code from the book for the 8x8 and 32x32 cases, but the latter generalizes trivially to your 64x64 case:
The way that this works is that the function swaps successively smaller blocks of bits, starting with 32x32 blocks (without transposing the bit within those blocks), after that within those 32x32 blocks it swaps the appropriate 16x16 blocks, etc. The variable that holds the block size is
j
. Therefore, the outer loop hasj
succcessively take the values 32, 16, 8, 4, 2 and 1, which means that the outer loop runs six times. The inner loop runs over half the lines of your of bits, the lines where a given bit in the variablek
is equal to zero. Whenj
is 32 those are the lines 0-31, whenj
is 16 those are the lines 0-15 and 32-47, etc. Together the inner part of the loop runs 6*32 = 192 times. What happens inside this inner part is that the maskm
determines what are the bits that should be swapped, int
the xor or those bits are calculated, and that xor-ed lists of bits is used to update the bits in both places appropriately.The book (and the website) also has a version of this code in which these loops have both been unrolled, and where the mask
m
is not calculated, but just assigned. I guess it depends on things like the number of registers and the size of your instruction cache whether that is an improvement?To test that this works, suppose we define some bit pattern, say:
We then call the
transpose32
function and print the resulting bit pattern:And this then gives as output:
Which is nicely flipped, as we hoped for.
Edit:
This is actually not really what you asked for, as you asked for a non-destructive version of this code. You can get this by having the first swap of the 32x32 blocks go from
x
toy
. For instance, you might do something like:Unlike the other version of the code this does not work regardless of endianness of the architecture. Also, I know that the C standard does not allow you to access an array of
uint64_t
as an array ofuint32_t
. However, I like it that no shifts or xors are needed for the first iteration of the move-the-blocks-around loop when you do it like this.EDIT: .. and a good coffeemaker. :)
http://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/03...