Zen of Assembly Language: Knowledge (Scott Foresman Assembly Language Programming Series)

Author: Michael Abrash
This Month Stack Overflow 2


by anonymous   2019-07-21

Everything can use a different number of clocks for different CPUs. Timing is very cpu specific. You also need to account for wait states, flash and ram, etc.

A speedy arm microcontroller eval board with a floating point unit i and d caches can be had for $20


or search for stm32f4 discovery (not to be confused with the stm32 value line discovery nor any of the stm8 boards).

I have examples on how to use the chip/board at http://github.com/dwelch67/stm32f4d examples

Many avr based arduinos are available (http://sparkfun.com) from $30 to $50+ when you add in the serial interface, there is a lilypad kit and a pro mini kit that has everything plus the usb to serial.

Be aware that avrs are going to be significantly slower than arms from a performance perspective. if you want to avoid getting into embedded programming the arduino sandbox removes a lot if not all of the embedded magic leaving you with apis to call as with an operating system. ST has libraries as well, but there is more work needed on your end. It is hard to compete with the avr specifically arduino family, when it comes to making your life easy (at the cost of some performance and other resources).

The mbed http://mbed.org is perhaps the better of the approaches to competing with the arduino sandbox, but getting arm performance. The maple http://leaflabs.com/devices/maple/ attempts if not succeeds (I wouldnt know) to be a drop in replacement for the avr based arduino with arm performance.

Why mention all of these boards? Because you should buy a couple and try your algorithm and find out what you really can and cant do from a performance perspective. Many processors these days do not publish a table of clocks per instruction because the newer processors tend to execute in one clock for every instruction. You still have to know how to count cycles. I HIGHLY recommend Michael Abrash's book Zen of Assembly Language


used copies are affordable. It does focus on the 8088/86, whose performance problems were fixed/moved in the next processors and their performance problems were fixed/moved and now we are mostly in a situation where the processor is not the problem it is the I/O. For microcontrollers it is similar to the old days for size and cost the processor is still a bottleneck esp if you program in C or something other than assembler. The Zen book will get you into the mindset of understanding the question you are asking. The instruction set reference may only speak of one clock per instruction execution, but remember there is a clock or more needed to fetch that instruction, if that instruction performs a memory cycle there is a clock or more for that memory cycle, etc. How many of those cycles and how can they be optimized is not only cpu/processor specific but with microcontrollers chip specific, one chip from the same company may perform quite differently than another from the same company with the same core processor. Back to referencing Michael Abrash, basically no matter how much you think you know or have figure out about how the hardware works, you still need to run and time your code (ACCURATELY!, you can make many mistakes trying to time the code and forming conclusions based on bad testing/timing).

by anonymous   2019-07-21

The difference between the two addressing modes is...the source of the address...For direct addressing mode the address of the item to be accessed is an immediate encoded in the instruction, so the instruction is larger, in some cases much larger so it requires more clock cycles to access, ideally it is in the cache as it is the bytes immediately following the opcode and the fetching of the opcode normally causes at least a cache line behind it to be fetched, with anything but the oldest x86 platforms I dont see how you would get to where you are executing the instruction without the rest of the instruction and the next few/many instructions already fetched and in the pipe. Even old x86 processors had a prefetch queue of some size.

Register addressing means the address for the item being accessed is in a register. Assuming the address was already there, then this is faster because you dont incur the larger instruction, extra cycles, more of the cache line burned for instruction. Where you have to be careful with this argument is say for example the instruction just before is loading the immediate address into the register.

mov ax,[1000h]

mov ax,[bx]

The second one is faster than the first (for things that can be compared at this level), because of the instruction size and additional cache burned and cycles take.


mov ax,[1000h]

mov bx,1000h    
mov ax,[bx]

the direct addressing is faster because overall it takes fewer cycles to fetch and execute (for things that can be compared).

What do I mean by for things that can be compared? The addressing mode has to do with where the address comes FROM. once you start to EXECUTE that instruction and perform a memory cycle then the two are equal, it is an address on a bus, to be comparing the two instructions the data size is the same. it may very well be the case that direct adddressing is faster for some test program simply because for that test program the data is always in the data cache, where for that test program the register addressing version is not or sometimes is not. So the things that can be compared between the two instructions are the size of the instruction, which leads to the cycles and cache line it burns. One cache line can hold many register based instructions but only a few direct/immediate based instructions, so by using direct/immediate you have an opportunity cost and incur more memory cycles overall when executing the program. YES, many of these cycles are in parallel on anything remotely modern.

So these types of questions have to do with whether or not you understand the instruction set, and depending on how much detail you return, whether or not you understand beyond that what the actual costs are. Likewise without experience, simply trying an experiment will likely fail or show no difference as you have to craft the experiment around the caches.

I highly recommend the book The Zen of Assembly Language by Michael Abrash


NOT the free one that comes with the big black graphics programming book that one is incomplete. You can get a used copy in good shape (bought a second one and it was better than my original that I bought at the store and has lived on a bookshelf). The details about 8088 and 8086 were outdated when the book went to print, that is not the importance of the book, the importance is to understand how to attack the problem, how to think about the problem and get an elementary insight as to what is going on behind the scenes. It is significantly more complicated today, still understandable, but I recommend starting with a foundation like this before jumping into what you see today. Esp with x86 (I highly recommend learning something, anything, other than x86 first when you start looking at busses and caching, etc). http://github.com/dwelch67/amber_samples. I have cleaned up and made the amber processor (arm2 clone) available using open source tools so that you can see things running inside the processor. One version of the amber has a cache. Again a stepping stone, adding mmus and multi cores, etc just adds to the complexity.

Super short answer, the direct addressing encodes using a longer instruction, more cycles than register addressing when only the two instructions are compared to each other. Memory side effects, caching, etc can confuse or neutralize the differences.

by markedathome   2019-07-12
Zen of Code Optimization 1994 ( https://www.amazon.com/Zen-Code-Optimization-Ultimate-Softwa...) is the follow up to Zen of Assembly Language 1990 (https://www.amazon.com/Zen-Assembly-Language-Knowledge-Progr...)