A couple of weeks ago I'd never heard of Peter Cordes. Now the linked article is the third time I've seen his work. He's doing a fine job of fixing Stackoverflow's low-level optimization knowledge. Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway", or, "modern computers are very complex, don't even try to understand what's happening".
> "modern computers are very complex, don't even try to understand what's happening"
One thing to keep in mind is that the CPU is no longer the CPU you think it is. While the x86 instruction set used to directly control the processor, today it's just a compatibility shim. Under the covers, the CPU is converting your assembly into its own microcode and it's that microcode which is doing the real work. For example, look at some of the block diagrams in section 2 of this Intel manual:
Combine that with all kinds of other voodoo they've added--fancy caches, branch prediction, and so forth--and the "don't even try" maxim starts to make sense. I'd amend that to say "don't even try unless you're just curious."
Personally, I think assembly is a great learning tool, e.g. for understanding how pointers work, how the stack works vs. the heap, how function calls work, and so-forth. I was exposed to Motorola 68K assembly in college and it completely changed my world--lots of C++ things I considered voodoo suddenly made total sense. But I've never used assembly in production code and probably never will.
Another thing to keep in mind is that the compiler is still pretty much the same compiler you think it is. This is the case because it is very difficult to codify all of the information in the Intel Optimization Manual [1]. It's a harder thing to use that information and then a harder thing still to codify, use and use correctly all that information for multiple targets and architectures.
LLVM Tablegen is an approach for representation but it still requires the compiler writer to write everything down and then use it effectively and correctly. This is really hard and it doesn't get done. Instead someone ports a backend and then kinda sorta improves things until it seems better than it was.
Intel has a fine compiler which goes further. It sure isn't open source and it costs a few bucks. They're also contributing LLVM now.
The reason don't even try makes sense both for compiler writer and assembly writer is that this stuff is really hard. Also CPUs are damn fast and Intel (and ARM and ...) work really hard to solve problems dynamically (branch prediction, speculation, register renaming, caching, speculation, ...). In fact, you shouldn't even try this unless you know a lot. Agner Fog and Peter Cordes should be familiar names. But if you do know a lot, beating compilers is not difficult at all. Finding the right place to beat them? That's tough.
Even though CPUs have got a lot more complicated, it is t that difficult to make intelligent decisions about optomizations at a high level of abstractions. There are situations where it makes sense to delve deeper as well. Not only because you are writing assembly, but also if you want to handhold the compiler. There are absolutely situations where it is not worthwhile to worry about optimizations. The problem is that the internet peanut gallery is rarely in a postition to give good advice about this.
> I was exposed to Motorola 68K assembly in college and it completely changed my world--lots of C++ things I considered voodoo suddenly made total sense
Heh. I had this the other way around - picked up 68k asm for fun long before C/C++, and couldn't understand why people had such difficulty with the concepts in those.
I've never learned x86 - partly because I worry that I'd enjoy it too much and waste oodles of time "optimizing" things that don't need it. Probably useful to be able to at least read it, though.
The x86-64 bit isn't bad. You get 16 registers (half named, half numbered---historical baggage) and opcodes galore (more than even the VAX architecture). The x86-32 bit is a mixed bag (8 registers, the specialized opcodes may or may not be worth using depending upon which chip you use), but it certainly helps to know the history (the 8086 and 80286) to understand why it's the way it is.
A few weeks ago I tried my hand at optimizing some code using the vector instructions. While I was able to beat clang quite easily using assembly, I just could not beat GCC. That was sobering. There's probably some 1000 page tome I have to read to understand where I messed up in optimizing.
The CPU is no longer the CPU that the CPU designers (often) think it is. A few years back someone who worked on CPUs at Intel related how his group pulling their hair out trying to characterize how a small section of code in a loop could have wildly different performance characteristics depending on the preceding instructions.
Oh very much so, 68K was great. Even better, at my first job we had a custom 68K-based embedded device and no source-level debugger, so you could only debug in assembly. That reinforced everything I had learned just a few years before. Our next device was MIPS-based, and MIPS was a great RISC system to learn on. I count myself very blessed.
"Don't even try to understand" _never_ makes sense. In this context it's the mantra of idiots who want to put dog-slow abstractions into C++ (among others). You have tried and you understand micro-ops exists, that branch prediction exists etc. This information is useful when optimising.
If you never need to optimise C or C++ code why are you using these languages? Seriously? You've almost certainly got the wrong tool for the job. (Ok sometimes someone has made a legacy decision and you're stuck with it for bad reasons. Meh).
If you don't try to understand it, you won't know things like "Never use a linked list if you care about performance at all." Even Bjarne has twigged to that one (at last!). C and C++ are languages you use when you need performance, understanding the CPU as best you are capable is crucial for performance. DO try. In fact it's insane not to work on your understanding.
Don't even try to justify a statement of "don't even try to understand". It's wrong thinking from top to bottom. Here, as everywhere. No really.
As someone who has used assembly in production code, I am much more than merely curious. I would like to assemble / generate microcode directly. I would like to know exactly what the CPUs are doing, and not rely on voodoo.
I have the same instinct, but I think in a few years we'd be in the same position, where what we write doesn't directly correspond to the architecture of the CPU. The CPU people are always going to work to make existing code faster, and since it's impossible for many generations of CPUs to "be just like the last generation, only faster", that's inevitably going to involve some kind of translation layer between your code and the CPU's "real architecture."
Won't the translation layer reduce some of that extra speed? And if so, why is it done? so that people can keep programming to the same instruction set, even though the underlying (micro)code and hardware architecture has changed?
Here's my informal and possibly screwy understanding of why translating to microcode doesn't really slow things down: When you've got a bunch of tasks to do, breaking the task-execution process into a pipeline of steps may potentially increase the latency (how long does it take to do a single task), because, well, you've now got to go through several steps do do that task. But (assuming your pipeline is designed well), it will also increase the throughput (how many tasks can you do in a fixed amount of time).
In software, it's likely that latency is not going to be a problem as long as you can get more throughput (you probably don't care if the time between when the program hits your opcode and the CPU actually executes your instruction is 1 nanosecond later than it conceivably could be as long as the number of instructions executed per millisecond is greater).
In any case, yes, you need to expose the same machine code interface, not just so people can write new software using the same code, but so that old code continues to work. It would be a nightmare if every few years there were incompatible changes to the instruction set.
You're basically asking why Intel doesn't expose their internal uops as an ISA that you can program in. See my answer on a SO question asking exactly that.
https://stackoverflow.com/a/32866797/224132
The other answers on that question make some good points, too.
Most instructions decode to a single internal uop already, so x86 machine code is just a compact way of programming in uops.
Things like variable-count shifts are multi-uop because of x86's legacy CISC flag-handling, but BMI2 fixes that by giving you no-flags shifts that do decode to a single uop.
You're usually not missing out on much.
Also, there's plenty of voodoo besides just translation to uops. Programming in uops directly wouldn't help you figure out why there seems to be some kind of limit on register reads per clock (http://agner.org/optimize/blog/read.php?i=415#852) for example, since the uops you'd program with would still have register renaming done to them as they're issued into the out-of-order core.
If I ever see Knuth's quote "premature optimization is the root of all evil" in response to a question again, I think I'll puke. Not only is it hard for outsiders to know what's premature and what isn't, but sometimes it's nice to make a habit of doing things the faster way when you have two choices that are otherwise indistinguishable. For example I try to use ++x instead of x++, even though 99% of the time it makes no difference.
In my opinion, any optimization done before taking the naive approach is usually premature optimization.
That might sound a little extreme, but in the past 5 years I've run into exactly 1 problem that was solved by busting out the profiler and optimizing. In that same time, I can't count on all my digits the number of features that didn't ship, estimates that were overshot, deadlines that were slipped, etc etc. I've even been part of a team that ran out of runway while popping open jsPerf to choose between !! and Boolean(). Our app was fast as hell -- too bad no one will ever get to use it.
If you're expending cycles choosing between ++x and x++ and you're not ahead of schedule, please stop.
That was my point, I'm not expending cycles choosing between ++x and x++. I've just chosen a different default than most of the code I've seen, and you still need to realize when the default doesn't do what you want - but that's usually obvious.
Sorry to hear about your unsuccessful projects, that's a bummer. I hope that premature optimization wasn't a major part of the blame for any of them.
This vastly differs depending on what software you write. I have done lots of necessary performance optimizations. Of course, there is a balance. To say never first optimize, is as bad as optimizing everything, in my opinion. It should require a case by case "study" of whether it is worth it. This kind of judgement comes with experience.
Writing performant code by using the correct idioms, data structures, algorithms, etc, from the start is just common sense rather than 'premature optimisation'.
Writing unreadable, micro-optimised code in the name of performance without even attempting to profile it first is another matter.
My personal rule (as a not-very-good hobbyist) is that if I have to refactor everything in order to accommodate an optimisation in a path that's already 'fast enough', or introduce some arcane hackery that only makes sense with comments into otherwise clean code, then it must be backed up with realistic benchmarks (and show a significant improvement).
I was with you until you got to the ++x/x++ example. It is a bad example because they generate the same machine code except in situations where they are semantically different.
For simple types that's true, and that's part of the 99% I was talking about. Where it makes a difference is class types, C++ iterators being a prime example. Maybe the optimizer can make the two cases generate identical code, but why take the chance?
I chose that example because it didn't need a full paragraph to explain, but you're right that there are probably better ones. Edit: maybe a Python example? ''.join([a, b, c]) is faster than a + b + c, but again 99% of the time you won't be using it in performance critical code. But it's a useful idiom to know.
From what I remember, for such small examples, a + b + c is significantly faster. (The CPython folks might have done some optimization work on this a few years back?)
It’s only if you plan to concatenate a whole lot of strings that join is a clear winner.
> My rule of thumb is if you haven't profiled it, it's premature to try and optimize it.
The context was specifically about questions on StackOverflow. You don't know if the person asking the question has profiled it or not, and the assumption is often that they haven't. Probably true more often than not, but very condescending and unhelpful to the person who has.
especially considering that people usually do not quote the following part of the quote : "Yet we should not pass up our opportunities in that critical 3%".
I remember seeing a blog post from Andrei Alexandrescu that I can't seem to dig up, but this SO post seems to be a nice summary [1]. In short, in 99.9999999% of usages post increment is probably better.
Thanks. I guess I'm applying this rule only when the result of the increment isn't being used directly, so there is no dependency on the operation. When the result is used, the semantic difference between ++x and x++ obviously determines the choice.
Just look at the liveness of the expression x+1. post-increment means expression has to be alive before whatever uses '++x' whereas pre-increment can delay until next usage of x.
I rather take "correctness and safety before performance", and only if it doesn't meet the performance criteria of an user story, then with the help of a profiler analyse what are the next steps.
99% of the time I don't need to worry about the profiler.
Anything one line is prone to being misinterpreted. IMO, people who're still learning shouldn't be bombarded with a million different things, so one liners work for them. People who're experienced, realize the intent of that one liner and will easily know enough to know when it applies.
That hasn't really been a thing since the PDP-11, which had an auto-increment option on the register used for pointer offsets. That's why that feature is in C. It really mattered in 1978.
Interesting, so the C creators used a feature in their language, that was hardware-specific. I thought (from having read K&R) that one goal of C was to be hardware-independent. Maybe this was an exception, or maybe that auto-increment was common to many computer designs at that time.
Wow, B. A blast from the past, sort of. I had read a good book about BCPL (the ancestor to B) many years ago. IIRC, it was by Martin Richards, inventor of BCPL. Pretty interesting book and language. BCPL and B were both typeless languages, or languages with just one type, the machine word (16 or 32 bits, don't remember). Still I found that many algorithms and programs were expressed rather compactly in BCPL - or so it seemed to me at the time. Was quite junior then, and without exposure to more advanced programming languages - only knew BASIC and Pascal, probably; even C, I only learned a bit later.
Also just saw some other interesting stuff from the BCPL article above:
[
BCPL is the language in which the original hello world program was written.[3] The first MUD was also written in BCPL (MUD1).
Several operating systems were written partially or wholly in BCPL (for example, TRIPOS and the earliest versions of AmigaDOS).
BCPL was also the initial language used in the seminal Xerox PARC Alto project, the first modern personal computer; among other projects, the Bravo document preparation system was written in BCPL.
]
Interestingly the code you quoted is not called at all. I guess eventually linking phase removes it as dead code.
That considered, both pre- and post-increment generate identical code, even with VS2017.
This matches my previous experience about pretty much any compiler in last 15 years or so -- there's no difference between ++i and i++, unless, of course, it's in a statement and changes the actual meaning of code.
"it++" case. Note that iterator function is not called.
foo PROC
mov rdx, QWORD PTR [rcx]
xor eax, eax
mov rcx, QWORD PTR [rcx+8]
cmp rdx, rcx
je SHORT $LN70@foo
mov r8, rcx
sub r8, rdx
sar r8, 5
npad 8
$LL4@foo:
add rax, r8
add rdx, 32 ; 00000020H
cmp rdx, rcx
jne SHORT $LL4@foo
$LN70@foo:
ret 0
foo ENDP
Here's the code generated for "++it" case. Iterator function is not called here either.
foo PROC
mov rdx, QWORD PTR [rcx]
xor eax, eax
mov rcx, QWORD PTR [rcx+8]
cmp rdx, rcx
je SHORT $LN68@foo
mov r8, rcx
sub r8, rdx
sar r8, 5
npad 8
$LL4@foo:
add rax, r8
add rdx, 32 ; 00000020H
cmp rdx, rcx
jne SHORT $LL4@foo
$LN68@foo:
ret 0
foo ENDP
Yes, but if used as their own statement and not used as some bigger expression (parameter to function, etc.), any decent compiler will compile them the same way.
Yes, and thus a perfect example premature optimization; semantics should define default usage, not some possible micro optimization. This is exactly the kind of case Knuth is talking about, going around doing ++x because you think it's faster when the standard idiom is x++ is premature optimization.
> Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway
I find this "Why would you even do that?" attitude in StackOverflow to be very irritating. And it's not just in low-level optimization questions. StackOverflow should be a place for questions and answers, not a collection of lessons on corporate best practices where curiosity and experimentation is discouraged.
I've had the same annoyance in IRC channels - getting an unhelpful lecture on the right way to do things from people who don't fully understand the use case or that "best practices" might not always be worth it or matter, especially in a small script/application
As someone who used to hang out in a few language channels on freenode, I completely understand why the lectures exist. You may have a good reason for asking the question you're asking, but the overwhelming majority of the time when a nick I don't recognize asks an esoteric question about low level optimization, it's because they are new to the language (and often programming in general), and have gone far down a rabbit hole they don't need to go down before asking for help.
When I tried to answer these kinds of questions, I always tried to start by asking questions to make sure I did fully understand the use case and the application. In my experience, a large percentage of people, regardless of how valid their question is, refused to give enough detail to allow me to actually understand their specific use case.
After repeating that cycle several times a week for a few years, it becomes very temping to just shut down the weird esoteric crap with a canned response that might get newbies pointed in the right direction. A small handful of people will stick around to explain why they really do understand their problem and need an answer to the question they are asking. A small handful of newbies will take the advice of the canned response and learn from it. The rest were largely not going to lead to an interesting conversation no matter what.
I agree. Programming forums are full of this crap. I hate asking a pretty straightforward question online and getting that patronizing "But what are you really trying to do?" response.
What I'm really trying to do is get an answer to this specific question. I'm not asking you to re-think my problem statement. Thanks for not helping.
At the same time we're getting a lot of people with XY-problems. They think they know what the problem is and they think they know what they about have to do, while the real problem is somewhere else. If someone's intend with an asked question isn't clear, then "But what are you really trying to do?" is a valid question.
It's impossible for anyone to just know what you're intentions are and whether you understand the deeper semantics of a problem. More often than not, people try to do things for the wrong reasons.
I dunno, I've been on both sides of this. I agree it's really annoying when someone waltzes around without answering my question. But I've also seen people asking questions which sounded goofy to me, and when you finally get them to tell you want they are actually trying to do, there's a much easier and more direct approach.
I think it's great for people on the "answering" side to indicate that they're willing to go beyond the literal question and offer more general support.
But as soon the asker has turned down an offer of that sort once, I really really wish people who aren't going to answer the question being asked would stay out of the conversation.