on
Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't
A small, but deep dive into performance and data-oriented optimization.
Foreword
So, earlier this year I did a presentation at a cool(get it?) little Norwegian company called Airthings.
It was a two parter. First part was how the compilation pipeline works - from ASCII to assembly, greatly simplified of course. It was from an old presentation I did many years ago. The second part is the interesting one if you ask me. It's about performance and what are some of the low-hanging fruits that you can pick off to speed up stuff as much as possible.
Now, that part actually takes a while, but in short it's basically a distillation of my experience working on compilers, assembly and writing vectorized code. The gist of it is - data.
It ocurred to me that a small summary would make a good blog post. There's plenty of material on data-oriented design and performance, but this might be a nice overview.
I'll style it as a written representation of how I approach optimization.
When I approach an optimization effort, first thing I do is I try to visualize the data, where each bit is going from and to, what operations are being done, what's the difference between the initial and final state, can we get there with fewer or simpler steps?
If the data is coming in in a suboptimal form, transform it.
There is no algorithm in the world that will save you from inefficiently coded information.
Usually the steps, in my experience, go like this:
1. Specialize the algorithm
This step is not directly related to data, but changing data often requires changing the algorithm. Not only do we need to change access patterns and the shape data, but often a different algorithm more suited to the new form of data we're processing.
Regardless, this is a crucial part as this is where you identify one of the most important parts of optimization - "do less". Usually when we implement algorithms, we just try to make them work, aka. the "make it work->make it fast->make it beautiful" paradigm. In the process of development we change the data, the algorithm, the code goes through many iterations, a lot if left at the table.
Specializing an algorithms means taking away the genericity and only leaving in the parts that are relevant to the data being processed. If it means specializing the data, then that's fair game as well, garbage in/garbage out. As part of development we tend to use existing algorithms, whether it is from a standard library or an external one, or even just different module in the program.
These existing algorithms are by definition not bespoke and don't take into account the data or the manipulation done on it. In many cases you can rewrite them to suit your use-case, in others you can simply use a more specific algorithm, e.g. radixsort instead of quicksort.
People are often afraid of peeking into the libraries, thinking it's magic. It's not, just look at the code, learn what's being used and adapt it to your own case.
The specific example given in the presentation was specializing a generic image downscaling algorithm to specifically handle integer-based, 32bit argb images, which was used to great effect. Sped up the implementation as well as allowed for further optimizations.
Small and trivial example would be something like this:
// we're trying to parse a validated, 2-digit, positive integer
// the generic way (atoi)
// overhead: function call, loop setup, branch prediction, locale checks.
int val = atoi(ptr);
// the specialized way
// overhead: none really, just a few instructions that the CPU can blaze through
int val = (ptr[0] - '0') * 10 + (ptr[1] - '0');
2. Make it cache friendly
This is where the meat of the matter is. Think about the data writes and reads, both on temporal and spatial axes. Every load and store needs to be accounted for and studied. Is it needed, where does it write, when should it write, are there any similarly pointed writes/reads? Similarly for reads, consider not just the location in the code, but the memory location it concerns.
To put it plainly, try to access data that's either close to data you've already accessed or data that you've recently accessed.
If you algorithm is just accessing a part of your data, e.g. you have an array of structs and you're only acessing the first 8 bytes of a 128 byte struct, you're wasting a lot of cpu. Separate out the relevant data and process it independently. The relevant data stays hot in cache and you reap the benefits.
One nice trick is if you're writing to a location that you know you won't access soon, you can use streaming non-temporal writes/stores (_mm_stream_XX intrinsics in x86).
Using them allows the CPU to bypass the data caches and speed up the whole process. Useful for constricted memory bandwidth as well.
Once you've done all that (or ideally before, premature optimization is not a thing here in this blog. Improperly granulated optimization on the other hand, is), you can see what the actual values will be for each piece of data and compress the types if possible. Saves space, you can keep more data in cache and thus speed up the whole thing.
Structure padding can also be a significant factor. Take a look at this example, generated by a very useful tool called pahole.
struct bad_order {
char c; /* 0 1 */
/* XXX 7 bytes hole, try to pack */
double d; /* 8 8 */
int i; /* 16 4 */
/* size: 24, cachelines: 1, members: 3 */
/* sum members: 13, holes: 1, sum holes: 7 */
/* padding: 4 */
/* last cacheline: 24 bytes */
};
struct good_order {
double d; /* 0 8 */
int i; /* 8 4 */
char c; /* 12 1 */
/* size: 16, cachelines: 1, members: 3 */
/* padding: 3 */
/* last cacheline: 16 bytes */
};
As seen above, even though bad_order has the same members as good_order, it is 8 bytes larger simply due to the ordering of members.
In practice, on x86-64, this means that you can fit 4 good_order structs in a 64bit cache line, whereas you can fit only 2 bad_order structs (with one spilling over).
Which is a bit bonkers if you think about it, simply reordering a few lines in an editor can take off a good chunk of a runtime (depending on the algorithm of course).
You can use __attribute__((packed)) instead of manually reordering members, but I'd only use it for protocol/storage use as unaligned members can cause issues on some architectures, so just one more thing to keep in mind.
Finally, if you're in a loop where you know exactly which next adresses you're going to access (provided it's not a simple strided memory access pattern), you can use software prefetching instructions (__builtin_prefetch intrinsic in x86). But again, unless it's a highly unpredictable (for the pretty good hardware prefetchers of the modern CPUs), you should just trust the CPU.
3. Make it SIMD
Vectorizing an algorithm can in some simpler cases come as a result of the step above, as the result is usually a serial processing of data, which is highly amenable to vectorization (and paralellization). In that case, simply grouping the data in 8/16/32 wide groups and using vector instructions to process them can be enough.
In those cases, what can also happen as well is autovectorization, where the compiler will happily recognize that a loop vectorizable and do the whole process for you.
In other cases, you have to take a good hard look at what your algorithm is doing, separate out independent streams of data processing, rearrange the data to fit the vectorized version so it's laid out nicely and contiguously for the SIMD instructions to speed through, and try to vectorize the whole loop(s). If you have some complex data movement going on, it can be a challenge to shuffle and permute all the bytes around, but a fun challenge nonetheless.
One thing to be aware of is that not all instruction sets are available on all CPUs microarchitectures (looking at you Intel with AVX512) and thus code that you write can just segfault on some machines. To avoid it, just simply branch based on the CPU feature set with a common fallback.
For autovectorization, you can use function multiversioning which automatically does all that for you. It makes multiple copies of a function for each target and branches on each call, very nifty.
Example:
__attribute__((target_clones("avx512f", "avx2", "default")))
void process_data(float* data, int count) {
for (int i = 0; i < count; ++i) {
data[i] = data[i] * 0.5f;
}
}
For more information on this (and data-oriented design in general), research SoA(structures of arrays) and AoSoA (arrays of structures of arrays). This could honestly be a whole series of blog posts, so just for the sake of terseness, take a look at the materials section at the bottom for sources.
4. Make it parallel
So now, parallelization at this stage should come at no cost, you've already separated out the data processing in independent streams, parallelizing should be just a matter of using OpenMP and rayon in case of C/C++ and Rust. Combined with the above steps, this should easily net you two orders of magnitude worth of speedup.
One thing you should be aware of however, is false sharing.
In short, if you have multiple threads accessing a single cache line, it'll trigger a memory stall and a load from RAM.
Ironically, it can be fixed by adding padding, which makes less data fit in the cache, but enables the paralellization to do its thing. Thus a net positive.
Small example:
struct packed {
std::atomic<int64_t> a;
std::atomic<int64_t> b;
};
struct padded {
std::atomic<int64_t> a;
// force "b" onto a new cache line
alignas(64) std::atomic<int64_t> b;
};
In the packed struct, both members fit in a single cache line and will therefore exhibit false sharing in a multithreaded scenario.
Conclusion
That is more or less it. Note that this list is not applicable in every case (how on earth would you parallelise an interpreter) nor does it have every optimization worth doing (rewriting a subroutine in assembly for example).
It should cover most of it though as, with programming, all we're doing is shuffling data from end to the other with some transformation in between.
How do GC/scripting languages fit in?
Short answer - they don't.
Long answer - it depends. It depends on the language and its support for value types and the ability to control the data layout.
-
Java has Project Valhalla, which is an unreleased experimental project to add value types to Java.
-
Go is solid all-around, but writing vectorized code often (ideally) requires writing go assembly.
-
Python defers any perf-related code to C/C++.
-
Lua similarly should not be used to implement perf-sensitive parts of a program, which its embeddable nature supports quite naturally.
-
C# has an awesome situation in here with its support for value types (ref structs), slices (spans), stack allocation, SIMD intrinsics (including AVX512!). You can even go bare-metal and GC-free with bflat.
Despite the outliers, if you truly have non-IO-bottlenecked, data-intensive code that needs to run as fast as possible, systems languages like C/C++/Rust are your best bet.
Profiling
Before doing anything - profile.
Sometimes you can intuit where the bottleneck is, but most of the time, you would be surprised at the real place the algorithm falls apart. Sometimes it won't be a single bottleneck, maybe it'll be spread out over multiple places (worst case scenario being the flat profile where program time is roughly evenly distributed).
A lot of it depends on not only understanding low-level programming but the domain that you're working with.
There's only so much you can optimize without enough information on what that specific algorithm/code/subsystem is supposed to be doing, rather than what it is actually (slowly) doing.
Also, once the lowest-hanging fruit has been picked, we're not talking just function, but instruction-level profiling as well where, depending on the machine, you can be frontend-bound, backend-bound, memory-bound... a whole new world opens up.
See VTune metrics and analysis guides, lots of useful stuff in there.
Useful tools & material
- hyperfine - benchmarking
- perf - profiling (ETW for windows)
- VTune - in-depth profiling
- DTrace - profiling, debugging, tracing
- Cachegrind - cache tracing
- Perfetto - tracing visualization
- godbolt.org - a fantastic website where you can see in-detail how you code gets compiled down to assembly (plus analysis and various tools)
The now legendary intro to data-oriented design: CppCon 2014: Mike Acton "Data-Oriented Design and C++"
Much more about data-oriented design: www.dataorienteddesign.com/dodbook
An in-depth overview of performance analysis and optimization: Performance Analysis and Tuning on Modern CPUs by Denis Bakhvalov