Data alignment across architecture: good, bad and ugly

Although a computer’s memory map looks quite smooth at first glance and too much byte-addressable, the same memory is much more dull at a hardware level. An essential term a developer can come across in this context Data alignment, Which indicates how hardware accesses the system’s random access memory (RAM). These and others are features of the system’s RAM and memory bus implementation, with various implications for software developers.

For a 32-bit memory bus, the optimal access type for some data would be a four byte, lined up on a four-byte border of memory. What happens when online access is attempted – such as reading four-byte values ​​in a half-line of a word – is defined as implementation. Some hardware platforms have hardware support for online access, others throw an exception that the operating system (OS) can catch and return to software in an unconnected routine. Other platforms will typically throw a bus error (SIGBUS in POSIX) if you try unconnected access.

Yet if unattended memory access is allowed, what is the actual performance effect?

A hardware view

Basic DRAM Topology (Credit: AnandTech)
Basic DRAM Topology (Credit: AnandTech)

As nebulous as system memory may seem, its implementation in the form of synchronous dynamic random-access memory (SDRAM) is bound up with physical limitations. A great primer on how SDRAM works can be found in Rajinder Gill’s 2010 AnandTech article. How to address SDRAM modules is the necessary take-away from here.

Each read and write request must select the target bank in a DIMM (BAn), then follow the commands specifying the target row (RAS) and column (CAS). Each row contains thousands of cells, each SDRAM chip of a DIMM contributes eight bits to the 64-bit data bus found in DDR 3 and DDR 4 DIMM.

The result of this physical configuration is that all access to RAM and intermediate cache (s) is aligned along these physically defined boundaries. When a memory controller is assigned the task of recovering data for certain variables, it is incredibly helpful when this data can be retrieved from RAM in a single read operation so that it can be read in a register on the CPU.

What happens if this data is not aligned? First it is possible to read the initial section of the data, then perform the second lesson to get the final section and then combine the two parts. Naturally, this is an operation that must either be supported directly by the memory controller, or managed by the operating system.

The memory controller will generate a bus error when it is asked to access an invalid address, stumbles upon a paging error, or is asked to access it simultaneously if unsupported. Platforms like x86 and derivatives support online memory access.

When things explode

As mentioned, x86 and also x86_64 are basically fine, but you can access system RAM with any alignment you choose or end up using randomly. Where things get messy with other platforms, such as ARM, ARMv7 documentation of platform features in the context of data access with documentation. Basically, in several cases you will get an alignment error from the hardware.

This 2005 IBM article covers how the Motorola m68k, MIPS and PowerPC CPUs of that era managed unconnected access. The interesting thing to note here is that up to 68020, unconnected access will always throw a bus error. MIPS doesn’t bother with online access in the name of CPU speed, and PowerPC adopted a hybrid approach that allowed 32-bit online access, but 64-bit (floating point) online access resulted in bus errors.

When it comes to replicating the SIGBUS alignment error, this is easily done using a pointer dereferencing:

uint8_t* data = binary_blob;
uint32_t val = *((uint32_t*) data);

Here binary_blob Not just a 32-bit integer, but a collection of variable-sized values.

Although this code will run fine on any x86 platform, on an ARM-based platform like Raspberry Pi, redirection in this manner guarantees you a SIGBUS error and a very dead process. The problem is that when you request access uint8_t The pointer as a 32-bit integer, it is likely to be aligned correctly uint32_t Basically zero.

So what to do in this case?

Stay in line

There are many arguments for using aligned memory access. The most important things are atomicity and portability. Atomicity refers to an integer read or write that can be performed in a single read or write operation. In the case of online access, this atomization no longer applies because it has to be read across borders. Some code may rely on such atomic readings and writing, which can lead to interesting and sporadic bugs and crashes when uninterrupted access is not considered.

The elephants in the room are very clear, however, portable. As we saw in the previous section, writing code that works great on one platform is very easy, but will sadly die on another platform. However, there is a way to write code that will be fully portable, which is actually defined in the C specification as a true way to copy data without the problem of alignment: memcpy.

We had to rewrite the previous piece of code using memcpyWe end with the following code:

#include <cstring>
uint8_t* data = binary_blob; 
uint32_t val;
memcpy(&val, data, 4); 

This code is fully portable, with memcpy Implement any alignment problem handling. If we execute such code on the Raspberry Pi system, no SIGBUS faults will be created, and the process will be live to see another CPU cycle.

Data structure, struct In C, grouping of related data values. Since these will be placed in successive rams, this will obviously create alignment problems unless padding is applied. Compilers add this type of padding as needed by default, which ensures that every data member of the structure is lined up in memory. Obviously this ‘wastes’ some memory and increases the size of the structure, but ensures that each access to the data member occurs in a completely aligned manner.

In general cases where Struct is used, such as MCUs and memory-mapped I / O on peripheral hardware devices, this is usually not a concern as they only use 32-bit or 64-bit registers that are always lined up when the first data member is performance. And because of the size, manual tweaking structure padding with compiler toolchain is often possible, but should only be done with extreme care.

Performance effects

But, one might ask, what is the effect of efficiency in ignoring data alignment and simply covering hardware or OSK complications? As we have seen in the exploration of the physical implementation of system RAM, online access is possible at the expense of additional reading or writing cycles. Obviously, this would at least double the number of such cycles. If this were to happen across the bank, the performance impact could be huge.

Single-vs Double-vs Quad-Byte Access (Credit: Jonathan Rentsch, IBM)
Single-vs Double-vs Quad-Byte Access (Credit: Jonathan Rentsch, IBM)

A number of benchmark results using single, double, four and eight-byte access patterns have been provided in the 2005 IBM article previously mentioned by Jananathan Rentsch. Despite running quite slow on the 800 MHz Powerbook G5, the effects of unorganized access were quite noticeable, with two-byte online access being 27% slower than aligned access. For four-byte online access, it was slower than queuing two-byte access, making the switch irrelevant to larger data sizes.

When switching to eight-byte queue access, it was 10% faster than four-byte queue access. Yet online eight-byte access takes a staggering 1.8 seconds for the entire buffer, 4.6 times slower than alignment.

Performance effects concern not only standard CPU ALU activities, but also SIMD (vector) extensions, as detailed by Mesa et al. (2007). Additionally, when developing the x264 codec, it was found that using the cashline alignment (16-byte transfer queue) was 69% faster in one of the most used functions in x264. This indicates that the data alignment system is much larger than RAM, but also applies to cache and other components of the computer system.

This doubles most (normal) activities and results in an injury to overall performance.

Let’s finish

In some ways the x86 architecture is rather comfortable in that it protects you from ugly parts of reality, such as online memory access, but there is a way to keep reality hidden from you when you expect it. An incident like this happened to me a few months ago while doing some profiling and optimization in a remote method call library.

After Valgrind’s Cachegrind profiling tool showed me an excessive amount of incoherent copies in internals, the challenge was not only to implement a zero-copy version of the library, but also in-place parsing of binary data. Which leads to some of the aforementioned effects with unlined memory access during (packed) binary data dereferencing.

Although the problem was easily solved using the above memcpy-Based solution, this gave me an interesting look at SIGBUS faults in ARM-based systems where the same code worked without interruption on the x86_64 system. As for performance effects? Benchmarks before and after changes to the RPC library showed a significant increase in performance, which may be due in part to switching to aligned access.

While there are some people around who insist that the impact of performance from online access is not significant enough to worry about today, the real impact on portability and atomicity is something that anyone should take a break from. Other than that, it is absolutely valuable to run the code through the profiler to get a sense of what the memory access patterns are and what can be improved or optimized.

Leave a Reply

Your email address will not be published.