home     blog     about     contact
Daniel Strecker's Blog - daniel-strecker.com
published: 2020-02-03
started: 2020-01-26
last change: 2020-02-04

Hitting the Memory Wall: Once Again

Abstract

This article describes the problem called memory wall, what kinds of algorithms experience it, and how to observe it in Java. An idea of how mitigation strategies are implemented is described at the end of the article.

Contents

1. Phenomenon
1.1. History
1.2. Occurrence
2. Experiment
2.1. Procedure
2.2. Observation
2.3. Interpretation
3. Optimizations
3.1. Fixed Array Sizes and Loop Condition
3.2. Splitting Arrays to Make Them Shorter
3.2. Avoid Cache Size Exhaustion
4. Appendix
4.1. Cache-Friendliness
4.2. Benchmarks Are Tricky
5. References

1. Phenomenon

1.1. History

The memory wall is the phenomenon that when processing large amounts of data, the processor often needs to wait for data coming from the rather slow RAM memory. Therefor, data-intensive algorithms run with less calculations per second than the processor could theoretically do. This is a persistent problem and it's getting worse since decades because of the inherently slower advancements of memory speed compared to processor speed. It was first described scientifically in the paper Hitting the Memory Wall: Implications of the Obvious [1] in 1994.

1.2. Occurrence

The phenomenon can be observed when processing any kind of data, provided the data size is big enough. And it can occur in any kind of today's computers, e.g. super computers, desktop computers, laptops, smartphones, or tablets. Because this is a hardware bound problem, changing the programming language has no effect on it.
On small amounts of data which need to be processed repeatedly, the performance of processors can be fully utilized, because the data can be kept completely in the processor's registers or in fast cache memory on the processor chip. When stored there, no loading from slower RAM is necessary. But when the amount of data is bigger and doesn't fit into registers or the processor cache anymore, data needs to be loaded from the RAM. That's slow - compared to registers and processor cache - and it slows down computation.

2. Experiment

2.1. Procedure

To observe the memory wall, we use the Java micro benchmark in Snippet 1:

public class Benchmark {

    public long profile(float[] a) {
        long t = System.nanoTime();

        // start of code to profile
	
        for (int i = 0; i < a.length; i++) {
            a[i] = a[i] * a[i];
        }
        // copy-paste this loop 9 further times here
        // for more accurate per-loop time measurement

        // end of code to profile

        t = System.nanoTime() - t;

        return t;
    }
}
Snippet 1: Micro Benchmark

The method profile(...) in Snippet 1 measures the time duration it takes to execute the code to profile, and returns this duration, in nano seconds.
Because it's Java, we have to run the same method many times repeatedly in order to observe the best execution times. The repeated runs have two effects. First of all, this tells the Java JIT compiler that the method is performance critical and it should therefor be compiled to fully optimized machine code. The second effect is that we get not just one, but many measurements, and we can get rid of measurement outliers by taking the minimum of the measured times over many invocations. For this purpose, we add the main(...) method in Snippet 2 to the Benchmark class of Snippet 1:

    public static void main(String[] args) {
        Benchmark benchmark = new Benchmark();
        long tMin = Long.MAX_VALUE;
        float[] a = new float[128];

        for (int i = 0; i < 10 * 1000 * 1000; i++) {
            long t = benchmark.profile(a);
            tMin = Math.min(t, tMin);
        }

        double performance = 10.0D * a.length / tMin; // 10 * ops / (10*time)

        System.out.printf("Gop/s: %.3f\n", performance);
    }
Snippet 2: Running the Micro Benchmark

You can then compile the Benchmark class and run it. It prints the performance of the code in profile(...) for a specific size of the array a.

The above benchmark can easily be automated to emit the performances for various array sizes. Plotting the performance over the length of the array a results in the diagram in Fig. 1.
Performance results highly depend on the CPU of the machine it runs on, the Java version, and the operating system. The diagram will look different on other machines, but similar traits can always be found.

I used a system with an Intel i7-4700MQ CPU, which has a maximum clock speed of 3.40 GHz, Ubuntu 18.04.3 LTS, and I used OpenJDK 11.0.6, 64-Bit Server VM. The time measurement on Windows doesn't work as precise as on Linux. So if you use Windows and you are getting weird results, it's probably because of inaccurate time measurement.

Fig. 1: Performance for Single Precision (float)

2.2. Observation

The graph in Fig. 1 has multiple different areas, each of them with a specific flavor. The first data point marks the performance for a = new float[4]; (16 bytes), which is at 1.18 Gop/s.
From there, with increasing array size, the performance leaps up and down, but increases overall up to about 22.50 Gop/s at the array size of 3 kB.
It stays about there up to an array size of a bit below 32 kB. This is the top performance area in the diagram.
The performance decreases slightly right before and at 32 kB and it drops sharply to about 7.5 Gop/s after this mark. Note that the x-axis is in log scale, so although this diagram looks like there is a drastic change right after the 32 kB mark, it actually only tells us that something changes between 32 kB and 45 kb.
After that, the performance doesn't change much for almost an order of magnitude up to 256 kB. From there until 512 kb the performance decreases again, this time moderately down to a plateau of 5.7 Gop/s.
This performance plateau lasts until about an array size of 4 MB, from where it drops again in the array size range of 4 MB to 12 MB to a performance of about 2.5 Gop/s.

In summary, looking at the whole diagram, we see a back and forth jumping increase, then plateau no. 1, and then 3 more or less stepwise decreases to 3 further plateaus.

2.3. Interpretation

The top performance of 22.50 Gop/s is a stunning result for a processor running at 3.40GHz, because this is 6.6 times faster than the maximum CPU clock speed. Read more about how this is possible in the article Auto Vectorization in Java.

The increasing slope at the beginning of the graph is a result of the overhead needed for taking care of a variable array size. Because of the required capability to handle various array sizes, the code in the profile(...) method cannot be as much optimized as it could be if the array size was known at compile time. With increasing array size, the ratio of this overhead in the complete processing load gets smaller and smaller. For this reason, an increasing array size makes this code more efficient.
Naturally, the quality of the optimizations that Java's JIT compiler produces can have a tremendous effect on the efficiency. These optimizations are often sloppy, not using the full capability of the CPU's instruction set. This and the impossibility to produce the most optimized code for a dynamic array size causes the jumps and setbacks in the increase.
Above about 3 kB of data size (~ 800 floats), well optimized parts of the code are utilized for processing, and the overhead of the dynamic array size checking gets negligible. This allows for top performance on the side of the algorithm and machine code instructions, which opens the high performance plateau above 3 kB. This plateau ends abruptly over 32 kB. This is caused by the exhaustion of the L1 (level 1) cache of the CPU, which has exactly 32 kB in size. The L1 cache of the CPU is the fastest of the caches, and the smallest as well.
The Intel i7-4700MQ processor has a total of 3 caches with the sizes 32 kB, 256 kB, and 6.0 MB. Further plateaus end where the sizes of the L2 and L3 caches of the CPU are exhausted. As effects of variable array sizes and differing optimizations are negligible over 32 kB, all following plateaus are more steady than plateau no. 1.
The last plateau, which begins around the size of the L3 cache, reflects the speed of the RAM memory of the system. So in total, there are 4 plateaus which reflect the speed of the L1, L2, and L3 cache of the CPU, and the speed of the RAM memory, respectively.

The following diagram in Fig. 2 shows the same data as Fig. 1, just with vertical lines marking the cache sizes of the i7-4700MQ CPU:

Fig. 2: Performance for Single Precision (float) and Cache Sizes of the i7-4700MQ CPU
The stepwise performance decreases line up nicely with the cache sizes of the i7-4700MQ processor.

3. Optimizations

3.1. Fixed Array Sizes and Loop Condition

TL;DR: Fixed array sizes and other plays on the loop limit have only a small effect.

Java doesn't support fixed array sizes on the language level. Therefor, when speaking about fixed array sizes in Java, I just refer to a compile-time fixed limit in the for loop, as shown in Snipped 3:

public class Benchmark {

    public long profile(float[] a) {
        long t = System.nanoTime();

        // start of code to profile
	
        for (int i = 0; i < 128; i++) {

// ...
Snippet 3: for loop for fixed array size, known at compile time

In theory, when using fixed loop limits, the compiler knows more accurately at compile time how the data shall be processed. With this information, it can produce better optimizations. This is true for a JIT compiler like in Java, as well as AOT compilers in other programming languages. E.g. loop unrolling is an optimization which can be followed up to a more optimal point when the loop limit is fixed, be it by a language level fixed array size or just by a fixed loop limit like in Java.

Other optimizations that compilers can do, like omitting array bounds checks, are possible only if the programming language supports fixed array sizes, which Java currently doesn't. For a fixed loop limit in Java, neither the sourcecode-to-bytecode AOT compiler nor the JIT compiler can always ascertain that the loop limit matches the array size, because the code could unintendedly be invoked with an array which is shorter than the loop limit. To be prepared for this case, the compiler has to produce code for bounds checking, which will throw an ArrayIndexOutOfBoundsException at runtime when needed. This is even the case when the method is, in practice, never invoked with an array that doesn't fit the loop limit.

In major Java implementations like the OpenJDK, the JIT compiler does not always use all the available information for its optimizations, which can sometimes result in code for dynamic array sizes exhibiting better performance than the code for the same fixed array size.
On the other side, with the dynamic array size and i < a.length in the loop condition, the occurence of an ArrayIndexOutOfBoundsException can be ruled out completely.
When using an additional integer parameter for specifying the loop limit, like in Snippet 4, the code often runs slightly slower than with either the array length as the loop limit or a fixed loop limit. This makes sense because an additional int parameter for the loop limit allows neither the optimization that a fixed loop limit allows, nor the ruling out of ArrayIndexOutOfBoundsExceptions.

public class Benchmark {

    public long profile(float[] a, int int_len) {
        long t = System.nanoTime();

        // start of code to profile
	
        for (int i = 0; i < int_len; i++) {

// ...
Snippet 4: separate int as array size

Fig. 3 compares the performance of various possibilities to do floating point operations on arrays in Java. There are 6 graphs in the diagram, 3 for single precision floating point values (Java floats) and 3 for double precision floating point value (Java doubles). For each of these two data types, 3 kinds of loop limits were profiled, namely compile-time fixed (i<fixed), array size (i<a.length), and additional integer parameter (i<int_len).

Fig. 3: Performance for Single Precision (float) and Double Precision (double)

These results highly depend on the implementation of the used Java implementation. The way optimizations are done can change with any version change, because they are not relevant for the outcome of the calculations, just for the performance. So with the next release or with another Java implementation, Fig. 3 is likely look different, even though the general properties will remain, like decreasing performance for array sizes which exceed cache limits. It would be a mistake though to small details of the graphs though. E.g. it is wrong to conclude that one should use the int_len loop limit for processing float[16] arrays, because it's faster for this particular length and this particular version of the Java implementation. The fact that the int_len loop limit has an advantage here is likely to be different with another version, or another implementation, or with the same Java version on another CPU.

In general we can only conclude that the changing the type of loop limits has only a small effect. These changes mainly showed that using int_len loop limits almost always decreases the performance, or doesn't change it, and thus it is a bad idea to use this. In contrary, using fixed loop limits usually increases the performance.

3.2. Splitting Arrays to Make Them Shorter

TL;DR - Getting shorter array sizes simply by splitting arrays is a fallacy. Don't do this.

In the graph we see that using shorter arrays, particularly arrays of sizes between 3 kB and 32 kB, boosts performance. But it's a fallacy to believe that using only arrays of this size is advantageous. This is only the case for this particular benchmark, which uses just one array to simulate an algorithm which uses this amount data. Another algorithm which uses the same total amount of data in various different arrays will face the same problem. It doesn't matter if this amount of data is stored in one array or in multiple arrays or other chunks of data. The CPU doesn't know about array sizes. It only knows memory addresses and the content of its caches. If 64kB of data are kept in one 64 kB array or in two separate 32 kB arrays, a cache of size 32 kB will be exhausted when the complete data is used in an algorithm. The cache is per CPU or per CPU core, not per array.

3.2. Avoid Cache Size Exhaustion

The only way to mitigate the memory wall problem is to avoid exhausting the cache size frequently. When the cache size is exhausted during a calculation, it means that a value which gets loaded is not contained in the cache. This is called a cache miss. In this case, the value needs to be loaded from the next higher cache level, or from the RAM, and thereby slows down execution. Minimizing the number of cache misses during the whole execution of an algorithm is the key strategy for mitigating the memory wall problem.
This is not an easy task. The base idea for minimizing the number of cache misses is to restructure an algorithm in a way so that it does as much processing as possible on a small chunk of data, and only then moves on to the next small chunk of data. How difficult it is to do so depends on the algorithm. Some algorithms are easy to transform to a version which works chunkwise and for other algorithms it's impossible. Others seem impossible to transform in this way at first glance, but can be adjusted with a trick which is difficult to see.

4. Appendix

4.1. Cache-Friendliness

Cacheline alignment is another crucial and closely related kind of optimization which also needs to be addressed when making algorithms cache-friendly. A cache can hold data from different locations of the RAM, e.g. a part from the beginning of the RAM and a part from the end of it. Or from many different locations. The smallest continuous RAM area a cache can hold is called a cacheline. Typically cacheline sizes are 32 bytes or 64 bytes, but they can be any number of bytes. This depends on the CPU, just like any other property of the cache. Aligning data accesses to cachelines can result in performance advantages, but that's beyond the scope of this article.

4.2. Benchmarks Are Tricky

The results of benchmarks are often inconsistent because it's hard to establish the same environmental conditions every time you run the benchmark. E.g. usually the clock speed of CPUs is adjusted by the operating system. The operating system can take the CPU temperature into account for determining the optimal clock speed. This ultimately makes the performance dependant upon the surrounding room temperature of the test machine. On Linux, you can use the cpupower command to keep the CPU speed at a fixed level. Use lscpu to check if it works, like in Snippet 5:

me@hostname:~$ sudo cpupower frequency-set --min 2.4GHz --max 2.4GHz
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 3
Setting cpu: 4
Setting cpu: 5
Setting cpu: 6
Setting cpu: 7
me@hostname:~$ lscpu | grep Hz
Model name:          Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
CPU MHz:             2394.617
CPU max MHz:         3400,0000
CPU min MHz:         800,0000
me@hostname:~$ █
Snippet 5: Setting and Checking CPU Frequency

5. References

[1] William A. Wulf and Sally A. McKee, "Hitting the Memory Wall: Implications of the Obvious", SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20-24, Mar 1995, http://doi.acm.org/10.1145/216585.216588 (back)