Lab 11 — Cache and Memory Management

Lab goals:


Pre-lab


Modern processors rely on multi-level caches (L1, L2, L3) to bridge the large latency gap between CPU and main memory. Efficient programs exploit spatial locality (accessing adjacent data) and temporal locality (reusing recently used data). In this lab you will run controlled experiments to visualize these effects with real hardware counters.

Before lab:


In-lab


Exercise 1 – Changing Array Traversal Order

In this exercise you’ll use matrix_sum.c to observe how access order affects cache performance. The program allocates and fills a 3-D array and computes the total sum multiple times:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 127

void fill_array(float array[N][N][N]) {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
                array[i][j][k] = (float)(i + k) / (float)(k + 1);
}

double sum_array(float array[N][N][N], double sum) {
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
                sum += array[i][j][k]; // Change the order in this line

    return sum;
}

int main() {
    float array[N][N][N];
    
    fill_array(array);

    clock_t start = clock();
    double total = sum_array(array, 0.0);
    total += sum_array(array, total);
    total += sum_array(array, total);
    total += sum_array(array, total);
    total += sum_array(array, total);
    clock_t end = clock();

    printf("Sum: %f\n", total);
    printf("Time: %lf seconds\n", (double)(end - start) / CLOCKS_PER_SEC);
    return 0;
}

Currently, the innermost loop iterates over k. Edit the code to change the order of indices in this line (e.g., make j or i the innermost loop) and rebuild:

make matrix_sum
./matrix_sum

Note the reported time. Try several loop orderings and compare the differences. Record these numbers in cache_perf.txt and answer the reflection prompts provided in the file. Which order configuration(s) take the least amount of time to run? Why? Which order configuration(s) take the most amount of time to run? Why? Are there ones with similar performances? Why?


Exercise 2 – Varying Memory Stride

Now you will explore stride-based access patterns using array_stride.c. This program sums elements in a large array while skipping a configurable number of elements per access.

#include <stdio.h>
#include <stdlib.h>

void access_pattern(char *array, size_t size, size_t stride) {
    size_t count = 0;
    
    for (size_t i = 0; i < stride; i++) {
        for (size_t j = i; j < size; j += stride) {
            array[j] += 1;
            count++;
        }
    }
    printf("Accessed %zu elements\n", count);
}

int main() {
    size_t size = 1024 * 1024 * 1024; // 1GB

    /* Update stride to any power of 2 until 4096,
     * which is the page size on clear server.
     * Run `getconf PAGESIZE` to find out. */
    size_t stride = 1;

    char *array = malloc(size);
    if (!array) {
        perror("malloc");
        return 1;
    }

    access_pattern(array, size, stride);

    free(array);

    return 0;
}

Modify the stride value to experiment with 1 (no skipping), 2, 4, 8, 16 … 4096 (which is the page size on clear servers). Recompile each time and run both perf stat to measure how performance changes:

make array_stride
perf stat ./array_stride

As stride increases, spatial locality decreases. You should observe higher longer runtimes. Reflect on the reasons behind this in stride_perf.txt by looking at the perf stat metrics. What kind of differences do you notice in the number of the total execution time? What are the differences in the other relevant metrics you are familiar with? Do the values match what you would expect? Why do you observe the values that you observe? What are the underlying trends?


Post-lab


Push the following files before the deadline:

Deadline: Commit and push your final version by 11:55 PM on Sunday, 11/16.