Lab 7 — Coding with SIMD

Lab goals:

Understand what SSE (128-bit) and AVX (256-bit) SIMD instructions are and how they operate on vectors of floats.
Learn when and why SIMD helps: data-parallel loops, contiguous memory access, and the impact of vector width.
Practice how to use SIMD via C intrinsics and verify correctness against scalar baselines.
Observe performance trade-offs (add vs. multiply vs. divide) across input sizes.

Pre-lab

Before lab, skim the following so you’re ready to code with intrinsics:

Registers and widths: SSE uses __m128 (xmm), AVX uses __m256 (ymm).
Core intrinsics you’ll use: _mm_loadu_ps, _mm_storeu_ps, _mm_add_ps, _mm_mul_ps, _mm_div_ps, and their AVX forms (_mm256_*).
Compilation: gcc -O0 -msse -mavx -Wall -Wextra -Werror (provided in the Makefile).
Element-wise vs. reductions: every SIMD op here is element-wise (no dot products/reductions).

GitHub Repository for This Lab

To obtain your private repo for this lab, please point your browser to the starter code for the lab at:

https://classroom.github.com/a/07hNXtVv

Follow the same steps as for previous labs and assignments to to create your repository on GitHub and to then clone it onto CLEAR. The directory for your repository for this lab will be

coding-with-simd-name

where name is your GitHub userid.

Files provided: add.c, multiply.c, divide.c (each has FILL THIS), plot_figure.py, Makefile, three CSVs: add_results.csv, multiply_results.csv, divide_results.csv, and a reflection file: reflection.txt.

In-lab

Important: The Makefile uses -Wall -Wextra -Werror and will fail to build until you replace all FILL THIS lines in the C files. Fix the code first, then build.

What the provided programs do & what they output

Each program (add, multiply, divide):

Takes a single command-line argument p (power), validates 0 ≤ p ≤ 25, and sets the problem size to N = 2^p.
Initializes two float arrays a and b, then computes element-wise results three ways:
- Scalar loop
- SSE (4-wide) loop with a scalar tail
- AVX (8-wide) loop with a scalar tail
Times each version and prints human-readable timings and a single CSV line you must copy into the appropriate results file.

CSV header (keep exactly once at the top of each CSV file):

Power,Scalar Time,SSE Time,AVX Time

CSV output line format (copy only the numeric line):

12,0.0000504990,0.0000317730,0.0000257870

Why you’re doing this: you will run across sizes (powers 0–25) to identify performance trends and understand when/why/how SSE/AVX are useful.

Run plan (do these in order)

ADD
Go through add.c and understand the code. Look at the SSE code lines:
```
int i = 0;
for (; i + 4 <= n; i += 4) {
    __m128 va = _mm_loadu_ps(a + i); // load 4 floats from a[i..i+3]
    __m128 vb = _mm_loadu_ps(b + i); // load 4 floats from b[i..i+3]
    __m128 vs = _mm_add_ps(va, vb);  // add them element-wise
    _mm_storeu_ps(r + i, vs);        // store the 4 results into r[i..i+3]
}
```
Update all the FILL THIS lines. Then,
```
make add
```
Then run for every power p = 0…25:
```
./add 0
./add 1
...
./add 25
```
Each run should print the timing results and checksum outputs for all three implementations. An example for ./add 12 is provided below:
```
Vector add with SIZE = 2^12 = 4096 elements
Scalar add: 0.0000504990 s
SSE    add: 0.0000317730 s
AVX    add: 0.0000257870 s

CSV Line: 12,0.0000504990,0.0000317730,0.0000257870

Checksums (sum over results):
  scalar: 6289920.0000000000
  sse   : 6289920.0000000000
  avx   : 6289920.0000000000
```
The checksum output is to help verify that all three ways of calculating the operations result in the same output.

Performance variability note: shared servers can have noise (other users’ jobs, frequency scaling, cache effects). If feasible, run each power multiple times and record the minimum. If you paste multiple entries for the same power, ensure the last one in the file is the minimum (the plotting script uses the latest entry per power).

Generate the figure:
```
make add_plot   # produces add_figure.svg
```
MULTIPLY
Go through multiply.c, understand the code, and update all the FILL THIS lines Then,
```
make multiply
```
Run for every power p = 0…25 and append the numeric CSV line to multiply_results.csv. Use the same “minimum of multiple runs” guideline as above, ensuring the last line per power is your minimum.
```
make multiply_plot   # produces multiply_figure.svg
```
DIVIDE
Go through divide.c, understand the code, and update all the FILL THIS lines. Then,
```
make divide
```
Run for every power p = 0…25 and append the numeric CSV line to divide_results.csv. Same variability/minimum guidance applies.
```
make divide_plot     # produces divide_figure.svg
```

Reflection

Open reflection.txt (already provided) and write your responses directly next to each prompt (copied below) after Ans:.

How does the performance of SIMD instructions compare for add, multiply, and divide?
What sizes of arrays give the largest performance boost with SIMD instructions?
Why do you think you see the trends that you see in the figures?
When is it useful to use SSE and AVX instructions?
What is your main takeaway?

Post-lab

What to submit

add.c, multiply.c, divide.c with all FILL THIS completed and compiling cleanly.
add_results.csv, multiply_results.csv, divide_results.csv with your appended numeric lines.
Your reflection answers in reflection.txt.

Due: Push code, CSVs, figures, and your reflection by 11:55 PM on Sunday (10/12).

COMP 222: Introduction to Computer Organization

Navigation

Lab 7 — Coding with SIMD

Pre-lab

GitHub Repository for This Lab

In-lab

What the provided programs do & what they output

Run plan (do these in order)

Reflection

Post-lab

What to submit