Lab 7 — Coding with SIMD
Lab goals:
- Understand what SSE (128-bit) and AVX (256-bit) SIMD instructions are and how they operate on vectors of
float
s. - Learn when and why SIMD helps: data-parallel loops, contiguous memory access, and the impact of vector width.
- Practice how to use SIMD via C intrinsics and verify correctness against scalar baselines.
- Observe performance trade-offs (add vs. multiply vs. divide) across input sizes.
Pre-lab
Before lab, skim the following so you’re ready to code with intrinsics:
- Registers and widths: SSE uses
__m128
(xmm
), AVX uses__m256
(ymm
). - Core intrinsics you’ll use:
_mm_loadu_ps
,_mm_storeu_ps
,_mm_add_ps
,_mm_mul_ps
,_mm_div_ps
, and their AVX forms (_mm256_*
). - Compilation:
gcc -O0 -msse -mavx -Wall -Wextra -Werror
(provided in the Makefile). - Element-wise vs. reductions: every SIMD op here is element-wise (no dot products/reductions).
GitHub Repository for This Lab
To obtain your private repo for this lab, please point your browser to the starter code for the lab at:
https://classroom.github.com/a/07hNXtVvFollow the same steps as for previous labs and assignments to to create your repository on GitHub and to then clone it onto CLEAR. The directory for your repository for this lab will be
coding-with-simd-namewhere name is your GitHub userid.
Files provided: add.c
, multiply.c
, divide.c
(each has FILL THIS
), plot_figure.py
, Makefile
, three CSVs:
add_results.csv
, multiply_results.csv
, divide_results.csv
, and a reflection file: reflection.txt
.
In-lab
Important: The Makefile uses -Wall -Wextra -Werror
and will
fail to build until you replace all FILL THIS
lines in the C files. Fix the code first, then build.
What the provided programs do & what they output
Each program (add
, multiply
, divide
):
- Takes a single command-line argument
p
(power), validates0 ≤ p ≤ 25
, and sets the problem size toN = 2^p
. - Initializes two float arrays
a
andb
, then computes element-wise results three ways:- Scalar loop
- SSE (4-wide) loop with a scalar tail
- AVX (8-wide) loop with a scalar tail
- Times each version and prints human-readable timings and a single CSV line you must copy into the appropriate results file.
CSV header (keep exactly once at the top of each CSV file):
Power,Scalar Time,SSE Time,AVX Time
CSV output line format (copy only the numeric line):
12,0.0000504990,0.0000317730,0.0000257870
Why you’re doing this: you will run across sizes (powers 0–25) to identify performance trends and understand when/why/how SSE/AVX are useful.
Run plan (do these in order)
-
ADD
Go throughadd.c
and understand the code. Look at the SSE code lines:
Update all theint i = 0; for (; i + 4 <= n; i += 4) { __m128 va = _mm_loadu_ps(a + i); // load 4 floats from a[i..i+3] __m128 vb = _mm_loadu_ps(b + i); // load 4 floats from b[i..i+3] __m128 vs = _mm_add_ps(va, vb); // add them element-wise _mm_storeu_ps(r + i, vs); // store the 4 results into r[i..i+3] }
FILL THIS
lines. Then,
Then run for every powermake add
p = 0…25
:
Each run should print the timing results and checksum outputs for all three implementations. An example for./add 0 ./add 1 ... ./add 25
./add 12
is provided below:Vector add with SIZE = 2^12 = 4096 elements Scalar add: 0.0000504990 s SSE add: 0.0000317730 s AVX add: 0.0000257870 s CSV Line: 12,0.0000504990,0.0000317730,0.0000257870 Checksums (sum over results): scalar: 6289920.0000000000 sse : 6289920.0000000000 avx : 6289920.0000000000
The checksum output is to help verify that all three ways of calculating the operations result in the same output.
Performance variability note: shared servers can have noise (other users’ jobs, frequency scaling, cache effects). If feasible, run each power multiple times and record the minimum. If you paste multiple entries for the same power, ensure the last one in the file is the minimum (the plotting script uses the latest entry per power).
Generate the figure:make add_plot # produces add_figure.svg
-
MULTIPLY
Go throughmultiply.c
, understand the code, and update all theFILL THIS
lines Then,
Run for every powermake multiply
p = 0…25
and append the numeric CSV line tomultiply_results.csv
. Use the same “minimum of multiple runs” guideline as above, ensuring the last line per power is your minimum.make multiply_plot # produces multiply_figure.svg
-
DIVIDE
Go throughdivide.c
, understand the code, and update all theFILL THIS
lines. Then,
Run for every powermake divide
p = 0…25
and append the numeric CSV line todivide_results.csv
. Same variability/minimum guidance applies.make divide_plot # produces divide_figure.svg
Reflection
Open reflection.txt
(already provided) and write your responses directly next to each prompt (copied below) after Ans:
.
- How does the performance of SIMD instructions compare for add, multiply, and divide?
- What sizes of arrays give the largest performance boost with SIMD instructions?
- Why do you think you see the trends that you see in the figures?
- When is it useful to use SSE and AVX instructions?
- What is your main takeaway?
Post-lab
What to submit
add.c
,multiply.c
,divide.c
with allFILL THIS
completed and compiling cleanly.add_results.csv
,multiply_results.csv
,divide_results.csv
with your appended numeric lines.- Your reflection answers in
reflection.txt
.
Due: Push code, CSVs, figures, and your reflection by 11:55 PM on Sunday (10/12).