Deadline: End of lab Thursday, August 1st
Pull the Lab09 files from the lab starter repository with
git pull staff master
NOTE THAT ALL CODE USING SSE INSTRUCTIONS IS GUARANTEED TO WORK ON THE HIVE MACHINES AND IT MAY NOT WORK ELSEWHERE
Many newer processors support SSE intrinsics, so it is certainly possible that your machine will be sufficient, but you may not see accurate speedups. Ideally, you should ssh
into one of the hive machines to run this lab. Additionally, many of the performance characteristics asked about later on this lab are more likely to show up on the Hive.
Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you’ll need in your application.
For this mini-exercise, please take a look at the Intel Intrinsics Guide. Open this page and once there, click the checkboxes for everything that begins with “SSE”.
Look through the possible instructions and syntax structures, then try to find the 128-bit intrinsics for the following operations:
Hint: Things that say “epi” or “pi” deal with integers, and those that say “ps” or “pd” deal with s ingle p recision and d ouble p recision floats.
Additionally, you can visualize how the vectors and the different functions work together by inputting your code into the code environment at this link!
For this part of the lab, we will be working in the simd
directory
The following are bugs that the staff have noticed were preventing students from passing the tests (bold text is what you should not do):
long long int
array. Use an int array. Side note: why?? The return value of this function is indeed a long long int
, but that’s because an int
isn’t big enough to hold the sum of all the values across all iterations of the outer loop. However, it is big enough to hold the sum of all the value across a single iteration of the outer loop. This means you’ll want to store your sum vector into an int
array after every iteration of the outer loop and add the total sum to the final result result
.storeu
before adding stuff is okay though.We’ve got one header file common.h
that has some code to sum the elements of a really big array. It’s a minor detail that it randomly does this 1 << 16
times… but you don’t need to worry about that. We also pincer the execution of the code between two timestamps (that’s what the clock()
function does) to measure how fast it runs! The file simd.c
is the one which will have a main
function to run the sum
functions.
We ask you to vectorize/SIMDize the following code in common.h
to speed up the naive implementation of sum()
.
You only need to vectorize the inner loop with SIMD! You will also need to use the following intrinsics:
__m128i _mm_setzero_si128()
- returns a 128-bit zero vector__m128i _mm_loadu_si128(__m128i *p)
- returns 128-bit vector stored at pointer p__m128i _mm_add_epi32(__m128i a, __m128i b)
- returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)void _mm_storeu_si128(__m128i *p, __m128i a)
- stores 128-bit vector a into pointer p__m128i _mm_cmpgt_epi32(__m128i a, __m128i b)
- returns the vector (a_i > b_i ? 0xffffffff : 0x0
for i
from 0 to 3). AKA a 32-bit all-1s mask if a_i > b_i and a 32-bit all-0s mask otherwise__m128i _mm_and_si128(__m128i a, __m128i b)
- returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bit-wise and operatorStart with the code in sum()
and use SSE intrinsics to implement the sum_simd()
function.
How do I do this?
Recall that the SSE intrinsics are basically functions which perform operations on multiple pieces of data in a vector in parallel. This turns out to be faster than running through a for loop and applying the operation once for each element in the vector.
In our sum function, we’ve got a basic structure of iterating through an array. On every iteration, we add an array element to a running sum. To vectorize, you should add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end.
__m128i
is the data type for Intel’s special 128-bit vector. We’ll be using them to encode 4 (four) 32-bit ints._127
which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop._mm_storeu_si128
) until after completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. However, if you wait until after the outer loop completes you may have overflow issues.__m128i
vector like they are arrays. You should store them into arrays first with the storeu
function, and then access the integers elementwise by indexing into the array.__m128i*
type arguments. You can just cast an int array to a __m128i
pointer. Alternatively, you could skip the typecast at the cost of a bunch of compiler warnings.To compile and run your code, run the following commands:
make simd
./simd
Sanity check: The naive version runs at about 25 seconds on the hive machines, and your SIMDized version should run in about 5-6 seconds. The naive function may take a long time to run! Feel free to comment it out while you are implementing sum_simd()
, but make sure to uncomment it out later, since we rely on that result for comparing against a reference; sometimes code can take a long time to run and this is one of those cases.
sum
to sum_simd
Concept Time! Another tactic used to increase performance is to unroll our for
loops! By performing more operations per iteration of the for loop, we have to loop fewer times and not have to waste as many cycles (think about why we would have to waste some cycles?). Fewer loops also means fewer branch instructions to execute (expensive both computationally and time-wise) when translated to machine code. Theoretically, code would be faster if we didn’t create loops and just copy pasted the loop n
times, but that’s not a very pretty function.
For example, consider this very simple example that adds together the first n elements of an array arr
:
int total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}
The corresponding assembly code might look something like this:
add t0, x0, x0
add t1, x0, x0 // Initialize loop counter
loop: beq t0, a1, end // Assume register a1 contains the size n of the array
slli t2, t1, 2
add t2, t1, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t3, 0(t2) // Load arr[i] into t3
add t0, t3, t0 // total += arr[i]
addi t1, t1, 1 // Increment the loop counter
jal x0, loop
end: ...
If we unroll the loop 4 times, this would be our equivalent code, with a tail case for the situations where n is not a multiple of 4:
int total = 0;
for (int i = 0; i < n / 4 * 4; i+=4) {
total += arr[i];
total += arr[i + 1];
total += arr[i + 2];
total += arr[i + 3];
}
for (i = n / 4 * 4; i < n; i++) {
total += arr[i];
}
For the unrolled code, the corresponding assembly code might look something like this:
add t0, x0, x0
add t1, a1, x0 // Assume register a1 contains the size n of the array
srli t1, t1, 2
slli t1, t1, 2 // Find largest of multiple 4 <= n
add t2, x0, x0 // Initialize loop counter
loop: beq t2, t1, tail
slli t3, t2, 2
add t3, t3, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t4, 0(t3) // Load arr[i] into t3
add t0, t4, t0 // total += arr[i]
lw t4, 4(t3) // Load arr[i + 1] into t3
add t0, t4, t0
lw t4, 8(t3), t0 // Load arr[i + 2] into t3
add t0, t4, t0
lw t4, 12(t3), // Load arr[i + 3] into t3
add t0, t4, t0
addi t2, t2, 4 // Increment the loop counter
jal x0, loop
tail: beq t2, a1, end
slli t3, t2, 2
lw t4, 0(t3)
add t0, t4, t0
addi t2, t2, 1
end: ...
To obtain even more performance improvement, carefully unroll the SIMD vector sum code that you created in the previous exercise to create sum_simd_unrolled()
. This should get you a little more increase in performance from sum_simd
(a few fractions of a second). As an example of loop unrolling, consider the supplied function sum_unrolled()
Within common.h
, copy your sum_simd()
code into sum_simd_unrolled()
and unroll it 4 (four) times. Don’t forget about your tail case!
To compile and run your code, run the following commands:
make simd
./simd
sum_simd()
to sum_simd_unrolled()
.How much of this work could the compiler have just done for you?
For our prior tests, we were running with the compiler flag -O0
(which is to
say no optimizations at all). We can compile the same code you’ve already
written to multiple optimization levels and compare the results.
To compile multiple optimization levels run:
make opt
./opt0_simd
./opt1_simd
./opt2_simd
./opt3_simd
Where optN_simd
corresponds to level N optimizations. An explanation of the
different GCC optimization levels and what they do can be found in the
GCC documentation.
Note: These optimizations work best on simple code, like demonstrated in
this lab. When we start Project 4 (which is already compiled with -O3
, you’ll
find that for complex code, a human who understands the goals of the program
may be able to perform significantly better optimizations than GCC can alone.
Run each of the optimization levels and compare their run times to answer the following questions:
At which optimization level does the compiler use SIMD intrinsic instructions?
Whose code was faster: the default C code with maximum compiler optimizations or your SIMD code without optimizations?
Whose code was faster with maximum optimizations: your SIMD code or the default C code? How much faster? Would it be worthwhile to always write your code using SIMD intrinsics?
Exercise 1:
Exercise 2:
sum
to sum_simd
.Exercise 3:
sum_simd()
to sum_simd_unrolled()
.Exercise 4: