# Lab 09

Deadline: End of lab Thursday, August 1st

## Objectives:

• The student will learn about and use various SIMD functions to perform data level parallelism
• The student will write code to SIMD-ize certain functions
• The student will learn about loop-unrolling and why it works
• The student will learn to use and interpret the SIMD visualizer, linked here

## Setup

Pull the Lab09 files from the lab starter repository with

git pull staff master


## Disclaimer

NOTE THAT ALL CODE USING SSE INSTRUCTIONS IS GUARANTEED TO WORK ON THE HIVE MACHINES AND IT MAY NOT WORK ELSEWHERE

Many newer processors support SSE intrinsics, so it is certainly possible that your machine will be sufficient, but you may not see accurate speedups. Ideally, you should ssh into one of the hive machines to run this lab. Additionally, many of the performance characteristics asked about later on this lab are more likely to show up on the Hive.

## Exercise 1 - Familiarize Yourself with the SIMD Functions

Given the large number of available SIMD intrinsics we want you to learn how to find the ones that you’ll need in your application.

For this mini-exercise, please take a look at the Intel Intrinsics Guide. Open this page and once there, click the checkboxes for everything that begins with “SSE”.

Look through the possible instructions and syntax structures, then try to find the 128-bit intrinsics for the following operations:

• Four floating point divisions in single precision (i.e. float)
• Sixteen max operations over signed 8-bit integers (i.e. char)
• Arithmetic shift right of eight signed 16-bit integers (i.e. short)

Hint: Things that say “epi” or “pi” deal with integers, and those that say “ps” or “pd” deal with s ingle p recision and d ouble p recision floats.

Additionally, you can visualize how the vectors and the different functions work together by inputting your code into the code environment at this link!

## Exercise 2 - Writing SIMD Code

For this part of the lab, we will be working in the simd directory

### Common Little Mistakes

The following are bugs that the staff have noticed were preventing students from passing the tests (bold text is what you should not do):

• Trying to store your sum vector into a long long int array. Use an int array. Side note: why?? The return value of this function is indeed a long long int, but that’s because an int isn’t big enough to hold the sum of all the values across all iterations of the outer loop. However, it is big enough to hold the sum of all the value across a single iteration of the outer loop. This means you’ll want to store your sum vector into an int array after every iteration of the outer loop and add the total sum to the final result result.
• Re-initializing your sum vector. Make sure when you add to your running sum vector, you aren’t declaring a new sum vector!!
• Forgetting the CONDITIONAL in the tail case!. If the list or array’s length is not evenly divisible by the step size, make sure to take care of the remaining values in the tail case.
• Adding to an UNINITIALIZED array! If you add stuff to your result array without initializing it, you are adding stuff to garbage, which makes the array still garbage! Using storeu before adding stuff is okay though.

We’ve got one header file common.h that has some code to sum the elements of a really big array. It’s a minor detail that it randomly does this 1 << 16 times… but you don’t need to worry about that. We also pincer the execution of the code between two timestamps (that’s what the clock() function does) to measure how fast it runs! The file simd.c is the one which will have a main function to run the sum functions.

We ask you to vectorize/SIMDize the following code in common.h to speed up the naive implementation of sum().

You only need to vectorize the inner loop with SIMD! You will also need to use the following intrinsics:

• __m128i _mm_setzero_si128() - returns a 128-bit zero vector
• __m128i _mm_loadu_si128(__m128i *p) - returns 128-bit vector stored at pointer p
• __m128i _mm_add_epi32(__m128i a, __m128i b) - returns vector (a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)
• void _mm_storeu_si128(__m128i *p, __m128i a) - stores 128-bit vector a into pointer p
• __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) - returns the vector (a_i > b_i ? 0xffffffff : 0x0 for i from 0 to 3). AKA a 32-bit all-1s mask if a_i > b_i and a 32-bit all-0s mask otherwise
• __m128i _mm_and_si128(__m128i a, __m128i b) - returns vector (a_0 & b_0, a_1 & b_1, a_2 & b_2, a_3 & b_3), where & represents the bit-wise and operator

Start with the code in sum() and use SSE intrinsics to implement the sum_simd() function.

How do I do this?

Recall that the SSE intrinsics are basically functions which perform operations on multiple pieces of data in a vector in parallel. This turns out to be faster than running through a for loop and applying the operation once for each element in the vector.

In our sum function, we’ve got a basic structure of iterating through an array. On every iteration, we add an array element to a running sum. To vectorize, you should add a few array elements to a sum vector in parallel and then consolidate the individual values of the sum vector into our desired sum at the end.

• Hint 1: __m128i is the data type for Intel’s special 128-bit vector. We’ll be using them to encode 4 (four) 32-bit ints.
• Hint 2: We’ve left you a vector called _127 which contains four copies of the number 127. You should use this to compare with some stuff when you implement the condition within the sum loop.
• Hint 3: DON’T use the store function (_mm_storeu_si128) until after completing the inner loop! It turns out that storing is very costly and performing a store in every iteration will actually cause your code to slow down. However, if you wait until after the outer loop completes you may have overflow issues.
• Hint 4: It’s bad practice to index into the __m128i vector like they are arrays. You should store them into arrays first with the storeu function, and then access the integers elementwise by indexing into the array.
• Hint 5: READ the function declarations in the above table carefully! You’ll notice that the loadu and storeu take __m128i* type arguments. You can just cast an int array to a __m128i pointer. Alternatively, you could skip the typecast at the cost of a bunch of compiler warnings.

To compile and run your code, run the following commands:

make simd
./simd


Sanity check: The naive version runs at about 25 seconds on the hive machines, and your SIMDized version should run in about 5-6 seconds. The naive function may take a long time to run! Feel free to comment it out while you are implementing sum_simd(), but make sure to uncomment it out later, since we rely on that result for comparing against a reference; sometimes code can take a long time to run and this is one of those cases.

### Checkpoint

• Show your TA your working code and performance improvement from sum to sum_simd

## Exercise 3 - Loop Unrolling

Concept Time! Another tactic used to increase performance is to unroll our for loops! By performing more operations per iteration of the for loop, we have to loop fewer times and not have to waste as many cycles (think about why we would have to waste some cycles?). Fewer loops also means fewer branch instructions to execute (expensive both computationally and time-wise) when translated to machine code. Theoretically, code would be faster if we didn’t create loops and just copy pasted the loop n times, but that’s not a very pretty function.

For example, consider this very simple example that adds together the first n elements of an array arr:

int total = 0;
for (int i = 0; i < n; i++) {
total += arr[i];
}


The corresponding assembly code might look something like this:

		add t0, x0, x0
add t1, x0, x0 // Initialize loop counter
loop: 	beq t0, a1, end // Assume register a1 contains the size n of the array
slli t2, t1, 2
add t2, t1, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t3, 0(t2) // Load arr[i] into t3
add t0, t3, t0 // total += arr[i]
addi t1, t1, 1 // Increment the loop counter
jal x0, loop
end: 	...


If we unroll the loop 4 times, this would be our equivalent code, with a tail case for the situations where n is not a multiple of 4:

int total = 0;
for (int i = 0; i < n / 4 * 4; i+=4) {
total += arr[i];
total += arr[i + 1];
total += arr[i + 2];
total += arr[i + 3];
}

for (i = n / 4 * 4; i < n; i++) {
total += arr[i];
}


For the unrolled code, the corresponding assembly code might look something like this:

      add t0, x0, x0
add t1, a1, x0 // Assume register a1 contains the size n of the array
srli t1, t1, 2
slli t1, t1, 2 // Find largest of multiple 4 <= n
add t2, x0, x0 // Initialize loop counter
loop: beq t2, t1, tail
slli t3, t2, 2
add t3, t3, a0 // Assume register a0 contains a pointer to the beginning of the array
lw t4, 0(t3) // Load arr[i] into t3
add t0, t4, t0 // total += arr[i]
lw t4, 4(t3) // Load arr[i + 1] into t3
add t0, t4, t0
lw t4, 8(t3), t0 // Load arr[i + 2] into t3
add t0, t4, t0
lw t4, 12(t3), // Load arr[i + 3] into t3
add t0, t4, t0
addi t2, t2, 4 // Increment the loop counter
jal x0, loop
tail: beq t2, a1, end
slli t3, t2, 2
lw t4, 0(t3)
add t0, t4, t0
addi t2, t2, 1
end: ...


To obtain even more performance improvement, carefully unroll the SIMD vector sum code that you created in the previous exercise to create sum_simd_unrolled(). This should get you a little more increase in performance from sum_simd (a few fractions of a second). As an example of loop unrolling, consider the supplied function sum_unrolled()

Within common.h, copy your sum_simd() code into sum_simd_unrolled() and unroll it 4 (four) times. Don’t forget about your tail case!

To compile and run your code, run the following commands:

make simd
./simd


### Checkpoint

• Show your TA your working code and performance improvement from sum_simd() to sum_simd_unrolled().
• Explain why looping less, but doing more per loop still allows us to perform our code faster.

## Exercise 4 - Compiler Optimizations

How much of this work could the compiler have just done for you?

For our prior tests, we were running with the compiler flag -O0 (which is to say no optimizations at all). We can compile the same code you’ve already written to multiple optimization levels and compare the results.

To compile multiple optimization levels run:

make opt
./opt0_simd
./opt1_simd
./opt2_simd
./opt3_simd


Where optN_simd corresponds to level N optimizations. An explanation of the different GCC optimization levels and what they do can be found in the GCC documentation.

Note: These optimizations work best on simple code, like demonstrated in this lab. When we start Project 4 (which is already compiled with -O3, you’ll find that for complex code, a human who understands the goals of the program may be able to perform significantly better optimizations than GCC can alone.

Run each of the optimization levels and compare their run times to answer the following questions:

1. At which optimization level does the compiler use SIMD intrinsic instructions?

2. Whose code was faster: the default C code with maximum compiler optimizations or your SIMD code without optimizations?

3. Whose code was faster with maximum optimizations: your SIMD code or the default C code? How much faster? Would it be worthwhile to always write your code using SIMD intrinsics?

## Checkoffs

Exercise 1:

• Record the 3 (three) intrinsics we specified above in a text file(or just know what they are) to show your TA.

Exercise 2:

• Show your TA your working code and performance improvement from sum to sum_simd.

Exercise 3:

• Show your TA your working code and performance improvement from sum_simd() to sum_simd_unrolled().
• Explain why looping less, but doing more per loop still allows us to perform our code faster.

Exercise 4:

• Explain the answers to the questions to your TA.