CS 61C Caches Review, SIMD Fall 2024 Discussion 10

# 1 Cache Review — Code Analysis

Given the follow chunk of code, analyze the hit rate given that we have a byteaddressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold.

```
#define NUM_INTS 8192 // 2ˆ13
int A[NUM_INTS]; // A lives at 0x10000
int i, total = 0;
for (i = 0; i < NUM_INTS; i + = 128) {
   A[i] = i; // Line 1
}
for (i = 0; i < NUM_INTS; i + = 128) {
   total += A[i]; // Line 2
}
```
1.1 How many bits make up a memory address on this computer?

1.2 What is the T:I:O breakdown?

1.3 Calculate the cache hit rate for the line marked Line 1:

1.4 Calculate the cache hit rate for the line marked Line 2:

## 2 Cache Review — Cache Performance

Recall that AMAT stands for Average Memory Access Time. The main formula for it is:  $AMAT = Hit Time + Miss Rate * Miss Penalty$ 

In a multi-level cache structure, we can separate miss rates into two types that we consider for each level.

- Global: Calculated as the number of accesses that missed at that level divided by the total number of accesses to the cache system.
- Local: Calculated as the number of accesses that missed at that level divided by the total number of accesses to that cache level.

2.1 In a 2-level cache system, after 100 total accesses to the cache system, we find that the L2\$ (L2 cache) ended up missing 20 times. What is the global miss rate of L2\$?

2.2 Given the system from the previous subpart, if L1\$ had a local miss rate of 50%, what is the local miss rate of L2\$?

Suppose your system consists of:

- 1. An L1\$ that has a hit time of 2 cycles and has a local miss rate of 20%
- 2. An L2\$ that has a hit time of 15 cycles and has a global miss rate of 5%
- 3. Main memory where accesses take 100 cycles

2.3 What is the local miss rate of L2\$?

2.4 What is the AMAT of the system?

2.5 Suppose we want to reduce the AMAT of the system to 8 cycles or lower by adding in a L3\$. If the L3\$ has a local miss rate of 30%, what is the largest hit time that the L3\$ can have?

### 3 DLP Precheck

This section is designed as a conceptual check for you to determine if you conceptually understand and have any misconceptions about this topic. Please answer true/false to the following questions, and include an explanation:

3.1 SIMD is ideal for flow-control heavy tasks (i.e. tasks with many branches/if statements).

3.2 Intel's SIMD intrinsic instructions invoke large registers available on the architecture in order to perform one operation on multiple values at once.

#### 4 Data-Level Parallelism

The idea central to data level parallelism is vectorized calculation: applying operations to multiple items (which are part of a single vector) at the same time.



Some machines with x86 architectures have special, wider registers, that can hold 128, 256, or even 512 bits. Intel intrinsics (Intel proprietary technology) allow us to use these wider registers to harness the power of DLP in C code.

Below is a small selection of the available Intel intrinsic instructions. All of them perform operations using  $128$ -bit registers. The type  $\text{\_m128i}$  is used when these registers hold 4 ints, 8 shorts or 16 chars;  $\text{\_}m128d$  is used for 2 double precision floats, and  $\text{\_}ml28$  is used for 4 single precision floats. Where you see "epiXX", epi stands for extended packed integer, and XX is the number of bits in the integer. "epi32" for example indicates that we are treating the 128-bit register as a pack of 4 32-bit integers.

• \_\_m128i \_mm\_set1\_epi32(**int** i):

Set the four signed 32-bit integers within the vector to i.

 $\bullet$  \_\_m128i \_mm\_loadu\_si128( \_\_m128i \*p):

Load the 4 successive ints pointed to by p into a 128-bit vector.

- \_\_m128i \_mm\_mullo\_epi32(\_\_m128i a, \_\_m128i b): Return vector  $(a_0 \cdot b_0, a_1 \cdot b_1, a_2 \cdot b_2, a_3 \cdot b_3).$
- \_\_m128i \_mm\_add\_epi32(\_\_m128i a, \_\_m128i b): Return vector  $(a_0 + b_0, a_1 + b_1, a_2 + b_2, a_3 + b_3)$
- **void** \_mm\_storeu\_si128( \_\_m128i \*p, \_\_m128i a): Store 128-bit vector a at pointer p.

#### 4 Caches Review, SIMD

- \_\_m128i \_mm\_and\_si128(\_\_m128i a, \_\_m128i b): Perform a bitwise AND of 128 bits in a and b, and return the result.
- $\bullet$   $\_m128i$   $\_mm$  $\_cmpeq$  $\_epi32($  $\_m128i$  a,  $\_m128i$  b): The ith element of the return vector will be set to 0xFFFFFFFF if the ith elements of a and b are equal, otherwise it'll be set to 0.

4.1 SIMD-ize the following function, which returns the product of all of the elements in an array.

```
static int product_naive(int n, int *a) {
    int product = 1;
    for (int i = 0; i < n; i++) {
        product *= a[i];}
    return product;
}
```
Things to think about: When iterating through a loop and grabbing elements  $\lambda$  at a time, how should we update our index for the next iteration? What if our array has a length that isn't a multiple of 4? What can we do to handle this tail case?

```
static int product_vectorized(int n, int *a) {
    int result[4];
    __m128i prod_v = ________________________________________;
    for (int i = 0; i < _____; i += ___) { // Vectorized loop
        prod_v = ________________________________________________________________________;
    }
    __mm_storeu_si128(__________________________, __________________________);
    for (int i = ______________; i < ____________; i++) { // Handle tail case
        result[0] *= ________________________;
    }
    return _____________________________________________________;
}
```