1 Precheck

This section is designed as a conceptual check for you to determine if you conceptually understand and have any misconceptions about this topic. Please answer true/false to the following questions, and include an explanation:

1.1 We cannot use a 1KB cache in a 32-bit system because it’s too small and cannot contain all possible addresses.

False. The purpose of the cache is not to hold every possible piece of memory at the same time, but rather to hold some parts of it only, so a 1KB cache is not "too small".

1.2 If a piece of data is both in the cache and in memory, reading it from cache is faster than reading from memory.

True. The cache is smaller and faster than memory.

1.3 Caches see an immediate improvement in memory access time at program execution.

False. A cache starts off ‘cold’, and required loading in values in blocks at first directly from memory, forcing compulsory misses. This can be somewhat alleviated by the use of a hardware prefetcher, that uses the current pattern of misses to predict and prefetch data that may be accessed later on.

1.4 Increasing cache size by adding more blocks always improves (increases) hit rate for all programs.

False. Whether this improves the hit rate for a given program depends on the characteristics of the program. As an example, it is possible for a program that only consists of a loop that runs through an array once to have each access be separated by more than one block (say, the block size is 8B, but we have an integer array and accessing every fourth element, so our access are separated by 16B). This makes every miss a compulsory miss, and there is no way for us to reduce the number of compulsory misses just by adding more blocks to our cache.

1.5 Decreasing block size to increase the number of blocks held by the cache improves the program speed for all programs.

False. This question is similar to the one above, in that the answer to it depends on the program that is running. If we have a program with a for loop that loops through continuous memory (like an array), having a bigger blocks size and fewer blocks might be helpful, as the single blocks will holds more continuous data. For example, lets say cache A has 10 lines and a block size of 8 bytes, while cache B has 20 lines with a a block size of 4 bytes and the array we loop through has 80
characters. Cache A in this case will have 10 cache misses and 70 hits, while Cache B will have 20 misses and 60 hits.

2 Cache Associativity

When working with caches, we have to be able to break down the memory addresses we work with to understand where they fit into our caches. There are three fields:

- **Tag**: Used to distinguish different blocks that use the same index. Number of bits: (\# of bits in memory address) - Index Bits - Offset Bits
- **Index**: The set that this piece of memory will be placed in. Number of bits: \( \log_2(\# \text{ of indices}) \)
- **Offset**: The location of the byte in the block. Number of bits: \( \log_2(\text{size of block}) \)

In order to evaluate cache performance and hit rate, especially with determining how effective our current cache structure is, it is useful to analyze the misses that do occur, and adjust accordingly. Below, we categorize cache misses into three types:

- **Compulsory**: A miss that must occur when you bring in a certain block for the first time, hence "compulsory". Reduce compulsory misses by having longer cache lines (bigger blocks), which bring in the surrounding addresses along with our requested data. Can also pre-fetch blocks beforehand using a hardware prefetcher (a special circuit that tries to guess the next few blocks that you will want).

- **Conflict**: Occurs if the block was fetched before, but evicted while the cache was not full. Increasing the associativity or improving the replacement policy would remove the miss. Note that Conflict misses tend to occur with inefficient cache usage, compared to Capacity misses.

- **Capacity**: Occurs if the block was fetched before, but evicted while the cache was full. Capacity misses are independent of the associativity of your cache. The only way to remove the miss is to increase the cache capacity.

Last discussion, we focused on Direct-Mapped caches, in which blocks map to specifically one slot in our cache. This is good for quick replacement and finding out block, but not good for spatial efficiency!

This is where we bring associativity into the matter. We define associativity as the number of slots a block can potentially map to in our cache. Thus, a Fully-Associative cache has the most associativity, meaning every block can go anywhere in the cache. Our Direct-Mapped cache, on the other hand, has the least, being only 1-way set associative.

For an \( N \)-way associative cache, the following are true:

\[
N \times \text{\# sets} = \text{\# blocks}, \quad \text{Index bits} = \log_2(\text{\# sets})
\]
Here’s some practice involving a 2-way set associative cache. This time we have an 8-bit address space, 8 B blocks, and a cache size of 32 B. Classify each of the following accesses as a cache hit (H), cache miss (M) or cache miss with replacement (R). For any misses, list out which type of miss it is (Compulsory, Conflict, or Capacity). Assume that we have an LRU replacement policy (in general, this is not always the case).

<table>
<thead>
<tr>
<th>Address</th>
<th>T/I/O</th>
<th>Hit, Miss, Replace</th>
</tr>
</thead>
<tbody>
<tr>
<td>0b0000 0100</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0000 0101</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0110 1000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b1100 1000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0110 1000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b1101 1101</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0100 0101</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0000 0100</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0011 0000</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b1100 1011</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0b0100 0010</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Since our cache is 2-way set associative, there are 2 blocks in a set. Given the cache size and the block size, we have 32 / 8 = 4 blocks. Thus, there are 4 / 2 = 2 sets in our cache. We need \( \log_2(2) = 1 \) bit to differentiate the 2 sets, so we have 1 index bit. Our block size of 8 B means we have \( \log_2(8) = 3 \) offset bits, and that the rest of our bits are our tag bits. Therefore, our TIO breakdown means bits 0, 1, and 2 are our offset bits, the only index bit is bit 3, and bits 4-7 being the tag bits.

<table>
<thead>
<tr>
<th>Address</th>
<th>T/I/O</th>
<th>Hit, Miss, Replace</th>
</tr>
</thead>
<tbody>
<tr>
<td>0b0000 0100</td>
<td></td>
<td>Tag 0000, Index 0, Offset 100 - M, Compulsory</td>
</tr>
<tr>
<td>0b0000 0101</td>
<td></td>
<td>Tag 0000, Index 0, Offset 101 - H</td>
</tr>
<tr>
<td>0b0110 1000</td>
<td></td>
<td>Tag 0110, Index 1, Offset 000 - M, Compulsory</td>
</tr>
<tr>
<td>0b1100 1000</td>
<td></td>
<td>Tag 1100, Index 1, Offset 000 - M, Compulsory</td>
</tr>
<tr>
<td>0b0110 1000</td>
<td></td>
<td>Tag 0110, Index 1, Offset 000 - H</td>
</tr>
<tr>
<td>0b1101 1101</td>
<td></td>
<td>Tag 1101, Index 1, Offset 101 - R, Compulsory</td>
</tr>
<tr>
<td>0b0100 0101</td>
<td></td>
<td>Tag 0100, Index 0, Offset 101 - M, Compulsory</td>
</tr>
<tr>
<td>0b0000 0100</td>
<td></td>
<td>Tag 0000, Index 0, Offset 100 - H</td>
</tr>
<tr>
<td>0b0011 0000</td>
<td></td>
<td>Tag 0011, Index 0, Offset 000 - R, Compulsory</td>
</tr>
<tr>
<td>0b1100 1011</td>
<td></td>
<td>Tag 1100, Index 1, Offset 011 - R, Conflict</td>
</tr>
<tr>
<td>0b0100 0010</td>
<td></td>
<td>Tag 0100, Index 0, Offset 010 - R, Capacity</td>
</tr>
</tbody>
</table>

What is the hit rate of our above accesses?

\[
\frac{3 \text{ hits}}{11 \text{ accesses}} \approx 27.3\% \text{ hit rate}
\]

3 AMAT

Recall that AMAT stands for Average Memory Access Time. The main formula for it is:

\[
\text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty}
\]
Caches

In a multi-level cache structure, we can separate miss rates into two types that we consider for each level.

- **Global:** Calculated as the number of accesses that missed at that level divided by the total number of accesses to the cache system.
- **Local:** Calculated as the number of accesses that missed at that level divided by the total number of accesses to that cache level.

### 3.1
In a 2-level cache system, after 100 total accesses to the cache system, we find that the L2$ (L2$ cache) ended up missing 20 times. What is the global miss rate of L2$?

\[
\frac{20}{100} = 20\%
\]

### 3.2
Given the system from the previous subpart, if L1$ had a local miss rate of 50%, what is the local miss rate of L2$?

\[
\frac{20}{50\% + 100} = \frac{20}{50} = 40\%
\]

We know that L2$ is accessed when L1$ misses, so if L1$ misses 50% of the time, that means we access L2$ 50 times, of which we ended up having 20 misses in L2$.

Suppose your system consists of:

1. An L1$ that has a hit time of 2 cycles and has a local miss rate of 20%
2. An L2$ that has a hit time of 15 cycles and has a global miss rate of 5%
3. Main memory where accesses take 100 cycles

### 3.3
What is the local miss rate of L2$?

The number of accesses to the L2$ is the number of misses in L1$, so we divide the global miss rate of L2$ with the miss rate of L1$.

\[
\text{L2$ Local miss rate} = \frac{\text{Misses in L2$}}{\text{Accesses in L2$}} = \frac{\text{Misses in L2$}}{\text{Total Accesses}} / \frac{\text{Misses in L1$}}{\text{Total Accesses}} = \frac{5\%}{20\%} = 0.25 = 25\%
\]

### 3.4
What is the AMAT of the system?

\[
\text{AMAT} = 2 + 20\% \times (15 + 25\% \times 100) = 10 \text{ cycles}, \text{as the Miss Penalty of the L1$ is the 'local' AMAT of the L2$}.
\]

Using global rates of each level, alternatively, \(\text{AMAT} = 2 + 20\% \times 15 + 5\% \times 100 = 10 \text{ cycles (using global miss rates)}\)

### 3.5
Suppose we want to reduce the AMAT of the system to 8 cycles or lower by adding in a L3$. If the L3$ has a local miss rate of 30%, what is the largest hit time that the L3$ can have?

Let \(H\) = hit time of the cache. Extending the AMAT equation so that the Miss Penalty of the L2$ is the 'local' AMAT of the L3$, we can write:

\[
2 + 20\% \times (15 + 25\% \times (H + 30\% \times 100)) \leq 8
\]

Solving for \(H\), we find that \(H \leq 30\). So the largest hit time is 30 cycles.
4 Analysis

Given the following chunk of code, analyze the hit rate given that we have a byte-addressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold.

```c
#define NUM_INTS 8192 // 2^13
int A[NUM_INTS]; // A lives at 0x10000
int i, total = 0;
for (i = 0; i < NUM_INTS; i += 128) {
    A[i] = i; // Line 1
}
for (i = 0; i < NUM_INTS; i += 128) {
    total += A[i]; // Line 2
}
```

4.1 How many bits make up a memory address on this computer?

We take \( \log_2(1 \text{ MiB}) = \log_2(2^{20}) = 20 \).

4.2 What is the T:I:O breakdown?

Offset = \( \log_2(1 \text{ KiB}) = \log_2(2^{10}) = 10 \)
Index = \( \log_2(\frac{16 \text{ KiB}}{1 \text{ KiB}}) = \log_2(16) = 4 \)
Tag = 20 − 4 − 10 = 6

4.3 Calculate the cache hit rate for the line marked Line 1:

The integer accesses are 4 * 128 = 512 bytes apart, which means there are 2 accesses per block. The first accesses in each block is a compulsory cache miss, but the second is a hit because \( A[i] \) and \( A[i+128] \) are in the same cache block. Thus, we end up with a hit rate of 50%.

4.4 Calculate the cache hit rate for the line marked Line 2:

The size of \( A \) is 8192 * 4 = 2^{15} \text{ bytes}. This is exactly twice the size of our cache. At the end of Line 1, we have the second half of \( A \) inside our cache, but Line 2 starts with the first half of \( A \). Thus, we cannot reuse any of the cache data brought in from Line 1 and must start from the beginning. Thus our hit rate is the same as Line 1 since we access memory in the same exact way as Line 1. We don’t have to consider cache hits for total, as the compiler will most likely store it in a register. Thus, we end up with a hit rate of 50%.