Fully associative cache
Set associative cache
Each cache line contains a valid bit, \(t\) tag bits, and a data block of \(B\) bytes. The valid bit indicates whether the cache line should be used. The tag bits come from part of the memory address we want to store, and it uniquely identifies the line in the set. The data block stores the actual bytes that we want cached.
When want to access an memory address in the cache, we break it down into three parts: the tag, the set index, and the byte offset. With these three pieces of information, we can identify the entry in the cache, if it exists. We use the set index to locate the set, the tag to locate the line in the set, and the byte offset to locate the data in the block.
For example, the Intel Pentium 4 processor has a 4-way set associative cache of 8 KiB total, and each cache line is 64 bytes. So there are a total of 8 KiB / 64 = 128 cache lines, divided into 128 / 4 = 32 total sets. With a 32-bit address space, we need 5 bits to identify the set, 6 bits to identify the byte in the 64-byte block, leaving 32-(5+6) = 21 tag bits to select the line in the set.
To see this in action, suppose we have a 100-element array of floats, and we want to read the 60th element, i.e. a[60]
, which happens to be at memory address 0xe2abee6a
. Below is a diagram to show this memory load.
To read the array element, we:
a[60]
at byte 40; if it exists, we just load the 40th byte!Conflict misses arise when we repeatedly access memory addresses that map to the same set, which causes repeated cache evictions and thus fetches from lower memory levels, increasing the latency. The most extreme example of this is if we have a direct-mapped cache and try to compute the dot product of two vectors that happen to be spaced number of sets * lines per set apart
. We call this the critical stride.
float dot_product(std::vector<float> &x, std::vector<float> &y) {
float sum = 0;
for (int i = 0; i < x.size(); ++i) {
sum += x[i] * y[i];
}
return sum;
}
This means each element in x
and the corresponding element in y
will be mapped to the same set, so the cache line gets replaced over and over again in a hot loop.
This is also why the set index are the middle bits rather than higher bits: when scanning an array, consecutive elements can be placed in a maximum number of sets in a round-robin fashion rather than mapping to the same set, which causes thrashing.
Typically not all caches are shared. For example, each processor usually has its own L1 cache. Cache coherency refers to keeping individual caches in sync with each other. One thing to note is that this is not a database where we can tolerate stale reads, i.e. non-strong consistency. Each cache must contain the latest data or else be invalid, in which case we fetch from main memory.
The most commonly used protocol for cache coherency is known as the MESI protocol. MESI allows all cache lines to be in one of four states:
When the a core issues read and write instructions, the other cores are notified through a signal on a shared bus indicating how the state of their own cache lines should be modified. There are two types of signals: processor signals and bus signals. Processor signals are sent by the processor to its cache. Bus signals are sent by a cache to other cache(s) on the shared bus, and are intercepted by what are known as “snoopers,” who watch for bus transactions.
Processor signals include the following:
Bus signals include the following:
Below is what happens at each state when these signals are received:
If cache line is Modified:
If cache line is Exclusive:
If cache line is Shared:
If cache line is Invalid:
Below shows a walkthrough of what happens when three cores try to read/write to the same cache line. The sequence of operations are: 1. processor 2 reads, 2. processor 2 writes, 3. processor 1 reads, etc.
local request P1 P2 P3 generated bus request data supplier
0 initial - - - - -
1 R2 - E - BusRd main memory
2 W2 - M - - -
3 R1 S S - BusRd P2
4 W1 M I - BusUpgr -
5 R2 S S - BusRd P1
6 R1 S S - - -
7 R3 S S S BusRd P1/P2
Below are some very general tips for optimizing the performance of your code.