Cache Basics

Memory Hierarchy Latency

  • External I/O -> ( >50 million ns ) # Thumb drive -> obj O with my age 32

    • 1 TB SSD/HD -> (5 million ns) # not used

      • 4 GB RAM -> (80 - 250 ns) # add O.age = 32; O.name

      • +3.75 MB 4 MB l3 -> (12 - 25) # add O.age = 32

        • +192 kb 256 kb l2 -> (5 - 9 ns) # add O.age = 32; O.name

          • 64 kb l1 -> (.5 - 3 ns) # O.name

            • registers -> (.25 - .5 ns) # O.age += 1; // now 32

Cache Miss Types

  • Compulsory (normal) - When a system starts or the first access to an object (assuming it isn't spatially or temporally local to a previously accessed object).

  • Capacity = The cache isn't large enough for all of the data the program needs.

  • Conflict = A conflict between addresses that map the same set, where the set is full // and leads to evictions

Example

Address mod 4 can map to one of the following sets: {0, 1, 2, 3}

  • 0xe5f % 4 = 3

Address Partitioning

The data address is partitioned into an index (not fully associative), valid, and offset.

How are Partitions Used?

Index Bits

We use the index portion to determine which set an address maps to. A modulus operation is typically applied to the index bits to determine the appropriate set. The number of index bits partitioned for an architecture determines the number of sets in a cache.

Tag Bits

Tag bits uniquely identify the memory block stored in the cache. When data is stored, the index determines the set, and the tag bits uniquely identify a block within the set. The index bits are referenced to determine the set, and the valid bit is checked to see if that cache line (set) has any data. A zero value indicates no data is present, and the lookup can stop. The cache line (set) is checked for a matching tag if the valid bit is set. The data is present if the valid bit is set and the tag bits match the data tag within the block.

Why use a valid Bit?

There are two reasons for a valid bit:

  1. The primary reason for using a valid bit is that the tag is initialized to some value when a cache line is initially loaded. Any initialized tag introduces the possibility that the tag value will match the tag computed by an actual address. For example, if the tag is initialized to 0000000005 when a system is started, it is possible that an address in the virtual address space, when partitioned to extract the tag bits, will compute to 0000000005.

    • Consider the scenario when an address's computed tag matches the initialized tag bit, and no valid bit is used. The cache controller would misinterpret the matching tag as indicating valid data is present and attempt to fetch it; however, the data is invalid. Incorporating a valid bit initialized to zero or "invalid" would prevent this from happening.

  2. Another crucially important use of the valid bit is cache coherency. Multicore architectures must maintain coherency by synchronizing data after a write. Any core accessing shared memory data will invalidate the corresponding cached data of a peer core, marking it as invalid. This forces the peer core to synchronize its cache records by updating the local copy to the data advertised on a shared bus. This is commonly referred to as write-through invalidate.

Alternatively, write-update can be used by multicore architectures. See below for more on Cache Coherency.

Cache Coherency

Write Invalidate - to maintain cache coherency on a multicore architecture where multiple CPUs share memory. CPU1 may write an update or invalidate the cache of its peer CPU, CPU2, after making a local change. In write invalidate, the valid bit for CPU2 would be set to 0, the valid bit for CPU1 (the CPU that just wrote) would be set to 1, and the shared value to 0. This would result in a cache miss for CPU2. The second CPU, after the miss, will send a request to the bus to check memory, and, although CPU1 marked as dirty and did not sync to memory, it will notice the request on the bus, send the data, and mark its own (CPU1) Shared bit to 1. CPU2 cache entry would then become valid, shared, and include the updated data.

Write update directly updates CPU2's cache, which avoids a cache miss from CPU2 when it attempts to read() after CPU1 wrote. The disadvantage of a written update would be if CPU1 makes multiple successive writes, it would have to update CPU2 each time repeatedly. Contrast this with write invalidate, if CPU1 makes several writes, it can continue to update the locally and only has to invalidate CPU2's entry once. Once it's done, CPU2 will have a cache miss and request the value from memory via the bus, to which CPU1 will provide only the most recent/current value. So, there are tradeoffs to write update vs write invalidate.

  1. Finally, back invalidation allows a CPU to honor a cache inclusivity policy. If L2 and/or L3 are cache inclusive, they may back invalidate an entry from a lower tier cache. For example, if L2 is inclusive of L1 and some address_1 maps to set 6 in L2 but set 4 in L1 (they may have different levels of associativity). Then let us say address_2 maps to set 2 in L1, but set 6 in L2, and L2s set 6 is now full. If L2 evicts address_1 from set 6 because of an LRU or FIFO policy, it would have to back-invalidate address_1 from L1 to maintain cache inclusivity.

Cache Coherency Example

Given 0xe5f = 10.

Write Invalidate

  1. CPU1 & CPU2 read 0xe5f and have cache entries for 0xe5f with the data value 10 stored.

  2. CPU1 write(0xe5f, 15); -> CPU1 invalidate CPU2s entry -> CPU1 marks its local entry as not-shared (other CPU does not have update)

  3. CPU2 read(0xe5f); -> Cache miss because its valid bit is 0 -> Via the connected bus to memory, issues a read request for 0xe5f -> read(0xe5f) -> remember that RAM doesn't have the update because it's in CPU1's cache and marked as dirty.

  4. CPU1 hears the broadcast from CPU2 to RAM for the value at 0xe5f -> CPU1 responds by sending CPU2 its local copy of 0xe5f -> CPU1 marks its local copy as shared (other CPU has update) -> CPU1 local copy still has the dirty bit set (Main memory does not have update)

  5. CPU2 sets the valid bit -> CPU2 sets the valid bit -> CPU2 sets the shared bit (synchronized/coherent with other CPU) -> CPU2 sets the dirty bit

  6. CPU1 & CPU2 continue processing.

WriteInvalidate() {
    CPU1 write(0xe5f, 15); CPU1 write(0xe5f, 20); CPU1 write(0xe5f, 25); CPU1 write(0xe5f, 30);
    CPU2 read(0xe5f); -> Cache Miss -> read(RAM, 0xe5f); // get value
    CPU1 send(CPU2, 0xe5f, 30); -> Marks local copy as shared (coherent)
    CPU2 -> local copy valid -> local copy shared -> local copy dirty
}
ProsCons

CPUn only updates its cache entry with the most recent value (fewer CPU cycles).

CPUn will always have a cache miss after CPU1 writes.

No Broadcast on shared bus

Write Update

  1. CPU1 & CPU2 read 0xe5f and have cache entries for 0xe5f with the data value 10 stored.

  2. CPU1 write(0xe5f, 15); // write to cache (not memory) marks bit dirty -> CPU1 updates CPU2s entry -> CPU1 marks its local entry as shared (cache is coherent)

  3. CPU2 reads(0xe5f) -> Cache HIT because the valid bit is set and the tag matches -> The value read is correct.

WriteUpdate() {
    CPU1 write(0xe5f, 15); -> CPU1 updates CPU2s entry
    CPU1 write(0xe5f, 20); -> CPU1 updates CPU2s entry
    CPU1 write(0xe5f, 25); -> CPU1 updates CPU2s entry
    CPU1 write(0xe5f, 30); -> CPU1 updates CPU2s entry
    CPU2 read(0xe5f); -> Cache hit
}
ProsCons

CPU2 avoids a cache miss after CPU1 writes

CPU1 sends broadcast on bus

CPUn's cache entry is continually updated after a write.

Cache Types

Direct-Mapped

Direct-Mapped = Synonymous with a one-way associative cache, whereby every address maps to a single block. I have 8 cache blocks, and a memory address starts at 0x000. Then, the hex addresses would be applied to a hash or index function to determine the proper index. Each index has only one block, meaning once the index is located, only the valid and tag bits must be checked.

This creates a one-to-one mapping when using a direct-mapped cache. For example, take the following 70-byte main memory starting at 0x000. If we have a 4-byte cache, the mappings would look like this:

0x000, 0x008, 0x010, 0x018, 0x020 will map to index 0
0x001, 0x009, 0x011, 0x019, 0x021 will map to index 1
0x002, 0x00A, 0x012, 0x01A, 0x022 will map to index 2
0x003, 0x00B, 0x013, 0x01B, 0x023 will map to index 3
......
0x007, 0x00F, 0x017, 0x01F, 0x027 will map to index 7

Notice how each address can only map to exactly one spot. As I'll show below, this differs from set-associative maps in that set-associative maps address exactly one set; however, the data may be placed in any block within that set.

We know that each block can hold one address, which means the next address, @ offset 8, will result in an eviction. If a subsequent request for a recently evicted object is made, it will result in a Conflict Cache Miss. For example, if we are given an address 0x028 [ 40 ], it would map to index zero; however, block zero is full with the last address from above that also maps to index 0, 0x020. In this case, the cache controller will evict 0x020 and store 0x028. This is the one-to-one mapping. If this were set-associative, index 0 'could' hold multiple blocks, and multiple address mapping to the same set would not automatically mean an eviction would occur, although it might. If there are no empty blocks, then an eviction would occur. A subsequent lookup for the evicted data would be a Conflict Cache Miss.

The offset informs the CPU of the data's byte position in the cache line, allowing the CPU to gather the data at the proper offset.

Fully Associative

Data can be stored in any available cache block and is not limited to an assigned block location as with direct-mapped and set-associative. As such, fully associative implementations do not include an index field in the data address; instead, the entire address is used as a tag. This increases the overall size of the cache because all n bits of the address must have space. One negative to this is that data could be in any block, so the tag of every cache block must be checked for a match. Unlike set-associative and direct-mapped, an index bit is extracted from the lower k bits of an address for indexing.

The offset is used to inform the CPU of the byte position of the data in the cache line. This allows the CPU to gather the data at the proper offset.

N-way Set Associative

The cache is divided into groups known as sets; an address maps to exactly one set. However, within that set can be several blocks, and the data can be placed in any of the blocks within the set. There are several ways to perform the set-associative groupings. If each set has 2^x blocks, then the cache is 2^x-way associative. For example, given an 8-block cache, the possible groupings are:

(8 sets, 1 block each); identical to Direct-Mapped with 8 sets
(4 sets, 2 blocks each); 2-way associative. Within a set, there are two ways (locations) to store the data.
- (2 sets, 4 blocks each); 4-way associative.  Within a set, there are 4 ways (locations) to store the data.

The offset informs the CPU of the byte position within the block (cache line) where the data begins. It allows the CPU to get data at the correct offset. To find the data, the process is very similar to direct-mapped. The index bits are used to calculate the set; then, the tag is searched within the set to find the correct block where the data is stored. Unlike direct-mapped, which only has one block in a set thus only one tag is compared with the address tag, with set-associative one of two search algorithms is used. Associative memory, or CAM (Content Addressable Memory), allows for parallel comparison of tag values in a set. This allows the cache controller to find the matching set quickly.

Last updated