Cache Memory

Cache Structure Review

What is the structure of a Cache?

Please see my other notes for a more detailed explanation.

Direct-mapped cache: block can be mapped onto one single cache block
Fully Associative
N-way Set associative

The majority of processor caches today are direct-mapped, 2-way, or 4-way set associative

How are blocks found in the Cache?

Memory addresses are partitioned into several sections; the partitioned bits used for locating an object in the cache are the valid, index, and tag bits. Note that fully associative caches do not use index bits because only one set exists. Other partitions exist as well:

The reference (use) bits are used for LRU eviction
The dirty bit is used when a write-back policy is employed; after a write() updates data, the cached object is marked dirty and synchronized to higher memory tiers when evicted.

Simply put, the index bits are used to determine the cache set that the data belongs to. Each cache line in a set has a valid bit; if the cache line is marked valid (usually indicated with the valid bit set to 1), the data residing in the line (bloc) is valid and can be used. The address tag is then compared to the tag in the cache line. Matching tags indicate a cache hit; anything else is a cache miss.

In modern CPUs, these operations are typically done in parallel

If the total cache size is kept the same, increasing associativity increases the number of blocks per set, decreasing the index and tag sizes. Does that make sense? An n-way associative cache the n refers to the number of ways (cache lines) in each set. For example, assuming the total size of the cache remains constant, doubling the size of the cache from 2-way to 4-way means the number of cache lines per set has increased. If the number of blocks per set increases, the number of sets required decreases; thus, fewer index bits are needed.

Index bit: What set is the data in?
Valid bit: Is the data in the cache line valid?
Tag bit: which cache line contains the data?

Cache miss behavior

When an L1 cache miss occurs, the cache controller must fetch the data from a higher-level cache or main memory and refresh the lower cache tiers according to the cache policy.

This is VERY easy with direct-mapped
- only one block is checked for cache hit, and only one can be replaced because every block in the main memory maps to exactly one cache line within the cache.
Strategies for fully associative and n-way associative
- Random: spreads allocation uniformly by randomly selecting blocks (pseudorandomness can assist with debugging hardware)
- LRU: Least Recently Used. Accesses to blocks are recorded to avoid throwing out data that will be used again shortly.
  - The block that has been unused for the longest period will be removed
  - Corollary (a proposition that follows/appends to one already approved) of the locality. Locality caches blocks (and their neighbors) on the premises that will be used again soon. The LRU is also the least likely of cached blocks to be used again shortly.
- FIFO: First in First Out
  - approximates LRU by removing the oldest cache blocks
  - LRU is difficult to implement
A virtue of random replacement is that it is simple to build in hardware. As the number of blocks to keep track of increases, LRU becomes increasingly expensive and is usually only approximated.
A common approximation (often called pseudoLRU) has a set of bits for each set in the cache, with each bit corresponding to a single way in the cache.
- (a way is bank in a set associative cache; there are four ways in a four-way set associative cache)
- When a set is accessed, the bit corresponding to the way containing the desired block is turned on; if all the bits associated with a set are turned on, they are reset except the most recently turned-on bit.
  - When a block must be replaced, the processor chooses a block from the way whose bit is turned off, oftentimes randomly
  - This approximates LRU because the block that is replaced will not have

What happens on a Cache Hit - When the CPU tries to read from memory, the address will be sent to a cache controller. — The lowest k bits of the address will index a block in the cache. — If the block is valid and the tag matches the upper (m - k) bits of the m-bit address, that data will be sent to the CPU. - caches are designed to optimize reads - The block can be read from the cache at the same time that the tag is read and compared - Writes do not afford the same optimizations - a write cannot happen until the tag is read and a hit is determined

Write policies are often what distinguishes cache designs
- The processor determines the size of the write
  - typically b/t 1 - 8 bytes
- Two general types of cache write policies
  - write-through: information is written to the cache block and lower-level memory
    one advantage is that it provides data coherency
    especially important with multiprocessors (each with its own cache) to ensure data consistency
  - write-back: information is written only to the cache block
    the modified block is only written to main memory when the cache block is removed (LRU, FIFO, or random)
    ***** Very Important *****
    A status or dirty bit is used to limit the frequency of write-backs to main memory when a cache block is replaced
    the status bit is set when a copy/write to main memory is needed (dirty) or not set when one is not needed (clean)
    if clean, the block is not written back on a miss (because no changes were made)
    requires fewer writes to memory and thus uses less memory bandwidth
    favorable for multiprocessors
- When a processor waits for a write-through to complete, this is called a write stall
  - write buffers are used to optimize and reduce write stalls
  - Write allocate—The block is allocated on a write miss, followed by the preceding write his actions. In this natural option, write misses act like read misses.
  - No-write allocate—This apparently unusual alternative is that write misses do not affect the cache. Instead, the block is modified only in the lower-level memory.

Hence, we organize six cache optimizations into three categories: - Reducing the miss rate — larger block size, larger cache size, and higher associativity - Reducing the miss penalty — multilevel caches and giving reads priority over writes - Reducing the time to hit in the cache — avoiding address translation when indexing the cache

Cache Misses - Should be sorted into three categories - compulsory: the first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses or first-references misses - capacity: if the cache cannot contain all the blocks needed during program execution, capacity misses (in addition to compulsory misses) will occur because blocks are discarded and later retrieved. - Conflict: If the block placement strategy is set associative or directly mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. - Also known as collision misses (similar to hash collisions using modulus) - hits in a fully associative cache that become misses in an n-way set-associative cache are due to more than n requests on some popular sets. - *** It may be difficult to distinguish between capacity and conflict; one way to do so is to remember that using modulus can ultimately place several blocks in the same set, reaching capacity, and other sets may be less full. - This is not an issue of no overall cache memory available, but too many mapping to the same set.

Optimizing cache misses - 1: Increase cache block size - larger blocks will lead to fewer misses because of locality - reduce compulsory misses - may increase miss penalty; more expensive to retrieve the larger block from main memory - larger blocks are favored with high latency/high bandwidth - more bytes per miss are grabbed - smaller blocks are favored with low latency/low bandwidth - 2: Larger cache - 3: Higher Associativity - 2:1 cache rule of thumb, is that a direct-mapped cache of size N has about the same miss rate as a two-way set associative cache of size N/2. - 4: Multilevel Caches - Local miss rate: - number of misses divided by the total number of memory accesses to this (level) cache - For first level cache, this is equal to Miss Rate(L1) and for second level Miss Rate(L2) - Global Miss Rate: - Number of misses in the cache divided by the total number of memory accesses generated by the processor - this includes misses in lower levels. - This is still Miss Rate (L1) for the first-level cache. However, for the Second-level Cache, this is Miss Rate (L1) * Miss Rate (L2)

    - Write-through will send all writes to the second-level cache, not just the misses.  Additionally, a write buffer can be used

    - Multilevel inclusion
        - Excellent Explanation:
            - https://en.wikipedia.org/wiki/Cache_inclusion_policy
        - Everything in L1 is also in L2
        - Consistency between I/O and cache can easily be checked by looking at L2 cache
        - The drawback is that measurements can suggest smaller block sizes in the smaller first-level cache vs the larger second-level cache
            - If this happens, maintaining inclusion is more difficult
                - Second-level cache must invalidate first-level blocks that map onto second-level blocks being replaced
                    - this leads to a higher first-level miss rate
            - ** most designers keep block sizes b/t caches the same
    - Multilevel Exclusion
        - This is a solid option for cache policy if the designer cannot afford to make the L2 significantly larger than the L1.
            In this case, L2 will be exclusive of L1, and a miss to L1 will incur a smaller penalty because data can likely be retrieved from L2.
                - LRU evictions from L1 can also be quickly retrieved from L2

- 5. Giving priority to Read misses over Write Misses
    - Read Misses are handled before a write completes
    - Write-Through cache Makes use of a write buffer
        - adds complexity
        - write buffers may hold updated values needed by a read miss
            - Two Options:
                - Wait for the write buffer to be empty
                - read the buffer to check it for conflicts.  Proceed only if there are no conflicts
                    - most commonly used by laptops and desktops
            - write backs can use buffers to improve penalty
                - Read miss that replaces dirty block
                    - 1. Copy dirty block to buffer
                    - 2. Read from memory
                    - 3. write to memory
                - * This is faster than waiting for a write of dirty memory to complete before reading fully.
- 6. Avoiding Address translation during indexing of the cache to reduce Hit Time
    - virtual cache
        - cache using virtual addresses
    - physical cache
        - traditional cache using physical address
    - It is important to do two things
        - indexing cache and comparing addresses
        - Is a virtual or Physical address being used to index the cache
        - Is a virtual or physical address being used in tag comparison
    - Why not exclusively use virtually-addressed caches?
        - Page-level protection is checked as part of the virtual-to-physical address translation
            - must be enforced
            - Could copy the protection information from the TLB on a miss, add a field to hold it, and check it on every access to the virtually addressed cache
        - Every time the process is switched, the virtual addresses refer to different physical addresses
            - requires the cache to be flushed, which leads to misses
            - Can be avoided by widening the tag field and adding a PID tag.
                - OS only needs to flush a PID that is recycled
        - Virtual caches also introduce synonyms, which can happen with User programs and the OS using different virtual addresses
            - this can lead to unsynchronized duplicate data in the cache
                - OS changes its value, the user program may not acknowledge it
            - Hardware can use antialiasing to guarantee every block in the cache a unique physical address
            - Example:
                - AMD Opteron
                    - 64 KiB instruction cache, 4 KiB page, and two-way set associativity
                    - hardware must handle 3 virtual address bits in the set index
                        - checks all 8 (2^3) locations to make sure none matches the physical address
                            - invalidates matches that guarantee unique physical address in the cache

PreviousCache Basics NextA Few CPU Formulas

Last updated 1 year ago

Was this helpful?