Cache Memory
Cache Structure Review
What is the structure of a Cache?
Please see my other notes for a more detailed explanation.
Direct-mapped cache: block can be mapped onto one single cache block
Fully Associative
N-way Set associative
The majority of processor caches today are direct-mapped, 2-way, or 4-way set associative
How are blocks found in the Cache?
Memory addresses are partitioned into several sections; the partitioned bits used for locating an object in the cache are the valid, index, and tag bits. Note that fully associative caches do not use index bits because only one set exists. Other partitions exist as well:
The reference (use) bits are used for LRU eviction
The dirty bit is used when a write-back policy is employed; after a write() updates data, the cached object is marked dirty and synchronized to higher memory tiers when evicted.
Simply put, the index bits are used to determine the cache set that the data belongs to. Each cache line in a set has a valid bit; if the cache line is marked valid (usually indicated with the valid bit set to 1), the data residing in the line (bloc) is valid and can be used. The address tag is then compared to the tag in the cache line. Matching tags indicate a cache hit; anything else is a cache miss.
In modern CPUs, these operations are typically done in parallel
If the total cache size is kept the same, increasing associativity increases the number of blocks per set, decreasing the index and tag sizes. Does that make sense? An n-way associative cache the n refers to the number of ways (cache lines) in each set. For example, assuming the total size of the cache remains constant, doubling the size of the cache from 2-way to 4-way means the number of cache lines per set has increased. If the number of blocks per set increases, the number of sets required decreases; thus, fewer index bits are needed.
Index bit: What set is the data in?
Valid bit: Is the data in the cache line valid?
Tag bit: which cache line contains the data?
Cache miss behavior
When an L1 cache miss occurs, the cache controller must fetch the data from a higher-level cache or main memory and refresh the lower cache tiers according to the cache policy.
This is VERY easy with direct-mapped
only one block is checked for cache hit, and only one can be replaced because every block in the main memory maps to exactly one cache line within the cache.
Strategies for fully associative and n-way associative
Random: spreads allocation uniformly by randomly selecting blocks (pseudorandomness can assist with debugging hardware)
LRU: Least Recently Used. Accesses to blocks are recorded to avoid throwing out data that will be used again shortly.
The block that has been unused for the longest period will be removed
Corollary (a proposition that follows/appends to one already approved) of the locality. Locality caches blocks (and their neighbors) on the premises that will be used again soon. The LRU is also the least likely of cached blocks to be used again shortly.
FIFO: First in First Out
approximates LRU by removing the oldest cache blocks
LRU is difficult to implement
A virtue of random replacement is that it is simple to build in hardware. As the number of blocks to keep track of increases, LRU becomes increasingly expensive and is usually only approximated.
A common approximation (often called pseudoLRU) has a set of bits for each set in the cache, with each bit corresponding to a single way in the cache.
(a way is bank in a set associative cache; there are four ways in a four-way set associative cache)
When a set is accessed, the bit corresponding to the way containing the desired block is turned on; if all the bits associated with a set are turned on, they are reset except the most recently turned-on bit.
When a block must be replaced, the processor chooses a block from the way whose bit is turned off, oftentimes randomly
This approximates LRU because the block that is replaced will not have
What happens on a Cache Hit - When the CPU tries to read from memory, the address will be sent to a cache controller. — The lowest k bits of the address will index a block in the cache. — If the block is valid and the tag matches the upper (m - k) bits of the m-bit address, that data will be sent to the CPU. - caches are designed to optimize reads - The block can be read from the cache at the same time that the tag is read and compared - Writes do not afford the same optimizations - a write cannot happen until the tag is read and a hit is determined
Write policies are often what distinguishes cache designs
The processor determines the size of the write
typically b/t 1 - 8 bytes
Two general types of cache write policies
write-through: information is written to the cache block and lower-level memory
one advantage is that it provides data coherency
especially important with multiprocessors (each with its own cache) to ensure data consistency
write-back: information is written only to the cache block
the modified block is only written to main memory when the cache block is removed (LRU, FIFO, or random)
***** Very Important *****
A status or dirty bit is used to limit the frequency of write-backs to main memory when a cache block is replaced
the status bit is set when a copy/write to main memory is needed (dirty) or not set when one is not needed (clean)
if clean, the block is not written back on a miss (because no changes were made)
requires fewer writes to memory and thus uses less memory bandwidth
favorable for multiprocessors
When a processor waits for a write-through to complete, this is called a write stall
write buffers are used to optimize and reduce write stalls
Write allocate—The block is allocated on a write miss, followed by the preceding write his actions. In this natural option, write misses act like read misses.
No-write allocate—This apparently unusual alternative is that write misses do not affect the cache. Instead, the block is modified only in the lower-level memory.
Hence, we organize six cache optimizations into three categories: - Reducing the miss rate — larger block size, larger cache size, and higher associativity - Reducing the miss penalty — multilevel caches and giving reads priority over writes - Reducing the time to hit in the cache — avoiding address translation when indexing the cache
Cache Misses - Should be sorted into three categories - compulsory: the first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold-start misses or first-references misses - capacity: if the cache cannot contain all the blocks needed during program execution, capacity misses (in addition to compulsory misses) will occur because blocks are discarded and later retrieved. - Conflict: If the block placement strategy is set associative or directly mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. - Also known as collision misses (similar to hash collisions using modulus) - hits in a fully associative cache that become misses in an n-way set-associative cache are due to more than n requests on some popular sets. - *** It may be difficult to distinguish between capacity and conflict; one way to do so is to remember that using modulus can ultimately place several blocks in the same set, reaching capacity, and other sets may be less full. - This is not an issue of no overall cache memory available, but too many mapping to the same set.
Optimizing cache misses - 1: Increase cache block size - larger blocks will lead to fewer misses because of locality - reduce compulsory misses - may increase miss penalty; more expensive to retrieve the larger block from main memory - larger blocks are favored with high latency/high bandwidth - more bytes per miss are grabbed - smaller blocks are favored with low latency/low bandwidth - 2: Larger cache - 3: Higher Associativity - 2:1 cache rule of thumb, is that a direct-mapped cache of size N has about the same miss rate as a two-way set associative cache of size N/2. - 4: Multilevel Caches - Local miss rate: - number of misses divided by the total number of memory accesses to this (level) cache - For first level cache, this is equal to Miss Rate(L1) and for second level Miss Rate(L2) - Global Miss Rate: - Number of misses in the cache divided by the total number of memory accesses generated by the processor - this includes misses in lower levels. - This is still Miss Rate (L1) for the first-level cache. However, for the Second-level Cache, this is Miss Rate (L1) * Miss Rate (L2)
Last updated
Was this helpful?