Introduction

Major computing advancements over the last 70 years can primarily be attributed to two things: - Compilers that automatically compile source code on heterogeneous platforms

- Vender-independent OS' such as Unix and Linux that lower the cost of bringing out new architecture.

SPEC improvements during the 1990s increased by ~50%/year. These additional improvements were really enabled by organizational and architectural improvements and had a lot to do with the emergence of reduced instruction set computer (RISC) architectures.

Power requirements were originally constant because of the decreased transistor size - This ended in 2003 and forced manufacturers to consider multiple cores. Robert H. Dennard observed that the power requirements of a given silicon area were constant even as you increased the number of transistors because the transistors themselves were smaller.

In 2015, Moore’s law ended, making it more difficult to scale performance with technological advancements alone. Consequently, recent performance improvements are only about 3.5% per year.

Moore predicted that the number of transistors we could fit in a given area would double yearly, and he amended it to double every two years in 1975.

Classes of Computers:

Personal Mobile Device - System: $100 - $1000 - Microprocessor: $10 - $100 - Critical System Design Issues: - Cost, energy, media performance, responsiveness Desktop - System: $300 - $2500 - Microprocessor: $50 - $500 - Critical System Design Issues: - Price-performance, energy, graphics performance Server - System: $5000 - $10,000,000 - Microprocessor: $200 - $2000 - Critical System Design Issues: - Throughput, Availability, Scalability, Energy Cluster/Warehouse Scale Computers - System: $100,000 - $200,000,000 - Microprocessor: $50 - $250 - Critical System Design Issues: - Price-performance, Throughput, energy proportionality Internet of Things (IoT)/Embedded - System: $10 - $100,000 - Microprocessor: $0.01 - $100 - Critical System Design Issues: - Price, Energy, application-specific performance

Parallelism is the driving force of design across all computer classes. Computing resources are used to perform multiple tasks simultaneously. There are two types of parallelism: Data-level parallelism arises because many data items can be operated on at once, and Task-level parallelism arises because different work tasks can be operated independently.

Hardware can exploit these applications' parallelisms in four major ways:
    - instruction-level parallelism exploits data parallelism using ideas like pipelining and out-of-order execution.
    - Vector Architectures and devices like GPUs exploit data-level parallelism by applying a single instruction to a collection of data in parallel (SIMD Intrinsics)
    - Thread-level parallelism exploits data and task-level parallelism by executing parallel threads in a common hardware model
    - Request-level parallelism exploits data and task-level parallelism using decoupled tasks specified by the application or operating system.

Flynn's Taxonomy
    - SISD (Single Instruction, Single Data) - Standard sequential computer with ILP (Instruction Level parallelism)

    SIMD (Single Instruction, Multiple Data) is when the same instruction is executed by multiple processors using different data streams. SIMD exploits data-level parallelism (DLP) by applying the same operations to multiple data items in parallel. It is found in vector architectures and GPUs.

    - MISD (Multiple Instruction, Single Data) - not implemented by any commercial processor, but it rounds out the hierarchy.

    - MIMD (Multiple Instruction, Multiple Data) - each processor fetches its own instructions and operates on its own data. MIMD can target task-level parallelism in both tightly coupled and loosely coupled architectures. An example of a tightly coupled MIMD architecture might be a multicore processor, while a loosely coupled architecture would be a cluster machine.

Computer Architecture - ISA (Instruction Set Architecture): - Boundary between Compiler Software and Hardware. Each instruction has its own opcode and task (or set of tasks/microinstructions) that it performs - Several types of ISAs in use today: x86, Advanced/Acorn RISC Machine (ARM), and RISC-V - Most ISAs in use today are RISC; although x86 began as a CISC architecture, it's implementation in modern machines is as a RISC. - ISA is typically different in a few domains: the number of registers, addressing modes, instruction widths, etc. - Organization - Memory, Bus Structure, CPU, etc. Sometimes referred to as the microarchitecture - Hardware - Our definition of architecture also includes the hardware, which is the computer's detailed logic design and packaging technology. Many machines will differ in hardware and organization, but not in ISA. This approach allows the same binaries to run on different versions of processors that support the same ISA. Early on, the term architecture referred to instruction set design. Still, the definition has been extended as people recognized that these other design aspects are very important for meeting our computational goals.

Bandwidth—How much work is done in a given amount of time—processor throughput has improved by over 32,000 to 40,000 times since the 1980s. Latency—Time between the start and completion of an event—is more difficult to optimize—gains ~ only 50 to 90 times for processors and networks and eight to nine times for memory and disks.

  • Memory has made the least improvement of all architecture components in both latency and bandwidth

Power & Energy - Energy consumption is often the biggest design challenge facing modern computers. - Power has to be brought into the system and distributed to the different parts of the chip. Power gets dissipated from the chip as heat, and that heat has to be removed.

- Maximum Power
    - We are interested in the maximum power a system might need. If the processor tries to draw more power than the power supply can provide, we might experience a failure.
- Thermal Design Power
    Additionally, we need to know the system's sustained power consumption. This metric is called the thermal design power (TDP) and is used to determine how much cooling the system requires. TDP is not really the peak power consumption; it’s more like the highest power consumption you would expect when running real applications. If the system starts to get too hot and is going to overheat, it might take some other steps, such as reducing the processor clock rate, to control power consumption.
- Energy Efficiency
    - The third factor is energy efficiency. Recall that power is just energy over time. Energy is often a better metric than power for comparing two different processors because we can compute the amount of energy that is consumed for a particular task we care about. For instance, we might have a processor that requires 20% less power but takes twice as long to execute a particular task. In that case, the slower processor will actually require more energy to complete the task.

- Dynamic Energy
    - Transistor switch from 0 -> 1, 1 -> 0
    - Energy[dynamic] = 1/2 * Capacitive Load * (Voltage^2)
- Dynamic Power
    - Power[dynamic] = 1/2 * Capacitive Load * (Voltage^2) * Frequency_switched
    - *** Reducing the clock rate reduces power, not energy***
The primary energy consumption of complementary metal-oxide-semiconductor (CMOS) chips has traditionally come from switching transistors.

- The energy required to flip a transistor from 0–1 or 1–0 depends on the capacitive load and square of the voltage to the transistor, which is shown in the Dynamic Energy and Power image above. This is also called dynamic energy.

The power consumption for a single transistor is then just the energy times the frequency at which the transistors are flipped.

Notice here that reducing the clock rate will reduce the frequency with which transistors get flipped, reducing the power, but it has no effect on the overall energy consumed. Voltage is a factor in both power and energy, and so, many systems have reduced their overall voltage over time to reduce energy consumption.

* more transistors means the bit flip rate will increase, resulting in higher power consumption.  Power dissipates as heat, which has to be cooled, and we reach the ceiling of air's ability to cool the heat.

- Methods for more efficient energy design
    - Turn off the clocks of inactive modules
    - Dynamic Voltage Frequency Scaling (DVFS)
        - Scaling down the clock rate for idle computers/modules
    - Low-Power states
        - Idle computers allow us to use low-power states for devices like DRAM and disk
    - Overclocking, or so-called turbo mode
        - devices running lower clock rates can rapidly increase clock rates when the load is very high
  • Static Power

    • You can think of static power as the constant power needed to keep the system on.

    • proportional to the number of devices

      • as transistor count increases, so does the static power requirement

    • It can be as much as 25-50% of total power consumption

    • The only way to reduce static power is to turn off the power supply to a device, known as 'power gating.' The tradeoff includes the overhead required when powering the device back on.

    • Power[static] = Current[static] * Voltage

** Refer to the picture that shows the energy comparison for various operations ** - an 8b Add requires .03 energy - A 32b DRAM read requires 640 energy (~21,000x) - A 32b SRAM read requires 5 energy (200x)

- This shows that SRAM is much more energy efficient, partly because caches use SRAM.
    - SRAM is significantly more expensive, which is why volatile RAM uses DRAM, while cache uses SRAM

Performance - Typically measured in Response Time (Latency) or Throughput

There are two main types of locality in computer programs:

- Temporal locality states that recently accessed items will likely be accessed soon. Some examples of temporal locality are loops or data at the top of the stack being used repeatedly due to a last in, first out (LIFO) stack organization.

- Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. An example of spatial locality might be sequential execution of instructions or sequential access of array elements.

Last updated