Digital Computers
Back to UWaterloo
Summary
Assembly and Data
- We learned ARM, and general RISC and CISC programming
- Big endian describes a way to order bytes of a word, where the the word reads left to right, and is the correct value of the 32-bit integer representation
- Little endian reverses the byte order, but bits in each byte remain the same
- there are many addressing modes describing ways to manipulate instruction operands
- immediate, absolute, register, indirect, index, base with index, pc-relative, pre-index, pre-index with writeback, post-index
- memory mapped I/O maps inputs and outputs to addresses
- I/O devices on a bus each have an interface, with status bits like KIN, DOUT for signaling readiness
- for interupt during instruction
i
, finish i
, set IE bit to 0 in Processor Status, PC <- start of ISR, run ISR, PC <- i + 4, set IE = 1, resume program
- Vectored interrupts allow for fine-grained control of device interrupt priority by have Interrupt Enable and Interrupt Request (IRQ) signal for each device
- A bus can have multiple lines, system bus is an inter-connection network connecting I/O devices, processor and memory
- for data lines on a bus, you need to allow for setup and hold time of signals on the line
- address decoder for I/O devices define an address range for each device using ECE124 logic
- busses obey a protocol defined by the system, to support bus transfers with precision
- system bus has control lines, address line and data lines for placing requests or placing responses for interactions between I/O and processor
- bus skew is the time required to account for varying bus propagation delay between lines (needed in order to ensure all lines hold correct values when, for example, a master-ready signal reaches a slave)
- Synchronous Bus Read is a read protocol so that the system has checks for correctness
- processor drives bus, places address and read command on lines, and asserts master ready after skew delay, slaves decode address
- cycle 1: requested slave is done decoding, and fetches requested data (this can take multiple clock cycles, n), master waits for slave-ready signal
- cycle 2 + n: When the data is fetched, data is placed on data line (incurs setup time), and held as slave-ready signal is sent
- cycle 3 + n: master receives slave ready, reads the data and then de-asserts master-ready.
- cycle 4 + n: slave deasserts slave-ready and clears data line
- Asynchronous Bus Read follows same protocol as synchronous but without the clock
- writes are opposite, master places data on data line, master-ready signal indicates valid data and address values
- there is serial transmission of data (1 bit at a time) or parallel transmission
- PCI bus (Peripheral Component Interconnection) allows for plug-and-play connectivity, the PCI bridge is the interface for processor, memory to the PCI bus
- a bridge interconnects two systems of different architecture (think different clock speeds, voltages, data-width 32vs64-bit)
- USB (Universal Serial Bus) supports different transfer-rates, designed for plug-and-play, and can form a tree of hubs connected to the computer via the root hub, where the leaves of the trees are the I/O devices
- by eliminating the need to arbitrate between USB devices, using only polling, USB hubs can be easily manufactured and chained
- the PCI bus has SATA, Ethernet, USB hub
Memory System
- memory is stored in arrays of cells that hold an on off value (1 bit, stored in a transistor or capacitor)
- asynchronous DRAM chip uses row and column addresses to access bytes, asserted by the RAS/CAS signals from the external memory controller
- fast page mode enables quick sequential burst reads (removes address stablization overhead)
- synchronous DRAM chips use clock, and includes a refresh signal for refreshing capacitors
- buffers allow next steps to occur while the data is being transferred to/from output registers
- synchronous is good for burst reads since clock cycles dictate sequential reads, where async requires reasserting CAS signals
- both chips use RAS and CAS signals to enable latches for holding column/row addresses
- complex SDRAMs need an interface, which converts regular bus lines to the necessary signals (RAS, CAS, Chip Select)
- Chip select signal is important for selecting specific DRAM chips within a larger array of DRAM chips
Caches
- caches sit in front of main memory and only take one clock cycle to respond with data
- addresses are broken down, word is rightmost always and determines what byte from a block of memory it refers to
- the block portion (not used in fully associative) decides which block to fill
- tag bits need to exist so information is not lost about the uniqueness of each address
- direct mapping maps addresses deterministically using modulos; address: [Tag, Block (portion used for modulos) = log2(number of blocks), Word = log2(block size in bytes)]
- fully associative cache can map any memory block anywhere in the cache, so no
Block
bits; address: [Tag, Word = log2(block size in bytes)]
- set associative cache maps memory blocks to sets which are fully associative (tag consumes the extra address bits compared to direct mapping); address: [Tag, Block (portion used to determine set) = log2(number of sets), Word]
- replacement algorithms decide what to evict when the cache is full, or when the set is full for set associative
- LRU, Random, Ordered
- valid bit is added to each block to denote that is holds a value that is relevant to the current executing program
- dirty bit is added to each cache block to denote that a write back must occur upon eviction
- it is added so that we do not have to write-through to memory for each write
- write allocate (rather than no write allocate) says that we write-through to memory without caching the value
Virtual Memory
- Virtual memory is an abstraction for a 32 bit address space, where we don't have enough physical memory to fill the address space
- think of main memory (implemented by RAM) as a cache in front of disk drives
- The Memory Management Unit (MMU) holds information for handling the conversion of virtual addresses to physical ones
- The Direct Memory Access (DMU) is a component that handles frequent/sequential I/O operations so the processor doesn't have to
- Pages are fixed-length units of data, (around 4K bytes)
- Page table converts virtual addresses to physical addresses, by finding the page offset (assuming the page is in main memory)
- like in caches, we parse the address, upper portion for page number and lower for offset (the specific division depends on the total number pages we have, and log2(n) bits to represent their number)
- page table translation replaces the upper bits of the virtual address with the actual frame bits in main memory, maintaining the lower bits of the address for specific address within the page
- Virtual memory read takes two main memory reads (one for table and one for actual page access)
- Translation Lookaside Buffer (TLB) acts as a cache in front of the page table to reduce the number of page table reads, and uses LRU for cache algorithm
Processors
- we use a 5 step processor: fetch, decode, execute, memory, writeback
- the hardware depends on 6 components: register file, ALU, processor-memory interface, instruction address generator (with PC), Instruction Register, Control Circuitry
- processor-memory interface uses MFC (memory finished) to signal to the processor that the memory operation is complete
- pipelining allows multiple instructions to be processed at once, utilizing all components of the processor for different instructions, while maintaining instruction ordering as needed
- data hazards exist between instructions, like if there is a read after write (RAW), in which case old values can be used
- memory stalls (freezing subsequent instructions in the pipeline) occur when accessing memory and the cache misses (so the operation takes more than one clock cycle)
- branching causes a change in PC, which is different than the assumed [PC] + 4
- branching computes the new PC in the execute stage by default, which causes two wasted cycles and requires the processor to squash the subsequent incorrect instructions
- branch penalty is delayed by adding an adder in the decode stage so that the ALU need not compute the new address given the offset, meaning there is only 1 cycle for branch penalty
- for conditional branching, the branch penalty is only incurred when the condition is met and we need to branch to the newly computed PC
- the place where the instruction after the branch instruction is placed is called the branch delay slot
- if we load something useful into the branch delay slot, then there is no branch penalty
- this can be done by reordering instructions assuming no data dependencies
- sometimes this can't occur, in which case a NoOPeration instruction is loaded
- branch prediction allows us to optimize further by guessing what instruction to place in the branch delay slot
- static branch prediction chooses to always take or always reject a branch
- dynamic branch prediction uses previous decision to influence its guess, whose history is stored in a finite state machine
- branch predition still requires the new PC address, if it wants to predict that the branch is followed, and does not have the newly computed address until after the decode hardware and computed it
- we can a Branch Target Buffer (BTB) that acts as a cache for final address calculations so that branch prediction can occur before the decode stage is complete, meaning there is no branch delay
- superscaler architecture allows for multiple instructions to be processed at each stage
- interleaving allows for 2 instructions/cycle throughput, assuming no branches and no data dependencies
- you have to avoid deadlock where two instructions depend on each other
- for the ALU, multiplication is computed using series of addition operations
- booth encoding can reduce the number of addition operations, especially for numbers with lots of sequential 1s (small numbers in signed encoding)
- conversion uses the following table: at each position
i
in a binary number, 00 = 0, 01 = +1, 10 = -1, 11 = 0
Notes
Part 3: ISA & Addressing Modes
Intro to Addressing
- binary prefixes, 2^10 kibi, 2^20 mebi, 2^30 gibi ...
- memory systems do not address single bits, they have an address space, range of addresses
- byte addressable systems refer to bytes or a word for instructions
- word addressable memory systems are rare, instructions don't have access to individual bytes, only words
- commonly, word length is 32 bits can be encoded for types of information (integers, ASCII or Unicode)
- byte ordering defines how addresses are assigned to bytes of a word
-
big-endian byte ordering assigns the lower byte address to more significant bytes within a word (more sequential)
-
little-endian is opposite (like how numbers are read)
- word alignment structures words 4 addresses apart
Instruction Sets
- computer must have instructions capable of performing data transfers, arithmetic, sequence control, I/O
- an Instruction-set architecture (ISA) defines set of opcodes and native commands by a processor (ARM, x86, MIPS, SPARC, RISC-V)
- RISC ISA, each instruction fits in a word and memory operands are accessed only using load and store instructions
- all operands for arithmetic and logical operations must be in processor registers
- CISC ISA has variable length for instructions, many instructions can access memory direction using memory addresses
Representative Notations
-
Register Transfer Notation is a way of expressing semantics of instructions as data transfers
- memory locations are symbolic names like
LOC1
- Registers are identified by register names
R0
- contents of memory locations are denoted by placing square brackets
[LOC1]
- control signals are symbolic names
T1
- assignment statements transfer/manipulate data
R1 <- [LOC1]
-
Assembly Language Notation represents machine instructions and programs, to express operations, source and destination operands
-
Load R1, LOC1
load into R1 from LOC1
-
Add R2, R3, R4
add R3 and R4 and save into R2
-
Store R1, LOC1
is BACKWARDS, store contents of R1 into LOC1
-
Subtract R2, R2, #1
= R2 <- [R2]-1
-
Branch_if_[R2]>0 LOOP
= If(T1>0) where T <- [R2] then PC<-LOOP
used for looping until conditional
-
Clear
= R2 <- 0
Addressing Modes
- addressing modes are different ways to specify the location of instruction's operand
-
effective address is the location of an operand, which differs based on the addressing mode
- Immediate
#Value
, Operand = Value
- Register
Ri
, EA = Ri where i is any integer and R is a keyword
- Absolute/Memory Addressing
LOC
, EA = LOC
is exact address
- Register Indirect Addressing
(Ri)
, EA = [Ri]
evaluates the value in the register (or mem location)
- Index addressing
X(Ri)
, EA = [Ri] + X
is used when the operand address is a known constant offest from a known mem location, useful for arrays
- Base with index addressing
(Ri, Rj)
, EA = [Ri] + [Rj]
, useful when index offset is non constant- PC-relative addressing is a special case of index addressing, where Ri = PC, useful for branching
- Auto increment addressing simplifies successive address loading:
(Ri)+
, EA = [Ri]
; increment Ri
- increment is dependent on addressing system (byte vs word), and post-increments
-
Load R2, (R3)+
= R2 <- [R3]; R3 <- [R3] + #4
- decrementing is similar, but it decrements prior to the register being read
- both are useful for dealing with stacks a queues
Stacks
- resides in main memory, top of stack is the lowest memory address, pointed to by the special CPU register called the stack pointer
- Push is
Subtract SP, SP, #4; Store R2, (SP)
, Pop is Load R1, (SP); Add SP, SP, #4;
- Subroutines are blocks of instructions that are executed repeatedly for modularization
- invoked by executing a
Call
instruction, executes, returns to the main program by Return
instruction
- implemented by subroutine linkage method, stored on the stack or in a special register called the link register
- a call is a special branch instruction that invokes a subroutine by storing contents of PC to the link register and branching to the new target address, then returns the PC to the value in the link register
- subroutines can be nested, so we need a way to store multiple link registers, which is done so on the stack (raises issues with stack overflows if full or errors if the stack is not properly maintained)
- subroutines can take parameters, passed through registers (by value), memory locations (by reference) or passed on the stack (global namespace)
- locations at the top of the processor stack is the stack frame, private namespace by subroutines
-
frame pointer (FP) register enables access ot private work space
- stack frame consists of all items pushed onto the stack before or during the subroutine, including parameters and return address before and frame pointer, local variables and register values modified during the subroutine
- the frame pointer indicates start of new stack frame where parameters are stored at
n(FP)
and local vars at -n(FP)
, where n is a multiple of 4
Part 4: ARM
ARM ISA Characteristics
- Advanced RISC Machine is most common
- RISC has fixed length instructions, load and store instructions access memory, arithmetic/logic instructions operate only on registers
- requires less transistors than CISC processors in personal computers
- has 16 registers, 15 general purpose and 1 program counter at R15
- CPSR (Current program status register) holds N-Negative, Z-Zero, C-Carry, V-Overflow, (conditional codes, all are set when event occurs else cleared) interupt disable, 5 status registers
Addressing Modes
- all addressing modes are derivatives of indexing addressing mode, where effective address is the sum of [Rn, #offset]
- PC-relative:
LDR Rd, Item
is semantically Rd <- [[PC] + Item]
- for ARM, processors are pipelined, by the time the load is executed, the next instruction has already been fetched so
[PC] = PC+8
pointing two instructions ahead of load
- Pre-index:
LDR Rd, [Rn, #offset]
is semantically Rd <- [[Rn] + #offset]
- Pre-indexing with writeback:
LDR Rd, [Rn, #offset]!
is semantically Rd <- [[Rn] + #offset]; Rn <- [Rn] + #offset
- can be used for easier iteration through sequential memory
- post-index:
LDR Rd, [Rn], #offset
is semantically Rd <- [[Rn]]; Rn <- [Rn] + offset
; increments index after loading
Memory Operations
- when loading with fewer than 32 bits, zero extensions are used to fill the rest of the word with 0s, signed extension fills with the most significant bit loaded into remaining bits
- Regular: LDR/STR for words, H suffix (LDRH) for half words (zero extended), B suffix for bytes (zero), SB/SH for sign extended versions
- Multiple word load/store:
LDMIA R10!, {R0, R1, R6, R7)
, IA is opcode for increment after, R10 holds the memory location to start loading from, and loads words sequentially into specified registers. IA makes R10 hold [R10] + 4*n for n registers
Instructions
All instructions can take registers or literals
-
ADD R0, R2, R4
: R0 <- [R2] + [R4]
-
SUB R0, R3, #17
: R0 <- [R3] - 17
- shifts and rotates can be done on registers: (logical) LSL, LSR, (arithmetic) ASR, (rotate) ROR
- arithmetic shifts preserve sign, logical shifts clear vacated bits
-
ADD R0, R1, R2, LSL #4
: R0 <- [R1] + 2^4*[R2]
-
MOV R0, R1
: R0 <- [R1]
-
TST R0, R1
: bitwise AND of [R0] and [R1] and sets condition code Z, good for checking status bits
-
TEQ R0, R1
: Z = [R0] == [R1] ? 1 : 0
-
CMP R0, R1
: [R0] - [R1] and update condition code accordingly
-
B{Condition Suffix} LOC
for branching; see list of condition suffixes
- conditional suffixes can be attached to other instructions too, executing based on CPSR flags (N,Z,V,C)
-
CMP R0, R1; MOVLT R0, R1
: R0 <- R1 if R0 < R1
Assembler
Assembler converts source program to an object program in machine language
label operation operand(s) comment
- operations can either be mnemonics, labels and names are in a symbol table
- assembler directives does stuff during translation:
- define constants
- reserve storage locations for data values
- initialize data values stored in memory
- indicate where in memory a program should be located
- assembler directives depend on the specific assembler
- AREA specifies start of CODE/DATA
- ENTRY specifies start of program execution
- DCD label and initialize data operands
- EQU declares symbolic names for constants
- RN assign COUNTER RN 3
- pseudo-instructions are accepted by assembler and are synthesized to actual instructions
-
LDR R0, =ADDRESS
: LDR R0, #0x127, for ADDRESS=0x127
- example: push/pop for stacks
- LDMIA => POP, STMDB => PUSH
- Subroutine Linkage with branch-and-link:
BL TARGET
- value of PC is store in R14, which is the link register (LR) for returning after a subroutine
- branch to TARGET
- multiple passes are necessary for assembly because of forward references
- two-pass assembly first generates symbol table, then substitute values in labels
- linkers and loaders allow multiple modules to assemble into one binary
- each module is assembled relative to location 0, then linker adds appropriate offset once code location is determined for each module
- relative addressing and some instructions are position independent, while referencing external labels are position dependent
Conditional
ARM - conditionals
- ROR ROL by four
- must use STR and LDR for memory
- LDMIA SP! or STMDB SP! for stacks
- 15 PC, 14 LR, 13 SP
- assemble each module as if starts from 0
- linker adds appropriate offsets in memory
Part 5: Basic IO
A system bus is an example of an interconnection network, joining I/O devices, memory devices and CPU
- memory mapped I/O treats I/O devices as part of the address space
- allows us to apply memory instructions for I/O
- I/O device interface is a circuit between a device and the interconnection network
- provides means for data transfer (DATA register) and status (STATUS REGISTER) and control (CONTROL register) information
- program controlled I/O and interrupt-driven I/O use memory mapped I/O
Program Controlled I/O
- program-controlled I/O, I/O interaction is realized by a program, through an interface listening for input
- signal protocol is needed for proper timing of inputs
- input device, Keyboard, sends 'ready' signal, processor Loads from DATA register, output Display, signals 'ready' to receive, processor Stores to Display's DATA register
- 'ready' signal is a status flag (bit) in the status register
- program-controlled I/O polls status flags, wait looping to see if it should load/store for DATA register
- polling is taxing on processor, and locks without using interrupts
- instead let I/O alert the processor when it's ready, from hardware
- interrupts cause processor to deviate from the normal sequence of program instructions to allow the processor to respond to high priority events
Interrupt Service Routine
- special subroutine, called interrupt service routine (ISR) is responsible for handling services requested by the interrupt and ensuring consistent state after the ISR
- difference between this and subroutine is that the ISR has to copy many registers to the stack to preserve their values
- interrupt request signals from devices call the ISR, which responds with an acknowledgement
- processor finishes current instruction
- state of processor is saved
- ISR is called, and services the device
- state is restored and next instruction resumes
- the processor only restores the state of registers, not memory
- saving state can introduce delay prior to servicing a device, called the interupt latency
- if a code fragment requires atomicicity, the Interrupt Enable (IE) bit in the processor status register is set to 0
- multiple device interrupts can be supported; IRQ bit in I/O device's status register indicates that it raised an interrupt
- one approach is to poll all I/O devices, servicing the first deice to have IRQ set
- a shared interrupt request /IRQ signal is created using one bus driver (where each device is connected with an open switch by default)
- value of /IRQ becomes 0 if any device connected to the signal wishes to interrupt the processor
- instead of polling all devices, the device requesting service uses another set of signals to inform the processor that it is the one requesting service
- the INTA signal is daisy chained through all devices, stopping at the one requesting service
- alternatively, vectored interrupts allow for reduced interrupt latency, without the need for polling
- interrupt vector table holds addresses to Interrupt Service Routines (also called the interrupt vector)
- I/O device provides interrupt code through the interconnected network, and asserts the interrupt request signal
- saves state of processor, interrupt enable bit disabled
- interrupt code is used as a pointer into the interrupt vector table for the right ISR
- PC loaded with address of ISR
- ISR executes
- restore state of processors, IE enabled again
- PC returns to the interrupted program
- multiple I/O devices may use shared interrupt request signals for all devices or dedicated interrupt request signals, with lower latency and allowing the ISR to be immediately called by processor
- requires more wires (1 request signal and 1 acknowledge signal per device), and the processor can only support a fixed number of devices
Interrupt Nesting
Sometimes ISRs take large latencies for servicing higher priority devices
- disabling all interrupts during an ISR may result in unintended behaviour (order of execution)
- instead, allow higher priority interrupts to be enabled during ISR execution
- interrupt nesting is similar to subroutine nesting
- for simultaneous interrupt requests, arbitration or priority resolution is required
- for polling, service order is same as polling order
- for vectored interrupts, hardware must select only one device to identify itself
- disabling IE bit prevents all I/O devices from interrupting, doesn't give very fine control over disabling I/O
- for more fine control of behaviour, keep IE bit for each I/O device in I/O register (Keyboard IE and Display IE)
Processor Control Registers
in addition to processor status (PS) register, other control registers are present
- IPS register, where PS is automatically saved when an interrupt request is recognized
- IENABLE has one bit per device to control if request from source is recognized
- IPENDING has one bit per device to indicated if interrupt request has not been serviced
-
MoveControl R2, PS
is a special instruction used to update the processor control register (knows about keywords IPS, IENABLE, IPENDING, PS and is for moving values)
-
example: read a line of characters from keyboard
- use Keyboard ISR, with polling within ISR to display
- unfortunately only one interrupt occurs
- assumptions: ILOC points to the memory location with the first instruction for ISR
- LINE is the starting address of where to store the line in memory
- PNTR points to a memory location for the characters from the ISR
- EOL indicates that the line has been read
mov r2, ps
Part 6: IO Organization
Basic Hardware
- hardware propagation delay: when a state change occurs at the input, the delay encountered before the corresponding change in the output waveform is the propagation delay
- sampling during the transition time can lead to incorrect data
- important to assert over a small time period
- setup (prior) and hold (following) time are necessary for correct sampling of data
- interconnection network is the circuit that transfers info between components (ex: system bus connecting address, data, control)
I/O Interface
- bus contains signals, one for each component of the I/O interface
-
address decoder deterines when a device should respond to a request from a processor
- produces a device enable signal for input addresses belonging to the device range
- given a memory map, you can determine which address signals identify a particular device
- each has a boolean expression that represents address ranges
- also use memory map to determine the address signals used by the device after its enabled
- also determines number of unique addresses supported by each device
Bus Operation
- a bus protocol is set of rules governing how the bus is used
- bus control lines specify whether read or write, the size of the operation, timing info
- broadly, timing data transfers over bus synchronously with a global clock or asynchronously
- master is device that initiates data transfer
- slave is device addressed by master
- X places Y on bus (X asserts signals amount to value Y) on the bus signals
- bus signals could be address, command, or data
- Skew occurs when two signals transmitted simultaneously from one source arrive at dest at different times
- happens because different lines have different propagation speeds
Part 7: Memory Systems
Array of simple memory cells, each storing a single bit of information
- size is the unique addressable memory locations
- for k address bits, 2^k words of memory are addressable
- address decoder drives word line (bus), selecting a row of cells
- sense/write circuit connected to two bit lines, during read, read current bit values. during write, take input and store in cells
- control line for read and write specifies which operation
- CS is chip select, choses given chip
- 1Kib x 1 means 1Ki bits of memory with 1 bit of input/output
- implemented with 32x32 cells, with a 5 bit decoder (2^5 = 32 addresses) for column and row
- read/write one cell at a time
- Static RAM stored in latches, expensive
- two inverters form a latch, where X=1, Y=0 represents 1
- on read, the word lines are activated, by closing switches connecting the inverter to the circuit
- on write, place values on b and b' and connect the word lines via latches
- Dynamic Ram store temporarily using capacitors, requires refreshes to recharge capacitor, but cheaper
- capacitor holds charge, when the charge is above a threshold, it holds 1
- during reads, assert word line and switch T is closed, connecting bit line to capacitor for reading
- during write, assert word line, close swithc T and drive bit line to alter amount of charge
- Organization of Asynchronous 32Mib x 8 DRAM chip:
- 16Kib x 16Kib, 16Ki rows of 16Kib cells, each row has 16Kib/8 = 2048 bytes
- log2(16Ki) = 14 bits to address row
- log2(2048) = 11 bits to address specific byte in row
- row and column are multiplexed on 14 pins
- row address first placed then assert RAS
DRAM
- 1Kib x 1 means 1Ki bits of memory with 1 bit of input/output
- implemented as 5 x 5 decoders to access specific memory cells at a row and column (2^5 = 32)
- asynchronous chip uses 14 x 11 decoder to access a row of cells from addresses of 16mib x (16mib/8)
- this architecture is good for fast page mode, once the row is chosen, you can iterate through columns for sequential r/w
synchronous chip is driven by clock, similar design with a buffered I/O byte, one for reads and one for writes so that they can happen simultaneously while the other operation is finishing up
-
memory latency is the time before tranfer of memory is complete
-
bandwidth is the amount of data transfered per unit time
Double Data Rate SDRAM switches between two banks of data for one r/w on every clock edge
- memory controller is the interface for connecting chips to the system bus
chips can be combined into large arrays of chips, with a bunch of decoders
DMA
Direct Memory Access is a hardware unit that handles memory access operations for the processor. Program controlled I/O is taxing on the processor with lots of overhead to access just one byte. Processor delagates to the DMA, which responds with an ISR when finished
- DMAs sit in front of things like DISK drives and Ethernet access, separate from main memory
Caches & Memory Hierarchy
Hierarchy of memory components goes from fastests to slowest, with speed cost trade off
- CPU -> M1 -> M2 -> M3 -> M4
- automatically move data up and down the hiearchy (frequent at the top) to optimize memory access time
- locality of memory references is an intrinsic property of real programs
- temporal locality is data that you recently used and are likely to use soon (local vars)
- spatial locality is data near the recently used data that are likely to be used again soon (elements in an array)
- caches assist with locality of memory references
- cache block is size that indicates multiple words of data/instructions transferred between cache and main memory
- cache hit if memory block is found
- cahce miss if not
- cache eviction removing block of data/instructions from the cache
- cache dirty bit used to identify when contents of a cache block change
- processor connects to the cache
- cache connects to the main memory
Caches are faster than memory because of their hardware implementation. The cache is on the CPU, with transistors and it doesn't have to go through the external bus. Additionally, the cache is smaller so less to check.
Cache Writing Protocol for Store R2, (R3):
- if cache hit 2 approaches:
- write-through protocol updates the cache block and the main memory simultaniously
- write-back protocol updates the cache block and simply flags that the cache block has been modified by setting a dirty bit
if cache miss, 2 approaches:
- write-through protocol updates main memory directly and cache
- write-back protocol fetches the main memory block into the appropriate cache block, updates the cache block and sets the dirty bit
- then dirty bit is used to decide whether to write-back data on eviction or replacement
- if cache miss, write-allocate says fill cache with line and perform write, while no-write-allocate ignores filling cache and writes directly to main memory
- Writeback caches delay the writeback to memory, using the dirty bit. We only write to memory after the cache block is evicted. This is more efficient when you update values in the cache before they're evicted
Mapping Functions determine the location in the cache for each memory address
- for any cache, the size of the block must be the same as the number of addressable bits for the main memory
- otherwise we don't have enough information to trace back to the original memory block
- each cache mapping just dictates how we choose to format this information
-
direct mapping uses a fixed mapping mem block j => cache block (j mod 128)
- address is 3 parts; word (offset) selects specific word in a block
- block (index) determines location in cache
- tag ensure block found is correct
-
fully-associative allows a block from main memory to map to any location in the cache
- block field is empty, and consumed by the tag field
- when accessing, each block tag is checked, which corresponds to a block of memory in the main memory
- then the word bits in the cache block decode to the respective word within the main memory block
-
set associative mapping uses direct mapping to map memory blocks to a set of cache blocks, think combination of direct and fully associative mapping
Cache Structure
- to fetch data from the cache, we use indexing based on the main memory address that the cache intercepts.
- The first 5 bits of the address will be interpreted as the word bits (n bits to choose from the 2^n words in the block),
- next 10 bits will be the block bits (b bits for 2^b blocks of cache capacity).
- The remaining bits are called tag bits, used to determine correctness of cache hits (for direct caches, multiple memory blocks map to the same cache block)
- each cache block holds data, along with tag bits and a valid bit, tag bits for comparison to the tag in the address, while the valid bit state whether the data stored in the cache is valid data (V = 1)
Replacement Algorithms
For set-associative and fully-associative caches, a cache miss needs to evict some undetermined block, of which is determined using a replacement algorithm.
- LRU leverages temporal locality by tracking age of cache components, using log2(n) bits for each block (for caches with sets of size n)
- age is defined as the number of elements who have been accessed more recently than it
- FIFO acts as a sort of round robin system
- Random
For C
time required to access a block in cache, M
time when accessing memory (including miss penalty) and h
hit percentage, cache performance is intuitively tavg = (hC + (1-h)M)
.
If we're given M' instead, being the latency for accessing Memory, then the time is tavg = C + (1-h)M'
since the time for a miss is equal to the time for a hit, then no matter what, we incur the time taken to access a block from cache, and add extra latency if it's a miss.
For two caches, intuitively tavg = h1•C1 + (1-h1)(h2•C2 + (1-h2)M)
Solving Cache Problems
Given a
-bit addresses, 2^a
(bytes or words; depending on addressing system) main memory, c
byte/word cache size, b
bytes/words per cache block. Cache block capacity = c/b
. Word bits are first log2(b)
bits (rightmost), block bits are next log2(c/b)
bits, tag bits are remaining.
For final cache content questions, do the timeline table method, visually updated each cache block with respect to a timeline. Then you can figure out quickly which block is the least recently used.
Virtual Memory
Programs are written using the full address space 2^32, but physical memory capacity is often less than this (2GB for example). Virtual memory populates the rest, storing it on disk
Lab4: ISR
- generate Random delay 5-25s in R6
- display on LED
- Delay of 1s then decrement your ## in R6m then display
- if R4 ≤ 0, fkasg LEDs on and off for 1 sec each until you press the button (ISR)
Part 8: Basic Processing
CPU has control unit, ALU and registers. Fundamentally, there are 5 steps:
- fetch
- decode
- read memory
-
execute: perform operation
- store data by write back
Processor has hardware to implement every instruction in the ISA.
The datapath of a processor hold a bunch of components that implement the ISA.
- Register file is a set of registers
- ALU
- processor-memory interface for memory and write-back operations
- instruction address generator updates PC
- instruction register
- control circuitry, decides what happens with mux selectors, and regulates flow of data
Design of Register file:
- you want to be able to select two registers for a clock cycle, and write to at most one register. So we need two corresponding output data. The width of i/o is 32 bits (same as register width) and addresses are log2(32) = 5
- Fetch gets whatevers at PC address.
- Decode
- in ADD R1, R2, R3, Rsrc1 = R2, Rsrc2 = R3
- Execute
Part 9: Pipelining
Pipelining allows for concurrent instructions per cycle, with different parts (stages) of the processor
- with 5 stage datapath, only one stage is active in any cycle
- pipeline diagram shows at each clock cycle what stage finished for each instruction is being run
- to implement this we need interstage buffers to carry forward information for EACH instruction
- RA, RB, RM, RY, RZ, IR, PC
- also pipelining the control signals
- this allows a new instruction to enter and exit the pipeline every cycle
Issues
- Data hazard is a result of a data dependency. Two successive instructions,
ADD R1, R2; ADD R3, R1
, the second one will decode the wrong value of R1, since the first hasn't writtenback to R1 yet. This can be resolved by stalling the pipeline
- using interstage buffers, check IR for source registers of a new command (before decoding) and check whether there is an older instruction who has the same destination as a source in the current instruction
- somethings wrong if there are two of the same stage in the same column (clock cycle) of the pipeline diagram
For pipelining problems, be aware of how the register files are being updated. Forwarding is important to understand. Forwarding can be done by adding a mux with the output of one stage, linked to an input of a previous stage to share values. An intermediate register may have a wrong value in it. By adding a mux, we can select the actual correct value from the output of a previous command. In general, we don't want to add a mux in the decode stage since the mux is expensive with time. Adding it to other stages is negligable since they involve memory operations.
A change made by some stage at a given clock cycle is not noticed until the next cycle.
instr PC R4 RA RM RZ RY
1F 37c00 1000 - - - -
2D 37C04 "
3X
4M
5W
Memory Stalls
Branches alter sequential execution, whose effect is unknown until the execution stage. The pipeline needs to squas the instructions fetched down the wrong path. The time taken is the penalty