Transcript for:
Computer Architecture Basics

Title: URL Source: blob://pdf/e492bcb2-2947-459c-ab15-bbef6c81e995 Markdown Content: # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 02 What do we mean by architecture? In general, Architecture means a style, structure, or composition of buildings Computer Architecture > Basic design and design concepts in computers > Hardware specifications with system software such as operating systems and compilers > Architecture of the computer + Behavior of the computer Overall view of a computer > 3 > Computer Architecture ## Instruction set Operating system ## Operative unit Memory unit ## Control unit Input and output units > Architecture of the computer > Behavior of the computer What is CPU? Equivalent terms: CPU, MPU, processor The CPU is an IC that includes the arithmetic, control, and input/output units of a computer. Lists companies that design/manufacture CPUs: Mainstream: Intel, AMD, ARM Embedded and Industrial: Zilog, Atmel, Fujitsu, etc. Architecture examples: x86 (Intel/AMD) ARM (mobile devices) MIPS, SPARC, PowerPC (once common in academia/industry) PIC, AVR, 68xxx, Z80 (microcontrollers) 4Computer Processing The flow of data and control between major parts of a computer. 1. CPU (Central Processing Unit) 2. Input Devices 3. Memory (RAM / Main Memory) 4. Output Devices > 5 Three questions about computer processing Input data Information Processing, computing and storage of information Output data Information Display Q1 How do computers represent information? what format? Q How is the acquired data calculated? Q What procedures are used to process (calculate) the "information"? > Binary format Algo > ALL switching element Group work Lets talk about Q1: How do computers represent information? What format? Q2: How is the acquired data calculated? Q3: What procedures are used to process (calculate) the information? 7 text , image in Binary format > 1. Identify and gathering 2 . Transforming to usable format 3 . performing calculation Though CPU : ALL Validation , sorting , summarization , aggregation , analysis and classification Group work Lets talk about > Q1: How do computers represent information? What format? > Q2: How is the acquired data calculated? > Q3: What procedures are used to process (calculate) the information? > 7 Explanation Ease of nderstanding Language It describes instructions to a computer in a language that is easy for humans to understand. High-level languages are converted into machine language programs by compilers and interpreters ex.) C, C++, Java, Python Easy High-level language A low-level language that uses mnemonics (e.g., MO, ADD, B) to represent machine instructions. Closer to hardware than high-level languages. ranslated into machine language by an assembler . Assembly language (Low-level language) Computers can directly understand Binary code converted from assembler language Impossible machine language Programming language Von Neumann Architecture (Von Neumann-type Computers) Many computers are designed as Neumann-type calculators Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer Stored-program computer > Instructions (programs) and data are placed indistinguishably in main memory > The distinction between instructions and data is made by the program Sequential processing > A program is a list of instructions > Instructions are fetched one by one from main memory and executed in a determined order > The position in memory of the next instruction to execute is stored into a register called the program counter . Linear address > Each cell of main memory is numbered sequentially > This number is called address > The Address is used to indicate the location of instructions and data 10 The memory address PC The Basic Structure of Neumann computer > 11 addi $s4, $zero, 1 add $1, $s1, $s0 sub $s0, $s0, $s4 bne $s0, $s4, L1 variable x array(0,0) Linear address variable y array(0,1) 0 add $s1, $zero, $zero 4 8 12 16 20 24 28 32 ... Sequential processing Stored-program computer The Basic Structure of Neumann computer The Basic Structure of Neumann computer The central processing unit (CPU) is the part of the computer that contains the arithmetic and control units. > Operating Unit : The arithmetic unit is the device that performs arithmetic and logical operations and that temporarily stores the operation terms and results. > Control Unit : The control unit is the device that controls the operation of the computer. Memory Unit : The memory unit is the device that stores the program and the data. > Main Memory Unit : The main memory is the part of the storage unit to which the CPU has direct access. > Auxiliary Memory Unit : The auxiliary storage is the part of the memory that cannot be directly accessed by the CPU. Input/Output Unit(s) : The input/output unit is the device that performs the data transfers between the computer and its environment. 13 The detailed structure of the CPU The operative unit > 14 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) The general-purpose registers are the registers that stores temporarily the terms and the results of the computers calculations. Compared to the memory units: Faster, smaller and without any address (2) Arithmetic and Logic Unit (ALU) The ALU is the device that performs the arithmetic and logic operations. (3) Flag Register (FR) > 15 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) (2) Arithmetic and Logic Unit (ALU) (3) Flag Register (FR) The flag register stores the status of the previous operation. The flag register a 4 or more bits Roles: Conditional branching (if and for statements) > 16 CF OF ZF SF CF: Carry Flag The output carry of the operation OF: Overflow Flag The overflow of the signed calculations ZF: Zero Flag Indicates if the result were 0. SF: Sign Flag Indicates if the result were negative The detailed structure of the CPU The Control Unit > 17 The detailed structure of the CPU The Control Unit (1) Program Counter (PC) The PC is the register that stores the address of the next instruction to be executed. (2) Instruction Register (IR) The IR is the register that stores the instruction to be executed. (3) Decoder (DE) The decoder is a circuit that decodes the value of the instruction stored in the instruction register. Its result drives a set of control signals for the computer. (4) Sequencer The sequencer is the circuit that generates the control signals related to the clock and the status of the computer. > 18 The Basic Behavior of a von Neumann Computer Execution flow/Instruction Cycle: the following infinite loop > 19 ## Fetch ## (FE) ## Decode ## (DE) ## Execute ## (EX) The Basic Behavior of a von Neumann Computer > 20 (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition The von Neumann Bottleneck What is a bottleneck? A bottlenecks is a situation that limit the overall performance of a given product. > 21 A bottles neck Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 08 LAST WEEK > 2 Register Operands Arithmetic instructions use register operands MIPS has a 32 32-bit register file > Use for frequently accessed data > Numbered 0 to 31 > 32-bit data called a word Assembler names ex. > $t0, $t1, , $t9 for temporary values > $s0, $s1, , $s7 for saved variables (of initial inputs) Design Principle : Smaller is faster > main memory: millions of locations -> Slower > Keep frequently used data in registers. -> Fast Access > Minimize memory accesses 3 > Use $t when you dont care about keeping the > value after a function call. > Use $s when the value must survive across calls. MIPS: Temporary vs. Saved Registers Example: # Using $t0 for temporary math add $t0, $a0, $a1 # Using $s0 to store a persistent value add $s0, $a2, $a3 If this function calls another function, it must save $s0 (using sw) before calling and restore (using lw) afterward. But $t0 doesnt need to be preserved it might be overwritten. > 4 MIPS Register Numbering and Assembler Names MIPS has 32 general-purpose registers , labeled $0 to $31 These are hardware register numbers , but in assembly programming, we use symbolic names to make them easier to read and understand. 5 Common Use Assembler Name Register Number Always 0 (read-only) $zero $0 Reserved for assembler $at $1 Function return values $v0$v1 $2$3 Function arguments $a0$a3 $4$7 Temporary registers $t0$t7 $8$15 Saved registers $s0$s7 $16$23 More temporary registers $t8$t9 $24$25 Reserved for kernel $k0$k1 $26$27 Global pointer $gp $28 Stack pointer $sp $29 Frame pointer $fp $30 Return address $ra $31 Memory Operands Main memory used for composite data > Arrays, structures, dynamic data To apply arithmetic operations > MIPS does not perform arithmetic directly on memory. > Load values from memory into registers > Store result from register to memory Memory is byte addressed > Each address identifies an 8-bit (1 byte) > MIPS loads/stores 32-bit words (4 bytes) Words are aligned in memory > Address must be a multiple of 4 > 6 > Byte Address > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Memory Operand Example C code: g = h + A[8]; > g is in $s1, h is in $s2 > Base address of array A, that is A[0], is in $s3 Compiled MIPS code: > Index 8 requires offset of 32 (4 bytes per word) lw $t0, 32($s3) # load word add $s1, $s2, $t0 > 7 > offset base register A[0] > 1000 > 1004 > 1008 1 word=4 bytes (32 bit) > 1012 > 1016 > 1020 > 1024 > 1028 > 1032 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 8 The MIPS Instruction Set MIPS instructions are categorized into 5 major types: 1. Arithmetic Instructions 2. Logical Operations 3. Memory Access Instructions 4. Control flow Instructions > 9 Arithmetic Instructions Add operation has three operands > Two sources and one destination add a, b, c # a gets b + c All arithmetic operations have this form sub and or slt Design Principle: Simplicity favors regularity > Regularity makes implementation simpler > Simplicity enables higher performance at lower cost > 10 Arithmetic Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b add $t0, $t1, $t2 x = a + b Addition add $t0 = x, $t1 = a, $t2 = b sub $t0, $t1, $t2 x = a - b Subtraction sub $t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5 Add immediate (constant) addi $t0 = x, $t1 = a, $t2 = b slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt $t0 = x, $t1 = a, $t2 = b sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (unsigned) sltu 11 Logical Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and $t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or $t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor $t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor $t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi 12 Memory Access Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $s1 = base address lw $t0, 8($s1) x = A[i] Load word from memory lw $t0 = x, $s1 = base address sw $t0, 8($s1) A[i] = x Store word to memory sw $t0 = x, $s1 = base address lb $t0, 4($s1) x = byte[i] Load byte from memory lb $t0 = x, $s1 = base address sb $t0, 4($s1) byte[i] = x Store byte to memory sb 13 Control Flow Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq $t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne -j LABEL goto LABEL Unconditional jump j return addr in $ra jal FUNC call FUNC() Jump and link (call) jal $ra = return address jr $ra return Jump to register (return) jr 14 Quiz 1: Arithmetic Example Suppose we place the sum of four variables b, c, d and e into variable a a = b + c + d + e; What is the compiled MIPS code? add a, b, c # sum of b and c is placed in a add a, a, d # sum of b,c and d is now in a add a, a, e # sum of b,c,d and e is now in a > 15 Quiz 2: Arithmetic Example Suppose we place the sum of four variables b, c, d and e into variable a a = b + c ; d = a e ; What is the compiled MIPS code? add a, b, c sub d, a, e > 16 Quiz 4: Register Operand C code: f = (g + h) - (i + j); > The variables f, g, , j are assigned to > the registers $s0, $s1 , $s4 What is the compiled MIPS code? add $t0, $s1, $s2 # register t0 contains g+h add $t1, $s3, $s4 # register t1 contains i+j sub $s0, $t0, $ t1 # f gets (g+h) (i+j) > 17 Quiz 5: Memory Operand C code: A[12] = h + A[8]; > h in $s2, base address of A in $s3 What is the compiled code? lw $t0, 32($s3) # load word add $t0, $s2, $t0 sw $t0, 48($s3) # store word > 18 Convert C code to MIPS assembly > 19 Find Absolute Value (abs) if (a < 0) abs = -a; else abs = a; Find the Maximum (Max) if (a > b) max = a; Else max = b; Check if a is between b and c if (b < a && a < c) result = 1; else result = 0; Convert C code to MIPS assembly Find the Maximum (Max) if (a > b) max = a; Else max = b; Hint : use slt, beq, bne, add, sub > 20 slt $t3, $t1, $t0 # if a > b $t3 = 1 (i.e. b < a) beq $t3, $zero, ELSE # if false go to ELSE add $t2, $t0, $zero # max = a j END ELSE: add $t2, $t1, $zero # max = b END: Convert C code to MIPS assembly Find Absolute Value (abs) if (a < 0) abs = -a; else abs = a; Hint : use slt, sub, add, bne > 21 slt $t2, $t0, $zero # if a < 0 $t2 = 1 beq $t2, $zero, ELSE # if a >= 0 go to ELSE sub $t1, $zero, $t0 # abs = -a j END ELSE: add $t1, $t0, $zero # abs = a END: Convert C code to MIPS assembly Check if a is between b and c if (b < a && a < c) result = 1; else result = 0; > 22 slt $t4, $t1, $t0 # b < a $t4 = 1 slt $t5, $t0, $t2 # a < c $t5 = 1 add $t6, $t4, $t5 # $t6 = $t4 + $t5 beq $t6, 2, YES # if both true result = 1 add $t3, $zero, $zero # result = 0 j END YES: add $t3, $zero, 1 # result = 1 END: TODAY > 23 Todays topic Radix-conversion > Binary number (Base-2) > Decimal number (Base-10) > Hexadecimal number (Base-16) 2s-Complement Representing Instructions We use decimal numbers in our daily lives Decimal number uses 10 numbers from 0 to 9, and when the number reaches 10, carry-over occurs 9876543210 1918 17 16 15 14 13 12 11 10 2928 27 26 25 24 23 22 21 20 99 98 97 96 95 94 93 92 91 90 10 9108 107 106 105 104 103 102 101 10 0 11 9118 117 116 115 114 113 112 111 110 Decimal number ( base-10 )FBA93210 1F1B 1A 19 13 12 11 10 2F2B 2A 29 23 22 21 20 9F9B 9A 99 93 92 91 90 AFAB AA A9 A3 A2 A1 A0 FFFB FA F9 F3 F2 F1 F0 Hexadecimal number (Base-16) To represent more than 10 in a single number, we use A, B, C, D, E, and F 365 (10) = 3 * 10 2 + 6 * 10 1 + 5 * 10 0 = 365 (10) 101 (2) = 1 * 2 2 + 0 * 2 1 + 1 * 2 0 = 5 (10) 125 (8) 2AD (16) = 2 * 16 2 + A * 16 1 + D * 16 0 = 2 * 16 2 + 10 * 16 1 + 13 * 16 0 = 685 (10) Conversion from base-n to decimal = 1 * 8 2 + 2 * 8 1 + 5 * 8 0 = 85 (10) 85 (10) base 8 85 8 = 10 ... 5 10 8 = 1 ... 2 1 8 = 0 ... 1 685 16 = 42 ... 13 D 42 16 = 2 ... 10 A 2 16 = 0 ... 2 Conversion from decimal to base-n Ans. 125 (8) Ans. 2AD (16) 685 (10) base 16 Hexadecimal Base 16 > Compact representation of bit strings > 4 bits per hex digit > 29 1100 c1000 80100 40000 0 1101 d1001 90101 50001 1 1110 e1010 a0110 60010 2 1111 f1011 b0111 70011 3 > ## Example: eca8 6420 1110 1100 1010 1000 0110 0100 0010 0000 Unsigned Binary Integers Given an n-bit number Example: What is the decimal value of the following 32-bit binary code? > 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 12 3 + 02 2 +12 1 +12 0 = 0 + + 8 + 0 + 2 + 1 = 11 10 > 30 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > Unsigned Binary Integers Given an n-bit number Range: 0 to +2 n 1 Using 32 bits > 0 to +4,294,967,295 > 31 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Example: What is the decimal value of the 2s complement binary code? > 1111 1111 1111 1111 1111 1111 1111 1100 2 = 12 31 + 12 30 + + 12 2 +02 1 +02 0 = > 32 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Example: What is the decimal value of the 2s complement binary code? > 1111 1111 1111 1111 1111 1111 1111 1100 2 = 12 31 + 12 30 + + 12 2 +02 1 +02 0 = 2,147,483,648 + 2,147,483,644 = 4 10 > 33 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Range: 2 n 1 to +2 n 1 1 Using 32 bits > 2,147,483,648 to +2,147,483,647 > 34 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers 2s-Complement Signed Integers Bit 31 is sign bit > 1 for negative numbers > 0 for non-negative numbers Non-negative numbers have the same unsigned and 2s-complement representation Some specific numbers > 0: 0000 0000 0000 > 1: 1111 1111 1111 > Most-negative: 1000 0000 0000 2,147,483,648 > Most-positive: 0111 1111 1111 2,147,483,647 > 36 2s-Complement Signed Integers Complement and add 1 Complement means 1 0, 0 1 Example: negate +2 > +2 = 0000 0000 0010 2 > 2 = 1111 1111 1101 2 + 1 = 1111 1111 1110 2 > 37 x1x 11111...111 xx 2 X = 0000 1111 X = 1111 0000 Sign Extension Representing a number using more bits > Preserve the numeric value Why It's Important: > When instructions handle: > addi : extend immediate value > lb , lh : extend loaded byte/halfword Replicate the sign bit to the left > c.f. unsigned values: extend with 0s Examples: 8-bit to 16-bit > +2: 0000 0010 => 0000 0000 0 000 0010 > 2: 1111 1110 => 1111 1111 1 111 1110 > 38 Sign Extension In MIPS instruction set > addi : extend immediate value > addi $t0, $t1, -13 # $t0 = $t1 + (-13) -13 is represented as 11110011 in 8-bit and 11111111 11110011 in 16-bit This value is sign-extended to 11111111 11111111 11111111 11110011 in 32-bit before the addition. > lb , lh : extend loaded byte/halfword lb (load byte) and lh (load halfword) are instructions used for loading 8-bit and 16- bit data from memory into registers > 39 Representing Instructions MIPS R-format Instructions > used for arithmetic and logical operations MIPS I-format Instructions > used for operations involving constants (immediate values), load/store instructions, and branch instructions . > 40 Representing Instructions Instructions are encoded in binary > Called machine code MIPS instructions > Encoded as 32-bit instruction words > Operation code (opcode), register numbers, Register numbers > $t0 $t7 are regs number 8 15 > $t8 $t9 are regs number 24 25 > $s0 $s7 are regs number16 23 > $t is registers for temporary values > $s is registers for saved variables in a program > 41 MIPS R-format Instructions Instruction fields > op : operation code (opcode) > rs : first source register number > rt : second source register number > rd : destination register number > shamt: shift amount (00000 for now) > funct : function code (extends opcode) > 42 op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 43 R-format Example (add operation) add $t0 , $s1, $s2 ($t0 =$s1+$s2 ) special $s1 $s2 $t0 0 add 0 17 18 8 0 32 000000 10001 10010 01000 00000 100000 00000010001100100100000000100000 2 = 02324020 16 Dicimal Binary op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 44 R-format Example (sub operation) sub $t0 , $s1, $s2 ($t0 =$s1-$s2 ) special $s1 $s2 $t0 0 sub 0 17 18 8 0 34 000000 10001 10010 01000 00000 100010 00000010001100100100000000100010 2 = 02324022 16 Dicimal Binary op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits MIPS R-format Instructions Funct (dec) Funct (bin) Opcode (dec) Opcode (bin) Instruction 32 100000 0000000 add 34 100010 0000000 sub 36 100100 0000000 and 37 100101 0000000 or 38 100110 0000000 xor 39 100111 0000000 nor 42 101010 0000000 slt 0000000 0000000 sll 2000010 0000000 srl 8001000 0000000 jr 45 MIPS I-format Instructions Immediate arithmetic and load/store instructions > rt : destination or source register number > Constant : 2 15 to +2 15 1 > Address : offset added to base address in rs Design Principle: Good design demands good compromises > Different formats complicate decoding, but allow 32-bit instructions uniformly > Keep formats as similar as possible > 46 > op rs rt constant or address > 6 bits 5 bits 5 bits 16 bits 47 I-format Example (Load operation) ## lw $t0 , 32( $s3) #A[8] is loaded to $t0 35 19 8 100011 10011 01000 10001110011010000000000000100000 2 = 8D680020 16 Dicimal Binary Load word rs rt constant or address 32 0000000000100000 op rs rt constant or address > 6 bits 5 bits 5 bits 16 bits MIPS I-format Instructions Opcode (dec) Opcode (bin) Instruction 8001000 addi 12 001100 andi 13 001101 ori 14 001110 xori 35 100011 lw 43 101011 sw 32 100000 lb 40 101000 sb 33 100001 lh 41 101001 sh 4000100 beq 5000101 bne 10 001010 slti 48 Logical Operations Instructions for bitwise manipulation Useful for extracting and inserting groups of bits in a word > 49 MIPS Java COperation sll << << Shift left srl >>> >> Shift right and, andi &&Bitwise AND or, ori ||Bitwise OR nor ~~Bitwise NOT Shift Operations: MIPS R-format Instructions shamt: how many positions to shift Shift left logical > Shift left and fill with 0 bits > sll by i bits multiplies by 2 > i Shift right logical > Shift right and fill with 0 bits > srl by i bits divides by 2 > i (unsigned only) > 50 op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits AND Operations Useful to mask bits in a word > Select some bits, clear others to 0 and $t0, $t1, $t2 > 51 0000 0000 0000 0000 0000 1101 1100 0000 0000 0000 0000 0000 0011 1100 0000 0000 $t2 $t1 0000 0000 0000 0000 0000 1100 0000 0000 $t0 OR Operations Useful to include bits in a word > Set some bits to 1, leave others unchanged or $t0, $t1, $t2 > 52 0000 0000 0000 0000 0000 1101 1100 0000 0000 0000 0000 0000 0011 1100 0000 0000 $t2 $t1 0000 0000 0000 0000 0011 1101 1100 0000 $t0 NOT Operations Useful to invert bits in a word > Change 0 to 1, and 1 to 0 MIPS has NOR 3-operand instruction > a NOR b == NOT ( a OR b ) nor $t0, $t1, $zero > 53 0000 0000 0000 0000 0011 1100 0000 0000 $t1 1111 1111 1111 1111 1100 0011 1111 1111 $t0 > Register 0: always > read as zero # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 07 LAST WEEK > 2 Power saving 1. Parallelism - Multi-Core Processors: Incorporate more cores into a single processor chip. - Simultaneous Multithreading (SMT) - GPUs and Accelerators : Utilize graphics processing units (GPUs) and other accelerators (such as TPUs for machine learning) Power saving 2. Memory Optimization - Prefetching: Predict and load data into the cache before it is needed by the CPU to reduce wait times. - Memory Bandwidth and Latency Improvements: Use faster memory technologies (such as DDR5 or HBM) and optimize memory controller designs Power saving 3. Software and Algorithm Optimization - Efficient Algorithms: Use algorithms with lower computational complexity to reduce the number of instructions required to complete a task. - Concurrency: Design software to effectively utilize multiple cores and threads Power saving 4. Specialized Hardware : Develop custom hardware tailored for specific tasks 5. Energy-Efficient Computing : Turn off portions of the processor when they are not in use 6. Heterogeneous Computing: Integrate different types of processors on the same chip to handle diverse workloads Execution of programs > 7 CPU > 100101001010100 > 000001011011100 > 111001111010011 lw $4, 0($1) lw $5, 4($1) add $2, $4, $5 Compiler If (y == 1) c=a+b ; else c=a-b ; High-level programming language such as C, Java, Verilog-HDL etc.. Instruction set Assembly lang. Machine lang. to CPU Binary code CPUa CPUb CPUc C Program ## CPU Independent 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 Interface between programs and CPUs Compiler for CPU-C Instruction set C Compiler for CPU-B Instruction set B Compiler for CPU-A Instruction set A ## CPU dependent Machine lang. Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Operations of a Processor Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Executes arithmetic and logic operations Stores data computed by the ALU Data is read from main memory and written back to main memory Stores programs, instructions and data Load / Store architecture The way of access to the main memory is Load (read) and store (write) instructions ALU executes operations on registers values Operations of a Processor Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU > C # TODAY > 15 Register Operands Arithmetic instructions use register operands MIPS has a 32 32-bit register file > Use for frequently accessed data > Numbered 0 to 31 > 32-bit data called a word Assembler names ex. > $t0, $t1, , $t9 for temporary values > $s0, $s1, , $s7 for saved variables (of initial inputs) Design Principle : Smaller is faster > main memory: millions of locations -> Slower > Keep frequently used data in registers. -> Fast Access > Minimize memory accesses 16 > Use $t when you dont care about keeping the > value after a function call. > Use $s when the value must survive across calls. MIPS: Temporary vs. Saved Registers Example: # Using $t0 for temporary math add $t0, $a0, $a1 # Using $s0 to store a persistent value add $s0, $a2, $a3 If this function calls another function, it must save $s0 (using sw) before calling and restore (using lw) afterward. But $t0 doesnt need to be preserved it might be overwritten. > 17 MIPS Register Numbering and Assembler Names MIPS has 32 general-purpose registers , labeled $0 to $31 These are hardware register numbers , but in assembly programming, we use symbolic names to make them easier to read and understand. 18 Common Use Assembler Name Register Number Always 0 (read-only) $zero $0 Reserved for assembler $at $1 Function return values $v0$v1 $2$3 Function arguments $a0$a3 $4$7 Temporary registers $t0$t7 $8$15 Saved registers $s0$s7 $16$23 More temporary registers $t8$t9 $24$25 Reserved for kernel $k0$k1 $26$27 Global pointer $gp $28 Stack pointer $sp $29 Frame pointer $fp $30 Return address $ra $31 Memory Operands Main memory used for composite data > Arrays, structures, dynamic data To apply arithmetic operations > MIPS does not perform arithmetic directly on memory. > Load values from memory into registers > Store result from register to memory Memory is byte addressed > Each address identifies an 8-bit (1 byte) > MIPS loads/stores 32-bit words (4 bytes) Words are aligned in memory > Address must be a multiple of 4 > 19 > Byte Address > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Memory Operand Example C code: g = h + A[8]; > g is in $s1, h is in $s2 > Base address of array A, that is A[0], is in $s3 Compiled MIPS code: > Index 8 requires offset of 32 (4 bytes per word) lw $t0, 32($s3) # load word add $s1, $s2, $t0 > 20 > offset base register A[0] > 1000 > 1004 > 1008 1 word=4 bytes (32 bit) > 1012 > 1016 > 1020 > 1024 > 1028 > 1032 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Registers vs. Memory ## Registers are faster to access than memory ## Operating on memory data requires loads and stores > More instructions to be executed ## Compiler must use registers for variables as much as ## possible > Only spill to memory for less frequently used variables > Register optimization is important! > 21 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 22 The MIPS Instruction Set MIPS instructions are categorized into 5 major types: 1. Arithmetic Instructions 2. Logical Operations 3. Memory Access Instructions 4. Control flow Instructions > 23 Arithmetic Instructions Add operation has three operands > Two sources and one destination add a, b, c # a gets b + c All arithmetic operations have this form sub and or slt Design Principle: Simplicity favors regularity > Regularity makes implementation simpler > Simplicity enables higher performance at lower cost > 24 Arithmetic Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b add $t0, $t1, $t2 x = a + b Addition add $t0 = x, $t1 = a, $t2 = b sub $t0, $t1, $t2 x = a - b Subtraction sub $t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5 Add immediate (constant) addi $t0 = x, $t1 = a, $t2 = b slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt $t0 = x, $t1 = a, $t2 = b sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (unsigned) sltu 25 Logical Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and $t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or $t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor $t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor $t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi 26 Memory Access Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $s1 = base address lw $t0, 8($s1) x = A[i] Load word from memory lw $t0 = x, $s1 = base address sw $t0, 8($s1) A[i] = x Store word to memory sw $t0 = x, $s1 = base address lb $t0, 4($s1) x = byte[i] Load byte from memory lb $t0 = x, $s1 = base address sb $t0, 4($s1) byte[i] = x Store byte to memory sb 27 Control Flow Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq $t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne -j LABEL goto LABEL Unconditional jump j return addr in $ra jal FUNC call FUNC() Jump and link (call) jal $ra = return address jr $ra return Jump to register (return) jr 28 Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 06 LAST WEEK > 2 ## Performance and Execution time Performance is For two computers X and Y, if the performance X is greater than Y, we have That is, the execution time on Y is longer, if X is faster > 3 Execution Time Performance Performance X Performance Y Execution Time X Execution Time Y Relative Performance Define Performance = 1/Execution Time X is n time faster than Y Example: time taken to run a program > 10s on comp. A, 15s on comp. B > Execution Time B / Execution Time A = 15s / 10s = 1.5 > So A is 1.5 times faster than B > 4 Performance X Performance Y Execution Time X Execution Time Y > == ## nCPU Clocking Operation of digital hardware governed by a constant-rate clock > Clock (cycles) > Data transfer > and computation > Update state > Clock period > Clock period: duration of a clock cycle > e.g., 250ps = 0.25ns = 250 10 12 s > Clock frequency (rate): cycles per second / Hz > e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time CPU Time is the actual time the CPU spends processing instructions for a program. Its a key measure of how efficiently a program runs on a processor. Rate Clock Cycles Clock CPU Time Cycle Clock Cycles Clock CPU Time CPU When executing a program, (using clock rate) Unit: cycle Unit: seconds/cycle Unit: cycles/second , Hz Unit: second Instruction Count and CPI Instruction Count = The total number of instructions a CPU must execute to complete a program. It depends on: The program itself (what it's doing) The compiler (how it translates code) The instruction set architecture (ISA) > 7 Instruction Count and CPI CPI = The average number of clock cycles the CPU takes to execute one instruction . (Cycles per Instruction) It depends on: CPU hardware design Type of instruction (some are faster/slower) How well the CPU handles memory and branches > 8 Instruction Count and CPI Instruction Count for a program > Determined by program, instruction set Average cycles per instruction (CPI) > Determined by CPU hardware Rate Clock CPI Count nInstructio Time Cycle Clock CPI Count nInstructio Time CPU nInstructio per Cycles Count nInstructio Cycles Clock When executing a program, Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail If different instruction classes take different numbers of cycles When a program runs, not all instructions are equal - some take more clock cycles than others. So instead of a single CPI value, we calculate an average CPI based on instruction classes . Example, > Cycles per Instruction % of Instructions Instruction Type > 1 cycle 50% ALU (Add/Sub) > 2 cycles 30% Load/Store > 3 cycles 20% Branches (if/jump) CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI # > n > 1i > ii )Count nInstructio (CPI Cycles Clock # > n > 1i > i > i Count nInstructio Count nInstructio CPI Count nInstructio Cycles Clock CPI > Relative frequency Power and Energy Important for embedded systems and computers Embedded systems want to save battery usage as much as possible. and reduce the heat generated C = capacitance (how much charge moves in each switch) V = voltage f = clock frequency (how often switching happens) = 1 2 CapacitiveLoad = CapacitiveLoad = Power and Energy Power increases with the square of voltage! > Lowering voltage is very effective for saving energy. Power is proportional to frequency > Higher clock speeds = more energy burned. = 1 2 CapacitiveLoad = CapacitiveLoad = CPU Time Quiz I Computer A: 2GHz clock it takes 10s CPU time to run a program Try to design Computer B > Aim for 6s CPU time > Can do faster clock, but causes 1.2 clock cycles than comp. A How fast must Computer B clock be? 4GHz 6s 10 24 6s 10 20 1.2 Rate Clock 10 20 2GHz 10s Rate Clock Time CPU Cycles Clock 6s Cycles Clock 1.2 Time CPU Cycles Clock Rate Clock > 99 > B > 9 > AAA > A > B > B > B CPI Quiz II Computer A has a clock cycle time = 250ps, CPI = 2.0 for some program Computer B has a clock cycle time = 500ps, CPI = 1.2 for the same program Which is faster, and by how much? 1.2 500ps I 600ps I A Time CPU B Time CPU 600ps I500ps 1.2 I B Time Cycle B CPI Count nInstructio B Time CPU 500ps I250ps 2.0 I A Time Cycle A CPI Count nInstructio A Time CPU > A is faster > by this much CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 1. Which sequence executes the most instructions? 2. Which will be faster? How about CPU clock cycles? CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 1. Which sequence executes the most instructions? Sequence 1 executes 2 + 1+ 2 = 5 instructions Sequence 2 executes 4 + 1+ 1 = 6 instructions Seq. 1 executes fewer instructions CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 2. Which will be faster? How about CPU clock cycles? CPU clock cycles1 = 2*1 + 1*2 + 2*3 = 10 cycles CPU clock cycles2 = 4*1 + 1*2 + 1*3 = 9 cycles Seq. 2 is faster even though it executes one more instruction Quiz IV Suppose you reduce voltage by 17% and it reduces switching frequency by 17%. So how much will the dynamic power be reduced? = 1 2 CapacitiveLoad = 1 2 0.83 CapacitiveLoad (0.83 ) = (0.83) 0.6 Group Work: Reducing Power The power wall We cant reduce voltage much more > Lower voltage saves energy, but too low causes instability. We cant remove more heat efficiently > Smaller chips = more transistors = more heat in less space. So: Simply increasing clock speed or transistor count no longer works without overheating or wasting power. # Presentation Your goal is to explain how computer architects overcome the Power Wall and improve speed, efficiency, and energy use .TODAY: POWER SAVING PRESENTATION > 21 # INSTRUCTION SET > 22 Instruction and Instruction set Instruction The words of a computers language Instruction set A set of instructions and it depends on each computer > 23 ADD SUB Logic Branch Comp. A ADD Branch Comp. B ADD SUB Comp. C ADD SUB And Branch Instruction set Different computers have different instruction sets > But with many aspects in common > Most of instruction sets have similarity Early computers had very simple instruction sets > Simplified implementation Many modern computers also have simple instruction sets > 24 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 25 A part of MIPS instructions In this class , we will use MIPS instruction set > 26 Example Instruction $s1 = $s2 + $s3 add $s1, $s2, $s3 add Arithmetic $s1 = $s2 $s3 sub $s1, $s2, $s3 Subtract Load from Memory address [$s2+100] to register $s1 lw $s1, 100($s2) load word Data translati on Store from register $s1 to Memory address [$s2+100] sw $s1, 100($s2) store word If $s1==$s2 then Branch to L beq $s1, $s2, L branch on equal Branch If $s1!=$s2 then branch to L bne $s1, $s2, L branch on not equal If $s2<$s3 then $s1=1, otherwise $s1=0 slt $s1, $s2, $s3 set on less than Jump to L j L jump Jump Jump to the address $s1 jr $s1 jump register Example > 27 ## Execution of programs > 28 CPU > 100101001010100 > 000001011011100 > 111001111010011 lw $4, 0($1) lw $5, 4($1) add $2, $4, $5 Compiler If (y == 1) c=a+b ; else c=a-b ; High-level programming language such as C, Java, Verilog-HDL etc.. Instruction set Assembly lang. Machine lang. to CPU Binary code CPUa CPUb CPUc C Program ## CPU Independent 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 Interface between programs and CPUs Compiler for CPU-C Instruction set C Compiler for CPU-B Instruction set B Compiler for CPU-A Instruction set A ## CPU dependent Machine lang. Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Operations of a Processor Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Executes arithmetic and logic operations Stores data computed by the ALU Data is read from main memory and written back to main memory Stores programs, instructions and data Load / Store architecture The way of access to the main memory is Load (read) and store (write) instructions ALU executes operations on registers values Operations of a Processor Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU > C # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 04 LAST WEEK > 2 Computer Classification Microcontrollers Micro computer / Microcontroller Usage : embedded systems Features : Miniaturization, energy saving, low power consumption. CPU, memory and peripheral circuits are contained in the same IC. Currently : also known as SoC (System on a Chip) > 3 This is a computer! Computer Classification Personal computers Usage : Office work, internet, gaming Features : Low price, Versatile and scalable, Single architecture, Large variety of peripherals, e.g., keyboard, mouse, display, sound card, printer. Currently : research, industry > 4 Computer Classification Workstations Usage : large calculation e.g., research, industry Features : Like PCs but with higher performance, High price. Gone with the success of personal computers but revived with artificial intelligence. > 5 Computer Classification Mainframes Usage : Internet companies, banks, industry, education, offices, etc. Features : Very high parallelism. One computer, many users (access via terminals) > 6 Computer Classification Supercomputers Usage : Industry and research huge computations. Features : > Huge performance > Huge parallelism > Huge area > Huge power consumption > One computer, few users > 7 Mechanical Era Timeframe: 1620s to the early 20th century. Key Developments: Mechanical Calculators : which could perform basic arithmetic operations. Charles Babbage's Designs : Difference Engine and the Analytical Engine. Introducing concepts like programmability and mechanical computation. Punched Card Systems : Tabulating and processing data > Punched card - Wikipedia Electro-Mechanical Era (Transition) Timeframe: Roughly from the late 19th century to the mid-20th century. Key Developments: Integration of Electromechanical Components : such as relays and switches. Harvard Mark I, which used electromechanical relays for computation. Punched Card Systems Continued : Punched card systems continued to be widely used for data input and processing, particularly in industries requiring large-scale data manipulation. Electronic Era Timeframe: Starting in the 1940s until now. Key Developments: ENIAC : The Electronic Numerical Integrator and Computer (ENIAC), completed in 1946, marked the beginning of the electronic era. Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized electronics. Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more reliable computers. Post-PC Era (Contemporary) Timeframe: From the late 20th century to the present. Key Characteristics: Proliferation of Mobile Devices : Smartphones and tablets have become popular Touch Interfaces and Intuitive User Experiences Cloud Computing Internet of Things (IoT) AI and Voice Assistants The Post PC Era Technology Trends (Now) Advanced IC (AI, 3D, RISC-V, Thermodynamic) AI chips : TPU (Tensor Processing Unit) Google NPU (Neural Processing Unit) Apple, Huawei Chiplets : Modular chip building blocks used in AMD and Intel architectures 3D-Stacked ICs : Vertical integration to boost speed and density RISC-V : Open-source CPU architecture with growing ecosystem Thermodynamic computing (experimental): Uses physics to power ultra-low- energy AI tasks > 13 Semiconductor Technology The foundation of all modern electronics including smartphones, computers, cars, TVs, and even smart fridges! Add materials to transform properties: Conductors (Lets electricity flow) Insulators (Blocks electricity) Switch (Controls flow (on/off)) A material (like silicon) that can act as both a conductor and an insulator depending on how its used. This makes it perfect for building transistors the basic "on/off switches" in all digital devices. > 14 Defining Performance Question Which airplane has the best performance? Passenger capacity: Airbus Cruising range : Boeing777 Cruising speed : Concorde Best performance depends on the definition of performance > 15 > Cruising > speed > (mile/h) > Cruising > range > (miles) > Passenger > capacity > Airplane > 564 3,000 240 Boeing737 > 1,350 4,000 132 BAC/Sud Concorde > 554 9,395 301 Boeing777 > 587 8,477 853 Airbus A380-800 Defining Performance Group work: Lets discuss about the definition of performance 1. Propose a definition for performance 2. Under the definition, which one has the best performance? > 16 Cruising speed (mile/h) Cruising range (miles) Passenger capacity Airplane 564 3,000 240 Boeing737 1,350 4,000 132 BAC/Sud Concorde 554 9,395 301 Boeing777 587 8,477 853 Airbus A380-800 Defining Performance Group work: Lets discuss about the definition of performance If we define the performance as Less time required to transport people Passenger capacity Cruising speed (mile/h) > 17 Passenger throughput (passenmile/h) Cruising speed (mile/h) Cruising range (miles) Passenger capacity Airplane 135,360 564 3,000 240 Boeing737 178,200 1,350 4,000 132 BAC/Sud Concorde 166,761 554 9,395 301 Boeing777 500,711 587 8,477 853 Airbus A380-800 Airbus has the best performance (Passenger throughput ) Response Time and Throughput Response time (Execution time) > The total time required for a computer to complete a task, including disk access, memory access, I/O activities Throughput > Total amount of work done per unit time > e.g., the number of tasks/transactions/ per hour How are response time and throughput affected by > Replacing the processor with a faster version? > Adding more processors? Response Time and Throughput Question A: If we replace the processor in a computer with a faster version , 1. Increase throughput 2. Decrease response time 3. Both of the above > Replace a faster processor Decrease response time Increase throughput Run program A in 10 sec. Run program A in 5 sec. Run program A twice in 10 sec. Response Time and Throughput Question B: If we add additional processors to a system that uses multiple processors for separate tasks 1. Increase throughput 2. Decrease response time 3. Both of the above Do not decrease response time Increase throughput Run program A in 10 sec. then program B in 10 sec. Run program B in 10 sec. Run program A In 10 sec. > Add more processors # TODAY: PERFORMANCE OF COMPUTER > 21 ## Performance and Execution time Performance is For two computers X and Y, if the performance X is greater than Y, we have That is, the execution time on Y is longer, if X is faster > 22 Execution Time Performance Performance X Performance Y Execution Time X Execution Time Y Relative Performance Define Performance = 1/Execution Time X is n time faster than Y Example: time taken to run a program > 10s on comp. A, 15s on comp. B > Execution Time B / Execution Time A = 15s / 10s = 1.5 > So A is 1.5 times faster than B > 23 Performance X Performance Y Execution Time X Execution Time Y > == ## nMeasuring Execution Time Elapsed time > Total response time, including all aspects Processing, I/O, OS overhead, idle time(waiting) > Determines system performance CPU time > Time spent processing a given job Discounts I/O time, other jobs shares > User CPU time Time spent executing user code (e.g., a programs logic). > System CPU time Time spent doing system-level operations (e.g., file I/O, memory allocation by OS) CPU Clocking Operation of digital hardware governed by a constant-rate clock > Clock (cycles) > Data transfer > and computation > Update state > Clock period > Clock period: duration of a clock cycle > e.g., 250ps = 0.25ns = 250 10 12 s > Clock frequency (rate): cycles per second / Hz > e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time CPU Time is the actual time the CPU spends processing instructions for a program. Its a key measure of how efficiently a program runs on a processor. Rate Clock Cycles Clock CPU Time Cycle Clock Cycles Clock CPU Time CPU When executing a program, (using clock rate) Unit: cycle Unit: seconds/cycle Unit: cycles/second , Hz Unit: second Instruction Count and CPI Instruction Count = The total number of instructions a CPU must execute to complete a program. It depends on: The program itself (what it's doing) The compiler (how it translates code) The instruction set architecture (ISA) > 27 Instruction Count and CPI CPI = The average number of clock cycles the CPU takes to execute one instruction . (Cycles per Instruction) It depends on: CPU hardware design Type of instruction (some are faster/slower) How well the CPU handles memory and branches > 28 Instruction Count and CPI Instruction Count for a program > Determined by program, instruction set Average cycles per instruction (CPI) > Determined by CPU hardware Rate Clock CPI Count nInstructio Time Cycle Clock CPI Count nInstructio Time CPU nInstructio per Cycles Count nInstructio Cycles Clock When executing a program, Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail If different instruction classes take different numbers of cycles When a program runs, not all instructions are equal - some take more clock cycles than others. So instead of a single CPI value, we calculate an average CPI based on instruction classes . Example, > Cycles per Instruction % of Instructions Instruction Type > 1 cycle 50% ALU (Add/Sub) > 2 cycles 30% Load/Store > 3 cycles 20% Branches (if/jump) CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI # > n > 1i > ii )Count nInstructio (CPI Cycles Clock # > n > 1i > i > i Count nInstructio Count nInstructio CPI Count nInstructio Cycles Clock CPI > Relative frequency Performance Summary Performance depends on > Algorithm : affects Instruction Count(IC), possibly CPI > Programming language : affects IC, CPI > Compiler : affects IC, CPI > Instruction Set Architecture : affects IC, CPI, T c > Fewer instructions = potentially faster performance > Lower CPI = better performance > Shorter cycles = faster execution > Higher clock rate = more instructions per second cycle Clock Seconds nInstructio cycles Clock Program ns Instructio Time CPU MIPS MIPS measures how many millions of instructions a CPU can execute per second . Execution speed of a computers processor Comparative Analysis: It allows for easy comparison between different processors and systems. A higher MIPS value generally indicates a faster CPU > Power and Energy Important for embedded systems and computers Embedded systems want to save battery usage as much as possible. and reduce the heat generated C = capacitance (how much charge moves in each switch) V = voltage f = clock frequency (how often switching happens) = 1 2 CapacitiveLoad = CapacitiveLoad = Power and Energy Power increases with the square of voltage! > Lowering voltage is very effective for saving energy. Power is proportional to frequency > Higher clock speeds = more energy burned. = 1 2 CapacitiveLoad = CapacitiveLoad = The von Neumann Bottleneck For the von Neumann computers Single-memory: both program and data are stored in the same memory and are transferred over the same bus. Instructions and data cannot be retrieved from memory at the same time, this is the von Neumann bottleneck. > 22 Summarize Computer architecture : overall design of a computer system including: > Hardware specifications > Software (OS, compilers) > Behavior and structure of the computer Three Key Questions about Computer Processing > How is information represented? > How is data calculated? > What processes are used? Von Neumann Architecture > Stored-program concept > Sequential processing > Linear memory addressing Instruction Cycle (FetchDecodeExecute) 23 Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 03 LAST WEEK > 2 ## What do we mean by architecture? In general, Architecture means a style, structure, or composition of buildings Computer Architecture > Basic design and design concepts in computers > Hardware specifications with system software such as operating systems and compilers > Architecture of the computer + Behavior of the computer Overall view of a computer > 4 > Computer Architecture ## Instruction set Operating system ## Operative unit Memory unit ## Control unit Input and output units > Architecture of the computer > Behavior of the computer Three questions about computer processing Input data Information Processing, computing and storage of information Output data Information Display Q1 How do computers represent information? what format? Q How is the acquired data calculated? Q What procedures are used to process (calculate) the "information"? Von Neumann Architecture (Von Neumann-type Computers) Many computers are designed as Neumann-type calculators Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer Stored-program computer > Instructions (programs) and data are placed indistinguishably in main memory > The distinction between instructions and data is made by the program Sequential processing > A program is a list of instructions > Instructions are fetched one by one from main memory and executed in a determined order > The position in memory of the next instruction to execute is stored into a register called the program counter . Linear address > Each cell of main memory is numbered sequentially > This number is called address > The Address is used to indicate the location of instructions and data 7 The memory address PC The Basic Structure of Neumann computer > 8 The Basic Structure of Neumann computer The central processing unit (CPU) is the part of the computer that contains the arithmetic and control units. > Operating Unit : The arithmetic unit is the device that performs arithmetic and logical operations and that temporarily stores the operation terms and results. > Control Unit : The control unit is the device that controls the operation of the computer. Memory Unit : The memory unit is the device that stores the program and the data. > Main Memory Unit : The main memory is the part of the storage unit to which the CPU has direct access. > Auxiliary Memory Unit : The auxiliary storage is the part of the memory that cannot be directly accessed by the CPU. Input/Output Unit(s) : The input/output unit is the device that performs the data transfers between the computer and its environment. 9The detailed structure of the CPU The operative unit > 10 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) The general-purpose registers are the registers that stores temporarily the terms and the results of the computers calculations. Compared to the memory units: Faster, smaller and without any address (2) Arithmetic and Logic Unit (ALU) The ALU is the device that performs the arithmetic and logic operations. (3) Flag Register (FR) > 11 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) (2) Arithmetic and Logic Unit (ALU) (3) Flag Register (FR) The flag register stores the status of the previous operation. The flag register a 4 or more bits Roles: Conditional branching (if and for statements) > 12 CF OF ZF SF CF: Carry Flag The output carry of the operation OF: Overflow Flag The overflow of the signed calculations ZF: Zero Flag Indicates if the result were 0. SF: Sign Flag Indicates if the result were negative The detailed structure of the CPU The Control Unit > 13 The detailed structure of the CPU The Control Unit (1) Program Counter (PC) The PC is the register that stores the address of the next instruction to be executed. (2) Instruction Register (IR) The IR is the register that stores the instruction to be executed. (3) Decoder (DE) The decoder is a circuit that decodes the value of the instruction stored in the instruction register. Its result drives a set of control signals for the computer. (4) Sequencer The sequencer is the circuit that generates the control signals related to the clock and the status of the computer. > 14 The Basic Behavior of a von Neumann Computer Execution flow/Instruction Cycle: the following infinite loop > 15 ## Fetch ## (FE) ## Decode ## (DE) ## Execute ## (EX) The Basic Behavior of a von Neumann Computer > 16 (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition The von Neumann Bottleneck What is a bottleneck? A bottlenecks is a situation that limit the overall performance of a given product. > 17 A bottles neck The von Neumann Bottleneck For the von Neumann computers Single-memory: both program and data are stored in the same memory and are transferred over the same bus. Instructions and data cannot be retrieved from memory at the same time, this is the von Neumann bottleneck. > 18 # TODAY > 19 Topic ## Computer Classification ## Computer Evolution ## Semiconductor ## Computer Performance > 20 Computer Classification Microcontrollers Personal computers Workstations Mainframes Supercomputers > 21 Computer Classification Microcontrollers Micro computer / Microcontroller Usage : embedded systems Features : Miniaturization, energy saving, low power consumption. CPU, memory and peripheral circuits are contained in the same IC. Currently : also known as SoC (System on a Chip) > 22 This is a computer! Computer Classification Personal computers Usage : Office work, internet, gaming Features : Low price, Versatile and scalable, Single architecture, Large variety of peripherals, e.g., keyboard, mouse, display, sound card, printer. Currently : research, industry > 23 Computer Classification Workstations Usage : large calculation e.g., research, industry Features : Like PCs but with higher performance, High price. Gone with the success of personal computers but revived with artificial intelligence. > 24 Computer Classification Mainframes Usage : Internet companies, banks, industry, education, offices, etc. Features : Very high parallelism. One computer, many users (access via terminals) > 25 Computer Classification Supercomputers Usage : Industry and research huge computations. Features : > Huge performance > Huge parallelism > Huge area > Huge power consumption > One computer, few users > 26 Computer Evolution The evolution of computers can be broadly categorized into four main eras: Mechanical Era Electro-Mechanical Era (Transition) Electronic Era Post-PC Era (Contemporary) > 27 Mechanical Era Timeframe: 1620s to the early 20th century. Key Developments: Mechanical Calculators : which could perform basic arithmetic operations. Charles Babbage's Designs : Difference Engine and the Analytical Engine. Introducing concepts like programmability and mechanical computation. Punched Card Systems : Tabulating and processing data > Punched card - Wikipedia Electro-Mechanical Era (Transition) Timeframe: Roughly from the late 19th century to the mid-20th century. Key Developments: Integration of Electromechanical Components : such as relays and switches. Harvard Mark I, which used electromechanical relays for computation. Punched Card Systems Continued : Punched card systems continued to be widely used for data input and processing, particularly in industries requiring large-scale data manipulation. Electronic Era Timeframe: Starting in the 1940s until now. Key Developments: ENIAC : The Electronic Numerical Integrator and Computer (ENIAC), completed in 1946, marked the beginning of the electronic era. Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized electronics. Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more reliable computers. Post-PC Era (Contemporary) Timeframe: From the late 20th century to the present. Key Characteristics: Proliferation of Mobile Devices : Smartphones and tablets have become popular Touch Interfaces and Intuitive User Experiences Cloud Computing Internet of Things (IoT) AI and Voice Assistants The Post PC Era The Post PC Era Personal Mobile Device(PMD) > Battery operated, have become popular, offering computing capabilities on the go. > Connects to the Internet > Smart phones, tablets, electronic glasses > Popular Cloud computing > Software as a Service (SaaS) > Services and applications increasingly rely on cloud infrastructure for storage, processing, and collaboration. > Reducing the dependence on local computing resources. > Amazon and Google The Post PC Era Internet of Things (IoT) > Connected devices, from smart thermostats to fitness trackers, contribute to the expanding ecosystem of computing beyond traditional PCs. AI and Voice Assistants > Advances in artificial intelligence and natural language processing have led to the integration of voice-controlled assistants > like Siri, Alexa, and Google Assistant into various devices, changing how users interact with technology. Touchscreen Post PC device simplifying interaction with technology Supersedes keyboard and mouse Resistive and Capacitive types > Most tablets, smart phones use capacitive > Capacitive allows multiple touches simultaneously Technology Trends Electronics technology continues to evolve > Increased capacity and performance > Reduced cost Relative performance/cost Technology Year 1Vacuum tube 1951 35 Transistor 1965 900 Integrated circuit (IC) 1975 2,400,000 Very large-scale IC (VLSI) 1995 250,000,000,000 Ultra large-scale IC 2013 > DRAM capacity # Now? Technology Trends (Now) Advanced IC (AI, 3D, RISC-V, Thermodynamic) AI chips : TPU (Tensor Processing Unit) Google NPU (Neural Processing Unit) Apple, Huawei Chiplets : Modular chip building blocks used in AMD and Intel architectures 3D-Stacked ICs : Vertical integration to boost speed and density RISC-V : Open-source CPU architecture with growing ecosystem Thermodynamic computing (experimental): Uses physics to power ultra-low- energy AI tasks > 37 Semiconductor Technology The foundation of all modern electronics including smartphones, computers, cars, TVs, and even smart fridges! Add materials to transform properties: Conductors (Lets electricity flow) Insulators (Blocks electricity) Switch (Controls flow (on/off)) A material (like silicon) that can act as both a conductor and an insulator depending on how its used. This makes it perfect for building transistors the basic "on/off switches" in all digital devices. > 38 Manufacturing ICs > 39 Yield: proportion of working dies per wafer Example: Intel Core 10th Gen 300mm wafer, 506 chips, 10nm technology Each chip is 11.4 x 10.7 mm > 40 Example: Intel Core Ultra 41 https://www.thailand.intel.com/content/www/th/th/content-details/842532/intel-core-ultra-processors-200hx-series-processors-quick-reference-guide-pdf.html Integrated Circuit Cost Nonlinear relation to area and defect rate > Wafer cost and Die area are fixed. > Defect rate determined by manufacturing process. > Die area determined by architecture and circuit design. > 2 area/2)) Die area per (Defects (1 1 Yield area Die area Wafer wafer per Dies Yield wafer per Dies wafer per Cost die per Cost Wafer cost is fixed (e.g., $5,000 per 300mm wafer) Dies per wafer depends on chip size Yield is how many chips work (e.g., 90%) This shows how defect rate and chip size affect the yield. If die area increases, yield drops quickly (nonlinear). If defect rate is high, fewer good chips are produced. Response Time and Throughput Response time (Execution time) > The total time required for a computer to complete a task, including disk access, memory access, I/O activities Throughput > Total amount of work done per unit time > e.g., the number of tasks/transactions/ per hour How are response time and throughput affected by > Replacing the processor with a faster version? > Adding more processors?