Computer Architecture Basics

Title: URL Source: blob://pdf/e492bcb2-2947-459c-ab15-bbef6c81e995 Markdown Content: # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 02 What do we mean by architecture? In general, Architecture means a style, structure, or composition of buildings Computer Architecture > Basic design and design concepts in computers > Hardware specifications with system software such as operating systems and compilers > Architecture of the computer + Behavior of the computer Overall view of a computer > 3 > Computer Architecture ## Instruction set Operating system ## Operative unit Memory unit ## Control unit Input and output units > Architecture of the computer > Behavior of the computer What is CPU? Equivalent terms: CPU, MPU, processor The CPU is an IC that includes the arithmetic, control, and input/output units of a computer. Lists companies that design/manufacture CPUs: Mainstream: Intel, AMD, ARM Embedded and Industrial: Zilog, Atmel, Fujitsu, etc. Architecture examples: x86 (Intel/AMD) ARM (mobile devices) MIPS, SPARC, PowerPC (once common in academia/industry) PIC, AVR, 68xxx, Z80 (microcontrollers) 4Computer Processing The flow of data and control between major parts of a computer. 1. CPU (Central Processing Unit) 2. Input Devices 3. Memory (RAM / Main Memory) 4. Output Devices > 5 Three questions about computer processing Input data Information Processing, computing and storage of information Output data Information Display Q1 How do computers represent information? what format? Q How is the acquired data calculated? Q What procedures are used to process (calculate) the "information"? > Binary format Algo > ALL switching element Group work Lets talk about Q1: How do computers represent information? What format? Q2: How is the acquired data calculated? Q3: What procedures are used to process (calculate) the information? 7 text , image in Binary format > 1. Identify and gathering 2 . Transforming to usable format 3 . performing calculation Though CPU : ALL Validation , sorting , summarization , aggregation , analysis and classification Group work Lets talk about > Q1: How do computers represent information? What format? > Q2: How is the acquired data calculated? > Q3: What procedures are used to process (calculate) the information? > 7 Explanation Ease of nderstanding Language It describes instructions to a computer in a language that is easy for humans to understand. High-level languages are converted into machine language programs by compilers and interpreters ex.) C, C++, Java, Python Easy High-level language A low-level language that uses mnemonics (e.g., MO, ADD, B) to represent machine instructions. Closer to hardware than high-level languages. ranslated into machine language by an assembler . Assembly language (Low-level language) Computers can directly understand Binary code converted from assembler language Impossible machine language Programming language Von Neumann Architecture (Von Neumann-type Computers) Many computers are designed as Neumann-type calculators Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer Stored-program computer > Instructions (programs) and data are placed indistinguishably in main memory > The distinction between instructions and data is made by the program Sequential processing > A program is a list of instructions > Instructions are fetched one by one from main memory and executed in a determined order > The position in memory of the next instruction to execute is stored into a register called the program counter . Linear address > Each cell of main memory is numbered sequentially > This number is called address > The Address is used to indicate the location of instructions and data 10 The memory address PC The Basic Structure of Neumann computer > 11 addi $s4, $zero, 1 add $1, $s1, $s0 sub $s0, $s0, $s4 bne $s0, $s4, L1 variable x array(0,0) Linear address variable y array(0,1) 0 add $s1, $zero, $zero 4 8 12 16 20 24 28 32 ... Sequential processing Stored-program computer The Basic Structure of Neumann computer The Basic Structure of Neumann computer The central processing unit (CPU) is the part of the computer that contains the arithmetic and control units. > Operating Unit : The arithmetic unit is the device that performs arithmetic and logical operations and that temporarily stores the operation terms and results. > Control Unit : The control unit is the device that controls the operation of the computer. Memory Unit : The memory unit is the device that stores the program and the data. > Main Memory Unit : The main memory is the part of the storage unit to which the CPU has direct access. > Auxiliary Memory Unit : The auxiliary storage is the part of the memory that cannot be directly accessed by the CPU. Input/Output Unit(s) : The input/output unit is the device that performs the data transfers between the computer and its environment. 13 The detailed structure of the CPU The operative unit > 14 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) The general-purpose registers are the registers that stores temporarily the terms and the results of the computers calculations. Compared to the memory units: Faster, smaller and without any address (2) Arithmetic and Logic Unit (ALU) The ALU is the device that performs the arithmetic and logic operations. (3) Flag Register (FR) > 15 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) (2) Arithmetic and Logic Unit (ALU) (3) Flag Register (FR) The flag register stores the status of the previous operation. The flag register a 4 or more bits Roles: Conditional branching (if and for statements) > 16 CF OF ZF SF CF: Carry Flag The output carry of the operation OF: Overflow Flag The overflow of the signed calculations ZF: Zero Flag Indicates if the result were 0. SF: Sign Flag Indicates if the result were negative The detailed structure of the CPU The Control Unit > 17 The detailed structure of the CPU The Control Unit (1) Program Counter (PC) The PC is the register that stores the address of the next instruction to be executed. (2) Instruction Register (IR) The IR is the register that stores the instruction to be executed. (3) Decoder (DE) The decoder is a circuit that decodes the value of the instruction stored in the instruction register. Its result drives a set of control signals for the computer. (4) Sequencer The sequencer is the circuit that generates the control signals related to the clock and the status of the computer. > 18 The Basic Behavior of a von Neumann Computer Execution flow/Instruction Cycle: the following infinite loop > 19 ## Fetch ## (FE) ## Decode ## (DE) ## Execute ## (EX) The Basic Behavior of a von Neumann Computer > 20 (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition The von Neumann Bottleneck What is a bottleneck? A bottlenecks is a situation that limit the overall performance of a given product. > 21 A bottles neck Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 08 LAST WEEK > 2 Register Operands Arithmetic instructions use register operands MIPS has a 32 32-bit register file > Use for frequently accessed data > Numbered 0 to 31 > 32-bit data called a word Assembler names ex. > $t0, $t1, , $t9 for temporary values > $s0, $s1, , $s7 for saved variables (of initial inputs) Design Principle : Smaller is faster > main memory: millions of locations -> Slower > Keep frequently used data in registers. -> Fast Access > Minimize memory accesses 3 > Use $t when you dont care about keeping the > value after a function call. > Use $s when the value must survive across calls. MIPS: Temporary vs. Saved Registers Example: # Using $t0 for temporary math add $t0, $a0, $a1 # Using $s0 to store a persistent value add $s0, $a2, $a3 If this function calls another function, it must save $s0 (using sw) before calling and restore (using lw) afterward. But $t0 doesnt need to be preserved it might be overwritten. > 4 MIPS Register Numbering and Assembler Names MIPS has 32 general-purpose registers , labeled $0 to $31 These are hardware register numbers , but in assembly programming, we use symbolic names to make them easier to read and understand. 5 Common Use Assembler Name Register Number Always 0 (read-only) $zero $0 Reserved for assembler $at $1 Function return values $v0$v1 $2$3 Function arguments $a0$a3 $4$7 Temporary registers $t0$t7 $8$15 Saved registers $s0$s7 $16$23 More temporary registers $t8$t9 $24$25 Reserved for kernel $k0$k1 $26$27 Global pointer $gp $28 Stack pointer $sp $29 Frame pointer $fp $30 Return address $ra $31 Memory Operands Main memory used for composite data > Arrays, structures, dynamic data To apply arithmetic operations > MIPS does not perform arithmetic directly on memory. > Load values from memory into registers > Store result from register to memory Memory is byte addressed > Each address identifies an 8-bit (1 byte) > MIPS loads/stores 32-bit words (4 bytes) Words are aligned in memory > Address must be a multiple of 4 > 6 > Byte Address > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Memory Operand Example C code: g = h + A[8]; > g is in $s1, h is in $s2 > Base address of array A, that is A[0], is in $s3 Compiled MIPS code: > Index 8 requires offset of 32 (4 bytes per word) lw $t0, 32($s3) # load word add $s1, $s2, $t0 > 7 > offset base register A[0] > 1000 > 1004 > 1008 1 word=4 bytes (32 bit) > 1012 > 1016 > 1020 > 1024 > 1028 > 1032 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 8 The MIPS Instruction Set MIPS instructions are categorized into 5 major types: 1. Arithmetic Instructions 2. Logical Operations 3. Memory Access Instructions 4. Control flow Instructions > 9 Arithmetic Instructions Add operation has three operands > Two sources and one destination add a, b, c # a gets b + c All arithmetic operations have this form sub and or slt Design Principle: Simplicity favors regularity > Regularity makes implementation simpler > Simplicity enables higher performance at lower cost > 10 Arithmetic Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b add $t0, $t1, $t2 x = a + b Addition add $t0 = x, $t1 = a, $t2 = b sub $t0, $t1, $t2 x = a - b Subtraction sub $t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5 Add immediate (constant) addi $t0 = x, $t1 = a, $t2 = b slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt $t0 = x, $t1 = a, $t2 = b sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (unsigned) sltu 11 Logical Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and $t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or $t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor $t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor $t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi 12 Memory Access Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $s1 = base address lw $t0, 8($s1) x = A[i] Load word from memory lw $t0 = x, $s1 = base address sw $t0, 8($s1) A[i] = x Store word to memory sw $t0 = x, $s1 = base address lb $t0, 4($s1) x = byte[i] Load byte from memory lb $t0 = x, $s1 = base address sb $t0, 4($s1) byte[i] = x Store byte to memory sb 13 Control Flow Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq $t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne -j LABEL goto LABEL Unconditional jump j return addr in $ra jal FUNC call FUNC() Jump and link (call) jal $ra = return address jr $ra return Jump to register (return) jr 14 Quiz 1: Arithmetic Example Suppose we place the sum of four variables b, c, d and e into variable a a = b + c + d + e; What is the compiled MIPS code? add a, b, c # sum of b and c is placed in a add a, a, d # sum of b,c and d is now in a add a, a, e # sum of b,c,d and e is now in a > 15 Quiz 2: Arithmetic Example Suppose we place the sum of four variables b, c, d and e into variable a a = b + c ; d = a e ; What is the compiled MIPS code? add a, b, c sub d, a, e > 16 Quiz 4: Register Operand C code: f = (g + h) - (i + j); > The variables f, g, , j are assigned to > the registers $s0, $s1 , $s4 What is the compiled MIPS code? add $t0, $s1, $s2 # register t0 contains g+h add $t1, $s3, $s4 # register t1 contains i+j sub $s0, $t0, $ t1 # f gets (g+h) (i+j) > 17 Quiz 5: Memory Operand C code: A[12] = h + A[8]; > h in $s2, base address of A in $s3 What is the compiled code? lw $t0, 32($s3) # load word add $t0, $s2, $t0 sw $t0, 48($s3) # store word > 18 Convert C code to MIPS assembly > 19 Find Absolute Value (abs) if (a < 0) abs = -a; else abs = a; Find the Maximum (Max) if (a > b) max = a; Else max = b; Check if a is between b and c if (b < a && a < c) result = 1; else result = 0; Convert C code to MIPS assembly Find the Maximum (Max) if (a > b) max = a; Else max = b; Hint : use slt, beq, bne, add, sub > 20 slt $t3, $t1, $t0 # if a > b $t3 = 1 (i.e. b < a) beq $t3, $zero, ELSE # if false go to ELSE add $t2, $t0, $zero # max = a j END ELSE: add $t2, $t1, $zero # max = b END: Convert C code to MIPS assembly Find Absolute Value (abs) if (a < 0) abs = -a; else abs = a; Hint : use slt, sub, add, bne > 21 slt $t2, $t0, $zero # if a < 0 $t2 = 1 beq $t2, $zero, ELSE # if a >= 0 go to ELSE sub $t1, $zero, $t0 # abs = -a j END ELSE: add $t1, $t0, $zero # abs = a END: Convert C code to MIPS assembly Check if a is between b and c if (b < a && a < c) result = 1; else result = 0; > 22 slt $t4, $t1, $t0 # b < a $t4 = 1 slt $t5, $t0, $t2 # a < c $t5 = 1 add $t6, $t4, $t5 # $t6 = $t4 + $t5 beq $t6, 2, YES # if both true result = 1 add $t3, $zero, $zero # result = 0 j END YES: add $t3, $zero, 1 # result = 1 END: TODAY > 23 Todays topic Radix-conversion > Binary number (Base-2) > Decimal number (Base-10) > Hexadecimal number (Base-16) 2s-Complement Representing Instructions We use decimal numbers in our daily lives Decimal number uses 10 numbers from 0 to 9, and when the number reaches 10, carry-over occurs 9876543210 1918 17 16 15 14 13 12 11 10 2928 27 26 25 24 23 22 21 20 99 98 97 96 95 94 93 92 91 90 10 9108 107 106 105 104 103 102 101 10 0 11 9118 117 116 115 114 113 112 111 110 Decimal number ( base-10 )FBA93210 1F1B 1A 19 13 12 11 10 2F2B 2A 29 23 22 21 20 9F9B 9A 99 93 92 91 90 AFAB AA A9 A3 A2 A1 A0 FFFB FA F9 F3 F2 F1 F0 Hexadecimal number (Base-16) To represent more than 10 in a single number, we use A, B, C, D, E, and F 365 (10) = 3 * 10 2 + 6 * 10 1 + 5 * 10 0 = 365 (10) 101 (2) = 1 * 2 2 + 0 * 2 1 + 1 * 2 0 = 5 (10) 125 (8) 2AD (16) = 2 * 16 2 + A * 16 1 + D * 16 0 = 2 * 16 2 + 10 * 16 1 + 13 * 16 0 = 685 (10) Conversion from base-n to decimal = 1 * 8 2 + 2 * 8 1 + 5 * 8 0 = 85 (10) 85 (10) base 8 85 8 = 10 ... 5 10 8 = 1 ... 2 1 8 = 0 ... 1 685 16 = 42 ... 13 D 42 16 = 2 ... 10 A 2 16 = 0 ... 2 Conversion from decimal to base-n Ans. 125 (8) Ans. 2AD (16) 685 (10) base 16 Hexadecimal Base 16 > Compact representation of bit strings > 4 bits per hex digit > 29 1100 c1000 80100 40000 0 1101 d1001 90101 50001 1 1110 e1010 a0110 60010 2 1111 f1011 b0111 70011 3 > ## Example: eca8 6420 1110 1100 1010 1000 0110 0100 0010 0000 Unsigned Binary Integers Given an n-bit number Example: What is the decimal value of the following 32-bit binary code? > 0000 0000 0000 0000 0000 0000 0000 1011 2 = 0 + + 12 3 + 02 2 +12 1 +12 0 = 0 + + 8 + 0 + 2 + 1 = 11 10 > 30 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > Unsigned Binary Integers Given an n-bit number Range: 0 to +2 n 1 Using 32 bits > 0 to +4,294,967,295 > 31 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Example: What is the decimal value of the 2s complement binary code? > 1111 1111 1111 1111 1111 1111 1111 1100 2 = 12 31 + 12 30 + + 12 2 +02 1 +02 0 = > 32 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Example: What is the decimal value of the 2s complement binary code? > 1111 1111 1111 1111 1111 1111 1111 1100 2 = 12 31 + 12 30 + + 12 2 +02 1 +02 0 = 2,147,483,648 + 2,147,483,644 = 4 10 > 33 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers Given an n-bit number Range: 2 n 1 to +2 n 1 1 Using 32 bits > 2,147,483,648 to +2,147,483,647 > 34 > 0 > 0 > 1 > 1 > 2n > 2n > 1n > 1n 2x2x2x2xx > > > 2s-Complement Signed Integers 2s-Complement Signed Integers Bit 31 is sign bit > 1 for negative numbers > 0 for non-negative numbers Non-negative numbers have the same unsigned and 2s-complement representation Some specific numbers > 0: 0000 0000 0000 > 1: 1111 1111 1111 > Most-negative: 1000 0000 0000 2,147,483,648 > Most-positive: 0111 1111 1111 2,147,483,647 > 36 2s-Complement Signed Integers Complement and add 1 Complement means 1 0, 0 1 Example: negate +2 > +2 = 0000 0000 0010 2 > 2 = 1111 1111 1101 2 + 1 = 1111 1111 1110 2 > 37 x1x 11111...111 xx 2 X = 0000 1111 X = 1111 0000 Sign Extension Representing a number using more bits > Preserve the numeric value Why It's Important: > When instructions handle: > addi : extend immediate value > lb , lh : extend loaded byte/halfword Replicate the sign bit to the left > c.f. unsigned values: extend with 0s Examples: 8-bit to 16-bit > +2: 0000 0010 => 0000 0000 0 000 0010 > 2: 1111 1110 => 1111 1111 1 111 1110 > 38 Sign Extension In MIPS instruction set > addi : extend immediate value > addi $t0, $t1, -13 # $t0 = $t1 + (-13) -13 is represented as 11110011 in 8-bit and 11111111 11110011 in 16-bit This value is sign-extended to 11111111 11111111 11111111 11110011 in 32-bit before the addition. > lb , lh : extend loaded byte/halfword lb (load byte) and lh (load halfword) are instructions used for loading 8-bit and 16- bit data from memory into registers > 39 Representing Instructions MIPS R-format Instructions > used for arithmetic and logical operations MIPS I-format Instructions > used for operations involving constants (immediate values), load/store instructions, and branch instructions . > 40 Representing Instructions Instructions are encoded in binary > Called machine code MIPS instructions > Encoded as 32-bit instruction words > Operation code (opcode), register numbers, Register numbers > $t0 $t7 are regs number 8 15 > $t8 $t9 are regs number 24 25 > $s0 $s7 are regs number16 23 > $t is registers for temporary values > $s is registers for saved variables in a program > 41 MIPS R-format Instructions Instruction fields > op : operation code (opcode) > rs : first source register number > rt : second source register number > rd : destination register number > shamt: shift amount (00000 for now) > funct : function code (extends opcode) > 42 op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 43 R-format Example (add operation) add $t0 , $s1, $s2 ($t0 =$s1+$s2 ) special $s1 $s2 $t0 0 add 0 17 18 8 0 32 000000 10001 10010 01000 00000 100000 00000010001100100100000000100000 2 = 02324020 16 Dicimal Binary op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 44 R-format Example (sub operation) sub $t0 , $s1, $s2 ($t0 =$s1-$s2 ) special $s1 $s2 $t0 0 sub 0 17 18 8 0 34 000000 10001 10010 01000 00000 100010 00000010001100100100000000100010 2 = 02324022 16 Dicimal Binary op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits MIPS R-format Instructions Funct (dec) Funct (bin) Opcode (dec) Opcode (bin) Instruction 32 100000 0000000 add 34 100010 0000000 sub 36 100100 0000000 and 37 100101 0000000 or 38 100110 0000000 xor 39 100111 0000000 nor 42 101010 0000000 slt 0000000 0000000 sll 2000010 0000000 srl 8001000 0000000 jr 45 MIPS I-format Instructions Immediate arithmetic and load/store instructions > rt : destination or source register number > Constant : 2 15 to +2 15 1 > Address : offset added to base address in rs Design Principle: Good design demands good compromises > Different formats complicate decoding, but allow 32-bit instructions uniformly > Keep formats as similar as possible > 46 > op rs rt constant or address > 6 bits 5 bits 5 bits 16 bits 47 I-format Example (Load operation) ## lw $t0 , 32( $s3) #A[8] is loaded to $t0 35 19 8 100011 10011 01000 10001110011010000000000000100000 2 = 8D680020 16 Dicimal Binary Load word rs rt constant or address 32 0000000000100000 op rs rt constant or address > 6 bits 5 bits 5 bits 16 bits MIPS I-format Instructions Opcode (dec) Opcode (bin) Instruction 8001000 addi 12 001100 andi 13 001101 ori 14 001110 xori 35 100011 lw 43 101011 sw 32 100000 lb 40 101000 sb 33 100001 lh 41 101001 sh 4000100 beq 5000101 bne 10 001010 slti 48 Logical Operations Instructions for bitwise manipulation Useful for extracting and inserting groups of bits in a word > 49 MIPS Java COperation sll << << Shift left srl >>> >> Shift right and, andi &&Bitwise AND or, ori ||Bitwise OR nor ~~Bitwise NOT Shift Operations: MIPS R-format Instructions shamt: how many positions to shift Shift left logical > Shift left and fill with 0 bits > sll by i bits multiplies by 2 > i Shift right logical > Shift right and fill with 0 bits > srl by i bits divides by 2 > i (unsigned only) > 50 op rs rt rd shamt funct > 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits AND Operations Useful to mask bits in a word > Select some bits, clear others to 0 and $t0, $t1, $t2 > 51 0000 0000 0000 0000 0000 1101 1100 0000 0000 0000 0000 0000 0011 1100 0000 0000 $t2 $t1 0000 0000 0000 0000 0000 1100 0000 0000 $t0 OR Operations Useful to include bits in a word > Set some bits to 1, leave others unchanged or $t0, $t1, $t2 > 52 0000 0000 0000 0000 0000 1101 1100 0000 0000 0000 0000 0000 0011 1100 0000 0000 $t2 $t1 0000 0000 0000 0000 0011 1101 1100 0000 $t0 NOT Operations Useful to invert bits in a word > Change 0 to 1, and 1 to 0 MIPS has NOR 3-operand instruction > a NOR b == NOT ( a OR b ) nor $t0, $t1, $zero > 53 0000 0000 0000 0000 0011 1100 0000 0000 $t1 1111 1111 1111 1111 1100 0011 1111 1111 $t0 > Register 0: always > read as zero # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 07 LAST WEEK > 2 Power saving 1. Parallelism - Multi-Core Processors: Incorporate more cores into a single processor chip. - Simultaneous Multithreading (SMT) - GPUs and Accelerators : Utilize graphics processing units (GPUs) and other accelerators (such as TPUs for machine learning) Power saving 2. Memory Optimization - Prefetching: Predict and load data into the cache before it is needed by the CPU to reduce wait times. - Memory Bandwidth and Latency Improvements: Use faster memory technologies (such as DDR5 or HBM) and optimize memory controller designs Power saving 3. Software and Algorithm Optimization - Efficient Algorithms: Use algorithms with lower computational complexity to reduce the number of instructions required to complete a task. - Concurrency: Design software to effectively utilize multiple cores and threads Power saving 4. Specialized Hardware : Develop custom hardware tailored for specific tasks 5. Energy-Efficient Computing : Turn off portions of the processor when they are not in use 6. Heterogeneous Computing: Integrate different types of processors on the same chip to handle diverse workloads Execution of programs > 7 CPU > 100101001010100 > 000001011011100 > 111001111010011 lw $4, 0($1) lw $5, 4($1) add $2, $4, $5 Compiler If (y == 1) c=a+b ; else c=a-b ; High-level programming language such as C, Java, Verilog-HDL etc.. Instruction set Assembly lang. Machine lang. to CPU Binary code CPUa CPUb CPUc C Program ## CPU Independent 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 Interface between programs and CPUs Compiler for CPU-C Instruction set C Compiler for CPU-B Instruction set B Compiler for CPU-A Instruction set A ## CPU dependent Machine lang. Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Operations of a Processor Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Executes arithmetic and logic operations Stores data computed by the ALU Data is read from main memory and written back to main memory Stores programs, instructions and data Load / Store architecture The way of access to the main memory is Load (read) and store (write) instructions ALU executes operations on registers values Operations of a Processor Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU > C # TODAY > 15 Register Operands Arithmetic instructions use register operands MIPS has a 32 32-bit register file > Use for frequently accessed data > Numbered 0 to 31 > 32-bit data called a word Assembler names ex. > $t0, $t1, , $t9 for temporary values > $s0, $s1, , $s7 for saved variables (of initial inputs) Design Principle : Smaller is faster > main memory: millions of locations -> Slower > Keep frequently used data in registers. -> Fast Access > Minimize memory accesses 16 > Use $t when you dont care about keeping the > value after a function call. > Use $s when the value must survive across calls. MIPS: Temporary vs. Saved Registers Example: # Using $t0 for temporary math add $t0, $a0, $a1 # Using $s0 to store a persistent value add $s0, $a2, $a3 If this function calls another function, it must save $s0 (using sw) before calling and restore (using lw) afterward. But $t0 doesnt need to be preserved it might be overwritten. > 17 MIPS Register Numbering and Assembler Names MIPS has 32 general-purpose registers , labeled $0 to $31 These are hardware register numbers , but in assembly programming, we use symbolic names to make them easier to read and understand. 18 Common Use Assembler Name Register Number Always 0 (read-only) $zero $0 Reserved for assembler $at $1 Function return values $v0$v1 $2$3 Function arguments $a0$a3 $4$7 Temporary registers $t0$t7 $8$15 Saved registers $s0$s7 $16$23 More temporary registers $t8$t9 $24$25 Reserved for kernel $k0$k1 $26$27 Global pointer $gp $28 Stack pointer $sp $29 Frame pointer $fp $30 Return address $ra $31 Memory Operands Main memory used for composite data > Arrays, structures, dynamic data To apply arithmetic operations > MIPS does not perform arithmetic directly on memory. > Load values from memory into registers > Store result from register to memory Memory is byte addressed > Each address identifies an 8-bit (1 byte) > MIPS loads/stores 32-bit words (4 bytes) Words are aligned in memory > Address must be a multiple of 4 > 19 > Byte Address > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Memory Operand Example C code: g = h + A[8]; > g is in $s1, h is in $s2 > Base address of array A, that is A[0], is in $s3 Compiled MIPS code: > Index 8 requires offset of 32 (4 bytes per word) lw $t0, 32($s3) # load word add $s1, $s2, $t0 > 20 > offset base register A[0] > 1000 > 1004 > 1008 1 word=4 bytes (32 bit) > 1012 > 1016 > 1020 > 1024 > 1028 > 1032 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] > Byte 1 0 > Byte 2 1 > Byte 3 2 > Byte 4 3 Registers vs. Memory ## Registers are faster to access than memory ## Operating on memory data requires loads and stores > More instructions to be executed ## Compiler must use registers for variables as much as ## possible > Only spill to memory for less frequently used variables > Register optimization is important! > 21 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 22 The MIPS Instruction Set MIPS instructions are categorized into 5 major types: 1. Arithmetic Instructions 2. Logical Operations 3. Memory Access Instructions 4. Control flow Instructions > 23 Arithmetic Instructions Add operation has three operands > Two sources and one destination add a, b, c # a gets b + c All arithmetic operations have this form sub and or slt Design Principle: Simplicity favors regularity > Regularity makes implementation simpler > Simplicity enables higher performance at lower cost > 24 Arithmetic Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b add $t0, $t1, $t2 x = a + b Addition add $t0 = x, $t1 = a, $t2 = b sub $t0, $t1, $t2 x = a - b Subtraction sub $t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5 Add immediate (constant) addi $t0 = x, $t1 = a, $t2 = b slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt $t0 = x, $t1 = a, $t2 = b sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (unsigned) sltu 25 Logical Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and $t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or $t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor $t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor $t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi 26 Memory Access Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t0 = x, $s1 = base address lw $t0, 8($s1) x = A[i] Load word from memory lw $t0 = x, $s1 = base address sw $t0, 8($s1) A[i] = x Store word to memory sw $t0 = x, $s1 = base address lb $t0, 4($s1) x = byte[i] Load byte from memory lb $t0 = x, $s1 = base address sb $t0, 4($s1) byte[i] = x Store byte to memory sb 27 Control Flow Instructions Assumed Register Mapping MIPS Equivalent C Equivalent Description Instruction $t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq $t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne -j LABEL goto LABEL Unconditional jump j return addr in $ra jal FUNC call FUNC() Jump and link (call) jal $ra = return address jr $ra return Jump to register (return) jr 28 Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 06 LAST WEEK > 2 ## Performance and Execution time Performance is For two computers X and Y, if the performance X is greater than Y, we have That is, the execution time on Y is longer, if X is faster > 3 Execution Time Performance Performance X Performance Y Execution Time X Execution Time Y Relative Performance Define Performance = 1/Execution Time X is n time faster than Y Example: time taken to run a program > 10s on comp. A, 15s on comp. B > Execution Time B / Execution Time A = 15s / 10s = 1.5 > So A is 1.5 times faster than B > 4 Performance X Performance Y Execution Time X Execution Time Y > == ## nCPU Clocking Operation of digital hardware governed by a constant-rate clock > Clock (cycles) > Data transfer > and computation > Update state > Clock period > Clock period: duration of a clock cycle > e.g., 250ps = 0.25ns = 250 10 12 s > Clock frequency (rate): cycles per second / Hz > e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time CPU Time is the actual time the CPU spends processing instructions for a program. Its a key measure of how efficiently a program runs on a processor. Rate Clock Cycles Clock CPU Time Cycle Clock Cycles Clock CPU Time CPU When executing a program, (using clock rate) Unit: cycle Unit: seconds/cycle Unit: cycles/second , Hz Unit: second Instruction Count and CPI Instruction Count = The total number of instructions a CPU must execute to complete a program. It depends on: The program itself (what it's doing) The compiler (how it translates code) The instruction set architecture (ISA) > 7 Instruction Count and CPI CPI = The average number of clock cycles the CPU takes to execute one instruction . (Cycles per Instruction) It depends on: CPU hardware design Type of instruction (some are faster/slower) How well the CPU handles memory and branches > 8 Instruction Count and CPI Instruction Count for a program > Determined by program, instruction set Average cycles per instruction (CPI) > Determined by CPU hardware Rate Clock CPI Count nInstructio Time Cycle Clock CPI Count nInstructio Time CPU nInstructio per Cycles Count nInstructio Cycles Clock When executing a program, Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail If different instruction classes take different numbers of cycles When a program runs, not all instructions are equal - some take more clock cycles than others. So instead of a single CPI value, we calculate an average CPI based on instruction classes . Example, > Cycles per Instruction % of Instructions Instruction Type > 1 cycle 50% ALU (Add/Sub) > 2 cycles 30% Load/Store > 3 cycles 20% Branches (if/jump) CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI # > n > 1i > ii )Count nInstructio (CPI Cycles Clock # > n > 1i > i > i Count nInstructio Count nInstructio CPI Count nInstructio Cycles Clock CPI > Relative frequency Power and Energy Important for embedded systems and computers Embedded systems want to save battery usage as much as possible. and reduce the heat generated C = capacitance (how much charge moves in each switch) V = voltage f = clock frequency (how often switching happens) = 1 2 CapacitiveLoad = CapacitiveLoad = Power and Energy Power increases with the square of voltage! > Lowering voltage is very effective for saving energy. Power is proportional to frequency > Higher clock speeds = more energy burned. = 1 2 CapacitiveLoad = CapacitiveLoad = CPU Time Quiz I Computer A: 2GHz clock it takes 10s CPU time to run a program Try to design Computer B > Aim for 6s CPU time > Can do faster clock, but causes 1.2 clock cycles than comp. A How fast must Computer B clock be? 4GHz 6s 10 24 6s 10 20 1.2 Rate Clock 10 20 2GHz 10s Rate Clock Time CPU Cycles Clock 6s Cycles Clock 1.2 Time CPU Cycles Clock Rate Clock > 99 > B > 9 > AAA > A > B > B > B CPI Quiz II Computer A has a clock cycle time = 250ps, CPI = 2.0 for some program Computer B has a clock cycle time = 500ps, CPI = 1.2 for the same program Which is faster, and by how much? 1.2 500ps I 600ps I A Time CPU B Time CPU 600ps I500ps 1.2 I B Time Cycle B CPI Count nInstructio B Time CPU 500ps I250ps 2.0 I A Time Cycle A CPI Count nInstructio A Time CPU > A is faster > by this much CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 1. Which sequence executes the most instructions? 2. Which will be faster? How about CPU clock cycles? CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 1. Which sequence executes the most instructions? Sequence 1 executes 2 + 1+ 2 = 5 instructions Sequence 2 executes 4 + 1+ 1 = 6 instructions Seq. 1 executes fewer instructions CPI Quiz III Alternative compiled code sequences using instructions in classes A, B, C > CBAClass > 321CPI for class > 212Instruction numbers > in sequence 1 > 114Instruction numbers > in sequence 2 2. Which will be faster? How about CPU clock cycles? CPU clock cycles1 = 2*1 + 1*2 + 2*3 = 10 cycles CPU clock cycles2 = 4*1 + 1*2 + 1*3 = 9 cycles Seq. 2 is faster even though it executes one more instruction Quiz IV Suppose you reduce voltage by 17% and it reduces switching frequency by 17%. So how much will the dynamic power be reduced? = 1 2 CapacitiveLoad = 1 2 0.83 CapacitiveLoad (0.83 ) = (0.83) 0.6 Group Work: Reducing Power The power wall We cant reduce voltage much more > Lower voltage saves energy, but too low causes instability. We cant remove more heat efficiently > Smaller chips = more transistors = more heat in less space. So: Simply increasing clock speed or transistor count no longer works without overheating or wasting power. # Presentation Your goal is to explain how computer architects overcome the Power Wall and improve speed, efficiency, and energy use .TODAY: POWER SAVING PRESENTATION > 21 # INSTRUCTION SET > 22 Instruction and Instruction set Instruction The words of a computers language Instruction set A set of instructions and it depends on each computer > 23 ADD SUB Logic Branch Comp. A ADD Branch Comp. B ADD SUB Comp. C ADD SUB And Branch Instruction set Different computers have different instruction sets > But with many aspects in common > Most of instruction sets have similarity Early computers had very simple instruction sets > Simplified implementation Many modern computers also have simple instruction sets > 24 The MIPS Instruction Set Used as the example in this class Stanford MIPS commercialized by MIPS Technologies Typical of many modern ISA s (Instruction Set Architecture ) Similar ISAs have a large share of embedded core market > Applications in consumer electronics, network/storage equipment, > cameras, printers, It's known for its simplicity and efficiency > 25 A part of MIPS instructions In this class , we will use MIPS instruction set > 26 Example Instruction $s1 = $s2 + $s3 add $s1, $s2, $s3 add Arithmetic $s1 = $s2 $s3 sub $s1, $s2, $s3 Subtract Load from Memory address [$s2+100] to register $s1 lw $s1, 100($s2) load word Data translati on Store from register $s1 to Memory address [$s2+100] sw $s1, 100($s2) store word If $s1==$s2 then Branch to L beq $s1, $s2, L branch on equal Branch If $s1!=$s2 then branch to L bne $s1, $s2, L branch on not equal If $s2<$s3 then $s1=1, otherwise $s1=0 slt $s1, $s2, $s3 set on less than Jump to L j L jump Jump Jump to the address $s1 jr $s1 jump register Example > 27 ## Execution of programs > 28 CPU > 100101001010100 > 000001011011100 > 111001111010011 lw $4, 0($1) lw $5, 4($1) add $2, $4, $5 Compiler If (y == 1) c=a+b ; else c=a-b ; High-level programming language such as C, Java, Verilog-HDL etc.. Instruction set Assembly lang. Machine lang. to CPU Binary code CPUa CPUb CPUc C Program ## CPU Independent 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 Interface between programs and CPUs Compiler for CPU-C Instruction set C Compiler for CPU-B Instruction set B Compiler for CPU-A Instruction set A ## CPU dependent Machine lang. Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Operations of a Processor Register > PC > Decoder > > Processor Main memory ALU Small capacity Fast access speed Buss Big capacity Slow access > ALU (Arithmetic-Logic Unit) Executes arithmetic and logic operations Stores data computed by the ALU Data is read from main memory and written back to main memory Stores programs, instructions and data Load / Store architecture The way of access to the main memory is Load (read) and store (write) instructions ALU executes operations on registers values Operations of a Processor Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU Register > > Processor > Address bus > Data bus > A > B ALU Main memory Operations of a Processor When you want to sum two data in the main memory and write the result back to memory 1: Reading data from memory to register (Load instruction) 2: ALU calculations using register values 3: Write the value of the register to memory (Store instruction) > B > A > C ALU > C # Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 04 LAST WEEK > 2 Computer Classification Microcontrollers Micro computer / Microcontroller Usage : embedded systems Features : Miniaturization, energy saving, low power consumption. CPU, memory and peripheral circuits are contained in the same IC. Currently : also known as SoC (System on a Chip) > 3 This is a computer! Computer Classification Personal computers Usage : Office work, internet, gaming Features : Low price, Versatile and scalable, Single architecture, Large variety of peripherals, e.g., keyboard, mouse, display, sound card, printer. Currently : research, industry > 4 Computer Classification Workstations Usage : large calculation e.g., research, industry Features : Like PCs but with higher performance, High price. Gone with the success of personal computers but revived with artificial intelligence. > 5 Computer Classification Mainframes Usage : Internet companies, banks, industry, education, offices, etc. Features : Very high parallelism. One computer, many users (access via terminals) > 6 Computer Classification Supercomputers Usage : Industry and research huge computations. Features : > Huge performance > Huge parallelism > Huge area > Huge power consumption > One computer, few users > 7 Mechanical Era Timeframe: 1620s to the early 20th century. Key Developments: Mechanical Calculators : which could perform basic arithmetic operations. Charles Babbage's Designs : Difference Engine and the Analytical Engine. Introducing concepts like programmability and mechanical computation. Punched Card Systems : Tabulating and processing data > Punched card - Wikipedia Electro-Mechanical Era (Transition) Timeframe: Roughly from the late 19th century to the mid-20th century. Key Developments: Integration of Electromechanical Components : such as relays and switches. Harvard Mark I, which used electromechanical relays for computation. Punched Card Systems Continued : Punched card systems continued to be widely used for data input and processing, particularly in industries requiring large-scale data manipulation. Electronic Era Timeframe: Starting in the 1940s until now. Key Developments: ENIAC : The Electronic Numerical Integrator and Computer (ENIAC), completed in 1946, marked the beginning of the electronic era. Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized electronics. Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more reliable computers. Post-PC Era (Contemporary) Timeframe: From the late 20th century to the present. Key Characteristics: Proliferation of Mobile Devices : Smartphones and tablets have become popular Touch Interfaces and Intuitive User Experiences Cloud Computing Internet of Things (IoT) AI and Voice Assistants The Post PC Era Technology Trends (Now) Advanced IC (AI, 3D, RISC-V, Thermodynamic) AI chips : TPU (Tensor Processing Unit) Google NPU (Neural Processing Unit) Apple, Huawei Chiplets : Modular chip building blocks used in AMD and Intel architectures 3D-Stacked ICs : Vertical integration to boost speed and density RISC-V : Open-source CPU architecture with growing ecosystem Thermodynamic computing (experimental): Uses physics to power ultra-low- energy AI tasks > 13 Semiconductor Technology The foundation of all modern electronics including smartphones, computers, cars, TVs, and even smart fridges! Add materials to transform properties: Conductors (Lets electricity flow) Insulators (Blocks electricity) Switch (Controls flow (on/off)) A material (like silicon) that can act as both a conductor and an insulator depending on how its used. This makes it perfect for building transistors the basic "on/off switches" in all digital devices. > 14 Defining Performance Question Which airplane has the best performance? Passenger capacity: Airbus Cruising range : Boeing777 Cruising speed : Concorde Best performance depends on the definition of performance > 15 > Cruising > speed > (mile/h) > Cruising > range > (miles) > Passenger > capacity > Airplane > 564 3,000 240 Boeing737 > 1,350 4,000 132 BAC/Sud Concorde > 554 9,395 301 Boeing777 > 587 8,477 853 Airbus A380-800 Defining Performance Group work: Lets discuss about the definition of performance 1. Propose a definition for performance 2. Under the definition, which one has the best performance? > 16 Cruising speed (mile/h) Cruising range (miles) Passenger capacity Airplane 564 3,000 240 Boeing737 1,350 4,000 132 BAC/Sud Concorde 554 9,395 301 Boeing777 587 8,477 853 Airbus A380-800 Defining Performance Group work: Lets discuss about the definition of performance If we define the performance as Less time required to transport people Passenger capacity Cruising speed (mile/h) > 17 Passenger throughput (passenmile/h) Cruising speed (mile/h) Cruising range (miles) Passenger capacity Airplane 135,360 564 3,000 240 Boeing737 178,200 1,350 4,000 132 BAC/Sud Concorde 166,761 554 9,395 301 Boeing777 500,711 587 8,477 853 Airbus A380-800 Airbus has the best performance (Passenger throughput ) Response Time and Throughput Response time (Execution time) > The total time required for a computer to complete a task, including disk access, memory access, I/O activities Throughput > Total amount of work done per unit time > e.g., the number of tasks/transactions/ per hour How are response time and throughput affected by > Replacing the processor with a faster version? > Adding more processors? Response Time and Throughput Question A: If we replace the processor in a computer with a faster version , 1. Increase throughput 2. Decrease response time 3. Both of the above > Replace a faster processor Decrease response time Increase throughput Run program A in 10 sec. Run program A in 5 sec. Run program A twice in 10 sec. Response Time and Throughput Question B: If we add additional processors to a system that uses multiple processors for separate tasks 1. Increase throughput 2. Decrease response time 3. Both of the above Do not decrease response time Increase throughput Run program A in 10 sec. then program B in 10 sec. Run program B in 10 sec. Run program A In 10 sec. > Add more processors # TODAY: PERFORMANCE OF COMPUTER > 21 ## Performance and Execution time Performance is For two computers X and Y, if the performance X is greater than Y, we have That is, the execution time on Y is longer, if X is faster > 22 Execution Time Performance Performance X Performance Y Execution Time X Execution Time Y Relative Performance Define Performance = 1/Execution Time X is n time faster than Y Example: time taken to run a program > 10s on comp. A, 15s on comp. B > Execution Time B / Execution Time A = 15s / 10s = 1.5 > So A is 1.5 times faster than B > 23 Performance X Performance Y Execution Time X Execution Time Y > == ## nMeasuring Execution Time Elapsed time > Total response time, including all aspects Processing, I/O, OS overhead, idle time(waiting) > Determines system performance CPU time > Time spent processing a given job Discounts I/O time, other jobs shares > User CPU time Time spent executing user code (e.g., a programs logic). > System CPU time Time spent doing system-level operations (e.g., file I/O, memory allocation by OS) CPU Clocking Operation of digital hardware governed by a constant-rate clock > Clock (cycles) > Data transfer > and computation > Update state > Clock period > Clock period: duration of a clock cycle > e.g., 250ps = 0.25ns = 250 10 12 s > Clock frequency (rate): cycles per second / Hz > e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time CPU Time is the actual time the CPU spends processing instructions for a program. Its a key measure of how efficiently a program runs on a processor. Rate Clock Cycles Clock CPU Time Cycle Clock Cycles Clock CPU Time CPU When executing a program, (using clock rate) Unit: cycle Unit: seconds/cycle Unit: cycles/second , Hz Unit: second Instruction Count and CPI Instruction Count = The total number of instructions a CPU must execute to complete a program. It depends on: The program itself (what it's doing) The compiler (how it translates code) The instruction set architecture (ISA) > 27 Instruction Count and CPI CPI = The average number of clock cycles the CPU takes to execute one instruction . (Cycles per Instruction) It depends on: CPU hardware design Type of instruction (some are faster/slower) How well the CPU handles memory and branches > 28 Instruction Count and CPI Instruction Count for a program > Determined by program, instruction set Average cycles per instruction (CPI) > Determined by CPU hardware Rate Clock CPI Count nInstructio Time Cycle Clock CPI Count nInstructio Time CPU nInstructio per Cycles Count nInstructio Cycles Clock When executing a program, Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail If different instruction classes take different numbers of cycles When a program runs, not all instructions are equal - some take more clock cycles than others. So instead of a single CPI value, we calculate an average CPI based on instruction classes . Example, > Cycles per Instruction % of Instructions Instruction Type > 1 cycle 50% ALU (Add/Sub) > 2 cycles 30% Load/Store > 3 cycles 20% Branches (if/jump) CPI in More Detail If different instruction classes take different numbers of cycles Weighted average CPI # > n > 1i > ii )Count nInstructio (CPI Cycles Clock # > n > 1i > i > i Count nInstructio Count nInstructio CPI Count nInstructio Cycles Clock CPI > Relative frequency Performance Summary Performance depends on > Algorithm : affects Instruction Count(IC), possibly CPI > Programming language : affects IC, CPI > Compiler : affects IC, CPI > Instruction Set Architecture : affects IC, CPI, T c > Fewer instructions = potentially faster performance > Lower CPI = better performance > Shorter cycles = faster execution > Higher clock rate = more instructions per second cycle Clock Seconds nInstructio cycles Clock Program ns Instructio Time CPU MIPS MIPS measures how many millions of instructions a CPU can execute per second . Execution speed of a computers processor Comparative Analysis: It allows for easy comparison between different processors and systems. A higher MIPS value generally indicates a faster CPU > Power and Energy Important for embedded systems and computers Embedded systems want to save battery usage as much as possible. and reduce the heat generated C = capacitance (how much charge moves in each switch) V = voltage f = clock frequency (how often switching happens) = 1 2 CapacitiveLoad = CapacitiveLoad = Power and Energy Power increases with the square of voltage! > Lowering voltage is very effective for saving energy. Power is proportional to frequency > Higher clock speeds = more energy burned. = 1 2 CapacitiveLoad = CapacitiveLoad = The von Neumann Bottleneck For the von Neumann computers Single-memory: both program and data are stored in the same memory and are transferred over the same bus. Instructions and data cannot be retrieved from memory at the same time, this is the von Neumann bottleneck. > 22 Summarize Computer architecture : overall design of a computer system including: > Hardware specifications > Software (OS, compilers) > Behavior and structure of the computer Three Key Questions about Computer Processing > How is information represented? > How is data calculated? > What processes are used? Von Neumann Architecture > Stored-program concept > Sequential processing > Linear memory addressing Instruction Cycle (FetchDecodeExecute) 23 Computer Architecture Computer Engineering, 3rd year, Semester 1 Week 03 LAST WEEK > 2 ## What do we mean by architecture? In general, Architecture means a style, structure, or composition of buildings Computer Architecture > Basic design and design concepts in computers > Hardware specifications with system software such as operating systems and compilers > Architecture of the computer + Behavior of the computer Overall view of a computer > 4 > Computer Architecture ## Instruction set Operating system ## Operative unit Memory unit ## Control unit Input and output units > Architecture of the computer > Behavior of the computer Three questions about computer processing Input data Information Processing, computing and storage of information Output data Information Display Q1 How do computers represent information? what format? Q How is the acquired data calculated? Q What procedures are used to process (calculate) the "information"? Von Neumann Architecture (Von Neumann-type Computers) Many computers are designed as Neumann-type calculators Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer Stored-program computer > Instructions (programs) and data are placed indistinguishably in main memory > The distinction between instructions and data is made by the program Sequential processing > A program is a list of instructions > Instructions are fetched one by one from main memory and executed in a determined order > The position in memory of the next instruction to execute is stored into a register called the program counter . Linear address > Each cell of main memory is numbered sequentially > This number is called address > The Address is used to indicate the location of instructions and data 7 The memory address PC The Basic Structure of Neumann computer > 8 The Basic Structure of Neumann computer The central processing unit (CPU) is the part of the computer that contains the arithmetic and control units. > Operating Unit : The arithmetic unit is the device that performs arithmetic and logical operations and that temporarily stores the operation terms and results. > Control Unit : The control unit is the device that controls the operation of the computer. Memory Unit : The memory unit is the device that stores the program and the data. > Main Memory Unit : The main memory is the part of the storage unit to which the CPU has direct access. > Auxiliary Memory Unit : The auxiliary storage is the part of the memory that cannot be directly accessed by the CPU. Input/Output Unit(s) : The input/output unit is the device that performs the data transfers between the computer and its environment. 9The detailed structure of the CPU The operative unit > 10 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) The general-purpose registers are the registers that stores temporarily the terms and the results of the computers calculations. Compared to the memory units: Faster, smaller and without any address (2) Arithmetic and Logic Unit (ALU) The ALU is the device that performs the arithmetic and logic operations. (3) Flag Register (FR) > 11 The detailed structure of the CPU The operative unit (1) General-Purpose Registers (GPR) (2) Arithmetic and Logic Unit (ALU) (3) Flag Register (FR) The flag register stores the status of the previous operation. The flag register a 4 or more bits Roles: Conditional branching (if and for statements) > 12 CF OF ZF SF CF: Carry Flag The output carry of the operation OF: Overflow Flag The overflow of the signed calculations ZF: Zero Flag Indicates if the result were 0. SF: Sign Flag Indicates if the result were negative The detailed structure of the CPU The Control Unit > 13 The detailed structure of the CPU The Control Unit (1) Program Counter (PC) The PC is the register that stores the address of the next instruction to be executed. (2) Instruction Register (IR) The IR is the register that stores the instruction to be executed. (3) Decoder (DE) The decoder is a circuit that decodes the value of the instruction stored in the instruction register. Its result drives a set of control signals for the computer. (4) Sequencer The sequencer is the circuit that generates the control signals related to the clock and the status of the computer. > 14 The Basic Behavior of a von Neumann Computer Execution flow/Instruction Cycle: the following infinite loop > 15 ## Fetch ## (FE) ## Decode ## (DE) ## Execute ## (EX) The Basic Behavior of a von Neumann Computer > 16 (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition (1) Fetch (FE) Fetch is the step at which the instruction indicated by the PC is read from memory and stored in the IR. Definition (2) Decode (DE) Decode is the step at which the instruction stored in the IR is decoded. Definition (3) Execute (EX) Execution is the step at which the instruction is carried out depending on the results of the decoding stage. Definition The von Neumann Bottleneck What is a bottleneck? A bottlenecks is a situation that limit the overall performance of a given product. > 17 A bottles neck The von Neumann Bottleneck For the von Neumann computers Single-memory: both program and data are stored in the same memory and are transferred over the same bus. Instructions and data cannot be retrieved from memory at the same time, this is the von Neumann bottleneck. > 18 # TODAY > 19 Topic ## Computer Classification ## Computer Evolution ## Semiconductor ## Computer Performance > 20 Computer Classification Microcontrollers Personal computers Workstations Mainframes Supercomputers > 21 Computer Classification Microcontrollers Micro computer / Microcontroller Usage : embedded systems Features : Miniaturization, energy saving, low power consumption. CPU, memory and peripheral circuits are contained in the same IC. Currently : also known as SoC (System on a Chip) > 22 This is a computer! Computer Classification Personal computers Usage : Office work, internet, gaming Features : Low price, Versatile and scalable, Single architecture, Large variety of peripherals, e.g., keyboard, mouse, display, sound card, printer. Currently : research, industry > 23 Computer Classification Workstations Usage : large calculation e.g., research, industry Features : Like PCs but with higher performance, High price. Gone with the success of personal computers but revived with artificial intelligence. > 24 Computer Classification Mainframes Usage : Internet companies, banks, industry, education, offices, etc. Features : Very high parallelism. One computer, many users (access via terminals) > 25 Computer Classification Supercomputers Usage : Industry and research huge computations. Features : > Huge performance > Huge parallelism > Huge area > Huge power consumption > One computer, few users > 26 Computer Evolution The evolution of computers can be broadly categorized into four main eras: Mechanical Era Electro-Mechanical Era (Transition) Electronic Era Post-PC Era (Contemporary) > 27 Mechanical Era Timeframe: 1620s to the early 20th century. Key Developments: Mechanical Calculators : which could perform basic arithmetic operations. Charles Babbage's Designs : Difference Engine and the Analytical Engine. Introducing concepts like programmability and mechanical computation. Punched Card Systems : Tabulating and processing data > Punched card - Wikipedia Electro-Mechanical Era (Transition) Timeframe: Roughly from the late 19th century to the mid-20th century. Key Developments: Integration of Electromechanical Components : such as relays and switches. Harvard Mark I, which used electromechanical relays for computation. Punched Card Systems Continued : Punched card systems continued to be widely used for data input and processing, particularly in industries requiring large-scale data manipulation. Electronic Era Timeframe: Starting in the 1940s until now. Key Developments: ENIAC : The Electronic Numerical Integrator and Computer (ENIAC), completed in 1946, marked the beginning of the electronic era. Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized electronics. Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more reliable computers. Post-PC Era (Contemporary) Timeframe: From the late 20th century to the present. Key Characteristics: Proliferation of Mobile Devices : Smartphones and tablets have become popular Touch Interfaces and Intuitive User Experiences Cloud Computing Internet of Things (IoT) AI and Voice Assistants The Post PC Era The Post PC Era Personal Mobile Device(PMD) > Battery operated, have become popular, offering computing capabilities on the go. > Connects to the Internet > Smart phones, tablets, electronic glasses > Popular Cloud computing > Software as a Service (SaaS) > Services and applications increasingly rely on cloud infrastructure for storage, processing, and collaboration. > Reducing the dependence on local computing resources. > Amazon and Google The Post PC Era Internet of Things (IoT) > Connected devices, from smart thermostats to fitness trackers, contribute to the expanding ecosystem of computing beyond traditional PCs. AI and Voice Assistants > Advances in artificial intelligence and natural language processing have led to the integration of voice-controlled assistants > like Siri, Alexa, and Google Assistant into various devices, changing how users interact with technology. Touchscreen Post PC device simplifying interaction with technology Supersedes keyboard and mouse Resistive and Capacitive types > Most tablets, smart phones use capacitive > Capacitive allows multiple touches simultaneously Technology Trends Electronics technology continues to evolve > Increased capacity and performance > Reduced cost Relative performance/cost Technology Year 1Vacuum tube 1951 35 Transistor 1965 900 Integrated circuit (IC) 1975 2,400,000 Very large-scale IC (VLSI) 1995 250,000,000,000 Ultra large-scale IC 2013 > DRAM capacity # Now? Technology Trends (Now) Advanced IC (AI, 3D, RISC-V, Thermodynamic) AI chips : TPU (Tensor Processing Unit) Google NPU (Neural Processing Unit) Apple, Huawei Chiplets : Modular chip building blocks used in AMD and Intel architectures 3D-Stacked ICs : Vertical integration to boost speed and density RISC-V : Open-source CPU architecture with growing ecosystem Thermodynamic computing (experimental): Uses physics to power ultra-low- energy AI tasks > 37 Semiconductor Technology The foundation of all modern electronics including smartphones, computers, cars, TVs, and even smart fridges! Add materials to transform properties: Conductors (Lets electricity flow) Insulators (Blocks electricity) Switch (Controls flow (on/off)) A material (like silicon) that can act as both a conductor and an insulator depending on how its used. This makes it perfect for building transistors the basic "on/off switches" in all digital devices. > 38 Manufacturing ICs > 39 Yield: proportion of working dies per wafer Example: Intel Core 10th Gen 300mm wafer, 506 chips, 10nm technology Each chip is 11.4 x 10.7 mm > 40 Example: Intel Core Ultra 41 https://www.thailand.intel.com/content/www/th/th/content-details/842532/intel-core-ultra-processors-200hx-series-processors-quick-reference-guide-pdf.html Integrated Circuit Cost Nonlinear relation to area and defect rate > Wafer cost and Die area are fixed. > Defect rate determined by manufacturing process. > Die area determined by architecture and circuit design. > 2 area/2)) Die area per (Defects (1 1 Yield area Die area Wafer wafer per Dies Yield wafer per Dies wafer per Cost die per Cost Wafer cost is fixed (e.g., $5,000 per 300mm wafer) Dies per wafer depends on chip size Yield is how many chips work (e.g., 90%) This shows how defect rate and chip size affect the yield. If die area increases, yield drops quickly (nonlinear). If defect rate is high, fewer good chips are produced. Response Time and Throughput Response time (Execution time) > The total time required for a computer to complete a task, including disk access, memory access, I/O activities Throughput > Total amount of work done per unit time > e.g., the number of tasks/transactions/ per hour How are response time and throughput affected by > Replacing the processor with a faster version? > Adding more processors?

Title:

URL Source: blob://pdf/e492bcb2-2947-459c-ab15-bbef6c81e995

Markdown Content:
# Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 02 What do we mean by architecture?

In general, Architecture means a style, structure, or composition

of buildings

Computer Architecture

Basic design and design concepts in computers

Hardware specifications with system software such as

operating systems and compilers

Architecture of the computer + Behavior of the computer Overall view of a computer

> 3
> Computer Architecture

## Instruction set Operating system

## Operative unit Memory unit

## Control unit Input and output units

> Architecture of the computer
> Behavior of the computer

What is CPU?

Equivalent terms: CPU, MPU, processor

The CPU is an IC that includes the arithmetic, control, and input/output units

of a computer.

Lists companies that design/manufacture CPUs:

Mainstream: Intel, AMD, ARM

Embedded and Industrial: Zilog, Atmel, Fujitsu, etc.

Architecture examples:

x86 (Intel/AMD)

ARM (mobile devices)

MIPS, SPARC, PowerPC (once common in academia/industry)

PIC, AVR, 68xxx, Z80 (microcontrollers) 4Computer Processing

The flow of data and control between major parts of a computer.

1. CPU (Central Processing Unit)

2. Input Devices

3. Memory (RAM / Main Memory)

4. Output Devices

> 5

Three questions about computer processing

Input data

Information

Processing, computing and storage

of information

Output data

Information

Display

How do computers

represent information?

what format?

How is the acquired data

calculated?

What procedures are used

to process (calculate) the

"information"?

> Binary format

Algo

> ALL

switching element Group work

Lets talk about

Q1: How do computers represent information?

What format?

Q2: How is the acquired data calculated?

Q3: What procedures are used to process (calculate)

the information?

text , image in Binary format

> 1.

Identify and gathering 2 .

Transforming to usable format 3 . performing calculation

Though CPU : ALL

Validation , sorting , summarization , aggregation , analysis and classification Group work

Lets talk about

Q1: How do computers represent information?

What format?

Q2: How is the acquired data calculated?

Q3: What procedures are used to process (calculate)

the information?

> 7

Explanation Ease of nderstanding Language

It describes instructions to a computer in a language

that is easy for humans to understand. High-level

languages are converted into machine language programs

by compilers and interpreters

ex.) C, C++, Java, Python

Easy High-level language

A low-level language that uses mnemonics (e.g., MO,

ADD, B) to represent machine instructions.

Closer to hardware than high-level languages.

ranslated into machine language by an assembler .

Assembly language

(Low-level language)

Computers can directly understand

Binary code converted from assembler language

Impossible machine language

Programming language Von Neumann Architecture (Von Neumann-type Computers)

Many computers are designed as Neumann-type calculators

Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer

Stored-program computer

Instructions (programs) and data are placed indistinguishably in main memory

The distinction between instructions and data is made by the program

Sequential processing

A program is a list of instructions

Instructions are fetched one by one from main memory and executed in a

determined order

The position in memory of the next instruction to execute is stored into a

Linear address

Each cell of main memory is numbered sequentially

This number is called address

The Address is used to indicate the location of instructions and data 10

The memory address

PC The Basic Structure of Neumann computer

> 11

addi $s4, $zero, 1

add $1, $s1, $s0

sub $s0, $s0, $s4

bne $s0, $s4, L1

variable x

array(0,0)

Linear

address

variable y

array(0,1)

0 add $s1, $zero, $zero

...

Sequential processing

Stored-program

computer

The Basic Structure of Neumann computer The Basic Structure of Neumann computer

The central processing unit (CPU) is the part of the computer that contains

the arithmetic and control units.

Operating Unit : The arithmetic unit is the device that performs arithmetic and

logical operations and that temporarily stores the operation terms and results.

Control Unit : The control unit is the device that controls the operation of the

computer.

Memory Unit : The memory unit is the device that stores the program and the

data.

Main Memory Unit : The main memory is the part of the storage unit to which

the CPU has direct access.

Auxiliary Memory Unit : The auxiliary storage is the part of the memory that

cannot be directly accessed by the CPU.

Input/Output Unit(s) : The input/output unit is the device that performs the

data transfers between the computer and its environment. 13 The detailed structure of the CPU

The operative unit

> 14

The detailed structure of the CPU

The operative unit

(1) General-Purpose Registers (GPR)

The general-purpose registers are the registers that stores temporarily the terms

and the results of the computers calculations.

Compared to the memory units: Faster, smaller and without any address

(2) Arithmetic and Logic Unit (ALU)

The ALU is the device that performs the arithmetic and logic operations.

(3) Flag Register (FR)

> 15

The detailed structure of the CPU

The operative unit

(1) General-Purpose Registers (GPR)

(2) Arithmetic and Logic Unit (ALU)

(3) Flag Register (FR)

The flag register stores the status of the previous operation.

The flag register a 4 or more bits

Roles: Conditional branching (if and for statements)

> 16

CF OF ZF SF

CF: Carry Flag

The output carry of the operation

OF: Overflow Flag

The overflow of the signed calculations

ZF: Zero Flag

Indicates if the result were 0.

SF: Sign Flag

Indicates if the result were negative The detailed structure of the CPU

The Control Unit

> 17

The detailed structure of the CPU

The Control Unit

(1) Program Counter (PC)

The PC is the register that stores the address of the next instruction to be executed.

(2) Instruction Register (IR)

The IR is the register that stores the instruction to be executed.

(3) Decoder (DE)

The decoder is a circuit that decodes the value of the instruction stored in the

instruction register.

Its result drives a set of control signals for the computer.

(4) Sequencer

The sequencer is the circuit that generates the control signals related to the clock

and the status of the computer.

> 18

The Basic Behavior of a von Neumann Computer

Execution flow/Instruction Cycle: the following infinite loop

> 19

## Fetch

## (FE)

## Decode

## (DE)

## Execute

## (EX) The Basic Behavior of a von Neumann Computer

> 20

(1) Fetch (FE)

Fetch is the step at which the instruction indicated by the PC is read

from memory and stored in the IR.

Definition

(2) Decode (DE)

Decode is the step at which the instruction stored in the IR is

decoded.

Definition

(3) Execute (EX)

Execution is the step at which the instruction is carried out

depending on the results of the decoding stage.

Definition

(1) Fetch (FE)

Fetch is the step at which the instruction indicated by the PC is read

from memory and stored in the IR.

Definition

(2) Decode (DE)

Decode is the step at which the instruction stored in the IR is

decoded.

Definition

(3) Execute (EX)

Execution is the step at which the instruction is carried out

depending on the results of the decoding stage.

Definition The von Neumann Bottleneck

What is a bottleneck?

A bottlenecks is a situation that limit the overall performance of a given

product.

> 21

A bottles neck Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 08 LAST WEEK

> 2

Arithmetic instructions use register operands

MIPS has a 32  32-bit register file

Use for frequently accessed data

Numbered 0 to 31

32-bit data called a word

Assembler names ex.

$t0, $t1, , $t9 for temporary values

$s0, $s1, , $s7 for saved variables (of initial inputs)

Design Principle : Smaller is faster

main memory: millions of locations -> Slower

Keep frequently used data in registers. -> Fast Access

Minimize memory accesses 3

> Use $t when you dont care about keeping the
> value after a function call.
> Use $s when the value must survive across calls.

MIPS: Temporary vs. Saved Registers

Example:

# Using $t0 for temporary math

add $t0, $a0, $a1

# Using $s0 to store a persistent value

add $s0, $a2, $a3

If this function calls another function, it must save $s0 (using sw) before calling

and restore (using lw) afterward.

But $t0 doesnt need to be preserved  it might be overwritten.

> 4

MIPS Register Numbering and Assembler Names

MIPS has 32 general-purpose registers , labeled $0 to $31

These are hardware register numbers , but in assembly programming, we use symbolic names to make

them easier to read and understand.

Common Use Assembler Name Register Number

Always 0 (read-only) $zero $0

Reserved for assembler $at $1

Function return values $v0$v1 $2$3

Function arguments $a0$a3 $4$7

Temporary registers $t0$t7 $8$15

Saved registers $s0$s7 $16$23

More temporary registers $t8$t9 $24$25

Reserved for kernel $k0$k1 $26$27

Global pointer $gp $28

Stack pointer $sp $29

Frame pointer $fp $30

Return address $ra $31 Memory Operands

Main memory used for composite data

Arrays, structures, dynamic data

To apply arithmetic operations

MIPS does not perform arithmetic directly on memory.

Load values from memory into registers

Store result from register to memory

Memory is byte addressed

Each address identifies an 8-bit (1 byte)

MIPS loads/stores 32-bit words (4 bytes)

Words are aligned in memory

Address must be a multiple of 4

> 6
> Byte Address
> Byte 1 0
> Byte 2 1
> Byte 3 2
> Byte 4 3

Memory Operand Example

C code:

g = h + A[8];

g is in $s1, h is in $s2

Base address of array A, that is A[0], is in $s3

Compiled MIPS code:

Index 8 requires offset of 32 (4 bytes per word)

lw $t0, 32($s3) # load word

add $s1, $s2, $t0

> 7
> offset base register

A[0]

> 1000
> 1004
> 1008

1 word=4 bytes (32 bit)

> 1012
> 1016
> 1020
> 1024
> 1028
> 1032

A[1]

A[2]

A[3]

A[4]

A[5]

A[6]

A[7]

A[8]

> Byte 1 0
> Byte 2 1
> Byte 3 2
> Byte 4 3

The MIPS Instruction Set

Used as the example in this class

Stanford MIPS commercialized by MIPS Technologies

Typical of many modern ISA s

(Instruction Set Architecture )

Similar ISAs have a large share of embedded core market

> Applications in consumer electronics, network/storage equipment,
> cameras, printers,

It's known for its simplicity and efficiency

> 8

The MIPS Instruction Set

MIPS instructions are categorized into 5 major types:

1. Arithmetic Instructions

2. Logical Operations

3. Memory Access Instructions

4. Control flow Instructions

> 9

Arithmetic Instructions

Add operation has three operands

Two sources and one destination

add a, b, c # a gets b + c

All arithmetic operations have this form

sub and or slt

Design Principle: Simplicity favors regularity

Regularity makes implementation simpler

Simplicity enables higher performance at lower cost

> 10

Arithmetic Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $t1 = a, $t2 =

add $t0, $t1, $t2 x = a + b Addition add

$t0 = x, $t1 = a, $t2 =

sub $t0, $t1, $t2 x = a - b Subtraction sub

$t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5

Add immediate

(constant)

addi

$t0 = x, $t1 = a, $t2 =

slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt

$t0 = x, $t1 = a, $t2 =

sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0

Set if less than

(unsigned)

sltu

11 Logical Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and

$t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or

$t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor

$t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor

$t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi

12 Memory Access Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $s1 = base

address

lw $t0, 8($s1) x = A[i]

Load word from

memory

$t0 = x, $s1 = base

address

sw $t0, 8($s1) A[i] = x

Store word to

memory

$t0 = x, $s1 = base

address

lb $t0, 4($s1) x = byte[i]

Load byte from

memory

$t0 = x, $s1 = base

address

sb $t0, 4($s1) byte[i] = x

Store byte to

memory

13 Control Flow Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq

$t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne

-j LABEL goto LABEL Unconditional jump j

return addr in $ra jal FUNC call FUNC() Jump and link (call) jal

$ra = return address jr $ra return

Jump to register

(return)

14 Quiz 1: Arithmetic Example

Suppose we place the sum of four variables b, c, d and e into variable a

a = b + c + d + e;

What is the compiled MIPS code?

add a, b, c # sum of b and c is placed in a

add a, a, d # sum of b,c and d is now in a

add a, a, e # sum of b,c,d and e is now in a

> 15

Quiz 2: Arithmetic Example

Suppose we place the sum of four variables b, c, d and e into variable a

a = b + c ;

d = a  e ;

What is the compiled MIPS code?

add a, b, c

sub d, a, e

> 16

Quiz 4: Register Operand

C code:

f = (g + h) - (i + j);

> The variables f, g, , j are assigned to
> the registers $s0, $s1 , $s4

What is the compiled MIPS code?

add $t0, $s1, $s2 # register t0 contains g+h

add $t1, $s3, $s4 # register t1 contains i+j

sub $s0, $t0, $ t1 # f gets (g+h)  (i+j)

> 17

Quiz 5: Memory Operand

C code:

A[12] = h + A[8];

> h in $s2, base address of A in $s3

What is the compiled code?

lw $t0, 32($s3) # load word

add $t0, $s2, $t0

sw $t0, 48($s3) # store word

> 18

Convert C code to MIPS assembly

> 19

Find Absolute Value (abs)

if (a < 0)

abs = -a;

else

abs = a;

Find the Maximum (Max)

if (a > b)

max = a;

Else

max = b;

Check if a is between b and c

if (b < a && a < c)

result = 1;

else

result = 0; Convert C code to MIPS assembly

Find the Maximum (Max)

if (a > b)

max = a;

Else

max = b;

Hint : use slt, beq, bne, add, sub

> 20

slt $t3, $t1, $t0 # if a > b  $t3 = 1 (i.e. b < a)

beq $t3, $zero, ELSE # if false  go to ELSE

add $t2, $t0, $zero # max = a

j END

ELSE:

add $t2, $t1, $zero # max = b

END: Convert C code to MIPS assembly

Find Absolute Value (abs)

if (a < 0)

abs = -a;

else

abs = a;

Hint : use slt, sub, add, bne

> 21

slt $t2, $t0, $zero # if a < 0  $t2 = 1

beq $t2, $zero, ELSE # if a >= 0  go to ELSE

sub $t1, $zero, $t0 # abs = -a

j END

ELSE:

add $t1, $t0, $zero # abs = a

END: Convert C code to MIPS assembly

Check if a is between b and c

if (b < a && a < c)

result = 1;

else

result = 0;

> 22

slt $t4, $t1, $t0 # b < a  $t4 = 1

slt $t5, $t0, $t2 # a < c  $t5 = 1

add $t6, $t4, $t5 # $t6 = $t4 + $t5

beq $t6, 2, YES # if both true  result = 1

add $t3, $zero, $zero # result = 0

j END

YES:

add $t3, $zero, 1 # result = 1

END: TODAY

> 23

Todays topic

Radix-conversion

Binary number (Base-2)

Decimal number (Base-10)

Hexadecimal number (Base-16)

2s-Complement

Representing Instructions We use decimal numbers in our daily lives

Decimal number uses 10 numbers from 0 to 9, and

when the number reaches 10, carry-over occurs

9876543210

1918 17 16 15 14 13 12 11 10

2928 27 26 25 24 23 22 21 20

99 98 97 96 95 94 93 92 91 90

10 9108 107 106 105 104 103 102 101 10 0

11 9118 117 116 115 114 113 112 111 110

Decimal number ( base-10 )FBA93210

1F1B 1A 19 13 12 11 10

2F2B 2A 29 23 22 21 20

9F9B 9A 99 93 92 91 90

AFAB AA A9 A3 A2 A1 A0

FFFB FA F9 F3 F2 F1 F0

Hexadecimal number (Base-16)

To represent more than 10 in a single number,

we use A, B, C, D, E, and F  365 (10) = 3 * 10 2 + 6 * 10 1 + 5 * 10 0

= 365 (10)

101 (2) = 1 * 2 2 + 0 * 2 1 + 1 * 2 0

= 5 (10)

125 (8)

2AD (16) = 2 * 16 2 + A * 16 1 + D * 16 0

= 2 * 16 2 + 10 * 16 1 + 13 * 16 0

= 685 (10)

Conversion from base-n to decimal

= 1 * 8 2 + 2 * 8 1 + 5 * 8 0

= 85 (10) 85 (10) base 8 85  8 = 10 ... 5

10  8 = 1 ... 2

1  8 = 0 ... 1

685  16 = 42 ... 13  D

42  16 = 2 ... 10  A

2  16 = 0 ... 2

Conversion from decimal to base-n

Ans. 125 (8)

Ans. 2AD (16)

685 (10) base 16 Hexadecimal

Base 16

Compact representation of bit strings

4 bits per hex digit

> 29

1100 c1000 80100 40000 0

1101 d1001 90101 50001 1

1110 e1010 a0110 60010 2

1111 f1011 b0111 70011 3

## Example: eca8 6420

1110 1100 1010 1000 0110 0100 0010 0000 Unsigned Binary Integers

Given an n-bit number

Example: What is the decimal value of the following 32-bit binary code?

0000 0000 0000 0000 0000 0000 0000 1011 2

= 0 +  + 12 3 + 02 2 +12 1 +12 0

= 0 +  + 8 + 0 + 2 + 1 = 11 10

> 30
> 0
> 0
> 1
> 1
> 2n
> 2n
> 1n
> 1n

2x2x2x2xx

> 
> 
>

Unsigned Binary Integers

Given an n-bit number

Range: 0 to +2 n  1

Using 32 bits

0 to +4,294,967,295

> 31
> 0
> 0
> 1
> 1
> 2n
> 2n
> 1n
> 1n

2x2x2x2xx

> 
> 
>

2s-Complement Signed Integers

Given an n-bit number

Example: What is the decimal value of

the 2s complement binary code?

1111 1111 1111 1111 1111 1111 1111 1100 2

= 12 31 + 12 30 +  + 12 2 +02 1 +02 0

> 32
> 0
> 0
> 1
> 1
> 2n
> 2n
> 1n
> 1n

2x2x2x2xx

> 
> 
>

2s-Complement Signed Integers

Given an n-bit number

Example: What is the decimal value of

the 2s complement binary code?

1111 1111 1111 1111 1111 1111 1111 1100 2

= 12 31 + 12 30 +  + 12 2 +02 1 +02 0

= 2,147,483,648 + 2,147,483,644 = 4 10

> 33
> 0
> 0
> 1
> 1
> 2n
> 2n
> 1n
> 1n

2x2x2x2xx

> 
> 
>

2s-Complement Signed Integers

Given an n-bit number

Range: 2 n  1 to +2 n  1  1

Using 32 bits

2,147,483,648 to +2,147,483,647

> 34
> 0
> 0
> 1
> 1
> 2n
> 2n
> 1n
> 1n

2x2x2x2xx

> 
> 
>

2s-Complement Signed Integers 2s-Complement Signed Integers

Bit 31 is sign bit

1 for negative numbers

0 for non-negative numbers

Non-negative numbers have the same unsigned and 2s-complement

representation

Some specific numbers

0: 0000 0000  0000

1: 1111 1111  1111

Most-negative: 1000 0000  0000  2,147,483,648

Most-positive: 0111 1111  1111  2,147,483,647

> 36

2s-Complement Signed Integers

Complement and add 1

Complement means 1  0, 0  1

Example: negate +2

+2 = 0000 0000  0010 2

2 = 1111 1111  1101 2 + 1

= 1111 1111  1110 2

> 37

x1x

11111...111 xx 2

X = 0000 1111

X = 1111 0000 Sign Extension

Representing a number using more bits

Preserve the numeric value

Why It's Important:

When instructions handle:

addi : extend immediate value

lb , lh : extend loaded byte/halfword

Replicate the sign bit to the left

c.f. unsigned values: extend with 0s

Examples: 8-bit to 16-bit

+2: 0000 0010 => 0000 0000 0 000 0010

2: 1111 1110 => 1111 1111 1 111 1110

> 38

Sign Extension

In MIPS instruction set

addi : extend immediate value

addi $t0, $t1, -13 # $t0 = $t1 + (-13)

-13 is represented as 11110011 in 8-bit and 11111111 11110011 in 16-bit

This value is sign-extended to 11111111 11111111 11111111 11110011 in 32-bit before the

addition.

lb , lh : extend loaded byte/halfword

lb (load byte) and lh (load halfword) are instructions used for loading 8-bit and 16-

bit data from memory into registers

> 39

Representing Instructions

MIPS R-format Instructions

used for arithmetic and logical operations

MIPS I-format Instructions

used for operations involving constants (immediate values), load/store

instructions, and branch instructions .

> 40

Representing Instructions

Instructions are encoded in binary

Called machine code

MIPS instructions

Encoded as 32-bit instruction words

Operation code (opcode), register numbers,

$t0  $t7 are regs number 8  15

$t8  $t9 are regs number 24  25

$s0  $s7 are regs number16  23

> $t is registers for temporary values
> $s is registers for saved variables in a program
> 41

MIPS R-format Instructions

Instruction fields

op : operation code (opcode)

rs : first source register number

rt : second source register number

rd : destination register number

shamt: shift amount (00000 for now)

funct : function code (extends opcode)

> 42

op rs rt rd shamt funct

> 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 43

R-format Example (add operation)

add $t0 , $s1, $s2 ($t0 =$s1+$s2 )

special $s1 $s2 $t0 0 add

0 17 18 8 0 32

000000 10001 10010 01000 00000 100000

00000010001100100100000000100000 2 = 02324020 16

Dicimal

Binary

op rs rt rd shamt funct

> 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits 44

R-format Example (sub operation)

sub $t0 , $s1, $s2 ($t0 =$s1-$s2 )

special $s1 $s2 $t0 0 sub

0 17 18 8 0 34

000000 10001 10010 01000 00000 100010

00000010001100100100000000100010 2 = 02324022 16

Dicimal

Binary

op rs rt rd shamt funct

> 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

MIPS R-format Instructions

Funct (dec) Funct (bin) Opcode (dec) Opcode (bin) Instruction

32 100000 0000000 add

34 100010 0000000 sub

36 100100 0000000 and

37 100101 0000000 or

38 100110 0000000 xor

39 100111 0000000 nor

42 101010 0000000 slt

0000000 0000000 sll

2000010 0000000 srl

8001000 0000000 jr

45 MIPS I-format Instructions

Immediate arithmetic and load/store instructions

rt : destination or source register number

Constant : 2 15 to +2 15  1

Address : offset added to base address in rs

Design Principle: Good design demands good compromises

Different formats complicate decoding, but allow 32-bit instructions

uniformly

Keep formats as similar as possible

> 46
> op rs rt constant or address
> 6 bits 5 bits 5 bits 16 bits 47

I-format Example (Load operation)

## lw $t0 , 32( $s3) #A[8] is loaded to $t0

35 19 8

100011 10011 01000

10001110011010000000000000100000 2 = 8D680020 16

Dicimal

Binary

Load word rs rt constant or address

0000000000100000

op rs rt constant or address

> 6 bits 5 bits 5 bits 16 bits

MIPS I-format Instructions

Opcode (dec) Opcode (bin) Instruction

8001000 addi

12 001100 andi

13 001101 ori

14 001110 xori

35 100011 lw

43 101011 sw

32 100000 lb

40 101000 sb

33 100001 lh

41 101001 sh

4000100 beq

5000101 bne

10 001010 slti

48 Logical Operations

Instructions for bitwise manipulation

Useful for extracting and inserting groups of bits in a word

> 49

MIPS Java COperation

sll << << Shift left

srl >>> >> Shift right

and, andi &&Bitwise AND

or, ori ||Bitwise OR

nor ~~Bitwise NOT Shift Operations: MIPS R-format Instructions

shamt: how many positions to shift

Shift left logical

Shift left and fill with 0 bits

sll by

i bits multiplies by 2

> i

Shift right logical

Shift right and fill with 0 bits

srl by

i bits divides by 2

> i

(unsigned only)

> 50

op rs rt rd shamt funct

> 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

AND Operations

Useful to mask bits in a word

Select some bits, clear others to 0

and $t0, $t1, $t2

> 51

0000 0000 0000 0000 0000 1101 1100 0000

0000 0000 0000 0000 0011 1100 0000 0000

$t2

$t1

0000 0000 0000 0000 0000 1100 0000 0000 $t0 OR Operations

Useful to include bits in a word

Set some bits to 1, leave others unchanged

or $t0, $t1, $t2

> 52

0000 0000 0000 0000 0000 1101 1100 0000

0000 0000 0000 0000 0011 1100 0000 0000

$t2

$t1

0000 0000 0000 0000 0011 1101 1100 0000 $t0 NOT Operations

Useful to invert bits in a word

Change 0 to 1, and 1 to 0

MIPS has NOR 3-operand instruction

a NOR b == NOT ( a OR b )

nor $t0, $t1, $zero

> 53

0000 0000 0000 0000 0011 1100 0000 0000 $t1

1111 1111 1111 1111 1100 0011 1111 1111 $t0

> Register 0: always
> read as zero

# Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 07 LAST WEEK

> 2

Power saving

1. Parallelism

- Multi-Core Processors: Incorporate more cores into a single processor chip.

- Simultaneous Multithreading (SMT)

- GPUs and Accelerators : Utilize graphics processing units (GPUs) and other

accelerators (such as TPUs for machine learning) Power saving

2. Memory Optimization

- Prefetching: Predict and load data into the cache before it is needed by the

CPU to reduce wait times.

- Memory Bandwidth and Latency Improvements: Use faster memory

technologies (such as DDR5 or HBM) and optimize memory controller

designs Power saving

3. Software and Algorithm Optimization

- Efficient Algorithms: Use algorithms with lower computational complexity to

reduce the number of instructions required to complete a task.

- Concurrency: Design software to effectively utilize multiple cores and threads Power saving

4. Specialized Hardware : Develop custom hardware tailored for specific tasks

5. Energy-Efficient Computing : Turn off portions of the processor when they

are not in use

6. Heterogeneous Computing: Integrate different types of processors on the

same chip to handle diverse workloads Execution of programs

> 7

CPU

> 100101001010100
> 000001011011100
> 111001111010011

lw $4, 0($1)

lw $5, 4($1)

add $2, $4, $5

Compiler

If (y == 1)

c=a+b ;

else

c=a-b ;

High-level programming language

such as C, Java, Verilog-HDL etc..

Instruction

set

Assembly lang.

Machine lang. to CPU

Binary code CPUa

CPUb

CPUc

C Program

## CPU Independent

0 1 1

1 1 1

1 1 0

1 0 1

0 1 0

0 0 1

Interface between programs and CPUs

Compiler

for CPU-C

Instruction set C

Compiler

for CPU-B

Instruction set B

Compiler

for CPU-A

Instruction set A

## CPU dependent

Machine lang. Register

> PC
> Decoder
> 
>

Processor

Main

memory

ALU Small capacity

Fast access speed Buss

Big capacity

Slow access

> ALU (Arithmetic-Logic Unit)

Operations of a Processor Register

> PC
> Decoder
> 
>

Processor

Main

memory

ALU Small capacity

Fast access speed Buss

Big capacity

Slow access

> ALU (Arithmetic-Logic Unit)

Executes arithmetic

and logic operations

Stores data computed by the ALU

Data is read from main memory and

written back to main memory

Stores programs,

instructions and data

Load / Store architecture

The way of access to the main memory is

Load (read) and store (write) instructions

ALU executes operations on registers values

Operations of a Processor Register

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction) Register

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A
> C

ALU

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A
> C

ALU

> C

# TODAY

> 15

Arithmetic instructions use register operands

MIPS has a 32  32-bit register file

Use for frequently accessed data

Numbered 0 to 31

32-bit data called a word

Assembler names ex.

$t0, $t1, , $t9 for temporary values

$s0, $s1, , $s7 for saved variables (of initial inputs)

Design Principle : Smaller is faster

main memory: millions of locations -> Slower

Keep frequently used data in registers. -> Fast Access

Minimize memory accesses 16

> Use $t when you dont care about keeping the
> value after a function call.
> Use $s when the value must survive across calls.

MIPS: Temporary vs. Saved Registers

Example:

# Using $t0 for temporary math

add $t0, $a0, $a1

# Using $s0 to store a persistent value

add $s0, $a2, $a3

If this function calls another function, it must save $s0 (using sw) before calling

and restore (using lw) afterward.

But $t0 doesnt need to be preserved  it might be overwritten.

> 17

MIPS Register Numbering and Assembler Names

MIPS has 32 general-purpose registers , labeled $0 to $31

These are hardware register numbers , but in assembly programming, we use symbolic names to make

them easier to read and understand.

Common Use Assembler Name Register Number

Always 0 (read-only) $zero $0

Reserved for assembler $at $1

Function return values $v0$v1 $2$3

Function arguments $a0$a3 $4$7

Temporary registers $t0$t7 $8$15

Saved registers $s0$s7 $16$23

More temporary registers $t8$t9 $24$25

Reserved for kernel $k0$k1 $26$27

Global pointer $gp $28

Stack pointer $sp $29

Frame pointer $fp $30

Return address $ra $31 Memory Operands

Main memory used for composite data

Arrays, structures, dynamic data

To apply arithmetic operations

MIPS does not perform arithmetic directly on memory.

Load values from memory into registers

Store result from register to memory

Memory is byte addressed

Each address identifies an 8-bit (1 byte)

MIPS loads/stores 32-bit words (4 bytes)

Words are aligned in memory

Address must be a multiple of 4

> 19
> Byte Address
> Byte 1 0
> Byte 2 1
> Byte 3 2
> Byte 4 3

Memory Operand Example

C code:

g = h + A[8];

g is in $s1, h is in $s2

Base address of array A, that is A[0], is in $s3

Compiled MIPS code:

Index 8 requires offset of 32 (4 bytes per word)

lw $t0, 32($s3) # load word

add $s1, $s2, $t0

> 20
> offset base register

A[0]

> 1000
> 1004
> 1008

1 word=4 bytes (32 bit)

> 1012
> 1016
> 1020
> 1024
> 1028
> 1032

A[1]

A[2]

A[3]

A[4]

A[5]

A[6]

A[7]

A[8]

> Byte 1 0
> Byte 2 1
> Byte 3 2
> Byte 4 3

Registers vs. Memory

## Registers are faster to access than memory

## Operating on memory data requires loads and stores

More instructions to be executed

## Compiler must use registers for variables as much as

## possible

Only spill to memory for less frequently used variables

> 21

The MIPS Instruction Set

Used as the example in this class

Stanford MIPS commercialized by MIPS Technologies

Typical of many modern ISA s

(Instruction Set Architecture )

Similar ISAs have a large share of embedded core market

> Applications in consumer electronics, network/storage equipment,
> cameras, printers,

It's known for its simplicity and efficiency

> 22

The MIPS Instruction Set

MIPS instructions are categorized into 5 major types:

1. Arithmetic Instructions

2. Logical Operations

3. Memory Access Instructions

4. Control flow Instructions

> 23

Arithmetic Instructions

Add operation has three operands

Two sources and one destination

add a, b, c # a gets b + c

All arithmetic operations have this form

sub and or slt

Design Principle: Simplicity favors regularity

Regularity makes implementation simpler

Simplicity enables higher performance at lower cost

> 24

Arithmetic Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $t1 = a, $t2 =

add $t0, $t1, $t2 x = a + b Addition add

$t0 = x, $t1 = a, $t2 =

sub $t0, $t1, $t2 x = a - b Subtraction sub

$t0 = x, $t1 = a addi $t0, $t1, 5 x = a + 5

Add immediate

(constant)

addi

$t0 = x, $t1 = a, $t2 =

slt $t0, $t1, $t2 x = (a < b) ? 1 : 0 Set if less than (signed) slt

$t0 = x, $t1 = a, $t2 =

sltu $t0, $t1, $t2 x = (a < b) ? 1 : 0

Set if less than

(unsigned)

sltu

25 Logical Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $t1 = a, $t2 = b and $t0, $t1, $t2 x = a & b Bitwise AND and

$t0 = x, $t1 = a, $t2 = b or $t0, $t1, $t2 x = a | b Bitwise OR or

$t0 = x, $t1 = a, $t2 = b xor $t0, $t1, $t2 x = a ^ b Bitwise XOR xor

$t0 = x, $t1 = a, $t2 = b nor $t0, $t1, $t2 x = ~(a | b) Bitwise NOR nor

$t0 = x, $t1 = a andi $t0, $t1, 0xF0 x = a & 0xF0 AND with constant andi

26 Memory Access Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t0 = x, $s1 = base

address

lw $t0, 8($s1) x = A[i]

Load word from

memory

$t0 = x, $s1 = base

address

sw $t0, 8($s1) A[i] = x

Store word to

memory

$t0 = x, $s1 = base

address

lb $t0, 4($s1) x = byte[i]

Load byte from

memory

$t0 = x, $s1 = base

address

sb $t0, 4($s1) byte[i] = x

Store byte to

memory

27 Control Flow Instructions

Assumed Register

Mapping

MIPS Equivalent C Equivalent Description Instruction

$t1 = a, $t2 = b beq $t1, $t2, LABEL if (a == b) Branch if equal beq

$t1 = a, $t2 = b bne $t1, $t2, LABEL if (a != b) Branch if not equal bne

-j LABEL goto LABEL Unconditional jump j

return addr in $ra jal FUNC call FUNC() Jump and link (call) jal

$ra = return address jr $ra return

Jump to register

(return)

28 Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 06 LAST WEEK

> 2

## Performance and Execution time

Performance is

For two computers X and Y, if the performance X is greater than Y, we have

That is, the execution time on Y is longer, if X is faster

> 3

Execution Time

Performance

Performance X  Performance Y

Execution Time X  Execution Time Y Relative Performance

Define Performance = 1/Execution Time

X is n time faster than Y

Example: time taken to run a program

10s on comp. A, 15s on comp. B

Execution Time B / Execution Time A

= 15s / 10s = 1.5

So A is 1.5 times faster than B

> 4

Performance X

Performance Y Execution Time X

Execution Time Y

> ==

## nCPU Clocking

Operation of digital hardware governed by a constant-rate clock

> Clock (cycles)
> Data transfer
> and computation
> Update state
> Clock period
>

Clock period: duration of a clock cycle

e.g., 250ps = 0.25ns = 250 10 12 s

Clock frequency (rate): cycles per second / Hz

e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time

CPU Time is the actual time the CPU spends processing instructions for a program.

Its a key measure of how efficiently a program runs on a processor.

Rate Clock

Cycles Clock CPU

Time Cycle Clock Cycles Clock CPU Time CPU

When executing a program, (using clock rate)

Unit: cycle Unit: seconds/cycle

Unit: cycles/second , Hz

Unit: second Instruction Count and CPI

Instruction Count = The total number of instructions a CPU must execute to

complete a program.

It depends on:

The program itself (what it's doing)

The compiler (how it translates code)

The instruction set architecture (ISA)

> 7

Instruction Count and CPI

CPI = The average number of clock cycles the CPU takes to execute one

instruction . (Cycles per Instruction)

It depends on:

CPU hardware design

Type of instruction (some are faster/slower)

How well the CPU handles memory and branches

> 8

Instruction Count and CPI

Instruction Count for a program

Determined by program, instruction set

Average cycles per instruction (CPI)

Determined by CPU hardware

Rate Clock

CPI Count nInstructio

Time Cycle Clock CPI Count nInstructio Time CPU

nInstructio per Cycles Count nInstructio Cycles Clock

When executing a program,

Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail

If different instruction classes take different numbers of cycles

When a program runs, not all instructions are equal - some take more clock

cycles than others.

So instead of a single CPI value, we calculate an average CPI based on

instruction classes .

Example,

> Cycles per Instruction % of Instructions Instruction Type
> 1 cycle 50% ALU (Add/Sub)
> 2 cycles 30% Load/Store
> 3 cycles 20% Branches (if/jump)

CPI in More Detail

If different instruction classes take different numbers of cycles

Weighted average CPI

> n
> 1i
> ii

)Count nInstructio (CPI Cycles Clock

> n
> 1i
> i
> i

Count nInstructio

CPI

Count nInstructio

Cycles Clock

CPI

> Relative frequency

Power and Energy

Important for embedded systems and computers

Embedded systems want to save battery usage as much as possible. and

reduce the heat generated

C = capacitance (how much charge moves in each switch)

V = voltage

f = clock frequency (how often switching happens)

= 1

2  CapacitiveLoad

= CapacitiveLoad

Power and Energy

Power increases with the square of voltage!

Lowering voltage is very effective for saving energy.

Power is proportional to frequency

Higher clock speeds = more energy burned.

= 1

2  CapacitiveLoad

= CapacitiveLoad

CPU Time Quiz I

Computer A: 2GHz clock

it takes 10s CPU time to run a program

Try to design Computer B

> Aim for 6s CPU time
> Can do faster clock, but causes 1.2  clock cycles than comp. A

How fast must Computer B clock be?

4GHz

10 24

10 20 1.2

Rate Clock

10 20 2GHz 10s

Rate Clock Time CPU Cycles Clock

Cycles Clock 1.2

Time CPU

Cycles Clock

Rate Clock

> 99
> B
> 9
> AAA
> A
> B
> B
> B

CPI Quiz II

Computer A has a clock cycle time = 250ps,

CPI = 2.0 for some program

Computer B has a clock cycle time = 500ps,

CPI = 1.2 for the same program

Which is faster, and by how much?

1.2

500ps I

600ps I

Time CPU

600ps I500ps 1.2 I

Time Cycle

CPI Count nInstructio

Time CPU

500ps I250ps 2.0 I

Time Cycle

CPI Count nInstructio

Time CPU

> A is faster
> by this much

CPI Quiz III

Alternative compiled code sequences using instructions in classes A, B, C

> CBAClass
> 321CPI for class
> 212Instruction numbers
> in sequence 1
> 114Instruction numbers
> in sequence 2

1. Which sequence executes the most instructions?

2. Which will be faster? How about CPU clock cycles? CPI Quiz III

Alternative compiled code sequences using instructions in classes A, B, C

> CBAClass
> 321CPI for class
> 212Instruction numbers
> in sequence 1
> 114Instruction numbers
> in sequence 2

1. Which sequence executes the most instructions?

Sequence 1 executes 2 + 1+ 2 = 5 instructions

Sequence 2 executes 4 + 1+ 1 = 6 instructions

Seq. 1 executes fewer instructions CPI Quiz III

Alternative compiled code sequences using instructions in classes A, B, C

> CBAClass
> 321CPI for class
> 212Instruction numbers
> in sequence 1
> 114Instruction numbers
> in sequence 2

2. Which will be faster? How about CPU clock cycles?

CPU clock cycles1 = 2*1 + 1*2 + 2*3 = 10 cycles

CPU clock cycles2 = 4*1 + 1*2 + 1*3 = 9 cycles

Seq. 2 is faster even though it executes one more instruction Quiz IV

Suppose you reduce voltage by 17% and it reduces switching

frequency by 17%. So how much will the dynamic power be

reduced?

= 1

2  CapacitiveLoad

= 1

2  0.83  CapacitiveLoad  (0.83   )

= (0.83)

0.6    Group Work: Reducing Power

The power wall

We cant reduce voltage much more

Lower voltage saves energy, but too low causes instability.

We cant remove more heat efficiently

Smaller chips = more transistors = more heat in less space.

So: Simply increasing clock speed or transistor count no longer works

without overheating or wasting power.

# Presentation

Your goal is to explain how computer architects overcome the Power Wall

and improve speed, efficiency, and energy use .TODAY: POWER SAVING PRESENTATION

> 21

# INSTRUCTION SET

> 22

Instruction and Instruction set

Instruction  The words of a computers language

Instruction set  A set of instructions and it depends on each computer

> 23

ADD

SUB

Logic

Branch

Comp. A

ADD

Branch

Comp. B

ADD

SUB

Comp. C

ADD

SUB

And

Branch Instruction set

Different computers have different instruction sets

But with many aspects in common

Most of instruction sets have similarity

Early computers had very simple instruction sets

Simplified implementation

Many modern computers also have simple instruction sets

> 24

The MIPS Instruction Set

Used as the example in this class

Stanford MIPS commercialized by MIPS Technologies

Typical of many modern ISA s

(Instruction Set Architecture )

Similar ISAs have a large share of embedded core market

> Applications in consumer electronics, network/storage equipment,
> cameras, printers,

It's known for its simplicity and efficiency

> 25

A part of MIPS instructions

In this class , we will use MIPS instruction set

> 26

Example Instruction

$s1 = $s2 + $s3 add $s1, $s2, $s3 add Arithmetic

$s1 = $s2  $s3 sub $s1, $s2, $s3 Subtract

Load from Memory address

[$s2+100] to register $s1

lw $s1, 100($s2) load word Data

translati

on Store from register $s1 to

Memory address [$s2+100]

sw $s1, 100($s2) store word

If $s1==$s2 then Branch to L beq $s1, $s2, L branch on equal Branch

If $s1!=$s2 then branch to L bne $s1, $s2, L branch on not

equal

If $s2<$s3 then $s1=1, otherwise

$s1=0

slt $s1, $s2, $s3 set on less than

Jump to L j L jump Jump

Jump to the address $s1 jr $s1 jump register Example

> 27

## Execution of programs

> 28

CPU

> 100101001010100
> 000001011011100
> 111001111010011

lw $4, 0($1)

lw $5, 4($1)

add $2, $4, $5

Compiler

If (y == 1)

c=a+b ;

else

c=a-b ;

High-level programming language

such as C, Java, Verilog-HDL etc..

Instruction

set

Assembly lang.

Machine lang. to CPU

Binary code CPUa

CPUb

CPUc

C Program

## CPU Independent

0 1 1

1 1 1

1 1 0

1 0 1

0 1 0

0 0 1

Interface between programs and CPUs

Compiler

for CPU-C

Instruction set C

Compiler

for CPU-B

Instruction set B

Compiler

for CPU-A

Instruction set A

## CPU dependent

Machine lang. Register

> PC
> Decoder
> 
>

Processor

Main

memory

ALU Small capacity

Fast access speed Buss

Big capacity

Slow access

> ALU (Arithmetic-Logic Unit)

Operations of a Processor Register

> PC
> Decoder
> 
>

Processor

Main

memory

ALU Small capacity

Fast access speed Buss

Big capacity

Slow access

> ALU (Arithmetic-Logic Unit)

Executes arithmetic

and logic operations

Stores data computed by the ALU

Data is read from main memory and

written back to main memory

Stores programs,

instructions and data

Load / Store architecture

The way of access to the main memory is

Load (read) and store (write) instructions

ALU executes operations on registers values

Operations of a Processor Register

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction) Register

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A
> C

ALU

> 
> Processor
> Address bus
> Data bus
> A
> B

ALU

Main memory

Operations of a Processor

When you want to sum two

data in the main memory and

write the result back to memory

1: Reading data from memory to

2: ALU calculations using

3: Write the value of the register

to memory (Store instruction)

> B
> A
> C

ALU

> C

# Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 04 LAST WEEK

> 2

Computer Classification

Microcontrollers

Micro computer / Microcontroller

Usage : embedded systems

Features : Miniaturization, energy saving, low power consumption.

CPU, memory and peripheral circuits are contained in the same IC.

Currently : also known as SoC (System on a Chip)

> 3

This is a computer! Computer Classification

Personal computers

Usage : Office work, internet, gaming

Features : Low price, Versatile and scalable, Single architecture,

Large variety of peripherals, e.g., keyboard, mouse, display, sound

card, printer.

Currently : research, industry

> 4

Computer Classification

Workstations

Usage : large calculation e.g., research, industry

Features : Like PCs but with higher performance, High price.

Gone with the success of personal computers but revived with

artificial intelligence.

> 5

Computer Classification

Mainframes

Usage : Internet companies, banks, industry, education, offices, etc.

Features : Very high parallelism. One computer, many users (access

via terminals)

> 6

Computer Classification

Supercomputers

Usage : Industry and research huge computations.

Features :

Huge performance

Huge parallelism

Huge area

Huge power consumption

One computer, few users

> 7

Mechanical Era

Timeframe: 1620s to the early 20th century.

Key Developments:

Mechanical Calculators : which could perform basic arithmetic operations.

Charles Babbage's Designs : Difference Engine and the Analytical Engine.

Introducing concepts like programmability and mechanical computation.

Punched Card Systems : Tabulating and processing data

> Punched card - Wikipedia

Electro-Mechanical Era (Transition)

Timeframe: Roughly from the late 19th century to the mid-20th century.

Key Developments:

Integration of Electromechanical Components :

such as relays and switches.

Harvard Mark I, which used electromechanical relays for computation.

Punched Card Systems Continued : Punched card systems continued to be

widely used for data input and processing, particularly in industries requiring

large-scale data manipulation. Electronic Era

Timeframe: Starting in the 1940s until now.

Key Developments:

ENIAC : The Electronic Numerical Integrator and Computer (ENIAC),

completed in 1946, marked the beginning of the electronic era.

Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized

electronics.

Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more

reliable computers. Post-PC Era (Contemporary)

Timeframe: From the late 20th century to the present.

Key Characteristics:

Proliferation of Mobile Devices : Smartphones and tablets have become

popular

Touch Interfaces and Intuitive User Experiences

Cloud Computing

Internet of Things (IoT)

AI and Voice Assistants The Post PC Era Technology Trends (Now)

Advanced IC (AI, 3D, RISC-V, Thermodynamic)

AI chips :

TPU (Tensor Processing Unit)  Google

NPU (Neural Processing Unit)  Apple, Huawei

Chiplets : Modular chip building blocks used in AMD and Intel architectures

3D-Stacked ICs : Vertical integration to boost speed and density

RISC-V : Open-source CPU architecture with growing ecosystem

Thermodynamic computing (experimental): Uses physics to power ultra-low-

energy AI tasks

> 13

Semiconductor Technology

The foundation of all modern electronics  including smartphones,

computers, cars, TVs, and even smart fridges!

Add materials to transform properties:

Conductors (Lets electricity flow)

Insulators (Blocks electricity)

Switch (Controls flow (on/off))

A material (like silicon) that can act as both a conductor and an insulator

depending on how its used.

This makes it perfect for building transistors  the basic "on/off switches" in

all digital devices.

> 14

Defining Performance

Question Which airplane has the best performance?

Passenger capacity: Airbus

Cruising range : Boeing777

Cruising speed : Concorde

Best performance depends on the definition of performance

> 15
> Cruising
> speed
> (mile/h)
> Cruising
> range
> (miles)
> Passenger
> capacity
> Airplane
> 564 3,000 240 Boeing737
> 1,350 4,000 132 BAC/Sud Concorde
> 554 9,395 301 Boeing777
> 587 8,477 853 Airbus A380-800

Defining Performance

Group work: Lets discuss about the definition of performance

1. Propose a definition for performance

2. Under the definition, which one has the best performance?

> 16

Cruising

speed

(mile/h)

Cruising

range

(miles)

Passenger

capacity

Airplane

564 3,000 240 Boeing737

1,350 4,000 132 BAC/Sud Concorde

554 9,395 301 Boeing777

587 8,477 853 Airbus A380-800 Defining Performance

Group work: Lets discuss about the definition of performance

If we define the performance as

Less time required to transport people

Passenger capacity Cruising speed (mile/h)

> 17

Passenger

throughput

(passenmile/h)

Cruising

speed

(mile/h)

Cruising

range

(miles)

Passenger

capacity

Airplane

135,360 564 3,000 240 Boeing737

178,200 1,350 4,000 132 BAC/Sud Concorde

166,761 554 9,395 301 Boeing777

500,711 587 8,477 853 Airbus A380-800

Airbus has the best performance (Passenger throughput ) Response Time and Throughput

Response time (Execution time)

The total time required for a computer to complete a task, including disk

access, memory access, I/O activities

Throughput

Total amount of work done per unit time

> e.g., the number of tasks/transactions/ per hour

How are response time and throughput affected by

Replacing the processor with a faster version?

Adding more processors? Response Time and Throughput

Question A:

If we replace the processor in a computer with a faster version ,

1. Increase throughput

2. Decrease response time

3. Both of the above

> Replace a faster processor

Decrease response time

Increase throughput

Run program A

in 10 sec.

Run program A

in 5 sec. Run program A

twice in 10 sec. Response Time and Throughput

Question B: If we add additional processors to a system that uses multiple

processors for separate tasks

1. Increase throughput

2. Decrease response time

3. Both of the above

Do not decrease

response time

Increase throughput

Run program A

in 10 sec. then

program B

in 10 sec.

Run program B

in 10 sec.

Run program A

In 10 sec.

> Add more processors

# TODAY: PERFORMANCE OF COMPUTER

> 21

## Performance and Execution time

Performance is

For two computers X and Y, if the performance X is greater than Y, we have

That is, the execution time on Y is longer, if X is faster

> 22

Execution Time

Performance

Performance X  Performance Y

Execution Time X  Execution Time Y Relative Performance

Define Performance = 1/Execution Time

X is n time faster than Y

Example: time taken to run a program

10s on comp. A, 15s on comp. B

Execution Time B / Execution Time A

= 15s / 10s = 1.5

So A is 1.5 times faster than B

> 23

Performance X

Performance Y Execution Time X

Execution Time Y

> ==

## nMeasuring Execution Time

Elapsed time

Total response time, including all aspects

Processing, I/O, OS overhead, idle time(waiting)

Determines system performance

CPU time

Time spent processing a given job

Discounts I/O time, other jobs shares

User CPU time

Time spent executing user code (e.g., a programs logic).

System CPU time

Time spent doing system-level operations (e.g., file I/O, memory allocation by OS) CPU Clocking

Operation of digital hardware governed by a constant-rate clock

> Clock (cycles)
> Data transfer
> and computation
> Update state
> Clock period
>

Clock period: duration of a clock cycle

e.g., 250ps = 0.25ns = 250 10 12 s

Clock frequency (rate): cycles per second / Hz

e.g., 4.0GHz = 4000MHz = 4.0 10 9Hz CPU Time

CPU Time is the actual time the CPU spends processing instructions for a program.

Its a key measure of how efficiently a program runs on a processor.

Rate Clock

Cycles Clock CPU

Time Cycle Clock Cycles Clock CPU Time CPU

When executing a program, (using clock rate)

Unit: cycle Unit: seconds/cycle

Unit: cycles/second , Hz

Unit: second Instruction Count and CPI

Instruction Count = The total number of instructions a CPU must execute to

complete a program.

It depends on:

The program itself (what it's doing)

The compiler (how it translates code)

The instruction set architecture (ISA)

> 27

Instruction Count and CPI

CPI = The average number of clock cycles the CPU takes to execute one

instruction . (Cycles per Instruction)

It depends on:

CPU hardware design

Type of instruction (some are faster/slower)

How well the CPU handles memory and branches

> 28

Instruction Count and CPI

Instruction Count for a program

Determined by program, instruction set

Average cycles per instruction (CPI)

Determined by CPU hardware

Rate Clock

CPI Count nInstructio

Time Cycle Clock CPI Count nInstructio Time CPU

nInstructio per Cycles Count nInstructio Cycles Clock

When executing a program,

Unit: Cycles/Instruction Unit: Instructions CPI CPI in More Detail

If different instruction classes take different numbers of cycles

When a program runs, not all instructions are equal - some take more clock

cycles than others.

So instead of a single CPI value, we calculate an average CPI based on

instruction classes .

Example,

> Cycles per Instruction % of Instructions Instruction Type
> 1 cycle 50% ALU (Add/Sub)
> 2 cycles 30% Load/Store
> 3 cycles 20% Branches (if/jump)

CPI in More Detail

If different instruction classes take different numbers of cycles

Weighted average CPI

> n
> 1i
> ii

)Count nInstructio (CPI Cycles Clock

> n
> 1i
> i
> i

Count nInstructio

CPI

Count nInstructio

Cycles Clock

CPI

> Relative frequency

Performance Summary

Performance depends on

Algorithm : affects Instruction Count(IC), possibly CPI

Programming language : affects IC, CPI

Compiler : affects IC, CPI

Instruction Set Architecture : affects IC, CPI, T c

Fewer instructions = potentially faster performance

Lower CPI = better performance

Shorter cycles = faster execution

Higher clock rate = more instructions per second

cycle Clock

Seconds

nInstructio

cycles Clock

Program

ns Instructio

Time CPU MIPS

MIPS measures how many millions of instructions a CPU can execute per second .

Execution speed of a computers processor

Comparative Analysis: It allows for easy comparison between different processors

and systems.

A higher MIPS value generally indicates a faster CPU

Power and Energy

Important for embedded systems and computers

Embedded systems want to save battery usage as much as possible. and

reduce the heat generated

C = capacitance (how much charge moves in each switch)

V = voltage

f = clock frequency (how often switching happens)

= 1

2  CapacitiveLoad

= CapacitiveLoad

Power and Energy

Power increases with the square of voltage!

Lowering voltage is very effective for saving energy.

Power is proportional to frequency

Higher clock speeds = more energy burned.

= 1

2  CapacitiveLoad

= CapacitiveLoad

The von Neumann Bottleneck

For the von Neumann computers

Single-memory: both program and data are stored in the same memory and

are transferred over the same bus.

Instructions and data cannot be retrieved from memory at the same time, this

is the von Neumann bottleneck.

> 22

Summarize

Computer architecture : overall design of a computer system including:

Hardware specifications

Software (OS, compilers)

Behavior and structure of the computer

Three Key Questions about Computer Processing

How is information represented?

How is data calculated?

What processes are used?

Von Neumann Architecture

Stored-program concept

Sequential processing

Linear memory addressing

Instruction Cycle (FetchDecodeExecute) 23 Computer Architecture

Computer Engineering, 3rd year, Semester 1

Week 03 LAST WEEK

> 2

## What do we mean by architecture?

In general, Architecture means a style, structure, or composition

of buildings

Computer Architecture

Basic design and design concepts in computers

Hardware specifications with system software such as

operating systems and compilers

Architecture of the computer + Behavior of the computer Overall view of a computer

> 4
> Computer Architecture

## Instruction set Operating system

## Operative unit Memory unit

## Control unit Input and output units

> Architecture of the computer
> Behavior of the computer

Three questions about computer processing

Input data

Information

Processing, computing and storage

of information

Output data

Information

Display

How do computers

represent information?

what format?

How is the acquired data

calculated?

What procedures are used

to process (calculate) the

"information"? Von Neumann Architecture (Von Neumann-type Computers)

Many computers are designed as Neumann-type calculators

Proposed by John von Neumann in 1945. The Basic Structure of Neumann computer

Stored-program computer

Instructions (programs) and data are placed indistinguishably in main memory

The distinction between instructions and data is made by the program

Sequential processing

A program is a list of instructions

Instructions are fetched one by one from main memory and executed in a

determined order

The position in memory of the next instruction to execute is stored into a

Linear address

Each cell of main memory is numbered sequentially

This number is called address

The Address is used to indicate the location of instructions and data 7

The memory address

PC The Basic Structure of Neumann computer

> 8

The Basic Structure of Neumann computer

The central processing unit (CPU) is the part of the computer that contains

the arithmetic and control units.

Operating Unit : The arithmetic unit is the device that performs arithmetic and

logical operations and that temporarily stores the operation terms and results.

Control Unit : The control unit is the device that controls the operation of the

computer.

Memory Unit : The memory unit is the device that stores the program and the

data.

Main Memory Unit : The main memory is the part of the storage unit to which

the CPU has direct access.

Auxiliary Memory Unit : The auxiliary storage is the part of the memory that

cannot be directly accessed by the CPU.

Input/Output Unit(s) : The input/output unit is the device that performs the

data transfers between the computer and its environment. 9The detailed structure of the CPU

The operative unit

> 10

The detailed structure of the CPU

The operative unit

(1) General-Purpose Registers (GPR)

The general-purpose registers are the registers that stores temporarily the terms

and the results of the computers calculations.

Compared to the memory units: Faster, smaller and without any address

(2) Arithmetic and Logic Unit (ALU)

The ALU is the device that performs the arithmetic and logic operations.

(3) Flag Register (FR)

> 11

The detailed structure of the CPU

The operative unit

(1) General-Purpose Registers (GPR)

(2) Arithmetic and Logic Unit (ALU)

(3) Flag Register (FR)

The flag register stores the status of the previous operation.

The flag register a 4 or more bits

Roles: Conditional branching (if and for statements)

> 12

CF OF ZF SF

CF: Carry Flag

The output carry of the operation

OF: Overflow Flag

The overflow of the signed calculations

ZF: Zero Flag

Indicates if the result were 0.

SF: Sign Flag

Indicates if the result were negative The detailed structure of the CPU

The Control Unit

> 13

The detailed structure of the CPU

The Control Unit

(1) Program Counter (PC)

The PC is the register that stores the address of the next instruction to be executed.

(2) Instruction Register (IR)

The IR is the register that stores the instruction to be executed.

(3) Decoder (DE)

The decoder is a circuit that decodes the value of the instruction stored in the

instruction register.

Its result drives a set of control signals for the computer.

(4) Sequencer

The sequencer is the circuit that generates the control signals related to the clock

and the status of the computer.

> 14

The Basic Behavior of a von Neumann Computer

Execution flow/Instruction Cycle: the following infinite loop

> 15

## Fetch

## (FE)

## Decode

## (DE)

## Execute

## (EX) The Basic Behavior of a von Neumann Computer

> 16

(1) Fetch (FE)

Fetch is the step at which the instruction indicated by the PC is read

from memory and stored in the IR.

Definition

(2) Decode (DE)

Decode is the step at which the instruction stored in the IR is

decoded.

Definition

(3) Execute (EX)

Execution is the step at which the instruction is carried out

depending on the results of the decoding stage.

Definition

(1) Fetch (FE)

Fetch is the step at which the instruction indicated by the PC is read

from memory and stored in the IR.

Definition

(2) Decode (DE)

Decode is the step at which the instruction stored in the IR is

decoded.

Definition

(3) Execute (EX)

Execution is the step at which the instruction is carried out

depending on the results of the decoding stage.

Definition The von Neumann Bottleneck

What is a bottleneck?

A bottlenecks is a situation that limit the overall performance of a given

product.

> 17

A bottles neck The von Neumann Bottleneck

For the von Neumann computers

Single-memory: both program and data are stored in the same memory and

are transferred over the same bus.

Instructions and data cannot be retrieved from memory at the same time, this

is the von Neumann bottleneck.

> 18

# TODAY

> 19

Topic

##  Computer Classification

##  Computer Evolution

##  Semiconductor

##  Computer Performance

> 20

Computer Classification

Microcontrollers

Personal computers

Workstations

Mainframes

Supercomputers

> 21

Computer Classification

Microcontrollers

Micro computer / Microcontroller

Usage : embedded systems

Features : Miniaturization, energy saving, low power consumption.

CPU, memory and peripheral circuits are contained in the same IC.

Currently : also known as SoC (System on a Chip)

> 22

This is a computer! Computer Classification

Personal computers

Usage : Office work, internet, gaming

Features : Low price, Versatile and scalable, Single architecture,

Large variety of peripherals, e.g., keyboard, mouse, display, sound

card, printer.

Currently : research, industry

> 23

Computer Classification

Workstations

Usage : large calculation e.g., research, industry

Features : Like PCs but with higher performance, High price.

Gone with the success of personal computers but revived with

artificial intelligence.

> 24

Computer Classification

Mainframes

Usage : Internet companies, banks, industry, education, offices, etc.

Features : Very high parallelism. One computer, many users (access

via terminals)

> 25

Computer Classification

Supercomputers

Usage : Industry and research huge computations.

Features :

Huge performance

Huge parallelism

Huge area

Huge power consumption

One computer, few users

> 26

Computer Evolution

The evolution of computers can be broadly categorized into four main eras:

Mechanical Era

Electro-Mechanical Era (Transition)

Electronic Era

Post-PC Era (Contemporary)

> 27

Mechanical Era

Timeframe: 1620s to the early 20th century.

Key Developments:

Mechanical Calculators : which could perform basic arithmetic operations.

Charles Babbage's Designs : Difference Engine and the Analytical Engine.

Introducing concepts like programmability and mechanical computation.

Punched Card Systems : Tabulating and processing data

> Punched card - Wikipedia

Electro-Mechanical Era (Transition)

Timeframe: Roughly from the late 19th century to the mid-20th century.

Key Developments:

Integration of Electromechanical Components :

such as relays and switches.

Harvard Mark I, which used electromechanical relays for computation.

Punched Card Systems Continued : Punched card systems continued to be

widely used for data input and processing, particularly in industries requiring

large-scale data manipulation. Electronic Era

Timeframe: Starting in the 1940s until now.

Key Developments:

ENIAC : The Electronic Numerical Integrator and Computer (ENIAC),

completed in 1946, marked the beginning of the electronic era.

Transistors : In 1947, the invention of the transistor at Bell Labs revolutionized

electronics.

Transistors replaced bulky vacuum tubes, leading to smaller, faster, and more

reliable computers. Post-PC Era (Contemporary)

Timeframe: From the late 20th century to the present.

Key Characteristics:

Proliferation of Mobile Devices : Smartphones and tablets have become

popular

Touch Interfaces and Intuitive User Experiences

Cloud Computing

Internet of Things (IoT)

AI and Voice Assistants The Post PC Era The Post PC Era

Personal Mobile Device(PMD)

Battery operated, have become popular, offering computing capabilities

on the go.

Connects to the Internet

Smart phones, tablets, electronic glasses

Popular

Cloud computing

Software as a Service (SaaS)

Services and applications increasingly rely on cloud infrastructure for

storage, processing, and collaboration.

Reducing the dependence on local computing resources.

Amazon and Google The Post PC Era

Internet of Things (IoT)

Connected devices, from smart thermostats to fitness trackers, contribute

to the expanding ecosystem of computing beyond traditional PCs.

AI and Voice Assistants

Advances in artificial intelligence and natural language processing have

led to the integration of voice-controlled assistants

like Siri, Alexa, and Google Assistant into various devices, changing how

users interact with technology. Touchscreen

Post PC device

simplifying interaction with technology

Supersedes keyboard and mouse

Resistive and Capacitive types

Most tablets, smart phones use capacitive

Capacitive allows multiple touches simultaneously Technology Trends

Electronics technology continues to evolve

Increased capacity and performance

Reduced cost

Relative performance/cost Technology Year

1Vacuum tube 1951

35 Transistor 1965

900 Integrated circuit (IC) 1975

2,400,000 Very large-scale IC (VLSI) 1995

250,000,000,000 Ultra large-scale IC 2013

> DRAM capacity

# Now? Technology Trends (Now)

Advanced IC (AI, 3D, RISC-V, Thermodynamic)

AI chips :

TPU (Tensor Processing Unit)  Google

NPU (Neural Processing Unit)  Apple, Huawei

Chiplets : Modular chip building blocks used in AMD and Intel architectures

3D-Stacked ICs : Vertical integration to boost speed and density

RISC-V : Open-source CPU architecture with growing ecosystem

Thermodynamic computing (experimental): Uses physics to power ultra-low-

energy AI tasks

> 37

Semiconductor Technology

The foundation of all modern electronics  including smartphones,

computers, cars, TVs, and even smart fridges!

Add materials to transform properties:

Conductors (Lets electricity flow)

Insulators (Blocks electricity)

Switch (Controls flow (on/off))

A material (like silicon) that can act as both a conductor and an insulator

depending on how its used.

This makes it perfect for building transistors  the basic "on/off switches" in

all digital devices.

> 38

Manufacturing ICs

> 39

Yield: proportion of working dies per wafer Example: Intel Core 10th Gen

300mm wafer, 506 chips, 10nm technology

Each chip is 11.4 x 10.7 mm

> 40

Example: Intel Core Ultra

41 https://www.thailand.intel.com/content/www/th/th/content-details/842532/intel-core-ultra-processors-200hx-series-processors-quick-reference-guide-pdf.html Integrated Circuit Cost

Nonlinear relation to area and defect rate

Wafer cost and Die area are fixed.

Defect rate determined by manufacturing process.

Die area determined by architecture and circuit design.

> 2

area/2)) Die area per (Defects (1

Yield

area Die area Wafer wafer per Dies

Yield wafer per Dies

wafer per Cost

die per Cost

Wafer cost is fixed (e.g., $5,000 per 300mm wafer)

Dies per wafer depends on chip size

Yield is how many chips work (e.g., 90%)

This shows how defect rate and chip size affect the yield.

If die area increases, yield drops quickly (nonlinear).

If defect rate is high, fewer good chips are produced. Response Time and Throughput

Response time (Execution time)

The total time required for a computer to complete a task, including disk

access, memory access, I/O activities

Throughput

Total amount of work done per unit time

> e.g., the number of tasks/transactions/ per hour

How are response time and throughput affected by

Replacing the processor with a faster version?

Adding more processors?

Transcript for:Computer Architecture Basics

Transcript for:
Computer Architecture Basics