Department of Computer Science National Tsing Hua University EECS403000 Computer Architecture Spring 2024, Homework 1 Solution (Revised) and Rubric Due date: 1. (40 points) Installing and using AndeSight™ for RISC-V program development. (1) See AndeSight_STD_v5.3_Installation_Guide.pdf for the installation guide. (2) Create a new Andes C Project (i) Click on File → New → Project → C/C++, and select “Andes C Project”. (ii) Create a project with the project name “fast_power_recur”. (iii) From Chip Profile, select chip profile “AE350” → “ADP-AE350-NX25F”. (iv) From Project Type, select project type “Andes Executable” → “Hello World ANSI C Project” (v) From Toolchains, select the “nds64le-elf-mculib-v5d”. (vi) Other configurations are left as default. (3) Replace fast_power_recur.c in the project with the one we provided. (4) To build the project, click on the expanding arrow (a small triangle ) beside “Build” in the toolbar → “1 Debug” for project “fast_power_recur” in the toolbar. (5) To execute the program, press “Debug” “Resume” in the debug window. in the toolbar → “1 Application Program” and press You can follow the same steps for other program codes. To select (or check) the optimization setting, follow the figure below. To inspect the Assembly code of the program, follow the figure below. In this exercise, we will experiment with the naïve and fast power computation in two different implementations (iterative and recursive) with AndeSight™. There are four source codes, namely naive_power_iter.c, naive_power_recur.c, fast_power_iter.c, and fast_power_recur.c. The optimization level -Og will be used by default in the following questions unless stated otherwise. (a) (10 points) Effects of algorithms on performance Press “Profile” in the toolbar and select Profile as “Application Program”. Press the button “Resume” in the debug window, record the CycC and InsC for the four functions listed in the table below and complete the table. Based on the characteristics of the programs, briefly compare and explain the differences between the naïve and fast power algorithms in their profiles. Function Source CycC InsC naive_power_iter() naive_power_recur() fast_power_iter() fast_power_recur() naive_power_iter.c naive_power_recur.c fast_power_iter.c fast_power_recur.c 83 194 42 118 50 135 27 73 naive_power_iter() naive_power_recur() fast_power_iter() fast_power_recur() β InsC and CycC of the iterative naïve power > InsC and CycC of the iterative fast power algorithm β InsC and CycC of the recursive naïve power > InsC and CycC of the recursive fast power algorithm β The number of multiplication operations in the naïve implementation is directly proportional to the given number of powers. In this case, both the iterative and recursive versions involve 11 multiplication operations. β On the other hand, the number of multiplication operations in the fast power algorithm is at most 2×⌈ log πππ (πππ€ππ) ⌉. In this case, both the iterative and recursive versions involve 7 multiplication operations. β Therefore, the fast power algorithm naturally has lower CycC and InsC than the naïve counterpart. Grading policy: β Each CycC and InsC values are worth 1 point each. β Comparison and justification are worth 2 points. (b) (10 points) Effects of programming on performance From the table above and based on the characteristics of the programs, briefly compare and explain the differences between the iterative and recursive implementations of the fast power algorithm in their profiles. Suppose that they are executed in a processor with a clock rate of 3 GHz, what are the average CPI and CPU execution time for the fast_power_iter() and fast_power_recur() functions? Function Average CPI fast_power_iter() πΆπ¦ππΆ 42 πΆππΌ = πΌππ πΆ = 27 = 1. 5 πΆπ¦ππΆ 118 πΆππΌ = πΌππ πΆ = 73 = 1. 61644 fast_power_recur() Average Execution Time πΆπ¦ππΆ 42 πΆπ¦ππΆ 3×10 118 πΈπ = πΆππππ π ππ‘π = πΈπ = πΆππππ π ππ‘π = 9 9 3×10 = 14ππ = 39. 3ππ β CycC & InsC of the recursive fast power algorithm > CycC & InsC of the iterative fast power algorithm. β The recursive implementation requires more push and pop (memory access) instructions, which leads to a higher cycle count. Grading policy: β Each CPI and Execution Time values are worth 2 points each. β Comparison and justification are worth 2 points. (c) (10 points) Effects of the compiler on performance Compile fast_power_iter.c and fast_power_recur.c with two optimizations levels, -O0 and -O1. Record the CycC and InsC and compute their corresponding CPI for the two different optimization levels. Furthermore, briefly compare and explain the differences in their profiles. Function Optimization level -O0 CycC InsC Optimization level -O1 CPI CycC InsC 179 or 83 fast_power_iter() 179 or 176 176 83 83 42 27 118 73 CPI 42 27 = 1. 5 118 73 = 1. 6 = 2. 15663 or 2. 12048 fast_power_recur() 271 166 271 166 = 1. 6 fast_power_iter() with optimization flag -O0 fast_power_recur() with optimization flag -O0 fast_power_iter() with optimization flag -O1 fast_power_recur() with optimization flag -O1 β CycC & InsC of iterative with -O0 opt flag > CycC & InsC of iterative with -O1 opt flag. β CycC & InsC of recursive with -O0 opt flag > CycC & InsC of recursive with -O1 opt flag. β The -O1 optimization flag allows the compiler to optimize for speed. β On the other hand, the -O0 optimization flag does not optimize the program. β The -O1 optimization flag naturally has lower CycC and InsC than the -O0 optimization flag. Note: the screenshot of fast_power_iter() with optimization flag -O0 with CycC 176 is unavailable. Grading policy: β Each CycC and InsC values are worth 0.5 points each. β Each CPI value is worth 1 point each. β Comparison and justification are worth 2 points. (d) (10 points) Compilers versus hardware implementations If we want to run the -O0 codes compiled in (c) on a faster processor to achieve the same speedup as running the -O1 codes on the original processor with clock rate of 3 GHz in fast_power_iter() and fast_power_recur(), what will the clock rates of the faster processor be for fast_power_iter.c and fast_power_recur.c respectively? Function fast_power_iter() fast_power_recur() 9 179 ×3×10 = 12. 78571 πΊπ»π§ 42 Clock Rate or 9 176 ×3×10 = 12. 57143 πΊπ»π§ 42 πΆππππ πΆπ¦ππππ −π0 πΈπ₯πππ’π‘πππ ππππ−π0, πππ€ ππππππ π ππ = πΈπ₯πππ’π‘πππ ππππ−π1, πππ ππππππ π ππ πΆππππ π ππ‘π πππ€ ππππππ π ππ πΆππππ πΆπ¦ππππ −π1 = πΆππππ π ππ‘π πππ ππππππ π ππ Grading policy: β Each Clock Rate value is worth 4 points each. β Correct derivation of the formula is worth 2 points. 9 271 ×3×10 = 6. 88983 πΊπ»π§ 118 πΆππππ πΆπ¦ππππ πΆππππ π ππ‘ππππ€ ππππππ π ππ = πΆππππ πΆπ¦ππππ −π0 × πΆππππ π ππ‘ππππ ππππππ π ππ −π1 2. (25 points) Benchmarking Below is a comparison between three mobile phones and their processors. Product Samsung Galaxy S24 Ultra Apple iPhone 15 Pro Max Google Pixel 8 Pro SoC Snapdragon 8 Gen 3 Apple A17 Pro Google Tensor G3 Cores 8 (1+3+2+2) 6 (2+4) 9 (1+4+4) PDF renderer 227.4 Mpixels/sec 178.5 Mpixels/sec 153 Mpixels/sec HDR 238.2 Mpixels/sec 232.4 Mpixels/sec 136.9 Mpixels/sec Background blur 26.7 images/sec 27.9 images/sec 15 images/sec Photo processing 64.9 images/sec 79.1 images/sec 47 images/sec Ray tracing 7.38 Mpixels/sec 7.58 Mpixels/sec 4.52 Mpixels/sec The information above is provided by https://nanoreview.net/en/soc-list/rating. (a) (5 points) Follow the link https://nanoreview.net/en/soc-list/rating and fill in the table below. Core name Snapdragon 8 Gen 3 Apple A17 Pro Google Tensor G3 Pro Peak frequency of the most performant block of cores (MHz) One core Three cores Two cores Two cores Cortex-X4 Cortex-A720 Cortex-A720 Cortex-A520 3300 MHz 3150 MHz 2960 MHz 2260 MHz Two cores Four cores Everest Sawtooth 3780 MHz 2110 MHz One core Four cores Four cores Cortex-X3 Cortex-A715 Cortex-A510 2910 MHz 2370 MHz 1700 MHz Note: 1 GHz = 1000 MHz References: Snapdragon 8 Gen 3, Apple A17 Pro, and Google Tensor G3. Grading policy: β Each incorrect value is -1 point. β A minimum of zero points is given. (b) (10 points) Suppose that we run three computer graphics and multimedia programs on all three smartphones: Program A: Renders 114,000,000 pixels when viewing HW1.pdf. Program B: Blurs the background of 2,000 images in the image gallery. Program C: Processes 4,000 images in the image gallery. For simplicity, we assume that the program only runs on a single core. The Samsung Galaxy S24 Ultra uses Cortex-X4, the Apple iPhone 15 Pro Max uses Everest, and the Google Pixel 8 Pro uses Cortex-X3. Furthermore, there is no other overhead. We are interested in the execution time (in seconds) and the clock cycles (in millions) of each smartphone. Use the provided information in the table and your answer in (a) to complete the table below. Program A Smartphon e Seconds Samsung (Cortex-X4 ) Apple (Everest) 6 114*10 6 227.4*10 6 178.5*10 Google (Cortex-X3 ) =0 6 114*10 6 153*10 Clock Cycles 114 * 3300 = = 0 227.4 6 114*10 Program B = 0. 114 178.5 * 3780 =2414.118 114 153 * 2910 =2168.235 Seconds Program C Clock Cycles Seconds Clock Cycles 2000 26.7 = 74. 906 2000 26.7 * 3300 = 2 64.9 = 61. 633 4000 4000 64.9 * 3300 = 2 2000 27.9 = 71. 685 2000 27.9 * 3780 = 2 79.1 = 50. 569 4000 4000 79.1 * 3780 = 1 2000 15 = 133. 333 2000 15 * 2910 = 3 4000 47 4000 47 * 2910 = 2 = 85. 106 ππππππ‘π πΈπ₯πππ’π‘πππ ππππ = ππππππ π πππ πππππ πΆππππ πΆπ¦ππππ = πΈπ₯πππ’π‘πππ ππππ× πΆππππ π ππ‘π = ππππππ‘π ππππππ π πππ πππππ × πΆππππ π ππ‘π Note: Some precision errors may exist if the rounded results of seconds are directly multiplied with the peak frequency of the cores. Grading policy: β Each incorrect second and clock cycle value is worth 0.5 points each. β Correct derivation of the formula is worth 1 point. (c) (10 points) We are interested in comparing the performances of the three smartphones. Calculate the relative performance of the three smartphones, with each phone as the reference for comparison. Use your answer in (b) to complete the table below and summarize the performance results by calculating the geometric mean of the performance ratio of the three benchmark programs (Program A, Program B, and Program C). Hint: you might only need to compute some of the six values from scratch. Performance Ratio Reference Samsung Galaxy S24 Ultra Samsung Galaxy S24 Ultra 1 Apple iPhone 15 Pro Max 1 0.99990 = 1. 00010 Google Pixel 8 Pro 1 0.64930 = 1. 54012 Apple iPhone 15 Pro Max 3 178.5 227.4 27.9 79.1 × 26.7 × 64.9 1 1 0.64936 = 1. 53997 Google Pixel 8 Pro 3 153 227.4 × 26.7 × 64.9 = 15 3 153 178.5 × 27.9 × 79.1 15 47 47 1 where πΊπππππ‘πππ πππππ΄, π΅ is the geometric mean of the performance ratio of machine A with machine B as reference. Hint explanation: Observe that the geometric mean of A with reference B is the reciprocal of the geometric mean of B with reference A. Thus, we only need to compute three values and take their reciprocals. Note: Some precision errors may exist if the rounded results of seconds are used to compute the geometric means of performance ratio, but the ranking of the three smartphones should remain the same. Grading policy: β Overall ranking is worth 5 points. β Each incorrect geometric mean value is -1 point. β A minimum of zero points is given. 3. (10 points) Performance and Speedup Assume that a program requires the execution of 100 × 106 FP instructions, 140 × 106 INT instructions, 110 × 106 L/S instructions, and 55 × 106 branch instructions. The CPI for each type of instruction is 3, 2, 5, and 3, respectively. Assume that the processor has a 5 GHz clock rate. (a) (5 points) By how much must we improve the CPI of INT instructions if we want the program to run two times faster? Please show the calculation procedure. πΆππππ πΆπ¦ππππ = πΆππΌπΉπ × #πΉπ + πΆππΌπΌππ × #πΌππ + πΆππΌπΏ/π × #πΏ/π + πΆππΌπ΅ππππβ × #π΅ππππβ = (3×100 + 2×140 + To achieve two times speedup, we must have: πΈπ₯πππ’π‘πππ πππππππ€ πΈπ₯πππ’π‘πππ πππππππ πΆππππ πΆπ¦ππππ 1 = πΆππππ πΆπ¦ππππ πππ€ = 2 πππ To have the number of clock cycles by improving the CPI of INT instructions: πΆππππ πΆπ¦ππππ πππ€ πΆππππ πΆπ¦ππππ πππ 1 = 2 πππ€ πΆππΌπΉπ×#πΉπ+πΆππΌπΌππ ×#πΌππ+πΆππΌπΏ/π×#πΏ/π+πΆππΌπ΅ππππβ×#π΅ππππβ πΆππππ πΆπ¦ππππ πππ πππ€ πΆππΌπΌππ = πΆππππ πΆπ¦ππππ πππ 2 ( ) − πΆππΌπΉπ×#πΉπ+πΆππΌπΏ/π×#πΏ/π+πΆππΌπ΅ππππβ×#π΅ππππβ #πΌππ 1 = 2 = 647.5−1015 140 < 0 Therefore, it is impossible to improve the CPI of INT instructions if we want the program to run two times faster. Grading policy: β Correct clock cycles is worth 2 points. β Correct CPI is worth 2 points. β Correct answer is worth 1 point. (b) (5 points) By how much is the execution time of the program improved if the CPI of FP instructions is reduced by 28%, the CPI of INT instructions is reduced by 32% and the CPI of L/S instructions is reduced by 61% and the CPI of branch instructions is reduced by 64%? Please show the calculation procedure πΈπ₯πππ’π‘πππ πππππππ€ πΈπ₯πππ’π‘πππ πππππππ πΆππππ πΆπ¦ππππ = πΆππππ πΆπ¦ππππ πππ€ = πππ 3×100×0.72+2×140×0.68+5×110×0.39+3×55×0.36 3×100+2×140+5×110+3×55 The execution time of the program is reduced by 0.47%. Grading policy: β Correct formula is worth 2 points. β Correct calculation is worth 1 point. β Correct answer is worth 2 point. = 0. 53 4. (10 points) Amdahl’s Law and the Eight Great Ideas of Computer Architecture One of the great ideas of computer architecture is parallelization. Amdahl’s law can be used to calculate the overall speedup of parallel executions. Amdahl's Law is defined as follow: ππππ‘ππππ¦ = 1 π (1−π)+ π where ππππ‘ππππ¦ : the theoretical speedup of the execution of the whole task, : the speedup of the part of the task that benefits from improved system resources, : the proportion of the execution time that the part benefiting from improved resources originally occupied. The ideal speedup of a parallelized program is the number of processors used. However, the theoretical speedups have limitations by the percentage of the application that cannot be parallelized, which includes the communication costs. The problem is that the communication costs are not fixed but often vary based on the number of processors used. In the following, let us consider the communication costs separately from the non-parallelizable execution of the program. (a) (5 points) Suppose we have a method to parallelize the fast_power_iter() function in (1) using an arbitrary number of processors. Moreover, the execution time π on one processor is the result obtained in (1)(b). Compute the parallel execution time of fast_power_iter() on 2, 4, 8 processors assuming 75% of the function is parallelizable and there is no communication cost. Number of Processors Parallel Execution Time π π ((1 − 0. 75) + )×14×10 = 8. 75 ππ ((1 − 0. 75) + )×14×10 = 6. 125 ππ ((1 − 0. 75) + )×14×10 = 4. 8125 ππ −9 0.75 2 0.75 4 0.75 8 2 4 8 ππππ‘ππππ¦ = πΈπ₯πππ’π‘πππ πππππ’ππππππππ π ππ πΈπ₯πππ’π‘πππ ππππππππππππ ( π ) = −9 −9 1 π (1−π)+ π πΈπ₯πππ’π‘πππ ππππππππππππ = (1 − π) + π × πΈπ₯πππ’π‘πππ πππππ’ππππππππ π ππ Grading policy: β Each Parallel Execution Time value is 1.5 points each. β The partially correct derivation of the formula is 0.5 points. (b) (5 points) Assuming the communication costs are 4% of the original execution time regardless of the number of cores, what is the speedup with 8 cores when 75% of the program is parallelizable? Since the communication costs are now considered, the communication cost has to be added in the Amdahl’s Law. Communication cost is 0.04. ππππ‘ππππ¦ = 1 (1−0.75)+0.04+ 0.75 8 ≈ 2. 606 The speedup with 8 cores when 75% of the program is parallelizable with 4% communication cost is 2.606. Grading policy: β Correct answer +5 β Minor error : -1 β Wrong answer : Only calculate execution time (if calculation is correct) +1 β Wrong answer : but partially correct formula +1 (c) (5 points) Assuming the communication costs are increased by 2% of the original execution time every time the number of processors are doubled, what is the speedup with n cores when 75% of the program is parallelizable? Furthermore, what is the specific speedup value when n = 8 in this scenario? The speedup with n cores when 75% of the program is parallelizable with 2% communication cost for every doubled core is given by ππππ‘ππππ¦ = 1 0.75 π (1−0.75)+0.02×πππ2(π) + Moreover, ππππ‘ππππ¦ = 1 ≈ 2. 477 0.75 8 (1−0.75)+0.02×πππ2(8) + The speedup with 8 cores when 75% of the program is parallelizable with 2% communication cost for every doubled core is 2.477. Grading policy: β Correct formula for speedup and correct calculation when n = 8 +5 β Minor error -1 β Wrong formula : Formula and calculation time for execution time (given it’s correct) +1 5. (10 points) Integrated Circuit Cost and Manufacturing Assume that a 50mm diameter-wafer has a cost of $9 and contains 95 dies. The yield for this wafer is 90%. (a) (4 points) Find the defects per area for this wafer using the Equation on Page 28 of the textbook (or Page 45 of the slide of Computer Abstractions and Technology). 2 πππππ π΄πππ π·ππ π΄πππ = π·πππ πππ πππππ = 1 πππππ −1 π×25 95 1 −1 = 20. 67 ππ 2 2 0.9 π·πππππ‘π πππ π΄πππ = 2× π·ππ π΄πππ = 10.335 = 0. 005 ππππππ‘π /ππ The defects per area for this wafer is 0.005 defects/mm2. Grading policy: β Each correct formula is worth 1 point. β Each correct answer is worth 1 point. β No unit is -0.5 point. (b) (2 points) Find the cost per die for this wafer. πΆππ π‘ πππ πππππ $9 πΆππ π‘ πππ π·ππ = π·πππ πππ πππππ×πππππ = 95×0.9 = $0. 1053 The cost per die for this wafer is $0.1053. Grading policy: β Correct formula is worth 1 point. β Correct answer is worth 1 point. (c) (4 points) If the number of dies per wafer is increased by 10% and the defects per area unit increases by 25%, find the new die area and new yield. 2 πππππ π΄πππ π×25 π·ππ π΄πππ = π·πππ πππ πππππ = 95×1.1 = 18. 79 ππ πππππ = 1 ( ( π·ππ π΄πππ 1+ π·πππππ‘π πππ π΄πππ× 2 2 )) = 1 (1+(0.005×1.25× The new die area is 18.79 mm2 and the new yield is 0.8922. Grading policy: β Each correct formula is worth 1 point. β Each correct answer is worth 1 point. β No unit is -0.5 point. 18.79 2 2 )) 2 = 0. 8922