COMP2006 Computer Organization Assignment I Sample Solutions 1. “Throughput is commonly used to measure the performance of a system. Consider a computer that has a hard disk drive (HDD) and an SSD drive. The throughput of HDD is about 100MB/s, and the throughput of the SSD drive is about 500MB/s. If you want to copy a file of size 4GB from HDD to SDD, please estimate how long it takes, and justify your answer. Assume 1GB = 1000MB.” (10 marks) Ans: 40 seconds. When we copy a file from the slower HDD to the faster SDD, it involves two types of operations: (1) read operations, i.e., copy data from HDD to main memory; and (2) write operations, i.e., copy data from main memory to SDD. Since the main memory is much faster than both HDD and SDD, the throughput of read operation (from HDD to main memory) is limited by HDD, i.e., about 100MB/s, while the throughput of write operation (from main memory to SSD) is limited by SDD, i.e., about 500MB/s. In total, the read operations take 4x1000/100 = 40 seconds, while the write operations take 4x1000/500 = 8 seconds. In practice, the read and write operations can be overlapped (i.e., after reading a small segment of data from HDD, they can be written to SSD. Therefore, we only need to count the read time of 40 seconds, while the 8 seconds of writing time can be largely hidden. The blow figure shows the detailed process: Marking scheme: If your answer is around 40 seconds and your justification is reasonable, you get 10 marks. If your answer is around 48 seconds (i.e., you add the 40 seconds of read and 8 seconds of write, you get 8 marks. This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ For other unreasonable answers, you get 0 mark. 2. Consider a CPU that can perform 1million additions per second. You have two data arrays A and B in main memory, each with 1million numbers. The CPU can load 1000 numbers from main memory to its registers in each second. The CPU can write 1000 numbers to main memory in each second. But the memory read and memory write operations cannot happen at the same time. Now we want to calculate the summation of the two arrays, i.e., Ci = Ai + Bi, where 0 ≤ i ≤ 999,999. C is the data array that stores the 1million results. Please estimate how long it takes to have all results in array C. Hint: to perform Ci = Ai + Bi, it takes several steps: (1) load Ai to register; (2) load Bi to register; (3) perform the addition by the CPU; (4) store the result to Ci. (10 marks) Ans: 3001 (or 3000) seconds. [The question tests your understanding of the CPU + memory structure.] Since we have three large arrays, A, B, C, they can only be stored in main memory. To perform A+B, we need to load data from main memory (array A and array B) to CPU registers piece by piece, then perform the addition, then write the result to main memory (array C). The data movement between CPU and main memory is very slow: 1000 numbers per second, as compared to the CPU speed which is 1,000,000 additions per second. Given array size of 1,000,000, we need to move 2,000,000 numbers (i.e., A and B) from main memory to CPU registers, which takes about 2,000,000/1000 = 2,000 seconds. But the addition operations only take 1,000,000/1,000,000 = 1 second. It also takes 1,000,000/1000 = 1,000 seconds to move data from CPU registers to array C. The problem assumes that memory read and memory write cannot happen at the same time, so the total memory access time is 2,000+1,000 = 3,000 seconds. The CPU calculation time is only 1 second. The final answer will be 3,001 seconds. If you assume that the CPU calculations can overlap with data movement operations, then the answer will be 3,000 seconds. Both answers are accepted. 3. “Apple has recently announced its new generation of processor A14 Bionic, to be used in an iPAD Air. (1) Please describe the major components of A14 processor and their functionalities. (2) Compare the A14 processor with the A13 processor released in 2019. What are the improvement? (3) Numerical Wind Tunnel was the top-1 supercomputer in the world during 1993-1995, with a cost over $100 million. Do you think the iPad Air equipped with A14 processor will be faster than it or not? Justify your answer.“ Hint: You need to search in the Internet to answer this question. (10 marks) Ans: This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ [This is an open-ending question without standard answer. The purpose is to let you learn new concepts (such as GPUs and neural engines) from daily life applications through Internet search. Below is my sample answer as reference.] (1) 3 marks The A14 Bionic processor is a 64-bit ARM-based system on a chip (SoC), designed by Apple Inc. It includes the following major components: COMPONENTS 6-core CPU (2 high-performance cores and 4 power-efficient cores) 4-core GPU Neural Engine FUNCTIONALITIES To support the iOS operating system and general mobile applications GPU means “graphics processing unit”. It is used to accelerate the creation and processing of images, graphics, and videos to be shown on the screen. To support AI applications such as Face ID (2) 3 marks Manufacturing Number of transistors CPU GPU Neural Engine A13 7nm 8.5 billion 2+4 cores, up to 20% faster than A12’s CPU 4 cores, up to 20% faster than A12’s GPU 8 cores, up to 6 trillion operations per second A14 5nm 11.8 billion 2+4 cores, up to 40% faster than A12’s CPU 4 cores, up to 30% faster than A12’s GPU 16 cores, up to 11 trillion operations per second (3) 4 marks The performance of a supercomputer is usually measured by two metrics: the theoretical peak performance and the measured performance when running a specific software (called LINPACK). The theoretical peak performance means that the CPU is 100% busy, so it is usually higher than the measured performance. The commonly used unit is GFlop/s, which means 1 billion floating-point operations per second. According to https://en.wikipedia.org/wiki/Numerical_Wind_Tunnel, the peak performance of Numerical Wind Tunnel is 235.8 GFlop/s, and the measure performance is 124.0 GFlops/s. We cannot find the GFlop performance of A14, since it is so new. But we can find the GFlop performance of A12 and A13. E.g., according to https://gadgetversus.com/processor/intel-core-i5-9500-vs-applea13-bionic/, the performance of A13 CPU is 302.1 GFlop/s, and the A13 GPU can also achieve more than 400 GFlop/s. So what we can say is that A13 is at least comparable to Numerical Wind Tunnel. Considering that A14 has further improved the performance over A13, so in my opinion, the iPad Air equipped with A14 processor is faster than the supercomputer Numerical Wind Tunnel. This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ Remark: your conclusion could be different from mine. As long as you found some reasonable data from the Internet and your judgment is based on your data, you will receive the 4 marks. 4. “Explain the main differences between high-level programming languages and machine language, and how we can transfer a high-level language into machine language.” (10 marks) Ans: High-level programming languages are closer to natural language. They allow the programmers to focus more on the problem domain. They can provide higher productivity and portability. (2 marks) Machine language is a binary representation of machine instructions, i.e., the commands that computer hardware understands and obeys. (2 marks) In order to transfer a high-level language into machine language, a compiler software first translates the high-level language into the assembly language, which is a symbolic representation of machine instructions. Then the assembler software translates the assembly language into machine language. In practice, these two steps may be combined as a single step. (6 marks) 5. “Assume there are two numbers x and y stored in a disk drive as a txt file. In order to calculate x + y, which components of the computer will be involved? Please describe the whole process as detailed as you can.” Hint: Which computer components are not mentioned in Question 2? (10 marks) Ans: Data of x and y will first be loaded from USB disk to main memory. (2 marks) Then the CPU will fetch the values of x and y from main memory to two registers. In this step, modern CPUs will also introduce cache memory between main memory and registers. That is to say, the data will be copied from main memory to cache memory (if they were not in cache yet), and then be copied from cache memory to registers upon the request of the CPU. (4 marks) And then the CPU can calculate the value of x + y using ALU and registers. (4 marks) 6. “Consider a processor X that can perform one multiplication operation in 1µs (i.e., 10-6 sec). What is the response time and throughput of multiplication operation on X, respectively? Now we construct a multiprocessor Y that consists of 4 processors of X. What is the response time and throughput of multiplication operation on Y, respectively?” (10 marks) This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ Ans: Response time of X: 1 µs or 0.000001 second (2.5 marks) Throughput of X: 1/0.000001 = 1,000,000 multiplications per second (2.5 marks) Response time of Y: 1 µs (because using 4 devices cannot help in reducing the time of a single multiplication operation) (2.5 marks) Throughput of Y: 4/0.000001 = 4,000,000 multiplications per second (2.5 marks) Remark: the unit of throughput should be “multiplications per second”. If the unit is wrong, one mark will be deducted. 7. “A dual-core processor A works at 2.4GHz, and each core has a throughput of 4 multiplications per CPU cycle. A quad-core processor B works at 3.2GHz, and each core has a throughput of 8 multiplications per CPU cycle. Please compare the performance of A and B in terms of multiplication throughput.” (10 marks) Ans: A works at 2.4GHz, and in each cycle each of its two cores can finish 4 multiplications, so A can perform 2.4x109x4x2 = 1.92x1010 multiplications per second. (4 marks) B works at 3.2GHz, and in each cycle each of its four cores can finish 8 multiplications, so B can perform 3.2x109x8x4 = 1.024x1011 multiplications per second. (4 marks) The performance of A is 18.75% of B in terms of multiplication throughput. Or, we can say the performance of B is 5.33 times of A. (2 marks) 8. “Assume a CPU supports three types of instructions A, B, C. The CPI (Cycles Per Instruction) of A, B, C are 1, 2, 10, respectively. Given a program written in a high-level programming language, we need to choose a compiler to generate the machine code. “ (1) “Compiler 1 generates 10,000 type-A instructions, 20,000 type-B instructions, and 2,000 type-C instructions. What is the CPI of the machine code generated by Compiler 1? “ (2) “Compiler 2 generates 8,000 type-A instructions, 15,000 type-B instructions, and 4,000 type-C instructions. What is the CPI of the machine code generated by Compiler 2? “ (10 marks) Ans: (1) The CPI is (10000x1 + 20000x2 + 2000x10) / (10000 + 20000 + 2000) = 2.1875 (or 2.19) This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ (5 marks) (2) The CPI is (8000x1 + 15000x2 + 4000x10) / (8000 + 15000 + 4000) = 2.8889 (or 2.89) (5 marks) 9. “Consider the same CPU in Question 8 with working frequency of 2.5 GHz. A program’s machine code includes 1 billion type-A instructions, 0.5 billion type-B instructions, and 0.2 billion type-C instructions. What is the CPU time of this program? Assume there is no parallelism. “ (10 marks) Ans: CPU cycles (in billion) = 1 x 1 + 0.5 x 2 + 0.2 x 10 = 4 (billion) (5 marks) CPU time = 4 billion / 2.5GHz = 1.6 second (5 marks) 10. “Consider a program with three sequential procedures: X, Y, and Z, where Y can only be started after X is finished, and Z can only be started after Y is finished. Procedure X takes 2 seconds on a single CPU core and cannot be parallelized. Procedure Y takes 20 seconds on a single CPU core but can be perfectly parallelized on multiple CPU cores, i.e., its instructions can be equally distributed to multiple CPU cores without extra overhead. Procedure Z takes 16 second on a single CPU core and can be parallelized on multiple CPU cores with 80% of utilization. Please estimate the best CPU time of this program on a quad-core CPU.” (10 marks) Ans: Procedure X takes 2 seconds. (2 marks) Procedure Y takes 20/4 = 5 seconds. (3 marks) If procedure Z can be parallelized perfectly, it would takes16/4 = 4 seconds. However, due to the extra overhead, the CPU utilization is only 80%. Hence procedure Z takes 4 / 80% = 5 seconds. (3 marks) Since X, Y and Z have to execute sequentially, the best CPU time of this program on a quadcore CPU is 2+5+5= 12 seconds. (2 marks) This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00 https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/ Powered by TCPDF (www.tcpdf.org)