Uploaded by 2776044307

未知文件.pdf

advertisement
COMP2006 Computer Organization
Assignment I
Sample Solutions
1. “Throughput is commonly used to measure the performance of a system. Consider a
computer that has a hard disk drive (HDD) and an SSD drive. The throughput of HDD is
about 100MB/s, and the throughput of the SSD drive is about 500MB/s. If you want to
copy a file of size 4GB from HDD to SDD, please estimate how long it takes, and justify
your answer. Assume 1GB = 1000MB.”
(10 marks)
Ans: 40 seconds.
When we copy a file from the slower HDD to the faster SDD, it involves two types of operations:
(1) read operations, i.e., copy data from HDD to main memory; and (2) write operations, i.e.,
copy data from main memory to SDD. Since the main memory is much faster than both HDD
and SDD, the throughput of read operation (from HDD to main memory) is limited by HDD, i.e.,
about 100MB/s, while the throughput of write operation (from main memory to SSD) is limited
by SDD, i.e., about 500MB/s. In total, the read operations take 4x1000/100 = 40 seconds, while
the write operations take 4x1000/500 = 8 seconds. In practice, the read and write operations
can be overlapped (i.e., after reading a small segment of data from HDD, they can be written
to SSD. Therefore, we only need to count the read time of 40 seconds, while the 8 seconds of
writing time can be largely hidden. The blow figure shows the detailed process:
Marking scheme:


If your answer is around 40 seconds and your justification is reasonable, you get 10
marks.
If your answer is around 48 seconds (i.e., you add the 40 seconds of read and 8 seconds
of write, you get 8 marks.
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/

For other unreasonable answers, you get 0 mark.
2. Consider a CPU that can perform 1million additions per second. You have two data arrays
A and B in main memory, each with 1million numbers. The CPU can load 1000 numbers
from main memory to its registers in each second. The CPU can write 1000 numbers to
main memory in each second. But the memory read and memory write operations cannot
happen at the same time. Now we want to calculate the summation of the two arrays,
i.e., Ci = Ai + Bi, where 0 ≤ i ≤ 999,999. C is the data array that stores the 1million results.
Please estimate how long it takes to have all results in array C.
Hint: to perform Ci = Ai + Bi, it takes several steps: (1) load Ai to register; (2) load Bi to
register; (3) perform the addition by the CPU; (4) store the result to Ci.
(10 marks)
Ans: 3001 (or 3000) seconds.
[The question tests your understanding of the CPU + memory structure.]
Since we have three large arrays, A, B, C, they can only be stored in main memory. To perform
A+B, we need to load data from main memory (array A and array B) to CPU registers piece by
piece, then perform the addition, then write the result to main memory (array C). The data
movement between CPU and main memory is very slow: 1000 numbers per second, as
compared to the CPU speed which is 1,000,000 additions per second. Given array size of
1,000,000, we need to move 2,000,000 numbers (i.e., A and B) from main memory to CPU
registers, which takes about 2,000,000/1000 = 2,000 seconds. But the addition operations only
take 1,000,000/1,000,000 = 1 second. It also takes 1,000,000/1000 = 1,000 seconds to move
data from CPU registers to array C. The problem assumes that memory read and memory write
cannot happen at the same time, so the total memory access time is 2,000+1,000 = 3,000
seconds. The CPU calculation time is only 1 second.
The final answer will be 3,001 seconds. If you assume that the CPU calculations can overlap
with data movement operations, then the answer will be 3,000 seconds. Both answers are
accepted.
3. “Apple has recently announced its new generation of processor A14 Bionic, to be used in
an iPAD Air. (1) Please describe the major components of A14 processor and their
functionalities. (2) Compare the A14 processor with the A13 processor released in 2019.
What are the improvement? (3) Numerical Wind Tunnel was the top-1 supercomputer in
the world during 1993-1995, with a cost over $100 million. Do you think the iPad Air
equipped with A14 processor will be faster than it or not? Justify your answer.“
Hint: You need to search in the Internet to answer this question.
(10 marks)
Ans:
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/
[This is an open-ending question without standard answer. The purpose is to let you learn
new concepts (such as GPUs and neural engines) from daily life applications through
Internet search. Below is my sample answer as reference.]
(1) 3 marks The A14 Bionic processor is a 64-bit ARM-based system on a chip (SoC), designed
by Apple Inc. It includes the following major components:
COMPONENTS
6-core CPU (2 high-performance
cores and 4 power-efficient
cores)
4-core GPU
Neural Engine
FUNCTIONALITIES
To support the iOS operating system and general mobile
applications
GPU means “graphics processing unit”. It is used to
accelerate the creation and processing of images,
graphics, and videos to be shown on the screen.
To support AI applications such as Face ID
(2) 3 marks
Manufacturing
Number of transistors
CPU
GPU
Neural Engine
A13
7nm
8.5 billion
2+4 cores, up to 20% faster
than A12’s CPU
4 cores, up to 20% faster than
A12’s GPU
8 cores, up to 6 trillion
operations per second
A14
5nm
11.8 billion
2+4 cores, up to 40% faster
than A12’s CPU
4 cores, up to 30% faster than
A12’s GPU
16 cores, up to 11 trillion
operations per second
(3) 4 marks
The performance of a supercomputer is usually measured by two metrics: the theoretical peak
performance and the measured performance when running a specific software (called
LINPACK). The theoretical peak performance means that the CPU is 100% busy, so it is usually
higher than the measured performance. The commonly used unit is GFlop/s, which means 1
billion
floating-point
operations
per
second.
According
to
https://en.wikipedia.org/wiki/Numerical_Wind_Tunnel, the peak performance of Numerical Wind
Tunnel is 235.8 GFlop/s, and the measure performance is 124.0 GFlops/s.
We cannot find the GFlop performance of A14, since it is so new. But we can find the GFlop performance
of A12 and A13. E.g., according to https://gadgetversus.com/processor/intel-core-i5-9500-vs-applea13-bionic/, the performance of A13 CPU is 302.1 GFlop/s, and the A13 GPU can also achieve more than
400 GFlop/s. So what we can say is that A13 is at least comparable to Numerical Wind Tunnel.
Considering that A14 has further improved the performance over A13, so in my opinion, the iPad Air
equipped with A14 processor is faster than the supercomputer Numerical Wind Tunnel.
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/
Remark: your conclusion could be different from mine. As long as you found some reasonable data from
the Internet and your judgment is based on your data, you will receive the 4 marks.
4. “Explain the main differences between high-level programming languages and machine
language, and how we can transfer a high-level language into machine language.”
(10 marks)
Ans:
High-level programming languages are closer to natural language. They allow the programmers
to focus more on the problem domain. They can provide higher productivity and portability. (2
marks)
Machine language is a binary representation of machine instructions, i.e., the commands that
computer hardware understands and obeys. (2 marks)
In order to transfer a high-level language into machine language, a compiler software first
translates the high-level language into the assembly language, which is a symbolic
representation of machine instructions. Then the assembler software translates the assembly
language into machine language. In practice, these two steps may be combined as a single step.
(6 marks)
5. “Assume there are two numbers x and y stored in a disk drive as a txt file. In order to
calculate x + y, which components of the computer will be involved? Please describe the
whole process as detailed as you can.”
Hint: Which computer components are not mentioned in Question 2?
(10 marks)
Ans:
Data of x and y will first be loaded from USB disk to main memory. (2 marks)
Then the CPU will fetch the values of x and y from main memory to two registers. In this step,
modern CPUs will also introduce cache memory between main memory and registers. That is
to say, the data will be copied from main memory to cache memory (if they were not in cache
yet), and then be copied from cache memory to registers upon the request of the CPU. (4 marks)
And then the CPU can calculate the value of x + y using ALU and registers. (4 marks)
6. “Consider a processor X that can perform one multiplication operation in 1µs (i.e., 10-6
sec). What is the response time and throughput of multiplication operation on X,
respectively? Now we construct a multiprocessor Y that consists of 4 processors of X.
What is the response time and throughput of multiplication operation on Y, respectively?”
(10 marks)
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/
Ans:
Response time of X: 1 µs or 0.000001 second (2.5 marks)
Throughput of X: 1/0.000001 = 1,000,000 multiplications per second (2.5 marks)
Response time of Y: 1 µs (because using 4 devices cannot help in reducing the time of a single
multiplication operation) (2.5 marks)
Throughput of Y: 4/0.000001 = 4,000,000 multiplications per second (2.5 marks)
Remark: the unit of throughput should be “multiplications per second”. If the unit is wrong,
one mark will be deducted.
7. “A dual-core processor A works at 2.4GHz, and each core has a throughput of 4
multiplications per CPU cycle. A quad-core processor B works at 3.2GHz, and each core
has a throughput of 8 multiplications per CPU cycle. Please compare the performance of
A and B in terms of multiplication throughput.”
(10 marks)
Ans:
A works at 2.4GHz, and in each cycle each of its two cores can finish 4 multiplications, so A can
perform 2.4x109x4x2 = 1.92x1010 multiplications per second. (4 marks)
B works at 3.2GHz, and in each cycle each of its four cores can finish 8 multiplications, so B can
perform 3.2x109x8x4 = 1.024x1011 multiplications per second. (4 marks)
The performance of A is 18.75% of B in terms of multiplication throughput. Or, we can say the
performance of B is 5.33 times of A. (2 marks)
8. “Assume a CPU supports three types of instructions A, B, C. The CPI (Cycles Per Instruction)
of A, B, C are 1, 2, 10, respectively. Given a program written in a high-level programming
language, we need to choose a compiler to generate the machine code. “
(1) “Compiler 1 generates 10,000 type-A instructions, 20,000 type-B instructions, and
2,000 type-C instructions. What is the CPI of the machine code generated by Compiler 1?
“
(2) “Compiler 2 generates 8,000 type-A instructions, 15,000 type-B instructions, and 4,000
type-C instructions. What is the CPI of the machine code generated by Compiler 2? “
(10 marks)
Ans:
(1) The CPI is (10000x1 + 20000x2 + 2000x10) / (10000 + 20000 + 2000) = 2.1875 (or 2.19)
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/
(5 marks)
(2) The CPI is (8000x1 + 15000x2 + 4000x10) / (8000 + 15000 + 4000) = 2.8889 (or 2.89)
(5 marks)
9. “Consider the same CPU in Question 8 with working frequency of 2.5 GHz. A program’s
machine code includes 1 billion type-A instructions, 0.5 billion type-B instructions, and 0.2
billion type-C instructions. What is the CPU time of this program? Assume there is no
parallelism. “
(10 marks)
Ans:
CPU cycles (in billion) = 1 x 1 + 0.5 x 2 + 0.2 x 10 = 4 (billion) (5 marks)
CPU time = 4 billion / 2.5GHz = 1.6 second (5 marks)
10. “Consider a program with three sequential procedures: X, Y, and Z, where Y can only be
started after X is finished, and Z can only be started after Y is finished. Procedure X takes
2 seconds on a single CPU core and cannot be parallelized. Procedure Y takes 20 seconds
on a single CPU core but can be perfectly parallelized on multiple CPU cores, i.e., its
instructions can be equally distributed to multiple CPU cores without extra overhead.
Procedure Z takes 16 second on a single CPU core and can be parallelized on multiple CPU
cores with 80% of utilization. Please estimate the best CPU time of this program on a
quad-core CPU.”
(10 marks)
Ans:
Procedure X takes 2 seconds. (2 marks)
Procedure Y takes 20/4 = 5 seconds. (3 marks)
If procedure Z can be parallelized perfectly, it would takes16/4 = 4 seconds. However, due to
the extra overhead, the CPU utilization is only 80%. Hence procedure Z takes 4 / 80% = 5
seconds. (3 marks)
Since X, Y and Z have to execute sequentially, the best CPU time of this program on a quadcore CPU is 2+5+5= 12 seconds. (2 marks)
This study source was downloaded by 100000851797096 from CourseHero.com on 09-25-2022 21:24:24 GMT -05:00
https://www.coursehero.com/file/83631654/Assignment-I-Sol-2020-21pdf/
Powered by TCPDF (www.tcpdf.org)
Download