simultaneous multithreading and thread

advertisement
C12
6129
Disclaimer—This paper partially fulfills a writing requirement for first year (freshman) engineering students at the University
of Pittsburgh Swanson School of Engineering. This paper is a student, not a professional, paper. This paper is based on publicly
available information and may not provide complete analyses of all relevant data. If this paper is used for any purpose other
than these authors’ partial fulfillment of a writing requirement for first year (freshman) engineering students at the University
of Pittsburgh Swanson School of Engineering, the user does so at his or her own risk.
SIMULTANEOUS MULTITHREADING AND THREAD-INTENSIVE
PROCESSING
Andrew Dant, agd25@pitt.edu, Bursic 2pm, John Redding, john.redding@pitt.edu, Mena 6pm
Abstract— Since increasing clock speeds can no longer
provide sufficient improvements to processor speeds, many
researchers and developers have turned to parallelism as a
means to improve processing power and efficiency.
Simultaneous Multithreading(SMT) is a form of parallelism
that allows a computer to process multiple instructions at the
same time. Processors normally run through threads of
instructions one at a time, sequentially. Multi-core processors
can have a different thread being processed by each core, but
each core is still restricted to a single thread at a time. If a
thread is waiting for further input it may be paused,
preventing the program from moving further. This can often
leave cycles unused which reduces efficiency. Simultaneous
Multithreading allows unused cycles to be to be utilized by
other threads. This means that multiple threads can be
processed in same time that a traditional processor would
only process one. This becomes important when dealing with
thread intensive processing where many threads are being
sent to the processor at the same time. SMT also prevents
multiple threads that are sent at the same time from
overfilling the processors memory cache. This paper aims to
review the effectiveness of SMT as well as which types of
processes can be most improved by it.
Key Words—Cycle, Multithreading,
Processor, Simultaneous, Thread
Parallelism,
PROCESSORS AND SIMULTANEOUS
MULTITHREADING
Computers have become a major part of daily life for
much of society. With usage always increasing, more and
more programs are being created, and they are expected to run
faster and faster as computers improve and become more
complex over time. This can cause problems when the
programs being created are more resource-intensive than a
computer processor can handle. Researchers overcame this
hurdle by increasing the clock speed of processors. Now,
however, increasing speed via increased clock speeds is
becoming increasingly difficult, as it drains more and more
power, and produces too much heat, leading to efforts in
improving the processor in other ways [1].
University of Pittsburgh Swanson School of Engineering 1
Submission Date 2016-02-12
HOW DO COMPUTERS WORK
In order to understand multithreading, it’s important to
know how computers operate. The most important parts of a
computer are the motherboard, processor, primary storage,
and secondary storage. The motherboard is the piece that
connects all of the individual parts of a computer. The Central
Processing Unit (CPU), or processor, is the part of the
computer that completes any instructions. Modern CPU’s are
generally made up of multiple cores, which consist of
multiple execution units each. Primary storage is essentially
the working memory of the computer, and is commonly
known as Random Access Memory or RAM. Secondary
storage is the long-term storage, generally using hard drives
[2].
When a program runs, data is streamed from the
secondary storage to the primary storage. This data is then
sent to the CPU, which performs any actions required.
Information is then sent back to the RAM, and any permanent
changes are sent back to the secondary storage. All of these
steps are completed by the execution units inside of the CPU.
[2]. This entire process, known as an instruction cycle, repeats
for as long as there are instructions to be completed. This
instruction cycle is also known as a clock cycle, the speed of
which is controlled by the clock speed of the processor. Figure
1, below, shows the loop from main memory, to the
CPU/execution units, and back to the main memory.
FIGURE 1[3]
This figure shows a basic instruction cycle
There are two major kinds of processors. One type is scalar,
which means there is one execution unit per core. This leads
Andrew Dant
John Redding
to a limit in speed and efficiency, as only one instruction cycle
at most is happening per cycle. Because of this, if one
instruction gets hung up due to a data dependency, the entire
program stops until the data dependency is cleared. When this
happens, the CPU is said to be performing at a sub-scalar level
[4].
This sub-scalar performance led to the development of
superscalar processors. This means that there are multiple
execution units within each core. This allows the CPU to
process multiple instructions at the same time, which prevents
hung up instructions from halting the entire work flow. Most,
if not all, modern processors are superscalar. This becomes
very important when implementing different types of
parallelism [4].
A thread, short for thread of execution, is the basic unit of
CPU operations. A thread is the smallest sequence of
instructions which is independently handled by an operating
system’s scheduler for execution on a CPU [6]. Many
computer processes can be broken up into smaller parts
which, while related to one another, can be performed
independently of one another by the processor. Because
threads are the most basic entities scheduled for execution, all
processes are converted into threads before being sent to the
CPU. On a traditional, single threaded computer, there is a
sequential order for all threads related to any specific process,
and separate processes must progress one at a time. Almost
all modern computers and software utilize some form of
multithreading, which is a form of parallelism [6]. In
computing, parallelism is a broad term for completing
multiple tasks at the same time, or finding ways to overlap the
execution of operations that do not need to occur sequentially.
The fact that all threads are very small, and all share a similar
design means that there are several forms of multithreading
and parallelism involving threads. Due to their complexity
and importance, multithreading and parallelism will be
discussed in more detail later in this paper.
Each thread sent to the processor consists of a set of
operations or instructions which it will complete, as well as a
number of resources which may be used in the execution of
the intended operations. Every thread has a thread ID, a
program counter, a set of registers, and a stack [6]. The thread
ID is a unique identifier assigned to each thread, so that it can
be interacted with by other threads. The program counter
specifies which operations should be next executed. In a
single threaded processor, the processor will execute one
instruction, from one thread, from one process at a time
sequentially, even if multiple threads from multiple processes
need to be completed. Registers are small amounts of physical
memory set apart for use as short term storage of things such
as temporary variables by the processor, and each thread is
assigned a register or registers to use while being processed.
Registers are an extremely important aspect of a processors
physical architecture, as all threads will need to store
temporary information on the registers in order to complete
their intended operations. A stack is a very commonly used
data structure which utilizes a last in, first out(LIFO) method
of adding and removing items, which means that the most
recently added item on the stack must be removed from the
stack first, before any others can be removed. All items on a
stack are removed in the reverse of the order in which they
were added [6].
While the elements of a thread listed above are all
generally thread specific, there are also resources that can be
shared between threads. When an individual process is broken
into multiple threads that can be executed independently of
one another, that process is said to be multithreaded. All of
the threads within a multithreaded process share access to
certain pertinent resources, such as code sections, data, and
open files [6]. The ability to break an individual process into
multiple independently operating threads can be extremely
Execution Units
Computer processors have groups of components, called
execution units, also known as functional units, which are
responsible for performing the calculations and operations
sent to the processor [5]. An individual execution unit
completes at most one instruction per instruction cycle.
Modern processors implement what is called a superscalar
architecture, meaning that there are multiple identical
execution units in each core, allowing multiple instructions to
be completed at once by the same processing core. This is a
form of instruction-level parallelism, which is covered in
more depth later in this paper. Since each executional unit can
execute up to one instruction per cycle, a processor with X
number of execution units can be thought of as having X units
available per cycle. The frequency of cycles is defined by the
clock speed. All processors have a specific clock speed, which
refers to the number of clock cycles which occur per second.
Modern processors have billions of clock cycles per second,
but this does not mean that billions of instructions are
completed per second. Many operations take multiple cycles
to complete, and there are often instructions within a thread
of instructions which can only be completed in a certain order.
This leads to a significant loss in efficiency, since individual
operations can not only preoccupy an execution unit for
several cycles, but can also prevent other operations in the
thread from occurring during those cycles [5]. Traditional
superscalar processors, despite being able to execute multiple
instructions at once, can still only draw new instructions from
a single thread at a time. If a thread has fewer new instructions
available than the number of available execution units, the
extra execution units are left completely idle. This
shortcoming in traditional superscalar architecture can be
greatly improved upon via a method called multithreading [5].
However, in order to understand multithreading, one first
needs an understanding of both threads and parallelism.
WHAT IS A THREAD AND WHY DO
PROGRAMS USE THEM
2
Andrew Dant
John Redding
useful, and is the basis for the concept of thread-level
parallelism. One example of the power of multithreaded
programs, which are extremely common in modern
computing, is a word processor, such as Microsoft Word or
Google Docs. The processor may have one thread for
displaying graphics, another for logging user input, and
another still running a spelling and grammar checker on the
words that the user types [6].
12+3 and 3+9 do not directly interact at this level, they can
both be calculated at the same time. This means that both
buttons can be used to complete these expressions, reducing
the problem to ’15-12’. Only one button is required for this,
giving a final answer of ‘3’. In this example, we completed a
5 step problem in 3 cycles. This is the benefit of superscalar
processors.
Both of these methods are very common. So common in
fact, that they are often used together. Superscalar processors
now use pipelining to improve efficiency even further.
Instructions are lined up and sent into multiple execution units
in a row, building up as they go along [7]. This creates a kind
of expansion effect in the middle, where many instructions are
being executed at once. This can be achieved either through
software or hardware. Software changes would include
coding a program to run in parallel, and having the computer
mark the instructions to do so. Hardware changes would
include more execution units and more registers [7].
WHAT IS PARALLELISM
Parallelism is a term used very broadly when relating to
programs. Basically, parallelism is when multiple things are
happening within the same timeframe. These things do not
need to be happening within the same instant, and in many
cases are unable to happen within the same instant. In most
cases, one calculation is started, and then put on hold while
another is started. Parallelism is generally broken into two
main types: instruction-level(ILP) and thread-level(TLP) [7].
Instruction Level
Instructional-level parallelism is one way of running
multiple parts of a program together. Transfer of data from
the RAM to the processor is not instantaneous, which means
there are times where the processor is doing nothing but
waiting for input. This decreases overall efficiency.
One method of parallelism is pipelining. This method
consists of starting many instructions before finishing any. A
good way to think about this is an assembly line of envelope
stuffers. The first person starts with a piece of paper. They
fold the piece of paper, and pass it to the next person, who
puts the paper into an envelope. The next person receives this
stuffed envelope, and sticks a shipping label on it. This is then
passed to a final person who puts a stamp on it and seals it,
putting it in the ‘done’ pile.
This is how pipelining works in a program as well.
Multiple instructions line up in sequence, and begin executing
one piece at a time. All of the data is loaded, then all of the
calculations are made, then all of the data is written. This
leads to a lower count of wasted cycles due to hung up
instructions [8].
Another method of parallelism is superscalar processing.
Having multiple execution units is very important, as it allows
a CPU to complete multiple instructions truly in parallel.
When using pipelining, the CPU can complete multiple
instructions during the same timeframe, but not in the same
instant. Multiple execution units allows multiple instructions
to be executed per clock cycle [9].
One way to imagine this is two buttons, given the problem
‘4*3+12/4-3+9’. Only one operation can be completed per
button, and they must be independent. For example, because
4*3 and 12/4 do not interact directly on the first level, both of
these can be completed at the same time. When both buttons
are used, the problem is reduced to ‘12+3-3+9’. This is the
end of the clock cycle, and the buttons reset. Next, because
FIGURE 2 [7]
Shows the improved efficiency of combining
pipelining and superscalar processing
Figure 2, above, shows the timetable a program would
follow if pipelining and superscalar processing were
combined.
Thread Level
Thread-level parallelism(TLP) is a concept that is similar
to instruction-level parallelism(ILP), but on a larger scale.
Whereas ILP sorts instructions within a thread, TLP changes
between entire threads. If a thread is determined to be more
efficient, the next clock cycle will be spent executing the other
thread. It is important to note that threads cannot be run within
the same cycle using this method [9].
There are two major kinds of thread-level parallelism. The
first kind is coarse-grained parallelism(CMT) [3]. This type
of parallelism uses a set number of cycles to work on a thread
before switching to another. At that time, all of the
information used in the other thread is saved to the register,
and the other thread is immediately started. This removes the
wasted cycles that were previously used to transition between
threads when one was completed.
The method of setting the amount of cycles to run varies.
Some programs may continue one thread until the processor
hits a cache miss, while others use a predetermined number of
cycles. Below is a diagram on coarse-grained TLP.
3
Andrew Dant
John Redding
FIGURE 3.2 [7]
Compares super-scalar(right) to FMT(left)
FIGURE 3.1 [7]
Compares super-scalar(left) to CMT(right)
Figure 3.2, above, shows the workflow of a standard
superscalar processor on the right side, and the workflow of a
superscalar processor using FMT on the left. FMT is more
rapid in thread changes than CMT is. This leads to each thread
taking more cycles to fully complete, but less time overall for
the threads to complete.
Figure 3.1, above, shows the workflow of a standard
superscalar processor on the left, and the workflow of a
superscalar processor using CMT on the right. CMT removes
the cycles used on switching between threads.
The other kind of thread-level parallelism is fine-grained
parallelism(FMT)[7]. This method switches rapidly between
threads. Whereas coarse-grained TLP allowed threads to have
a couple cycles before switching, fine-grained TLP switches
threads every single cycle. This makes each individual thread
take longer to complete, but the overall completion time is
greatly reduced [7].
WHAT IS SIMULTANEOUS
MULTITHREADING
Simultaneous multithreading is the full integration
of TLP and superscalar processing. Thread-level parallelism
can be used to greatly reduce the vertical waste of resources.
This does nothing for the wasted execution units in each
cycle. Likewise, superscalar processing reduces horizontal
waste, but does very little for wasted clock cycles. SMT is the
combination of these two technologies, in order to minimize
waste, both horizontal and vertical [7]. This means that
performance is greatly increased, with the majority of every
cycle being used to complete necessary operations within the
threads being processed.
4
Andrew Dant
John Redding
Changes to Physical Architecture
All processors with simultaneous multithreading
implement a form of superscalar architecture, along with
certain architectural modifications. Most of the components
necessary for simultaneous multithreading are components
already necessary for conventional superscalar designs. There
are however certain modifications necessary to accommodate
the rapid access of multiple threads and their respective
resources. Because threads often share many resources, SMT
can cause significant interference between threads within
shared physical structures. This interference is most notable
in the form of increased short term memory requirements.
Each thread being used has to make memory references to its
own set of variables and stored information every time the
thread is used. Since more threads are being used, SMT
processors have significantly increased memory requirements
[11]. This issue can be overcome by having a larger number
of registers built into the architecture of SMT processors.
SMT processors also benefit from additional pipelining stages
for accessing the registers [11].
FIGURE 3.3 [7]
Compares CMT(left) with FMT(middle) and SMT(right)
Changes to Software
Though the necessary changes to the physical architecture
of SMT processors are relatively minor, there are numerous
ways in which the software being processed can be adjusted
to improve the effectiveness of SMT. While simultaneous
multithreading can improve efficiency in traditionally coded
software to a degree, some changes to how software is
programmed are necessary in order to fully take advantage of
the capabilities of SMT. Many software structuring and
behavioral changes have been made over time in
accommodating the instruction and thread level parallelism
employed by our current processors, so it is not abnormal for
proper implementation of SMT to require some changes on
the part of programmers and software designers. Software
support for SMT borrows many mechanisms from software
support for previously implemented forms of multithreading,
but also requires the use of some new methods in order to fully
take advantage of its hardware capabilities [12]. As discussed
in the section on physical architecture, SMT requires a
significant increase in register use. An increase in the number
of physical registers in the processor is one potential method
to combat this shortcoming, but it is also possible to reduce
the need for additional registers via simple changes to the
software being processed. One study conducted at the
University of Washington found that there are multiple kinds
of compiler optimizations for traditional processors which can
be applied as a means of significantly improving SMT
efficiency with only minor adjustments [12]. Another
University of Washington study found that, of several
methods tried, the most effective way to improve SMT
efficiency and memory usage was a simple change to the
software that controls short term memory allocation in order
to eliminate inherent contention [11]. The study also found
In Figure 3.3, three kinds of parallelism are shown. Column c
shows coarse-grained thread-level parallelism. Column d
shows fine-grained TLP. Finally, column e shows SMT. This
figure is great for comparing all three types of parallelism, and
showing that SMT has a lower unused execution unit/cycle
count.
PROGRESSION FROM SINGLETHREAD PROCESSORS TO
SIMULTANEOUS MULTITHREADING
Thread parallelism has been an idea for much longer
than it was able to be implemented effectively. The first
multithreading processor was created in the early 1950s, and
was capable of handling two threads at once. This technology
was then quickly improved and by the late 1950s, processors
existed that could handle up to 33 threads [10].
In the 1960s, superscalar processing was beginning to be
developed, with registers and instructions being tagged in
order to make them more easily recognizable by execution
units [10].
Progress was slowly made over the next 20 years, with
four-way multithreading being possible by the late 1980s. In
the 1990s, the idea of simultaneous multithreading was just
being developed and expanded upon. However, there was no
processor truly capable of SMT until the 2000s [10].
IMPLEMENTATION OF
SIMULTANEOUS MULTITHREADING
5
Andrew Dant
John Redding
that once this simple change was implemented, several other
beneficial changes became available, which they believe to be
worthy of further research [11]. As a final example, a team of
researchers, also at the University of Washington, developed
and partially tested a concept for a brand new method of
implementing SMT called mini-threads [13]. The basic
concept of which is breaking threads into smaller ‘minithreads’ which are capable of sharing architectural registers
[13]. The researchers believe that implementing this method
could significantly improve SMT processing speeds, while
simultaneously decreasing the number of necessary registers
[13]. While the concept of mini-threads is not implemented
on any commercially available processors as of yet, the
researchers state that they plan to continue to explore the
concept of mini-threads as future work [13]. The fact that
there are already so many potentially viable options being
considered for improving SMT in the future provides strong
support for the idea that this already impressive technology
can become even more beneficial with future research.
Figure 3.4, above, shows the difference between a
standard superscalar processor, and a multicore processor.
Multicore processors are basically multiple processors
combined into one. This allows each core to execute
instructions completely independently, without switching
between threads.
What Kinds of Processes Are Most Affected
Many programs are affected by SMT, although entire
workflows and activities can be impacted. Among these are
workflows relying on multitasking, and activities such as
video editing and 3D rendering.
When editing a video, data is being encoded in tiny
sections, frame by frame. These sections are all completely
independent of each other – as long as they end up in the right
order at the end, it doesn’t matter how they are processed.
Because of this, as long as the data transfer can be handled,
SMT gives a huge benefit.
The same is true of rendering a 3D model. Calculations
are made per point of the model, resulting in thousands of
calculations. This is greatly improved by SMT [14].
MULTICORE PROCESSORS AND THEIR
RELATION TO SIMULTANEOUS
MULTITHREADING
IS THIS THE BEST WAY TO IMPROVE
PROCESSING POWER
Multicore processors are CPUs that have multiple
cores of execution. Each core is made up of multiple
execution units, which means one multicore processor is
capable of doing the work of multiple superscalar processors.
This opens up another method of parallelism, in which threads
are run simultaneously, and completely independently. Below
is a diagram showing how a multicore processor compares to
a regular superscalar processor [7].
Alternative Methods of Improving Processor Power
Alternative methods of improving CPU performance
include increasing clock speed and increasing the number of
cores of execution. These two methods both have progressed
greatly over the last few decades.
Increasing clock speed has the side-effect of a higher
temperature, and also can cause a greater error-rate. This
means that clock speed recently has plateaued around 4 GHz,
with most processors tending to stay around 3 GHz. Without
further improvements to cooling, increasing clock speed is no
longer a viable way of improving performance.
Increasing the number of cores of execution in a processor
is a method of improving performance that is still being
explored. Every generation of processors has a greater
number of cores, both logical and physical. Some high end
server processors have up to 15 cores of execution [15].
One alternative to SMT is completely different than any
other, and that is a whole new CPU design. Currently, CPUs
function in what is essentially 2 dimensional space. Within
the past couple years, a large amount of funding has gone
toward researching 3 dimensional processors. This would
increase the efficiency of CPUs by up to a thousand times, not
only in terms of performance, but also power consumption
[16].
Advantages of Simultaneous Multithreading
FIGURE 3.4 [7]
6
Andrew Dant
John Redding
Simultaneous multithreading gives an efficiency
boost that is hard to match with other methods of
improvement. While clock speeds have been very helpful up
to this point, it is no longer reasonable to expect large
returns for the effort and resources required to increase the
clock speed further.
The same can be said of multicore processors. With
more cores, transistors must decrease in size further.
Eventually we will be unable to fit any more on a chip. This
means that we must find another way to improve
performance.
Should a 3 dimensional processor be developed and
created in a reasonably cheap manner, it is likely that the
benefits would far outweigh that of SMT. However, it is
unlikely that such a powerful processor will be developed in
the near future, whereas SMT is already being implemented.
another program uses this cache, and can gain access to the
encryption key, that program also has access to any data the
intended program does. This flaw generally does not affect
personal computers, but it is a serious concern for large-scale
servers [18].
WHERE DO WE GO FROM HERE?
Simultaneous multithreading is currently one of the most
viable ways to improve the efficiency and performance of
future processors. After years of progress being made in
parallelism, at both the instruction and thread-level,
simultaneous multithreading technology is just now
becoming available in consumer PCs. SMT is an innovative
and versatile technology, and its future in commercial
computing is clear. The benefits of this technology far
outweigh the drawbacks, and it can continue to be improved
through multiple kinds of simple software changes.
Simultaneous multithreading will continue to be researched,
improved, and implemented in our processors for years to
come.
Shortcomings of Simultaneous Multithreading
Nothing exists without a downside. The same is true of
SMT. Two major issues that currently exist with simultaneous
multithreading are bottlenecks occurring when resources are
shared, and security flaws.
Because simultaneous multithreading executes many
threads at one time, data is being pulled in large amounts from
RAM and caches. While this is not an issue with smaller
processors, when the thread count starts increasing, it is very
easy to hit a wall in terms of data transfer. This can lead to
threads taking even longer to complete than they usually
would.
One way to imagine this is the fire alarm in a crowded
building being pulled, with only 2 exits. When you have a
small number of people, it is very easy to make two exits
handle the output required. However, when the building is
filled with hundreds of people, a mob forms by the doors, with
everybody fighting to get out at once. This leads to a decrease
in output, which may have been avoidable by forming a line,
and having everybody exit in order. The same can be said for
SMT. In some cases, when many threads are going to be using
one resource, it is best to perform them sequentially, than to
have too many threads try to access the same register.
Another major flaw with SMT is the security flaws it
opens up. When multiple threads share one execution unit,
they do not have to be from the same program. Two different
programs can access the same execution unit at any one time.
Normally this would be a good thing, as it decreases the time
a thread spends waiting to be executed.
This sharing of execution units, however, also means that
the programs are sharing the same registries and caches. This
means that any data that is being used by one program can be
accessed by another program. This could lead to programs
gaining access to information that should have been encrypted
[17].
Encryption keys exist to access encrypted data. A program
carries this key, which means that it must be saved to a cache
at some point. Generally this is not a problem. However, when
REFERENCES
[1] P. Persson Mattsson (2014, Nov. 13) “Why Haven’t CPU
Clock Speeds Increased in the Last Few Years?” ComSol
(online article) https://www.comsol.com/blogs/havent-cpuclock-speeds-increased-last-years/
[2]”The
Instruction
Cycle”
(website)
http://www.cs.uwm.edu/classes/cs315/Bacon/Lecture/HTM
L/ch05s06.html
[3] R. S. Singh (2014, Apr. 23) “What is clock cycle,
machine cycle, instruction cycle in a
microprocessor?“Quora (online article)
https://www.quora.com/What-is-clock-cycle-machine-cycleinstruction-cycle-in-a-microprocessor
[4] J. Smith, G. Sohi (1995). “The Microarchitecture of
Superscalar
Processors”
(online
article)
ftp://ftp.cs.wisc.edu/sohi/papers/1995/ieeeproc.superscalar.pdf
[5] G. Torres (2006). “Execution Units” Hardware Secrets
(website) http://www.hardwaresecrets.com/inside-intel-coremicroarchitecture/4/
[6] A. Silberschatz, P. Galvin, G. Gagne (2013). “Operating
System Concepts, Ninth Edition” Kendallville, Indiana:
Courier (print book) Chapters 2-4
[7] P. Mazzucco(2001, June 15) “Multithreading” SLCentral
(website)
http://www.slcentral.com/articles/01/6/multithreading/print.p
hp
[8] Hawkes (2000) “Enhancing Performance with
Pipelining” FSU (website)
http://www.cs.fsu.edu/~hawkes/cda3101lects/chap6
[9] E. Karch (2011, Apr. 1,). “CPU Parallelism: Techniques
of Processor Optimization” MSDN Blogs. (online article)
7
Andrew Dant
John Redding
http://blogs.msdn.com/b/karchworld_identity/archive/2011/0
4/01/cpu-parallelism-techinques-of-processoroptimization.aspx
[10] M. Smotherman ( April 2005) “History of
Multithreading”
(website)
http://people.cs.clemson.edu/~mark/multithreading.html
[11]J. Lo, S. Eggers, et al. “Tuning Compiler Optimizations
for Simultaneous Multithreading.” Dept. of Computer
Science and Engineering, University of California. (online
article).
http://www.cs.washington.edu/research/smt/papers/smtcomp
iler.pdf
[12] L. K. McDowell, S. J. Eggers, S. D. Gribble (2003).
“Improving Server Software Support for Simultaneous
Multithreaded Processors.” University of Washington (online
article)
http://www.cs.washington.edu/research/smt/papers/serverSu
pport.pdf
[13] J. Redstone, S. Eggers, H. Levy (2003) “Mini-threads:
Increasing TLP on Small-Scale SMT Processors.” University
of
Washington
(online
article)
http://www.cs.washington.edu/research/smt/papers/minithre
ads.pdf
[14] P. Manadhata, V. Sekar (2003). “Simultaneous
Multithreading”
(online
presentation)
http://www.cs.cmu.edu/afs/cs/academic/class/15740f03/www/lectures/smt_slides.pdf
[15]
”Xeon
Processor”
Intel
(website)
http://ark.intel.com/products/75251/Intel-Xeon-ProcessorE7-4890-v2-37_5M-Cache-2_80-GHz
[16]T Puiu “3D stacked computer chips could make
computers 1,000 times faster” ZME Science (online article)
http://www.zmescience.com/research/technology/3dstacked-computer-chips-43243/
[17] A. Fog (2009). “How Good is Hyperthreading?” Agner’s
CPU
Blog
(online
article)
http://www.agner.org/optimize/blog/read.php?i=6&v=t
[18] C. Percival (2005). “Hyper-Threading Considered
Harmful”
(online
article)
http://www.daemonology.net/hyperthreading-consideredharmful/
ACKNOWLEDGMENTS
We would like to acknowledge our writing instructor,
Keely Bowers, our co-chair, Kyler Madara, and our chair,
Sovay McGalliard for their help in the writing of this paper.
8
Download