This Project Thesis for Masters of Engineering Degree by

DESIGN OF A PARAMETRIC OUTLIER
DETECTION SYSTEM
By
Ronald J. Erickson
B.S., Colorado Technical University, 2001
A project thesis submitted to the Graduate Faculty of the
University of Colorado at Colorado Springs
In partial fulfillment for the degree
Masters of Engineering
Department of Computer Science
2011
©Copyright By Ronald J. Erickson 2011
All Rights Reserved
ii
This Project Thesis for Masters of Engineering Degree by
Ronald J. Erickson
has been approved for the
Department of Computer Science
by
Dr. C.H. Edward Chow
Dr. Xiaobo Zhou
Dr. Chuan Yue
______________________________
Date
iii
Erickson, Ronald J. (M.E., Software Engineering)
Design of a Parametric Outlier Detection System
Project Thesis Directed by Professor C.H. Edward Chow
Integrated circuit defects are only discovered during the electrical test of
the integrated circuit. Electrical testing can be performed at the wafer level,
package level or both. This type of testing produces a data-log that contains
device specific electrical characteristic data known as parametric data. This
testing of the integrated circuit has become increasing expensive in an ultra
competitive market. The traditional method of setting upper and lower limits to
discover non-compliant circuit defects; has become only a portion of the criteria
required to discover potential defects, as the geometries of these integrated
circuits continue to decrease.
Integrated circuit manufacturers are now increasingly employing additional
methods of failure detection by adding test structures on the wafer. Other
techniques employ high percentage failure mode screening on these circuits for
device marginality. However, marginality testing can only be accomplished after
extensive characterization of the circuit on numerous wafer lots over standard
process corners to establish device trends. Another technique is to include a
design for manufacturing structure that detects a single parametric measurement
that is potentially out of family when compared to the device simulation. All of
these standard types of failure detection work well with an environment that
builds
many
wafer
lots
and
thousands
iv
of
a
single
circuit.
.
Commercial circuit manufacturers will typically lower the FIT “failures in time” rate
requirements of their specific integrated circuit offering if the potential returned
merchandise costs are lower than the costs of more rigorous electrical testing.
Currently there is no universal industry adopted way to detect parametrics within
rigorous FIT requirement, that have an end use mission critical environment, and
where the product quantities are small. This paper and accompanying project is
an attempt to address the opportunities and challenges of creating a more
rigorous parametric outlier detection system on embedded circuits with very high
FIT requirements, very small sample sizes, used in mission critical devices.
v
For my family, Crystal, Chelsea, and Isabel
(and to my parents Gerry and Janice for always compelling me to continue)
vi
Acknowledgements
Anyone who has ever gone back to a university to obtain a high level
degree knows the effort that is involved in the creation of a project of this
magnitude. This project thesis was no exception. While I have many people to
thank for helping me throughout this project, I would like to pay special mention
to a number of individuals.
First, I would like to sincerely thank Professor C.H. Edward Chow.
Dr. Chow you are an excellent instructor and I really enjoyed the time you have
taken mentoring me throughout the tedious graduate school process. Your time
is very valuable, and I thank you for sharing a portion of that time over the last
couple semesters. This project thesis would not have been successful without
your direct guidance.
To the members of my project thesis committee, Dr. Xiaobo Zhou and Dr.
Chuan Yue, thank you for your feedback, support, and of course your
encouragement throughout this process.
And finally to Mr. Troy Noble: Thank you for being a great motivator and
intuitive in asking and answering my questions. In addition, thank you for the
personal mentoring in electrical engineering and software development.
vii
TABLE OF CONTENTS
Chapter
I.
INTRODUCTION....………………..……………………………………1
Background…………...…………………………………………2
General Circuit Manufacturing Techniques …………………7
Design for Manufacturing Architectures..……………………8
Motivation……………...…………….…………………………10
Approach Overview ………………………….……………….12
II.
SOFTWARE ENGINEERING.……………………………………….13
Software Engineering Basic Stages………………….……..14
Capability Maturity Model Integration………………………..16
Software Engineering Advanced Stages…………………...20
III.
STATISTICAL ANALYSIS……………………………………………24
Goodness of Fit...……………………………………….……..25
The Null Hypothesis…...….…………………………………..28
Anderson Darling……………………………………….……..29
IV.
SYSTEM DESIGN & ARCHITECTURE......………………………..31
Database………………………………………………………..32
Automatic Data Extraction…………………………………….33
Automatic Statistical Modeling ...…………………………….34
Testing and Evaluation of Prototype......…….……………...37
Analysis of Results………………........…….………………...39
viii
V.
CONCLUSIONS.............................................…….………………..45
BIBLOGRAPHY..........................................…………….…...……………..48
APPENDICIES
A. Data Set Parametric Outlier Application Dataset 1………..…..51
Normal Distribution Test Case
B. Data Set Parametric Outlier Application Dataset 2………..…..52
Normal Distribution Test Case
C. Data Set Parametric Outlier Application Dataset 3………..…..53
Normal Distribution Test Case
D. Data Set Parametric Outlier Application Dataset 4………..…..54
Bimodal Distribution Test Case
E. Data Set Parametric Outlier Application Dataset 5………..…..55
Bimodal Distribution Test Case
F. Data Set Parametric Outlier Application Dataset 6………..…..56
Normal Distribution with Negatives Test Case
G. Data Set Parametric Outlier Application Dataset 7………..…..57
Extreme Outlier with Multiple Ties Test Case
H. Data Set Parametric Outlier Application Dataset 8………..…..58
Multiple Ties Normal Distribution Test Case
I. Data Set Performance Graded Binder Sample 195……………59
Large Data-Set Non-Normal Distribution Test Case
J. Data Set Performance Graded Binder Sample 196……………60
Large Data-Set Non-Normal Distribution Test Case
ix
LIST OF TABLES
Table
1. Critical Level Table……………..….……..………………………………..…29
LIST OF FIGURES
Figure
1. Wafer Dice with Street Intact …….……..……………………..……………...3
2. Wafer Dice with Street Sawn…….……..……………………………………...4
3. Astronaut Repairing the Hubble Telescope…………………………………10
4. Software Engineering Basic Stages………………. …….…….…………...13
5. Software Engineering Advanced Stages………... …..….…….…………...21
6. Normal Distribution Curve………………………………….…….…………...35
7. Bimodal Distribution Curve………..……………………….…….…………...37
8. Cluster Algorithm Flowchart……....……………………….…….…………...42
xi
CHAPTER 1
INTRODUCTION
Integrated circuits are designed and simulated by designers and then sent
off to a wafer fabrication to be built on wafers. Unfortunately wafer fabrication
manufacturing and design defects are only discovered during the electrical test of
the integrated circuit. This testing of the circuit electrically can be performed at
the wafer level, package level, or both. Electrical testing produces a data-log that
contains specific device characteristic data known as parametric data. Testing of
the integrated circuit has become increasing expensive in an ultra competitive
market and circuit manufactures are exploring new methods of discovering
failures.
Traditionally manufactures would use a method that would set an upper
and lower limit on a parametric measurement to detect circuit failures. This
methodology has become only a portion of the criteria that is now required to
discover potential defects of small geometry integrated circuits. Integrated circuit
manufacturers are now employing additional methods of failure detection
including wafer test structures and specialized BIST (built in self test) circuitry.
After extensive characterization of the circuit on many wafer high percentage
failure mode trending on these circuits can be used to detect lower parametric
marginality. Design for manufacturing simulations that employ only detecting a
limited parametric measurement to the original design simulation.
2
All of these failure detection schemes work well with environments that
produce thousands and thousands of the single integrated products. Non-mission
critical integrated circuit companies will typically lower the FIT rate requirements
of their specific integrated circuit offering because the end user of the product is
within a commercial market. Another reason commercial manufactures can lower
the FIT requirements is because the potential returned merchandise costs will be
usually lower than the costs of rigorous testing. There currently is a need to
create a more rigorous automated parametric outlier detection system on types
of embedded circuits with very high FIT requirements, on very small sample
sizes that are on mission critical devices.
Background
As global markets become increasingly competitive some microelectronics
companies are designing the integrated circuit in their facility and then utilizing a
fab-less wafer manufacturing system. These companies design, assemble, and
test the high reliability standard and custom integrated circuits for the aerospace,
high-altitude avionics, medical, networking, and other standard consumer
electronics in their facility, but pay to have the wafers manufactured at an outside
facility. In this type of business model the company does not own a wafer
fabrication facility. Therefore the company must use many wafer manufacturing
facilities throughout the world, depending on the design geometries of the
transistor structures being manufactured within the integrated circuit design.
3
Wafer Dice with Street Intact
To mitigate a portion of the risk using differing wafer fabs, most companies
have begun inserting standardized transistor cell structures in the streets of the
wafer. (Marinissen, E. Vermeulen, B., Madge, R. Kessler, M., and Muller, M,
2003) The wafers streets are the areas between the die and is illustrated above.
However to save silicon the streets are the basis of where the wafer saw cuts the
wafer to get individual die. These test structures are only available for testing
while the wafer is intact at wafer probe. After wafer saw these test structures are
destroyed by the saw process, see illustration below.
4
Wafer Dice with Street Sawn
High reliability products are electrically tested multiple times prior to
shipment to ensure compliance to the device specification. In some cases this
testing is tailored to the specific industry for which they are designed (Marinissen,
E., Vermeulen, B., Madge, R. Kessler, M., and Muller, M, 2003). Companies
must consistently monitor large amounts of test data at pre-defined electrical test
steps during the manufacturing process to ensure compliance (Perez, R.,
Lilkendey, J., and Koh, S.1994). Contained within this test data are device
specific electrical parametric readings that are composed of DC and AC electrical
parameters pertaining to the individual device performance.
5
To maintain a high reliability product as the geometries continue to
decrease, monitoring device parametric data has become an increasingly
important tool in discovering discrete product defects (Moran, D., Dooling, D.,
Wilkins, T., Williams, R., and Ditlow, G 1999).
Naturally the business model of the fab-less microelectronic company
introduces even more risk to the development of wafers than a company that
owns their own fab. The wafer fabrication facility will require a contract on less
rigorous requirements on wafer parametric acceptance. A less restrictive wafer
acceptance is geared to increase fab starts and deliverables on a statistical basis
that is positive to manufacturing fab, not the fab-less companies. This philosophy
leads to the potential of large deviations in material parametrics and must be
screened for out of family measurements. Some consumer products companies
have resorted to only electrical testing at the wafer level. (Bhattachatya, S., and
Chatterjee, A 2005). Only because it cheaper to replace the device than it is to
electrically test the device at package level. Manufacturers of commercial circuits
have been willing to trade reliability for better overall performance, but designers
of high reliability systems cannot accept this trade-off. Mission critical systems
require higher reliability because replacement of faulty parts is difficult or
impossible after deployment and because component failures may compromise
national security. Furthermore, typical service life tends to be longer for military
systems and aerospace applications, so high reliability and mission critical
devices must be tested at higher level at package to ensure compliance.
6
Wafer fab minor deviations within the doping of the transistor can create
very different wafer structure parametric data (Singh, A., Mani, M., and
Orshansky, M 2005) (Veelenturf, K 2000). These types of deviations can be
easily normalized and thus predicted if there are many parametric wafer
samples, they are much harder to predict on small lot quantities that can have
months in between lot.
Radiation tolerant circuits typically are only built in small quantities and
sold after a rigorous qualification process. Some companies have developed a
manufacturing process to produce a range of radiation-hard electronic products.
These processes are somewhat different from the ones used in commercial
foundries because they include a modified process that produces a circuit with
greater radiation tolerance. In addition, these processes are jointly developed
between the manufacturing fab and the circuit design facility in the case of a fabless company. These commercial foundries typically provide the starting material
for all electronic components manufactured in their processing facilities.
However, radiation tolerant circuits require nonstandard starting materials,
incorporating epitaxial layers, insulating substrates; doped "trenches" in the
silicon that may enhance radiation tolerance. The use of non-standard materials
makes the use of standard transistor out of family detection much more difficult
This is where a better system to predict yield, and detect outliers without many
samples on smaller geometries from many manufactures on non-standard
processes is driving the requirement to build a better outlier detection tool.
7
General Circuit Manufacturing Techniques
All integrated circuit defects are discovered through the electrical test of the
integrated circuit at the wafer level. High reliability devices are then re-tested at
the packaged device level to detect assembly, environmental and accelerated
life-time infant mortality failures.
A portion of these defects that affect integrated circuits throughout the
process of manufacture at a wafer fabrication facility, through the assembly,
environmental aspects of tests are listed below

Wafer fabrication defects
o transistor doping, oxide breakdown, interconnect and poly stringers
o opens, shorts, resistive metals, bridging materials
o metallization stability issues, dissimilar metals like AL & CU.

Potential device assembly defects
o wafer back-grind, wafer saw, die preparation
o device package defects including trace length, & integrity, package
leads
o solder ball integrity, die attach, wire bond, hermetic lid seal

Environmental defects on the package and die are caused by simulating
o temperature cycle, temperature shock, mechanical vibration
o centrifuge, high humidity, salt environment
o accelerated and dynamic electrical burn-in causing infant mortality
8
o

total ionizing dose (TID), gamma radiation effects on gate oxides
As a minimum the digital circuit DC parametric tests performed at
electrical test are
o Continuity, power shorts, input leakage, output tri-state leakage
o quiescent power, active power, input and output thresholds
o minimum voltage operation

However, these digital circuit measurements usually include the AC
electrical parameters
o data propagation, input set-up, output/input hold
o input rise, input fall, clock duty cycle
Design for Manufacturing Architectures
All high reliability IC manufacturers must electrically test their product to
ensure compliance to the product specification and save package and assembly
costs. Design for manufacturing techniques (DfM) imply that device parametric
data can potentially discover the mismatches between the manufacturing
process of the circuit and the circuit designer intent within test structures in street
locations of a die reticle (Gattiker, A 2008) Other parametric defect techniques
only detect limited parameters of quiescent current device sensitivity and are
limited to a very specific device function (Wang, W., and Orshansky, M 2006).
9
MARYL (Manufacturability Assessment and Rapid Yield Learning)
minimizes the risk of starting high volume manufacturing in a waferfab. Engineers
consistently study the relationships between the process parameters within the
wafer fabrication facility and the electrical parameters of the integrated circuit and
predict a final yield of the process (Veelenturf, K 2000).
Some tools utilize the circuit from a many building blocks approach. This
tool takes the knowledge of smaller building block circuits in an attempt to solve a
part of a larger problem. However it is only a simulation of a real world integrated
circuit based upon the performance or execution time of a particular product
within a product line in a specific wafer fab (Moran, D., Dooling, D., Wilkins, T.,
Williams, R., and Ditlow, G 1999).
Standard DfM techniques work well within the commercial market because
these designs do not contain radiation tolerance manufacturing requirements.
Essentially a mature process wafer test structure will deterministically behave as
predicted without radiation tolerance requirements. However within a small fabless business model the wafer manufacturer will assume a no-liability stance to
implanted wafers due to the complexity and variance from one wafer lot to
another and the fact that the process of manufacturing these wafers is a nonstandard process.
10
Motivation
A small fab-less company needs a tool that can process relatively small
amounts of individual device data based on small lots of the electrical
characteristic data. But this small amount of data can be skewed by the implied
marginality effects of Commercial Radiation Hardened (CRH) implants within the
transistor structure on a wafer to wafer variability. These specialized circuits
require nonstandard starting materials, incorporating epitaxial layers, insulating
substrates; doped "trenches" in the silicon that enhance radiation tolerance, but
introduce marginality into the circuits. To compound the issue, this data being
analyzed within these small lots must be used to guarantee a very high FIT rate
on products that provide critical services within a mission critical sub system.
Astronaut Repairing the Hubble Telescope
11
To complicate matters if these devices are placed in service on systems
that is in high altitude or a space application. It is becomes unbelievably
expensive to send an astronaut to replace a small faulty integrated circuit or
subsystem in space. For instance the 2009 repair of the Hubble space telescope
was valued at 1.1 billion dollars. It is highly visible when a space system fails, the
publicity involved in a failure of this magnitude, and the potential government
fines can easily drive a small company into bankruptcy.
Typically device lot quantities are usually small and the product offering
are very broad to cover simple logic devices, to memories of DRAM, Flash, and
SRAM. Additional devices types could be protocols of spacewire, European
CAN, multiple clock generator chips, or single and multiple processors as an
example. All of these products could be built of differing technologies of 0.600µ to
90n meters throughout different wafer fabrication facilities throughout the world.
Therefore there is a real need to provide a tool would allow a small fab-less
company to make quick educated decisions into the validity of the parametric
data being examined in a effort to continue to provide their customers with the
highest quality products for at least the next 20 years while maintaining a high
FIT requirement on mission critical circuits.
12
Approach Overview
This automatic parametric outlier detection system prototype will utilize an
existing algorithm named Anderson-Darling to test for normal distribution on the
dataset. This test is a general normality test designed to detect all departures
from normality within a dataset, and is believed as the most powerful test for
normality. The importance of checking the data-set for normality is very important
or the consequential methodology for choosing outliers may produce incorrect
results. If a dataset is deemed a normal Gaussian distribution by the automated
Anderson Darling test The prototype will perform an inner 75% quartile outlier
detection techniques on the dataset and present the user with potential outliers
within the data-set in a histogram format based on a normal distribution.
If a dataset is deemed non-normal by Anderson Darling testing, the dataset
will be presented in a histogram format requiring user intervention. The user will
be required to visually inspect the data and set the upper and lower fences for a
normal distribution. Once the user has placed the upper and lower fences on the
data distribution the remaining steps for normal and non-normal user intervention
datasets will be the same. The prototype will perform inner 75% quartile outlier
detection techniques on the dataset and present the user with potential outliers
within the data-set in a histogram format based on a normal distribution.
CHAPTER 2
SOFTWARE ENGINEERING
Software engineering is dedicated to the design, implementation, and
maintenance of software code, software applications, and software systems.
Software engineering integrates significant mathematics, computer science and
general best engineering practices. High quality, more affordable, easy to
maintain software is generated with a methodic approach to the key stages of the
software.
Software Engineering Basic Stages
14
Software Engineering Basic Stages
The various stages within a software engineering life-cycle are
requirements, design, construction, testing, and maintenance.
A software requirement are defined as a property that must be exhibited
In the software product being created that will solve a real-world. Requirements
documents identify the project problem that the software will solve. Through the
elicitation of stakeholders, documenting requirements, and analyzing
requirements to the people and organizations that are affected by the software
that will be created. When an agreement and priorities are set on each of the
requirements within the requirements document the project can move to the next
stage; design.
Software design is the process of defining the software architecture, the
user and system interfaces, and other characteristics of a system or component
that affect the end software product to the requirements document The use of
UML unified modeling language tools like class diagrams that describe the static
design of the objects, and the relationships these objects have with to each other
in the software application are generally used to describe a software design
stage.
15
Software construction refers to the methodical detailed creation of
operating, meaningful software through the combination of writing and debugging
the code. Then verification that ensures the software operates correctly, unit
testing that ensures individual units of code are fit for use in the system, and.
integration testing that will verify all the source code with the final system that it
will operate.
Software testing consists of the dynamic verification of the requirements to
the final software application while operating the software in final system
environment. Generally the quality of the software under test performs the
process of executing the program or application with the intent of finding software
defects or errors
Software maintenance is the most expensive stage in the software life
cycle as it covers the operational lifetime through the eventual retirement of the
software. When the software is delivered or placed in operation and errors are
discovered a software engineer must fix the issues for continued use by the end
users. But errors are not the only activity associated with software maintenance.
In addition operating environments can change, or new requirements may be
necessary? (Abrain, A. 2004)
16
Capability Maturity Model Integration (CMMI)
Capability Maturity Model Integration a measure of software process maturity
framework within an organization. The CMMI all assess the process maturity of
at one of 5 different levels to measure a company’s ability to produce quality
software and is defined by the 5 CMMI levels are described below:
Maturity Level 1 Initial
Maturity Level 2 Managed
Maturity Level 3 Defined
Maturity Level 4 Quantitatively Managed
Maturity Level 5 Optimizing
Maturity Level 1: Initial level and is characterized by processes and projects
that are usually unplanned from project management perspective, very informal,
project impromptu changes, and improvised requirements leading to project
scope issues. Companies that fall into this maturity level are usually small startup companies that tend, abandon processes when schedules are due, over
commit and over promise, and cannot repeat their past successes on future
projects.
17
Maturity levels above level 1 have specific goals and requirements for the
software engineering process. These goals and requirements must be met for a
company to be certified at a higher level. As these levels progress higher from 1
to 5 this implies the higher level is comprised of the standards on its level plus
the standards on the corresponding lower levels.
Maturity Level 2: Managed level as requirements are documented and
managed within the project group. All processes have planning stages and
documentation measured at major milestones, and controlled. The status of
these processes and the software products are reviewed at major milestones and
at the completion of major tasks of the project. Requirements management
identifies the project real-world problem that it will solve. Through the elicitation of
stakeholders, documenting requirements, and analyzing requirements given by
these members. And finally agreeing and setting priority on this document
controlled requirements through content management. Project planning identifies
the scope of the project; estimates the work involved, and creates a project
schedule based on the scope and work estimates. Project monitoring and control
is handled through the effective use of a program manager that will hold regular
intervals of meeting with developers. This program manager will capture minutes
of the meetings, update the schedule to current work completed and
communicate these results to management and customers. Supplier agreement
management handles and manages all project subsystems, hardware, software,
manuals, documentation, parts and materials that will be used on the project.
18
Measurement and analysis involves gathering quantitative data about the
software project and analyzing that data to influence project plans and effort.
Process and product quality assurance involve formal audits by a separate
quality assurance individual or group. Technical peer reviews at different levels
within the project, and in-depth reviews of all work performed. Configuration
management outlines how the project team will control and monitor changes to
the project code.
Maturity Level 3: Defined processes are understood and characterized these
processes are described in documented standards, procedures, tools. Processes
performed across the organization are consistent except for the differences that
are documented with a process tailoring document which specifies the variance
to the standard guidelines. Additional requirements for level three from the
previous level include additional requirements analysis. Technical solutions that
will design, develop, and implement solutions to requirement document.
Project verification that ensures the software operates correctly and validation
that ensures that the correct product was built per the requirements
documentation. The organizational process focuses on initiating process
improvements on processes and process assets throughout the organization.
Organizational process definition establishes and maintains a usable set of
process assets and work environment organizational standards, and then
completes organizational training on these standards
19
Integrated project management addresses the activities related to
maintaining the project plan, monitoring plan progress, managing commitments,
and taking corrective actions. Risk management will assign risk to the project
before and during the project and will also work to mitigate or remove risk
altogether. Integrated Teaming involves the teaming of software engineering to
system engineering. The integrated supplier management within the project is
related to management reporting and management. Decision analysis and
resolution are comprised of decisions that may move the project timeline, could
have a major impact on system performance, and could have legal implications.
Organizational environment for integration will integrate product and process
development
Product integration will assemble the product and ensure that the product
functions properly when integrated in the system.
Maturity Level 4: Quantitatively managed objectives of quality and process
performance. These objectives are then used as the criteria in managing
processes. Quality and process performance is controlled using statistical and
other quantitative techniques, and is quantitatively predictable based on these
techniques. The Organizational process performance views how well the
processes are performing, these performance matrices can be effort, cycle time,
and defect rate. Quantitative project management is the project’s defined process
and the ability to statistically achieve the project’s established quality and
process-performance objectives.
20
Maturity Level 5: Optimizing processes by continually improving on
quantitative understanding of processes variation causes. Through both
incremental and innovative technological improvements level 5 constantly
improving process performance. Organizational innovation and deployment
select and deploy innovative and incremental improvements that improve the
processes and technologies of the company. Causal analysis and resolution
traces defects to the root cause and modifies the processes to reduce or
eliminate those defects. (CMMI, 2010)
Software Engineering Advanced Stages
Software configuration management identifies the configuration of
software at distinct points during the software engineering lifecycle. It has only
one purpose and that is to control and track changes to the software
configuration systematically and methodically through the use of version control.
Configuration management will maintain the integrity and traceability of the
software configuration for the life cycle.
Software engineering management is the project management and
measurement of software at various stages with the software lifecycle by
monitoring plan progress, managing commitments, and taking corrective actions.
21
Software Engineering Advanced Stages
Software engineering process defines the implementation, assessment,
measurement, management, modification, and eventual improvement of all the
processes used in software engineering.
Software engineering tools and methods are the use of tools that provide
the assistance to the software developer in the development of software project.
The purpose of the software engineering tools and methods is to decrease the
software development time, increase the quality, decrease errors and software
defects during the construction and maintenance stages of a software lifecycle.
22
Software quality identifies software quality considerations which occur
during the software life cycle. Through formal audits by quality assurance group
of the process and product these audits provide a quality measure of how well
the software performs and conforms to the original design and requirements.
Knowledge Areas of Related Disciplines include computer engineering
that is a merge of electrical engineering and computer science. Project
management plans, organizes, secures, and manages all project resources in an
effort to successfully complete a software project. Computer science is the study
and implementation of techniques to perform automated actions on computer
and server systems. Quality management seeks to improve product quality and
performance that will meet or exceed requirements documentation for the
software product. Management software ergonomics deals with the safety and
compliance of the software systems to rules and regulations. Mathematics is the
learning or studying of patterns of numbers, counting, measurement, and
calculations of quantity, space, and change. Systems engineering is effort
required to integrate the software to the system platform that it will run.
(Abrain, A. 2004)
CHAPTER 3
STATISTICAL ANALYSIS
An outlier is defined as an observation point that is inconsistent with other
observations in the data set. A data point within the data-set, an outlier has a
lower probability that it originated from the same statistical distribution as the
other observations in the data set. An extreme outlier is an observation that will
have an even low probability of occurrence than the rest of the data. Outliers can
provide useful information about the process. An outlier can be created by a shift
in the mean of the data-set or the variability of parameters in the process.
Though an observation in a particular sample might be a candidate as an
outlier, the process might have shifted. Sometimes, the spurious result is a gross
recording error or an un-calibrated measurement error. Measurement systems
should be shown to be capable for the process they measure and calibrated for
accuracy. Outliers also come from incorrect specifications that are based on the
wrong distributional assumptions at the time the specifications are generated.
Incorrect specifications can also occur if the sample size used was not large
enough and from diverse samples of the process at differing times of
manufacture when the specification limit is generated.
24
Once an observation is identified by means of mathematical or graphical
inspection for potential outliers, root cause analysis should begin to determine
whether an assignable cause can be found for the outlier result.
If the root cause cannot be determined, and a retest can be justified, the potential
outlier should be recorded for future reference and quality control standards.
Outliers processing of a data-set can potentially introduce risk in the
evaluation if the assumption of the data-set comes from a normal distribution;
when actually the data-set is from a non-normal distribution. To ensure that
errors are not introduced into the outlier analysis, the importance of testing the
distribution for normality before performing outlier analysis is critical.
Outlier calculations using standard deviation Sigma, quartile, or CPK
analysis only show the variation of the data-set from the mean by usually
ignoring the tails of the data-set. Small data-sets present the unique problem that
they essentially can be easily viewed by the user quickly as appearing to be a
normal distribution. Only after additional robust analysis of the distribution can
these data-sets be proven to be from or not from a normal distribution. In
statistics there are a number of mathematical equations that provide a statistical
evaluation of the data by checking outliers as well as providing confidence into
the normality of the data distribution.
25
This project utilizes the Anderson-Darling test for normality; and was
developed by two mathematicians with backgrounds is statistics, Theodore
Anderson and Donald Darling in 1952. When used to test if a normal distribution
adequately describes a set of data, it is one of the more powerful statistical tools
for detecting most non normality distributions. The equation can also be used as
an estimation of the parameter for the basis of a form of minimum distance
calculation. Statistically it is an evidence test performed on the data set to prove
it did not arise from a given probability distribution. The given probability
distribution then identifies the probability for each value of a random discrete
variable within a given data set.
Goodness of Fit
The most commonly used quantitative goodness-of-fit techniques for
normality within a distribution are the Shapiro-Wilks, and the Anderson-Darling
tests. The Anderson-Darling is commonly the most popular and widely used
because it the accuracy of the test. The Anderson Darling test is merely a
modification of the Kolmogorov-Smirnov test but gives more weight to the tails.
The Kolmogorov-Smirnov test is distribution free because the critical values do
not depend on the specific distribution being tested; this is the difference with the
Anderson Darling test which makes use of the specific distribution of the data set
in calculating the critical values.
27
Because of these facts the Anderson Darling creates a more sensitive
normality test but the critical values must be calculated for each distribution.
(Romeu J 2003)
Both Shapiro-Wilks and the Anderson-Darling tests use a confidence level
to evaluate the data set for normality, this value is referred to as a p-value or it is
also known as an alpha level. A lower p-value limit correlates to a higher value of
confidence and proves greater evidence that the data did not come from the
selected distribution and therefore is not a normalized distribution. This double
negative of not rejecting the null hypothesis is somewhat counter intuitive. Simply
put the Anderson Darling test only allows the data reviewer to have a greater
level of confidence as the p-value increases that no significant departure from
normality was found. A professional engineer is credited with a unique
perspective of the null hypothesis explaining an interpretation of the null
hypothesis.
Just as a dry sidewalk is evidence that it didn't rain, a wet sidewalk might
be caused by rain or by the sprinkler system. So a wet sidewalk can't
prove that it rained, while a not-wet one is evidence that it did not rain.1
1Annis,C.:P.E.StatisticalEngineering
2009
Charles.Annis@StatisticalEngineering.com,
28
The Null Hypothesis
The definition of the null hypotheses is: H0: the data-set follows a specified
distribution. This hypothesis regarding the distributional form of the data-set is
rejected at the chosen (α) significance level and if the Anderson-Darling A2 test
statistic is greater than the critical value obtained from the critical value table
below. The fixed values of α (0.01, 0.05, 00.25, 0.10.) are generally only used to
evaluate the null hypothesis (H0) at various significance levels. The (α)
significance value or sometimes called the p-value of 0.05 is typically used for
most applications, however, in some critical instances; a lower (α) value may be
applied. This thesis project used the (α) significance value limit of 0.05 as the
limit of p-value comparison to the application output when checking a data-set for
normal distribution.
The critical values of the Anderson-Darling test statistic depend on the
specific distribution being tested like a normal or log-normal distribution. The
tables of critical values for many distributions are not normally published, the
exception are the most widely used ones like normal, log-normal, and Weibull
distributions. The Anderson-Darling quantitative goodness-of-fit techniques for
normality tests are typically used to test if the data is from a normal, log-normal,
Weibull, exponential, or from logistic distributions.
29
Due to the vast differing data distributions that can be analyzed by an
Anderson-Darling test, it was chosen as the basis of test for normality for this
thesis project. The confidence levels are calculated from the p-value as 100*(1 p-value).
Confidence Level
α-level
Critical value A2
99.0%
0.01
0.631
95.0%
0.05
0.752
92.5%
0.025
0.873
90.0%
0.10
1.035
Critical Level Table
Anderson–Darling
Because the Anderson-Darling statistic belongs to the class of quadratic
empirical distribution based on the empirical distribution function statistical model
which is essentially a step function that jumps for 1/n at each of the n data points,
The Anderson Darling test statistic A, and the A2 equations are shown below
these for the normal and lognormal distributions
30
A  n 
2i  1
ln F Yi   ln 1  F Yn1i 
n
n is the sample size of a continuous probability distribution. Then
modifying this equation for a smaller sample data set size, the resulting formula
for the adjusted statistic becomes;
 0.75 2.25 
A 2  A 2 1 +
+ 2 
n
n 

The A2 value calculated is then compared to an appropriate critical value
from the table showing confidence level, α-value, and critical value of the null
hypothesis from the critical value table shown above. In summary for simplicity of
this thesis project the Anderson-Darling test for a normal distribution proves the
data is from a normal distribution if the p –value > α-value and if A2 -value <
Critical-value from the critical value table shown above. In the real world the need
to make decisions and then get on to other more urgent decisions is incredibly
important. The normal distribution function is relatively easily to implement and
will produce quick estimates that are useful for such decisions, only if the sample
size is relatively large and histogram curve fit appears to be a normal distribution.
However, if the data-set is small it can still be assumed a normal distribution if
the histogram curve looks non-normal, providing the p-value of the goodness-offit statistic shows the distribution is a normal Gaussian bell curve p > 0.05.
CHAPTER 3
SYSTEM DESIGN & ARCHITECTURE
The requirements of this design and project was to take a distribution of data,
analyze the data for a normal Gaussian bell curve distribution, perform outlier
analysis by some means and then indentify potential outliers within that data set.
The data will then be presented to user as a histogram and outliers are and the
bell curve are shown. If the data-set did not come from a normal distribution then
present the data-set in a histogram format so the user can set the lower and
upper fences of the data-set before applying quartile analysis.
This project investigated the prototype of an automatic parametric outlier
detection system design that will utilize multiple existing algorithms like an
Anderson-Darling with Shapiro-Wilk tests in combination with affordable tools
and a database into a user friendly automated system. The goal of this design
was to combine various existing outlier detection techniques and make
necessary improvements so that the small company outlier detection system
used on small data-sets would operate similar to other elaborate expensive
architectures in making better manufacturing decisions. Specific software
requirements included the platform base code that would be C# used to as the
prototype test vehicle. This requirement was changed in the project due to
unforeseen small company infrastructure issues after an agreement with
stakeholders was completed. This requirement change added a significant
amount of coding had to be converted to the C# from the original base code.
32
Database
This project did will not focus on the database design, however the
automated system inputs and future capability learning criteria was and original
requirement that accessed a database. SQL server a widely-used and supported
database can be utilized in this post project design relatively easily because of
the hooks built-in to the .NET framework and the C# code. The basis of
connection and implementation is enabled through the use of the SqlConnection
class in the C# program to initiate a connection to the SQL database containing
parametric data. The SqlConnection class is used in a resource acquisition
statement in the C# language and must have the Open method called; this allows
the query of the database with SqlCommand’s. The usage of this SqlConnection
is a using type resource acquisition statement in the C# language for use in Sql.
The SqlConnection has a constructor that requires a string reference pointing to
the connection string character data. This connection string is often autogenerated by the dialogs in Visual Studio the automated system must include the
SqlConnection code before the use of the SqlCommand object to perform a
database query. By using the using statement construct in the C# programming
language it will lead to a more reliable, stabile C# database project. This
application could not gain access to the database in a remote laptop atmosphere
due to additional small company regulations. Therefore, for demonstration
purposes the project data-sets were hard coded into varying arrays and tested
for functionality against test cases documented in a cited paper.
33
Automatic Data Extraction
The automatic data extraction tool can overlap any parametric data-set to
complete the viewing of any data from more than one viewpoint. The parametric
historical data based on wafer manufactures like ON Semiconductor wafer fab
0.6u products or TSMC Semiconductor wafer fab 0.25u products are easily
implemented within the database selections that are called and analyzed for
outlier detection. For instance, the tool can also view parametric historical data
based on the PIC product identification code. 0.25u products like SRAM
synchronous random access memory that statically does not need to be
periodically refreshed. And DRAM dynamic random access memory that stores
each bit of data in a separate capacitor within an integrated circuit. Both of these
memories store the data in a distinctly different way and therefore have some
differing data based on the architecture of the integrated circuit. However, some
will parametric values will correlate between both memory products built at the
same wafer foundry that will be used to detect fabrication trending within the
wafer foundry. This trending could be as simple as lower resistivity metal lines
that exhibit faster access timing between both devices. These manufacturing
trends between product lines can be used to validate transistor design models.
These transistor design models then help designers match the designs to real
world silicon. The automated parametric outlier detection system will provide an
important statistical data loopback to the original design models for verification.
34
Automatic Statistical Modeling
Automated data modeling and goodness-of-fit tests are contained within a
normalcy test of Anderson-Darling then modified for smaller data-sets. The
Shapiro-Wilk test is a slight modification of the Anderson Darling test, and the
Skewness-Kurtosis tests has better predicting capabilities to larger data-sets and
therefore were not chose over the functionality of the Anderson Darling. The only
test implemented in this project was Anderson-Darling test for the null
hypothesis. Through this test the user can gain a greater confidence that the
data-set being analyzed is from a normal distribution and will provide greater
confidence in to the outlier analysis being performed will be correct.
This project implemented a quartile analysis of the normalized data identified
by the Anderson-Darling test described throughout this document as the basic
default operation. The implementation followed a screening for the bottom 12.5%
of the data or visually the left side of the bell curve. Then the tool screens for
outliers on the upper 12.5% beginning at 87.5% of the data or visually the right
side of the bell curve. The histogram is then configured to show the distribution
visually with the outlier data-set values removed The toolset has other statistical
modeling abilities including standard deviation, inter quartile analysis as options
for the user to modify if needed but the default automated operation is inner 75%
percentile analysis.
35
.
Normal Distribution Curve
If the data-set is determined by Anderson-Darling analysis to be bimodal or a
non-normal distribution, it will be shown in a histogram format requiring user
intervention. Bimodal distributions are a continuous probability distribution with
two different modes; these modes appear as distinct peaks in the distribution
diagram. The user is required to pick the upper and lower fence of the data to be
analyzed for quartile analysis, in the case of a bimodal distribution the user will
have to pick one of the distinct peaks. This is the only non-automated portion of
the project that requires user intervention by setting distribution of the data-set.
Once the toolset knows what data in the data-set needs be analyzed for outliers
the toolset will perform the modeling on the data-set to locate potential outliers
36
Future enhancements to the outlier detection system for non-normal data sets
will utilize machine learning. All user defined changes to data-sets will be
captured into the database. These user defined changes will specifically capture
the user, the data type, data-set, lower, and upper fence chose. Then based on
the decisions made by engineers on the automated non-normal distributions, a
learning algorithm can start to be developed. However this project and tool-set is
being launched as a beta version prototype model for evaluation. Based on the
evaluation of the beta version additional requirements may be realized and
required in the released version. The small company requirements for this toolset
deem the immediate release of a beta version.
The effect of this software development methodology will be realized in the
maintenance stage of this design. A well documented fact that the maintenance
stage is the longest and the most expensive to the software project. Every
attempt will have to be made to understand the costs involved with future
enhancements and requirements. Because releasing an incomplete version of
the software for user evaluation, the need for future requirements and
enhancements will certainly be a possibility as the known problem domain may
still be unknown?
37
Bimodal Distribution Curve
Testing and Prototype Evaluation
The outlier detection system design implementation utilized small data-sets
as the test cases used on the automated system implemented in C#. With the
exception of two data-sets described in the Asphalt Study cases 195 and 196.(
Holsinger, R., Fisher, A., and Spellerberg, P 2005) These data-sets were listed in
the document cited within paper as Precision Estimates for AASHTO Test
Method T308 and the Test Methods for Performance-Graded Asphalt Binder in
AASHTO Specification M320. For specific testing of the Anderson-Darling
implementation, there are extensively documented test cases in the appendices
of this document. The use of a non-automated Excel spread sheet with
Anderson-Darling test statistics A and A2 calculations for the Anderson-Darling
Normality Test Calculator. (Otto, K.2005)
38
Identified data-sets are random representations of real data-sets but still are
limited to utilize only a specific single device parameter within these randomly
created data-sets, with the exception of the asphalt study case 195 and 196.
These data-sets were not pre-loaded into a database, but instead hard coded
into the application for ease of use and testing criterion for initial prototype
correct functionality. Outlier analysis is then performed on a default inner 75%
percentile analysis. The inner 75% percentile involves indentifying data values
that fall with the first 12.5% of the data and the last 12.5% of the data from the
mean as outliers with the normal distribution. The application can also filter for
detecting outliers and extreme values based on interquartile ranges.
These equations are in the form Q1 = 25% quartile, Q3 = 75% quartile,
IQR = Interquartile Range difference between Q1 and Q3
OF = Outlier Factor
EVF = Extreme Value Factor
Mild value detection:
Outliers: Q3 + OF*IQR < x <= Q3 + EVF*IQR
Outliers Q1 - EVF*IQR <= x < Q1 - OF*IQR
Extreme value detection:
x > Q3 + EVF*IQR
and
x < Q1 - EVF*IQR
39
Also presented in a histogram format of the standard sigma deviation from the
mean, the user can pick the number sigma or standard deviations from the mean
data of the histogram. The application also presents the data in an all data within
the data-set pass histogram format. This mode is used if the user needs to see a
histogram format of all data points within the data-set. The all data format
histogram is presented to the user with the ability to manually set the fences for
upper and lower limits. This functionality allows the user to manually pick the
outliers based on an upper and lower fence choice. This functionality can
manipulate any data-set outlier to the user requirements and will be tracked by
user and data-set and saved in the data-base to become part of a machine
learning version at a later time.
Analysis of Results
The Anderson-Darling test for normality is one of three general normality
tests designed to detect all departures from normality, and is sometimes touted
as the most powerful test available. The Anderson-Darling test makes use of the
specific distribution in calculating critical values, because the critical values do
depend on the specific distribution being tested. This has the advantage of
allowing a more sensitive test and the disadvantage that critical values must be
calculated for each distribution or data-set that it is tested against.
40
The Anderson-Darling test while having excellent theoretical properties
does appear to have limitations when applied to real world data. The AndersonDarling mathematical computation is severely affected by ties in the data due to
poor precision. Ties are data points or values that match exactly other data
points within the data-set. When a significant number of ties exist, the AndersonDarling will frequently reject the data as non-normal, regardless of how well the
data fits the normal distribution. The prototype application will still present the
data in a histogram format for the user to custom set the upper and lower limits of
the data-set. After a short analysis the user will notice the data-set is normally
distributed due to the histogram being presented by the application. The user can
still perform the 75% inner percentile analysis, or any other available in the
application. While potentially creating a need for the user to evaluate more datasets that have multiple ties of the same data, it by no means detracts from the
functionality of the application, validation to the original requirements were still
met. Additional conditional statements were hardcoded into the application that
dealt with the mathematical calculation of Anderson-Darling A and A2 test statistic
by checking for a unique divide by zero calculation. When computing these
values the standard deviation tested to be greater than zero or the application will
set the value to positive infinity. The calculations for the value of AndersonDarling A and A2 test statistic by dividing by the standard deviation the resulting
calculation produces a application crash. Similar to the multiple ties limitation of
the equation more serious mathematical errors occur when all the data values
are the same resulting in a standard deviation of zero.
41
Adjustments were made to the base Anderson-Darling equation and
results of the check for normality are documented within formal papers that
adjust the base equation for a specific fit. These papers take the equation and
modify the integral then verify the functionality by comparison by the use of some
other normality test, like of Chi-Squares goodness of fit tests for instance. After
equation modification the original limitation is still left intact with the inability to
deal with multiple ties within the data-set, but the equation was better suited for
small data-set analysis. (Rahman, M., Pearson, M., Heien, C.2006)
While not completed in this project the added functionality to cluster the
data by using a hybrid format with the standard deviation to overcome the
limitation of Anderson-Darling on data ties.( Jain, A., Murty, M., and Flynn, P
1999) The K-Means Clustering algorithm based on data clustering methodology
was developed by J. MacQueen in 1967 and then again by J.A.Hartigan and
M.A. Wong in 1975. It is an algorithm to classify or to group objects based on
attributes or features into a variable. The number of the group, K is a positive
integer number and the grouping is done by minimizing the sum of squares of
distances between data and the corresponding cluster called a centroid. The Kmean clustering only purpose is to classify the data into groups or “clusters” of
data. (Technomo, K 2006)
42
Clustering Algorithm Flowchart
Then by using these clusters created in the K means clustering groups by
passing in a hybrid format to verify against the standard deviation of the normal
distribution. The data can then be reviewed to verify if the clusters of data are
within the standard deviation of the distribution. If these clusters of data reside
within the standard deviation set by the application usually a value of three
standard deviations. Then the assumption can be made that the data within the
distribution is from a normal distribution.
43
The data must have had too many ties within the data-set and thus will not
need user intervention to perform the task automatically to perform a 75% inner
percentile analysis for outliers. The K means algorithm can be used as a
automated function when the Anderson-Darling p-value test fails when performed
on a data-set. If the K-means clustering test also fails then the data-set is still
presented to the user in histogram format requiring user intervention.
Another possibility is the k-sample Anderson-Darling test for removing
large number non-automated data-set histograms from this prototype. While the
potential for this test being a hybrid solution of the k-means clustering described
previously with the standard Anderson-Darling standard normalcy test. The
details and the equation of the test are described below and are not trivial by any
means. This test for normal distributions is a nonparametric statistical procedure
that tests a hypothesis. The hypothesis is that the populations from two or more
groups of data were analyzed are identical. In this equation the test data can be
either presented in groups or ungrouped because the Anderson-Darling k-sample
test will verify if a grouped data-set can modeled as ungrouped data-set, grouped
data-sets will look like bimodal distributions and this fail this test.
The k-sample Anderson-Darling test is shown below, note that variable L
is less than n if there are tied data-point observations
44
hj = the number of data values in the combined data-set sample equal to zj
Hj = the number of data-values in the combined data-set sample less than zj plus
one half the number of values in the combined data-set sample equal to zj
Fij = the number of values in the i-th group which are less than zj plus one half
the number of values in this group which are equal to zj where k is the number of
groups, ni is the number of hits in group i, xij is the j-th hits in the ith group
z1, z2 ..., zL are the distinct values in the combined data set ordered from
smallest to largest Implemented in pseudo code for k-means Anderson-Darling to
appreciate the complexity of this algorithm
ADK = (n-1)/(n**2*(k-1))*SUM[i=1][k][(1/n(i))*SUM1]
SUM1 = SUM]j=1][L}[h(j)*NUM/DEN]
NUM = (n*F(ij)-n(i)*H(j))**2
DEM = H(j)(n-H(j))-n*h(j)/4
This prototype application validated all requirements in a loosely written
requirements document. However after data analysis the chance that the
Anderson-Darling test for normalcy would fail normally distributed data-sets
became an unforeseen issue while meeting application requirements. The only
issue was the additional user intervention required to process the data-sets. If
additional time were available at the time of this write-up, a more elegant
algorithm would be implemented and tested for a more automated version of the
prototype. This elegant algorithm envisioned would certainly have to implement
some portion of a clustering methodology like k-means, or a hybrid of k-means
clustering.
45
CHAPTER 4
CONCLUSIONS
The Anderson-Darling test for normality is one of three general normality
tests designed to detect all departures from normality, and is sometimes touted
as the most powerful test available. The Anderson-Darling test, while having
excellent theoretical properties, does have limitations when applied to real world
data when a data-point in the set ties or matches another data-point .When this
happens the Anderson-Darling mathematical computation is severely affected by
due to poor precision analysis of the data-set tails. When a significant number of
ties exist, the Anderson-Darling will frequently reject the data as non-normal,
regardless of how well the data fits the normal distribution.
Validation to the original requirements was completed at project end to the
original requirements. Although the requirements phase of this project evidenced
a lack of appreciation for the true complexity of the project.. As certain aspects of
the functionality of the Anderson-Darling test for normality within the system
automation features from the application to be less than optimal and require
more user intervention than anticipated. This lack of appreciation for the
complexity of evaluating a normal Gaussian distribution on every data-set
presented to the prototype system is simple in theory, but became a large part of
this project.
46
The problem domain was documented as a set of requirements by
managerial stakeholders to produce a robust outlier detection system. The
additional requirement of detection within small data-sets was identified in the
requirements
documentation.
This
requirement
evaluated
small
dataset
distribution mathematical equations for normalcy through significant hours of
research. The requirements included evaluation of the dataset into a histogram
format on non normal distributions. Then after user intervention a 75% percentile
analysis was always a requirement, but as the project entered the testing phase
this default situation starting occurring more often than what I felt was desired in
an automated sub-system although it still met validation requirements.
During the design phase a significant amount of additional methods that
became more complex to implement as important algorithms of the system. This
became especially true when passing data back and forth from complex
mathematical equations within numerous arrays’s of the data as it was calculated
and maintaining these values. The use of case diagrams would have helped in
the design phase as the design was mainly. This of course equated to additional
time being required within the test phase because of the complexity of the
Anderson-Darling mathematical equation to check for normal distributions. These
deterministic equations easily produced divide by zero errors that were not taken
into account during the design phase, and the number of ties within data-sets
cause the tool to have false normalcy failures.
47
Post prototype validation analysis of the requirements revealed minor
departures of the software system within certain data-sets, this analysis revealed
potential non-automated limitations of the software system.
Therefore additional research was started in the final testing stage of the
parametric outlier design to solve these non-automated issues. Clustering
methodologies created in the K means clustering groups could be passed in a
hybrid format to verify against the standard deviation of the normal distribution.
This data can then be reviewed to verify if the clusters of data are within the
standard deviation of the distribution. If these clusters of data reside within the
standard deviation set by the application usually a value of three standard
deviations. Then the assumption can be made that the data within the distribution
is from a normal distribution.
The other promising methodology would be the use of the k-sample
Anderson-Darling test a nonparametric statistical procedure that tests the
hypothesis that the data from a data-set of two or more groups of data is identical
and therefore from a normalized distribution. The k-means Anderson-Darling
appears to be the most promising solution, however the mathematics required to
implement, verify correct operation, and validate to requirements would be a
significant task.
48
BIBLOGRAPHY
Abrain, A.: SWEBOK, Guide to Software Engineering Body of Knowledge, IEEE
Computer Society, 2004.
Bhattachatya, S., and Chatterjee, A. Optimized Wafer-Probe and Assembled
Package Test Design for Analog Circuits. ACM: Transactions on Design
Automation of Electronic Systems (TODAES), Volume 10 Issue 2, April
2005.
CMMI - Capability Maturity Model Integration Version 1.3, Nov 2010
Dy, J, and Brodley,C.. Feature Selection for Unsupervised Learning. JMLR.org:
The Journal of Machine Learning Research, Volume 5, December 2004.
Gama,J., Rodrigues,P., and Sebastiao,R.: Evaluating Algorithms that Learn from
Data Streams. ACM: SAC '09: Proceedings of the 2009 ACM symposium on
Applied Computing, March 2009.
Gattiker, A.: Using Test Data to Improve IC Quality and Yield. IEEE Press:
ICCAD '08: IEEE/ACM International Conference on Computer-Aided
Design, November, 2008
Jain, A., Murty, M., and Flynn, P. Data clustering: a review. ACM Computing
Surveys: Volume 31, Issue 3, Pages: 264 – 323, September 1999.
Jebara, T. Wang, J., and Chang, S.: Graph Construction and b-Matching for
semi-Supervised Learning. ACM: Proceedings of the 17th ACM international
conference on Multimedia, October 2009.
Holsinger, R., Fisher, A., and Spellerberg, P., Precision Estimates for AASHTO
Test Method T308 and the Test Methods for Performance-Graded Asphalt
Binder in AASHTO Specification M320. National Cooperative Highway
Research Program, AASHTO Materials Reference Laboratory,
Gaithersburg, Maryland, February, 2005
49
Marinissen, E. Vermeulen, B., Madge, R. Kessler, M., and Muller, M.: Creating
Value through Test: DATE '03: Proceedings of the conference on Design,
Automation and Test in Europe - Volume 1, March 2003.
Moran, D., Dooling, D., Wilkins, T., Williams, R., and Ditlow, G.:Integrated
Manufacturing and Development (IMaD). ACM: Supercomputing '99:
Proceedings of the 1999 ACM/IEEE conference on Supercomputing, Jan
1999.
Otto, K.: Anderson-Darling Normality Test Calculator,
http://www.kevinotto.com/RSS/templates/Anderson-Darling Normality Test
Calculator.xls, Version 6.0, 2005.
Perez, R., Lilkendey, J., and Koh, S, Machine Learning for a Dynamic
manufacturing Environment. ACM: SIGICE Bulletin, Volume 19, Issue 3 ,
February 1994.
Pyzdek, T.: Histograms: When to Use Your Eyeballs, When Not, Looks can be
deceiving. Quality Digest Magazine, Volume 31, Issue 3, March 2011.
Rahman, M., Pearson, L., and Heien, H.: A Modified Anderson-Darling Test for
Uniformity. BULLETIN of the Malaysian Mathematical Sciences Society (2)
29(1) (2006), 11–16
Romeu J.: Selected Topics in Reliability and Assurance Related Technologies,
Reliability Analysis Center, Volume 10, Number 5, May 2005
Saab, K., Ben-Hamida, N., and Kaminska, B.: Parametric Fault Simulation and
Test Vector Generation. ACM: Proceedings of the conference on Design,
automation and test in Europe, January 2000.
Singh, A., Mani, M., and Orshansky, M.: Statistical Technology Mapping for
Parametric Yield. IEEE Computer Society: ICCAD '05: Proceedings of the
2005 IEEE/ACM International conference on Computer-aided design, May
2005.
50
Technomo, K.: K-Means Clustering Tutorial, pdf file 2006 with reference to
website, http://people.revoledu.com/kardi/tutorial/kMean/index.html, 2006.
Veelenturf, K.: The Road to better Reliability and Yield Embedded DfM tools.
ACM: Proceedings of the conference on Design, automation and test in
Europe, January 2000.
Wang, W., and Orshansky, M.: Robust Estimation of Parametric Yield under
Limited Descriptions of Uncertainty. ACM: ICCAD '06: Proceedings of the
2006 IEEE/ACM international conference on Computer-aided design,
November 2006.
Walfish, S.: A Review of Statistical Outlier Methods: PharmTech - A Review of
Statistical Outlier Methods, Pharmaceutical Technology, August 2006
51
APPENDIX A
Data Set Parametric Outlier Dataset 1
52
APPENDIX B
Data Set Parametric Outlier Dataset 2
53
APPENDIX C
Data Set Parametric Outlier Dataset 3
54
APPENDIX D
Data Set Parametric Outlier Dataset 4
55
APPENDIX E
Data Set Parametric Outlier Dataset 5
56
APPENDIX F
Data Set Parametric Outlier Dataset 6
57
APPENDIX G
Data Set Parametric Outlier Dataset 7
58
APPENDIX H
Data Set Parametric Outlier Dataset 8
59
APPENDIX I
Data Set Performance Graded Binder Sample 195
60
APPENDIX J
Data Set Performance Graded Binder Sample 196