Improving Process Capability Data Access for Design by

advertisement
Improving Process Capability Data Access for Design
by
James A. Hanson
B.S. Mechanical Engineering, Cum Laude
University of Maryland College Park, 1999
Submitted to the Department of Mechanical Engineering
in Partial Fulfillment of the Requirements for the Degree of
MASTER OF SCIENCE IN MECHANICAL ENGINEERING
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
~rv4'~
INSTITUTE
MASSACHUSETTS
May, 2001
3a a
MASSA CHU SET WN5ITUTE
OF TECHNOLOGY
JAN 2 9 2002
© Massachusetts Institute of Technology, 2001.
All rightspeserved.
LIBRARIES
Signature of Auth
Department of Mechanical Engineering
May 22, 2001
Certified by
Anna Thornton, Thesis Supervisor
Assistant Professor of Mechanical Engineering
Accepted by
Ain Sonin, Thesis Reader, Chairman of the Graduate Committee
Department of Mechanical Engineering
1
Improving Process Capability Data Access for Design
by
James A. Hanson
Submitted to the Department of Mechanical Engineering
on May 22, 2001 in partial fulfillment of the requirements for the
Degree of Master of Science in Mechanical Engineering
Abstract
Process capability databases are used to store data that characterizes manufacturing operations.
The manufacturing function uses process capability data (PCD) to monitor production and
identify out-of-control processes. The design function uses PCD to allocate tolerances, evaluate
manufacturability and assess product robustness. Academic literature predominantly assumes
design and manufacturing enjoy ready access to PCD. However, the design function actually
faces major barriers to PCD access, including lack of design-focused indexing and inconsistent
structures between PCDBs.
A survey was circulated to industrial enterprises, and the results identified problems with poor
PCD population. First, there are no formalized approaches for managing the problem of poor
database population. Second, poor database population is most harmful to design when new
designs are investigated, not when old designs are revisited. The distinction is important because
PCD can provide important information when new designs are created.
This thesis addresses the indexing and multiple-volume barriers by presenting a hybrid data reindexing system. The system can employ design-driven attributes to index data, making the data
intuitively accessible to design. Additionally, the system represents data from different data set
structures in a single, attribute-based Euclidean space. This feature allows a user to query
multiple data sets from a single interface.
With data represented in Euclidean space, poor database population is managed by using multiaxis regression techniques to generate estimates for unknown data values. This thesis presents a
unique interactive multi-axis regression method. This method creates estimates for unknown
data values by minimizing sum-of-square error along each Euclidean axis and across multiple
axes.
Thesis Supervisor: Anna Thornton
Assistant Professor, Mechanical Engineering
3
4
Acknowledgements
This thesis project has allowed me to explore a rewarding subject, and investigate new strategies
for solving a challenging set of problems. I feel very fortunate to have had this opportunity, and
I owe a debt of gratitude to several individuals who helped make this project a success.
First I want to thank my thesis advisor, Professor Anna Thornton, for her academic and
professional guidance during my time at MIT.
Professor Thornton's insistence upon
unquestionably professional character and performance were essential to my success. Her eye
for detail and compassionate encouragement led me to greater levels of confidence and
comprehension throughout this project.
I want to thank my family for their support and patience - not only while I toiled in the lab, but
also when I tried in vain to describe my project in 200 words or less. My parents John and Judy
and my brother Joe were consistently supportive and interested in my progress, and I am very
grateful for their enthusiasm. Accolades should not be limited to my immediate family; my
extended family is also very special to me, but unfortunately they are too numerous to list by
name on this page.
Professional associates have been essential to this project. In particular, kudos to Chris Hull of
Boeing for his enthusiasm during and after the 2000 KC Symposium. I want to acknowledge the
generous financial support of the National Science Foundation, MIT's Center for Innovation in
Product Development, and Analytics Operations Engineering, Inc.
My friends, in Boston and elsewhere, always provide generous emotional support when I need it
most. In particular, I offer my sincere appreciation to Lori and Christy. And I am thrilled to be
able to call my labmates friends.
Finally, I give very special thanks to Melissa, who has given me more support than I can ever
hope to express in words.
5
6
Table of Contents
A BSTRA CT...................................................................................................................................
3
A CKN OWLEDG EM EN TS .....................................................................................................
5
LIST OF FIGU RES ....................................................................................................................
11
A CRO NY M S ...............................................................................................................................
15
GLO SSA RY .................................................................................................................................
17
1
IN TRO D U CTION ................................................................................................................
23
1.1
BACKGROUND ..................................................................................................................
23
1.2
M OTIVATION ....................................................................................................................
23
1.3
LITERATURE REVIEW .....................................................................................................-.
25
1.3.1
Currentstate ofPCDBs ........................................................................................
25
1.3.2
Human capabilityfor managing uncertainty........................................................
26
1.3.3
Data analysis and knowledge discovery...............................................................
26
1.3.4
Database-specificuser issues .................................................................................
27
THESIS OBJECTIVES...........................................................................................................
28
1.4.1
Supportfor design functions .................................................................................
28
1.4.2
Strategiesfor unpopulateddata indices .................................................................
29
1.4
THESIS DATA ....................................................................................................................
29
1.5.1
Sim ilarities................................................................................................................
30
1.5.2
Differences ................................................................................................................
31
1.5
OUTLINE ...........................................................................................................................
31
IN DU STRY SU RV EY ......................................................................................................
33
1.6
2
2.1
SURVEY QUESTIONS AND RESPONSES................................................................................
33
2.1.1
PCDB usage..............................................................................................................
33
2.1.2
Causes and effects of unpopulated PCDB indices.................................................
34
2.1.3
Indexing strategies.................................................................................................
38
2.1.4
PCDB population...................................................................................................
39
7
2.2
Causes and effects of unpopulatedPCD ...............................................................
41
2.2.2
Indexing strategies.................................................................................................
43
2.2.3
PCDBpopulation levels........................................................................................
43
2.2.4
Remarks.....................................................................................................................
44
RECOMMENDED ACTION .................................................................................................
44
2.3.1
Indexing scheme improvements ............................................................................
45
2.3.2
Improved management of unpopulated indices .....................................................
45
CONCLUSION ....................................................................................................................
47
CURRENT STATE OF DATABASE IMPLEMENTATIONS ...................................
49
2.4
3.1
PCDB-SPECIFIC DBM S NEEDS......................................................................................
49
3.2
DBM S ARCHITECTURES FOR PCDBS...........................................................................
51
3.2.1
Hierarchystructures..............................................................................................
51
3.2.2
Relationalstructures..............................................................................................
56
3.2.3
Object-orientedstructures.....................................................................................
60
3.2.4
Hybrid Structures...................................................................................................
63
3.3
4
40
2.2.1
2.3
3
SURVEY ANALYSIS............................................................................................................
COMMON PROBLEMS ACROSS DMBS ARCHITECTURES .................................................
63
3.3.1
Inherent dissimilarityof DBMSs............................................................................
63
3.3.2
Errormanagement and infeasible indices............................................................
64
3.3.3
Schema conversion.................................................................................................
65
3.4
THE MISSING DATA PROBLEM .........................................................................................
65
3.5
CONCLUSION: RELEVANCE TO PCDB STRUCTURES.......................................................
66
ATTRIBUTE-BASED INDEXIN G ................................................................................
4.1
BASIC CONCEPTS ..............................................................................................................
67
67
4.1.1
Attribute coordinatehyperspace representation..................................................
67
4.1.2
Representation of items with multiple valuesfor an attribute...............................
69
4.1.3
Representation of items with no values for an attribute........................................
70
4.1.4
Rationalefor representation...................................................................................
72
4.2
ELEMENTS OF THE INDEXING SYSTEM ............................................................................
4.2.1
Components...............................................................................................................
8
72
74
4.2.2
4.3
82
Motivation.................................................................................................................
82
4.3.2
Hum an-machine learning.....................................................................................
83
4.3.3
Precise rolesfor hum an intervention in the system ...............................................
86
CONCLUSION ....................................................................................................................
STRATEGIES FOR MISSING DATA ..........................................................................
5.1
6
A TTRIBUTE LEARNING ...................................................................................................
77
4.3.1
4.4
5
Processes...................................................................................................................
JUSTIFICATION OF METHODS..........................................................................................
87
89
89
5.1.1
Index-based similarity.............................................................................................
90
5.1.2
Data-basedsim ilarity.............................................................................................
91
5.1.3
Regression as a predictionsystem ..........................................................................
92
5.1.4
Processesfor com bining regressionresults ..........................................................
95
5.2
PREPARATION FOR REGRESSION......................................................................................
96
5.3
REGRESSION PROCESS.....................................................................................................
97
5.4
REGRESSION COM BINATION............................................................................................
99
5.5
CONCLUSION ..................................................................................................................
100
DEMONSTRATION OF THE TECHNOLOGY ...........................................................
101
6.1
SYSTEM ARCHITECTURE .................................................................................................
101
6.1.1
Client-server interaction.........................................................................................
101
6.1.2
Intra-serverprocesses.............................................................................................
102
6.2
SERVER SOFTW ARE .........................................................................................................
103
6.2.1
Connection process andsession management ........................................................
103
6.2.2
Screens ....................................................................................................................
104
6.2.3
Com m ents on user experience ................................................................................
113
A DM INISTRATIVE SOFTW ARE ..........................................................................................
113
6.3
6.3.1
Searchfunction .......................................................................................................
114
6.3.2
Attribute learningand index expansion functions ..................................................
115
6.4
SOFTW ARE ENHANCEMENTS ...........................................................................................
118
6.5
CONCLUSION ..................................................................................................................
121
9
7
C O N C LU SIO N ................................................................................................................... 123
7.1
CONTRIBUTIONS ............................................................................................................. 123
7.2
FURTHER RESEARCH ....................................................................................................... 124
R EFEREN C ES .......................................................................................................................... 127
A PPEN D IX A : PC DB SU RV EY ............................................................................................. 131
A PPEN DIX B : SU R V EY RESPO N SES ................................................................................. 137
10
List of Figures
Figure 1.1: Some attributes of baseball card auction data, found in auction title............ 30
Figure 2.1: PCDB implementation among Symposium attendees......................................
34
Figure 2.2: Design tasks resulting in unpopulated PCDB indices......................................
34
Figure 2.3: Strong influences upon PCDB index population...............................................
35
Figure 2.4: Methods for managing lack of PCD at organizations with PCDBs .................
36
Figure 2.5: Methods for managing lack of PCD at organizations without PCDBs............ 37
Figure 2.6: Designers' reactions to unpopulated PCDB indices..........................................
38
Figure 2.7: Index scheme parameters in use.......................................................................
38
Figure 2.8: Index population levels in PCDBs......................................................................
39
Figure 2.9: Distribution of unpopulated data in PCDBs......................................................
40
Figure 2.10: Percentage of queries returning unpopulated indices ...................................
40
Figure 3.1: Generic hierarchical structure with indices......................................................
51
Figure 3.2: Determining child nodes of node 1.1 ..................................................................
52
Figure 3.3: Identical attributes in multiple locations and levels of hierarchy structure...... 53
Figure 3.4: Mining hierarchy sections for "Green" and "Orange"..............................
53
Figure 3.5: Identically named attributes may not represent identical concepts...............
54
Figure 3.6: No knowledge of nearest-neighbor attribute values..........................................
54
Figure 3.7: Removal of attributes from hierarchical index .................................................
55
Figure 3.8: Addition of attributes to hierarchical DBMS ...................................................
56
Figure 3.9: Relational representation of items.....................................................................
56
Figure 3.10: Adding new attribute to relational table..........................................................
57
Figure 3.11: Process of creating new table using pointers to two tables............................
58
Figure 3.12: Adding an attribute to a relational table: pointers automatically update.......58
Figure 3.13: Relational difficulties with attribute value similarities...................................
60
Figure 3.14: Object-oriented DBMS supporting attribute and method inheritance........ 61
Figure 3.15: Polymorphism of methods: "+" adds numbers, concatenates strings.......... 61
Figure 3.16: Polymorphism of attributes and associated ambiguities ...............................
62
Figure 3.17: Failure of declarative operations due to encapsulation .................................
63
Figure 3.18: Hybrid database structure requiring one choice each from A, B and C.......... 64
11
Figure 4.1: Index space in three dimensions .........................................................................
68
Figure 4.2: Attribute table assigning values to axis coordinates ........................................
69
Figure 4.3: Representing items with multiple values for a single attribute........................ 70
Figure 4.4: Different attribute axes for non-candy items and candy items, respectively..... 71
Figure 4.5: Attributes translated into index code ................................................................
73
Figure 4.6: SVC operator in substring..................................................................................
75
Figure 4.7: MVC operator in substring ................................................................................
75
Figure 4.8: A rules file, with a sample invalid index combination ......................................
77
Figure 4.9: Real-time specification: Selection of "disco ball" eliminates "plastic"............. 77
Figure 4.10: Query structure and sample returned index codes........................................
78
Figure 4.11: Use of wildcard for data characterization........................................................
79
Figure 4.12: Result of wildcard search ..................................................................................
79
Figure 4.13: Query codes with wildcard (left) and omission (right) for attribute 2.......... 80
Figure 4.14: Alteration of attribute table ..............................................................................
81
Figure 4.15: Update of dataset using altered attribute table ..............................................
81
Figure 4.16: Portion of an ignore file, listed alphabetically .................................................
84
Figure 4.17: Recognition of words unknown to system........................................................
84
Figure 4.18: Presentation of new words to user for assistance in categorization ..............
85
Figure 4.19: Illustration of learning process .........................................................................
85
Figure 4.20: Theoretically ambiguous text descriptor and nature of the ambiguity.........87
Figure 5.1: Z value calculation for two index codes ............................................................
90
Figure 5.2: Failure of data-based similarity measures when data is missing....................
92
Figure 5.3: Linear regression analysis in one dimension .....................................................
93
Figure 5.4: Illustration of lower and higher regression errors............................................
95
Figure 5.5: Translation of coordinate axes to unpopulated data point...............................
96
Figure 5.6: Location of populated data points on translated x' axis...................................
97
Figure 5.7: Axis for which ns = 4
... ..
............................................
99
Figure 6.1: Client/server system configuration......................................................................
101
Figure 6.2: Data flow across client/server system..................................................................
102
Figure 6.3: Data flow within the server location....................................................................
103
Figure 6.4: Title page viewed on client machine....................................................................
105
12
Figure 6.5: M enu-driven search screen ..................................................................................
106
Figure 6.6: Error screen ...........................................................................................................
108
Figure 6.7: HTM L table of card price frequencies................................................................
109
Figure 6.8: Selection screen for estimation function .............................................................
111
Figure 6.9: Estimate results screen..........................................................................................
112
Figure 6.10: Default window for administrative software ....................................................
115
Figure 6.11: Attribute learning option window .....................................................................
116
Figure 6.12: Rules file before and after adding wildcards....................................................
118
Figure 6.13: HTM L table bar chart ........................................................................................
120
Figure 6.14: Graphical bar chart ............................................................................................
120
13
14
Acronyms
DBMS = Database Management System
GIF = Graphics Interface Format
HTML = Hypertext Markup Language
JPEG = Joint Photographic Experts Group
KC = Key Characteristic
MVC = Multiple Value Concatenation
PCD = Process Capability Data
PCDB = Process Capability Database
PSA = Professional Sports Authenticator
SPC = Statistical Process Control
SQL = Structured Query Language
SVC = Single Value Concatenation
URL = Universal Resource Locator
VRM = Variation Risk Management
15
16
Glossary
*
Aggregate data = "data provided when ... the details of one or more ... parameters are not
known." (Tata, 1999)
" Attribute = characteristic of a datum, such as material or color.
*
Automated system = system capable of accomplishing operations and goals without human
guidance or other assistance.
" Basic table = list of all attributes and corresponding values in the indexing system.
*
Child node = the subordinate among two connected nodes on different levels of a
hierarchical tree. "Every node has a finite set (possibly empty) of nodes which are called
immediate successors or children of that node." (Alagic, 1986)
"
Class = a classification of similar objects. "A class characterizes one or more objects that
have common methods, variables, and relationships"; "A class can be thought of as the
'rubber stamp' from which individual objects are created." (Burleson, 1999)
*
Client = computer submitting requests to, and receiving data from, a server
" Concatenation = act of placing two or more data strings in sequence as a single string.
*
Confidence interval* = "an interval of plausible values for the parameter being estimated"
(Devore, 1987)
*
Conversion = process of exporting the information in an indexing scheme to a differently
structured indexing scheme.
*
Cookie = small, persistent data file on a client computer that identifies user information to a
server.
*
Curse of dimensionality = colloquial term for the observation that a linear increase in data
dimensionality can exponentially increase the complexity of data analysis.
" Data independence = data model in which "data and process are deliberately independent";
"'ad-hoc' data access" (Burleson, 1999)
" Data mining = "the confluence of machine learning and the performance emphasis of
database technology"; "discovery of rules embedded in massive data." (Agrawal et al., 1993)
" Database Management System (DBMS) = the structure, actions, and constraints imposed
on data storage and access.
"A number of models, each of which has a collection of
17
conceptual objects, actions on those objects, and constraints under which those actions are
performed." (Alagic, 1986)
*
Domain knowledge = knowledge of a specific field of inquiry that goes beyond the
information in a data set. Domain knowledge aids in discovery of rules and relations from a
data set.
" Encapsulation = process that "gathers the data and methods of an object and puts them into
a package, creating a well-defined boundary around the object." (Burleson, 1999)
*
Engine = core processing element of a software package.
*
Euclidean
*
Expansion
=
=
characterized by a set of mutually orthogonal coordinate axes.
process of adding new criteria to an indexing scheme.
" Field = column of a relational data table.
*
Goodness of fit = extent to which an equation matches the data it describes.
*
Hierarchy = "a method of organizing data into descending one-to-many relationships, with
each level having a higher precedence than those below it." (Burleson, 1999)
*
Hyperspace/hypervolume
= Euclidean space characterized by a large number of
dimensions, typically more than four.
*
Ignore file = table of words or other data properties that are not used for indexing.
* Index = "set of choices for each parameter detailing data desired. The index is the label for
PCD in the PCDB." (Tata, 1999)
*
Indexing scheme/system = model by which data is described and stored.
*
Inheritance = assumption of methods and attributes belonging to parent elements.
" Interface = means by which a user interacts with a system.
"
Invalid = a description of data indices that do not correspond with any possible real entity or
process.
" Key Characteristic (KC)* = label "used to indicate where excess variation will most
significantly affect product quality and what product features and tolerances require special
attention from manufacturing" (Lee and Thornton, 1996)
*
Machine learning = using computerized data analysis techniques to obtain new, useful rules
and knowledge from a dataset.
*
Method = data access or manipulation procedure; "behavior of the data" (Burleson, 1999)
18
*
Nearest-neighbor = the data index located the shortest distance from a reference index, with
distance measures dictated by the data structure.
*
Node = element of a hierarchical tree structure.
*
Null = taking no value or containing no information
*
Object = self-contained data package containing private data values, private procedures, and
a public interface.
*
Object-oriented = compatible with object-structured data and methods.
* Observation
" Parent node
=
sample; the outcome of a stochastic process.
=
the dominant node among two connected nodes on different levels of a
hierarchical tree. "Every node, except one, has a unique node which is called its immediate
predecessor or parent node." (Alagic, 1986)
" Pointer = data link between tables in a relational database
*
Polymorphism = "the ability of different objects to receive the same message and respond in
different ways." (Burleson, 1999)
*
Process capability* = "Process capability is a product process's ability to produce products
within the desired expectations of customers." (Gaither, p. 713)
*
Process Capability Data (PCD) = "the expected and obtained standard deviations and mean
shifts for a feature produced by a particular process and made of a particular material" (Tata,
1999)
" Process Capability Database (PCDB) = "includes target and actual tolerances for particular
process, material, and feature combinations" (Tata, 1999)
" Prediction = assigning a speculative value to some entity when there is insufficient
information to determine the value with complete confidence.
" Professional Sports Authenticator (PSA) = pay service specializing in the objective
"quality" grading of collectibles
" Projection
=
selection of a subset of attributes from a data object or table.
*
=
Distance between two data indices, with distance measures dictated by the data
Proximity
structure.
*
Query = request to access, modify or delete data in a data set.
19
"
Re-indexing = reevaluation of the characteristics of a datum, followed by overwriting the
datum's index with an updated one.
" Real-time = description of a process that executes on an immediate or nearly-immediate time
scale relative to human interface activity.
" Record = row of a table-type database structure, describing one data instance
"
Regression
*
Relational= database structure comprised of tables connected by time-dependent relations
*
Rules file = table of (in/)valid index combinations in a data set.
"
Semi-structured = possessing no formal data indexing structure
*
Server = computer or computer network that receives requests from clients and returns
=
creation of a mathematical model to fit data approximately.
processed responses.
*
Similarity = measure of resemblance between at least two data indices.
*
Statistical Process Control (SPC)* = "The use of control charts" (Gaither, p. 740) "...used
to ensure the ongoing quality of the manufacturing process." (Batchelor et al., 1996)
" Structured Query Language (SQL) = an expression vocabulary for submitting data
queries.
" Substitution = representation of a datum's characteristics by reference to those of another
datum.
*
Surrogate = data that is similar to the data for an unpopulated index (Tata, 1999)
" Table = a set of tuples characterized by the same attributes.
" Thread = isolated instance of server variables and processes, created for each connected
client.
"
Threshold = maximum regression error value for an axis, above which the regression model
is discarded as inaccurate.
"
Tolerance = "[The] maximum value that [a] dimension can deviate from the specified value
on [a] drawing." (Tata, 1999)
"
Training = iterative process of refining a data model's parameters for best possible fit of
data values.
"
Translation = see Conversion.
" Trend = describable data pattern.
20
*
Tuple = a relational data record. "In the relational model of data an entity is represented by a
tuple of values of its attributes." (Alagic, 1986)
"
Uncertainty = "Unsureness about the exact value." "There are a variety of uncertainties in
PCDBs including surrogate data, multiple data sets, aggregate data and small data sets."
(Tata, 1999)
" Universal Resource Locator (URL) = pointer to data of almost arbitrary composition,
typically allowing remote access.
" Unpopulated = description of a data index that contains no data values.
*
Value = the state of an attribute; e.g.: the value "blue" represents the state of the "color"
attribute.
"
Weight = assignment of mathematical influence over the outcome of an operation.
*
Wildcard = character that represents no selection of values for an attribute.
21
22
1
Introduction
This chapter provides background information about Process Capability Databases and
motivations for improving the current state of the art. The results of a literature search are
presented, followed by the thesis objectives and information about the nature of the data used in
this project. The chapter concludes with an outline of Chapters 2-7.
1.1
Background
Design and manufacturing enterprises rely on accurate and timely information to create products.
Designers can use knowledge such as customer needs, target costs, and material properties to
make a product design successful. (Ulrich and Eppinger, 2000)
Manufacturers can use
knowledge such as machine availability, employee skills, and delivery schedules to make
products according to requirements. (DeGarmo et al., 1997) But the information needs of design
and manufacturing are not entirely independent.
manufacturing and design.
Some types of knowledge concern both
Materials and scheduling needs are two examples. (Ulrich and
Eppinger, 2000)
Process Capability Data (PCD) is one type of knowledge that both design and manufacturing
use. PCD is manufacturing data collected for process monitoring, identification of out-of-control
manufacturing process elements, and evaluation of candidate processes for manufacturing
operations.
Designers can use PCD for such tasks as predicting manufacturing variation,
creating criteria for robust designs, and analyzing cost sensitivities. (Thornton and Tata, 1999)
Enterprises frequently use Process Capability Databases (PCDBs) to store PCD. PCDBs can
index and store Statistical Process Control (SPC) data, PCD and other information such as
date/time stamps and machine identification codes.
Because PCD supports design and
manufacturing functions, PCDBs should be efficiently accessible to both functions.
1.2
Motivation
Designers seek access to PCD for information support in creating new designs, relating known
processes to new products, and reviewing current processes and designs. Recent academic study
23
has revealed several shortcomings common to current PCDBs. These shortcomings hinder the
design function's ability to use PCD as an information source for design tasks.
Common barriers to PCDB use by the design function include (Tata, 1999):
-
Lack of PCDB commonality across enterprises
Large entities, and those who use component suppliers, frequently rely upon multiple PCDB
sources. This complicates the information retrieval process and reduces the usefulness of the
information for design. The ideal of querying multiple data sources with a single simple
query is generally not achieved, discouraging PCD use.
-
Poor PCDB indexing schemes
Widely disparate schemes exist for indexing data in PCDBs. Frequently used parameters
include part number, Key Characteristic (KC) number, feature number, manufacturing
process, feature type, machine number or name, tooling, supplier, team, product, and
material.
There exists no standard for indexing PCD, so the data indexing typically is
different between different manufacturers. This complicates the commonality problem, in
which multiple databases must be queried for PCD.
PCDBs also frequently use hierarchical structures. These structures complicate feature-based
analysis by separating features into multiple locations in the hierarchy. If a user desires to
retrieve all data representing a particular data feature, the entire hierarchical structure often
must be mined to find all the data.
-
Poor population of supplier PCDBs
Suppliers often do not provide data to external groups (including customers), or they only
provide data for specifically ordered parts. This reduces the amount of data available to
designers and makes the design task more difficult.
24
These common PCDB characteristics are significant barriers to successful use of PCD for design.
There is an evident need for tools and methods to manage or lower these barriers, so PCD can
provide a better level of feedback from manufacturing to design.
1.3
Literature review
This section discusses the results of a literature review, outlining findings regarding PCDBs,
human capabilities for managing uncertainty, knowledge discovery and user issues.
1.3.1
Current state of PCDBs
PCD provides a crucial link between design functions and manufacturing capabilities. Clausing
(1998) discusses the wasted work that results from a lack of information during product
development.
PCD is one example of information that can be in short supply during
development and design.
To improve data availability, enterprises use repositories such as
PCDBs to systematically store PCD. Campbell and Bernie (1996) outline a PCDB structure to
catalog geometric tolerances for rapid-prototyping processes.
Designers access the data by
specifying feature types, which are design-oriented process characteristics.
Design-oriented data access makes a PCDB system easier to use and more useful for designers.
Thornton and Tata (1999) find that the literature assumes PCD is available to designers, but this
assumption is inaccurate. Some causes of poor PCDB availability include:
-
Data indexing schemes do not allow designers to find the data they need.
-
Databases within an enterprise are often incompatible with each other. This fact can
-
make it impossible to access multiple PCDBs with a single query.
PCDBs are often poorly populated. Low population levels reduce the probability of
finding pertinent PCD.
25
Among other needs, design requires consistent, design-friendly indexing schemes and methods
for managing unpopulated data. The current lack of these elements has created barriers to PCD
access.
1.3.2 Human capability for managing uncertainty
PCDBs represent a type of knowledge database in which the information is seldom
comprehensive.
A survey by Deleryd (1998) finds most enterprises' PCDBs are not fully
populated with data.
This gives rise to a complex problem.
When faced with incomplete
knowledge, the human tendency to reason intuitively is both an asset and a liability.
Piattelli-Palmarini (1991) discusses human "cognitive illusions." Humans often form incorrect
hypotheses based on incomplete information. Even when faced with contradicting information,
humans often cling to these outdated or incorrect hypotheses. Human hindsight is also far less
reliable than its bearers believe, as is the human capacity for risk assessment. Because of these
shortcomings, human inference regarding missing database knowledge can be unreliable. This is
true even among experts.
Automated systems can assist the human data user by performing impartial analysis tasks. Baral
et al. (1998) discuss the use of logical methods to query incomplete knowledge databases. These
methods use inference methods that are external to a database, but make use of data within the
database.
1.3.3 Data analysis and knowledge discovery
Mathematical and logical methods can make inferences about missing data. The knowledge
discovery field offers insight into the problem of missing PCD. Walker (1987) discusses the
successful use of automated systems to make inferences in the areas of mass spectrometry,
pharmacology, mathematics, and geology.
These systems typically make use of domain
knowledge - fundamental, topic-specific knowledge that surpasses the information in the
database - to make non-obvious conclusions about data interactions and relationships.
Automated systems have very limited abilities, both in model-driven and data-driven tasks.
26
The possible approaches to model-driven and data-driven data analysis tasks are myriad. Zhang
and Korfhage (1999) note the existence of "more than 60 different similarity measures" for
comparing data numerically. Distance-based and angle-based measures in Euclidean data space
are noted to be the most popular. Zhang and Korfhage (1999) present the concept of hybrid datadriven-analysis tools and note hybrid similarity measures are not frequently studied. All dataoriented similarity measures, even hybrids, have inherent weaknesses and are not well suited for
all analysis tasks.
Model-driven data analysis and knowledge discovery methods see frequent use in feature
recognition (Turk and Pentland 1991; Pavlovic et al. 2000) and speech recognition (Rabiner
1998; Pereira and Riley 1996; Riccardi et al. 1996; Bangalore and Riccardi 2000). Common
model-based methods include neural networks (Bishop 1996), Bayesian networks (Cowell 1999),
and support vector machines (Burges 1998). One common benefit of model-driven methods is
the ability to discard individual data points once the model is built. These methods generally
require copious training to build accurate models.
Like data-driven methods, model-driven
methods are critically dependent on the presence of a large volume of accurate data.
1.3.4 Database-specific user issues
Meng at al (1995) discuss translation of database queries between relational and hierarchical
schema. Query translation can allow a user to query multiple databases of differing structure
from a single interface. Liu and Pu (1997) present various meta operations for heterogeneous
data sources, including query aggregation. Meta operations can be part of an effort to integrate
different data sources for responses to a single data request.
Chandra et al. (1992) note that query and inference methods should be clear to the user. A
database user interface should provide basic information about the processes that return
information to the user. Without information about query and inference methods, a user may not
be confident that the methods are sound.
27
The literature documents the applications of PCD, but has only recently begun to address
designers' need for better access to PCD. Human inconsistencies in managing uncertainty are
well documented, as are many methods for recognizing data features and managing missing data.
Concepts for query translation and user interface design are also present in the literature. There
is little integration of these concepts, however. Industry needs integrated solutions for missing
data, multiple databases and improved data indexing; this need is not yet met.
Thesis objectives
1.4
This thesis advances the hypothesis that technological changes to PCDBs can reduce access
barriers and improve PCD availability to design. To prove this hypothesis, prototype software is
created to demonstrate improvements. Specifically, these improvements are:
-
Database support of design functions:
" design-focused methods for accessing data
-
error checking on queries
" support for searching multiple PCDBs with a single query
-
Strategies for unpopulated data indices:
" data aggregation and support for wildcard indices
" prediction methods for unknown data values
A data re-indexing component and a data analysis component form the basis for the
improvements listed above. Each is discussed below.
1.4.1 Support for design functions
In industry, PCDB indexing schemes often reflect the manufacturing function's need for data
access by features like drawing and part numbers. Design typically needs PCD organized by
design-driven attributes like features, materials and processes.
needs can result in reduced value of PCDBs to design tasks.
28
The two functions' differing
When designers have to cross-reference features with part or drawing numbers, their ability to
efficiently retrieve PCD is reduced. To address this access imbalance, this thesis presents a
method of re-indexing data by attribute, in order to lower the barrier to PCDB usage by design.
This re-indexing process does not compromise manufacturers' PCD access. Rather, it makes
access more universal, allowing design and manufacturing to use the same interface for queries.
1.4.2 Strategies for unpopulated data indices
A PCDB can only contain information collected from previously completed manufacturing
operations. However, designers frequently are charged with the task of designing components
with unique sizes, exotic materials, obscure features or other characteristics that have not
previously been catalogued in a PCDB.
Searches for information regarding a new process
typically result in a null set of returned data from the PCDB. In these cases, process capability
must be estimated by supplementary methods.
Currently, several strategies exist for managing the problem of unpopulated data indices. Few of
these methods include continued use of the PCDB, which may contain PCD for substantially
similar processes. To maximize the value of PCDBs to design and manufacturing, this thesis
presents a mathematical framework for extracting surrogate data from a PCDB when a requested
index is unpopulated.
1.5
Thesis data
The methods described in this thesis are suitable for wider application than PCDB systems. Any
structured or semi-structured data set can be arranged and analyzed by the methods described
here. Regardless of the type of data, the resulting structures will be remarkably similar. It is
appropriate, then, to demonstrate how the process works on some arbitrary semi-structured data
set. This thesis does just that: the data that demonstrates the technology is collected from an
application sharing nothing with process capability studies. Discussion of this data as well as
PCD illustrates the versatile nature of the re-indexing process and data analysis techniques.
29
Several examples in this thesis make use of baseball card references to illustrate various
principles. The data used for this project was gathered from a semi-structured data source on the
World Wide Web. This data documents the dates, times, and selling prices of approximately
250,000 baseball cards auctioned online. Its source was a large set of HTML tables in which
auction records appeared as they occurred, by way of an automated recording system external to
MIT. This sales data represents a semi-structured data set, where there is no formal indexing
system available to classify the data. PCD, by contrast, is typically stored by marking records
with a pre-defined index that best describes them. In their original forms, the organization of
PCD is therefore more highly structured than the baseball card sales data.
1.5.1
Similarities
PCD and baseball card sales data do not initially seem to have much in common, but they are
both describable in a very similar way. Both types of data use a single record to characterize a
single occurrence. Each record is a set of attributes: materials/sizes/features for manufacturing
operations and names/manufacturers/years/qualities for baseball card sales. We can describe any
type of data, as long as we can figure out what the attributes are and build the attribute sets.
Figure 1.1 illustrates how a system can mine semi-structured data and determine data attributes.
Year
1997 BOWMAN JOSE CRUZ JR ROOKIE CARD MINT
Manufacturer
Name
Quality
Figure 1.1: Some attributes of baseball card auction data, found in auction title
The meaning of the stored data differs by data type. In the case of PCD, we want to store the
dimensional capabilities of a manufacturing operation. In the case of baseball cards, we want to
find the price statistics for a baseball card. But with data for different applications stored in a
similar way, we can use a consistent set of tools to manage all the data and address data
problems that may arise.
30
1.5.2 Differences
Baseball card data is different from PCD in one very significant way. As noted earlier, no
indexing scheme has initially catalogued the baseball card data. It typically exists without an
indexing structure. As with much of the information available on the Internet and elsewhere,
there is no predetermined way to categorize and manage the baseball card data. PCD, on the
other hand, is typically stored in a rigidly structured PCDB structure.
This thesis manages the differences between structured and unstructured data by presenting an
indexing system that accommodates either type.
The system presented here can re-index
structured data, while creating ground-up indices for unstructured data. This capability creates
advantages that become apparent in Chapters 4 and 5.
1.6
Outline
This thesis contains seven chapters. Chapter 1 defines PCDBs, reviews their use in industry, and
states the motivations for further work. The chapter reviews current literature and summarizes
the objective of the research contained in this thesis. Finally, Chapter 1 describes the data used
for this project and outlines the structure of this thesis.
Chapter 2 presents the results of an industry survey at the 2000 Key Characteristics Symposium.
Survey responses reflect the current state of PCDB implementation, and identify areas for PCDB
improvements.
Chapter 3 discusses database management structures (DBMSs), identifying strengths and
weaknesses of each major DBMS category. Weaknesses relevant to PCDB use receive particular
attention. Common problems across DBMS structures are indicated, and the chapter concludes
with a discussion of pertinent PCDB-related indexing issues.
Chapter 4 describes an attribute-based re-indexing scheme. This scheme can represent data from
various sources with a single set of Euclidean indices, subject to constraints.
31
The chapter
discusses details of the indexing code and support for improved functions such as assistance with
user queries, expansion of the indexing scheme, and assisted machine learning of new attributes.
Chapter 5 discusses a strategy for managing missing PCD by generating estimates for missing
data values. The chapter presents details and equations in support of the method. Chapter 6
outlines features of the software prototype that integrates the re-indexing scheme of Chapter 4
with the missing data strategies of Chapter 5. Chapter 6 also provides concepts for software
enhancements. Finally, Chapter 7 offers conclusions and suggested directions for future work.
32
2
Industry Survey
To determine how design uses PCDB, a survey was distributed to attendees of the Fourth Annual
Variation Risk Management/Key Characteristic (VRM/KC) Symposium in January 2000.
Attendees of this Symposium were interested in the use of VRM/KC tools, from industrial and
academic viewpoints. Twenty-four attendees, representing American and international industrial
and academic institutions, completed the survey.
The results of this survey suggest many institutions have not developed tools to systematically
manage unpopulated PCDB indices. Additionally, unpopulated database indices seem to have
adverse consequences upon the actions of designers.
This chapter presents the survey and draws conclusions from the results.
conclusions, the chapter discusses recommended actions.
After offering
The full survey can be found in
Appendix A.
2.1
Survey questions and responses
This section presents each survey question and the responses received.
2.1.1
PCDB usage
Among the twenty-four respondents to the survey, eleven indicated having PCDB system in use.
Five additional respondents indicated PCDB implementation was in the planning stages at their
organizations (Figure 2.1). Unless specifically noted otherwise, all remaining questions in this
chapter were asked of those respondents who answered "yes" to this question.
33
M-EPH-w-
12
11
10 88
6
42 0
0
Yes
No, No Plans
Planned
Being Created
Figure 2.1: PCDB implementation among Symposium attendees
2.1.2 Causes and effects of unpopulated PCDB indices
Respondents were asked to specify the design tasks that result in queries of unpopulated indices,
choosing as many as apply to their organization. Ten of the eleven respondents identified the
creation of new designs as a cause of queries of unpopulated indices. Four respondents indicated
redesign tasks frequently lead to the return of null PCDB data sets. Four responses indicated
investigative queries (what-if scenarios and other speculative search tasks) frequently lead to
unpopulated indices. See Figure 2.2.
12
10 -
10
86
4
4
4
1
2
0
New Designs
Investigative
Queries
Redesign
Other
Figure 2.2: Design tasks resulting in unpopulated PCDB indices
34
- 4W
Respondents were asked to note the data characteristics (i.e., attributes) that exhibit the most
significant influence upon the probability that a query returns a null data set. Results are
illustrated in Figure 2.3.
6
5
54
4-
3
3-
22
2-
Complexity
N/A
Feature
Material
Size
Date
Figure 2.3: Strong influences upon PCDB index population
Complexity was the most frequently chosen response, indicating queries for more complex
features or operations are less likely to return useful data.
Four of the eleven respondents
indicated they had noted no strong relationships between data characteristics and PCD
population levels. Three respondents noted strong influences of feature selection upon index
population levels. Two respondents indicated variations in material choice are influential, and
two noted that feature size has a significant impact upon the potential for PCDB index
population.
Reaction to the unpopulated index condition was gauged by asking respondents two questions
about responses to unpopulated data indices. One question asked for the most popular short-term
solutions to a lack of data in the PCDB, while the other question asked about long-term strategies
for addressing the same problem.
35
9
87
8
8
7
65
5
4
3
2
T
0
Use
Internal
Expert
Find Consider
Use
Find
Request Contact
Info from Supplier Alternate External Alternate Design
Change
via
Expert
via
Mfg.
Software
Intuition
Figure 2.4: Methods for managing lack of PCD at organizations with PCDBs
Responses to the short-term question are plotted in Figures 2.4 and 2.5. Respondents were asked
to choose as many options as were applicable. Figure 2.4 illustrates results from respondents
who do use PCDBs at their organizations. Eight respondents indicated information is requested
from an enterprise's manufacturing center(s) when PCDB queries result in null responses. Seven
respondents indicated contacting suppliers for the missing information, while five suggested
intuition is used to estimate the information needed. Four respondents noted the use of an
external expert, while one indicated the use of a software algorithm and one indicated design
changes.
The responses above were compared with the responses of individuals who do not currently use
PCDBs at their organizations. These individuals were asked what strategies they use, in the
absence of a PCDB, to get PCD when it is not already known. The results appear in Figure 2.5.
36
9
8
8
7
6
6
6
5
5
3,3
2
Internal
Contact
Generate
Invest.
External
Expert
Supplier
Estimates
Design
Expert
Changes
Figure 2.5: Methods for managing lack of PCD at organizations without PCDBs
Eight of thirteen respondents indicated using an internal expert (perhaps someone in
manufacturing) to provide some estimated measure of PCD. Six indicated supplier contact, and
six indicated estimates are generated (presumably not via PCDB-supported algorithms) for the
unknown values.
Five indicated the consideration of design changes.
Three respondents
indicated reliance upon an external expert to provide the necessary information.
When asked about long-term trends within the design function in response to missing PCD,
survey respondents provided the answers in Figure 2.6. The most popular reply (6 respondents)
was "No Trends Noted." Four respondents indicated PCDB queries were performed only upon
"popular" indices - that is, upon well-known sets of parameters that have a greater chance of
containing data. One respondent indicated queries are performed only upon indices known to be
populated with PCD. No respondents indicated decreased PCDB use, distrust of the data, use
only for revision of existing designs, or other reactions.
respondents who indicated PCDB use at their organizations.
37
This question was asked only of
7 65 41
I
3
2 1 0
I
1
0
0
0
0
I
I
I
C
i-
W)
0
0
a.
0
*
2.
U
<0
OC
0
6,
0
ZC
Figure 2.6: Designers' reactions to unpopulated PCDB indices
2.1.3 Indexing strategies
Respondents were asked to specify how their PCDBs were indexed, selecting all items that
applied. The results are illustrated in Figure 2.7.
12
10
10
9
8
6
4
2
IIII I
0
C
.S
E
.2
o
4)
E
0
z
0
Figure 2.7: Index scheme parameters in use
The most popular indexing scheme parameter was material, with ten of eleven respondents
indicating its use. This was followed in popularity by the feature parameter, with nine responses,
and then by the size and machine parameters, which received eight responses apiece. The part
identification parameter received seven responses, followed by the operation parameter with six
and the KC parameter with five.
38
2.1.4 PCDB population
Overwhelmingly, respondents indicated low levels of PCDB index population, with nine of
eleven indicating their databases were populated to levels of 25% or less. As illustrated in
Figure 2.8, one respondent indicated 25-50% population, and one respondent indicated 75-99%
population.1
10
9 87654321
00
0-25%
25-50%
50-75%
75-99%
100%
Figure 2.8: Index population levels in PCDBs
Five respondents indicated unpopulated indices are distributed in concentrated, localized regions,
and also in lower concentrations throughout the database.
Three respondents described their
PCDBs as "Mostly Unpopulated," and the remaining three respondents were distributed evenly
between
"Mostly
Unpopulated Data."
Populated,"
"Concentrated
Unpopulated
Regions"
and
"Dispersed
Thus, eight of the eleven respondents suggested significant amounts of
unpopulated data throughout their PCDBs, as well as concentrations of unpopulated indices in
certain PCDB regions (Figure 2.9).
1 Respondents were not asked if their specified percentages took infeasible indices into account.
39
65
5
4
3
3
2
1
1
0
'a
00
4) CiC
(0
0c
0
Figure 2.9: Distribution of unpopulated data in PCDBs
Respondents were asked what percentage of PCDB searches fail to return data. More than half
of respondents indicated a null response more than half the time the database is used (Figure
2.10).
3.5
3
2.5
2
2
2
<10%
10-25%
2
1.5
1
0.5
0
25-50%
>50%
Figure 2.10: Percentage of queries returning unpopulated indices
2.2
Survey analysis
Survey results yielded several key findings, which supported and guided the work described in
this thesis. These conclusions appear below.
40
2.2.1
Causes and effects of unpopulated PCD
Several conclusions were evident regarding the unpopulated data index problem. Each topic is
addressed individually.
Influence of task on index population. It appears from survey results that new designs are
the most significant contributors to queries returning unpopulated indices. Redesign tasks and
investigative queries also contribute significantly to the problem of null responses. The survey
results suggest the most frequently unsuccessful queries support processes about which designers
know the least. This was anticipated, as these tasks frequently look for information that lies
outside the set of documented processes.
New designs were also anticipated to be the most frequent causes of null responses. Redesign
and investigative efforts may be based on previous design decisions, but a new design might not
share any elements with previous designs.
As a result of this difference, a new design can
prompt database queries that explore entirely empty regions of a PCDB. In contrast, redesign
and investigative queries often retain some characteristics of a catalogued design. This fact may
increase the likelihood that investigative and redesign queries either return data in response to a
query, or query an unpopulated database index that is "close to" a populated database index.
Influence of attributes on index population. Among parameters of a query that increase
the probability that a PCDB index is unpopulated, respondents identified query complexity as the
most significant contributor.
Complexity was chosen over more concrete parameters of the
process being investigated, such as feature, material and size.
This result suggested the
aggregation strategy described in Section 2.3.2.
Strategies for missing PCD. When PCDB can not be found by direct database query,
respondents who use PCDBs indicated heavy reliance upon data sources external to the PCDB.
Survey results suggest respondents use at least six different contingency strategies with some
frequency when PCDB indices are found to be unpopulated. This suggests database indices are
found unpopulated with sufficient frequency to necessitate numerous contingency strategies.
Some of these strategies are not advisable, however. In particular, using human intuition for
41
generation of surrogate data is not generally advisable (Piattelli-Palmarini, 1991). Additionally,
the popularity of simple intuition to estimate process capability is not particularly different from
using intuition entirely in place of the PCDB, particularly in light of the fact that such a high
percentage of database queries are unsuccessful.
The popularity of requests to manufacturers is interesting, as one motivator for PCDB creation is
the reduction of reliance upon manufacturing centers for statistical information. Yet the clear
majority of organizations represented here have not reached the full potential of this reduction:
on average, more than 50% of queries return unpopulated data indices, and the most popular
method for dealing with this condition seems to be a return to the manufacturing center for
information.
Among respondents who do not have access to PCDBs, there was a significant increase in the
popularity of design change considerations when PCD is not available.
Other strategies for
managing missing data were similar in popularity between respondents with PCDBs and
respondents without PCDBs.
This comparison does not allow a direct item-to-item contrast between the actions of designers
with PCDBs and those without, but it does suggest organizations with PCDBs rely upon
information sources similar to those used by organizations without PCDBs. In the case of
organizations with PCDBs, respondents undertake these strategies for addressing missing data
after a PCDB query returns null results, so there is still a significant need for data estimates in
spite of PCDB implementation.
Long-term implications of missing PCD. Survey respondents indicated loyalty to the
PCDB when unpopulated indices are returned. Although they noted some reduction in the scope
of queries submitted, nobody indicated decreased PCDB use, distrust of the data, or use only to
revise existing designs.
This might be a reflection of many factors.
The continued use of
PCDBs, even in light of frequent query failures, may reflect a fundamental trust in the PCDB
system that is not shaken by a lack of data. Alternately, it may reflect enterprise requirements
that PCDBs be used for every design task when possible, forcing PCDB use patterns to remain
42
constant at some level. This portion of the survey did not provide as much insight into the
attitudes of designers as was hoped, although the results do establish that unpopulated PCDB
indices can cause some reduction in the scope of tasks for which the PCDB is queried.
2.2.2 Indexing strategies
Design tasks focus on parameters such as features, material, sizes and process parameters. (Tata,
1999) It is encouraging to see these three parameters listed as three top parameters used for
indexing, because they show the potential for making PCDBs more accessible to the design
function. Even if a PCDB's structure does not currently allow for attribute-based data analysis,
many respondents indicated the necessary information is contained in the database.
This
information might be used to re-index or restructure the PCDB for attribute-based analysis.
2.2.3 PCDB population levels
With regard to population levels, the survey addressed two topics.
These topics were data
topography within a PCDB, and respondents' long-term reactions to missing PCD. Conclusions
for each topic appear below.
Population levels and distributions. Respondents generally indicated low population
levels in their PCDBs.
Many current indexing systems create millions of possible index
permutations, of which many or most are infeasible. (Tata, 1999) For this reason, respondents'
answers to this question were not surprising. These results suggest a strong possibility that a
requested PCDB index will be infeasible or unpopulated. Population levels less than 25%, which
most respondents suggested, indicate PCDB implementations frequently suffer from null query
results.
Within respondents' databases, some regions seem to be more densely populated with PCD than
other regions. This result suggests certain types of processes may be better represented than other
types. This is consistent with intuition, as some processes are more frequently used than others
in a facility.
43
Effects of population levels on query responses. It is clear that current PCDB
population levels frequently prevent enterprise functions from obtaining the PCD they seek. The
majority of survey respondents suggested most of their PCDB queries yield null responses.
Respondents additionally indicated the unpopulated data index problem is not localized to
particular areas of a PCDB.
Unpopulated indices may also create a constant, low-intensity
"background noise" across the entire database. This serves to make unpopulated data indices a
potential problem anywhere in the PCDB.
2.2.4 Remarks
The survey results underscore the need for solutions to three PCDB problems. First, it is clear
that null responses occur with high frequency. This is probably a result of the low population
levels of industry PCDBs.
Second, query complexity is a strong contributor to null query
responses. Third, the contingency strategies for missing PCD are myriad, and similar between
organizations with PCDBs and those without PCDBs.
These findings underscore the importance of strategies for managing null responses. With so
many unfulfilled PCD requests, it is clear that industry needs strategies for preventing or
managing null PCDB responses.
Any solutions to these problems should also consider the
findings of (Tata, 1999). Specifically:
2.3
-
PCDBs are not generally accessible to designers with design-focused interfaces
-
Indexing systems allow specification of infeasible indices
-
Databases are frequently incompatible with each other
Recommended action
This section lists strategies for addressing two types of problems. The first problem is improving
the indexing scheme of PCDBs for design access. The second problem is managing unpopulated
PCDB indices.
44
2.3.1
Indexing scheme improvements
To improve design's PCD access, PCDBs should offer a design-focused interface for generating
queries. This interface ideally should not be limited to one function. Rather, the interface should
be flexible enough to allow either design or manufacturing to access PCD by the appropriate
attributes. The interface should also support the possibility of using a single query to search
multiple databases.
To respond to design-driven queries, PCDBs must be indexed using design-driven attributes.
The ideal state would use each data attribute as a potential indexer, so each datum would be
accessible from design-driven queries and manufacturing-driven queries. If each PCDB within
an enterprise is to be queried from a single interface, then each PCDB should share a common
indexing scheme. The data in each PCDB may remain in place, but a universal indexing code
should accompany each individual datum.
The improved indexing scheme should allow for implementation of existing data analysis tools.
Specifically, the scheme should arrange data in a structure that enables strategies for managing
unpopulated indices.
Because this management task relies on finding data indices that are
similar to an unpopulated data index (see Section 2.3.2), the data structure should minimize the
effort needed to determine similarity.
Chapter 4 of this thesis outlines a database indexer that accomplishes these objectives by
arranging PCD in a Euclidean, attribute-based data structure.
2.3.2 Improved management of unpopulated indices
First, industry needs a tool that prevents users from specifying infeasible index combinations.
This tool would save PCDB users time by intercepting index specification errors before a user
commits to a search.
This error-checking method would be a valuable addition to PCDB
implementations.
45
Second, industry needs strategies for addressing missing data.
would be to more fully populate the database.
proposition.
One solution to this problem
This is an expensive and time-consuming
Populating the entire database could prove impractical from a time and money
standpoint, as there are many thousands of valid indices in a typical PCDB.
This problem
requires a more cost-efficient and quickly implemented solution. Until PCDB population levels
are high enough to return data for nearly every query, enterprises should employ statistically
correct procedures for quickly generating estimates from available and properly organized
information.
If a queried PCDB index is found to be unpopulated, one reliable strategy for finding surrogate
data may involve reducing the complexity of the query to seek a more general process
description, or selecting a different feature, material or size.
The industry survey indicated
complexity is a primary cause of null responses, so specifying a more simple set of criteria may
improve a user's chances for getting useful data from the PCDB.
An improved indexing
structure should allow a user to make simple changes to some query parameters without having
to reset all parameters. This functionality should include the ability to specify a partial list of
search criteria for aggregate searches.
Because of their similarity to previously catalogued processes, redesign and investigative queries
are promising candidates for mathematical algorithms that extrapolate or interpolate to
unpopulated indices, based upon the data contained in local populated indices.
The data
structure suggested in Section 2.3.1 should place similar data structures in close proximity to
each other, so distance-based similarity measures can serve as a measure of similarity.
Regression techniques are well suited for situations in which ample data is available in close
proximity to unknown values, providing quickly implemented and statistically correct tools for
generating surrogate data. Chapter 5 presents a set of regression operations that can generate
estimates for unknown PCD values.
46
2.4
Conclusion
It is not the intent of this thesis to propose a solution to all unpopulated PCDB index problems,
but rather to manage the types of queries that most frequently contribute to the return of
unpopulated PCDB indices.
As indicated by the survey, these queries may frequently be
expected to involve unpopulated indices in close proximity to populated indices, allowing for the
use of neighboring indices' data in predicting values for the unpopulated indices. This finding is
key to the strategies employed in later chapters for managing unpopulated indices.
Chapters 4 and 5 of this thesis offer solutions that directly parallel the findings of this survey, as
well as the survey reviewed in (Tata, 1999). It is clear from these surveys that the needs of
PCDB users are currently not being met. As outlined in Chapter 3, basic database architectures
can contribute to the PCDB problems that these surveys have uncovered.
47
48
3
Current State of Database Implementations
Storing large amounts of data often requires a dedicated system. A PCDB is an example of such
a system. Depending on their function modem databases vary in size, structure and operations.
But all databases must have methods for entering, storing, and returning data upon request.
Database programmers use the term Database Management System (DBMS) to describe an
organization structure that enables these functions. (Alagic, 1986) Multiple DBMS architectures
exist, each uniquely adapted for specific applications.
One way to address these problems is to seek out an alternate DBMS from among those that are
popular in other applications, with the intent of finding a DBMS that offers advantages over
currently used PCDB DBMSs.
In fact, there exist several DBMS architectures that are
Unfortunately these structures, despite their
particularly widespread in implementation.
popularity, all feature weaknesses that can hamper the data storage, management and retrieval
processes required of PCDBs, as well as higher level analysis functions. It would be impossible
to describe the complete state of database technology here.
Instead, some popular DBMS
architectures are described below in detail.
This chapter begins by laying out specific needs for a PCDB management system. Next, it
reviews three major types of database structures: hierarchical, relational, and object-oriented.
During this review, the chapter identifies the strengths and weaknesses of each system. The
chapter concludes with comments about how some DBMS weaknesses relate to PCDBs.
3.1
PCDB-specific DBMS needs
To manage PCDBs, a DBMS must satisfy several needs. This section describes these needs in
detail.
Efficient data storage. To minimize the resources needed to store and search a PCDB, the
DBMS should use an efficient structure that conserves storage space and processor load. The
DBMS should not create structural elements that then go unused, because these elements can
waste physical resources. To maximize flexibility of data entry and searching, the DBMS should
49
store data that has not been assigned a full set of attributes.
This support for incompletely
specified data may be useful when some information about a set of PCD entries is unknown.
Ability to locate similar indices. The DBMS should naturally support easy methods for
figuring out which indices are most similar to a reference index. One way to accomplish this is
to place similar data indices close to each other in the data structure. The system should not
require a human to tell it which indices are similar to each other, because this information may
change when the indexing scheme changes.
Index structure expandability. When a user adds new data attributes or attribute values to
the indexing structure, s/he should be able to do so in a very short amount of time. Ideally, this
process should be as easy as editing an attribute list. The user can then tell the DBMS to update
the data with this new list.
When a user makes data index updates, the DBMS should restructure the data arrangement as
efficiently as possible. Index changes should have a minimal impact on the volume and structure
of the data, although some changes are inevitable.
Support for multiple volumes. Enterprises often have multiple PCDBs available, including
legacy systems, purchased data and inherited archives. Whenever possible, DBMS structures of
each PCDB should be compatible with others.
This makes the user experience consistent
between PCDBs, and allows a user to query multiple PCDBs simultaneously from a single user
interface. Even if each PCDB has a different structure, a user should still be able to easily search
multiple PCDBs with a single query.
Support for simple aggregation. PCDB users have indicated a desire to return multiple data
indices by using aggregated queries.
The DBMS should have strong support for index
aggregation when a user specifies a query or enters data. The DBMS should make the process of
gathering aggregate data as efficient as possible, to minimize query response times.
50
Support for error checking. When a PCDB user enters new data or submits a query for
PCD, the DBMS should check the submitted data indices for validity. The DBMS should only
allow a user to specify valid index combinations.
3.2
DBMS architectures for PCDBs
This section reviews the three most popular types of DBMS architectures. These systems are
hierarchical, relational, and object-oriented DBMSs.
The section indicates the unique
advantages and drawbacks of each system. Hybrid systems are also briefly discussed.
3.2.1 Hierarchy structures
This subsection describes the hierarchy DBMS, and its disadvantages.
Background. Hierarchical structures often serve to illustrate other database management
systems, but the hierarchy is also a separate DBMS. A hierarchical DBMS looks like an inverted
tree, and each node in the structure represents an index that can contain data. A node may have a
maximum of one "parent" node higher up in the structure, and may have multiple "child" nodes
below it. See Figure 3.1.
[Root]
1
2
2.1
11
2,2
1.2
2.1,2
2.1.1
2 2 1
2,2,2
Figure 3.1: Generic hierarchical structure with indices
51
Data indices in a hierarchical structure identify their parent nodes. All parents' indices are
embedded in those of the children. For example, the parent of index 2.2.1 is index 2.2. A method
can mine child nodes by looking for extensions of the current node's index code. See Figure 3.2.
[Root]
Fork
1
n
o11 : 1s
1
Conclusion:
1.1 and 1.2 are
valid indices
Loo k fot 1
F Iu r
3
Figure 3.2: Determining child nodes of node 1.1
Disadvantages. The hierarchical model offers simple, efficient access to data. But it can also
hinder data access and knowledge discovery. Problems with the hierarchical DBMS appear
below.
Scattering of similar indices. First, data containing similar attributes may appear in different
locations through the structure. See Figure 3.3. The hierarchical DBMS often must repeat
identical attribute names across a level of the hierarchy, and occasionally across multiple levels.
To find all occurrences of an attribute, a search must mine the entire hierarchy structure. This
does not make an attribute-based search impossible, but it does make the search inefficient. See
Figure 3.4.
52
1
2
Ball
Truck
12
2
1
Green
Orang
Small
Large
12
2
1
Green
Orange
Green
Orange
Figure 3.3: Identical attributes in multiple locations and levels of hierarchy structure
1To2
Ball
1
Truck
'
1
2
Green
Orange
match
match
A-"2
Small
Large
2
112
Green
Orange
Green
Orange
match
match
match
match
Figure 3.4: Mining hierarchy sections for "Green" and "Orange"
Lack of attribute information. The second problem with the hierarchical DMBS is the lack of
information regarding a datum's attributes. Information about data attributes is contained by the
hierarchical structure itself. Mining the structure for all attributes (the first problem from above)
may require human assistance, because some attributes or attribute values with identical names
may not actually represent identical properties. This problem is known as polymorphism. See
Figure 3.5.
53
12
Truck
Ball
Green
These attributes may
not be truly equivalent
Large
Small
Orange
Green
Orange
Green
Orange
Figure 3.5: Identically named attributes may not represent identical concepts
Complicated determination of index similarity. Third, the hierarchical DBMS has no means
to identify which data indices are most similar to a given index. This is important for managing
unpopulated data indices. See Figure 3.6. Substitution or estimation with hierarchical data can
become complex and difficult when similar indices are not efficiently found.
Truck
Small
Is "Orange" more similar
to "Green" or "Blue"?
Green
ffOrange f
Bue
Figure 3.6: No knowledge of nearest-neighbor attribute values
Difficult alteration of the hierarchical structure. A fourth difficulty with hierarchical
databases involves altering the DBMS structure. If a user deletes a data attribute or attribute
value from the indexing scheme, the DBMS has to restructure all nodes containing that attribute.
Deleting an attribute also deletes all of the data in children of nodes containing the deleted
54
attribute. The DBMS has to move data out of areas beneath deleted nodes, and then find an
appropriate location for each data record. See Figure 3.7. A similar problem occurs when a user
adds an attribute. The attribute must appear in multiple locations in the hierarchy, requiring an
examination of the entire structure to identify all necessary insertion points.
Toy
Toy
2
Ball
2
Truck
Truck
Ba
Remove "Small"
and Large" from
Green
Orange
Orange
Green
Orange
Green
Large
Small
Orange
Green
Orange
Combine indices'
data contents
Figure 3.7: Removal of attributes from hierarchical index
When a user adds an attribute or attribute value to the indexing scheme, the DBMS may need to
add large structural sections to reflect the change. The system has to duplicate all nodes that will
appear under the added attribute. See Figure 3.8. The DBMS then has to reanalyze and relocate
all affected data within the new hierarchy. Also, a user must reexamine all human-generated
information such as data attributes and nearest-neighbor attribute values. This reexamination
must occur to make sure the human-generated information is still legitimate under the new
indexing scheme.
Add "Medium"
Smal
malLarge
Green
Orange
Green
Orange
Green
Duplicate "Green" and "Orange" attributes;
redistribute contents of data indices
55
Orange
Medium
Green
Orange
Large
Green
Orange
Figure 3.8: Addition of attributes to hierarchical DBMS
3.2.2 Relational structures
This subsection discusses the relational DBMS, and the problems that may be encountered when
using it.
Background. The relational model uses tables to store data. A table is a collection of data
instances that share the same attributes, even if the values for one or more attributes are different
between instances. Each row in a table is a single instance, or tuple. Columns (fields) represent
the attributes that describe data in the table. See Figure 3.9.
TableName BALL
TableName TRUCK
Key
Color
Key
Color
Size
1
Green
I
Green
Large
2
Green
2
Green
Small
3
Orange
3
Orange
Small
4
Green
4
Green
5
Orange
5
Orange
Large
Small
6
Orange
6
Orange
Large
Figure 3.9: Relational representation of items
Relational databases can be accessed with very simple query processing, via Structured Query
Language, or SQL.
Relational DBMSs traditionally comply with the principle of data
independence. This principle says structural changes should not affect data already in place, and
data operations should treat each datum exactly the same way. Relational databases allow any
SQL command to operate on any available piece of information in any table. As a result, query
definition is simple.
This is an improvement upon older DBMS systems like hierarchical
structures.
Expandability of relational tables is superior to hierarchical or object-oriented structures
(described later in this section), again due to data independence. Adding an attribute to a single
56
relational table is simple. However, newly created attribute might not contain any values until
the user or database algorithm adds them. See Figure 3.10.
TableName YO-YO
TableName YO-YO
Key
Color
Key
Color
1
Green
1
Green
2
Green
Add new
2
Green
3
Orange
attribute
3
Orange
4
Green
4
Green
5
Orange
5
Orange
6
Orange
6
Orange
Size
Figure 3.10: Adding new attribute to relational table
An additional feature of relational databases is the pointer, which reduces the need for the
duplication of data (a problem outlined above for hierarchical DBMSs). Pointers are simple
references from a table to an attribute in another table. A pointer retrieves the data from the
proper field in a remote table, and returns that data to a field in the presently active table. See
Figure 3.11. This data is not persistent. It will be lost when the table is closed. But the data can
always be retrieved again from the remote table as needed, so while its existence in a particular
table is transient, its availability from another table is continuous. The pointer system is an
improvement upon hierarchical systems, because the DBMS only has to update one location
when an attribute value changes. Figure 3.12 illustrates adding a Size attribute to the Ball table.
In this case, the linked Toy table in Figure 3.11 will automatically update, incorporating the new
data values.
57
TableName BALL
TableName TRUCK
Key
Color
Key
Color
Size
I
Green
a
Green
Large
2
Green
b
Green
Small
3
Orange
C
Orange
Small
4
Green
d
Green
Large
5
6
Orange
Orange
e
Orange
Small
f
Orange
Large
TableName TOY
TableName TOY
Key
Color
a
13
x
a
8
Join
Join
BALL
BALL
Color
Size
with
with
TRUCK
TRUCK
Color
Size
Color
Key
Size
Size
Green
Green
Small
BALL has
8
Green
Green
Large
no Size:
s
Orange
leave blank
*
Orange
Y
Green
11
Green
t
Orange
<p
Orange
Large
K
Orange
Orange
Large
'1
p
K
K
Large
Small
Figure 3.11: Process of creating new table using pointers to two tables
TableName BALL
Key
Color
1
Green
TableName BALL
Add a new
attribute
Key
Color
Size
i
Green
[new value]
2
Green
[new value]
3
Orange
2
Green
3
Orange
4
Green
4
Green
5
Orange
5
Orange
[new value]
[new value]
[new value]
6
Orange
6
Orange
[new value]
Figure 3.12: Adding an attribute to a relational table: pointers automatically update
58
Disadvantages.
Comparing
relational
systems
with
PCDB
needs
reveals
several
incompatibilities. These incompatibilities appear below.
No true aggregate objects. First, relational DBMS systems classify objects only by atomic
attributes. These systems can not truly "create" aggregate objects by combining attributes of
individual objects. An aggregate object (Figure 3.11) is just a new table that contains attributes
from other tables. The aggregate object does not truly "exist" in database storage. It is only a
projection (combination of columns) of one or more other tables, and it disappears when the
database is closed.
At run-time, a relational DBMS must create all aggregate objects with a projection of one or
more relational tables. If an aggregate table uses fields that aren't shared by all of its constituent
tables, many attribute fields in the aggregate table will be blank. These empty cells appear in
Figure 3.11, and waste database resources by allocating space that is never used.
Difficult alteration of the relational structure. Second, adding attribute fields (including their
attribute values) to existing relational data is complicated when several tables are involved,
requiring the DBMS to locate all insertion points for new fields. The added fields may contain
actual data or simple pointers, but the system still must identify and update each location. This
does not include the steps necessary to populate new field(s) with attribute values if the DBMS
does not use pointers. The system must determine attribute values for every tuple via some
external means.
Difficult determination of nearest-neighbor indices. Third, the relational DBMS has
problems with nearest-neighbor identification, which parallel hierarchical DBMS problems.
Within a particular attribute, the relational table has no knowledge of the nearest-neighbor order
of attribute values.
The ordering of some attributes, specifically those that are numerically
specified (such as length in millimeters), may follow numerical order. But there is no guarantee
that every numerical attribute will have a nearest-neighbor order that coincides with numerical
order. Text attributes are even more troubling, because they are not likely to follow any nearestneighbor algorithm that a computer can figure out by reading text values. See Figure 3.13. The
59
DBMS must order these attribute values by some other means, such as human intervention, in
order to use any nearest-neighbor processes.
TableName BALL
Color
Key
Green
1
Green
2
3
4
5
6
Orange
Green
Orange
Orange
Size
Mega
Mammoth
Is "Mega" more similar
Colossal
Mammoth
Massive
Tremendous
to "Mammoth,""Colossal,"
"Massive" or 'Tremendous"?
Figure 3.13: Relational difficulties with attribute value similarities
3.2.3 Object-oriented structures
This subsection discusses the unique architecture of the object-oriented DBMS, and the
difficulties associated with using it.
Background. Object-oriented DBMS systems can store data properties, support property and
method inheritance, and support true aggregate objects. These functions are not possible with
the pure relational and hierarchical models addressed above. Within an object database, a data
object is an encapsulated private module, belonging to a "class" that contains other objects with
identical attributes. In addition to relevant data, the object contains a collection of procedures
(methods). These methods, and only these methods, can be executed against the data in the
object. The only exception to this rule is the set of methods of an object's parent classes. The
object inherits these methods, so they are also valid. The data object also inherits attributes from
its parent classes. See Figure 3.14.
60
Object: "Toy"
Attributes: "Color"
Methods: "Select", "Delete",
Object: "Ball"
Object: "Truck"
Attributes: Methods: "AddNew"
Attributes: "Size"
Methods: "AddNew"
Figure 3.14: Object-oriented DBMS supporting attribute and method inheritance
Disadvantages.
The primary disadvantages of object-oriented DBMSs are structural
alteration, query definition, and attribute value ordering.
Difficult alteration of the object-oriented structure. Changing an object-oriented indexing
scheme requires locating and updating all appropriate data locations. Adding an attribute or
method can require significant time and effort, partially due to polymorphism. As described in
Section 3.2.3, polymorphism describes the possibility that an identically named property or
method represents different concepts for different objects. Figure 3.15 illustrates polymorphism
for the "+" method.
Numerical data types:
String data types:
4+5= 9
"Tru" + "ck" = "Truck"
Figure 3.15: Polymorphism of methods: "+" adds numbers, concatenates strings
Polymorphism is powerful for an end user, but it complicates data index expansion. This is
especially true when attributes or methods have been defined at the child object level instead of
higher up in the structure.
When a structural change occurs, the DBMS must examine the
"flavor" of a property or method within each affected object. See Figure 3.16.
This adds
complexity relative to hierarchical and relational tables, which typically do not support
polymorphism (although they may contain attribute values in different locations that represent
different properties).
61
Object: "To y"
Attributes: Methods: "Select", "D elete",..
Object: "Ball"
Attributes: "C olor"
Methods: "AddNew"
Object: "Truck"
Attributes: "Color", "Size"
Methods: "AddNew"
"Color" attributes may not
be equivalent descriptors
Figure 3.16: Polymorphism of attributes and associated ambiguities
The complexity of changing an object-oriented system's indexing scheme encourages very
careful initial planning of a database, and discourages structural improvement of a poorly
planned object DBMS. It is very difficult to design an object-oriented DBMS system for easy
maintenance, if the system is likely to experience change many times over its lifetime.
Complicated query definition. Query definition is complicated within object-oriented DBMSs,
due to a conflict between declarative languages (such as SQL) and object encapsulation.
SQL
assumes all data is equally accessible to any given command. This assumption is not true in an
object environment.
Only methods contained within an object or inherited from a parent can
operate upon an object's data. Thus, an "uncooperative" data object may compromise the simple
authoritarian nature of SQL. See Figure 3.17. A user must often use alternate methods to query
an object database, reducing the simplicity of data management and analysis tasks.
62
Object: "Toy"
Attributes: Methods: "Select"
Object: "Ball"
Attributes: "Color"
Object: "Truck"
Attributes: "Color", "Size"
Methods: "AddNew"
Methods: "AddNew", "D elete"
SQL intent: Delete all objects
and associated classes
having parent class "Toy"
Result: Operation failure:
"Ball" object class does not
contain or inherit "Delete" method
Figure 3.17: Failure of declarative operations due to encapsulation
3.2.4 Hybrid Structures
There are several hybrid forms of databases that combine features of the above systems. One
popular example is the relational-object model, which combines relational table architecture with
various features of object-oriented data. The wide variety of hybrid systems prohibits detailed
explanation here. Hybrid systems are promising solutions for specific data needs, although they
generally inherit some or all of the problems associated with their constituent forms.
3.3
Common problems across DMBS architectures
In addition to the index expandability problems outlined above, database management systems
share other common shortcomings. These problems are outlined below.
3.3.1
Inherent dissimilarity of DBMSs
One common problem is simply the existence of dissimilar DBMS structures. This problem is
not symptomatic of one type of DBMS, but always causes problems when more than one DBMS
is in use. PCDB users are inconvenienced when they have to submit a different query to each
desired database.
63
3.3.2 Error management and infeasible indices
Error management is another difficulty with current systems. Many DBMSs have no means for
identifying and correcting data errors during data entry, storage or retrieval. In Figure 3.18 a
hybrid PCDB structure uses a material (A), a process (B) and a feature (C) to characterize a
manufacturing operation.2
The user specifies one hierarchical index for each of the three
structures (A, B, C). The database system will allow an index choice of the form {A2, B2, C2},
specifying a paper material, drilling operation and slot feature.
The DBMS accepts this
combination even though a manufacturer would not attempt to drill a slot in a paper material.
The index {A2, B2, C2} infeasible for practical use, but the DBMS allows a user to specify it
anyway.
A
Material
Steel
Paper
C
B
+
Process
Cut
Drill
+
Feature
Hole
Slot
Figure 3.18: Hybrid database structure requiring one choice each from A, B and C
If a user accidentally submits the query {A2, B2, C2} to an error-free database, an error message
will result when the requested index is found to contain no data. But if the user commits the
specification error during data entry, data will be stored in the wrong index and no error message
will result. Continuing the example from Figure 3.18, an operator may want to enter data from a
manufacturing operation that drills a hole in steel. The index for this operation is {A1, B2, C I}.
The operator may accidentally type in the infeasible index {A2, B2, C2} instead, causing the
data to be stored under the wrong index. After this error occurs, a query of the valid index {A1,
B2, Cl} will not return the data that should be stored there, because the data will have been
stored in the infeasible index {A2, B2, C2}.
Such an error causes inaccurate data
characterization, reducing the effectiveness of the database.
2 This
is a simplified representation of some PCDBs currently in use.
64
3.3.3 Schema conversion
Converting a database scheme from one type to another is often imperfect. For instance, a data
object typically contains information about which operations a DBMS can perform on it. But the
data in a relational or hierarchical scheme typically lacks such by-object resolution.
When
converting data from relational or hierarchical schemes to object schemes, there may not be any
way to automatically create the additional information stored in a data object.
However, the DBMS may not need certain contents of a data object. In this case, the data
objects may be incomplete without affecting the operation of the database.
3.4
The missing data problem
An additional major concern in PCDB implementation is the problem of missing data. Some
potential causes of unpopulated data indices are:
-
The enterprise has never undertaken the requested manufacturing process
-
The enterprise has undertaken the requested manufacturing process, but has not
catalogued it
-
The enterprise has catalogued the process, but committed an error in recording its index
-
The requestor committed an error in specifying the search, returning the wrong index
-
A supplier has not undertaken or catalogued the requested process
-
A supplier does not allow access to the requested data
These problems can be summarized with the following three categories:
1.
Lack of access to, or existence of, data
2. Erroneous recording of data
3. Erroneous request for data
Problem categories (2) and (3) above can be addressed by providing improved tools for
managing data entry and user queries, to reduce errors in index specification. Problem category
65
(1) groups a lack of data access with a lack of data existence because in both cases the user is left
with no means to obtain the data desired. In this case, statistical prediction methods can generate
surrogate data from the existing data set.
3.5
Conclusion:relevance to PCDB structures
The MIT VRM group has seen PCDB implementations that use hybrid hierarchical DBMSs.
These DBMSs suffer from many of the problems noted in section 3.2.1. Some of these problems
parallel those discovered in previous VRM group surveys (Tata, 1999).
Specifically, lack of
database commonality across enterprises, flawed indexing schemes, large numbers of infeasible
index combinations, and poor data population levels were previously found to plague PCDB
implementation.
Switching from one commonly used DBMS architecture to another will not eliminate problems
that affect PCDB use.
Each DBMS has its own problems and shares problems with other
DBMSs. Popular DBMS systems share basic problems that prevent efficient management of
unpopulated data indices and determination of "similar" attributes/values.
Better error management tools can improve storage and retrieval accuracy in PCDBs. New tools
and data environments that enable these improvements are described in Chapters 4 and 5.
66
4
Attribute-Based Indexing
Instead of indexing with one of the DBMS systems from Chapter 3, this thesis proposes a
method for rearranging the organization of data. The method uses only the characteristics of the
data to create a new indexing system, instead of relying upon the branched-tree structures that
are frequently used in PCDBs. This system can be created in one of two ways. If a data set is
already indexed, exhaustive "mining" of the data index for attributes can provide the information
necessary for re-indexing in a new form. If a data set is not already organized by an indexing
system, the data's characteristics - such as textual contents, responses to mathematical
algorithms, or any other observable properties - can provide the means for reorganization.
In either case, the resulting indexing system offers advantages over more traditional DBMSs.
Specifically, it simplifies data organization and makes the PCDB more versatile. To simplify
organization, the system maps different types of data into a single set of indices. This data may
come from separate, inconsistently structured volumes. The re-indexer creates one searchable
set of indices that includes all volumes, cataloguing all of the desired data. One search can then
be used to find data from any of the original volumes.
This chapter describes the data indexing structure developed for this thesis. First, basic concepts
relevant to the indexing scheme are discussed. The chapter then describes the components of the
attribute-based indexer, and concludes with a discussion of the indexer's learning methods.
4.1
Basic concepts
This section outlines the index hyperspace, two special cases in representation, and a rationale
for using this system.
4.1.1
Attribute coordinate hyperspace representation
To make data-related tasks easier, the re-indexing process organizes data into a large
multidimensional space, with each dimension of that space corresponding to one attribute of the
data. The result is illustrated in Figure 4.1, for a case in which three attributes (axes) are used to
describe the data. In the figure, these attributes are the item's name, its size, and the material of
67
which it is made. The space is expandable to an arbitrary number of dimensions, but three
dimensions are used here for visualization purposes. This representation depicts the indexing
scheme, not the data each point contains.
Each point in the hyperspace can contain data
describing items represented by the point. This data can exist in essentially any form, including
numerical data and formatted text. This thesis concentrates on numerical data forms.
A
SIZE
4
32
[2,2,2]
1
43
2 3 4
'ITEM
4
MATERIAL
Figure 4.1: Index space in three dimensions
Each axis contains a series of coordinate number labels, and each number on an axis corresponds
with a value the relevant attribute can take. For example, the "item" attribute may take the
values "globe," "disco ball," or "lamp." Coordinate numbers on the ITEM axis represent each of
these potential values. See Figure 4.2.
68
Attribute Table:
1. ITEM
1. Globe
2. Disco Ball
3. Lamp
2. SIZE
1.10"
2. 12"
3. 16"
3. MATERIAL
1. Plastic
2. Glass
Figure 4.2: Attribute table assigning values to axis coordinates
In this case, the number 1 on the ITEM axis may represent a globe, 2 may represent a disco ball,
and 3 may represent a lamp. Higher numbers on the ITEM axis may represent other item names
the system needs to represent. The SIZE and MATERIAL axes are similarly set up, with numbers
on each axis representing values for each respective attribute.
The 3-dimensional coordinate
depicted in Figure 4.1, [2,2,2], represents a choice of 2 ("disco ball") for the item name, 2 for the
size and 2 for the material.
If a SIZE choice of 2 represents a 12-inch diameter, and a
MATERIAL choice of 2 represents glass, the coordinate uniquely identifies the item as a 12-inch
diameter glass disco ball.
4.1.2 Representation of items with multiple values for an attribute
An item may have two valid values for one of its attributes. For instance, a glass disco ball may
double as a hanging globe during non-party hours, making it a disco ball and a globe. Ideally,
the indexing system should allocate a separate coordinate on the ITEM axis, perhaps labeled
"disco ball and globe," to represent this duality.
However, this is not always practical,
particularly in a case where such combinations are frequent and unpredictable.
In these
situations, multiple data points can represent a single item. For the example described above, the
item can have two data points in the hyperspace, in which case its data is represented by a
combination of the "pure" data for each of the individual points. This is illustrated in Figure 4.3.
69
Data for the item then exists in both locations, and a query of the database for the desired object
will return data from both data points, with each data set labeled by source. The storage and
querying process is described in greater detail later in this chapter.
A
SIZE
4
3-
2 -[2,2,1]
S - [2,2,2]
2 3 4
2
....... .
4ITEM
2'2
MATERIAL
Figure 4.3: Representing items with multiple values for a single attribute
4.1.3 Representation of items with no values for an attribute
Alternately, an object may not have any values at all for a particular attribute. This may occur
because the object is not describable by that attribute, or because the value for the attribute is
unknown. In this case, the attribute simply does not appear in the description of the object. This
feature is very important for cataloguing different types of objects within a single hyperspace.
Each object is described using only the attributes (axes) that are relevant to it. For example, if a
fourth axis called FLA VOR existed in the hyperspace in Figure 4.1, the new axis would not be
relevant to any of the objects (the globe, disco ball or lamp). The FLA VOR attribute does not
typically describe any of these items, so that attribute (and its axis) would be ignored in the
descriptions of the items.
70
If candy items existed in the list of objects to be described, FLA VOR would then become a
relevant axis. But candy description might not require the SIZE axis. Then we would have some
objects described only by the ITEM, SIZE, and MATERIAL axes, while other objects would use
only the ITEM, MATERIAL, and FLA VOR axes. Figure 4.4 illustrates this possibility in a
sequence of two figures, illustrating the three relevant dimensions for each case.
A
SIZE
A
MA TERIAL
4
4
3
2
1 -
3
2
1-
1
2
I
I
1
2 3 4.
I
I
I
I
I-
1 2
ITEM
3
3 4...
3
41
' MA TERIA L
2
I II
Z...........
/SIZE
4
/
FLA VOR
Figure 4.4: Different attribute axes for non-candy items and candy items, respectively
Each of the representations in Figure 4.4 is really just a dimensional subset of the entire
hyperspace. In this example, we have four dimensions (axes) in the hyperspace, and each object
is represented in Figure 4.4 by three of those four dimensions. (To avoid confusion, this thesis
does not present visual examples for hyperspaces of more than three dimensions.) If an object
possessed name, size, material and flavor properties, it would have coordinates on all four of the
axes, and would exist in the complete four-dimensional hyperspace.
In fact, any number and
combination of the axes can describe an object, depending on how many relevant attributes the
object has.
71
4.1.4 Rationale for representation
This particular representation is highly valuable for data storage and analysis because it enables
functions that are difficult to implement in other structures. With each attribute represented by
an orthogonal (independent) axis, an object becomes a set of coordinates, enabling several
spatially oriented strategies for data manipulation.
The attribute-based structure allows for distance-based analysis of similar data indices, as they
are located close together in the index hyperspace. Examples of this type of analysis include
retrieval of data from indices similar to a chosen index, estimation of (or substitution for) data in
empty indices, improved data index expansion, and some forms of error management. Attributebased re-indexing also improves learning algorithms that use human input to discover new data
features. These analysis tasks are valuable to industry because they assist in solving difficult
problems. Later sections of this chapter outline these tasks in greater detail.
Elements of the indexing system
4.2
This section discusses system elements in detail. The section begins with a review of needs the
indexer must fulfill. Then components of the code and the rules file are discussed. Next, the
search process and the attribute expansion process are discussed.
The section concludes by
detailing the attribute expansion process.
To index data in the attribute-based structure, each datum must have a code that identifies its
contents. This code maps the datum to the index space and is the only link between the user and
the data, so it must meet several needs. These needs are listed below.
-
Represent each relevant attribute axis
-
Suppress axes that are not relevant to a particular object
-
Represent each appropriate value for an axis
-
Discern between axes that can take only one value and those that can take multiple values
-
Accommodate wildcard characters as axis values
-
Serve as data descriptors and search criteria
-
Enable searching by a subset of attribute axes
72
-
Consume minimal system resources for generation, storage and manipulation
-
Enable expansion of the index with minimal human effort
-
Represent other indexing structures, such as hierarchical trees
-
Enable checking of queries and new data indices for validity
-
Place data with similar characteristics in close proximity
To satisfy these needs, the system uses a simple Polish notation string to characterize data
attributes and represent queries. Unlike many other indexing structures, this notation can satisfy
the needs listed above. This string is custom-generated for each datum in a data set and then
searched against during query operations. See Figure 4.5.
Attribute Table:
1. IT EM
1. Globe
2. Disco Ball
3. Lamp
2. SIZE
1. 10"
Attrib utes for this ite m:
ITEM = Disco Ball (2)
3.16"
SIZE =12" (2)
MATERIAL = Glass (2)
3. MATERIAL
1. Plastic
Co de:1,2-,+2,2,+32-++
___________________
2. Glass
4. FLAVOR
1. Sweet
2. Sour
3. Bitter
Figure 4.5: Attributes translated into index code
Each identified attribute, and only identified attributes, appear numerically in the code.
For
instance, the item name "disco ball" (value 2 for axis 1 in the attribute table) appears at the
beginning of the code via the substring "1,2,-,+." The 12-inch size (value 2 for axis 2) appears
next, via the substring "2,2,-,+" and the "glass" material property (value 2 for axis 3) appears via
73
the substring "3,2,-,+." Since the flavor attribute is irrelevant to a disco ball, it does not appear
anywhere in the code. The operators
4.2.1
"+"
and "-" are described below.
Components
This subsection describes the index code and rules file components of the indexing system.
Index code. The text string code in Figure 4.5 contains numbers identifying attributes (axes)
and values (coordinates), but other symbolic components are necessary. These other components
take two general forms. The first form is the separator, which identifies the end of one part of
the code and the beginning of another.
The second form is the operator, which represents
groupings of code symbols into substrings. These substrings identify the properties of the coded
items.
Two types of separators are used in the code. The first type, the symbol separator, appears in
this notation as a comma (,). This separator is a boundary between all other symbols, with the
sole task of marking the end of one symbol and the beginning of another. This is important if
multiple text characters represent a single symbol. For example, the number "xyz" is very
different from the two numbers "xy" and "z," so the comma is used to discern between "xyz"
and "xy,z" in the code.
Any non-numeric symbol can serve this purpose, such as a space
between the symbols, but this thesis uses the comma to make the distinctions deliberately. The
second type of separator, the attribute separator, appears as the plus symbol (+). The attribute
separator indicates the end of one attribute-value substring and the beginning of another. These
separators are illustrated in Figures 4.6 and 4.7.
74
Attribute
number
Value
number
SVC
Attribute
operator separator
Figure 4.6: SVC operator in substring
Attribute
Value
number numbers
.1.
3,2,6,6,&,+,t ...
MVC
Attribute
operator separator
Figure 4.7: MVC operator in substring
Operators are more complex than separators, and associate values with the attributes they
describe. An operator symbol may be one of three types.
The first form is the single-value
concatenation (SVC), appearing here as the minus symbol (-).
The SVC operator pairs an
attribute axis number with one value number appearing directly after it. See Figure 4.6. This
operator is used when an attribute can only have a single value, and the (-) operator can
optionally verify that the user has selected only one value for the attribute. The second operator
type is the multiple-value concatenation (MVC), appearing here as the ampersand symbol (&).
The MVC allows an attribute to take several values, and pairs an attribute axis number with
multiple value numbers. See Figure 4.7.
In Figure 4.6, the single value number 2 is modifies the attribute number 1, meaning the
hyperspace axis labeled "1" will have coordinate "2" for the item being described.
The "-"
operator explicitly assigns the value number to the attribute number, and the "+" operator
indicates the end of the substring (and the end of assignments for axis 1).
75
In Figure 4.7, the
value numbers 2, 3, and 6 modify the attribute number 1, meaning the axis labeled "1" will have
all three coordinates. The "&" operator performs this association, and the "+" operator ends the
substring of assignments to the attribute. Concatenating substrings like those in Figures 4.6 and
4.7 will create a full code for the description of an item. Each attribute number appears in the
code a maximum of one time. A final
"+"
operator follows the entire code, to specify the end of
the code. The code in Figure 4.5 illustrates a full code for three attributes.
Rules file. There are many impossible combinations of attribute values in a typical process
capability database. A manufacturer may not make plastic disco balls, for example, in which
case a database pertaining to that manufacturer will not contain any information about them.
However, a user might specify this combination of attributes while constructing a search or
storing new data in a database. This might happen accidentally by way of a user error, or the
user might purposefully specify the combination of attribute values because s/he does not know
the combination is invalid.
This error can occur during data entry and data retrieval. If the user commits the error when
specifying attribute criteria, a system error or an empty set of returned data will result. If the
error occurs during data entry, the data will be stored incorrectly, making it difficult or
impossible to find later. Both of these results can reduce the effectiveness of the database and
frustrate database users.
A rules file eliminates the possibility of specifying impossible attribute value combinations,
immediately letting the user know when specified values are incompatible with each other. Such
a file is illustrated in Figure 4.8. The file contains all "illegal" combinations of index values. An
automated system can consult the rules file to identify invalid combinations, and then make
errors or invalid combinations evident to the user. Consultation can occur in real-time, with the
user's selections of attribute values automatically narrowed to represent only allowable
combinations. This is illustrated sequentially in Figure 4.9. Alternately, the system may wait for
the user to completely specify the code, and then consult the file once to verify that the
combination of attribute values is valid. The real-time consultation is more useful as an aid to
76
specifying data entry and query codes, because it provides the user with more information during
the process.
SIZE = [any]
ITEM = "disco ball"
MATERIAL = "plastic"
1,,,,2*-+,,,,
Figure 4.8: A rules file, with a sample invalid index combination
ITEM
MENU
SIZE
MAT'L
ITEM
MENU
SIZE
Globe
10"
Plastic
Globe
10"
D
12"
Glass
Disco
12"
.
LI
16"
MAT'L
Glass
16"
Lamp
Lamp
Figure 4.9: Real-time specification: Selection of "disco ball" eliminates "plastic"
4.2.2 Processes
This subsection discusses the system's search and attribute expansion processes.
Search function. To search for a particular item in a code, the user submits a query with a
format very similar to the code used to describe items. The system compares this query directly
with each item in a database, and returns all items satisfying the requested criteria to the user.
Like item codes, the query code is a string using attribute numbers, value numbers, separators
and operators. Each attribute substring (see Figures 4.6 and 4.7) indicates the desired values for
each attribute. An attribute substring also indicates whether some or all values need to appear in
order to satisfy the search. Only attributes that the user specifies are used as search criteria. If a
77
user does not specify a requested value for an attribute, the system will not use that attribute as a
search criterion. This is illustrated in Figure 4.8.
There are differences between operators in item coding and query coding. Object codes use the
ampersand (&) operator as a multiple-value concatenation (MVC), when a single attribute takes
multiple values to describe an item. Object codes also use the minus (-) operator as a singlevalue concatenation (SVC), to indicate when an attribute can only take one value. A query uses
these two operators differently.
The (&) operator acts as a logical AND, and requires all
requested values for an attribute to appear in an item's code. If only some of the requested
values appear in an item's code, the system will not return the item to the user. The query code
uses the (-) operator as a logical OR, requiring only one of the requested values for an attribute to
appear in the item's code.
If at least one requested value appears for an attribute in the item's
code, the (-) the item is not rejected. An item's index code must satisfy the requested criteria for
all attributes in the query, or it is rejected and will not be returned to the user.
Returns:
Query:
1,2,3,6.&,+,2.4,5.&,+,3.12,&+.+
1,2,3,6,&+.Z4,5,-,+3, 1,2,&,+.+
S12,2,3,6,&,+,25,-,+,12;3,&,+.+
1 2,3, &,+2,4,,+3,1 2,&,+,+
Axis 3
Axis 2
Axis 1
1 and 2
4or 5
[ 2, 3, and 6
Figure 4.10: Query structure and sample returned index codes
In Figure 4.10, the function of the (-) and (&) query operators is evident from the returned data
index codes. The attribute codes for axis 1 in the returned data indices contain all three
requested values (2, 3, and 6). This is a requirement, because the query uses the ampersand
"AND" operator (&) to specify the query for the first attribute. The same applies for axis 3. The
query specifies the minus "OR" operator (-) for axis 2, so only one of the requested values for
axis 2 has to appear in a data index. Finally, no query requests are made for axis numbers higher
than 3, so any values can appear for higher-numbered axes. These values will not disqualify a
data index from being returned to a user in response to a query.
78
A final difference between item index codes and query codes is the use of the wildcard character.
Here, the asterisk (*) denotes a wildcard. In a query, a wildcard signifies that any value of the
corresponding attribute will satisfy the search criteria. A search code wildcard will then return
an item with any value for that attribute. When a user specifies a wildcard for an attribute in a
search, the attribute becomes effectively irrelevant to the search.
This allows a user to
selectively eliminate certain attributes from becoming search criteria, enabling the user to submit
queries based upon only a subset of the data attributes. See Figures 4.11 and 4.12.
Wildcard attribute value
Figure 4.11: Use of wildcard for data characterization
Specified search criteria:
1,2,&,+,2,*-,+3,2-,+,+
Intent: ITEM = Disco Ball
SIZE = Any
MATERIAL = Glass
Result: Glass Disco Ball
of any size is
returned
Figure 4.12: Result of wildcard search
Following the item description concept from the introduction of this section, queries can be
constructed without using wildcards, by specifying only those attributes selected by the user, and
removing any attributes without specified values from the query code. The difference between
the two strategies appears in Figure 4.13. However, this method eliminates one useful function
of the system. A user might want to specify that only data items with certain attributes should be
returned, regardless of the values actually specified for the attributes. For example, a user might
only be interested in lamps that have values associated with the "size" attribute. In this case, the
user would want to eliminate all instances of lamps with no size specified. It is not efficient to
build a query specifying every possible value for size. It is more efficient to simply specify the
79
wildcard character as the desired size value. In this case, the systems returns all globes with any
specified sizes, as illustrated in Figure 4.12, and the search rejects all globes with no size(s)
specified.
Specified search criteria:
Specified search criteria:
1,2,&,+,2.*,-,+,3,2,-,+,+
1,2,&.+,3,2,-,+,+
Figure 4.13: Query codes with wildcard (left) and omission (right) for attribute 2
A user can implement wildcard search criteria and non-wildcard search criteria concurrently,
depending upon the user's wishes. If a user does not care about a lamp's "material" attribute,
s/he may simply drop the "material" attribute from the query code. If, in the same search, a user
wants to make sure that some size is specified for the lamp, s/he can use the wildcard character
for the "size" attribute. The query will then only return lamps with specified sizes, ignoring
lamps with no size specified. The search will entirely ignore each lamp's material. The system
will return lamps with the "material" attribute specified, and lamps with no specification at all
for the "material" attribute.
Attribute expansion function. A database may grow in many ways. The most frequent
growth mode is the addition of new data through the indexing system, but a database may also
grow by the addition of new indices. Index expansion involves adding new attributes and/or new
values for an attribute. This changes the indexing scheme, and the potential for negative
consequences is evident from Chapter 3. Database indices should be designed to accommodate
all foreseeable types of data when the database is first created, but there is no guarantee that this
will happen. Furthermore, databases may grow beyond the original design vision, making index
expansion a requirement.
This attribute-based indexing scheme described in this chapter allows for simplified expansion of
indexing. The attribute table, introduced earlier in this chapter, contains an indexed entry for
each attribute type, and the values corresponding to each type. When a human operator wants to
80
change the indexing scheme, only the attribute tables require the human's attention.
Upon
update of attributes and/or attribute values, the entire system may be re-indexed in a fully
automated fashion, analyzing and indexing each datum according to the updated indexing
criteria. Only data affected by the indexing change must be updated, but the system can update
the entire data set as an added measure of error reduction. See Figures 4.14 and 4.15.
Attribute Table:
Attribute Table:
1. ITEM
1. ITEM
1. Globe
2. Disco Ball
3. Lamp
1. Globe
Alter
Table
2. Disco Ball
3. Lamp
2. SIZE
1.10"
2.12"
3.16"
2. SIZE
1.10"
2. 12"
3. 16"
3. MATERIAL
1. Plastic
2. Glass
3. MATERIAL
1. Plastic
2. Glass
4. FLAVOR
1. Sweet
2. Sour
3. Bitter
Figure 4.14: Alteration of attribute table
Attribute Table
Altered Attribute Table
Old Dataset
New Dataset
Figure 4.15: Update of dataset using altered attribute table
81
This index update procedure reduces the degree of effort necessary for re-indexing, by reducing
the user's work to simple manipulation of values in a list or table. The remaining data analysis
and re-indexing work is suitable for implementation in fully automated fashion, just as the
original indexing process can be fully automated once the attributes have been established.
4.3
Attributelearning
This section outlines the motivation for using a learning algorithm.
It then discusses the
possibility for human-machine participation, and the specific role of humans in the learning
process described in this thesis.
4.3.1 Motivation
When data is added to a database, the database increases in size. With the addition of thousands
or millions of data points over time, a database can become very large and require significant
computer resources to maintain. The number of data types and attributes in the database can also
increase, requiring changes to the indexing structure. A small database containing as many as a
few thousand entries might be managed by a human operator, but the difficulty of managing
databases increases with the database's size.
Managing a large data set requires actions beyond the abilities of a human.
These actions
include checking for errors or new data attributes, expanding database indices to accommodate
new data types, and re-indexing data to represent changes to the database. A human working
alone will find these tasks impossible to achieve for a large database, because they require
individual attention to thousands or millions of data points. The human might require hundreds
or thousands of hours to accomplish even one of these maintenance tasks.
To thoroughly
accomplish them, the human requires computer assistance.
Computers are very good at quickly evaluating and comparing numbers, so the process of
recognizing new attributes can be left to a machine. In the case of text databases, a computer can
try to recognize new attributes by scanning the text and looking for new words and phrases. To
do this, the computer must parse each datum and look for unfamiliar combinations of letters and
82
words. This is a numerical process, so a computer can recognize and store unfamiliar words and
phrases very quickly.
Computers are not as good at learning without human assistance, however.
This problem
prevents a computer from scanning data, recognizing new attributes and accurately modifying
the database's indexing scheme to accommodate them. Only humans can understand the context
surrounding a potential new attribute, and then decide whether it is important or not. The
learning strategy described in the next section makes use of the computer's ability to compare
and recognize new attributes quickly, while leaving the decision-making process to the human.
4.3.2 Human-machine learning
Large text databases can contain millions of words, some of which are important for
categorization. Other words, such as "the" and "a," are typically not important because they do
not add any information about the classification of a text datum. During index expansion, the
discovery of "new" words in a database can offer insight to a human user. New words may
provide hints about the new values an attribute might take, or the new attributes that might be
added to the indexing scheme to improve data classification. Scouring a large database for these
words can be time-consuming, but the job is easier when the system recalls the words that have
already been analyzed.
The indexing system presented in this thesis enables simple learning when indexing a semistructured data set. An example of learning is found in the parsing and indexing of textual data
entries.
On a periodic basis, the system may scan the database for any words that are not
recognized by the system. The goal of the scan is to identify words that are not search terms, and
words that are new search terms and should be added to the indexing scheme -- i.e., to the basic
table.
An ignore file, analogous to the rules file for invalid database index combinations, contains all
words that are identified as irrelevant to searching. See Figure 4.16. When the data set scan
identifies a word (or more universally, a potential attribute value from a generic data type) that is
83
neither in the ignore file nor in the current attribute table, the system makes note of the word.
See Figure 4.17. Upon completion of the scan, the system then makes a "best guess" at the
relevance of the word (or potential attribute value), and the attribute type to which the relevant
value might belong. The system then presents a list of unknown words to a human user, who can
verify or change the system's guesses. See Figure 4.18.
Ignore File:
a
about
an
approximate
at
Figure 4.16: Portion of an ignore file, listed alphabetically
New Data Items Scanned:
- Approx. 14" Glass Disco Ball
- 12" Ceramic Globe
Recognized Attributes:
- Glass
- Disco Ball
Ignored Words:
- [none]
Unrecognized Attributes:
- Approx.
- 14"
- Ceramic
- 12"
- Globe
Figure 4.17: Recognition of words unknown to system
84
ITEM Approx looks like Approximate in the Ignore file. Ignore?
/
Keep?
ITEM 14" looks like 12" in the SIZE Attribute.
Ignore?
Keep in SIZE? 4
Keep elsewhere?
ITEM Ceramic is ambiguous.
Ignore?
Keep?k/
Ll
Choose Location:
ITEM
SIZE
MATERIAL
V
1
Figure 4.18: Presentation of new words to user for assistance in categorization
All words the user deems relevant and then categorizes are placed in the appropriate location of
the attribute table, while the ignore file stores irrelevant words so they will be recognized again
later. The database can then be re-indexed according to the expanded database index, and the
system will appropriately categorize all data containing the newly identified attributes and/or
attribute values. See Figure 4.19. This process may also search for combinations of words that
are frequently repeated, as a data attribute value may consist of a series of words in succession.
Before Learning
During Learning
After Learming
Attribute Table:
New Terms:
Attribute Table:
1. ITEM
1. Globe
2. Disco Ball
3. Lamp
14" => SIZE
Ceramic -> MATERIAL
1. ITEM
1. Globe
2. Disco Ball
3. Lamp
2. SIZE
Approx. ->
1. 10"
New Ignored Terms:
IGNORE
Aprx=G
2. 12"
3.16"
E1.
2. SIZE
10"
2. 12"
3.14"
4. 16"
3. MATERIAL
1. Plastic
2. Glass
3. MATERIAL
1. Plastic
2. Glass
3. Ceramic
Figure 4.19: Illustration of learning process
85
4.3.3 Precise roles for human intervention in the system
Much of an indexing or learning/re-indexing process can be automated, but some portions of the
process require human assistance for best implementation.
These portions require delicate
decisions, or considerations that are too complex or "fuzzy" for reliable computer modeling.
Previous sections of this chapter mention this fact, but the precise role of humans in the system
deserves more attention.
Some portions of the process that require human assistance appear
below.
-
Creating the basic framework for a new or converted indexing scheme
-
Recognizing of new words and phrases when simple or context-sensitive recognition fails
-
Decision-making in cases where an existing indexing system is ambiguous or difficult to
re-index
When generating a new index for semi-structured data, the attribute table is first constructed by
hand. A human assigns unique numbers to attribute types, and to the possible values for each
attribute. Since a human generates the attribute table by hand, only the attribute types and values
known ahead of time will be included in the basic table.
A learning algorithm, as outlined above, expands this basic table.
Human intervention is
necessary during this process because the system will frequently be uncertain about certain data
features.
As an example, a hobby shop may implement descriptive text strings to characterize trading card
sales data, like the auction titles described earlier. For re-indexing purposes, a parser scours the
text string for recognizable words and phrases.
Despite sophisticated context-sensitive
operations, an automated system may not readily classify some words or phrases as descriptive
of only one attribute. An example, illustrated in Figure 4.20, is the name "Donruss," which
could be the name of either an athlete or a trading card manufacturer. If the system does not
already know the name, and can not determine what it describes, a human expert must step in
and finish the job.
86
Text descriptor of sale:
1994 Donruss Puckett
Identifiedfrom table: "1994" =:> YEAR
"Puckett"=: NAME1
Ambiguous: "Donruss" = MFR.1?
or
"Donruss" = NAME2 ?
Figure 4.20: Theoretically ambiguous text descriptor and nature of the ambiguity
A third role for human process involvement may occur during re-indexing of a structured data
set. While identifying characteristics in the data set's original structure, the automated system
may locate attributes that appear similar, but do not satisfy all criteria for inclusion as a single
attribute.
In this case, the system will leave the structure unaltered, and then ask for human
assistance after the rest of the re-indexing process is complete.
The system may pose all
unknown relations to the human operator as a series of individual questions.
Machine learning is a rich research field, and this thesis does not propose to overtake it for the
purposes of simplifying database expansion. Unfortunately, the human knowledge and decision
process is extremely complex and not fully understood.
There is no current model that fully
emulates the human thought process, and until such a system can be created, human assistance
will be necessary for complex and "fuzzy" decision making processes. (Duda and Hart, 1973)
For this reason, the re-indexing algorithms in this project do not try to expand the envelope of
machine learning technology. Further work in the realm of cognitive machines should supply
projects like this one with better tools for unsupervised decisions.
4.4
Conclusion
The attribute-based system described in this chapter serves several purposes.
They are
summarized here.
-
The system can generate indices for semi-structured data sets from arbitrary sources, as
long as the data set offers some means for extracting feature information.
87
-
The system can convert a hierarchical tree, popular in PCDB applications, into an
attribute-based hyperspace, by replacing branched structures with the attribute-based
index code and a rules file.
-
The system indexes a datum only by attributes the datum is seen to contain. This reduces
complexity of the indexing codes, and improves search efficiency.
-
Index coding uses a symbolic lexicon to represent attributes, their values, limitations on
these values, and other features to a machine.
-
Submitted codes are checked against a rules file for validity during searching and data
entry tasks.
-
The system optionally creates queries using only those attributes specified by a user,
improving search efficiency by ignoring attributes irrelevant to the search. This is not
possible in a hierarchical system.
-
The indexing structure can be expanded and re-indexed in a semi-automated fashion with
human assistance.
-
The hyperspace representation places similar data points in close proximity, enabling the
use of better data analysis tools and strategies for missing data.
The benefits of these features become more clear in Chapter 5. Representing data with this index
structure allows the user to perform analysis tasks in the proper data environment. Without the
benefits afforded by this representation, the tasks described in Chapter 5 would be extremely
difficult, if not impossible to achieve.
88
5
Strategies for Missing Data
When a user queries a PCDB, there is no guarantee that data will be returned. A query might
point to data indices that contain no data, resulting in a null response.
Depending on the
indexing structure used, different strategies are available for managing null responses.
The industry survey in Chapter 2 revealed several methods to find surrogate data when a query
returns no data. None of the popular methods from the survey involve further use of the PCDB,
despite the fact that a PCDB can provide further information when a query fails to return data. It
is clear from this result that industry is not using the data to its full potential.
This chapter builds prediction functionality onto the data indexing structure from Chapter 4. The
subject of this chapter is a formal data-driven process for making predictions when a query
results in a null response. The prediction process takes three basic steps.
1) The procedure looks for populated data points near the unpopulated data point.
2) Linear regressions parallel to each hyperspace axis make predictions for the unknown
value(s).
3) A weighting system assigns values to each axis based upon its regression error, and a
weighted sum provides the final prediction.
5.1
Justificationof methods
This thesis uses regression analysis to generate surrogate data for unpopulated data indices.
When a queried data index is unpopulated, each chosen attribute axis is analyzed with regression
techniques. For each attribute axis, the method predicts a value for the unknown data. The
results from each axis contribute to a weighted average that predicts a "best-guess" value for the
unknown data. The remainder of this chapter describes the details of this procedure.
This section begins by describing a previously proposed index-based similarity measure, and
noting its shortcomings.
The section then outlines the motivations for using a data-driven
89
similarity measure. The last two topics of this section discuss a hybrid regression algorithm that
can predict surrogate data values for unpopulated database indices.
5.1.1
Index-based similarity
A similarity measure based on indexing structure was proposed by (Tata, 1999). This measure
determines similarity by performing a Z value test on data indices, rather than by comparing the
data stored under the indices. For two data indices, the Z value calculation follows:
Z =
2
1_+
m
Here, xi and
X2
(1)
2
n
represent the mean of each index code's digits, (i and a2 represent the standard
deviation of each index code's digits, and m and n represent the number of digits in each of the
two index codes. More similar data indices will have a value of z closer to zero; less similar
indices will have a value of z farther from zero. This value, which can be positive or negative,
identifies the data point(s) with indices most similar to the unpopulated index. By comparing Z
values of the data indices instead of the data, this method avoids the missing data problem.
Confidence intervals for the method are available in statistical distribution charts. For example,
a Z value with an absolute value less than 1.96 will provide a 95% confidence interval. See
Figure 5.1.
Code 1:
1,4,3,6,4
Code 2:
1,2,3,5,4
p- = 3.6
G 1= 1.8166
p2= 3
- 1.5811
G2
Z = 0.557
Figure 5.1: Z value calculation for two index codes
This method is problematic. It gives equal weight to each digit in the index, which does not
necessarily reflect the true nature of the data. Z value testing treats an index code's digits as a
90
series of observations of identical importance. But data values may vary more (or less) strongly
when a particular index digit is changed than when some other index digit is changed.
Each digit in a data index represents a choice from among several options.
Some of these
choices may have greater or less impact upon the resulting data values than other choices. For
example, the choice between 6061 aluminum and 5052 aluminum materials may have greater or
less impact upon a process tolerance than the choice between cutting and finishing processes.
Yet each of these choices may appear as a single digit in a data index, and will be treated as
equally important by the Z value test. As a result, the determination of similarity is biased in
favor of index digits of less importance to the resulting data value.
Furthermore, the Z value method assumes the most similar point within an acceptable confidence
interval will be a good surrogate for the missing data. This is not a safe assumption, despite
confidence testing. The confidence interval calculation is based upon similarity in the indices,
not similarity in the data. One may expect similar indices to contain more similar data than
dissimilar indices, but there is no indication that two single data points with similar indices will
contain equally similar data.
5.1.2 Data-based similarity
To address missing data, one option involves finding data that seems to be a good surrogate for
an unknown value. This data would be used in place of the missing data, as a best-guess value.
This process involves finding the most similar data, and then determining whether it is
acceptably similar.
Similarity measures can compare multiple data records or sets, and return a measure of similarity
between them. Unfortunately, the missing data problem prevents this strategy from working
properly.
These methods require knowledge of the data before similarity can be measured.
Without the data, these methods do not function properly.
91
Figure 5.2 illustrates an example of how data-driven similarity measures can fail. A user has
queried data point X, only to find it contains no data. Points Y and Z are close to X, and each
may be similar enough to X to serve as surrogate data. A similarity method may try to compare
the means and standard deviations of the three data points, to determine which point is most
similar to X. But a lack of data at point X disrupts the system. Without knowledge of the mean
and standard deviation for data point X, the method can not determine similarity between X and
Y or X and Z. More generally, when only partial data is available, calculations requiring all data
will fail. To address the missing data problem, a different approach is necessary.
Y
X
z
Strategy:
Compare data in X with data in Y and Z, to determine
Problem:
There is no data in X, so no comparison can be made.
which point is most similar to X.
The method falls.
Figure 5.2: Failure of data-based similarity measures when data is missing
5.1.3 Regression as a prediction system
The missing data problem causes considerable difficulties in finding surrogate data. Data-based
similarity measures can not operate when some data is missing. Index-based similarity measures
can not account for some important data properties, because those properties are not evident
from the indices alone. The failures of data-based measures and index-based measures do not
overlap, however. Data-based measures fail for lack of data, while index-based measures fail for
inadequately considering the data. This leaves opportunity for a hybrid approach to the problem,
using indices as well as available data.
92
Least-squares regression analysis uses data indices and existing data to make predictions about
unknown data values. This satisfies the need for a system that does not need a full set of data,
while reducing the assumptions made about the data. The regression process fits data with a
mathematical equation describing data trends. In the simple case of two dimensions, this fit is a
line or curve.
In higher dimensions, the fit becomes a surface, volume or hypervolume.
Regression is also different from the similarity measures described above because it does not
attempt to substitute a value from another index for the unknown data.
Instead, regression
attempts to estimate a new value for the unknown data, by continuing trends found in nearby
data. The regression function manages estimation error by using the trendline that minimizes
sum-of-squares error. (Hogg and Ledolter, 1992) See Figure 5.3.
Populated points
Regression
Line
Axis
--
-
---
Estimate
*
6
i
Unpopulated
point
Figure 5.3: Linear regression analysis in one dimension
One difficulty with regression is known as the "curse of dimensionality." Each added dimension
of data requires an increase in the amount of data to be analyzed and the required processor time
to understand data features.
Simultaneous multidimensional regression procedures do exist.
(Hogg and Ledolter, 1992) But calculations can become unwieldy to manage when analysis
procedures require thousands of points. This difficulty is addressed by reducing regression to a
combination of single-dimension regressions requiring fewer points. Regression occurs along
each attribute dimension, and these results combine to yield a final data estimate.
Linear regressions are used to simplify the calculations, and because the number of available
points for regression may be too low to rely upon higher-dimensional fits. Since this project
93
only uses linear regressions, the system assumes local linearity in the region of the unpopulated
point. This is not guaranteed to be accurate, because data trends can take nonlinear forms that a
polynomial regression would fit more accurately. For this reason, additional work with higherorder regressions would be beneficial to this project. The intelligent use of higher-dimensional
regressions is a subject left for further investigation.
An additional consideration is the fact that not all attributes readily conform to representation as
coordinates on an axis. Some attributes, such as baseball player names, consist of text strings
with no numerical representation. Attribute values of this type must simply be assigned integer
values as "placeholders" on an axis.
When describing baseball cards, the name "Mark
McGwire" might be assigned to location 23 on an axis. But there is no significance to this
assignment. The association between name and number only serves as an indexing method to
give each attribute value a unique location on its axis.
Some attribute axes that use coordinates as "placeholders" will not demonstrate strong trends,
because the ordering of values on these axes does not create any strong data patterns. To remedy
this, one might order player names on an axis from lowest average price to highest. But these
price values change frequently, so the indexing system would be in constant change.
An attribute with no strong data trends is a strong contrast to other attributes, such as card
quality. Card quality increases will always strongly correlate with increases in card prices,
making it obvious that attribute values should be ordered from lowest to highest, or vice versa.
An axis with high correlation will exhibit a strong, predictable data trend that leads to good
predictions of unknown values. But when no trends are evident, as in the case of player names,
the axis will not be of much use to the system.
Fortunately, the system can recognize this fact and respond by ignoring or downplaying the
results of regression on a trend-poor axis. The weighting system described in Section 5.1.4
selectively assigns more influence to stronger trends, while reducing the influence of weaker
trends.
94
5.1.4 Processes for combining regression results
Some attributes are likely to be more sensitive data drivers than other. This sensitivity appears
as the slope of a regression line. Sensitivity is not the only important feature of regression,
however. The goodness of regression fit to the data is also a very important feature. When
several regressions are combined into a single estimate, the better-fit regressions should have
more influence over the combination than less well-fit regressions. See Figure 5.4. The result
should then be a more accurate estimate than if all regressions had equal influence over the
combination.
*
Lower regression error
Higher regression error
Figure 5.4: Illustration of lower and higher regression errors
There is no direct way to determine the goodness of fit of a regression line relative to an
unknown data value.
As a substitute for this, the system uses the goodness of fit of the
regression line to all of the known data points. This process assumes that the goodness of fit to
known data points will provide information about the goodness of fit to the unknown data point.
In other words, regressions that do a better job of fitting known data points are likely to do a
better job of fitting unknown data points nearby. The system uses the mean squared error of a
regression line to represent its influence over the final result. The lower the mean squared error
of the regression, the greater its influence will be.
Regression activities are well documented. However, the author does not know of any previous
application in which regression error on an axis is used as a surrogate for estimation accuracy.
Furthermore, the author does not know of an application that generates a single estimated value
in multidimensional space by combining many single-axis regressions in inverse proportion to
their respective regression error. This procedure is presented as an original contribution for
95
managing unpopulated data indices. Implementation details appear in subsequent sections of this
chapter.
5.2
Preparation for regression
Data preparation involves looking for data points close to an unpopulated data point, setting the
stage for regression analysis. The preparation stage finds data points on a translated coordinate
system centered at the data point. See Figure 5.5.
Ay
4
3
2
1 _
1
2z'
1
2
4
---
.
-.-. -...... ...
2
3
4
hz
Figure 5.5: Translation of coordinate axes to unpopulated data point
The process only looks for data points on the translated coordinate system axes. This reduces the
number of regressions performed on an unpopulated data point. Using on-axis points allows the
process to see the effects of varying one attribute at a time, making it easier to compute a
weighted average later. This does eliminate the off-axis points from contributing to the
regression. Including these points in the linear regression analysis would be difficult, requiring
more processing time and an unmanageable number of regressions to account for all off-axis
points. The system must return an estimate quickly, so the process trades off analysis of every
local point in favor of speed.
96
The data point search process finds points on one data axis at a time, by varying one digit in the
unpopulated data point's index code.
The process finds and stores the mean and standard
deviation of each discovered point. When all populated points on an axis are found and their
values are stored, the axis is ready for regression analysis. See Figure 5.6.
A Y'
(-3,0,0)
(1,0,0)
(-1,0,0)
(2,0,0)
(0,0, 0)
Sz'
Figure 5.6: Location of populated data points on translated x' axis
5.3
Regression process
The final prediction for the unknown data value has the form
Y =X(WYk)
(2)
k=1
Y is the final estimation of the unknown value, and Y is an estimation of the unknown value
based upon a single-axis linear regression along axis s. Each single-axis prediction is weighted
with the corresponding weight W.
The linear regression for each attribute axis has the final form
Ys,i
=
as+
ps - xi
Yi= Ys,i + es, i
97
(3)
(4)
Ys,i is the predicted value of Y at location i on axis s, and Ys, is the actual value at that point.
The value of a is the slope of a line that minimizes the sum-of-squares error between the line
and the data points used to create it. Ps represents the y-intercept of the line. When the axis
location of the unpopulated data point is substituted for x, Y, i= Y, o = Y is the desired prediction
for the unknown value.
This prediction is based only upon the axis used to generate the
regression, so a different prediction Ys will result for each value of s (one for each axis used in
the regression). At any data point, the error term es,i is the difference between the predicted
value Y,i and the actual value Ys,i at that point. Its mean is zero, and its value is different for
each value of x (each i) along each axis s.
Values of a and P are calculated following:
Fs =
Ys, /ns
(5)
x,i ns
(6)
ns
Xs=
Pis
fls
:xs
fis =
i - Ys i -xs
YS'i
i=1
is
(7)
i=I
as =Ys - fjs
The value ns is the number of data points used for regression along axis s. See Figure 5.7.
98
(8)
(-3,0,0)
(-1,0,0)
(0, 0, 0)
(1, 0, 0)
ppt
A..
(2,0,0)
p
- -S-M
Figure 5.7: Axis for which n, = 4
Sum-of-squares error E, for axis s is calculated by finding the sum of the squares of each error
value es, i along the axis.
es,i = Ys,i -Ys,i
(9)
(10)
E=
i= i
5.4
Regression combination
The weight Ws for each axis is inversely proportional to Es , the mean sum-of-squares error over
all points on the axis. An axis with lower mean sum-of-squares error is assigned a higher weight
due to its more consistent, lower-error trend. An axis with a very weak trend in data values will
not contribute much to the final estimate, minimizing its impact upon the result.
The mean sum-of-squares error calculation follows:
Es = (l/ns)*
es,i
(11)
Then, the weight calculations follow:
s
ZEk
Ws=
k=1
Es
99
(12)
The larger the sum-of-squares error is for an axis, the smaller the resulting weight for that axis
will be. The calculation satisfies:
k=1
(13)
Ek =1
Without further need for normalization, each weight Ws multiplies each prediction Ys , and the
sum of these products is the final prediction Y (see Equation 1).
5.5
Conclusion
This chapter addresses the need for an alternate surrogate data strategy. The shortcomings of
data-based and index-based similarity measures were discussed.
This chapter disclosed a
method for data estimation, which creates several one-dimensional linear regressions and then
combining them in a weighted average for a final estimate of an unknown data value.
100
6
Demonstration of the Technology
This chapter highlights important features of the prototype software. These highlights include
details of the Internet-based architecture and interface, information flow through the system, data
output, and surrogate data generation.
This chapter also discusses prototype database
administration software, which runs locally on the server.
System architecture
6.1
This section describes information flow between client and server, and then details the
information processes within the server.
Client-server interaction
6.1.1
The system operates on a World Wide Web platform, so its accessibility is as universal as
possible. To access the prototype software, a user only needs a browser that supports HTML
tables and basic forms. The user can connect to the server with any type of computer in any
Internet-enabled location, as long as the computer supports graphical display of HTML and
images. The client machine does not need any additional software or special configuration to
access the system. See Figure 6.1. By sending an HTTP request to the database server, the user
becomes a Web client of the server machine.
r---------------------i
r-----------
Server Location
Client Location
.
..........
Database
Engine
I
Server
Client
F-g-re
---.-C
Le/ser s----------------------
Figure 6.1: Client/server system configuration
101
Database(s)
Because the client computer receives no specialized scripts or applets to execute, all
programming details remain confidential on the server machine. Concealment is important
because it keeps the data and data access procedures hidden behind the server's security
measures. User contact with the server takes place only through the browser's forms, improving
data security and simplifying information transfer between the client and server. Figure 6.2
illustrates data flow across the distributed system.
HTTP request: URL and form data
HTTP reply: HTML and graphics
10
Server
Client
Figure 6.2: Data flow across client/server system
6.1.2 Intra-server processes
When the server receives a request from a client, its software component parses the request to
determine how it should respond. Response tasks include analyzing form data from the client,
accessing database records, generating graphics and HTML, and sending the graphics and
HTML back to the client. All data access, calculation and output formatting tasks occur in the
server's local environment.
When tasks require database connections, the software component sends Structured Query
Language (SQL) requests through a database engine. The engine analyzes the request and
returns the appropriate data to the prototype software. This data consists of database records and
summary statistics like mean and standard deviation. The software reduces the returned data by
creating summary statistics and chart graphics. This information is returned to the client. See
Figure 6.3.
102
.......... .a.
E:1
SQL request for records and statistics
Databases
Engine
Reply: records and statistics
Data retrieval instructions
s
Records and statistics
Database(s)
Server
Figure 6.3: Data flow within the server location
6.2
Server software
This section begins by describing the connection process between client and server. Each screen
is discussed in detail, and the section concludes with general comments on the user experience.
6.2.1 Connection process and session management
When a user first sends an HTTP request to the server, his/her machine receives a special session
thread, or memory space, on the server. This thread isolates the user's actions from the actions
of other connected clients. Within this space, the user's actions and data remain in memory
between data requests, instead of being lost (as they would in a simple web server).
The system manages data this way because it is an attractive alternative to other options, like
passing data through the browser's URL field or using cookies. Passing information through the
URL field would restrict the system to only a few kilobytes of transferred data, and not all web
browsers support cookies.
The session thread method allows large amounts of data to be
retained for the user with maximum flexibility.
Since the data is persistent between data requests, the system can perform several analyses
without refreshing data between each user request. This feature improves system performance
and allows the system to perform more complex analyses by keeping all data on hand. When a
user wants to reset his/her search data, s/he can submit a new search. This action destroys the
old session thread and creates a new one, resetting the search data. If a user stops requesting data
for several minutes, the server will automatically destroy the session thread, assuming the user
103
has disconnected. At no time can one client machine access another client's session thread; each
thread is accessible only to the machine that originally requested the connection.
6.2.2 Screens
This subsection describes the various screens a user encounters during a search session.
Title screen. The first data a user receives is an HTML page containing instructional materials
and other information. This screen contains basic usage instructions for novice users. At an
administrator's discretion, the page may also contain important information regarding system
changes or updates.
This title page guarantees that each user has the opportunity to review
important information before accessing the search functions. See Figure 6.4.
104
- - 11-
--
-_
, -
I- - -
-
__ - 7-
__
-
__-My
-
-
Figure 6.4: Title page viewed on client machine
Search screen. When the user navigates past the title page, s/he sees the search screen next.
This screen contains a set of pull-down selection menus, with each menu corresponding to an
105
, --
36pa"-
- -*
attribute of the data. The user specifies desired values for some or all attributes by making
selections from the menus. See Figure 6.5.
Yiew
4 -
dit
file
I- Back -
LAdess I]
Fivorites
1ools
H1*
Search L
Favm
es
_j
(Histy
C>Go
http://localhost/Projectl/BCPA.ASPWCI=ShowSearch&WCU
TI/E
BRERi[R
tD
U
r5
f R[Il
Select your search criteria from the lists below
selections as you like from each list.
Links
Oi919t
Remember, you can make as many
To make multiple selections from a single list, hold down the CONTROL key while clicking
each desired item.
-
- To select a range of items from a list, hold down the SHIFT key while clicking the first and last
item.
Search Criteria
Name
Ben McDonald
Sammy Sosa
Year
Manufacturer
1993
1994
1995
aiill
Cal Ripken
Ken Griffey, Jr.
Randy Johnson
Jose Canseco
1996
Tony Gwynn
2000 J
Type
Card
Pack
Box
Set
1997
1998
PSA Rating
PSA 7.5
PSA 8
Gem
Donrussi
Fleere
Upper Deck
Bowman
Skybox
J
Stadium Club
-J
PSA 9.5
PSA10
Mint
Near Mint
Very Good
Good
Fair
Other Descriptor
Ro okie
Errori
Minor
Traded
Sealed
PSA 8.5
Quality
Find hI
J
LOV IIMIIr
1 ,
141
Figure 6.5: Menu-driven search screen
106
The user can make as many or as few selections as s/he likes from each menu. By default, each
menu contains a wildcard selection. If the user does not make a selection for a particular menu,
the wildcard will remain selected for that menu, and the corresponding attribute will not be a
search criterion. Figure 6.5 illustrates the search screen where a user has specified the Name
Mark McGwire, the Year 1999, the Manufacturer Topps, and a PSA Rating of 9. The user has
left the Quality, Type, and OtherDescriptormenus in their default (wildcard) state, so the system
will not use them as search criteria. The user is asking for records with any value(s) for the
Quality, Type and Other Descriptor criteria, as long as the returned records satisfy the Name,
Year, Manufacturerand PSA Rating criteria.
Error screen. After the user submits search criteria, the prototype software consults the rules
file to find out if the criteria will point to a valid index. The prototype software returns an error
screen if the rules file suggests the criteria are invalid or point only to unpopulated data indices.
The user is given the option to specify a new search or ask the system for price estimates for the
specified criteria. Figure 6.6 illustrates this error screen.
107
..........
File
Edit
Back-
Favorites
View
4
j
A4
Tools
Help
.iiFates
,jSearch
JHtory
Go
Adress W] http //iocalhost/Projec1 /BCPA.ASP?WC-ShowResut&WCU
8RSERL L [?L7 PRI[IIYG5 /UTHLR/TY /2
Lin*s
.7
Your sealch clitelia:
Glenn Hubbard -
Name:
1951-
Year:
Type:
{none
specified }
PSA Rating:
PSA 8 -
Quality:
none specified }
Manufacturer:
Other Descriptor:
O-Pee-Chee
{ none
-
specified
Your search specified invalid index combinations, or pointed only to unpopulated database
regions.
BUT HA VE NO FEAR! You can get estimated pricing for this card, or search for similar
cards, by using the buttons below.
Retum to Search Page
Estimate the Price Using Similar Cards
Locdiaet
Done
Figure 6.6: Error screen
Results screen. If the rules file indicates the search criteria are valid (Section 4.2.1), the
system passes a query to the database engine. The system analyzes these records for mean and
standard deviation, and returns charts summarizing the data. In addition to mean and standard
deviation, the system optionally returns the title of each card sale for the user's review. By
default, this option is turned off because it returns very large volumes of text data to the user
when many records satisfy a search.
The system can implement two types of charts. The first type is the standard graphic chart found
in applications like Microsoft Excel. The system generates this chart and returns it to the client's
browser as a standard graphic file. This chart can be complicated to implement and takes several
seconds to download. The second type of chart is coded directly into the HTML page, and uses
108
HTML tables to generate the various rows and columns of the chart. Figure 6.7 illustrates this
type of chart. HTML table charts require less time to download and are simple to implement, but
they offer little visual flexibility because HTML tables constrain the system to bar charts.
Edk Yew Fivrites loclk Help
Eilde
4- Back
Aedres
t2
*Smch
LiJF vres
j
,HWjtory
.
_] http://Iocalhost/Project1 /BCPA.ASPWCI-ShowResut&WCU
6Go jjinksj
TE
BA'SEBBLL CRRL PA[IIfl& BL/7TY
R!THOR T
.
You seaich ciiteria:
Name:
- Maik McGeuire
Year:
-
Type:
{ none specified}
1999PSA9-
PSA Rating:
Quality:
{ none specified}
Manufacturer:
{ none specified
Other Descriptor:
{ none specified
Your search results:
Mean Price = $19.05
Standard Deviation = $20.23
Card Price Frequencies (18 total)
$0 - $6.00
$6.00 - $10.00
$10.00 -$15.00
$1.00- $20.00
$20.00 $25.00
$26.00 - $3000
$30.00. $35.00
$35.00 -40.DO
$40.00 - $45.00
$46.00- $5600
$50.00- $55.00
365.00 - $60.00
$60.00 $$66.00
$05.00 -$70.00
$70.00- $75.00
$76.00 $80.00
5
10
1I
I
$86.00 - $00.00
$00.00 $96.00
$05.00 - $100.00
j-Done
ILocamitranet
-I
Figure 6.7: HTML table of card price frequencies
109
The prototype software uses only HTML charts to display search data. The Software
Enhancements section of this chapter discusses prospects for using more complex charting
methods.
Price estimation input screen. Regardless of search results, the prototype software always
gives a user the option to estimate price information for his/her specified search criteria.
Estimating a price involves three basic steps. First, the software notes which data indices the
user requested. These indices may or may contain price data. Then, the software tries to find
data indices that seem similar to the requested indices. These similar indices must contain price
data; any indices containing no price data are ignored. Using these populated similar indices, the
software attempts to estimate values for the price data in the originally requested indices.
Locating similar indices and estimating data values are discussed in detail in Chapter 5.
If the user decides to generate an estimate, the system presents a set of radio buttons
corresponding to the axes used for the search. The prototype software relies on user input to
decide which axes will be used to create data estimates. With the radio buttons, the user can
indicate which axes s/he wants to use to generate the price estimate. Figure 6.8 illustrates this
screen.
The user typically can not choose between all possible attributes in preparation for price
estimation.
The software only presents the attributes for which the user made non-wildcard
selections during the search.
This limitation must occur because the regression operations
described in Chapter 5 rely on the user's selection of values for an attribute axis. An axis taking
a wildcard value represents non-selection, aborting the estimation process for that axis.
Figure 6.8 illustrates the estimation-input screen where the user originally specified values for
the Name, Year, Manufacturer,and PSA Rating attributes. Among these attributes, the user has
chosen to use the Year, Manufacturer,and PSA Rating attributes to generate pricing estimates.
When the user submits this information, the software automatically performs regression
operations on the three selected axes and returns a weighted average of the results.
110
BCPA Price Estimator -Microsoft Internet Explorer5
ile Edt Yiew Fjvorites Iools Help
4-Back
f
-+
J} '
Search
W
L2Favontes
_History
Ie Ifj http //Iocalhost/Proiectl/BCPA ASPWC=Star(Similar&WCU
tGo
8VSEBLL [RLI PRI[/JY
LnS
/17
H/T.'5TP/
Welcome to the ultra-modern BCPA Price Estimator. This tool searches the multidimensional
index hyperspace surrounding your specified search criteria, and then estimates a selling
price for your card.
To use the Price Estimator, select the card properties which you want the system to use for
generating an estimate. Note that not all attributes are good candidates for this. For
instance, we recommend leaving the player's Name alone. If you tell the BCPA NOT to use the
Name property here, you will get an estimate based exclusively upon cards with the same
player name(s) you originally searched for.
You made selections for the following attributes:
- Player Name
- Year
- Manufacturer
- PSA
Select the attributes you want to allow to change
Name
r Don't change it
C Allow it to change
Year
Manufacturer
CDon't change it
SAllow it to change
(- Don't change it
SAllow
itto change
PSA Rating
Don't change it
SAllow it to change
(_
Fi nd
Figure 6.8: Selection screen for estimation function
Price estimation results screen. The price estimation procedure generates a lot of data. To
present this data clearly, the prototype software generates an HTML page that presents final
estimates for mean and standard deviation, followed by axis-specific information. Figure 6.9
111
illustrates this screen. If an attribute provides too few data points to support regression activities,
the user is notified that the attribute could not be used for estimation. In Figure 6.9, the system
has discarded the Year and PSA Rating attributes for this reason.
..f~LJ
WIN_Fie
j
id ew Favortes
Tools
4o Back - 4 / cj IO Lh
Help
aA
/BSwch
/W
P
Favi
es
HistoPy
d y Go
J |deno|#_http://Iocalhost/Ptojectl1/BCPA.ASPWC1=FindSimilai&WCU
fSEBRLL L7 R J
I
P2/fIG' /UTHOR/I'
Mean Price $359.62
BCPA Predictions:
Category Weights:
Manufacturer
Mean: 100.00%
Mean Price: $359.62
a Manufacturer category:
Standard
lks*s
/ .111
Deviatior- $185.15
Standard Deviation: 100.00%
Standard Deviation: $185.15
Price Means for Manufacturer
Not Speelfled
Topps
Donrus
Fler
Uppai Dedc
330.00
$188.70
e4.82
$0,32
Bovoman
Skybox
Stadium Club
Gold Leaf
Leaf
Pinnaole
eowamn Chrome
O-Pee-Chee PREOICTED: $359.62
a
Year category not used: Too few populated value bins found.
PSA category not used: Too few populated value bins found.
L4
-
7
-
Figure 6.9: Estimate results screen
112
UFje
6.2.3 Comments on user experience
The server software performs several disjoint functions.
These functions include the data
retrieval, analysis and output tasks described above. The prototype server software represents an
attempt to integrate these functions without presenting a disjointed use experience. The simple
style and arrangement of the HTML input and output pages are intended to reduce the details of
a complicated operation to a level that does not overwhelm users with information. The server
software balances this need for simplicity with the requirement that all useful data be returned to
the user. Section 6.5 discusses some additional methods for achieving the information/simplicity
balance for different types of users.
6.3
Administrativesoftware
The previous section discusses the operation of the software when a user submits queries. But
user-focused operations are not the only tasks the software must perform. The software's ability
to process and serve data to a user depends on proper maintenance.
There are many kinds of changes that force database administration. Over time, users' needs can
change and new users may begin to make different demands of the database system. Timesensitive data requires regular updates, and the types of stored data can change.
New data
attributes may be added to a database's indexing system, requiring changes in the way the
indexing scheme organizes data. When any of these changes occurs, the system must allow an
administrator to make fundamental changes to the database and indexing scheme. Otherwise, the
system will never be able to grow and evolve with its users and their data needs. The prototype
administrative software fulfills this need for adaptation by providing functions to keep the
system "healthy."
Administrators must occasionally perform various tasks to keep the database up to date. These
tasks include:
-
Attribute learning, the discovery of new data features for data indexing (Section 4.3)
Index expansion, assigning these new features to the indexing scheme (Section 4.2.2)
113
-
Rules file updates, adding new valid (or invalid) index combinations (Section 4.2.1)
With the exception of some human assistance in attribute learning, the tasks can run on a
schedule in the absence of human supervision. The prototype software performs these tasks in
the machine's local environment. There is no way to access these tasks remotely. Restriction to
the local environment is intentional, as it keeps all sensitive data-related tasks safely confined
within the server environment.
This section describes features of the locally executed
administrative software.
6.3.1
Search function
The first function of the prototype software is a database search, very similar to searches the
server software performs. An administrator may want to run a local search while running the
administrative software.
The search function can verify changes to database records, the
indexing scheme, and the rules file, so it is an important component of the administrative
software. The administrator's search function is limited in comparison with search functions
available for remote clients, because the local administrator should only use it to verify
administrative tasks.
The administrative search environment is similar to the remote client's search environment. The
major differences are listed below.
-
No similarity search functionality
-
All records listed individually, to assist with administrative tasks
-
Environment is based on forms, not browser
Figure 6.10 illustrates the default window for administrative software. The administrator selects
search criteria from menus at the top of the screen and then submits the search. The software
returns individual records and allows the administrator to scroll through the records, fifteen at a
time.
114
Figure 6.10: Default window for administrative software
6.3.2 Attribute learning and index expansion functions
The administrative software provides a step-by-step set of tools for finding new data features and
re-indexing the database. This tool set is divided into several functions so the administrator can
verify the completion of each action before the next one takes place, and repeat an action in the
event of an error. The tool set functions are located at the bottom of the default window in
Figure 6.10.
115
:7724
Following the methods of Section 4.2.2, the tool set offers a learning algorithm that scours the
database for unrecognized words. See Figures 4.17 and 4.18. The software does not have the
sophistication to make accurate guesses about the meanings of unknown words, so each new
word appears individually in an option window.
The human administrator then has the
opportunity to review each new word, and either ignore it or add it to the indexing scheme. This
option window appears in Figure 6.11.
x
This screen aloo you to update the database index vst fo We seatchet. SpWcf
for each itm below. Saiect"Retain" to keep the iem classified as unknon, to be revie
Itm
a dexcnptcx MAe
e another timne.
Sekwt Tge
J~natur
lcd
Other Descriptor
bow
INamne
709
jAdd to IGNORE
1991
IYear
psa
IName
IAd to IGNORE
JAd to IGNORE
re
jOther Descriptor
pa
jAdd to IGNORE
9
Add to IGNORE
1989
<-
ack
Figure 6.11: Attribute learning option window
Each new word from the database appears on a separate line at the left side of the form. The user
can select the word's fate using a pull-down menu. Each word either becomes a value for an
existing attribute (such as Name or Manufacturer), moves into the ignore file, or remains in its
unknown state for later review. If a word moves into the ignore file, the software will never
present that word to the user again. If a word becomes a new attribute value, the change
becomes a permanent part of the indexing system when data re-indexing occurs later.
116
Once the learning process is complete and the indexing scheme has received new attribute
values, the system needs to re-index the database and update the rules file to reflect the changes.
Several tools, executed in series, accomplish this goal. These tools appear as command buttons
at the bottom of the form in Figure 6.10. First the system re-indexes the data records with the
new data index.
Then it creates a new rules file to reflect the changes.
Rules file creation
follows this procedure:
1) Copy all unique indicesfrom the database into the rulesfile.
This action documents every valid data index from the database, recording each index
only once.
2) Within the rules file, split aggregatedindices into a series of unique indices.
Some data indices may specify more than one value for a data attribute. The software
splits each of these aggregated indices into a set of non-aggregated indices, each having
only one value for each attribute. This process may create new duplicate indices, so once
again the software searches for and deletes any duplicates.
3) Combine unique indices into aggregate indices using wildcards (wherepossible)
Figure 6.12 illustrates a situation where index simplification with wildcards is possible.
Simplification reduces the size of the rules file. If situations similar to Figure 6.12 are
discovered among the data indices, a single index with a wildcard can take the place of
several similar indices.
4) Repeat step four until no more aggregationwith wildcards can occur
Even among indices that already contain wildcards, further simplification can occur. The
software continues simplifying indices until no more simplification is possible.
The result of this four-step procedure is a set of wildcard-enabled rules governing the search
process. The new rules file can then operate as outlined in Chapter 4.
117
MONOWNWNW-
-----.--
-
-- -
------ nw
..
,,
Rules File
Rules File
Attribute A
Attribute B
Attribute C
Attribute A
Attribute B
Attribute C
1
1
1
1
1
*
1
1
2
1
1
3
Attribute C can take values [1 , 2 ,3 ]
Figure 6.12: Rules file before and after adding wildcards
6.4
Software enhancements
There are several software features this project did not fully implement. Although they are not
essential to the software's basic operation, further software development efforts should include
these features. The improvements in this section will create a more efficient user experience,
providing clients with data more quickly and in more appropriate forms.
Differentiation between invalid and unpopulated indices. The rules file currently tracks
all unpopulated data indices, but it does not know why any particular index is unpopulated. An
unpopulated index may represent a valid combination of attributes that nobody has ever
documented with data. Alternately, the index may be unpopulated because it represents an
The software should allow a user to perform estimation
functions on valid, unpopulated indices. But it should not allow a user to perform estimation
functions on invalid combinations of attributes, because the results will not represent any real
data. To decide whether estimation functions are allowed, the software should be able to discern
invalid combination of attributes.
between valid and invalid indices.
converting hierarchical structures.
It may be possible to automate this distinction when
But when creating a new indexing scheme for a semi-
structured data set, humans will have to set up the distinction between valid and invalid indices
by hand.
118
Real-time narrowing of query menus. There are two ways to use the rules file to ensure
search criteria are valid. The prototype software waits for a user to make and submit all of
his/her selections before consulting the rules file. The other strategy uses the rules file to change
the user's available selections each time s/he alters a menu selection.
This is a real-time
narrowing strategy, because it immediately guides the user to make valid index selections. See
Figure 4.9. Real-time narrowing is preferable because it only allows the user to make valid
selections of indexing combinations. The prototype software does not use real-time narrowing
because of the limitations of basic HTML transactions over the World Wide Web. Without
using a special browser standard, script or applet, real-time narrowing is prohibitive for Web use.
Further investigation of different methods for Web information transfer may reveal a good
compromise between universal user access and real-time query verification.
Generation of better and more varied charts. To maximize the rate of information
transfer between server and client, the prototype software currently uses only HTML table charts
to summarize data. The system could create more attractive and versatile charts in GIF or JPEG
format. This graphical charting technique can create scatter plots, time-series charts, pie charts
and many other varieties. In particular, time series charts are important for representing timedependent data. For this reason alone, charting upgrades would be a major improvement in the
software.
Figure 6.13 illustrates an HTML table bar chart in its most detailed form, and Figure 6.14
illustrates a fairly standard graphical bar chart. The plotted data is identical for both charts, but
the presentation of the graphical chart in Figure 6.14 is cleaner and better suited for presentation.
119
MWONOWNW.-
........
........
Pfice Means few
sow;t
MdIu
)386.WJ
N44 Spouiad
iss
I Opps
I>Onfu
W4.82
Tpps
Oetr
M
11*1
W,32
ITadium Club
Ool4 Leaf
PiinnAcl*
SenTmfn Chtwtm.
Oe*.Pe
PREISCTEI
V31 42
Figure 6.13: HTML table bar chart
Price Means for Manufacturer
0-Pee-Chee
PREDICTED: $359.62
Bowman Chrome
Pinnacle
Leaf
Gold Leaf
Stadium Club
Skybox
Bowman
Upper Deck
Fleer
Donruss
Topps
Not Specified
$0
$50
$100
$150
$200
$250
$300
$350
$400
Figure 6.14: Graphical bar chart
Automatic determination of best axes for estimation. The prototype software asks a
user to specify which axes it should use for regression functions. The software accepts the user's
not
judgment and uses only those axes to estimate unknown data values. If the user does
exercise good judgment, the resulting estimation function will achieve lower accuracy than it
otherwise could have. To ensure best results, the software should offer an "auto-detect" option
to determine which axes are best suited for the estimation process. The software might generate
(or be given) a set of threshold error values for each axis. If, during regression analysis, average
120
error per axis value falls above the threshold, the axis can be discarded from the estimation
process. The axes that remain under the error threshold can then be treated as they are under the
current software prototype.
Provision of novice and expert user modes. The features discussed in this section
balance simplicity of use with the system's trust in a user. Some users will always be more
knowledgeable or experienced than others will, so the system's faith in a user's decision should
be variable. The most direct way to address this problem is the creation of different user modes.
A novice user mode might verify index selections in real-time, present only basic pricing charts,
and automatically select the axes to use for regression analysis. An expert user mode might offer
the option for one-time index verification, present multiple types of data charts, list all returned
records and let the user select which axes to use for regression. By offering a novice mode and
an expert mode, the system can meet the needs of new and experienced users simply by changing
a few system settings.
The administrative software may also benefit from multiple-mode
operation, by varying the level of user involvement in administrative task scheduling and
execution.
Speed
improvement.
The
prototype software
leaves
significant
room
for speed
improvements. These improvements are necessary if the software is to handle several clients
simultaneously. The current system uses state-of-the-art hardware as of October 2000. The
software components, however, do not create an efficient processing environment. They were
selected for their ease of use, and because they were available to the Variation Risk Management
workgroup.
A superior environment might use a more efficient operating system, data
repository, network protocol, data retrieval protocol and operating language. Additionally, the
system may operate as a distributed server network instead of a single machine. These changes
could decrease response times by an order of magnitude or more, relative to current performance.
6.5
Conclusion
This chapter has outlined the major functions of the prototype software. The prototype server
software demonstrates functions for search specification, criteria checking, data retrieval, data
analysis, and output formatting.
Additionally, the server software demonstrates estimation
121
methods for creating surrogate data.
The locally run administration software demonstrates
methods for assisted machine learning, index expansion, and database re-indexing. Subject to
the enhancements suggested in this chapter, these functions demonstrate the practical feasibility
of the concepts in this thesis.
122
7
Conclusion
This chapter reviews the nature of design-focused PCDB problems, the solutions presented in
this thesis, and remarks about the strengths and weaknesses of the system.
7.1
Contributions
Design currently suffers from a lack of access to PCD. Some of the access barriers are products
of problematic PCDB system architectures and data management strategies. This thesis presents
an alternate PCDB architecture and improved methods for managing unpopulated database
indices.
First, this thesis reviews major barriers to design's use of PCD. These barriers include:
-
Poor indexing schemes
-
Lack of design-focused PCD access
-
PCDB dissimilarity within and across enterprises
-
Absence of formalized methods for managing unpopulated data indices
These problems are evident from a review of previous work, discussions with industry during the
completion of this thesis, and a recent industry survey. Chapters 1 and 2 establish the need to
manage the noted problems through design-driven PCDB solutions.
The thesis then describes the current state of popular DBMS architectures in Chapter 3, pointing
out the advantages and disadvantages of each system. In particular, problems specific to PCDB
usage appear prominently. Chapter 4 presents a hybrid attribute-based indexing structure that
avoids many of the problems associates with well-known indexing systems. This hybrid system
also lays the groundwork for solutions to many of the problems design faces when using PCD.
Specifically, the hybrid system indexes data in a fully attribute-based way.
It also allows
designers to seek PCD by design-focused attributes, and uses a single interface to query many
databases of differing structure.
123
Chapter 5 presents a method to predict values for unpopulated data indices. This method follows
directly from the data structures in Chapter 4, and allows PCDB users to use formalized
procedures to manage unpopulated data indices.
Chapter 6 reviews a prototype server and
administrative software package that demonstrates all of the technology described in the thesis.
7.2
Further research
There are several areas of design-focused PCDB implementation that require future work. This
section describes some possibilities for future contributions.
Multiple estimation methods. In different situations, different strategies may be appropriate
for creating surrogate data.
This thesis describes an estimation routine based on regression
analysis, but other strategies may also yield useful information. Alternate methods may use a
different estimation routine, or alternately may use a substitution routine that uses data from a
different index as a direct surrogate for an unpopulated index.
These methods should be
explored further.
Estimation and substitution routines for unknown data have been studied at length, and there are
myriad approaches to missing data problems. A system could use multiple strategies to create
surrogate data, and then compare or combine the results to create a best estimate.
Improved statistical validity tests. The system proposed in this thesis generates a
minimum-error estimate for missing data. But the algorithm does not compare this minimum
error value with an acceptable standard.
An improved system should create a standard for
minimum allowable error. This standard could be an amortized error value per data point, or a
value that varies by axis. Alternately, the system could compare the final estimate with domain
knowledge, to assess validity.
Software enhancements. Chapter 6 describes several software enhancements that would
improve the user experience. These improvements include varying user modes and improved
graphical output.
Most of the noted software enhancements offer greater flexibility and
customization of the user experience.
124
The technology presented in this thesis holds great promise for improving designers' access to
PCD. Industry has expressed interest in technologies that improve PCDB implementations, even
while the literature has overwhelmingly assumed that designers have already have suitable
access to the data. With the technology demonstrated in this thesis, industry can begin closing
the gap between the current state of PCDB implementations and the powerful ideal of designfocused PCD applications.
125
126
References
Agrawal, Rakesh, Tomasz Imielinski and Arun Swami (1993) "Database Mining: A Performance
Perspective." IEEE Transactionson Knowledge and Data Engineering,5(6), pp. 914-925.
Alagic, Suad (1986) RelationalDatbase Technology. Springer-Verlag, New York.
Bangalore, Srinivas and Giuseppe Riccardi (2000) "Stochastic Finite-State Models for Spoken
Language Machine Translation." Workshop on Embedded Machine TranslationSystems, Seattle,
WA.
Baral, Chitta, Michael Gelfond, and Olga Kosheleva (1998) "Expanding Queries to Incomplete
Databases by Interpolating General Logic Programs." Journal of Logic Programming35, pp.
195-230.
Batchelor, R. and K.G. Swift (1996) "Conformability Analysis Support of Design for Quality."
ImechE Journal of MaterialsProcessingTechnology 61(1-2), pp. 163-167.
Bishop, Christopher M. (1996) Neural Networks for Pattern Recognition. Oxford University
Press, Oxford, England.
Burges, Christopher J.C. (1998) "A Tutorial on Suport Vector Machines for Pattern
Recognition." In Proceedings ofData Mining and Knowledge Discovery (Usama Fayyad ed.), 2,
pp. 1-43. Kluwer Academic Publishers, Boston.
Burleson, Donald K. (1999) Inside the DatabaseObject Model. CRC Press, Boston.
Campbell, R.I. and M.R.N. Bernie (1996) "Creating a Database of Rapid Prototyping System
Capabilities." JournalofMaterialProcessingTechnology 61, pp. 163-167.
Chandra, S., D.I. Blockley and N.J. Woodman (1993) "Qualitative Querying of Physical Process
Simulations." Civil EngineeringSystems 10, pp. 225-242.
Clausing, D. "Reusability in Product Development." Engineering Design Conference 1998.
Uxbridge, England.
Cowell, Robert (1999) "Introduction to Inference for Bayesian Networks." In Learning in
GraphicalModels, (M. I. Jordan ed.), pp. 9-26. Kluwer Academic Publishers, Boston.
DeGarmo, E. Paul, J.T. Black and Ronald A. Kohser (1997) Materials and Processes in
Manufacturing.Prentice Hall, Upper Saddle River, NJ.
Deleryd, Mats (1999) "A Pragmatic View on Process Capability Studies." InternationalJournal
ofProductionEconomics 58, pp. 319-330
127
Devore, Jay L. (1987) Probabilityand Statisticsfor Engineering and the Sciences. Brooks/Cole
Publishing Company, Monterey, CA.
Duda, Richard 0. and Peter E. Hart (1973)
Wiley & Sons, New York.
Pattern Classification and Scene Analysis. John
Gaither, Norman (1994) Production and Operations Management. Harcourt Brace Company,
Orlando.
Hogg, Robert V. and Johannes Ledolter (1992) Applied Statistics for Engineers and Physical
Scientists. Macmillan Publishing Company, New York.
Lee, D.J. and A.C. Thornton (1996) "The Identification and Use of Key Characteristics in the
Product Development Process". ASME Design and Theory Methodology Conference. Irvine, CA.
Liu, Ling and Calton Pu (1997) "An Adaptive Object-Oriented Approach to Integration and
Access of Heterogeneous Information Sources." Distributedand Parallel Databases. 5(2), pp.
167-205.
Meng, Weiyi, Clement Yu, and Won Kim (1995) "A Theory of Translation from Relational
Queries to Hierarchical Queries." IEEE Transactionsof Knowledge and Data Engineering.7(2),
pp. 228-245.
Pavlovic, Vladamir, James M. Rehg, Tat-Jhen Cham, and Kevin Murphy (2000) ""A Dynamic
Bayesian Network Approach to Figure Tracking Using Learned Dynamic Models." Hybrid
Systems: Computation and Control 1790, pp. 366-380.
Pereira, Fernando C.N. and Michael D. Riley (1997) "Speech recognition by composition of
weighted finite automata." In Finite State LanguageProcessing(E. Roche and Y. Schabes, eds.),
pp. 431-453. MIT Press, Cambridge.
Piattelli-Palmarini, Massimo (1991) "Probability: Neither Rational nor Capricious." Bostonia,
March/April 1991, pp. 28-35.
Rabiner, Lawrence R. (1998) "A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition." Proceedingsof the IEEE 77(2), pp. 257-286.
Riccardi, Giuseppe, Roberto Pieraccini, and Enrico Bocchier (1996) "Stochastic Automata for
Language Modeling." ComputerSpeech and Language 10, pp. 265-293.
Tata, Melissa. (1999) The Effective Use ofProcess CapabilityDatabasesforDesign. MSc, MIT,
Cambridge.
Thornton, Anna C. and Melissa Tata (1999) "Process Capability Database Usage in Industry:
Myth vs. Reality." Design for Manufacturing Conference, ASME Design Technical Conferences,
Las Vegas, NV.
128
Turk, Matthew and Alex Pentland (1991) "Eigenfaces for Recognition." Journal of Cognitive
Neuroscience 3(1), pp. 71-86.
Ulrich, Karl T. and Steven D. Eppinger (2000) Product Design and Development. McGraw-Hill,
Boston.
Walker, Michael G. (1987) "How Feasible is Automated Discovery?" IEEE Expert, Spring 1987,
pp. 69-82
Zhang, Jin and Robert R. Korfhage (1999) "A Distance and Angle Similarity Method." Journal
of the American Society for Information Science 50(9), pp. 772-778.
129
130
Appendix A: PCDB Survey
131
132
Process Capability Database Survey
Thank you for providing us with information about your process capability database by answering the questions
below.
Your answers will be kept completely confidential, and will assist us in:
*
"
Identifying key areas for discussion at this year's KC Symposium
Assessing the current state of process capability database utilization
We will aggregate the data and report it back to the working group at the start of the meeting.
When answering questions, please note the following terminology:
Organization
PCDB
PCD
KC
Index
Unpopulated
Query
Company, division, office, or other group, if any, that you represent at the Symposium.
Process capability database
Process capability data
Key characteristic
The
full
code
used
to
identify
a
specific
location
in
the
(A single index may contain several data points)
Descriptor for an index containing no data
Request for data points in at least one PCDB index
PCDB
Again, thank you for your assistance and we look forward to seeing you at the Symposium.
Your name:
_
Your organization:
1) Among the following PCD-related topics, which do you consider most important for discussion at this year's KC
Symposium? Please check all that apply.
__
-
Planning and implementation of PCDBs
Development of PCDB structure
PCD query development
Interpreting returned PCD
Designing Indices
-
Populating PCDBs: timetables, economics, resource allocation
Managing supplier PCDBs and other external PCD sources
Other
-
2) Does your organization currently use a PCDB?
_
Yes
No, but a PCDB is currently being created
No, but a PCDB is currently in planning stages
We have no PCDB, nor specific plans to implement a PCDB
If you answered "no" to question 2, please skip to question 12 on page 3.
Questions 3-11 (page 2) are for individuals representing organizations that have a PCDB.
Questions 12 and 13 (page 3) are for individuals representing organizations that do not have a PCDB.
133
8) Which characteristics seem to have the greatest influence
over the likelihood that an index is unpopulated?
3) What percentage of your PCDB is currently populated?
_
0-25%
25-50%
50-75%
75-99%
Material, i.e. certain materials have a much greater
likelihood of index population than others
Date, i.e. older hierarchy branches are more/less
likely to be populated than newer ones
Feature, i.e. certain features have a greater
likelihood of index population than others
Size, i.e. certain feature sizes are considerably
more likely to be populated than others
100%
4) Which of the following statements, if any, most accurately
represent(s) the dispersion of unpopulated indices through
your PCDB?
_
_
__
Most or all indices are populated; unpopulated PCD is
not an evident problem.
The majority of the PCDB is populated, with
unpopulated indices interspersed throughout.
Unpopulated indices tend to exist in concentrated
PCDB regions; other areas are thoroughly populated.
Unpopulated indices abound in small concentrated
regions of the PCDB, but are also interspersed
moderately throughout the rest of the PCDB.
i.e.
indices
for
simpler
_
Complexity,
_
features/operations are more/less likely to be
populated
No evident trends, or only weak trends noted
9) Who is typically asked for PCD? Check all that apply.
The majority of the PCDB is unpopulated.
__
_
Designers
Manufacturing
__
Suppliers
Data specialists or database query personnel
No request system is in place
An individual is expected to find the data
him/herself, through databases or otherwise.
5) How frequently is a requested PCDB index found to be
unpopulated?
Less than 10% of queries
10-25% of queries
25-50% of queries
More than 50% of queries
10) What types of design tasks most frequently require PCD
that is found to be unpopulated? Check all that apply.
Redesign of existing parts or processes
6) What characteristics are specified in order to access a
particular data index (set of data points)? Please check all
that apply.
__
__
__
_
__
Investigative queries of alternate features/feature
sizes, processes or materials
Other
Part number or name
Material
Feature
Operation
Size
KC number or name
Machine
11) What trends, if any, are characteristic of design's
reaction to not being able to find the right data?
__
_
__
7) What strategies does your organization use to address
unpopulated data? Check all that apply.
Seek out alternate values from within PCDB via
software procedure
Seek out alternate values from within PCDB based
on user intuition/expertise
__
_
Design of new parts or processes
Consult expert within organization
Consult expert outside of organization
Seek information from manufacturing
Decreased use of PCD
Distrust of populated PCD indices
Use of PCD only for revision of existing designs
Use of PCD only for processes, materials, parts,
KCs, etc that are known to have populated indices
Use of PCD only for queries regarding
"popular"/well-known processes, materials, parts,
KCs, etc.
Other
No trends noted
Question 11 is the last question for individuals representing
organizations that have a PCDB. Thank you!
Contact supplier if unpopulated index is suspected to
exist in supplier database
Investigate design changes that enable processes with
known capabilities
134
12) What are the primary reasons for the lack of a PCDB at your organization? Please check all that apply.
-
_
_
Simple lack of need for PCDB
Need for PCDB exists, but is not recognized by decision-makers
Need is known, but implementation has simply been slow
Lack of funding or other resources to create PCDB
Lack of organizational knowledge for PCDB creation
Other
No known reasons
13) Which of the following strategies are used by design when desired process capability information is
unavailable?
_
Contact supplier or manufacturer
Consult expert within organization
Consult expert outside of organization
Generate estimates for unavailable information, using resources within organization
Investigate design changes that enable processes with known capabilities
Question 13 is the last question for individuals representing organizations that do not have a PCDB. Thank you!
135
136
NW--
Appendix B: Survey Responses
1.
Among the following PCD-related topics, which do you consider most important for
discussion at this year's KC Symposium?
Planning and implementation of PCDBs
65%
Development of PCDB structure
65%
PCD query development
26%
Interpreting returned PCD
22%
Designing indices
13%
Populating PCDBs: timetable, econ., resource
43%
Managing supplier PCDBs/extemal sources
48%
Other
4%
No. respondents = 23
16
15
15
14
12
11
10
10
8
6
5
4
a
2
0
E
E
E
CL
C1.
CL,
Cu
0)
C
2
A?
0-E
E
CL
0.
C0
137
C
2l0)
.2'
--
mg3ov,
-
2. Does your organization currently use a PCDB?
Yes
No, but a PCDB is currently being created
No, but a PCDB is currently in planning stages
We have no PCDB, nor specific plans to implement a PCDB
46%
21%
21%
0%
No. respondents = 24
12
10
11
8
86-
5
4 20
0Yes
Planned
No, No Plans
Being Created
3. What percentage of your PCDB is currently populated?
0-25%
25-50%
50-75%
75-99%
100%
82%
9%
0%
9%
0%
10
9
8
7
6
5
No. respondents = 11
2
1
0
0
0-25%
138
25-50%
50-75%
75-99%
100%
4. Which of the following statements, if any, most accurately represent(s) the dispersion of
unpopulated indices through your PCDB?
Most or all indices are populated; unpopulated PCD is not an evident problem.
The majority of the PCDB is populated, with unpopulated indices interspersed
9%
Unpopulated indices tend to exist in concentrated PCDB regions; other areas are
thoroughly populated.
Unpopulated indices abound in small concentrated regions of the PCDB, but are
also interspersed moderately throughout the rest of the PCDB.
The majority of the PCDB is unpopulated.
9%
throughout.
9%
46%
27%
No. respondents = 11
6
5
5
4
3
3
2
1
0
cc
~0
L
0
a-.
.0
0.0
V 0a
LoJC
20
0
(
CL
0C0
D
0
0
5. How frequently is a requested PCDB index found to be unpopulated?
22%
Less than 10% of queries
22%
10-25% of queries
125-50%
of queries
1 33% 1
1 22% 1
queries
of
50%
than
I More
No. respondents = 9
3.5
3
3,
2.5
2
2
<10%
10-25%
2
2
1.5
1
0.5
0
139
25-50%
>50%
6. What characteristics are specified in order to access a particular data index (set of data
points)?
12
64%
91%
82%
55%
73%_
45%
73%
Part number or name
Material
Feature
Operation
Size
KC number of name
Machine
10
10
1
8
66
6
4
2
0c
20E
No. respondents = 11
11 1
5
ci)
mI
Z
Na
LL
mi
I
.
ZE
.
U0
7. What strategies does your organization use to address unpopulated data?
Seek out alternate values from within PCDB via software procedure
Seek out alternate values from within PCDB based on user intuition/expertise
Consult expert within organization
Consult expert outside of organization
Seek information from manufacturing
Contact supplier if unpopulated index is suspected to exist in supplier database
Investigate design changes that enable processes with known capabilities
No. respondents = 11
9
8
8
8
7
7
6
5
4
3
2
1
Find Consider
Use
Find
Request Contact
Use
Internal Info from Supplier Alternate External Alternate Design
Change
via
Expert
via
Mfg.
Expert
Intuition
140
Software
9%
45%
73%
36%
73%
64%
9%
8. Which characteristics seem to have the greatest influence over the likelihood that an index is
unpopulated?
Material, i.e. certain materials have a much greater likelihood of index population
than others
Date, i.e. older hierarchy branches are more/less likely to be populated than newer
ones
18%
0%
___
Feature, i.e. certain features have a greater likelihood of index population than
others
Size, i.e. certain feature sizes are considerably more likely to be populated than
others
Complexity, i.e. indices for simpler features/operations are more/less likely to be
populated
No evident trends, or only weak trends noted
No. respondents = 11
6
5
5
4
4-
a
3
2
2
1
Ii
0
Complexity
N/A
Feature
141
Material
Size
Date
27%
18%
45%
36%
9. Who is typically asked for PCD?
Designers
Manufacturing
Suppliers
Data specialists or database query personnel
No request system is in place
An individual is expected to find the data him/herself, through databases or
otherwise.
45%
36%
9%
36%
45%
0%
No. respondents = 11
6
I
5
5
4
3
2
1
0
OT
E
C.1
0)
z
0
z
CL
C/)
0
10. What types of design tasks most frequently require PCD that is found to be unpopulated?
Redesign
New designs
Investigative queries
Other
1
36%
91%
36%J
9
12
10
10
8
6
4
No. respondents = 11
0
New Designs
142
Redesign
Investigative
Queries
Other
11. What trends, if any, are characteristic of design's reaction to not being able to find the right
data?
Decreased use of PCD
Distrust of populated PCD indices
Use of PCD only for revision of existing designs
Use of PCD only for processes, materials, parts, KCs, etc that are known to have
0%
0%
0%
Use of PCD only for queries regarding "popular"/well-known processes, materials,
36%
Other
No trends noted
0%
55%
populated indices
parts, KCs, etc.
No. respondents = 11
7
6
5
I
4
3
2
0
0
C
.2'
0
z
CL-0
0
C
0)
CY 0
0
0
a-
0~
0
0
2n
0
0
b
0)
143
6
(I)
0
-0
0
0
0
'~
9%
12. What are the primary reasons for the lack of a PCDB at your organization?
15%
38%
23%
31%
31%
23%
0%
Simple lack of need for PCDB
Need for PCDB exists, but is not recognized by decision-makers
Need is known, but implementation has simply been slow
Lack of funding or other resources to create PCDB
Lack of organizational knowledge for PCDB creation
Other
No known reasons
No. respondents
=13
6
5
42
31
0
0
64
4)CD4
00
~
E
144
Z
13. Which of the following strategies are used by design when desired process capability
information is unavailable?
Contact supplier or manufacturer
Consult expert within organization
Consult expert outside of organization
Generate estimates for unavailable information, using resources
organization
Investigate design changes that enable processes with known capabilities
No. respondents: 12
9
8-
8
76
6
6
5
5-
4-
3
32
1-
0-
--
Internal
Expert
--
Contact
Supplier
---
Generate
Estimates
145
Invest.
Design
Change
External
Expert
50%
67%
25%
within
50%
42%
146
Download