Improving Process Capability Data Access for Design by James A. Hanson B.S. Mechanical Engineering, Cum Laude University of Maryland College Park, 1999 Submitted to the Department of Mechanical Engineering in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE IN MECHANICAL ENGINEERING at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY ~rv4'~ INSTITUTE MASSACHUSETTS May, 2001 3a a MASSA CHU SET WN5ITUTE OF TECHNOLOGY JAN 2 9 2002 © Massachusetts Institute of Technology, 2001. All rightspeserved. LIBRARIES Signature of Auth Department of Mechanical Engineering May 22, 2001 Certified by Anna Thornton, Thesis Supervisor Assistant Professor of Mechanical Engineering Accepted by Ain Sonin, Thesis Reader, Chairman of the Graduate Committee Department of Mechanical Engineering 1 Improving Process Capability Data Access for Design by James A. Hanson Submitted to the Department of Mechanical Engineering on May 22, 2001 in partial fulfillment of the requirements for the Degree of Master of Science in Mechanical Engineering Abstract Process capability databases are used to store data that characterizes manufacturing operations. The manufacturing function uses process capability data (PCD) to monitor production and identify out-of-control processes. The design function uses PCD to allocate tolerances, evaluate manufacturability and assess product robustness. Academic literature predominantly assumes design and manufacturing enjoy ready access to PCD. However, the design function actually faces major barriers to PCD access, including lack of design-focused indexing and inconsistent structures between PCDBs. A survey was circulated to industrial enterprises, and the results identified problems with poor PCD population. First, there are no formalized approaches for managing the problem of poor database population. Second, poor database population is most harmful to design when new designs are investigated, not when old designs are revisited. The distinction is important because PCD can provide important information when new designs are created. This thesis addresses the indexing and multiple-volume barriers by presenting a hybrid data reindexing system. The system can employ design-driven attributes to index data, making the data intuitively accessible to design. Additionally, the system represents data from different data set structures in a single, attribute-based Euclidean space. This feature allows a user to query multiple data sets from a single interface. With data represented in Euclidean space, poor database population is managed by using multiaxis regression techniques to generate estimates for unknown data values. This thesis presents a unique interactive multi-axis regression method. This method creates estimates for unknown data values by minimizing sum-of-square error along each Euclidean axis and across multiple axes. Thesis Supervisor: Anna Thornton Assistant Professor, Mechanical Engineering 3 4 Acknowledgements This thesis project has allowed me to explore a rewarding subject, and investigate new strategies for solving a challenging set of problems. I feel very fortunate to have had this opportunity, and I owe a debt of gratitude to several individuals who helped make this project a success. First I want to thank my thesis advisor, Professor Anna Thornton, for her academic and professional guidance during my time at MIT. Professor Thornton's insistence upon unquestionably professional character and performance were essential to my success. Her eye for detail and compassionate encouragement led me to greater levels of confidence and comprehension throughout this project. I want to thank my family for their support and patience - not only while I toiled in the lab, but also when I tried in vain to describe my project in 200 words or less. My parents John and Judy and my brother Joe were consistently supportive and interested in my progress, and I am very grateful for their enthusiasm. Accolades should not be limited to my immediate family; my extended family is also very special to me, but unfortunately they are too numerous to list by name on this page. Professional associates have been essential to this project. In particular, kudos to Chris Hull of Boeing for his enthusiasm during and after the 2000 KC Symposium. I want to acknowledge the generous financial support of the National Science Foundation, MIT's Center for Innovation in Product Development, and Analytics Operations Engineering, Inc. My friends, in Boston and elsewhere, always provide generous emotional support when I need it most. In particular, I offer my sincere appreciation to Lori and Christy. And I am thrilled to be able to call my labmates friends. Finally, I give very special thanks to Melissa, who has given me more support than I can ever hope to express in words. 5 6 Table of Contents A BSTRA CT................................................................................................................................... 3 A CKN OWLEDG EM EN TS ..................................................................................................... 5 LIST OF FIGU RES .................................................................................................................... 11 A CRO NY M S ............................................................................................................................... 15 GLO SSA RY ................................................................................................................................. 17 1 IN TRO D U CTION ................................................................................................................ 23 1.1 BACKGROUND .................................................................................................................. 23 1.2 M OTIVATION .................................................................................................................... 23 1.3 LITERATURE REVIEW .....................................................................................................-. 25 1.3.1 Currentstate ofPCDBs ........................................................................................ 25 1.3.2 Human capabilityfor managing uncertainty........................................................ 26 1.3.3 Data analysis and knowledge discovery............................................................... 26 1.3.4 Database-specificuser issues ................................................................................. 27 THESIS OBJECTIVES........................................................................................................... 28 1.4.1 Supportfor design functions ................................................................................. 28 1.4.2 Strategiesfor unpopulateddata indices ................................................................. 29 1.4 THESIS DATA .................................................................................................................... 29 1.5.1 Sim ilarities................................................................................................................ 30 1.5.2 Differences ................................................................................................................ 31 1.5 OUTLINE ........................................................................................................................... 31 IN DU STRY SU RV EY ...................................................................................................... 33 1.6 2 2.1 SURVEY QUESTIONS AND RESPONSES................................................................................ 33 2.1.1 PCDB usage.............................................................................................................. 33 2.1.2 Causes and effects of unpopulated PCDB indices................................................. 34 2.1.3 Indexing strategies................................................................................................. 38 2.1.4 PCDB population................................................................................................... 39 7 2.2 Causes and effects of unpopulatedPCD ............................................................... 41 2.2.2 Indexing strategies................................................................................................. 43 2.2.3 PCDBpopulation levels........................................................................................ 43 2.2.4 Remarks..................................................................................................................... 44 RECOMMENDED ACTION ................................................................................................. 44 2.3.1 Indexing scheme improvements ............................................................................ 45 2.3.2 Improved management of unpopulated indices ..................................................... 45 CONCLUSION .................................................................................................................... 47 CURRENT STATE OF DATABASE IMPLEMENTATIONS ................................... 49 2.4 3.1 PCDB-SPECIFIC DBM S NEEDS...................................................................................... 49 3.2 DBM S ARCHITECTURES FOR PCDBS........................................................................... 51 3.2.1 Hierarchystructures.............................................................................................. 51 3.2.2 Relationalstructures.............................................................................................. 56 3.2.3 Object-orientedstructures..................................................................................... 60 3.2.4 Hybrid Structures................................................................................................... 63 3.3 4 40 2.2.1 2.3 3 SURVEY ANALYSIS............................................................................................................ COMMON PROBLEMS ACROSS DMBS ARCHITECTURES ................................................. 63 3.3.1 Inherent dissimilarityof DBMSs............................................................................ 63 3.3.2 Errormanagement and infeasible indices............................................................ 64 3.3.3 Schema conversion................................................................................................. 65 3.4 THE MISSING DATA PROBLEM ......................................................................................... 65 3.5 CONCLUSION: RELEVANCE TO PCDB STRUCTURES....................................................... 66 ATTRIBUTE-BASED INDEXIN G ................................................................................ 4.1 BASIC CONCEPTS .............................................................................................................. 67 67 4.1.1 Attribute coordinatehyperspace representation.................................................. 67 4.1.2 Representation of items with multiple valuesfor an attribute............................... 69 4.1.3 Representation of items with no values for an attribute........................................ 70 4.1.4 Rationalefor representation................................................................................... 72 4.2 ELEMENTS OF THE INDEXING SYSTEM ............................................................................ 4.2.1 Components............................................................................................................... 8 72 74 4.2.2 4.3 82 Motivation................................................................................................................. 82 4.3.2 Hum an-machine learning..................................................................................... 83 4.3.3 Precise rolesfor hum an intervention in the system ............................................... 86 CONCLUSION .................................................................................................................... STRATEGIES FOR MISSING DATA .......................................................................... 5.1 6 A TTRIBUTE LEARNING ................................................................................................... 77 4.3.1 4.4 5 Processes................................................................................................................... JUSTIFICATION OF METHODS.......................................................................................... 87 89 89 5.1.1 Index-based similarity............................................................................................. 90 5.1.2 Data-basedsim ilarity............................................................................................. 91 5.1.3 Regression as a predictionsystem .......................................................................... 92 5.1.4 Processesfor com bining regressionresults .......................................................... 95 5.2 PREPARATION FOR REGRESSION...................................................................................... 96 5.3 REGRESSION PROCESS..................................................................................................... 97 5.4 REGRESSION COM BINATION............................................................................................ 99 5.5 CONCLUSION .................................................................................................................. 100 DEMONSTRATION OF THE TECHNOLOGY ........................................................... 101 6.1 SYSTEM ARCHITECTURE ................................................................................................. 101 6.1.1 Client-server interaction......................................................................................... 101 6.1.2 Intra-serverprocesses............................................................................................. 102 6.2 SERVER SOFTW ARE ......................................................................................................... 103 6.2.1 Connection process andsession management ........................................................ 103 6.2.2 Screens .................................................................................................................... 104 6.2.3 Com m ents on user experience ................................................................................ 113 A DM INISTRATIVE SOFTW ARE .......................................................................................... 113 6.3 6.3.1 Searchfunction ....................................................................................................... 114 6.3.2 Attribute learningand index expansion functions .................................................. 115 6.4 SOFTW ARE ENHANCEMENTS ........................................................................................... 118 6.5 CONCLUSION .................................................................................................................. 121 9 7 C O N C LU SIO N ................................................................................................................... 123 7.1 CONTRIBUTIONS ............................................................................................................. 123 7.2 FURTHER RESEARCH ....................................................................................................... 124 R EFEREN C ES .......................................................................................................................... 127 A PPEN D IX A : PC DB SU RV EY ............................................................................................. 131 A PPEN DIX B : SU R V EY RESPO N SES ................................................................................. 137 10 List of Figures Figure 1.1: Some attributes of baseball card auction data, found in auction title............ 30 Figure 2.1: PCDB implementation among Symposium attendees...................................... 34 Figure 2.2: Design tasks resulting in unpopulated PCDB indices...................................... 34 Figure 2.3: Strong influences upon PCDB index population............................................... 35 Figure 2.4: Methods for managing lack of PCD at organizations with PCDBs ................. 36 Figure 2.5: Methods for managing lack of PCD at organizations without PCDBs............ 37 Figure 2.6: Designers' reactions to unpopulated PCDB indices.......................................... 38 Figure 2.7: Index scheme parameters in use....................................................................... 38 Figure 2.8: Index population levels in PCDBs...................................................................... 39 Figure 2.9: Distribution of unpopulated data in PCDBs...................................................... 40 Figure 2.10: Percentage of queries returning unpopulated indices ................................... 40 Figure 3.1: Generic hierarchical structure with indices...................................................... 51 Figure 3.2: Determining child nodes of node 1.1 .................................................................. 52 Figure 3.3: Identical attributes in multiple locations and levels of hierarchy structure...... 53 Figure 3.4: Mining hierarchy sections for "Green" and "Orange".............................. 53 Figure 3.5: Identically named attributes may not represent identical concepts............... 54 Figure 3.6: No knowledge of nearest-neighbor attribute values.......................................... 54 Figure 3.7: Removal of attributes from hierarchical index ................................................. 55 Figure 3.8: Addition of attributes to hierarchical DBMS ................................................... 56 Figure 3.9: Relational representation of items..................................................................... 56 Figure 3.10: Adding new attribute to relational table.......................................................... 57 Figure 3.11: Process of creating new table using pointers to two tables............................ 58 Figure 3.12: Adding an attribute to a relational table: pointers automatically update.......58 Figure 3.13: Relational difficulties with attribute value similarities................................... 60 Figure 3.14: Object-oriented DBMS supporting attribute and method inheritance........ 61 Figure 3.15: Polymorphism of methods: "+" adds numbers, concatenates strings.......... 61 Figure 3.16: Polymorphism of attributes and associated ambiguities ............................... 62 Figure 3.17: Failure of declarative operations due to encapsulation ................................. 63 Figure 3.18: Hybrid database structure requiring one choice each from A, B and C.......... 64 11 Figure 4.1: Index space in three dimensions ......................................................................... 68 Figure 4.2: Attribute table assigning values to axis coordinates ........................................ 69 Figure 4.3: Representing items with multiple values for a single attribute........................ 70 Figure 4.4: Different attribute axes for non-candy items and candy items, respectively..... 71 Figure 4.5: Attributes translated into index code ................................................................ 73 Figure 4.6: SVC operator in substring.................................................................................. 75 Figure 4.7: MVC operator in substring ................................................................................ 75 Figure 4.8: A rules file, with a sample invalid index combination ...................................... 77 Figure 4.9: Real-time specification: Selection of "disco ball" eliminates "plastic"............. 77 Figure 4.10: Query structure and sample returned index codes........................................ 78 Figure 4.11: Use of wildcard for data characterization........................................................ 79 Figure 4.12: Result of wildcard search .................................................................................. 79 Figure 4.13: Query codes with wildcard (left) and omission (right) for attribute 2.......... 80 Figure 4.14: Alteration of attribute table .............................................................................. 81 Figure 4.15: Update of dataset using altered attribute table .............................................. 81 Figure 4.16: Portion of an ignore file, listed alphabetically ................................................. 84 Figure 4.17: Recognition of words unknown to system........................................................ 84 Figure 4.18: Presentation of new words to user for assistance in categorization .............. 85 Figure 4.19: Illustration of learning process ......................................................................... 85 Figure 4.20: Theoretically ambiguous text descriptor and nature of the ambiguity.........87 Figure 5.1: Z value calculation for two index codes ............................................................ 90 Figure 5.2: Failure of data-based similarity measures when data is missing.................... 92 Figure 5.3: Linear regression analysis in one dimension ..................................................... 93 Figure 5.4: Illustration of lower and higher regression errors............................................ 95 Figure 5.5: Translation of coordinate axes to unpopulated data point............................... 96 Figure 5.6: Location of populated data points on translated x' axis................................... 97 Figure 5.7: Axis for which ns = 4 ... .. ............................................ 99 Figure 6.1: Client/server system configuration...................................................................... 101 Figure 6.2: Data flow across client/server system.................................................................. 102 Figure 6.3: Data flow within the server location.................................................................... 103 Figure 6.4: Title page viewed on client machine.................................................................... 105 12 Figure 6.5: M enu-driven search screen .................................................................................. 106 Figure 6.6: Error screen ........................................................................................................... 108 Figure 6.7: HTM L table of card price frequencies................................................................ 109 Figure 6.8: Selection screen for estimation function ............................................................. 111 Figure 6.9: Estimate results screen.......................................................................................... 112 Figure 6.10: Default window for administrative software .................................................... 115 Figure 6.11: Attribute learning option window ..................................................................... 116 Figure 6.12: Rules file before and after adding wildcards.................................................... 118 Figure 6.13: HTM L table bar chart ........................................................................................ 120 Figure 6.14: Graphical bar chart ............................................................................................ 120 13 14 Acronyms DBMS = Database Management System GIF = Graphics Interface Format HTML = Hypertext Markup Language JPEG = Joint Photographic Experts Group KC = Key Characteristic MVC = Multiple Value Concatenation PCD = Process Capability Data PCDB = Process Capability Database PSA = Professional Sports Authenticator SPC = Statistical Process Control SQL = Structured Query Language SVC = Single Value Concatenation URL = Universal Resource Locator VRM = Variation Risk Management 15 16 Glossary * Aggregate data = "data provided when ... the details of one or more ... parameters are not known." (Tata, 1999) " Attribute = characteristic of a datum, such as material or color. * Automated system = system capable of accomplishing operations and goals without human guidance or other assistance. " Basic table = list of all attributes and corresponding values in the indexing system. * Child node = the subordinate among two connected nodes on different levels of a hierarchical tree. "Every node has a finite set (possibly empty) of nodes which are called immediate successors or children of that node." (Alagic, 1986) " Class = a classification of similar objects. "A class characterizes one or more objects that have common methods, variables, and relationships"; "A class can be thought of as the 'rubber stamp' from which individual objects are created." (Burleson, 1999) * Client = computer submitting requests to, and receiving data from, a server " Concatenation = act of placing two or more data strings in sequence as a single string. * Confidence interval* = "an interval of plausible values for the parameter being estimated" (Devore, 1987) * Conversion = process of exporting the information in an indexing scheme to a differently structured indexing scheme. * Cookie = small, persistent data file on a client computer that identifies user information to a server. * Curse of dimensionality = colloquial term for the observation that a linear increase in data dimensionality can exponentially increase the complexity of data analysis. " Data independence = data model in which "data and process are deliberately independent"; "'ad-hoc' data access" (Burleson, 1999) " Data mining = "the confluence of machine learning and the performance emphasis of database technology"; "discovery of rules embedded in massive data." (Agrawal et al., 1993) " Database Management System (DBMS) = the structure, actions, and constraints imposed on data storage and access. "A number of models, each of which has a collection of 17 conceptual objects, actions on those objects, and constraints under which those actions are performed." (Alagic, 1986) * Domain knowledge = knowledge of a specific field of inquiry that goes beyond the information in a data set. Domain knowledge aids in discovery of rules and relations from a data set. " Encapsulation = process that "gathers the data and methods of an object and puts them into a package, creating a well-defined boundary around the object." (Burleson, 1999) * Engine = core processing element of a software package. * Euclidean * Expansion = = characterized by a set of mutually orthogonal coordinate axes. process of adding new criteria to an indexing scheme. " Field = column of a relational data table. * Goodness of fit = extent to which an equation matches the data it describes. * Hierarchy = "a method of organizing data into descending one-to-many relationships, with each level having a higher precedence than those below it." (Burleson, 1999) * Hyperspace/hypervolume = Euclidean space characterized by a large number of dimensions, typically more than four. * Ignore file = table of words or other data properties that are not used for indexing. * Index = "set of choices for each parameter detailing data desired. The index is the label for PCD in the PCDB." (Tata, 1999) * Indexing scheme/system = model by which data is described and stored. * Inheritance = assumption of methods and attributes belonging to parent elements. " Interface = means by which a user interacts with a system. " Invalid = a description of data indices that do not correspond with any possible real entity or process. " Key Characteristic (KC)* = label "used to indicate where excess variation will most significantly affect product quality and what product features and tolerances require special attention from manufacturing" (Lee and Thornton, 1996) * Machine learning = using computerized data analysis techniques to obtain new, useful rules and knowledge from a dataset. * Method = data access or manipulation procedure; "behavior of the data" (Burleson, 1999) 18 * Nearest-neighbor = the data index located the shortest distance from a reference index, with distance measures dictated by the data structure. * Node = element of a hierarchical tree structure. * Null = taking no value or containing no information * Object = self-contained data package containing private data values, private procedures, and a public interface. * Object-oriented = compatible with object-structured data and methods. * Observation " Parent node = sample; the outcome of a stochastic process. = the dominant node among two connected nodes on different levels of a hierarchical tree. "Every node, except one, has a unique node which is called its immediate predecessor or parent node." (Alagic, 1986) " Pointer = data link between tables in a relational database * Polymorphism = "the ability of different objects to receive the same message and respond in different ways." (Burleson, 1999) * Process capability* = "Process capability is a product process's ability to produce products within the desired expectations of customers." (Gaither, p. 713) * Process Capability Data (PCD) = "the expected and obtained standard deviations and mean shifts for a feature produced by a particular process and made of a particular material" (Tata, 1999) " Process Capability Database (PCDB) = "includes target and actual tolerances for particular process, material, and feature combinations" (Tata, 1999) " Prediction = assigning a speculative value to some entity when there is insufficient information to determine the value with complete confidence. " Professional Sports Authenticator (PSA) = pay service specializing in the objective "quality" grading of collectibles " Projection = selection of a subset of attributes from a data object or table. * = Distance between two data indices, with distance measures dictated by the data Proximity structure. * Query = request to access, modify or delete data in a data set. 19 " Re-indexing = reevaluation of the characteristics of a datum, followed by overwriting the datum's index with an updated one. " Real-time = description of a process that executes on an immediate or nearly-immediate time scale relative to human interface activity. " Record = row of a table-type database structure, describing one data instance " Regression * Relational= database structure comprised of tables connected by time-dependent relations * Rules file = table of (in/)valid index combinations in a data set. " Semi-structured = possessing no formal data indexing structure * Server = computer or computer network that receives requests from clients and returns = creation of a mathematical model to fit data approximately. processed responses. * Similarity = measure of resemblance between at least two data indices. * Statistical Process Control (SPC)* = "The use of control charts" (Gaither, p. 740) "...used to ensure the ongoing quality of the manufacturing process." (Batchelor et al., 1996) " Structured Query Language (SQL) = an expression vocabulary for submitting data queries. " Substitution = representation of a datum's characteristics by reference to those of another datum. * Surrogate = data that is similar to the data for an unpopulated index (Tata, 1999) " Table = a set of tuples characterized by the same attributes. " Thread = isolated instance of server variables and processes, created for each connected client. " Threshold = maximum regression error value for an axis, above which the regression model is discarded as inaccurate. " Tolerance = "[The] maximum value that [a] dimension can deviate from the specified value on [a] drawing." (Tata, 1999) " Training = iterative process of refining a data model's parameters for best possible fit of data values. " Translation = see Conversion. " Trend = describable data pattern. 20 * Tuple = a relational data record. "In the relational model of data an entity is represented by a tuple of values of its attributes." (Alagic, 1986) " Uncertainty = "Unsureness about the exact value." "There are a variety of uncertainties in PCDBs including surrogate data, multiple data sets, aggregate data and small data sets." (Tata, 1999) " Universal Resource Locator (URL) = pointer to data of almost arbitrary composition, typically allowing remote access. " Unpopulated = description of a data index that contains no data values. * Value = the state of an attribute; e.g.: the value "blue" represents the state of the "color" attribute. " Weight = assignment of mathematical influence over the outcome of an operation. * Wildcard = character that represents no selection of values for an attribute. 21 22 1 Introduction This chapter provides background information about Process Capability Databases and motivations for improving the current state of the art. The results of a literature search are presented, followed by the thesis objectives and information about the nature of the data used in this project. The chapter concludes with an outline of Chapters 2-7. 1.1 Background Design and manufacturing enterprises rely on accurate and timely information to create products. Designers can use knowledge such as customer needs, target costs, and material properties to make a product design successful. (Ulrich and Eppinger, 2000) Manufacturers can use knowledge such as machine availability, employee skills, and delivery schedules to make products according to requirements. (DeGarmo et al., 1997) But the information needs of design and manufacturing are not entirely independent. manufacturing and design. Some types of knowledge concern both Materials and scheduling needs are two examples. (Ulrich and Eppinger, 2000) Process Capability Data (PCD) is one type of knowledge that both design and manufacturing use. PCD is manufacturing data collected for process monitoring, identification of out-of-control manufacturing process elements, and evaluation of candidate processes for manufacturing operations. Designers can use PCD for such tasks as predicting manufacturing variation, creating criteria for robust designs, and analyzing cost sensitivities. (Thornton and Tata, 1999) Enterprises frequently use Process Capability Databases (PCDBs) to store PCD. PCDBs can index and store Statistical Process Control (SPC) data, PCD and other information such as date/time stamps and machine identification codes. Because PCD supports design and manufacturing functions, PCDBs should be efficiently accessible to both functions. 1.2 Motivation Designers seek access to PCD for information support in creating new designs, relating known processes to new products, and reviewing current processes and designs. Recent academic study 23 has revealed several shortcomings common to current PCDBs. These shortcomings hinder the design function's ability to use PCD as an information source for design tasks. Common barriers to PCDB use by the design function include (Tata, 1999): - Lack of PCDB commonality across enterprises Large entities, and those who use component suppliers, frequently rely upon multiple PCDB sources. This complicates the information retrieval process and reduces the usefulness of the information for design. The ideal of querying multiple data sources with a single simple query is generally not achieved, discouraging PCD use. - Poor PCDB indexing schemes Widely disparate schemes exist for indexing data in PCDBs. Frequently used parameters include part number, Key Characteristic (KC) number, feature number, manufacturing process, feature type, machine number or name, tooling, supplier, team, product, and material. There exists no standard for indexing PCD, so the data indexing typically is different between different manufacturers. This complicates the commonality problem, in which multiple databases must be queried for PCD. PCDBs also frequently use hierarchical structures. These structures complicate feature-based analysis by separating features into multiple locations in the hierarchy. If a user desires to retrieve all data representing a particular data feature, the entire hierarchical structure often must be mined to find all the data. - Poor population of supplier PCDBs Suppliers often do not provide data to external groups (including customers), or they only provide data for specifically ordered parts. This reduces the amount of data available to designers and makes the design task more difficult. 24 These common PCDB characteristics are significant barriers to successful use of PCD for design. There is an evident need for tools and methods to manage or lower these barriers, so PCD can provide a better level of feedback from manufacturing to design. 1.3 Literature review This section discusses the results of a literature review, outlining findings regarding PCDBs, human capabilities for managing uncertainty, knowledge discovery and user issues. 1.3.1 Current state of PCDBs PCD provides a crucial link between design functions and manufacturing capabilities. Clausing (1998) discusses the wasted work that results from a lack of information during product development. PCD is one example of information that can be in short supply during development and design. To improve data availability, enterprises use repositories such as PCDBs to systematically store PCD. Campbell and Bernie (1996) outline a PCDB structure to catalog geometric tolerances for rapid-prototyping processes. Designers access the data by specifying feature types, which are design-oriented process characteristics. Design-oriented data access makes a PCDB system easier to use and more useful for designers. Thornton and Tata (1999) find that the literature assumes PCD is available to designers, but this assumption is inaccurate. Some causes of poor PCDB availability include: - Data indexing schemes do not allow designers to find the data they need. - Databases within an enterprise are often incompatible with each other. This fact can - make it impossible to access multiple PCDBs with a single query. PCDBs are often poorly populated. Low population levels reduce the probability of finding pertinent PCD. 25 Among other needs, design requires consistent, design-friendly indexing schemes and methods for managing unpopulated data. The current lack of these elements has created barriers to PCD access. 1.3.2 Human capability for managing uncertainty PCDBs represent a type of knowledge database in which the information is seldom comprehensive. A survey by Deleryd (1998) finds most enterprises' PCDBs are not fully populated with data. This gives rise to a complex problem. When faced with incomplete knowledge, the human tendency to reason intuitively is both an asset and a liability. Piattelli-Palmarini (1991) discusses human "cognitive illusions." Humans often form incorrect hypotheses based on incomplete information. Even when faced with contradicting information, humans often cling to these outdated or incorrect hypotheses. Human hindsight is also far less reliable than its bearers believe, as is the human capacity for risk assessment. Because of these shortcomings, human inference regarding missing database knowledge can be unreliable. This is true even among experts. Automated systems can assist the human data user by performing impartial analysis tasks. Baral et al. (1998) discuss the use of logical methods to query incomplete knowledge databases. These methods use inference methods that are external to a database, but make use of data within the database. 1.3.3 Data analysis and knowledge discovery Mathematical and logical methods can make inferences about missing data. The knowledge discovery field offers insight into the problem of missing PCD. Walker (1987) discusses the successful use of automated systems to make inferences in the areas of mass spectrometry, pharmacology, mathematics, and geology. These systems typically make use of domain knowledge - fundamental, topic-specific knowledge that surpasses the information in the database - to make non-obvious conclusions about data interactions and relationships. Automated systems have very limited abilities, both in model-driven and data-driven tasks. 26 The possible approaches to model-driven and data-driven data analysis tasks are myriad. Zhang and Korfhage (1999) note the existence of "more than 60 different similarity measures" for comparing data numerically. Distance-based and angle-based measures in Euclidean data space are noted to be the most popular. Zhang and Korfhage (1999) present the concept of hybrid datadriven-analysis tools and note hybrid similarity measures are not frequently studied. All dataoriented similarity measures, even hybrids, have inherent weaknesses and are not well suited for all analysis tasks. Model-driven data analysis and knowledge discovery methods see frequent use in feature recognition (Turk and Pentland 1991; Pavlovic et al. 2000) and speech recognition (Rabiner 1998; Pereira and Riley 1996; Riccardi et al. 1996; Bangalore and Riccardi 2000). Common model-based methods include neural networks (Bishop 1996), Bayesian networks (Cowell 1999), and support vector machines (Burges 1998). One common benefit of model-driven methods is the ability to discard individual data points once the model is built. These methods generally require copious training to build accurate models. Like data-driven methods, model-driven methods are critically dependent on the presence of a large volume of accurate data. 1.3.4 Database-specific user issues Meng at al (1995) discuss translation of database queries between relational and hierarchical schema. Query translation can allow a user to query multiple databases of differing structure from a single interface. Liu and Pu (1997) present various meta operations for heterogeneous data sources, including query aggregation. Meta operations can be part of an effort to integrate different data sources for responses to a single data request. Chandra et al. (1992) note that query and inference methods should be clear to the user. A database user interface should provide basic information about the processes that return information to the user. Without information about query and inference methods, a user may not be confident that the methods are sound. 27 The literature documents the applications of PCD, but has only recently begun to address designers' need for better access to PCD. Human inconsistencies in managing uncertainty are well documented, as are many methods for recognizing data features and managing missing data. Concepts for query translation and user interface design are also present in the literature. There is little integration of these concepts, however. Industry needs integrated solutions for missing data, multiple databases and improved data indexing; this need is not yet met. Thesis objectives 1.4 This thesis advances the hypothesis that technological changes to PCDBs can reduce access barriers and improve PCD availability to design. To prove this hypothesis, prototype software is created to demonstrate improvements. Specifically, these improvements are: - Database support of design functions: " design-focused methods for accessing data - error checking on queries " support for searching multiple PCDBs with a single query - Strategies for unpopulated data indices: " data aggregation and support for wildcard indices " prediction methods for unknown data values A data re-indexing component and a data analysis component form the basis for the improvements listed above. Each is discussed below. 1.4.1 Support for design functions In industry, PCDB indexing schemes often reflect the manufacturing function's need for data access by features like drawing and part numbers. Design typically needs PCD organized by design-driven attributes like features, materials and processes. needs can result in reduced value of PCDBs to design tasks. 28 The two functions' differing When designers have to cross-reference features with part or drawing numbers, their ability to efficiently retrieve PCD is reduced. To address this access imbalance, this thesis presents a method of re-indexing data by attribute, in order to lower the barrier to PCDB usage by design. This re-indexing process does not compromise manufacturers' PCD access. Rather, it makes access more universal, allowing design and manufacturing to use the same interface for queries. 1.4.2 Strategies for unpopulated data indices A PCDB can only contain information collected from previously completed manufacturing operations. However, designers frequently are charged with the task of designing components with unique sizes, exotic materials, obscure features or other characteristics that have not previously been catalogued in a PCDB. Searches for information regarding a new process typically result in a null set of returned data from the PCDB. In these cases, process capability must be estimated by supplementary methods. Currently, several strategies exist for managing the problem of unpopulated data indices. Few of these methods include continued use of the PCDB, which may contain PCD for substantially similar processes. To maximize the value of PCDBs to design and manufacturing, this thesis presents a mathematical framework for extracting surrogate data from a PCDB when a requested index is unpopulated. 1.5 Thesis data The methods described in this thesis are suitable for wider application than PCDB systems. Any structured or semi-structured data set can be arranged and analyzed by the methods described here. Regardless of the type of data, the resulting structures will be remarkably similar. It is appropriate, then, to demonstrate how the process works on some arbitrary semi-structured data set. This thesis does just that: the data that demonstrates the technology is collected from an application sharing nothing with process capability studies. Discussion of this data as well as PCD illustrates the versatile nature of the re-indexing process and data analysis techniques. 29 Several examples in this thesis make use of baseball card references to illustrate various principles. The data used for this project was gathered from a semi-structured data source on the World Wide Web. This data documents the dates, times, and selling prices of approximately 250,000 baseball cards auctioned online. Its source was a large set of HTML tables in which auction records appeared as they occurred, by way of an automated recording system external to MIT. This sales data represents a semi-structured data set, where there is no formal indexing system available to classify the data. PCD, by contrast, is typically stored by marking records with a pre-defined index that best describes them. In their original forms, the organization of PCD is therefore more highly structured than the baseball card sales data. 1.5.1 Similarities PCD and baseball card sales data do not initially seem to have much in common, but they are both describable in a very similar way. Both types of data use a single record to characterize a single occurrence. Each record is a set of attributes: materials/sizes/features for manufacturing operations and names/manufacturers/years/qualities for baseball card sales. We can describe any type of data, as long as we can figure out what the attributes are and build the attribute sets. Figure 1.1 illustrates how a system can mine semi-structured data and determine data attributes. Year 1997 BOWMAN JOSE CRUZ JR ROOKIE CARD MINT Manufacturer Name Quality Figure 1.1: Some attributes of baseball card auction data, found in auction title The meaning of the stored data differs by data type. In the case of PCD, we want to store the dimensional capabilities of a manufacturing operation. In the case of baseball cards, we want to find the price statistics for a baseball card. But with data for different applications stored in a similar way, we can use a consistent set of tools to manage all the data and address data problems that may arise. 30 1.5.2 Differences Baseball card data is different from PCD in one very significant way. As noted earlier, no indexing scheme has initially catalogued the baseball card data. It typically exists without an indexing structure. As with much of the information available on the Internet and elsewhere, there is no predetermined way to categorize and manage the baseball card data. PCD, on the other hand, is typically stored in a rigidly structured PCDB structure. This thesis manages the differences between structured and unstructured data by presenting an indexing system that accommodates either type. The system presented here can re-index structured data, while creating ground-up indices for unstructured data. This capability creates advantages that become apparent in Chapters 4 and 5. 1.6 Outline This thesis contains seven chapters. Chapter 1 defines PCDBs, reviews their use in industry, and states the motivations for further work. The chapter reviews current literature and summarizes the objective of the research contained in this thesis. Finally, Chapter 1 describes the data used for this project and outlines the structure of this thesis. Chapter 2 presents the results of an industry survey at the 2000 Key Characteristics Symposium. Survey responses reflect the current state of PCDB implementation, and identify areas for PCDB improvements. Chapter 3 discusses database management structures (DBMSs), identifying strengths and weaknesses of each major DBMS category. Weaknesses relevant to PCDB use receive particular attention. Common problems across DBMS structures are indicated, and the chapter concludes with a discussion of pertinent PCDB-related indexing issues. Chapter 4 describes an attribute-based re-indexing scheme. This scheme can represent data from various sources with a single set of Euclidean indices, subject to constraints. 31 The chapter discusses details of the indexing code and support for improved functions such as assistance with user queries, expansion of the indexing scheme, and assisted machine learning of new attributes. Chapter 5 discusses a strategy for managing missing PCD by generating estimates for missing data values. The chapter presents details and equations in support of the method. Chapter 6 outlines features of the software prototype that integrates the re-indexing scheme of Chapter 4 with the missing data strategies of Chapter 5. Chapter 6 also provides concepts for software enhancements. Finally, Chapter 7 offers conclusions and suggested directions for future work. 32 2 Industry Survey To determine how design uses PCDB, a survey was distributed to attendees of the Fourth Annual Variation Risk Management/Key Characteristic (VRM/KC) Symposium in January 2000. Attendees of this Symposium were interested in the use of VRM/KC tools, from industrial and academic viewpoints. Twenty-four attendees, representing American and international industrial and academic institutions, completed the survey. The results of this survey suggest many institutions have not developed tools to systematically manage unpopulated PCDB indices. Additionally, unpopulated database indices seem to have adverse consequences upon the actions of designers. This chapter presents the survey and draws conclusions from the results. conclusions, the chapter discusses recommended actions. After offering The full survey can be found in Appendix A. 2.1 Survey questions and responses This section presents each survey question and the responses received. 2.1.1 PCDB usage Among the twenty-four respondents to the survey, eleven indicated having PCDB system in use. Five additional respondents indicated PCDB implementation was in the planning stages at their organizations (Figure 2.1). Unless specifically noted otherwise, all remaining questions in this chapter were asked of those respondents who answered "yes" to this question. 33 M-EPH-w- 12 11 10 88 6 42 0 0 Yes No, No Plans Planned Being Created Figure 2.1: PCDB implementation among Symposium attendees 2.1.2 Causes and effects of unpopulated PCDB indices Respondents were asked to specify the design tasks that result in queries of unpopulated indices, choosing as many as apply to their organization. Ten of the eleven respondents identified the creation of new designs as a cause of queries of unpopulated indices. Four respondents indicated redesign tasks frequently lead to the return of null PCDB data sets. Four responses indicated investigative queries (what-if scenarios and other speculative search tasks) frequently lead to unpopulated indices. See Figure 2.2. 12 10 - 10 86 4 4 4 1 2 0 New Designs Investigative Queries Redesign Other Figure 2.2: Design tasks resulting in unpopulated PCDB indices 34 - 4W Respondents were asked to note the data characteristics (i.e., attributes) that exhibit the most significant influence upon the probability that a query returns a null data set. Results are illustrated in Figure 2.3. 6 5 54 4- 3 3- 22 2- Complexity N/A Feature Material Size Date Figure 2.3: Strong influences upon PCDB index population Complexity was the most frequently chosen response, indicating queries for more complex features or operations are less likely to return useful data. Four of the eleven respondents indicated they had noted no strong relationships between data characteristics and PCD population levels. Three respondents noted strong influences of feature selection upon index population levels. Two respondents indicated variations in material choice are influential, and two noted that feature size has a significant impact upon the potential for PCDB index population. Reaction to the unpopulated index condition was gauged by asking respondents two questions about responses to unpopulated data indices. One question asked for the most popular short-term solutions to a lack of data in the PCDB, while the other question asked about long-term strategies for addressing the same problem. 35 9 87 8 8 7 65 5 4 3 2 T 0 Use Internal Expert Find Consider Use Find Request Contact Info from Supplier Alternate External Alternate Design Change via Expert via Mfg. Software Intuition Figure 2.4: Methods for managing lack of PCD at organizations with PCDBs Responses to the short-term question are plotted in Figures 2.4 and 2.5. Respondents were asked to choose as many options as were applicable. Figure 2.4 illustrates results from respondents who do use PCDBs at their organizations. Eight respondents indicated information is requested from an enterprise's manufacturing center(s) when PCDB queries result in null responses. Seven respondents indicated contacting suppliers for the missing information, while five suggested intuition is used to estimate the information needed. Four respondents noted the use of an external expert, while one indicated the use of a software algorithm and one indicated design changes. The responses above were compared with the responses of individuals who do not currently use PCDBs at their organizations. These individuals were asked what strategies they use, in the absence of a PCDB, to get PCD when it is not already known. The results appear in Figure 2.5. 36 9 8 8 7 6 6 6 5 5 3,3 2 Internal Contact Generate Invest. External Expert Supplier Estimates Design Expert Changes Figure 2.5: Methods for managing lack of PCD at organizations without PCDBs Eight of thirteen respondents indicated using an internal expert (perhaps someone in manufacturing) to provide some estimated measure of PCD. Six indicated supplier contact, and six indicated estimates are generated (presumably not via PCDB-supported algorithms) for the unknown values. Five indicated the consideration of design changes. Three respondents indicated reliance upon an external expert to provide the necessary information. When asked about long-term trends within the design function in response to missing PCD, survey respondents provided the answers in Figure 2.6. The most popular reply (6 respondents) was "No Trends Noted." Four respondents indicated PCDB queries were performed only upon "popular" indices - that is, upon well-known sets of parameters that have a greater chance of containing data. One respondent indicated queries are performed only upon indices known to be populated with PCD. No respondents indicated decreased PCDB use, distrust of the data, use only for revision of existing designs, or other reactions. respondents who indicated PCDB use at their organizations. 37 This question was asked only of 7 65 41 I 3 2 1 0 I 1 0 0 0 0 I I I C i- W) 0 0 a. 0 * 2. U <0 OC 0 6, 0 ZC Figure 2.6: Designers' reactions to unpopulated PCDB indices 2.1.3 Indexing strategies Respondents were asked to specify how their PCDBs were indexed, selecting all items that applied. The results are illustrated in Figure 2.7. 12 10 10 9 8 6 4 2 IIII I 0 C .S E .2 o 4) E 0 z 0 Figure 2.7: Index scheme parameters in use The most popular indexing scheme parameter was material, with ten of eleven respondents indicating its use. This was followed in popularity by the feature parameter, with nine responses, and then by the size and machine parameters, which received eight responses apiece. The part identification parameter received seven responses, followed by the operation parameter with six and the KC parameter with five. 38 2.1.4 PCDB population Overwhelmingly, respondents indicated low levels of PCDB index population, with nine of eleven indicating their databases were populated to levels of 25% or less. As illustrated in Figure 2.8, one respondent indicated 25-50% population, and one respondent indicated 75-99% population.1 10 9 87654321 00 0-25% 25-50% 50-75% 75-99% 100% Figure 2.8: Index population levels in PCDBs Five respondents indicated unpopulated indices are distributed in concentrated, localized regions, and also in lower concentrations throughout the database. Three respondents described their PCDBs as "Mostly Unpopulated," and the remaining three respondents were distributed evenly between "Mostly Unpopulated Data." Populated," "Concentrated Unpopulated Regions" and "Dispersed Thus, eight of the eleven respondents suggested significant amounts of unpopulated data throughout their PCDBs, as well as concentrations of unpopulated indices in certain PCDB regions (Figure 2.9). 1 Respondents were not asked if their specified percentages took infeasible indices into account. 39 65 5 4 3 3 2 1 1 0 'a 00 4) CiC (0 0c 0 Figure 2.9: Distribution of unpopulated data in PCDBs Respondents were asked what percentage of PCDB searches fail to return data. More than half of respondents indicated a null response more than half the time the database is used (Figure 2.10). 3.5 3 2.5 2 2 2 <10% 10-25% 2 1.5 1 0.5 0 25-50% >50% Figure 2.10: Percentage of queries returning unpopulated indices 2.2 Survey analysis Survey results yielded several key findings, which supported and guided the work described in this thesis. These conclusions appear below. 40 2.2.1 Causes and effects of unpopulated PCD Several conclusions were evident regarding the unpopulated data index problem. Each topic is addressed individually. Influence of task on index population. It appears from survey results that new designs are the most significant contributors to queries returning unpopulated indices. Redesign tasks and investigative queries also contribute significantly to the problem of null responses. The survey results suggest the most frequently unsuccessful queries support processes about which designers know the least. This was anticipated, as these tasks frequently look for information that lies outside the set of documented processes. New designs were also anticipated to be the most frequent causes of null responses. Redesign and investigative efforts may be based on previous design decisions, but a new design might not share any elements with previous designs. As a result of this difference, a new design can prompt database queries that explore entirely empty regions of a PCDB. In contrast, redesign and investigative queries often retain some characteristics of a catalogued design. This fact may increase the likelihood that investigative and redesign queries either return data in response to a query, or query an unpopulated database index that is "close to" a populated database index. Influence of attributes on index population. Among parameters of a query that increase the probability that a PCDB index is unpopulated, respondents identified query complexity as the most significant contributor. Complexity was chosen over more concrete parameters of the process being investigated, such as feature, material and size. This result suggested the aggregation strategy described in Section 2.3.2. Strategies for missing PCD. When PCDB can not be found by direct database query, respondents who use PCDBs indicated heavy reliance upon data sources external to the PCDB. Survey results suggest respondents use at least six different contingency strategies with some frequency when PCDB indices are found to be unpopulated. This suggests database indices are found unpopulated with sufficient frequency to necessitate numerous contingency strategies. Some of these strategies are not advisable, however. In particular, using human intuition for 41 generation of surrogate data is not generally advisable (Piattelli-Palmarini, 1991). Additionally, the popularity of simple intuition to estimate process capability is not particularly different from using intuition entirely in place of the PCDB, particularly in light of the fact that such a high percentage of database queries are unsuccessful. The popularity of requests to manufacturers is interesting, as one motivator for PCDB creation is the reduction of reliance upon manufacturing centers for statistical information. Yet the clear majority of organizations represented here have not reached the full potential of this reduction: on average, more than 50% of queries return unpopulated data indices, and the most popular method for dealing with this condition seems to be a return to the manufacturing center for information. Among respondents who do not have access to PCDBs, there was a significant increase in the popularity of design change considerations when PCD is not available. Other strategies for managing missing data were similar in popularity between respondents with PCDBs and respondents without PCDBs. This comparison does not allow a direct item-to-item contrast between the actions of designers with PCDBs and those without, but it does suggest organizations with PCDBs rely upon information sources similar to those used by organizations without PCDBs. In the case of organizations with PCDBs, respondents undertake these strategies for addressing missing data after a PCDB query returns null results, so there is still a significant need for data estimates in spite of PCDB implementation. Long-term implications of missing PCD. Survey respondents indicated loyalty to the PCDB when unpopulated indices are returned. Although they noted some reduction in the scope of queries submitted, nobody indicated decreased PCDB use, distrust of the data, or use only to revise existing designs. This might be a reflection of many factors. The continued use of PCDBs, even in light of frequent query failures, may reflect a fundamental trust in the PCDB system that is not shaken by a lack of data. Alternately, it may reflect enterprise requirements that PCDBs be used for every design task when possible, forcing PCDB use patterns to remain 42 constant at some level. This portion of the survey did not provide as much insight into the attitudes of designers as was hoped, although the results do establish that unpopulated PCDB indices can cause some reduction in the scope of tasks for which the PCDB is queried. 2.2.2 Indexing strategies Design tasks focus on parameters such as features, material, sizes and process parameters. (Tata, 1999) It is encouraging to see these three parameters listed as three top parameters used for indexing, because they show the potential for making PCDBs more accessible to the design function. Even if a PCDB's structure does not currently allow for attribute-based data analysis, many respondents indicated the necessary information is contained in the database. This information might be used to re-index or restructure the PCDB for attribute-based analysis. 2.2.3 PCDB population levels With regard to population levels, the survey addressed two topics. These topics were data topography within a PCDB, and respondents' long-term reactions to missing PCD. Conclusions for each topic appear below. Population levels and distributions. Respondents generally indicated low population levels in their PCDBs. Many current indexing systems create millions of possible index permutations, of which many or most are infeasible. (Tata, 1999) For this reason, respondents' answers to this question were not surprising. These results suggest a strong possibility that a requested PCDB index will be infeasible or unpopulated. Population levels less than 25%, which most respondents suggested, indicate PCDB implementations frequently suffer from null query results. Within respondents' databases, some regions seem to be more densely populated with PCD than other regions. This result suggests certain types of processes may be better represented than other types. This is consistent with intuition, as some processes are more frequently used than others in a facility. 43 Effects of population levels on query responses. It is clear that current PCDB population levels frequently prevent enterprise functions from obtaining the PCD they seek. The majority of survey respondents suggested most of their PCDB queries yield null responses. Respondents additionally indicated the unpopulated data index problem is not localized to particular areas of a PCDB. Unpopulated indices may also create a constant, low-intensity "background noise" across the entire database. This serves to make unpopulated data indices a potential problem anywhere in the PCDB. 2.2.4 Remarks The survey results underscore the need for solutions to three PCDB problems. First, it is clear that null responses occur with high frequency. This is probably a result of the low population levels of industry PCDBs. Second, query complexity is a strong contributor to null query responses. Third, the contingency strategies for missing PCD are myriad, and similar between organizations with PCDBs and those without PCDBs. These findings underscore the importance of strategies for managing null responses. With so many unfulfilled PCD requests, it is clear that industry needs strategies for preventing or managing null PCDB responses. Any solutions to these problems should also consider the findings of (Tata, 1999). Specifically: 2.3 - PCDBs are not generally accessible to designers with design-focused interfaces - Indexing systems allow specification of infeasible indices - Databases are frequently incompatible with each other Recommended action This section lists strategies for addressing two types of problems. The first problem is improving the indexing scheme of PCDBs for design access. The second problem is managing unpopulated PCDB indices. 44 2.3.1 Indexing scheme improvements To improve design's PCD access, PCDBs should offer a design-focused interface for generating queries. This interface ideally should not be limited to one function. Rather, the interface should be flexible enough to allow either design or manufacturing to access PCD by the appropriate attributes. The interface should also support the possibility of using a single query to search multiple databases. To respond to design-driven queries, PCDBs must be indexed using design-driven attributes. The ideal state would use each data attribute as a potential indexer, so each datum would be accessible from design-driven queries and manufacturing-driven queries. If each PCDB within an enterprise is to be queried from a single interface, then each PCDB should share a common indexing scheme. The data in each PCDB may remain in place, but a universal indexing code should accompany each individual datum. The improved indexing scheme should allow for implementation of existing data analysis tools. Specifically, the scheme should arrange data in a structure that enables strategies for managing unpopulated indices. Because this management task relies on finding data indices that are similar to an unpopulated data index (see Section 2.3.2), the data structure should minimize the effort needed to determine similarity. Chapter 4 of this thesis outlines a database indexer that accomplishes these objectives by arranging PCD in a Euclidean, attribute-based data structure. 2.3.2 Improved management of unpopulated indices First, industry needs a tool that prevents users from specifying infeasible index combinations. This tool would save PCDB users time by intercepting index specification errors before a user commits to a search. This error-checking method would be a valuable addition to PCDB implementations. 45 Second, industry needs strategies for addressing missing data. would be to more fully populate the database. proposition. One solution to this problem This is an expensive and time-consuming Populating the entire database could prove impractical from a time and money standpoint, as there are many thousands of valid indices in a typical PCDB. This problem requires a more cost-efficient and quickly implemented solution. Until PCDB population levels are high enough to return data for nearly every query, enterprises should employ statistically correct procedures for quickly generating estimates from available and properly organized information. If a queried PCDB index is found to be unpopulated, one reliable strategy for finding surrogate data may involve reducing the complexity of the query to seek a more general process description, or selecting a different feature, material or size. The industry survey indicated complexity is a primary cause of null responses, so specifying a more simple set of criteria may improve a user's chances for getting useful data from the PCDB. An improved indexing structure should allow a user to make simple changes to some query parameters without having to reset all parameters. This functionality should include the ability to specify a partial list of search criteria for aggregate searches. Because of their similarity to previously catalogued processes, redesign and investigative queries are promising candidates for mathematical algorithms that extrapolate or interpolate to unpopulated indices, based upon the data contained in local populated indices. The data structure suggested in Section 2.3.1 should place similar data structures in close proximity to each other, so distance-based similarity measures can serve as a measure of similarity. Regression techniques are well suited for situations in which ample data is available in close proximity to unknown values, providing quickly implemented and statistically correct tools for generating surrogate data. Chapter 5 presents a set of regression operations that can generate estimates for unknown PCD values. 46 2.4 Conclusion It is not the intent of this thesis to propose a solution to all unpopulated PCDB index problems, but rather to manage the types of queries that most frequently contribute to the return of unpopulated PCDB indices. As indicated by the survey, these queries may frequently be expected to involve unpopulated indices in close proximity to populated indices, allowing for the use of neighboring indices' data in predicting values for the unpopulated indices. This finding is key to the strategies employed in later chapters for managing unpopulated indices. Chapters 4 and 5 of this thesis offer solutions that directly parallel the findings of this survey, as well as the survey reviewed in (Tata, 1999). It is clear from these surveys that the needs of PCDB users are currently not being met. As outlined in Chapter 3, basic database architectures can contribute to the PCDB problems that these surveys have uncovered. 47 48 3 Current State of Database Implementations Storing large amounts of data often requires a dedicated system. A PCDB is an example of such a system. Depending on their function modem databases vary in size, structure and operations. But all databases must have methods for entering, storing, and returning data upon request. Database programmers use the term Database Management System (DBMS) to describe an organization structure that enables these functions. (Alagic, 1986) Multiple DBMS architectures exist, each uniquely adapted for specific applications. One way to address these problems is to seek out an alternate DBMS from among those that are popular in other applications, with the intent of finding a DBMS that offers advantages over currently used PCDB DBMSs. In fact, there exist several DBMS architectures that are Unfortunately these structures, despite their particularly widespread in implementation. popularity, all feature weaknesses that can hamper the data storage, management and retrieval processes required of PCDBs, as well as higher level analysis functions. It would be impossible to describe the complete state of database technology here. Instead, some popular DBMS architectures are described below in detail. This chapter begins by laying out specific needs for a PCDB management system. Next, it reviews three major types of database structures: hierarchical, relational, and object-oriented. During this review, the chapter identifies the strengths and weaknesses of each system. The chapter concludes with comments about how some DBMS weaknesses relate to PCDBs. 3.1 PCDB-specific DBMS needs To manage PCDBs, a DBMS must satisfy several needs. This section describes these needs in detail. Efficient data storage. To minimize the resources needed to store and search a PCDB, the DBMS should use an efficient structure that conserves storage space and processor load. The DBMS should not create structural elements that then go unused, because these elements can waste physical resources. To maximize flexibility of data entry and searching, the DBMS should 49 store data that has not been assigned a full set of attributes. This support for incompletely specified data may be useful when some information about a set of PCD entries is unknown. Ability to locate similar indices. The DBMS should naturally support easy methods for figuring out which indices are most similar to a reference index. One way to accomplish this is to place similar data indices close to each other in the data structure. The system should not require a human to tell it which indices are similar to each other, because this information may change when the indexing scheme changes. Index structure expandability. When a user adds new data attributes or attribute values to the indexing structure, s/he should be able to do so in a very short amount of time. Ideally, this process should be as easy as editing an attribute list. The user can then tell the DBMS to update the data with this new list. When a user makes data index updates, the DBMS should restructure the data arrangement as efficiently as possible. Index changes should have a minimal impact on the volume and structure of the data, although some changes are inevitable. Support for multiple volumes. Enterprises often have multiple PCDBs available, including legacy systems, purchased data and inherited archives. Whenever possible, DBMS structures of each PCDB should be compatible with others. This makes the user experience consistent between PCDBs, and allows a user to query multiple PCDBs simultaneously from a single user interface. Even if each PCDB has a different structure, a user should still be able to easily search multiple PCDBs with a single query. Support for simple aggregation. PCDB users have indicated a desire to return multiple data indices by using aggregated queries. The DBMS should have strong support for index aggregation when a user specifies a query or enters data. The DBMS should make the process of gathering aggregate data as efficient as possible, to minimize query response times. 50 Support for error checking. When a PCDB user enters new data or submits a query for PCD, the DBMS should check the submitted data indices for validity. The DBMS should only allow a user to specify valid index combinations. 3.2 DBMS architectures for PCDBs This section reviews the three most popular types of DBMS architectures. These systems are hierarchical, relational, and object-oriented DBMSs. The section indicates the unique advantages and drawbacks of each system. Hybrid systems are also briefly discussed. 3.2.1 Hierarchy structures This subsection describes the hierarchy DBMS, and its disadvantages. Background. Hierarchical structures often serve to illustrate other database management systems, but the hierarchy is also a separate DBMS. A hierarchical DBMS looks like an inverted tree, and each node in the structure represents an index that can contain data. A node may have a maximum of one "parent" node higher up in the structure, and may have multiple "child" nodes below it. See Figure 3.1. [Root] 1 2 2.1 11 2,2 1.2 2.1,2 2.1.1 2 2 1 2,2,2 Figure 3.1: Generic hierarchical structure with indices 51 Data indices in a hierarchical structure identify their parent nodes. All parents' indices are embedded in those of the children. For example, the parent of index 2.2.1 is index 2.2. A method can mine child nodes by looking for extensions of the current node's index code. See Figure 3.2. [Root] Fork 1 n o11 : 1s 1 Conclusion: 1.1 and 1.2 are valid indices Loo k fot 1 F Iu r 3 Figure 3.2: Determining child nodes of node 1.1 Disadvantages. The hierarchical model offers simple, efficient access to data. But it can also hinder data access and knowledge discovery. Problems with the hierarchical DBMS appear below. Scattering of similar indices. First, data containing similar attributes may appear in different locations through the structure. See Figure 3.3. The hierarchical DBMS often must repeat identical attribute names across a level of the hierarchy, and occasionally across multiple levels. To find all occurrences of an attribute, a search must mine the entire hierarchy structure. This does not make an attribute-based search impossible, but it does make the search inefficient. See Figure 3.4. 52 1 2 Ball Truck 12 2 1 Green Orang Small Large 12 2 1 Green Orange Green Orange Figure 3.3: Identical attributes in multiple locations and levels of hierarchy structure 1To2 Ball 1 Truck ' 1 2 Green Orange match match A-"2 Small Large 2 112 Green Orange Green Orange match match match match Figure 3.4: Mining hierarchy sections for "Green" and "Orange" Lack of attribute information. The second problem with the hierarchical DMBS is the lack of information regarding a datum's attributes. Information about data attributes is contained by the hierarchical structure itself. Mining the structure for all attributes (the first problem from above) may require human assistance, because some attributes or attribute values with identical names may not actually represent identical properties. This problem is known as polymorphism. See Figure 3.5. 53 12 Truck Ball Green These attributes may not be truly equivalent Large Small Orange Green Orange Green Orange Figure 3.5: Identically named attributes may not represent identical concepts Complicated determination of index similarity. Third, the hierarchical DBMS has no means to identify which data indices are most similar to a given index. This is important for managing unpopulated data indices. See Figure 3.6. Substitution or estimation with hierarchical data can become complex and difficult when similar indices are not efficiently found. Truck Small Is "Orange" more similar to "Green" or "Blue"? Green ffOrange f Bue Figure 3.6: No knowledge of nearest-neighbor attribute values Difficult alteration of the hierarchical structure. A fourth difficulty with hierarchical databases involves altering the DBMS structure. If a user deletes a data attribute or attribute value from the indexing scheme, the DBMS has to restructure all nodes containing that attribute. Deleting an attribute also deletes all of the data in children of nodes containing the deleted 54 attribute. The DBMS has to move data out of areas beneath deleted nodes, and then find an appropriate location for each data record. See Figure 3.7. A similar problem occurs when a user adds an attribute. The attribute must appear in multiple locations in the hierarchy, requiring an examination of the entire structure to identify all necessary insertion points. Toy Toy 2 Ball 2 Truck Truck Ba Remove "Small" and Large" from Green Orange Orange Green Orange Green Large Small Orange Green Orange Combine indices' data contents Figure 3.7: Removal of attributes from hierarchical index When a user adds an attribute or attribute value to the indexing scheme, the DBMS may need to add large structural sections to reflect the change. The system has to duplicate all nodes that will appear under the added attribute. See Figure 3.8. The DBMS then has to reanalyze and relocate all affected data within the new hierarchy. Also, a user must reexamine all human-generated information such as data attributes and nearest-neighbor attribute values. This reexamination must occur to make sure the human-generated information is still legitimate under the new indexing scheme. Add "Medium" Smal malLarge Green Orange Green Orange Green Duplicate "Green" and "Orange" attributes; redistribute contents of data indices 55 Orange Medium Green Orange Large Green Orange Figure 3.8: Addition of attributes to hierarchical DBMS 3.2.2 Relational structures This subsection discusses the relational DBMS, and the problems that may be encountered when using it. Background. The relational model uses tables to store data. A table is a collection of data instances that share the same attributes, even if the values for one or more attributes are different between instances. Each row in a table is a single instance, or tuple. Columns (fields) represent the attributes that describe data in the table. See Figure 3.9. TableName BALL TableName TRUCK Key Color Key Color Size 1 Green I Green Large 2 Green 2 Green Small 3 Orange 3 Orange Small 4 Green 4 Green 5 Orange 5 Orange Large Small 6 Orange 6 Orange Large Figure 3.9: Relational representation of items Relational databases can be accessed with very simple query processing, via Structured Query Language, or SQL. Relational DBMSs traditionally comply with the principle of data independence. This principle says structural changes should not affect data already in place, and data operations should treat each datum exactly the same way. Relational databases allow any SQL command to operate on any available piece of information in any table. As a result, query definition is simple. This is an improvement upon older DBMS systems like hierarchical structures. Expandability of relational tables is superior to hierarchical or object-oriented structures (described later in this section), again due to data independence. Adding an attribute to a single 56 relational table is simple. However, newly created attribute might not contain any values until the user or database algorithm adds them. See Figure 3.10. TableName YO-YO TableName YO-YO Key Color Key Color 1 Green 1 Green 2 Green Add new 2 Green 3 Orange attribute 3 Orange 4 Green 4 Green 5 Orange 5 Orange 6 Orange 6 Orange Size Figure 3.10: Adding new attribute to relational table An additional feature of relational databases is the pointer, which reduces the need for the duplication of data (a problem outlined above for hierarchical DBMSs). Pointers are simple references from a table to an attribute in another table. A pointer retrieves the data from the proper field in a remote table, and returns that data to a field in the presently active table. See Figure 3.11. This data is not persistent. It will be lost when the table is closed. But the data can always be retrieved again from the remote table as needed, so while its existence in a particular table is transient, its availability from another table is continuous. The pointer system is an improvement upon hierarchical systems, because the DBMS only has to update one location when an attribute value changes. Figure 3.12 illustrates adding a Size attribute to the Ball table. In this case, the linked Toy table in Figure 3.11 will automatically update, incorporating the new data values. 57 TableName BALL TableName TRUCK Key Color Key Color Size I Green a Green Large 2 Green b Green Small 3 Orange C Orange Small 4 Green d Green Large 5 6 Orange Orange e Orange Small f Orange Large TableName TOY TableName TOY Key Color a 13 x a 8 Join Join BALL BALL Color Size with with TRUCK TRUCK Color Size Color Key Size Size Green Green Small BALL has 8 Green Green Large no Size: s Orange leave blank * Orange Y Green 11 Green t Orange <p Orange Large K Orange Orange Large '1 p K K Large Small Figure 3.11: Process of creating new table using pointers to two tables TableName BALL Key Color 1 Green TableName BALL Add a new attribute Key Color Size i Green [new value] 2 Green [new value] 3 Orange 2 Green 3 Orange 4 Green 4 Green 5 Orange 5 Orange [new value] [new value] [new value] 6 Orange 6 Orange [new value] Figure 3.12: Adding an attribute to a relational table: pointers automatically update 58 Disadvantages. Comparing relational systems with PCDB needs reveals several incompatibilities. These incompatibilities appear below. No true aggregate objects. First, relational DBMS systems classify objects only by atomic attributes. These systems can not truly "create" aggregate objects by combining attributes of individual objects. An aggregate object (Figure 3.11) is just a new table that contains attributes from other tables. The aggregate object does not truly "exist" in database storage. It is only a projection (combination of columns) of one or more other tables, and it disappears when the database is closed. At run-time, a relational DBMS must create all aggregate objects with a projection of one or more relational tables. If an aggregate table uses fields that aren't shared by all of its constituent tables, many attribute fields in the aggregate table will be blank. These empty cells appear in Figure 3.11, and waste database resources by allocating space that is never used. Difficult alteration of the relational structure. Second, adding attribute fields (including their attribute values) to existing relational data is complicated when several tables are involved, requiring the DBMS to locate all insertion points for new fields. The added fields may contain actual data or simple pointers, but the system still must identify and update each location. This does not include the steps necessary to populate new field(s) with attribute values if the DBMS does not use pointers. The system must determine attribute values for every tuple via some external means. Difficult determination of nearest-neighbor indices. Third, the relational DBMS has problems with nearest-neighbor identification, which parallel hierarchical DBMS problems. Within a particular attribute, the relational table has no knowledge of the nearest-neighbor order of attribute values. The ordering of some attributes, specifically those that are numerically specified (such as length in millimeters), may follow numerical order. But there is no guarantee that every numerical attribute will have a nearest-neighbor order that coincides with numerical order. Text attributes are even more troubling, because they are not likely to follow any nearestneighbor algorithm that a computer can figure out by reading text values. See Figure 3.13. The 59 DBMS must order these attribute values by some other means, such as human intervention, in order to use any nearest-neighbor processes. TableName BALL Color Key Green 1 Green 2 3 4 5 6 Orange Green Orange Orange Size Mega Mammoth Is "Mega" more similar Colossal Mammoth Massive Tremendous to "Mammoth,""Colossal," "Massive" or 'Tremendous"? Figure 3.13: Relational difficulties with attribute value similarities 3.2.3 Object-oriented structures This subsection discusses the unique architecture of the object-oriented DBMS, and the difficulties associated with using it. Background. Object-oriented DBMS systems can store data properties, support property and method inheritance, and support true aggregate objects. These functions are not possible with the pure relational and hierarchical models addressed above. Within an object database, a data object is an encapsulated private module, belonging to a "class" that contains other objects with identical attributes. In addition to relevant data, the object contains a collection of procedures (methods). These methods, and only these methods, can be executed against the data in the object. The only exception to this rule is the set of methods of an object's parent classes. The object inherits these methods, so they are also valid. The data object also inherits attributes from its parent classes. See Figure 3.14. 60 Object: "Toy" Attributes: "Color" Methods: "Select", "Delete", Object: "Ball" Object: "Truck" Attributes: Methods: "AddNew" Attributes: "Size" Methods: "AddNew" Figure 3.14: Object-oriented DBMS supporting attribute and method inheritance Disadvantages. The primary disadvantages of object-oriented DBMSs are structural alteration, query definition, and attribute value ordering. Difficult alteration of the object-oriented structure. Changing an object-oriented indexing scheme requires locating and updating all appropriate data locations. Adding an attribute or method can require significant time and effort, partially due to polymorphism. As described in Section 3.2.3, polymorphism describes the possibility that an identically named property or method represents different concepts for different objects. Figure 3.15 illustrates polymorphism for the "+" method. Numerical data types: String data types: 4+5= 9 "Tru" + "ck" = "Truck" Figure 3.15: Polymorphism of methods: "+" adds numbers, concatenates strings Polymorphism is powerful for an end user, but it complicates data index expansion. This is especially true when attributes or methods have been defined at the child object level instead of higher up in the structure. When a structural change occurs, the DBMS must examine the "flavor" of a property or method within each affected object. See Figure 3.16. This adds complexity relative to hierarchical and relational tables, which typically do not support polymorphism (although they may contain attribute values in different locations that represent different properties). 61 Object: "To y" Attributes: Methods: "Select", "D elete",.. Object: "Ball" Attributes: "C olor" Methods: "AddNew" Object: "Truck" Attributes: "Color", "Size" Methods: "AddNew" "Color" attributes may not be equivalent descriptors Figure 3.16: Polymorphism of attributes and associated ambiguities The complexity of changing an object-oriented system's indexing scheme encourages very careful initial planning of a database, and discourages structural improvement of a poorly planned object DBMS. It is very difficult to design an object-oriented DBMS system for easy maintenance, if the system is likely to experience change many times over its lifetime. Complicated query definition. Query definition is complicated within object-oriented DBMSs, due to a conflict between declarative languages (such as SQL) and object encapsulation. SQL assumes all data is equally accessible to any given command. This assumption is not true in an object environment. Only methods contained within an object or inherited from a parent can operate upon an object's data. Thus, an "uncooperative" data object may compromise the simple authoritarian nature of SQL. See Figure 3.17. A user must often use alternate methods to query an object database, reducing the simplicity of data management and analysis tasks. 62 Object: "Toy" Attributes: Methods: "Select" Object: "Ball" Attributes: "Color" Object: "Truck" Attributes: "Color", "Size" Methods: "AddNew" Methods: "AddNew", "D elete" SQL intent: Delete all objects and associated classes having parent class "Toy" Result: Operation failure: "Ball" object class does not contain or inherit "Delete" method Figure 3.17: Failure of declarative operations due to encapsulation 3.2.4 Hybrid Structures There are several hybrid forms of databases that combine features of the above systems. One popular example is the relational-object model, which combines relational table architecture with various features of object-oriented data. The wide variety of hybrid systems prohibits detailed explanation here. Hybrid systems are promising solutions for specific data needs, although they generally inherit some or all of the problems associated with their constituent forms. 3.3 Common problems across DMBS architectures In addition to the index expandability problems outlined above, database management systems share other common shortcomings. These problems are outlined below. 3.3.1 Inherent dissimilarity of DBMSs One common problem is simply the existence of dissimilar DBMS structures. This problem is not symptomatic of one type of DBMS, but always causes problems when more than one DBMS is in use. PCDB users are inconvenienced when they have to submit a different query to each desired database. 63 3.3.2 Error management and infeasible indices Error management is another difficulty with current systems. Many DBMSs have no means for identifying and correcting data errors during data entry, storage or retrieval. In Figure 3.18 a hybrid PCDB structure uses a material (A), a process (B) and a feature (C) to characterize a manufacturing operation.2 The user specifies one hierarchical index for each of the three structures (A, B, C). The database system will allow an index choice of the form {A2, B2, C2}, specifying a paper material, drilling operation and slot feature. The DBMS accepts this combination even though a manufacturer would not attempt to drill a slot in a paper material. The index {A2, B2, C2} infeasible for practical use, but the DBMS allows a user to specify it anyway. A Material Steel Paper C B + Process Cut Drill + Feature Hole Slot Figure 3.18: Hybrid database structure requiring one choice each from A, B and C If a user accidentally submits the query {A2, B2, C2} to an error-free database, an error message will result when the requested index is found to contain no data. But if the user commits the specification error during data entry, data will be stored in the wrong index and no error message will result. Continuing the example from Figure 3.18, an operator may want to enter data from a manufacturing operation that drills a hole in steel. The index for this operation is {A1, B2, C I}. The operator may accidentally type in the infeasible index {A2, B2, C2} instead, causing the data to be stored under the wrong index. After this error occurs, a query of the valid index {A1, B2, Cl} will not return the data that should be stored there, because the data will have been stored in the infeasible index {A2, B2, C2}. Such an error causes inaccurate data characterization, reducing the effectiveness of the database. 2 This is a simplified representation of some PCDBs currently in use. 64 3.3.3 Schema conversion Converting a database scheme from one type to another is often imperfect. For instance, a data object typically contains information about which operations a DBMS can perform on it. But the data in a relational or hierarchical scheme typically lacks such by-object resolution. When converting data from relational or hierarchical schemes to object schemes, there may not be any way to automatically create the additional information stored in a data object. However, the DBMS may not need certain contents of a data object. In this case, the data objects may be incomplete without affecting the operation of the database. 3.4 The missing data problem An additional major concern in PCDB implementation is the problem of missing data. Some potential causes of unpopulated data indices are: - The enterprise has never undertaken the requested manufacturing process - The enterprise has undertaken the requested manufacturing process, but has not catalogued it - The enterprise has catalogued the process, but committed an error in recording its index - The requestor committed an error in specifying the search, returning the wrong index - A supplier has not undertaken or catalogued the requested process - A supplier does not allow access to the requested data These problems can be summarized with the following three categories: 1. Lack of access to, or existence of, data 2. Erroneous recording of data 3. Erroneous request for data Problem categories (2) and (3) above can be addressed by providing improved tools for managing data entry and user queries, to reduce errors in index specification. Problem category 65 (1) groups a lack of data access with a lack of data existence because in both cases the user is left with no means to obtain the data desired. In this case, statistical prediction methods can generate surrogate data from the existing data set. 3.5 Conclusion:relevance to PCDB structures The MIT VRM group has seen PCDB implementations that use hybrid hierarchical DBMSs. These DBMSs suffer from many of the problems noted in section 3.2.1. Some of these problems parallel those discovered in previous VRM group surveys (Tata, 1999). Specifically, lack of database commonality across enterprises, flawed indexing schemes, large numbers of infeasible index combinations, and poor data population levels were previously found to plague PCDB implementation. Switching from one commonly used DBMS architecture to another will not eliminate problems that affect PCDB use. Each DBMS has its own problems and shares problems with other DBMSs. Popular DBMS systems share basic problems that prevent efficient management of unpopulated data indices and determination of "similar" attributes/values. Better error management tools can improve storage and retrieval accuracy in PCDBs. New tools and data environments that enable these improvements are described in Chapters 4 and 5. 66 4 Attribute-Based Indexing Instead of indexing with one of the DBMS systems from Chapter 3, this thesis proposes a method for rearranging the organization of data. The method uses only the characteristics of the data to create a new indexing system, instead of relying upon the branched-tree structures that are frequently used in PCDBs. This system can be created in one of two ways. If a data set is already indexed, exhaustive "mining" of the data index for attributes can provide the information necessary for re-indexing in a new form. If a data set is not already organized by an indexing system, the data's characteristics - such as textual contents, responses to mathematical algorithms, or any other observable properties - can provide the means for reorganization. In either case, the resulting indexing system offers advantages over more traditional DBMSs. Specifically, it simplifies data organization and makes the PCDB more versatile. To simplify organization, the system maps different types of data into a single set of indices. This data may come from separate, inconsistently structured volumes. The re-indexer creates one searchable set of indices that includes all volumes, cataloguing all of the desired data. One search can then be used to find data from any of the original volumes. This chapter describes the data indexing structure developed for this thesis. First, basic concepts relevant to the indexing scheme are discussed. The chapter then describes the components of the attribute-based indexer, and concludes with a discussion of the indexer's learning methods. 4.1 Basic concepts This section outlines the index hyperspace, two special cases in representation, and a rationale for using this system. 4.1.1 Attribute coordinate hyperspace representation To make data-related tasks easier, the re-indexing process organizes data into a large multidimensional space, with each dimension of that space corresponding to one attribute of the data. The result is illustrated in Figure 4.1, for a case in which three attributes (axes) are used to describe the data. In the figure, these attributes are the item's name, its size, and the material of 67 which it is made. The space is expandable to an arbitrary number of dimensions, but three dimensions are used here for visualization purposes. This representation depicts the indexing scheme, not the data each point contains. Each point in the hyperspace can contain data describing items represented by the point. This data can exist in essentially any form, including numerical data and formatted text. This thesis concentrates on numerical data forms. A SIZE 4 32 [2,2,2] 1 43 2 3 4 'ITEM 4 MATERIAL Figure 4.1: Index space in three dimensions Each axis contains a series of coordinate number labels, and each number on an axis corresponds with a value the relevant attribute can take. For example, the "item" attribute may take the values "globe," "disco ball," or "lamp." Coordinate numbers on the ITEM axis represent each of these potential values. See Figure 4.2. 68 Attribute Table: 1. ITEM 1. Globe 2. Disco Ball 3. Lamp 2. SIZE 1.10" 2. 12" 3. 16" 3. MATERIAL 1. Plastic 2. Glass Figure 4.2: Attribute table assigning values to axis coordinates In this case, the number 1 on the ITEM axis may represent a globe, 2 may represent a disco ball, and 3 may represent a lamp. Higher numbers on the ITEM axis may represent other item names the system needs to represent. The SIZE and MATERIAL axes are similarly set up, with numbers on each axis representing values for each respective attribute. The 3-dimensional coordinate depicted in Figure 4.1, [2,2,2], represents a choice of 2 ("disco ball") for the item name, 2 for the size and 2 for the material. If a SIZE choice of 2 represents a 12-inch diameter, and a MATERIAL choice of 2 represents glass, the coordinate uniquely identifies the item as a 12-inch diameter glass disco ball. 4.1.2 Representation of items with multiple values for an attribute An item may have two valid values for one of its attributes. For instance, a glass disco ball may double as a hanging globe during non-party hours, making it a disco ball and a globe. Ideally, the indexing system should allocate a separate coordinate on the ITEM axis, perhaps labeled "disco ball and globe," to represent this duality. However, this is not always practical, particularly in a case where such combinations are frequent and unpredictable. In these situations, multiple data points can represent a single item. For the example described above, the item can have two data points in the hyperspace, in which case its data is represented by a combination of the "pure" data for each of the individual points. This is illustrated in Figure 4.3. 69 Data for the item then exists in both locations, and a query of the database for the desired object will return data from both data points, with each data set labeled by source. The storage and querying process is described in greater detail later in this chapter. A SIZE 4 3- 2 -[2,2,1] S - [2,2,2] 2 3 4 2 ....... . 4ITEM 2'2 MATERIAL Figure 4.3: Representing items with multiple values for a single attribute 4.1.3 Representation of items with no values for an attribute Alternately, an object may not have any values at all for a particular attribute. This may occur because the object is not describable by that attribute, or because the value for the attribute is unknown. In this case, the attribute simply does not appear in the description of the object. This feature is very important for cataloguing different types of objects within a single hyperspace. Each object is described using only the attributes (axes) that are relevant to it. For example, if a fourth axis called FLA VOR existed in the hyperspace in Figure 4.1, the new axis would not be relevant to any of the objects (the globe, disco ball or lamp). The FLA VOR attribute does not typically describe any of these items, so that attribute (and its axis) would be ignored in the descriptions of the items. 70 If candy items existed in the list of objects to be described, FLA VOR would then become a relevant axis. But candy description might not require the SIZE axis. Then we would have some objects described only by the ITEM, SIZE, and MATERIAL axes, while other objects would use only the ITEM, MATERIAL, and FLA VOR axes. Figure 4.4 illustrates this possibility in a sequence of two figures, illustrating the three relevant dimensions for each case. A SIZE A MA TERIAL 4 4 3 2 1 - 3 2 1- 1 2 I I 1 2 3 4. I I I I I- 1 2 ITEM 3 3 4... 3 41 ' MA TERIA L 2 I II Z........... /SIZE 4 / FLA VOR Figure 4.4: Different attribute axes for non-candy items and candy items, respectively Each of the representations in Figure 4.4 is really just a dimensional subset of the entire hyperspace. In this example, we have four dimensions (axes) in the hyperspace, and each object is represented in Figure 4.4 by three of those four dimensions. (To avoid confusion, this thesis does not present visual examples for hyperspaces of more than three dimensions.) If an object possessed name, size, material and flavor properties, it would have coordinates on all four of the axes, and would exist in the complete four-dimensional hyperspace. In fact, any number and combination of the axes can describe an object, depending on how many relevant attributes the object has. 71 4.1.4 Rationale for representation This particular representation is highly valuable for data storage and analysis because it enables functions that are difficult to implement in other structures. With each attribute represented by an orthogonal (independent) axis, an object becomes a set of coordinates, enabling several spatially oriented strategies for data manipulation. The attribute-based structure allows for distance-based analysis of similar data indices, as they are located close together in the index hyperspace. Examples of this type of analysis include retrieval of data from indices similar to a chosen index, estimation of (or substitution for) data in empty indices, improved data index expansion, and some forms of error management. Attributebased re-indexing also improves learning algorithms that use human input to discover new data features. These analysis tasks are valuable to industry because they assist in solving difficult problems. Later sections of this chapter outline these tasks in greater detail. Elements of the indexing system 4.2 This section discusses system elements in detail. The section begins with a review of needs the indexer must fulfill. Then components of the code and the rules file are discussed. Next, the search process and the attribute expansion process are discussed. The section concludes by detailing the attribute expansion process. To index data in the attribute-based structure, each datum must have a code that identifies its contents. This code maps the datum to the index space and is the only link between the user and the data, so it must meet several needs. These needs are listed below. - Represent each relevant attribute axis - Suppress axes that are not relevant to a particular object - Represent each appropriate value for an axis - Discern between axes that can take only one value and those that can take multiple values - Accommodate wildcard characters as axis values - Serve as data descriptors and search criteria - Enable searching by a subset of attribute axes 72 - Consume minimal system resources for generation, storage and manipulation - Enable expansion of the index with minimal human effort - Represent other indexing structures, such as hierarchical trees - Enable checking of queries and new data indices for validity - Place data with similar characteristics in close proximity To satisfy these needs, the system uses a simple Polish notation string to characterize data attributes and represent queries. Unlike many other indexing structures, this notation can satisfy the needs listed above. This string is custom-generated for each datum in a data set and then searched against during query operations. See Figure 4.5. Attribute Table: 1. IT EM 1. Globe 2. Disco Ball 3. Lamp 2. SIZE 1. 10" Attrib utes for this ite m: ITEM = Disco Ball (2) 3.16" SIZE =12" (2) MATERIAL = Glass (2) 3. MATERIAL 1. Plastic Co de:1,2-,+2,2,+32-++ ___________________ 2. Glass 4. FLAVOR 1. Sweet 2. Sour 3. Bitter Figure 4.5: Attributes translated into index code Each identified attribute, and only identified attributes, appear numerically in the code. For instance, the item name "disco ball" (value 2 for axis 1 in the attribute table) appears at the beginning of the code via the substring "1,2,-,+." The 12-inch size (value 2 for axis 2) appears next, via the substring "2,2,-,+" and the "glass" material property (value 2 for axis 3) appears via 73 the substring "3,2,-,+." Since the flavor attribute is irrelevant to a disco ball, it does not appear anywhere in the code. The operators 4.2.1 "+" and "-" are described below. Components This subsection describes the index code and rules file components of the indexing system. Index code. The text string code in Figure 4.5 contains numbers identifying attributes (axes) and values (coordinates), but other symbolic components are necessary. These other components take two general forms. The first form is the separator, which identifies the end of one part of the code and the beginning of another. The second form is the operator, which represents groupings of code symbols into substrings. These substrings identify the properties of the coded items. Two types of separators are used in the code. The first type, the symbol separator, appears in this notation as a comma (,). This separator is a boundary between all other symbols, with the sole task of marking the end of one symbol and the beginning of another. This is important if multiple text characters represent a single symbol. For example, the number "xyz" is very different from the two numbers "xy" and "z," so the comma is used to discern between "xyz" and "xy,z" in the code. Any non-numeric symbol can serve this purpose, such as a space between the symbols, but this thesis uses the comma to make the distinctions deliberately. The second type of separator, the attribute separator, appears as the plus symbol (+). The attribute separator indicates the end of one attribute-value substring and the beginning of another. These separators are illustrated in Figures 4.6 and 4.7. 74 Attribute number Value number SVC Attribute operator separator Figure 4.6: SVC operator in substring Attribute Value number numbers .1. 3,2,6,6,&,+,t ... MVC Attribute operator separator Figure 4.7: MVC operator in substring Operators are more complex than separators, and associate values with the attributes they describe. An operator symbol may be one of three types. The first form is the single-value concatenation (SVC), appearing here as the minus symbol (-). The SVC operator pairs an attribute axis number with one value number appearing directly after it. See Figure 4.6. This operator is used when an attribute can only have a single value, and the (-) operator can optionally verify that the user has selected only one value for the attribute. The second operator type is the multiple-value concatenation (MVC), appearing here as the ampersand symbol (&). The MVC allows an attribute to take several values, and pairs an attribute axis number with multiple value numbers. See Figure 4.7. In Figure 4.6, the single value number 2 is modifies the attribute number 1, meaning the hyperspace axis labeled "1" will have coordinate "2" for the item being described. The "-" operator explicitly assigns the value number to the attribute number, and the "+" operator indicates the end of the substring (and the end of assignments for axis 1). 75 In Figure 4.7, the value numbers 2, 3, and 6 modify the attribute number 1, meaning the axis labeled "1" will have all three coordinates. The "&" operator performs this association, and the "+" operator ends the substring of assignments to the attribute. Concatenating substrings like those in Figures 4.6 and 4.7 will create a full code for the description of an item. Each attribute number appears in the code a maximum of one time. A final "+" operator follows the entire code, to specify the end of the code. The code in Figure 4.5 illustrates a full code for three attributes. Rules file. There are many impossible combinations of attribute values in a typical process capability database. A manufacturer may not make plastic disco balls, for example, in which case a database pertaining to that manufacturer will not contain any information about them. However, a user might specify this combination of attributes while constructing a search or storing new data in a database. This might happen accidentally by way of a user error, or the user might purposefully specify the combination of attribute values because s/he does not know the combination is invalid. This error can occur during data entry and data retrieval. If the user commits the error when specifying attribute criteria, a system error or an empty set of returned data will result. If the error occurs during data entry, the data will be stored incorrectly, making it difficult or impossible to find later. Both of these results can reduce the effectiveness of the database and frustrate database users. A rules file eliminates the possibility of specifying impossible attribute value combinations, immediately letting the user know when specified values are incompatible with each other. Such a file is illustrated in Figure 4.8. The file contains all "illegal" combinations of index values. An automated system can consult the rules file to identify invalid combinations, and then make errors or invalid combinations evident to the user. Consultation can occur in real-time, with the user's selections of attribute values automatically narrowed to represent only allowable combinations. This is illustrated sequentially in Figure 4.9. Alternately, the system may wait for the user to completely specify the code, and then consult the file once to verify that the combination of attribute values is valid. The real-time consultation is more useful as an aid to 76 specifying data entry and query codes, because it provides the user with more information during the process. SIZE = [any] ITEM = "disco ball" MATERIAL = "plastic" 1,,,,2*-+,,,, Figure 4.8: A rules file, with a sample invalid index combination ITEM MENU SIZE MAT'L ITEM MENU SIZE Globe 10" Plastic Globe 10" D 12" Glass Disco 12" . LI 16" MAT'L Glass 16" Lamp Lamp Figure 4.9: Real-time specification: Selection of "disco ball" eliminates "plastic" 4.2.2 Processes This subsection discusses the system's search and attribute expansion processes. Search function. To search for a particular item in a code, the user submits a query with a format very similar to the code used to describe items. The system compares this query directly with each item in a database, and returns all items satisfying the requested criteria to the user. Like item codes, the query code is a string using attribute numbers, value numbers, separators and operators. Each attribute substring (see Figures 4.6 and 4.7) indicates the desired values for each attribute. An attribute substring also indicates whether some or all values need to appear in order to satisfy the search. Only attributes that the user specifies are used as search criteria. If a 77 user does not specify a requested value for an attribute, the system will not use that attribute as a search criterion. This is illustrated in Figure 4.8. There are differences between operators in item coding and query coding. Object codes use the ampersand (&) operator as a multiple-value concatenation (MVC), when a single attribute takes multiple values to describe an item. Object codes also use the minus (-) operator as a singlevalue concatenation (SVC), to indicate when an attribute can only take one value. A query uses these two operators differently. The (&) operator acts as a logical AND, and requires all requested values for an attribute to appear in an item's code. If only some of the requested values appear in an item's code, the system will not return the item to the user. The query code uses the (-) operator as a logical OR, requiring only one of the requested values for an attribute to appear in the item's code. If at least one requested value appears for an attribute in the item's code, the (-) the item is not rejected. An item's index code must satisfy the requested criteria for all attributes in the query, or it is rejected and will not be returned to the user. Returns: Query: 1,2,3,6.&,+,2.4,5.&,+,3.12,&+.+ 1,2,3,6,&+.Z4,5,-,+3, 1,2,&,+.+ S12,2,3,6,&,+,25,-,+,12;3,&,+.+ 1 2,3, &,+2,4,,+3,1 2,&,+,+ Axis 3 Axis 2 Axis 1 1 and 2 4or 5 [ 2, 3, and 6 Figure 4.10: Query structure and sample returned index codes In Figure 4.10, the function of the (-) and (&) query operators is evident from the returned data index codes. The attribute codes for axis 1 in the returned data indices contain all three requested values (2, 3, and 6). This is a requirement, because the query uses the ampersand "AND" operator (&) to specify the query for the first attribute. The same applies for axis 3. The query specifies the minus "OR" operator (-) for axis 2, so only one of the requested values for axis 2 has to appear in a data index. Finally, no query requests are made for axis numbers higher than 3, so any values can appear for higher-numbered axes. These values will not disqualify a data index from being returned to a user in response to a query. 78 A final difference between item index codes and query codes is the use of the wildcard character. Here, the asterisk (*) denotes a wildcard. In a query, a wildcard signifies that any value of the corresponding attribute will satisfy the search criteria. A search code wildcard will then return an item with any value for that attribute. When a user specifies a wildcard for an attribute in a search, the attribute becomes effectively irrelevant to the search. This allows a user to selectively eliminate certain attributes from becoming search criteria, enabling the user to submit queries based upon only a subset of the data attributes. See Figures 4.11 and 4.12. Wildcard attribute value Figure 4.11: Use of wildcard for data characterization Specified search criteria: 1,2,&,+,2,*-,+3,2-,+,+ Intent: ITEM = Disco Ball SIZE = Any MATERIAL = Glass Result: Glass Disco Ball of any size is returned Figure 4.12: Result of wildcard search Following the item description concept from the introduction of this section, queries can be constructed without using wildcards, by specifying only those attributes selected by the user, and removing any attributes without specified values from the query code. The difference between the two strategies appears in Figure 4.13. However, this method eliminates one useful function of the system. A user might want to specify that only data items with certain attributes should be returned, regardless of the values actually specified for the attributes. For example, a user might only be interested in lamps that have values associated with the "size" attribute. In this case, the user would want to eliminate all instances of lamps with no size specified. It is not efficient to build a query specifying every possible value for size. It is more efficient to simply specify the 79 wildcard character as the desired size value. In this case, the systems returns all globes with any specified sizes, as illustrated in Figure 4.12, and the search rejects all globes with no size(s) specified. Specified search criteria: Specified search criteria: 1,2,&,+,2.*,-,+,3,2,-,+,+ 1,2,&.+,3,2,-,+,+ Figure 4.13: Query codes with wildcard (left) and omission (right) for attribute 2 A user can implement wildcard search criteria and non-wildcard search criteria concurrently, depending upon the user's wishes. If a user does not care about a lamp's "material" attribute, s/he may simply drop the "material" attribute from the query code. If, in the same search, a user wants to make sure that some size is specified for the lamp, s/he can use the wildcard character for the "size" attribute. The query will then only return lamps with specified sizes, ignoring lamps with no size specified. The search will entirely ignore each lamp's material. The system will return lamps with the "material" attribute specified, and lamps with no specification at all for the "material" attribute. Attribute expansion function. A database may grow in many ways. The most frequent growth mode is the addition of new data through the indexing system, but a database may also grow by the addition of new indices. Index expansion involves adding new attributes and/or new values for an attribute. This changes the indexing scheme, and the potential for negative consequences is evident from Chapter 3. Database indices should be designed to accommodate all foreseeable types of data when the database is first created, but there is no guarantee that this will happen. Furthermore, databases may grow beyond the original design vision, making index expansion a requirement. This attribute-based indexing scheme described in this chapter allows for simplified expansion of indexing. The attribute table, introduced earlier in this chapter, contains an indexed entry for each attribute type, and the values corresponding to each type. When a human operator wants to 80 change the indexing scheme, only the attribute tables require the human's attention. Upon update of attributes and/or attribute values, the entire system may be re-indexed in a fully automated fashion, analyzing and indexing each datum according to the updated indexing criteria. Only data affected by the indexing change must be updated, but the system can update the entire data set as an added measure of error reduction. See Figures 4.14 and 4.15. Attribute Table: Attribute Table: 1. ITEM 1. ITEM 1. Globe 2. Disco Ball 3. Lamp 1. Globe Alter Table 2. Disco Ball 3. Lamp 2. SIZE 1.10" 2.12" 3.16" 2. SIZE 1.10" 2. 12" 3. 16" 3. MATERIAL 1. Plastic 2. Glass 3. MATERIAL 1. Plastic 2. Glass 4. FLAVOR 1. Sweet 2. Sour 3. Bitter Figure 4.14: Alteration of attribute table Attribute Table Altered Attribute Table Old Dataset New Dataset Figure 4.15: Update of dataset using altered attribute table 81 This index update procedure reduces the degree of effort necessary for re-indexing, by reducing the user's work to simple manipulation of values in a list or table. The remaining data analysis and re-indexing work is suitable for implementation in fully automated fashion, just as the original indexing process can be fully automated once the attributes have been established. 4.3 Attributelearning This section outlines the motivation for using a learning algorithm. It then discusses the possibility for human-machine participation, and the specific role of humans in the learning process described in this thesis. 4.3.1 Motivation When data is added to a database, the database increases in size. With the addition of thousands or millions of data points over time, a database can become very large and require significant computer resources to maintain. The number of data types and attributes in the database can also increase, requiring changes to the indexing structure. A small database containing as many as a few thousand entries might be managed by a human operator, but the difficulty of managing databases increases with the database's size. Managing a large data set requires actions beyond the abilities of a human. These actions include checking for errors or new data attributes, expanding database indices to accommodate new data types, and re-indexing data to represent changes to the database. A human working alone will find these tasks impossible to achieve for a large database, because they require individual attention to thousands or millions of data points. The human might require hundreds or thousands of hours to accomplish even one of these maintenance tasks. To thoroughly accomplish them, the human requires computer assistance. Computers are very good at quickly evaluating and comparing numbers, so the process of recognizing new attributes can be left to a machine. In the case of text databases, a computer can try to recognize new attributes by scanning the text and looking for new words and phrases. To do this, the computer must parse each datum and look for unfamiliar combinations of letters and 82 words. This is a numerical process, so a computer can recognize and store unfamiliar words and phrases very quickly. Computers are not as good at learning without human assistance, however. This problem prevents a computer from scanning data, recognizing new attributes and accurately modifying the database's indexing scheme to accommodate them. Only humans can understand the context surrounding a potential new attribute, and then decide whether it is important or not. The learning strategy described in the next section makes use of the computer's ability to compare and recognize new attributes quickly, while leaving the decision-making process to the human. 4.3.2 Human-machine learning Large text databases can contain millions of words, some of which are important for categorization. Other words, such as "the" and "a," are typically not important because they do not add any information about the classification of a text datum. During index expansion, the discovery of "new" words in a database can offer insight to a human user. New words may provide hints about the new values an attribute might take, or the new attributes that might be added to the indexing scheme to improve data classification. Scouring a large database for these words can be time-consuming, but the job is easier when the system recalls the words that have already been analyzed. The indexing system presented in this thesis enables simple learning when indexing a semistructured data set. An example of learning is found in the parsing and indexing of textual data entries. On a periodic basis, the system may scan the database for any words that are not recognized by the system. The goal of the scan is to identify words that are not search terms, and words that are new search terms and should be added to the indexing scheme -- i.e., to the basic table. An ignore file, analogous to the rules file for invalid database index combinations, contains all words that are identified as irrelevant to searching. See Figure 4.16. When the data set scan identifies a word (or more universally, a potential attribute value from a generic data type) that is 83 neither in the ignore file nor in the current attribute table, the system makes note of the word. See Figure 4.17. Upon completion of the scan, the system then makes a "best guess" at the relevance of the word (or potential attribute value), and the attribute type to which the relevant value might belong. The system then presents a list of unknown words to a human user, who can verify or change the system's guesses. See Figure 4.18. Ignore File: a about an approximate at Figure 4.16: Portion of an ignore file, listed alphabetically New Data Items Scanned: - Approx. 14" Glass Disco Ball - 12" Ceramic Globe Recognized Attributes: - Glass - Disco Ball Ignored Words: - [none] Unrecognized Attributes: - Approx. - 14" - Ceramic - 12" - Globe Figure 4.17: Recognition of words unknown to system 84 ITEM Approx looks like Approximate in the Ignore file. Ignore? / Keep? ITEM 14" looks like 12" in the SIZE Attribute. Ignore? Keep in SIZE? 4 Keep elsewhere? ITEM Ceramic is ambiguous. Ignore? Keep?k/ Ll Choose Location: ITEM SIZE MATERIAL V 1 Figure 4.18: Presentation of new words to user for assistance in categorization All words the user deems relevant and then categorizes are placed in the appropriate location of the attribute table, while the ignore file stores irrelevant words so they will be recognized again later. The database can then be re-indexed according to the expanded database index, and the system will appropriately categorize all data containing the newly identified attributes and/or attribute values. See Figure 4.19. This process may also search for combinations of words that are frequently repeated, as a data attribute value may consist of a series of words in succession. Before Learning During Learning After Learming Attribute Table: New Terms: Attribute Table: 1. ITEM 1. Globe 2. Disco Ball 3. Lamp 14" => SIZE Ceramic -> MATERIAL 1. ITEM 1. Globe 2. Disco Ball 3. Lamp 2. SIZE Approx. -> 1. 10" New Ignored Terms: IGNORE Aprx=G 2. 12" 3.16" E1. 2. SIZE 10" 2. 12" 3.14" 4. 16" 3. MATERIAL 1. Plastic 2. Glass 3. MATERIAL 1. Plastic 2. Glass 3. Ceramic Figure 4.19: Illustration of learning process 85 4.3.3 Precise roles for human intervention in the system Much of an indexing or learning/re-indexing process can be automated, but some portions of the process require human assistance for best implementation. These portions require delicate decisions, or considerations that are too complex or "fuzzy" for reliable computer modeling. Previous sections of this chapter mention this fact, but the precise role of humans in the system deserves more attention. Some portions of the process that require human assistance appear below. - Creating the basic framework for a new or converted indexing scheme - Recognizing of new words and phrases when simple or context-sensitive recognition fails - Decision-making in cases where an existing indexing system is ambiguous or difficult to re-index When generating a new index for semi-structured data, the attribute table is first constructed by hand. A human assigns unique numbers to attribute types, and to the possible values for each attribute. Since a human generates the attribute table by hand, only the attribute types and values known ahead of time will be included in the basic table. A learning algorithm, as outlined above, expands this basic table. Human intervention is necessary during this process because the system will frequently be uncertain about certain data features. As an example, a hobby shop may implement descriptive text strings to characterize trading card sales data, like the auction titles described earlier. For re-indexing purposes, a parser scours the text string for recognizable words and phrases. Despite sophisticated context-sensitive operations, an automated system may not readily classify some words or phrases as descriptive of only one attribute. An example, illustrated in Figure 4.20, is the name "Donruss," which could be the name of either an athlete or a trading card manufacturer. If the system does not already know the name, and can not determine what it describes, a human expert must step in and finish the job. 86 Text descriptor of sale: 1994 Donruss Puckett Identifiedfrom table: "1994" =:> YEAR "Puckett"=: NAME1 Ambiguous: "Donruss" = MFR.1? or "Donruss" = NAME2 ? Figure 4.20: Theoretically ambiguous text descriptor and nature of the ambiguity A third role for human process involvement may occur during re-indexing of a structured data set. While identifying characteristics in the data set's original structure, the automated system may locate attributes that appear similar, but do not satisfy all criteria for inclusion as a single attribute. In this case, the system will leave the structure unaltered, and then ask for human assistance after the rest of the re-indexing process is complete. The system may pose all unknown relations to the human operator as a series of individual questions. Machine learning is a rich research field, and this thesis does not propose to overtake it for the purposes of simplifying database expansion. Unfortunately, the human knowledge and decision process is extremely complex and not fully understood. There is no current model that fully emulates the human thought process, and until such a system can be created, human assistance will be necessary for complex and "fuzzy" decision making processes. (Duda and Hart, 1973) For this reason, the re-indexing algorithms in this project do not try to expand the envelope of machine learning technology. Further work in the realm of cognitive machines should supply projects like this one with better tools for unsupervised decisions. 4.4 Conclusion The attribute-based system described in this chapter serves several purposes. They are summarized here. - The system can generate indices for semi-structured data sets from arbitrary sources, as long as the data set offers some means for extracting feature information. 87 - The system can convert a hierarchical tree, popular in PCDB applications, into an attribute-based hyperspace, by replacing branched structures with the attribute-based index code and a rules file. - The system indexes a datum only by attributes the datum is seen to contain. This reduces complexity of the indexing codes, and improves search efficiency. - Index coding uses a symbolic lexicon to represent attributes, their values, limitations on these values, and other features to a machine. - Submitted codes are checked against a rules file for validity during searching and data entry tasks. - The system optionally creates queries using only those attributes specified by a user, improving search efficiency by ignoring attributes irrelevant to the search. This is not possible in a hierarchical system. - The indexing structure can be expanded and re-indexed in a semi-automated fashion with human assistance. - The hyperspace representation places similar data points in close proximity, enabling the use of better data analysis tools and strategies for missing data. The benefits of these features become more clear in Chapter 5. Representing data with this index structure allows the user to perform analysis tasks in the proper data environment. Without the benefits afforded by this representation, the tasks described in Chapter 5 would be extremely difficult, if not impossible to achieve. 88 5 Strategies for Missing Data When a user queries a PCDB, there is no guarantee that data will be returned. A query might point to data indices that contain no data, resulting in a null response. Depending on the indexing structure used, different strategies are available for managing null responses. The industry survey in Chapter 2 revealed several methods to find surrogate data when a query returns no data. None of the popular methods from the survey involve further use of the PCDB, despite the fact that a PCDB can provide further information when a query fails to return data. It is clear from this result that industry is not using the data to its full potential. This chapter builds prediction functionality onto the data indexing structure from Chapter 4. The subject of this chapter is a formal data-driven process for making predictions when a query results in a null response. The prediction process takes three basic steps. 1) The procedure looks for populated data points near the unpopulated data point. 2) Linear regressions parallel to each hyperspace axis make predictions for the unknown value(s). 3) A weighting system assigns values to each axis based upon its regression error, and a weighted sum provides the final prediction. 5.1 Justificationof methods This thesis uses regression analysis to generate surrogate data for unpopulated data indices. When a queried data index is unpopulated, each chosen attribute axis is analyzed with regression techniques. For each attribute axis, the method predicts a value for the unknown data. The results from each axis contribute to a weighted average that predicts a "best-guess" value for the unknown data. The remainder of this chapter describes the details of this procedure. This section begins by describing a previously proposed index-based similarity measure, and noting its shortcomings. The section then outlines the motivations for using a data-driven 89 similarity measure. The last two topics of this section discuss a hybrid regression algorithm that can predict surrogate data values for unpopulated database indices. 5.1.1 Index-based similarity A similarity measure based on indexing structure was proposed by (Tata, 1999). This measure determines similarity by performing a Z value test on data indices, rather than by comparing the data stored under the indices. For two data indices, the Z value calculation follows: Z = 2 1_+ m Here, xi and X2 (1) 2 n represent the mean of each index code's digits, (i and a2 represent the standard deviation of each index code's digits, and m and n represent the number of digits in each of the two index codes. More similar data indices will have a value of z closer to zero; less similar indices will have a value of z farther from zero. This value, which can be positive or negative, identifies the data point(s) with indices most similar to the unpopulated index. By comparing Z values of the data indices instead of the data, this method avoids the missing data problem. Confidence intervals for the method are available in statistical distribution charts. For example, a Z value with an absolute value less than 1.96 will provide a 95% confidence interval. See Figure 5.1. Code 1: 1,4,3,6,4 Code 2: 1,2,3,5,4 p- = 3.6 G 1= 1.8166 p2= 3 - 1.5811 G2 Z = 0.557 Figure 5.1: Z value calculation for two index codes This method is problematic. It gives equal weight to each digit in the index, which does not necessarily reflect the true nature of the data. Z value testing treats an index code's digits as a 90 series of observations of identical importance. But data values may vary more (or less) strongly when a particular index digit is changed than when some other index digit is changed. Each digit in a data index represents a choice from among several options. Some of these choices may have greater or less impact upon the resulting data values than other choices. For example, the choice between 6061 aluminum and 5052 aluminum materials may have greater or less impact upon a process tolerance than the choice between cutting and finishing processes. Yet each of these choices may appear as a single digit in a data index, and will be treated as equally important by the Z value test. As a result, the determination of similarity is biased in favor of index digits of less importance to the resulting data value. Furthermore, the Z value method assumes the most similar point within an acceptable confidence interval will be a good surrogate for the missing data. This is not a safe assumption, despite confidence testing. The confidence interval calculation is based upon similarity in the indices, not similarity in the data. One may expect similar indices to contain more similar data than dissimilar indices, but there is no indication that two single data points with similar indices will contain equally similar data. 5.1.2 Data-based similarity To address missing data, one option involves finding data that seems to be a good surrogate for an unknown value. This data would be used in place of the missing data, as a best-guess value. This process involves finding the most similar data, and then determining whether it is acceptably similar. Similarity measures can compare multiple data records or sets, and return a measure of similarity between them. Unfortunately, the missing data problem prevents this strategy from working properly. These methods require knowledge of the data before similarity can be measured. Without the data, these methods do not function properly. 91 Figure 5.2 illustrates an example of how data-driven similarity measures can fail. A user has queried data point X, only to find it contains no data. Points Y and Z are close to X, and each may be similar enough to X to serve as surrogate data. A similarity method may try to compare the means and standard deviations of the three data points, to determine which point is most similar to X. But a lack of data at point X disrupts the system. Without knowledge of the mean and standard deviation for data point X, the method can not determine similarity between X and Y or X and Z. More generally, when only partial data is available, calculations requiring all data will fail. To address the missing data problem, a different approach is necessary. Y X z Strategy: Compare data in X with data in Y and Z, to determine Problem: There is no data in X, so no comparison can be made. which point is most similar to X. The method falls. Figure 5.2: Failure of data-based similarity measures when data is missing 5.1.3 Regression as a prediction system The missing data problem causes considerable difficulties in finding surrogate data. Data-based similarity measures can not operate when some data is missing. Index-based similarity measures can not account for some important data properties, because those properties are not evident from the indices alone. The failures of data-based measures and index-based measures do not overlap, however. Data-based measures fail for lack of data, while index-based measures fail for inadequately considering the data. This leaves opportunity for a hybrid approach to the problem, using indices as well as available data. 92 Least-squares regression analysis uses data indices and existing data to make predictions about unknown data values. This satisfies the need for a system that does not need a full set of data, while reducing the assumptions made about the data. The regression process fits data with a mathematical equation describing data trends. In the simple case of two dimensions, this fit is a line or curve. In higher dimensions, the fit becomes a surface, volume or hypervolume. Regression is also different from the similarity measures described above because it does not attempt to substitute a value from another index for the unknown data. Instead, regression attempts to estimate a new value for the unknown data, by continuing trends found in nearby data. The regression function manages estimation error by using the trendline that minimizes sum-of-squares error. (Hogg and Ledolter, 1992) See Figure 5.3. Populated points Regression Line Axis -- - --- Estimate * 6 i Unpopulated point Figure 5.3: Linear regression analysis in one dimension One difficulty with regression is known as the "curse of dimensionality." Each added dimension of data requires an increase in the amount of data to be analyzed and the required processor time to understand data features. Simultaneous multidimensional regression procedures do exist. (Hogg and Ledolter, 1992) But calculations can become unwieldy to manage when analysis procedures require thousands of points. This difficulty is addressed by reducing regression to a combination of single-dimension regressions requiring fewer points. Regression occurs along each attribute dimension, and these results combine to yield a final data estimate. Linear regressions are used to simplify the calculations, and because the number of available points for regression may be too low to rely upon higher-dimensional fits. Since this project 93 only uses linear regressions, the system assumes local linearity in the region of the unpopulated point. This is not guaranteed to be accurate, because data trends can take nonlinear forms that a polynomial regression would fit more accurately. For this reason, additional work with higherorder regressions would be beneficial to this project. The intelligent use of higher-dimensional regressions is a subject left for further investigation. An additional consideration is the fact that not all attributes readily conform to representation as coordinates on an axis. Some attributes, such as baseball player names, consist of text strings with no numerical representation. Attribute values of this type must simply be assigned integer values as "placeholders" on an axis. When describing baseball cards, the name "Mark McGwire" might be assigned to location 23 on an axis. But there is no significance to this assignment. The association between name and number only serves as an indexing method to give each attribute value a unique location on its axis. Some attribute axes that use coordinates as "placeholders" will not demonstrate strong trends, because the ordering of values on these axes does not create any strong data patterns. To remedy this, one might order player names on an axis from lowest average price to highest. But these price values change frequently, so the indexing system would be in constant change. An attribute with no strong data trends is a strong contrast to other attributes, such as card quality. Card quality increases will always strongly correlate with increases in card prices, making it obvious that attribute values should be ordered from lowest to highest, or vice versa. An axis with high correlation will exhibit a strong, predictable data trend that leads to good predictions of unknown values. But when no trends are evident, as in the case of player names, the axis will not be of much use to the system. Fortunately, the system can recognize this fact and respond by ignoring or downplaying the results of regression on a trend-poor axis. The weighting system described in Section 5.1.4 selectively assigns more influence to stronger trends, while reducing the influence of weaker trends. 94 5.1.4 Processes for combining regression results Some attributes are likely to be more sensitive data drivers than other. This sensitivity appears as the slope of a regression line. Sensitivity is not the only important feature of regression, however. The goodness of regression fit to the data is also a very important feature. When several regressions are combined into a single estimate, the better-fit regressions should have more influence over the combination than less well-fit regressions. See Figure 5.4. The result should then be a more accurate estimate than if all regressions had equal influence over the combination. * Lower regression error Higher regression error Figure 5.4: Illustration of lower and higher regression errors There is no direct way to determine the goodness of fit of a regression line relative to an unknown data value. As a substitute for this, the system uses the goodness of fit of the regression line to all of the known data points. This process assumes that the goodness of fit to known data points will provide information about the goodness of fit to the unknown data point. In other words, regressions that do a better job of fitting known data points are likely to do a better job of fitting unknown data points nearby. The system uses the mean squared error of a regression line to represent its influence over the final result. The lower the mean squared error of the regression, the greater its influence will be. Regression activities are well documented. However, the author does not know of any previous application in which regression error on an axis is used as a surrogate for estimation accuracy. Furthermore, the author does not know of an application that generates a single estimated value in multidimensional space by combining many single-axis regressions in inverse proportion to their respective regression error. This procedure is presented as an original contribution for 95 managing unpopulated data indices. Implementation details appear in subsequent sections of this chapter. 5.2 Preparation for regression Data preparation involves looking for data points close to an unpopulated data point, setting the stage for regression analysis. The preparation stage finds data points on a translated coordinate system centered at the data point. See Figure 5.5. Ay 4 3 2 1 _ 1 2z' 1 2 4 --- . -.-. -...... ... 2 3 4 hz Figure 5.5: Translation of coordinate axes to unpopulated data point The process only looks for data points on the translated coordinate system axes. This reduces the number of regressions performed on an unpopulated data point. Using on-axis points allows the process to see the effects of varying one attribute at a time, making it easier to compute a weighted average later. This does eliminate the off-axis points from contributing to the regression. Including these points in the linear regression analysis would be difficult, requiring more processing time and an unmanageable number of regressions to account for all off-axis points. The system must return an estimate quickly, so the process trades off analysis of every local point in favor of speed. 96 The data point search process finds points on one data axis at a time, by varying one digit in the unpopulated data point's index code. The process finds and stores the mean and standard deviation of each discovered point. When all populated points on an axis are found and their values are stored, the axis is ready for regression analysis. See Figure 5.6. A Y' (-3,0,0) (1,0,0) (-1,0,0) (2,0,0) (0,0, 0) Sz' Figure 5.6: Location of populated data points on translated x' axis 5.3 Regression process The final prediction for the unknown data value has the form Y =X(WYk) (2) k=1 Y is the final estimation of the unknown value, and Y is an estimation of the unknown value based upon a single-axis linear regression along axis s. Each single-axis prediction is weighted with the corresponding weight W. The linear regression for each attribute axis has the final form Ys,i = as+ ps - xi Yi= Ys,i + es, i 97 (3) (4) Ys,i is the predicted value of Y at location i on axis s, and Ys, is the actual value at that point. The value of a is the slope of a line that minimizes the sum-of-squares error between the line and the data points used to create it. Ps represents the y-intercept of the line. When the axis location of the unpopulated data point is substituted for x, Y, i= Y, o = Y is the desired prediction for the unknown value. This prediction is based only upon the axis used to generate the regression, so a different prediction Ys will result for each value of s (one for each axis used in the regression). At any data point, the error term es,i is the difference between the predicted value Y,i and the actual value Ys,i at that point. Its mean is zero, and its value is different for each value of x (each i) along each axis s. Values of a and P are calculated following: Fs = Ys, /ns (5) x,i ns (6) ns Xs= Pis fls :xs fis = i - Ys i -xs YS'i i=1 is (7) i=I as =Ys - fjs The value ns is the number of data points used for regression along axis s. See Figure 5.7. 98 (8) (-3,0,0) (-1,0,0) (0, 0, 0) (1, 0, 0) ppt A.. (2,0,0) p - -S-M Figure 5.7: Axis for which n, = 4 Sum-of-squares error E, for axis s is calculated by finding the sum of the squares of each error value es, i along the axis. es,i = Ys,i -Ys,i (9) (10) E= i= i 5.4 Regression combination The weight Ws for each axis is inversely proportional to Es , the mean sum-of-squares error over all points on the axis. An axis with lower mean sum-of-squares error is assigned a higher weight due to its more consistent, lower-error trend. An axis with a very weak trend in data values will not contribute much to the final estimate, minimizing its impact upon the result. The mean sum-of-squares error calculation follows: Es = (l/ns)* es,i (11) Then, the weight calculations follow: s ZEk Ws= k=1 Es 99 (12) The larger the sum-of-squares error is for an axis, the smaller the resulting weight for that axis will be. The calculation satisfies: k=1 (13) Ek =1 Without further need for normalization, each weight Ws multiplies each prediction Ys , and the sum of these products is the final prediction Y (see Equation 1). 5.5 Conclusion This chapter addresses the need for an alternate surrogate data strategy. The shortcomings of data-based and index-based similarity measures were discussed. This chapter disclosed a method for data estimation, which creates several one-dimensional linear regressions and then combining them in a weighted average for a final estimate of an unknown data value. 100 6 Demonstration of the Technology This chapter highlights important features of the prototype software. These highlights include details of the Internet-based architecture and interface, information flow through the system, data output, and surrogate data generation. This chapter also discusses prototype database administration software, which runs locally on the server. System architecture 6.1 This section describes information flow between client and server, and then details the information processes within the server. Client-server interaction 6.1.1 The system operates on a World Wide Web platform, so its accessibility is as universal as possible. To access the prototype software, a user only needs a browser that supports HTML tables and basic forms. The user can connect to the server with any type of computer in any Internet-enabled location, as long as the computer supports graphical display of HTML and images. The client machine does not need any additional software or special configuration to access the system. See Figure 6.1. By sending an HTTP request to the database server, the user becomes a Web client of the server machine. r---------------------i r----------- Server Location Client Location . .......... Database Engine I Server Client F-g-re ---.-C Le/ser s---------------------- Figure 6.1: Client/server system configuration 101 Database(s) Because the client computer receives no specialized scripts or applets to execute, all programming details remain confidential on the server machine. Concealment is important because it keeps the data and data access procedures hidden behind the server's security measures. User contact with the server takes place only through the browser's forms, improving data security and simplifying information transfer between the client and server. Figure 6.2 illustrates data flow across the distributed system. HTTP request: URL and form data HTTP reply: HTML and graphics 10 Server Client Figure 6.2: Data flow across client/server system 6.1.2 Intra-server processes When the server receives a request from a client, its software component parses the request to determine how it should respond. Response tasks include analyzing form data from the client, accessing database records, generating graphics and HTML, and sending the graphics and HTML back to the client. All data access, calculation and output formatting tasks occur in the server's local environment. When tasks require database connections, the software component sends Structured Query Language (SQL) requests through a database engine. The engine analyzes the request and returns the appropriate data to the prototype software. This data consists of database records and summary statistics like mean and standard deviation. The software reduces the returned data by creating summary statistics and chart graphics. This information is returned to the client. See Figure 6.3. 102 .......... .a. E:1 SQL request for records and statistics Databases Engine Reply: records and statistics Data retrieval instructions s Records and statistics Database(s) Server Figure 6.3: Data flow within the server location 6.2 Server software This section begins by describing the connection process between client and server. Each screen is discussed in detail, and the section concludes with general comments on the user experience. 6.2.1 Connection process and session management When a user first sends an HTTP request to the server, his/her machine receives a special session thread, or memory space, on the server. This thread isolates the user's actions from the actions of other connected clients. Within this space, the user's actions and data remain in memory between data requests, instead of being lost (as they would in a simple web server). The system manages data this way because it is an attractive alternative to other options, like passing data through the browser's URL field or using cookies. Passing information through the URL field would restrict the system to only a few kilobytes of transferred data, and not all web browsers support cookies. The session thread method allows large amounts of data to be retained for the user with maximum flexibility. Since the data is persistent between data requests, the system can perform several analyses without refreshing data between each user request. This feature improves system performance and allows the system to perform more complex analyses by keeping all data on hand. When a user wants to reset his/her search data, s/he can submit a new search. This action destroys the old session thread and creates a new one, resetting the search data. If a user stops requesting data for several minutes, the server will automatically destroy the session thread, assuming the user 103 has disconnected. At no time can one client machine access another client's session thread; each thread is accessible only to the machine that originally requested the connection. 6.2.2 Screens This subsection describes the various screens a user encounters during a search session. Title screen. The first data a user receives is an HTML page containing instructional materials and other information. This screen contains basic usage instructions for novice users. At an administrator's discretion, the page may also contain important information regarding system changes or updates. This title page guarantees that each user has the opportunity to review important information before accessing the search functions. See Figure 6.4. 104 - - 11- -- -_ , - I- - - - __ - 7- __ - __-My - - Figure 6.4: Title page viewed on client machine Search screen. When the user navigates past the title page, s/he sees the search screen next. This screen contains a set of pull-down selection menus, with each menu corresponding to an 105 , -- 36pa"- - -* attribute of the data. The user specifies desired values for some or all attributes by making selections from the menus. See Figure 6.5. Yiew 4 - dit file I- Back - LAdess I] Fivorites 1ools H1* Search L Favm es _j (Histy C>Go http://localhost/Projectl/BCPA.ASPWCI=ShowSearch&WCU TI/E BRERi[R tD U r5 f R[Il Select your search criteria from the lists below selections as you like from each list. Links Oi919t Remember, you can make as many To make multiple selections from a single list, hold down the CONTROL key while clicking each desired item. - - To select a range of items from a list, hold down the SHIFT key while clicking the first and last item. Search Criteria Name Ben McDonald Sammy Sosa Year Manufacturer 1993 1994 1995 aiill Cal Ripken Ken Griffey, Jr. Randy Johnson Jose Canseco 1996 Tony Gwynn 2000 J Type Card Pack Box Set 1997 1998 PSA Rating PSA 7.5 PSA 8 Gem Donrussi Fleere Upper Deck Bowman Skybox J Stadium Club -J PSA 9.5 PSA10 Mint Near Mint Very Good Good Fair Other Descriptor Ro okie Errori Minor Traded Sealed PSA 8.5 Quality Find hI J LOV IIMIIr 1 , 141 Figure 6.5: Menu-driven search screen 106 The user can make as many or as few selections as s/he likes from each menu. By default, each menu contains a wildcard selection. If the user does not make a selection for a particular menu, the wildcard will remain selected for that menu, and the corresponding attribute will not be a search criterion. Figure 6.5 illustrates the search screen where a user has specified the Name Mark McGwire, the Year 1999, the Manufacturer Topps, and a PSA Rating of 9. The user has left the Quality, Type, and OtherDescriptormenus in their default (wildcard) state, so the system will not use them as search criteria. The user is asking for records with any value(s) for the Quality, Type and Other Descriptor criteria, as long as the returned records satisfy the Name, Year, Manufacturerand PSA Rating criteria. Error screen. After the user submits search criteria, the prototype software consults the rules file to find out if the criteria will point to a valid index. The prototype software returns an error screen if the rules file suggests the criteria are invalid or point only to unpopulated data indices. The user is given the option to specify a new search or ask the system for price estimates for the specified criteria. Figure 6.6 illustrates this error screen. 107 .......... File Edit Back- Favorites View 4 j A4 Tools Help .iiFates ,jSearch JHtory Go Adress W] http //iocalhost/Projec1 /BCPA.ASP?WC-ShowResut&WCU 8RSERL L [?L7 PRI[IIYG5 /UTHLR/TY /2 Lin*s .7 Your sealch clitelia: Glenn Hubbard - Name: 1951- Year: Type: {none specified } PSA Rating: PSA 8 - Quality: none specified } Manufacturer: Other Descriptor: O-Pee-Chee { none - specified Your search specified invalid index combinations, or pointed only to unpopulated database regions. BUT HA VE NO FEAR! You can get estimated pricing for this card, or search for similar cards, by using the buttons below. Retum to Search Page Estimate the Price Using Similar Cards Locdiaet Done Figure 6.6: Error screen Results screen. If the rules file indicates the search criteria are valid (Section 4.2.1), the system passes a query to the database engine. The system analyzes these records for mean and standard deviation, and returns charts summarizing the data. In addition to mean and standard deviation, the system optionally returns the title of each card sale for the user's review. By default, this option is turned off because it returns very large volumes of text data to the user when many records satisfy a search. The system can implement two types of charts. The first type is the standard graphic chart found in applications like Microsoft Excel. The system generates this chart and returns it to the client's browser as a standard graphic file. This chart can be complicated to implement and takes several seconds to download. The second type of chart is coded directly into the HTML page, and uses 108 HTML tables to generate the various rows and columns of the chart. Figure 6.7 illustrates this type of chart. HTML table charts require less time to download and are simple to implement, but they offer little visual flexibility because HTML tables constrain the system to bar charts. Edk Yew Fivrites loclk Help Eilde 4- Back Aedres t2 *Smch LiJF vres j ,HWjtory . _] http://Iocalhost/Project1 /BCPA.ASPWCI-ShowResut&WCU 6Go jjinksj TE BA'SEBBLL CRRL PA[IIfl& BL/7TY R!THOR T . You seaich ciiteria: Name: - Maik McGeuire Year: - Type: { none specified} 1999PSA9- PSA Rating: Quality: { none specified} Manufacturer: { none specified Other Descriptor: { none specified Your search results: Mean Price = $19.05 Standard Deviation = $20.23 Card Price Frequencies (18 total) $0 - $6.00 $6.00 - $10.00 $10.00 -$15.00 $1.00- $20.00 $20.00 $25.00 $26.00 - $3000 $30.00. $35.00 $35.00 -40.DO $40.00 - $45.00 $46.00- $5600 $50.00- $55.00 365.00 - $60.00 $60.00 $$66.00 $05.00 -$70.00 $70.00- $75.00 $76.00 $80.00 5 10 1I I $86.00 - $00.00 $00.00 $96.00 $05.00 - $100.00 j-Done ILocamitranet -I Figure 6.7: HTML table of card price frequencies 109 The prototype software uses only HTML charts to display search data. The Software Enhancements section of this chapter discusses prospects for using more complex charting methods. Price estimation input screen. Regardless of search results, the prototype software always gives a user the option to estimate price information for his/her specified search criteria. Estimating a price involves three basic steps. First, the software notes which data indices the user requested. These indices may or may contain price data. Then, the software tries to find data indices that seem similar to the requested indices. These similar indices must contain price data; any indices containing no price data are ignored. Using these populated similar indices, the software attempts to estimate values for the price data in the originally requested indices. Locating similar indices and estimating data values are discussed in detail in Chapter 5. If the user decides to generate an estimate, the system presents a set of radio buttons corresponding to the axes used for the search. The prototype software relies on user input to decide which axes will be used to create data estimates. With the radio buttons, the user can indicate which axes s/he wants to use to generate the price estimate. Figure 6.8 illustrates this screen. The user typically can not choose between all possible attributes in preparation for price estimation. The software only presents the attributes for which the user made non-wildcard selections during the search. This limitation must occur because the regression operations described in Chapter 5 rely on the user's selection of values for an attribute axis. An axis taking a wildcard value represents non-selection, aborting the estimation process for that axis. Figure 6.8 illustrates the estimation-input screen where the user originally specified values for the Name, Year, Manufacturer,and PSA Rating attributes. Among these attributes, the user has chosen to use the Year, Manufacturer,and PSA Rating attributes to generate pricing estimates. When the user submits this information, the software automatically performs regression operations on the three selected axes and returns a weighted average of the results. 110 BCPA Price Estimator -Microsoft Internet Explorer5 ile Edt Yiew Fjvorites Iools Help 4-Back f -+ J} ' Search W L2Favontes _History Ie Ifj http //Iocalhost/Proiectl/BCPA ASPWC=Star(Similar&WCU tGo 8VSEBLL [RLI PRI[/JY LnS /17 H/T.'5TP/ Welcome to the ultra-modern BCPA Price Estimator. This tool searches the multidimensional index hyperspace surrounding your specified search criteria, and then estimates a selling price for your card. To use the Price Estimator, select the card properties which you want the system to use for generating an estimate. Note that not all attributes are good candidates for this. For instance, we recommend leaving the player's Name alone. If you tell the BCPA NOT to use the Name property here, you will get an estimate based exclusively upon cards with the same player name(s) you originally searched for. You made selections for the following attributes: - Player Name - Year - Manufacturer - PSA Select the attributes you want to allow to change Name r Don't change it C Allow it to change Year Manufacturer CDon't change it SAllow it to change (- Don't change it SAllow itto change PSA Rating Don't change it SAllow it to change (_ Fi nd Figure 6.8: Selection screen for estimation function Price estimation results screen. The price estimation procedure generates a lot of data. To present this data clearly, the prototype software generates an HTML page that presents final estimates for mean and standard deviation, followed by axis-specific information. Figure 6.9 111 illustrates this screen. If an attribute provides too few data points to support regression activities, the user is notified that the attribute could not be used for estimation. In Figure 6.9, the system has discarded the Year and PSA Rating attributes for this reason. ..f~LJ WIN_Fie j id ew Favortes Tools 4o Back - 4 / cj IO Lh Help aA /BSwch /W P Favi es HistoPy d y Go J |deno|#_http://Iocalhost/Ptojectl1/BCPA.ASPWC1=FindSimilai&WCU fSEBRLL L7 R J I P2/fIG' /UTHOR/I' Mean Price $359.62 BCPA Predictions: Category Weights: Manufacturer Mean: 100.00% Mean Price: $359.62 a Manufacturer category: Standard lks*s / .111 Deviatior- $185.15 Standard Deviation: 100.00% Standard Deviation: $185.15 Price Means for Manufacturer Not Speelfled Topps Donrus Fler Uppai Dedc 330.00 $188.70 e4.82 $0,32 Bovoman Skybox Stadium Club Gold Leaf Leaf Pinnaole eowamn Chrome O-Pee-Chee PREOICTED: $359.62 a Year category not used: Too few populated value bins found. PSA category not used: Too few populated value bins found. L4 - 7 - Figure 6.9: Estimate results screen 112 UFje 6.2.3 Comments on user experience The server software performs several disjoint functions. These functions include the data retrieval, analysis and output tasks described above. The prototype server software represents an attempt to integrate these functions without presenting a disjointed use experience. The simple style and arrangement of the HTML input and output pages are intended to reduce the details of a complicated operation to a level that does not overwhelm users with information. The server software balances this need for simplicity with the requirement that all useful data be returned to the user. Section 6.5 discusses some additional methods for achieving the information/simplicity balance for different types of users. 6.3 Administrativesoftware The previous section discusses the operation of the software when a user submits queries. But user-focused operations are not the only tasks the software must perform. The software's ability to process and serve data to a user depends on proper maintenance. There are many kinds of changes that force database administration. Over time, users' needs can change and new users may begin to make different demands of the database system. Timesensitive data requires regular updates, and the types of stored data can change. New data attributes may be added to a database's indexing system, requiring changes in the way the indexing scheme organizes data. When any of these changes occurs, the system must allow an administrator to make fundamental changes to the database and indexing scheme. Otherwise, the system will never be able to grow and evolve with its users and their data needs. The prototype administrative software fulfills this need for adaptation by providing functions to keep the system "healthy." Administrators must occasionally perform various tasks to keep the database up to date. These tasks include: - Attribute learning, the discovery of new data features for data indexing (Section 4.3) Index expansion, assigning these new features to the indexing scheme (Section 4.2.2) 113 - Rules file updates, adding new valid (or invalid) index combinations (Section 4.2.1) With the exception of some human assistance in attribute learning, the tasks can run on a schedule in the absence of human supervision. The prototype software performs these tasks in the machine's local environment. There is no way to access these tasks remotely. Restriction to the local environment is intentional, as it keeps all sensitive data-related tasks safely confined within the server environment. This section describes features of the locally executed administrative software. 6.3.1 Search function The first function of the prototype software is a database search, very similar to searches the server software performs. An administrator may want to run a local search while running the administrative software. The search function can verify changes to database records, the indexing scheme, and the rules file, so it is an important component of the administrative software. The administrator's search function is limited in comparison with search functions available for remote clients, because the local administrator should only use it to verify administrative tasks. The administrative search environment is similar to the remote client's search environment. The major differences are listed below. - No similarity search functionality - All records listed individually, to assist with administrative tasks - Environment is based on forms, not browser Figure 6.10 illustrates the default window for administrative software. The administrator selects search criteria from menus at the top of the screen and then submits the search. The software returns individual records and allows the administrator to scroll through the records, fifteen at a time. 114 Figure 6.10: Default window for administrative software 6.3.2 Attribute learning and index expansion functions The administrative software provides a step-by-step set of tools for finding new data features and re-indexing the database. This tool set is divided into several functions so the administrator can verify the completion of each action before the next one takes place, and repeat an action in the event of an error. The tool set functions are located at the bottom of the default window in Figure 6.10. 115 :7724 Following the methods of Section 4.2.2, the tool set offers a learning algorithm that scours the database for unrecognized words. See Figures 4.17 and 4.18. The software does not have the sophistication to make accurate guesses about the meanings of unknown words, so each new word appears individually in an option window. The human administrator then has the opportunity to review each new word, and either ignore it or add it to the indexing scheme. This option window appears in Figure 6.11. x This screen aloo you to update the database index vst fo We seatchet. SpWcf for each itm below. Saiect"Retain" to keep the iem classified as unknon, to be revie Itm a dexcnptcx MAe e another timne. Sekwt Tge J~natur lcd Other Descriptor bow INamne 709 jAdd to IGNORE 1991 IYear psa IName IAd to IGNORE JAd to IGNORE re jOther Descriptor pa jAdd to IGNORE 9 Add to IGNORE 1989 <- ack Figure 6.11: Attribute learning option window Each new word from the database appears on a separate line at the left side of the form. The user can select the word's fate using a pull-down menu. Each word either becomes a value for an existing attribute (such as Name or Manufacturer), moves into the ignore file, or remains in its unknown state for later review. If a word moves into the ignore file, the software will never present that word to the user again. If a word becomes a new attribute value, the change becomes a permanent part of the indexing system when data re-indexing occurs later. 116 Once the learning process is complete and the indexing scheme has received new attribute values, the system needs to re-index the database and update the rules file to reflect the changes. Several tools, executed in series, accomplish this goal. These tools appear as command buttons at the bottom of the form in Figure 6.10. First the system re-indexes the data records with the new data index. Then it creates a new rules file to reflect the changes. Rules file creation follows this procedure: 1) Copy all unique indicesfrom the database into the rulesfile. This action documents every valid data index from the database, recording each index only once. 2) Within the rules file, split aggregatedindices into a series of unique indices. Some data indices may specify more than one value for a data attribute. The software splits each of these aggregated indices into a set of non-aggregated indices, each having only one value for each attribute. This process may create new duplicate indices, so once again the software searches for and deletes any duplicates. 3) Combine unique indices into aggregate indices using wildcards (wherepossible) Figure 6.12 illustrates a situation where index simplification with wildcards is possible. Simplification reduces the size of the rules file. If situations similar to Figure 6.12 are discovered among the data indices, a single index with a wildcard can take the place of several similar indices. 4) Repeat step four until no more aggregationwith wildcards can occur Even among indices that already contain wildcards, further simplification can occur. The software continues simplifying indices until no more simplification is possible. The result of this four-step procedure is a set of wildcard-enabled rules governing the search process. The new rules file can then operate as outlined in Chapter 4. 117 MONOWNWNW- -----.-- - -- - ------ nw .. ,, Rules File Rules File Attribute A Attribute B Attribute C Attribute A Attribute B Attribute C 1 1 1 1 1 * 1 1 2 1 1 3 Attribute C can take values [1 , 2 ,3 ] Figure 6.12: Rules file before and after adding wildcards 6.4 Software enhancements There are several software features this project did not fully implement. Although they are not essential to the software's basic operation, further software development efforts should include these features. The improvements in this section will create a more efficient user experience, providing clients with data more quickly and in more appropriate forms. Differentiation between invalid and unpopulated indices. The rules file currently tracks all unpopulated data indices, but it does not know why any particular index is unpopulated. An unpopulated index may represent a valid combination of attributes that nobody has ever documented with data. Alternately, the index may be unpopulated because it represents an The software should allow a user to perform estimation functions on valid, unpopulated indices. But it should not allow a user to perform estimation functions on invalid combinations of attributes, because the results will not represent any real data. To decide whether estimation functions are allowed, the software should be able to discern invalid combination of attributes. between valid and invalid indices. converting hierarchical structures. It may be possible to automate this distinction when But when creating a new indexing scheme for a semi- structured data set, humans will have to set up the distinction between valid and invalid indices by hand. 118 Real-time narrowing of query menus. There are two ways to use the rules file to ensure search criteria are valid. The prototype software waits for a user to make and submit all of his/her selections before consulting the rules file. The other strategy uses the rules file to change the user's available selections each time s/he alters a menu selection. This is a real-time narrowing strategy, because it immediately guides the user to make valid index selections. See Figure 4.9. Real-time narrowing is preferable because it only allows the user to make valid selections of indexing combinations. The prototype software does not use real-time narrowing because of the limitations of basic HTML transactions over the World Wide Web. Without using a special browser standard, script or applet, real-time narrowing is prohibitive for Web use. Further investigation of different methods for Web information transfer may reveal a good compromise between universal user access and real-time query verification. Generation of better and more varied charts. To maximize the rate of information transfer between server and client, the prototype software currently uses only HTML table charts to summarize data. The system could create more attractive and versatile charts in GIF or JPEG format. This graphical charting technique can create scatter plots, time-series charts, pie charts and many other varieties. In particular, time series charts are important for representing timedependent data. For this reason alone, charting upgrades would be a major improvement in the software. Figure 6.13 illustrates an HTML table bar chart in its most detailed form, and Figure 6.14 illustrates a fairly standard graphical bar chart. The plotted data is identical for both charts, but the presentation of the graphical chart in Figure 6.14 is cleaner and better suited for presentation. 119 MWONOWNW.- ........ ........ Pfice Means few sow;t MdIu )386.WJ N44 Spouiad iss I Opps I>Onfu W4.82 Tpps Oetr M 11*1 W,32 ITadium Club Ool4 Leaf PiinnAcl* SenTmfn Chtwtm. Oe*.Pe PREISCTEI V31 42 Figure 6.13: HTML table bar chart Price Means for Manufacturer 0-Pee-Chee PREDICTED: $359.62 Bowman Chrome Pinnacle Leaf Gold Leaf Stadium Club Skybox Bowman Upper Deck Fleer Donruss Topps Not Specified $0 $50 $100 $150 $200 $250 $300 $350 $400 Figure 6.14: Graphical bar chart Automatic determination of best axes for estimation. The prototype software asks a user to specify which axes it should use for regression functions. The software accepts the user's not judgment and uses only those axes to estimate unknown data values. If the user does exercise good judgment, the resulting estimation function will achieve lower accuracy than it otherwise could have. To ensure best results, the software should offer an "auto-detect" option to determine which axes are best suited for the estimation process. The software might generate (or be given) a set of threshold error values for each axis. If, during regression analysis, average 120 error per axis value falls above the threshold, the axis can be discarded from the estimation process. The axes that remain under the error threshold can then be treated as they are under the current software prototype. Provision of novice and expert user modes. The features discussed in this section balance simplicity of use with the system's trust in a user. Some users will always be more knowledgeable or experienced than others will, so the system's faith in a user's decision should be variable. The most direct way to address this problem is the creation of different user modes. A novice user mode might verify index selections in real-time, present only basic pricing charts, and automatically select the axes to use for regression analysis. An expert user mode might offer the option for one-time index verification, present multiple types of data charts, list all returned records and let the user select which axes to use for regression. By offering a novice mode and an expert mode, the system can meet the needs of new and experienced users simply by changing a few system settings. The administrative software may also benefit from multiple-mode operation, by varying the level of user involvement in administrative task scheduling and execution. Speed improvement. The prototype software leaves significant room for speed improvements. These improvements are necessary if the software is to handle several clients simultaneously. The current system uses state-of-the-art hardware as of October 2000. The software components, however, do not create an efficient processing environment. They were selected for their ease of use, and because they were available to the Variation Risk Management workgroup. A superior environment might use a more efficient operating system, data repository, network protocol, data retrieval protocol and operating language. Additionally, the system may operate as a distributed server network instead of a single machine. These changes could decrease response times by an order of magnitude or more, relative to current performance. 6.5 Conclusion This chapter has outlined the major functions of the prototype software. The prototype server software demonstrates functions for search specification, criteria checking, data retrieval, data analysis, and output formatting. Additionally, the server software demonstrates estimation 121 methods for creating surrogate data. The locally run administration software demonstrates methods for assisted machine learning, index expansion, and database re-indexing. Subject to the enhancements suggested in this chapter, these functions demonstrate the practical feasibility of the concepts in this thesis. 122 7 Conclusion This chapter reviews the nature of design-focused PCDB problems, the solutions presented in this thesis, and remarks about the strengths and weaknesses of the system. 7.1 Contributions Design currently suffers from a lack of access to PCD. Some of the access barriers are products of problematic PCDB system architectures and data management strategies. This thesis presents an alternate PCDB architecture and improved methods for managing unpopulated database indices. First, this thesis reviews major barriers to design's use of PCD. These barriers include: - Poor indexing schemes - Lack of design-focused PCD access - PCDB dissimilarity within and across enterprises - Absence of formalized methods for managing unpopulated data indices These problems are evident from a review of previous work, discussions with industry during the completion of this thesis, and a recent industry survey. Chapters 1 and 2 establish the need to manage the noted problems through design-driven PCDB solutions. The thesis then describes the current state of popular DBMS architectures in Chapter 3, pointing out the advantages and disadvantages of each system. In particular, problems specific to PCDB usage appear prominently. Chapter 4 presents a hybrid attribute-based indexing structure that avoids many of the problems associates with well-known indexing systems. This hybrid system also lays the groundwork for solutions to many of the problems design faces when using PCD. Specifically, the hybrid system indexes data in a fully attribute-based way. It also allows designers to seek PCD by design-focused attributes, and uses a single interface to query many databases of differing structure. 123 Chapter 5 presents a method to predict values for unpopulated data indices. This method follows directly from the data structures in Chapter 4, and allows PCDB users to use formalized procedures to manage unpopulated data indices. Chapter 6 reviews a prototype server and administrative software package that demonstrates all of the technology described in the thesis. 7.2 Further research There are several areas of design-focused PCDB implementation that require future work. This section describes some possibilities for future contributions. Multiple estimation methods. In different situations, different strategies may be appropriate for creating surrogate data. This thesis describes an estimation routine based on regression analysis, but other strategies may also yield useful information. Alternate methods may use a different estimation routine, or alternately may use a substitution routine that uses data from a different index as a direct surrogate for an unpopulated index. These methods should be explored further. Estimation and substitution routines for unknown data have been studied at length, and there are myriad approaches to missing data problems. A system could use multiple strategies to create surrogate data, and then compare or combine the results to create a best estimate. Improved statistical validity tests. The system proposed in this thesis generates a minimum-error estimate for missing data. But the algorithm does not compare this minimum error value with an acceptable standard. An improved system should create a standard for minimum allowable error. This standard could be an amortized error value per data point, or a value that varies by axis. Alternately, the system could compare the final estimate with domain knowledge, to assess validity. Software enhancements. Chapter 6 describes several software enhancements that would improve the user experience. These improvements include varying user modes and improved graphical output. Most of the noted software enhancements offer greater flexibility and customization of the user experience. 124 The technology presented in this thesis holds great promise for improving designers' access to PCD. Industry has expressed interest in technologies that improve PCDB implementations, even while the literature has overwhelmingly assumed that designers have already have suitable access to the data. With the technology demonstrated in this thesis, industry can begin closing the gap between the current state of PCDB implementations and the powerful ideal of designfocused PCD applications. 125 126 References Agrawal, Rakesh, Tomasz Imielinski and Arun Swami (1993) "Database Mining: A Performance Perspective." IEEE Transactionson Knowledge and Data Engineering,5(6), pp. 914-925. Alagic, Suad (1986) RelationalDatbase Technology. Springer-Verlag, New York. Bangalore, Srinivas and Giuseppe Riccardi (2000) "Stochastic Finite-State Models for Spoken Language Machine Translation." Workshop on Embedded Machine TranslationSystems, Seattle, WA. Baral, Chitta, Michael Gelfond, and Olga Kosheleva (1998) "Expanding Queries to Incomplete Databases by Interpolating General Logic Programs." Journal of Logic Programming35, pp. 195-230. Batchelor, R. and K.G. Swift (1996) "Conformability Analysis Support of Design for Quality." ImechE Journal of MaterialsProcessingTechnology 61(1-2), pp. 163-167. Bishop, Christopher M. (1996) Neural Networks for Pattern Recognition. Oxford University Press, Oxford, England. Burges, Christopher J.C. (1998) "A Tutorial on Suport Vector Machines for Pattern Recognition." In Proceedings ofData Mining and Knowledge Discovery (Usama Fayyad ed.), 2, pp. 1-43. Kluwer Academic Publishers, Boston. Burleson, Donald K. (1999) Inside the DatabaseObject Model. CRC Press, Boston. Campbell, R.I. and M.R.N. Bernie (1996) "Creating a Database of Rapid Prototyping System Capabilities." JournalofMaterialProcessingTechnology 61, pp. 163-167. Chandra, S., D.I. Blockley and N.J. Woodman (1993) "Qualitative Querying of Physical Process Simulations." Civil EngineeringSystems 10, pp. 225-242. Clausing, D. "Reusability in Product Development." Engineering Design Conference 1998. Uxbridge, England. Cowell, Robert (1999) "Introduction to Inference for Bayesian Networks." In Learning in GraphicalModels, (M. I. Jordan ed.), pp. 9-26. Kluwer Academic Publishers, Boston. DeGarmo, E. Paul, J.T. Black and Ronald A. Kohser (1997) Materials and Processes in Manufacturing.Prentice Hall, Upper Saddle River, NJ. Deleryd, Mats (1999) "A Pragmatic View on Process Capability Studies." InternationalJournal ofProductionEconomics 58, pp. 319-330 127 Devore, Jay L. (1987) Probabilityand Statisticsfor Engineering and the Sciences. Brooks/Cole Publishing Company, Monterey, CA. Duda, Richard 0. and Peter E. Hart (1973) Wiley & Sons, New York. Pattern Classification and Scene Analysis. John Gaither, Norman (1994) Production and Operations Management. Harcourt Brace Company, Orlando. Hogg, Robert V. and Johannes Ledolter (1992) Applied Statistics for Engineers and Physical Scientists. Macmillan Publishing Company, New York. Lee, D.J. and A.C. Thornton (1996) "The Identification and Use of Key Characteristics in the Product Development Process". ASME Design and Theory Methodology Conference. Irvine, CA. Liu, Ling and Calton Pu (1997) "An Adaptive Object-Oriented Approach to Integration and Access of Heterogeneous Information Sources." Distributedand Parallel Databases. 5(2), pp. 167-205. Meng, Weiyi, Clement Yu, and Won Kim (1995) "A Theory of Translation from Relational Queries to Hierarchical Queries." IEEE Transactionsof Knowledge and Data Engineering.7(2), pp. 228-245. Pavlovic, Vladamir, James M. Rehg, Tat-Jhen Cham, and Kevin Murphy (2000) ""A Dynamic Bayesian Network Approach to Figure Tracking Using Learned Dynamic Models." Hybrid Systems: Computation and Control 1790, pp. 366-380. Pereira, Fernando C.N. and Michael D. Riley (1997) "Speech recognition by composition of weighted finite automata." In Finite State LanguageProcessing(E. Roche and Y. Schabes, eds.), pp. 431-453. MIT Press, Cambridge. Piattelli-Palmarini, Massimo (1991) "Probability: Neither Rational nor Capricious." Bostonia, March/April 1991, pp. 28-35. Rabiner, Lawrence R. (1998) "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedingsof the IEEE 77(2), pp. 257-286. Riccardi, Giuseppe, Roberto Pieraccini, and Enrico Bocchier (1996) "Stochastic Automata for Language Modeling." ComputerSpeech and Language 10, pp. 265-293. Tata, Melissa. (1999) The Effective Use ofProcess CapabilityDatabasesforDesign. MSc, MIT, Cambridge. Thornton, Anna C. and Melissa Tata (1999) "Process Capability Database Usage in Industry: Myth vs. Reality." Design for Manufacturing Conference, ASME Design Technical Conferences, Las Vegas, NV. 128 Turk, Matthew and Alex Pentland (1991) "Eigenfaces for Recognition." Journal of Cognitive Neuroscience 3(1), pp. 71-86. Ulrich, Karl T. and Steven D. Eppinger (2000) Product Design and Development. McGraw-Hill, Boston. Walker, Michael G. (1987) "How Feasible is Automated Discovery?" IEEE Expert, Spring 1987, pp. 69-82 Zhang, Jin and Robert R. Korfhage (1999) "A Distance and Angle Similarity Method." Journal of the American Society for Information Science 50(9), pp. 772-778. 129 130 Appendix A: PCDB Survey 131 132 Process Capability Database Survey Thank you for providing us with information about your process capability database by answering the questions below. Your answers will be kept completely confidential, and will assist us in: * " Identifying key areas for discussion at this year's KC Symposium Assessing the current state of process capability database utilization We will aggregate the data and report it back to the working group at the start of the meeting. When answering questions, please note the following terminology: Organization PCDB PCD KC Index Unpopulated Query Company, division, office, or other group, if any, that you represent at the Symposium. Process capability database Process capability data Key characteristic The full code used to identify a specific location in the (A single index may contain several data points) Descriptor for an index containing no data Request for data points in at least one PCDB index PCDB Again, thank you for your assistance and we look forward to seeing you at the Symposium. Your name: _ Your organization: 1) Among the following PCD-related topics, which do you consider most important for discussion at this year's KC Symposium? Please check all that apply. __ - Planning and implementation of PCDBs Development of PCDB structure PCD query development Interpreting returned PCD Designing Indices - Populating PCDBs: timetables, economics, resource allocation Managing supplier PCDBs and other external PCD sources Other - 2) Does your organization currently use a PCDB? _ Yes No, but a PCDB is currently being created No, but a PCDB is currently in planning stages We have no PCDB, nor specific plans to implement a PCDB If you answered "no" to question 2, please skip to question 12 on page 3. Questions 3-11 (page 2) are for individuals representing organizations that have a PCDB. Questions 12 and 13 (page 3) are for individuals representing organizations that do not have a PCDB. 133 8) Which characteristics seem to have the greatest influence over the likelihood that an index is unpopulated? 3) What percentage of your PCDB is currently populated? _ 0-25% 25-50% 50-75% 75-99% Material, i.e. certain materials have a much greater likelihood of index population than others Date, i.e. older hierarchy branches are more/less likely to be populated than newer ones Feature, i.e. certain features have a greater likelihood of index population than others Size, i.e. certain feature sizes are considerably more likely to be populated than others 100% 4) Which of the following statements, if any, most accurately represent(s) the dispersion of unpopulated indices through your PCDB? _ _ __ Most or all indices are populated; unpopulated PCD is not an evident problem. The majority of the PCDB is populated, with unpopulated indices interspersed throughout. Unpopulated indices tend to exist in concentrated PCDB regions; other areas are thoroughly populated. Unpopulated indices abound in small concentrated regions of the PCDB, but are also interspersed moderately throughout the rest of the PCDB. i.e. indices for simpler _ Complexity, _ features/operations are more/less likely to be populated No evident trends, or only weak trends noted 9) Who is typically asked for PCD? Check all that apply. The majority of the PCDB is unpopulated. __ _ Designers Manufacturing __ Suppliers Data specialists or database query personnel No request system is in place An individual is expected to find the data him/herself, through databases or otherwise. 5) How frequently is a requested PCDB index found to be unpopulated? Less than 10% of queries 10-25% of queries 25-50% of queries More than 50% of queries 10) What types of design tasks most frequently require PCD that is found to be unpopulated? Check all that apply. Redesign of existing parts or processes 6) What characteristics are specified in order to access a particular data index (set of data points)? Please check all that apply. __ __ __ _ __ Investigative queries of alternate features/feature sizes, processes or materials Other Part number or name Material Feature Operation Size KC number or name Machine 11) What trends, if any, are characteristic of design's reaction to not being able to find the right data? __ _ __ 7) What strategies does your organization use to address unpopulated data? Check all that apply. Seek out alternate values from within PCDB via software procedure Seek out alternate values from within PCDB based on user intuition/expertise __ _ Design of new parts or processes Consult expert within organization Consult expert outside of organization Seek information from manufacturing Decreased use of PCD Distrust of populated PCD indices Use of PCD only for revision of existing designs Use of PCD only for processes, materials, parts, KCs, etc that are known to have populated indices Use of PCD only for queries regarding "popular"/well-known processes, materials, parts, KCs, etc. Other No trends noted Question 11 is the last question for individuals representing organizations that have a PCDB. Thank you! Contact supplier if unpopulated index is suspected to exist in supplier database Investigate design changes that enable processes with known capabilities 134 12) What are the primary reasons for the lack of a PCDB at your organization? Please check all that apply. - _ _ Simple lack of need for PCDB Need for PCDB exists, but is not recognized by decision-makers Need is known, but implementation has simply been slow Lack of funding or other resources to create PCDB Lack of organizational knowledge for PCDB creation Other No known reasons 13) Which of the following strategies are used by design when desired process capability information is unavailable? _ Contact supplier or manufacturer Consult expert within organization Consult expert outside of organization Generate estimates for unavailable information, using resources within organization Investigate design changes that enable processes with known capabilities Question 13 is the last question for individuals representing organizations that do not have a PCDB. Thank you! 135 136 NW-- Appendix B: Survey Responses 1. Among the following PCD-related topics, which do you consider most important for discussion at this year's KC Symposium? Planning and implementation of PCDBs 65% Development of PCDB structure 65% PCD query development 26% Interpreting returned PCD 22% Designing indices 13% Populating PCDBs: timetable, econ., resource 43% Managing supplier PCDBs/extemal sources 48% Other 4% No. respondents = 23 16 15 15 14 12 11 10 10 8 6 5 4 a 2 0 E E E CL C1. CL, Cu 0) C 2 A? 0-E E CL 0. C0 137 C 2l0) .2' -- mg3ov, - 2. Does your organization currently use a PCDB? Yes No, but a PCDB is currently being created No, but a PCDB is currently in planning stages We have no PCDB, nor specific plans to implement a PCDB 46% 21% 21% 0% No. respondents = 24 12 10 11 8 86- 5 4 20 0Yes Planned No, No Plans Being Created 3. What percentage of your PCDB is currently populated? 0-25% 25-50% 50-75% 75-99% 100% 82% 9% 0% 9% 0% 10 9 8 7 6 5 No. respondents = 11 2 1 0 0 0-25% 138 25-50% 50-75% 75-99% 100% 4. Which of the following statements, if any, most accurately represent(s) the dispersion of unpopulated indices through your PCDB? Most or all indices are populated; unpopulated PCD is not an evident problem. The majority of the PCDB is populated, with unpopulated indices interspersed 9% Unpopulated indices tend to exist in concentrated PCDB regions; other areas are thoroughly populated. Unpopulated indices abound in small concentrated regions of the PCDB, but are also interspersed moderately throughout the rest of the PCDB. The majority of the PCDB is unpopulated. 9% throughout. 9% 46% 27% No. respondents = 11 6 5 5 4 3 3 2 1 0 cc ~0 L 0 a-. .0 0.0 V 0a LoJC 20 0 ( CL 0C0 D 0 0 5. How frequently is a requested PCDB index found to be unpopulated? 22% Less than 10% of queries 22% 10-25% of queries 125-50% of queries 1 33% 1 1 22% 1 queries of 50% than I More No. respondents = 9 3.5 3 3, 2.5 2 2 <10% 10-25% 2 2 1.5 1 0.5 0 139 25-50% >50% 6. What characteristics are specified in order to access a particular data index (set of data points)? 12 64% 91% 82% 55% 73%_ 45% 73% Part number or name Material Feature Operation Size KC number of name Machine 10 10 1 8 66 6 4 2 0c 20E No. respondents = 11 11 1 5 ci) mI Z Na LL mi I . ZE . U0 7. What strategies does your organization use to address unpopulated data? Seek out alternate values from within PCDB via software procedure Seek out alternate values from within PCDB based on user intuition/expertise Consult expert within organization Consult expert outside of organization Seek information from manufacturing Contact supplier if unpopulated index is suspected to exist in supplier database Investigate design changes that enable processes with known capabilities No. respondents = 11 9 8 8 8 7 7 6 5 4 3 2 1 Find Consider Use Find Request Contact Use Internal Info from Supplier Alternate External Alternate Design Change via Expert via Mfg. Expert Intuition 140 Software 9% 45% 73% 36% 73% 64% 9% 8. Which characteristics seem to have the greatest influence over the likelihood that an index is unpopulated? Material, i.e. certain materials have a much greater likelihood of index population than others Date, i.e. older hierarchy branches are more/less likely to be populated than newer ones 18% 0% ___ Feature, i.e. certain features have a greater likelihood of index population than others Size, i.e. certain feature sizes are considerably more likely to be populated than others Complexity, i.e. indices for simpler features/operations are more/less likely to be populated No evident trends, or only weak trends noted No. respondents = 11 6 5 5 4 4- a 3 2 2 1 Ii 0 Complexity N/A Feature 141 Material Size Date 27% 18% 45% 36% 9. Who is typically asked for PCD? Designers Manufacturing Suppliers Data specialists or database query personnel No request system is in place An individual is expected to find the data him/herself, through databases or otherwise. 45% 36% 9% 36% 45% 0% No. respondents = 11 6 I 5 5 4 3 2 1 0 OT E C.1 0) z 0 z CL C/) 0 10. What types of design tasks most frequently require PCD that is found to be unpopulated? Redesign New designs Investigative queries Other 1 36% 91% 36%J 9 12 10 10 8 6 4 No. respondents = 11 0 New Designs 142 Redesign Investigative Queries Other 11. What trends, if any, are characteristic of design's reaction to not being able to find the right data? Decreased use of PCD Distrust of populated PCD indices Use of PCD only for revision of existing designs Use of PCD only for processes, materials, parts, KCs, etc that are known to have 0% 0% 0% Use of PCD only for queries regarding "popular"/well-known processes, materials, 36% Other No trends noted 0% 55% populated indices parts, KCs, etc. No. respondents = 11 7 6 5 I 4 3 2 0 0 C .2' 0 z CL-0 0 C 0) CY 0 0 0 a- 0~ 0 0 2n 0 0 b 0) 143 6 (I) 0 -0 0 0 0 '~ 9% 12. What are the primary reasons for the lack of a PCDB at your organization? 15% 38% 23% 31% 31% 23% 0% Simple lack of need for PCDB Need for PCDB exists, but is not recognized by decision-makers Need is known, but implementation has simply been slow Lack of funding or other resources to create PCDB Lack of organizational knowledge for PCDB creation Other No known reasons No. respondents =13 6 5 42 31 0 0 64 4)CD4 00 ~ E 144 Z 13. Which of the following strategies are used by design when desired process capability information is unavailable? Contact supplier or manufacturer Consult expert within organization Consult expert outside of organization Generate estimates for unavailable information, using resources organization Investigate design changes that enable processes with known capabilities No. respondents: 12 9 8- 8 76 6 6 5 5- 4- 3 32 1- 0- -- Internal Expert -- Contact Supplier --- Generate Estimates 145 Invest. Design Change External Expert 50% 67% 25% within 50% 42% 146