Measuring Complexity John David Kendrick Business Process Management, Inc April 21, 2010 My Purpose • To provide a framework for measuring complexity of a hardware or software product • To inform you about Cluster Analysis and Principal Component Analysis • To present examples where this approach was applied to estimating the complexity of a hardware and a software product. Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples My Education • Master of Engineering, Simulation and Modeling, Arizona State University • Master of Applied Statistics, Penn State University • MBA, Financial Economics, University of Pittsburgh • BA Economics, University of Pittsburgh • BS Math and Computer Science, University of Pittsburgh • BS Physics, Purdue University My Professional Certifications • Certified Six Sigma Master Black Belt, SigmaPro • ASQ Certifications: – – – – Six Sigma Black Belt Reliability Engineer Software Quality Engineer Quality Manager / Operational Excellence • Lean Certifications: – Lean Enterprise, Arizona State University – Lean Manufacturing Management, IDL Systems • Certified Reliability Estimation, SigmaPro Professional Experience • • • • • Naval Air Warfare Center - Aircraft Division US Army (CIO-G6, TRADOC, MNRA/CIO-G1) Motorola (Networks Division) Qwest Communications (US West) Freddie Mac (Single Family Housing, Program Management Office, Center of Excellence) • AT&T (Global Network Operations) • Ygomi, LLC My Publications & Quality Promotions • John Kendrick and Daniel Saaty, " Analytic Hierarchy Process (AHP) for Six Sigma Project Selection and Portfolio Optimization", Six Sigma Forum Magazine, American Society of Quality, 8/2007 • John Kendrick, "Data Stratification for Champions", Six Sigma Forum Magazine, American Society of Quality, 8/2008 • John Kendrick, "The Importance of a Proper SPC Subgroup Sampling Technique", Quality Digest, 11/2009 • John Kendrick, "Poor Technique", Six Sigma Forum Magazine, American Society of Quality, 3/2010 • • Speaker ASQ Section 618, March 2008 Speaker ASQ Section 702, April 2010 • Webinars: John Kendrick, "Using Discrete Event Simulation to Improve IT Help Desk Operations for ITIL Problems and Incidents", American Society of Quality, Service Division, 8/2008 (Registered Audience: 250) Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples What is Complexity? Complexity - The level in difficulty in solving mathematically posed problems as measured by the time, number of steps or arithmetic operations, or memory space required (called time complexity, computational complexity, and space complexity, respectively). SOURCE: www.dictionary.com Common Questions • How long will it take to make a …? • How many defects can we expect when we are making a …? • How many problems can we expect from our customers? • How many hours (in FTE) will it take to make a…? Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples The Approach • Group “similar” objects using Cluster Analysis • Use Principal Component Analysis to mathematically describe the objects and the group Advantages Over Alternative Methods • No guessing / “gut feel” / reliance on instinct (breaks down with more than three dimensions) • More Natural - Discriminate Analysis establishs the groups before the analysis rather than based on the objects under examination • Greater Flexibility – – Many types of measurements can be applied to the objects – Many types of cluster analysis algorithms are available • In my experience – this always works! Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples What is Cluster Analysis? • “Cluster Analysis is the art of finding groups in data.” (1,p.1) • Popular in the 1980’s • Considered a branch of pattern recognition and artificial intelligence. • Consider these two groups: – { a,b,c} and {A,B,C} – What makes the elements similar? We all agree that there are two groups, but how did we establish these groups? What rules did we use? What associations are there between the elements? How did we determine the criteria for the measurements or which attributes to use for the groups? How to Proceed? • Like many statistical approaches, this is as much an art as a science • Understand the question to be answered • Determine the significant attributes • Establish and validate a measurement system for each attribute • Devise a rule (or algorithm) to assign objects to groups based on instances of the attributes • VERIFY AND VALIDATE THE MODEL “All models are wrong, some models are useful.” – Dr. George Box and Dr. John Fowler Data Preparation • Types of Data – – – – – – • Variables (Continuous) Data Attribute (Discrete) Data Nominal Data Ordinal Data Ratio Data Interval Data Distribution of the data – Linear – Exponential • • • Units selected Data Centered? Transformation Needed? The groupings will change depending on the data type, units used, how the data is dispersed, and the clustering algorithm selected. Ways to Measure Associations Between Objects Distance Contribution from each element 1 n MeanAbsoluteDeviation MAD xi , f x f n i 1 minimize 1 n Mean m f xi , f n i 1 yy y x x x x Ways to Measure Dissimilarities Between Variables Correlation Superset of Euclidean Distance Accounts for covariance DM ( x) x T S 1 ( x ) x ( x1 , x2 ,...,xn ) (1 , 2 ,...,n ) A measure of divergence or distance between groups is Mahalanobis distance. Used to measure the similarity of an unknown set of objects to a known set of objects. Accounts for correlations between variables and is independent of scale. Which Algorithm to Select? Partitioning Methods – Partitions of the Space – Partitions around Medoids – Fuzzy Analysis A B X C “X is 90% associated with A, 5% with B, and 5% with C” 1111 11 3 3 2 2 2 Which Algorithm to Select? Hierarchical Methods Agglomerative Divisive Example: A Taxonomy that breaks down types of species. Which Algorithm to Select? Partitioning Methods – Partitions of the Space – Partitions around Medoids – Fuzzy Analysis A B X C “X is 90% associated with A, 5% with B, and 5% with C” 1111 11 3 3 2 2 2 Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples Principal Component Analysis • Principal component analysis is a linear transformation from the original coodinate system in p space to a new orthogonal coordinate system • The new coordinate system is directed in the directions of maximum variation of the data in the original coordinate system • The variables in the new coordinate system are uncorrelated Principal Component Analysis • Variable-directed technique • Will the first few components account for most of the variation? • The goal is to describe the data with a smaller number of variables, hence a data reduction method • The coordinates of the transformation are constructed from the correlation matrix • No assumptions are made about the distribution of the original data • Interpretations of the data in the transformed space is an art… Principal Component Analysis p2 x2 p1 Can examine in C1 x1 x3 Original data in C3 Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples Example 1: Software Development Example 1: Software Development The principal component index “PC1Effort” is associated with the clusters in the following way: If -3021 < PC1Effort < -2219 then the Level of Effort is associated with Cluster 1 – Low Level of Effort If -7085 < PC1Effort < -6063 then the Level of Effort is associated with Cluster 2 – Medium Level of Effort If -11155 < PC1Effort < -9940 then the Level of Effort is associated with Cluster 3 – High Level of Effort Verify and Validate the Model! Example 1: Software Development The principal component index “PC1Effort” is associated with the clusters in the following way: If -3021 < PC1Effort < -2219 then the Level of Effort is associated with Cluster 1 – Low Level of Effort If -7085 < PC1Effort < -6063 then the Level of Effort is associated with Cluster 2 – Medium Level of Effort If -11155 < PC1Effort < -9940 then the Level of Effort is associated with Cluster 3 – High Level of Effort Verify and Validate the Model! Example 1: Software Development High Med Low High Complexity = (400,475) hours Medium Complexity = (160,190) hours Low Complexity = (80,90) hours Verify and Validate the Model! pc1 Example 2: Electronic Product Defects Expected Defects = High Expected Defects = Med Expected Defects = Low Example 2: Electronic Product Defects PC1defects = -2.75*components – 2.72*levels – 2.75*solder_joints PC2defects = 0.007*components + 0.202*levels +0.156*solder_joints Cluster 1: PC1defects = (-723,-571 ) and PC2defects = (25, 34) Cluster 2: PC1defects = ( -1185,-1108) and PC2defects = ( 49, 53) Cluster 3: PC1defects = ( -1424,-1377 ) and PC2defects = ( 63, 65) Verify and Validate the Model! Example 2: Electronic Product Defects Expected Defects = High = (252,317) Expected Defects = Med = (180, 210) Expected Defects = Low = (128,150) Recap of Agenda • Introduction - John David Kendrick • Overview of the Problem • Background – Cluster Analysis – Principal Component Analysis • Presentation of the Technique and Examples Questions? John David Kendrick Business Process Management, Inc. (480) 307-0541 jkendri274@earthlink.net Thank You! John David Kendrick References: Kaufman, Lenonard, Finding Groups in Data, ©1990, John Wiley & Sons, New York, New York, Chapters 1-5. Johnson, Richard and Wichern, Dean, Applied Multivariate Statistics, ©1988m, Prentice Hall, Englewood Cliffs, New Jersey, Chapter 8.