CHAPTER14: INTRODUCTION TO DATA ANALYSIS 14.1 INTRODUCTION There are many situations in business where data is collected and analysed. The key ideas of data analysis are important in the modern business environment. Summarising and understanding the main features of the variables contained within the data, and investigate the nature of any linkages between the variables that may exist. 14.2 WHAT IS DATA Example 1 Population: the set of all people/objects of interest in the study being undertaken. – Very large – Enumerated precisely – Cannot be Enumerated physically Population member The information for each member of the population – – – – – Age: Gender: Parish: Will you vote in the by-election?: Will you vote for me? Variables: one piece of information – Five variables To investigate the connection between the two pairs of variables: – 'Will you vote for me' and 'Age' – 'Will you vote for me' and 'Gender' – 'Will you vote for me' and 'Parish' Population data is used the outcomes of the analysis are precise 'perfect information' results. Example 2 Population: the set of all customers A sensible initial set of questions is: – Do you understand exactly what each variable is measuring/recording? – Do you understand the problem under investigation and are the objectives of the investigation clear.? 14.3 DESCRIBING VARIABLES Classification of variable types – Attribute variables – Measured variables Attribute Variables: – An attribute variable has its outcomes described in terms of its characteristics or attributes. – Example 1 'By-Election Data': – Example 2 'Credit Data' • Does the customer own their own house? – 0=Yes 1=No • The Region in which the customer is resident? – – – – – 1—South West 2—South East 3—London 4—Midland 5—North • Handling attribute data is to give it a numerical code 0, 1, 2 ,…. Measured Variable – A measured variable is a variable that has its outcomes measured; the resulting outcome is expressed in numerical terms. – Two types of measured variables • Continuous variable : continuous scale of measurement(person's weight) • Discrete variable : the number of passengers on flight – Example 1 'By-Election Data': • The measured variable in this data set is 'Age' – Example 2 'Credit Data' • Measured variables as follows 14.4 THE CONCEPT OF A STATISTICAL DISTRIBUTION Attribute Variable – Gender of constituents (Example 1) DISTRIBUTION OF GENDER IN THE CONSTITUENCY – REGION (Example 2) DISTRIBUTION OF REGION IN WHICH CUSTOMER IS RESIDENT Measured Variable – Customer's Age (Example 2) DISTRIBUTION OF AGE OF CUSTOMER – Household Income (Example 2) DISTRIBUTION OF HOUSEHOLD INCOME What does the distribution show? – The area under the curve from one income value to another measures the relative proportion of the population having household incomes in that range. – Lower than £10,000 is relatively rare – Large proportion of the population have Household incomes between £20,000 & £50,000 The Descriptive Statistics for Distribution of a Measured Variable – Distribution of the height of adults in Great Britain. – The height of children under 11 years of age children's heights adult's heights – Heights in two different countries, country A and country B DISTRIBUTION OF HEIGHTS COUNTRY A & B A statistical distribution for a measured variable can be summarised using three key descriptions: – Centre of the distribution – Width of the distribution – Symmetry of the distribution – Measuring the Centre of a Distribution: – The Mean • • • • average value = X/n Average Household Income symbol for the population mean: Formally the population mean of a variable is defined to be: – = X/n – The Median • The median value of the variable is defined to be the particular value of the variable such that half the data values are less than the median value and half are greater. • Sorting all data in ascending order, the median value is then the middle value in this list Measuring the Width of a Distribution – The Standard Deviation • The Standard Deviation is the square root of the average squared deviation from the mean. • Symbol of Standard Deviation: • is usually defined in terms of the variance 2as: – 2 = (X- )2/n • Standard deviation is the square root of the variance • Calculating the standard deviation for the variable Household Income • Standard deviation is a relative measure of spread (width), the larger the standard deviation the wider the distribution. – Inter-quartile Range • The inter-quartile range is the range over which the middle 50% of the data values varies • To define the quartiles: – Q1 : the value of the variable that divides the distribution 25% to the left and 75% to the right. – Q2 :the value of the variable that divides the distribution 50% to the left and 50% to the right. – Q3 :the value of the variable that divides the distribution 75% to the left and 25% to the right. • The inter-quartile range is the value Q3 - Q1 • Calculating the Q1, Q2, Q3 for the variable 'Household Income' • Conventionally the mean and standard deviation are one pair of measures of location and spread, and the median and inter-quartile range as another pair of measures. Measuring the Symmetry (skewness) of a Distribution – Pearson's coefficient of Skewness • Pearson's coefficient of Skewness = 3(mean - median)/standard deviation – Quartile Measure of Skewness • Quartile Measure of Skewness = [(Q1 - Q3) - (Q2 – Q1)]/(Q3 – Q1) • 14.5 SUMMARY What is Data Variables Two types of variable: – an attribute variable – a measured variable The concept of a Statistical Distribution: – As applied to an attribute variable – As applied to a measured variable Descriptive Statistics for a measured variable: – Measures of Centre • Mean • Median – Measures of Width • Standard Deviation • Inter-Quartile Range – Measures of Symmetry (Skewness) • Pearson's coefficient of Skewness • Quartile Measure of Skewness 14.6 THE NATURE OF A SAMPLE: POPULATION: – Perfect Information – In practice it is often impossible to enumerate the whole population. – A sample drawn from the population to make judgements (inferences) about the population. SAMPLE – Imperfect Information – Random sample • Each item in the population has an equal chance of being included in the sample. – The KEY PROBLEM is to use this sample data to draw valid conclusions about the population with the knowledge of and taking into account the 'error due to sampling' – Unrepresentative sample • How to Lie with Statistics A Credit Scenario – Population: the set of all customers who used the credit facilities between 1st January 2000 and 31st December 2001. – Sample Size: 654 customers – Data file: BDMCREDIT.MTW 14.8 DESCRIBING SAMPLE DATA Attribute variable: the number of occurrences of each attribute is obtained Measured variable: Sample descriptive statistics describing the centre, width and symmetry of the distribution are calculated. Attribute Data – C5 Does the customer own their own house? Coded: 0 = Yes, l=No – C6 The Region in which the customer is resident? Coded: – – – – – 1 2 3 4 5 South West South East London Midlands North – Command STAT-TABLE-TALLY – Summary Statistics for Discrete Variables • Counts (OWN-OCC) • Percent(OWN-OCC) • Distribution graph(OWN-OCC) Do you Own your own house? – Summary Statistics for Discrete Variables – Count(REGION ) – The information in form: • • • • • 74 or 11.31% of the respondents are from the Southwest 132 or 20.18% of the respondents are from the Southeast 165 or 25.23% of the respondents are from the London area 161 or 24.62% of the respondents are from the Midlands 122 or 18.65% of the respondents are from the North Measured Variables – For the 'Credit Data • C2 Customer's Age (AGE) • C3 Household Income (£ per annum) (SALARY) • C4 Estimated monthly outgoing on mortgage/rent/rates/utilities/credit card payments etc. (PAYOUT) • C7 The Amount borrowed on credit (CREDIT) – HISTOGRAM – BOXPLOT • The BOXPLOT will prove to be a more useful way of representing the picture of a sample distribution when the data analysis used to examine the connection between two sample variables is discussed in later chapters. 14.7 DATA ANALYSIS USING SAMPLE DATA Before attempting to analyse any data, the analyst should: – The problem under investigation is clearly understood and the objectives of the investigation have been clearly specified. Keep asking questions until satisfactory answers have been obtained. – The individual variables making up the data set are clearly understood. – Descriptive Statistics • Measures of Centre – Mean • Sample Mean – Median X 945.2 • Measures of Width – Standard Deviation • Sample Standard Deviation: S • Sample Variance: S2 – Inter-Quartile Range IQR • Symmetry Symmetry (Skewness) – A distribution is skewed if one tail extends farther than the other. – A value close to 0 indicates symmetric data. – Negative values indicate negative/left skew. – Positive values indicate positive/right skew. – Example of a negative or left-skewed distribution (skewness = -1.44096) Summary for marks A nderson-Darling Normality Test 30 40 50 60 70 80 A -Squared P -V alue < 2.37 0.005 M ean StDev V ariance Skew ness Kurtosis N 73.540 12.670 160.534 -1.44096 2.92033 100 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 90 26.000 67.000 76.000 83.000 92.000 95% C onfidence Interv al for M ean 71.026 76.054 95% C onfidence Interv al for M edian 73.000 79.000 95% C onfidence Interv al for StDev 9 5 % Confidence Inter vals 11.125 Mean Median 70 72 74 76 78 80 14.719 – The Relationship between the descriptive statistics and the Boxplot • The asterisks on the right hand side of the median are indicating sample values that are in some sense extreme 14.9 INVESTIGATING RELATIONSHIPS BETWEEN VARIABLES To investigate the relationship between variables. – Response variable • a variable that measures either directly or indirectly the objectives of the analysis – Explanatory variable • a variable that may influence the response variable Example 1 – A university wishes to investigate the salary of its graduates five years after graduating – The questionnaire • 'Current Salary' • 'Starting Salary' • 'Class of Degree' Coded: l=First, 2=Upper Second, 3=Lower Second, 4=Third, 5=Pass. • 'Graduate's Gender' Coded: l=Male, 2=Female. – Response variable • Current Salary (measured variable) – Explanatory Variable • Staring Salary (measured variable) • Class of Degree (attribute variable) • 'Graduate's Gender (attribute variable) Example 2: CREDIT scenario – Objectives of the analysis • To investigate the nature of credit transactions • The variable 'The Amount borrowed on credit' • The problem is to investigate the relationship between 'The Amount borrowed on credit' and the other variables. – Summary Combinations of Response Variable and Explanatory Variable EXPLANATORY VARIABLE The method for investigating the connection between a response variable and an attribute variable depends on the type of variable. – Investigating the connection between a measured response and a measured explanatory variables – Investigating the connection between a measured response and an attribute explanatory variables Homework Find or collect some data in your life or business practice, answer the following questions – – – – Draw the statistic distribution of data Calculate the Mean and Standard Deviation Calculate the Median and Inter-Quartile Range Calculate the Pearson’s Coefficient of Skewness and Quartile Measure of Skewness