Data Analysis with MS® Excel™ 2007 (Special Session) A short course by: Stanley T. Schuyler, D.Sc. Math and Computer Science Dept. Edinboro University of PA. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 1 Data Analysis Orientation - 1 • The theme for today is data analysis using Excel™ – To understand “what does the data mean?” for some business purpose. – To do this requires manipulating the data set: • To reduce the “mass” of data to a summarized set of “useful” characterizations • To identify patterns in the data • To identify relationships or predictors in the data S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 2 Data Analysis Orientation -2 • We will begin with a model of a “marketing-like” data set – It contains survey participant information represented in two forms: • coded values (e.g. Likert scales about “something”) • real values (e.g. a participants “age” in years) • missing and erroneous values – It contains mostly codes which represent something else. – The next slide displays the first 30 or so rows of the “monster” data set S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 3 “Model” Data Set Situation Location Age 3 21 3 25 3 25 3 25 1 25 3 25 3 25 1 27 2 27 3 27 3 29 1 30 2 31 3 31 2 32 3 35 2 36 1 39 3 41 3 42 1 48 3 48 1 50 2 51 3 53 1 53 3 56 3 57 3 81 3 21 2 22 3 22 3 24 1 25 Gender Fam-Size Pres-Add Own-Rent Income educ Employ Cleanliness Hours Prices Service Overall-Imp 1 4 1 1 2 3 1 2.00 5.00 5.00 5.00 8 1 4 1 2 2 1 2.00 3.00 2.00 2.00 10 1 4 1 1 2 5 1 3.00 5.00 5.00 5.00 8 1 3 1 1 2 5 1 3.00 3.00 3.00 1.00 10 1 2 1 2 3 3 1 5.00 5.00 5 1 2 1 2 3 5 1 5.00 5.00 10 1 2 1 2 3 5 1 3.00 4.00 4.00 4.00 5 1 3 1 2 3 5 1 4.00 5.00 5.00 6 1 3 1 1 3 5 1 4.00 5 1 3 1 1 3 3 1 4.00 4.00 4.00 8 1 2 1 1 4 4 1 3.00 4.00 1.00 3.00 5 1 5 1 1 3 2 1 4.00 4.00 4.00 3.00 8 1 4 1 2 3 2 1 2.00 5.00 5.00 5.00 7 1 3 1 1 4 2 1 3.00 5.00 5.00 3.00 5 1 4 1 1 3 8 1 3.00 5.00 5.00 4.00 9 1 5 1 1 3 2 1 7 1 2 1 1 3 2 1 5.00 5.00 1 5 1 2 2 2 2 4.00 5.00 3.00 3.00 5 1 4 1 1 4 9 1 3.00 5.00 5.00 3.00 8 1 6 1 1 2 3 2 3.00 3.00 3.00 1.00 8 1 2 1 1 3 6 1 5.00 3.00 10 1 2 1 2 4 6 1 4.00 10 1 2 1 1 2 1 3.00 5.00 8 1 2 1 1 4 4 1 4.00 4.00 9 1 2 1 1 3 1 1.00 1.00 5 1 2 1 1 2 5 1 4.00 5.00 4.00 4.00 8 1 2 1 1 2 1 1 1.00 2.00 1 1 1 4 4 1 3.00 3.00 3.00 2 1 2 1 1 4 6 2 3.00 3.00 3.00 3.00 2 3 1 2 1 2 2 4.00 5.00 4.00 3.00 10 2 2 1 2 1 4 1 5.00 5.00 4.00 3.00 8 2 1 1 2 2 3 1 2.00 4.00 3.00 4.00 7 2 2 1 1 2 5 1 3.00 3.00 3.00 2.00 10 2 2 1 2 2 5 1 4.00 6 CCDebt p(purchase) 8369 0.57 5831 0.34 5831 0.18 5831 0.18 15468 0.40 15468 0.30 15468 0.30 13230 0.21 13230 0.21 13230 0.29 16525 0.21 10882 0.27 10282 0.25 14629 0.28 9749 0.08 8465 0.20 8117 0.19 2952 0.15 9508 0.05 2706 0.10 5627 0.05 7825 0.05 0 0.06 7317 0.07 0 0.04 2141 0.05 2041 0.21 6535 0.07 4916 0.04 3969 0.47 3612 0.25 7553 0.44 6311 0.22 5831 0.18 The rows of data go on and on … S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 4 The Data Analysis and Mining Process 1. Data Set input and information mapping setup 2. Data Cleaning 3. Qualitative Analysis A. Reinterpretation: mapping codes to meanings B. Descriptive Transformations C. Descriptive Summarization 4. Quantitative Analysis A. Planning and designing the needed results and views B. Using descriptive statistics to explore data properties C. Using tables and graphics to explore relationships D. Using Statistical Inference (within Excel™ limitations) 5. Synthesizing Qualitative and Quantitative Results S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 5 1. Data Set input and information mapping setup • Using IE or Windows Explorer go to this website: http://users.edinboro.edu/sschuyler • Scroll down until you see an entry for: Data Analysis with Microsoft Excel 2007 (Special Session) • Locate two files: – MrktDataBase.xls – VariableDefinitionsDataSet.doc • Download these two files (left click, “save as”) to a folder you will be working in today. • Open the 97-2003 compatible workbook and re-save it as a 2007 workbook (*.xlsx) • Open the definitions document • We will begin here S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 6 Class Procedure Note • At this point the slides indicate what our objectives are, not step by step instructions for “how to do it.” • During class the required aspects of using the Excel™ user interface needed to perform lesson operations will be pointed out, demonstrated, and discussed as needed. • The details of “how to” can be looked up in the following highly recommended text: Grauer, Robert T. & Mulbery, Keith & Scheeren, Judy. (2009). Microsoft® Office Excel 2007 Comprehensive, 2nd Edition. Pearson Prentice Hall, Saddle River, NJ. 07458. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 7 2. Data Set Cleaning Operations • Most Data Sets have errors and/or omissions • Therefore we need to: – Encode and map variable codes to descriptive definitions – Identify missing and erroneous values – Determine “bad data” and management strategies • Tools to be used: – Conditional Formatting – Find and replace – Using LOOKUP functions with mapping tables to view codes descriptively S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 8 Impact of Initial Conditional Formatting (seeing the missing) Location Age 3 21 3 25 3 25 3 25 1 25 3 25 3 25 1 27 2 27 3 27 3 29 1 30 2 31 3 31 2 32 3 35 2 36 1 39 3 41 3 42 1 48 3 48 1 50 2 51 3 53 1 53 3 56 3 57 3 81 3 21 2 22 Gender Fam-Size Pres-Add Own-Rent Income educ Employ Cleanliness Hours Prices Service Overall-Imp 1 4 1 1 2 3 1 2.00 5.00 5.00 5.00 8 1 4 1 2 2 1 2.00 3.00 2.00 2.00 10 1 4 1 1 2 5 1 3.00 5.00 5.00 5.00 8 1 3 1 1 2 5 1 3.00 3.00 3.00 1.00 10 1 2 1 2 3 3 1 5.00 5.00 5 1 2 1 2 3 5 1 5.00 5.00 10 1 2 1 2 3 5 1 3.00 4.00 4.00 4.00 5 1 3 1 2 3 5 1 4.00 5.00 5.00 6 1 3 1 1 3 5 1 4.00 5 1 3 1 1 3 3 1 4.00 4.00 4.00 8 1 2 1 1 4 4 1 3.00 4.00 1.00 3.00 5 1 5 1 1 3 2 1 4.00 4.00 4.00 3.00 8 1 4 1 2 3 2 1 2.00 5.00 5.00 5.00 7 1 3 1 1 4 2 1 3.00 5.00 5.00 3.00 5 1 4 1 1 3 8 1 3.00 5.00 5.00 4.00 9 1 5 1 1 3 2 1 7 1 2 1 1 3 2 1 5.00 5.00 1 5 1 2 2 2 2 4.00 5.00 3.00 3.00 5 1 4 1 1 4 9 1 3.00 5.00 5.00 3.00 8 1 6 1 1 2 3 2 3.00 3.00 3.00 1.00 8 1 2 1 1 3 6 1 5.00 3.00 10 1 2 1 2 4 6 1 4.00 10 1 2 1 1 2 1 3.00 5.00 8 1 2 1 1 4 4 1 4.00 4.00 9 1 2 1 1 3 1 1.00 1.00 5 1 2 1 1 2 5 1 4.00 5.00 4.00 4.00 8 1 2 1 1 2 1 1 1.00 2.00 1 1 1 4 4 1 3.00 3.00 3.00 2 1 2 1 1 4 6 2 3.00 3.00 3.00 3.00 2 3 1 2 1 2 2 4.00 5.00 4.00 3.00 10 2 2 1 2 1 4 1 5.00 5.00 4.00 3.00 8 S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) CCDebt p(purchase) 8369 0.57 5831 0.34 5831 0.18 5831 0.18 15468 0.40 15468 0.30 15468 0.30 13230 0.21 13230 0.21 13230 0.29 16525 0.21 10882 0.27 10282 0.25 14629 0.28 9749 0.08 8465 0.20 8117 0.19 2952 0.15 9508 0.05 2706 0.10 5627 0.05 7825 0.05 0 0.06 7317 0.07 0 0.04 2141 0.05 2041 0.21 6535 0.07 4916 0.04 3969 0.47 3612 0.25 9 Transform Raw Data to avoid Errors • The raw data is not the “Truth” (see the worksheet named “RawData” or slide 4 above) • For example: Income is coded 1,2,3 or 4 but represents four categorical range descriptions Annual household income: 1=Less than $10,000 Income Code 1 2=$10,000-$24,999 2 3=$25,000-$50,000 3 4=Over $50,000 4 S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) Meaning <$10,000 $10,000$24,999 $25,000$50,000 > $50,000 10 More “Truthful” Raw Data Gender Fam-Size Pres-Add Own-Rent Income educ male 4 0-1 Years owns $10,000-$24,999 male 4 0-1 Years male male 4 3 0-1 Years 0-1 Years owns owns $10,000-$24,999 $10,000-$24,999 male 2 0-1 Years rents $25,000-$50,000 male male male male 2 2 3 3 0-1 Years 0-1 Years 0-1 Years 0-1 Years rents rents rents owns $25,000-$50,000 $25,000-$50,000 $25,000-$50,000 $25,000-$50,000 male 3 0-1 Years owns $25,000-$50,000 male 2 0-1 Years owns > $50,000 male 5 0-1 Years owns $25,000-$50,000 male 4 0-1 Years rents $25,000-$50,000 male 3 0-1 Years owns > $50,000 male 4 0-1 Years owns $25,000-$50,000 male 5 0-1 Years owns $25,000-$50,000 male 2 0-1 Years owns $25,000-$50,000 male 5 0-1 Years rents $10,000-$24,999 male 4 0-1 Years owns > $50,000 male 6 0-1 Years owns $10,000-$24,999 male male 2 2 0-1 Years 0-1 Years owns rents $25,000-$50,000 > $50,000 male 2 0-1 Years owns unknown unknown $10,000-$24,999 S.T.Schuyler, D.Sc. 01/08/2010 Technical or trade school High school diploma or equivalent Bachelor’s degree Bachelor’s degree Technical or trade school Bachelor’s degree Bachelor’s degree Bachelor’s degree Bachelor’s degree Technical or trade school Some college High school diploma or equivalent High school diploma or equivalent High school diploma or equivalent Beyond MS High school diploma or equivalent High school diploma or equivalent High school diploma or equivalent Beyond MS Technical or trade school Graduate degree Graduate degree High school diploma or equivalent Employ Cleanliness Hours Prices Service OverallCCDebt p(purchase) Imp 1 2.00 5.00 5.00 5.00 8 8369 0.57 1 2.00 3.00 2.00 2.00 10 5831 0.34 1 1 3.00 3.00 5.00 3.00 5.00 3.00 5.00 1.00 8 10 5831 5831 0.18 0.18 1 5.00 5.00 5 15468 0.40 1 1 1 1 3.00 4.00 5.00 4.00 5.00 4.00 5.00 4.00 5.00 10 5 6 5 15468 15468 13230 13230 0.30 0.30 0.21 0.21 1 4.00 4.00 4.00 8 13230 0.29 1 3.00 4.00 1.00 3.00 5 16525 0.21 1 4.00 4.00 4.00 3.00 8 10882 0.27 1 2.00 5.00 5.00 5.00 7 10282 0.25 1 3.00 5.00 5.00 3.00 5 14629 0.28 1 3.00 5.00 5.00 4.00 9 9749 0.08 7 8465 0.20 8117 0.19 4.00 1 1 5.00 5.00 2 4.00 5.00 3.00 3.00 5 2952 0.15 1 3.00 5.00 5.00 3.00 8 9508 0.05 2 3.00 3.00 3.00 1.00 8 2706 0.10 1 1 5.00 4.00 3.00 10 10 5627 7825 0.05 0.05 1 3.00 5.00 8 0 0.06 Analyzing Data Using Excel (tm) 11 How do we translate the RawData • Link to Supplied Variable Definitions Document • Mapping to Excel Lookup Tables (example) For HLOOKUP(…) Location Store identifier (1,2,3) Age Customer Age (in years) Gender Customer Gender Family size (number of persons in household) How long has the customer resided at the present address: 1=0-1 years Pres-Add 2=2-5 years 3=6-10 years 4=11-20 years 5=more than 20 years 1 Oil City 2 Meadville 1 male 2 female Fam-Size Own-Rent Residence ownership status S.T.Schuyler, D.Sc. 01/08/2010 3 Erie For VLOOKUP(…) Year Code Meaning 1 0-1 Years 2 2-5 years 3 6-10 years 4 11-20 years 5 >20 years 1 2 Analyzing Data Using Excel (tm) owns rents 12 Using VLOOKUP and HLOOKUP to tell the Truth! • To Produce the Real Data – Copy the relevant text from the document into a blank sheet and build translation tables • • • • Restructure to form translation tables Put codes in the first column (vertical lookup) or first row (horizontal lookup) Put translations for each code in 2nd column (vertical) or second row (horizontal) Horizontal Code Translation Table Store identifier (1,2,3) 1 Oil City 2 Meadville 3 Erie Vertical Translate Table Rent Code Meaning 1 owns 2 rents S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 13 Using VLOOKUP and HLOOKUP to tell the Truth! – Make a copy of the encoded Data sheet – In the second copy we replace the codes in the cells with formulas that locate the text description for the code represented. – For each cell in a coded category: • We use “IF” to bypass blank cells or cells that have codes that mean there is “no data” or it is missing. • We use HLOOKUP or VLOOKUP to “lookup the code, corresponding to the original data sheet, in the appropriate translate table for the cell’s category.” • The lookup function returns the text representing the code (as illustrated on slide 11 above). S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 14 HLOOKUP and LOOKUP Basics • Syntax: HLOOKUP(<value to lookup>,<lookup table>,<row #>) – <value to lookup>: usually a relative cell reference – <lookup table>: a horizontally defined table with two or more rows, where the first row contains a set of values to match on; these must be sorted in ascending alphanumeric order; the value to lookup is compared to the values in the first row, if a match is found, the relative column number it is found in is noted. – <row #>: a number relative to the tables first row, to select a return value from using the relative column number where the match was found. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 15 VLOOKUP and LOOKUP Basics • Syntax: VLOOKUP(<value to lookup>,<lookup table>,<column #>) – <value to lookup>: usually a relative cell reference – <lookup table>: a vertically defined table with two or more columns, where the first column contains a set of values to match on; these must be sorted in ascending alphanumeric order; the value to lookup is compared to the values in the first column, if a match is found, the relative row number it is found in is noted. – <column #>: a number relative to the tables first column, to select a return value from using the relative row number where the match was found. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 16 Avoiding or Detecting Function Failures • Special Conditional functions we will employ with “IF” statements – ISBLANK (…) – ISERR (…) – ISNUMBER(…) – ISNA (…) – AND (…) – OR (…) – NOT (…) S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 17 Consulting Discussion • Before proceeding with a pre-planned program I want to engage a discussion: – What is the nature of the data you typically work with? Discuss how it differs from the course model data set? – What are you typically trying to learn from the data you get? • What is the nature of the outputs you are producing? • Why do you think you can get more leverage out of Excel™ since you already use it? – What do your stakeholders want that is different from what you already produce? – Do you already know what you need help learning to do with Excel™? Lets itemize these needs. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 18 3. Qualitative Analysis • Descriptive Transformations (using Tables) – Converting either or both the “RawData” or “DescriptiveData” sheets to data tables – Using Filters to explore data tables • Using IF conditionals and formulas – To manage anomalies – To produced derived variables S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 19 Filtering out the “unknown” and “blank” S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 20 Part of the Filtered View Pres-Add Own-Rent Income 0-1 Years owns $10,000-$24,999 0-1 Years 0-1 Years 0-1 Years 0-1 Years owns owns rents owns $10,000-$24,999 $10,000-$24,999 $25,000-$50,000 > $50,000 0-1 Years owns $25,000-$50,000 0-1 Years rents $25,000-$50,000 0-1 Years owns > $50,000 0-1 Years owns $25,000-$50,000 0-1 Years rents $10,000-$24,999 0-1 Years owns > $50,000 0-1 Years owns $10,000-$24,999 0-1 Years owns $10,000-$24,999 0-1 Years rents <$10,000 0-1 Years rents <$10,000 0-1 Years rents $10,000-$24,999 0-1 Years owns $10,000-$24,999 0-1 Years owns $25,000-$50,000 0-1 Years owns $25,000-$50,000 0-1 Years rents <$10,000 Erie 86 female 1 Total 699 699 3 S.T.Schuyler, D.Sc. 01/08/2010 >20 years educ Technical or trade school Bachelor’s degree Bachelor’s degree Bachelor’s degree Some college High school diploma or equivalent High school diploma or equivalent High school diploma or equivalent Beyond MS High school diploma or equivalent Beyond MS Technical or trade school Bachelor’s degree High school diploma or equivalent Some college Technical or trade school Bachelor’s degree Technical or trade school Some college High school diploma or equivalent owns <$10,000 Employ Cleanliness Hours Prices 1 2.00 5.00 5.00 1 1 1 1 3.00 3.00 3.00 3.00 5.00 3.00 4.00 4.00 5.00 3.00 4.00 1.00 1 4.00 4.00 4.00 1 2.00 5.00 5.00 1 3.00 5.00 5.00 1 3.00 5.00 5.00 2 4.00 5.00 3.00 1 3.00 5.00 5.00 2 3.00 3.00 3.00 1 4.00 5.00 4.00 2 4.00 5.00 4.00 1 5.00 5.00 4.00 1 2.00 4.00 3.00 1 3.00 3.00 3.00 1 3.00 5.00 3.00 1 3.00 4.00 2.00 2 4.00 4.00 4.00 Less than high school diploma 2 3.00 4.00 3.00 699 Analyzing Data Using Excel (tm) 21 4. Quantitative Analysis • Planning and designing the required results and views – What questions are you trying to answer? – What views do your stakeholders need? • Descriptive Summarization (on Data Tables) – Built-in summary and statistical functions – Built-in conditional statistical functions • Exploring relationships using charting tools • Descriptive Summarization (using Pivot Tables) S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 22 Exploring Graphically: Hunches, Ideas • The next few slides depict plots from a much reduced data set (all unknowns and blanks removed) – Histogram of Education levels of participants – Scatter plot of Income vs. CC Debt – Scatter plot of Education level vs. CC Debt – Scatter plot of Age vs. CC Debt. S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 23 Histogram from Data Analysis tools Education Levels in reduced Sample Bin 1 2 3 4 5 6 7 Histogram 250 200 Frequency More Frequency 33 238 76 116 80 61 56 0 150 100 50 0 1 2 3 4 5 6 7 More Bin S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 24 Scatter Plot of Income vs. Credit Card Debt. Income and Credit Card debt (CCDebt) $25,000 $20,000 CCDebt $15,000 $10,000 $5,000 $$- $10,000 $20,000 $30,000 $40,000 $50,000 $60,000 Income S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 25 More Exploring – Education and CCDebt $25,000 $20,000 CCDebt $15,000 $10,000 $5,000 $0 1 2 3 4 5 6 7 8 Education Level Reported S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 26 Exploring Age and CCDebt Age vs. CCDebt $25,000 $20,000 $15,000 $10,000 $5,000 $- 0 10 20 30 40 50 60 70 80 90 100 Age in Years S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 27 Pivot Table Notes • To Produce: – Identify patterns in data – Alternate views of data – Summarize data within categories – Reorganize (“pivot”) data summaries – Expand or collapse views – Queries over the categories S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 28 Pivot Table Notes • Input requirements – Raw data table must have column headings – At least one column must have duplicate text values • e.g. cities, states, products, departments • These become the categories in the pivot tables – At least one column must have numeric values – No blank rows S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 29 Pivot Table Notes • Problem Solving Forethoughts: – You need to have the target design of the pivot table you want sketched out!!! • What are the headings of the target pivot table? • What are the row headings? • What numeric summaries do you want (sums, averages, max, min, etc.). – You need to anticipate data transformations you will need. • Such as: Likert scales or numeric codes that really represent categorical descriptive information • e.g. when 1, 2, 3 correspond to income levels “under $10K”, “$10K to $30K”, and “greater than $30K” S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 30 Creating and Manipulating Pivot Tables • Two approaches – Quick trial with the raw data – Pivot tables from translated data • Operations to cover – Data selection – Selecting fields – Selecting Areas (values, rows, columns, filters) – Selecting variable summary calculation functions – Sorting and ordering fields and values in tables S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 31 Simple Pivot Table Example Location by Income Erie unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Meadville unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Oil City unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Grand Total S.T.Schuyler, D.Sc. 01/08/2010 Average of Average of CCDebt p(purchase) $ 5,108 11.2% $ 5.5% $ 1,140 12.8% $ 2,142 11.5% $ 6,068 11.6% $ 8,729 11.2% $ 5,247 9.7% $ 3.9% $ 1,181 12.6% $ 2,044 9.4% $ 6,189 10.7% $ 8,417 9.4% $ 4,798 10.1% $ 4.8% $ 1,082 11.1% $ 2,104 10.6% $ 6,034 11.1% $ 8,166 10.0% $ 5,061 10.6% Analyzing Data Using Excel (tm) 32 A Second Pivot Table to Examine Probability of Purchase: is it related to “Overall Impression”? Comparing Overall Impression with Probability of Purchase (using codes) Average of p(purchase) Location by Income Erie unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Meadville unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Oil City unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Grand Total Column Labels S.T.Schuyler, D.Sc. 01/08/2010 1 2 10.5% 10.7% 5.2% 13.7% 13.3% 11.6% 8.6% 11.5% 8.4% 7.5% 11.5% 8.8% 2.1% 10.0% 14.1% 5.8% 7.8% 2.1% 13.1% 8.5% 6.9% 5.1% 9.8% 3 4 9.3% 10.7% 6.3% 16.1% 9.0% 10.6% 9.5% 8.5% 9.9% 11.1% 6.8% 7.3% 12.4% 7.3% 7.6% 15.2% 8.2% 5.4% 5.0% 5.7% 10.5% 10.3% 12.1% 6.3% 6.3% 6.3% 9.6% 7.5% 10.4% 9.6% 12.8% 9.7% 15.4% 9.4% 12.2% 7.3% 10.3% 9.4% 10.6% 5 10.2% 5.4% 11.0% 10.5% 11.4% 9.5% 9.4% 5.0% 13.8% 9.8% 9.7% 8.8% 11.2% 5.0% 14.5% 9.9% 12.2% 12.0% 10.4% 6 7 8 9 10 11.6% 11.4% 12.5% 9.4% 11.5% 3.1% 2.5% 5.7% 4.2% 5.5% 14.1% 7.6% 11.7% 11.3% 14.5% 9.6% 14.6% 10.6% 10.9% 12.3% 14.8% 9.8% 10.6% 9.8% 12.2% 9.4% 10.8% 17.3% 8.4% 10.0% 8.7% 9.5% 10.5% 8.9% 10.1% 3.4% 6.3% 2.3% 4.4% 8.5% 9.8% 25.1% 11.8% 5.6% 9.9% 10.3% 9.7% 10.3% 10.1% 11.2% 11.0% 9.2% 10.8% 9.5% 9.0% 9.7% 9.2% 10.6% 8.6% 9.8% 10.2% 11.8% 9.5% 2.5% 5.3% 5.2% 7.3% 3.9% 9.8% 3.5% 12.7% 14.6% 8.9% 10.8% 13.6% 11.7% 15.9% 9.9% 10.9% 8.8% 10.6% 11.3% 11.0% 5.1% 12.5% 9.5% 10.7% 9.0% 10.1% 10.5% 11.4% 9.5% 10.7% Analyzing Data Using Excel (tm) 12.0% 6.2% 12.8% 11.8% 17.2% 8.5% 9.7% 3.0% 12.8% 6.4% 13.9% 9.9% 8.5% 4.7% 9.7% 9.9% 9.8% 6.1% 11.0% Grand Total 11.2% 5.5% 12.8% 11.5% 11.6% 11.2% 9.7% 3.9% 12.6% 9.4% 10.7% 9.4% 10.1% 4.8% 11.1% 10.6% 11.1% 10.0% 10.6% 33 Same comparison using Descriptives from VLOOKUP Store Impression compared with Income on p(purchase) Average of p(purchase) very very very poor poor Column Labels Location by Income Erie unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Meadville unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Oil City unknown <$10,000 $10,000-$24,999 $25,000-$50,000 > $50,000 Grand Total S.T.Schuyler, D.Sc. 01/08/2010 very very very slightly neutral neutral slightly plus positive positive positive positive minus poor poor 3 2 1 9.4% 10.7% 10.0% 6.3% 5.2% 12.9% 12.3% 12.1% 9.0% 8.1% 10.6% 10.9% 8.0% 12.2% 11.0% 14.1% 7.5% 2.1% 24.8% 9.9% 12.7% 15.2% 5.0% 5.8% 7.8% 10.8% 9.5% 2.1% 6.3% 6.3% 9.6% 13.1% 10.1% 7.9% 13.1% 7.4% 5.8% 8.1% 11.4% 8.9% 11.2% 9.6% 4 8.7% 8.7% 8.9% 6.9% 10.9% 6.6% 8.8% 4.4% 4.5% 8.2% 6.3% 6.9% 9.6% 7.4% 8.2% 5 10.0% 5.4% 11.0% 11.5% 10.3% 9.7% 9.9% 5.0% 13.8% 9.9% 10.6% 9.5% 11.2% 5.0% 19.4% 9.7% 12.3% 13.2% 10.4% 8.5% 5.8% 10.9% 7.1% 9.7% 2.5% 7 11.0% 2.5% 7.6% 15.6% 9.7% 9.0% 8.1% 3.4% 9.8% 9.2% 10.1% 6.9% 9.5% 5.3% 12.6% 13.9% 6.1% 10.5% 12.6% 8.4% 13.2% 9.8% 6 12.4% 3.1% 16.1% 8.4% 17.5% 9.9% 8.2% Analyzing Data Using Excel (tm) 8 13.7% 5.7% 12.1% 9.4% 10.7% 22.7% 9.3% 6.3% 9 9.3% 4.2% 11.1% 11.4% 9.0% 8.7% 8.6% 2.3% 8.9% 10.3% 8.6% 9.9% 5.2% 14.0% 11.7% 10.5% 9.9% 11.6% 9.6% 9.8% 7.1% 11.3% 7.3% 15.9% 7.9% 11.6% 9.2% 10 11.1% 5.5% 13.3% 12.7% 11.5% 10.3% 10.0% 4.4% 11.7% 11.7% 10.4% 11.3% 9.7% 3.9% 11.2% 9.7% 11.6% 9.0% 10.5% Grand Total 11.1% 5.2% 12.2% 11.5% 10.8% 12.5% 9.3% 4.0% 11.6% 10.0% 10.4% 8.6% 10.0% 4.8% 12.5% 10.7% 11.1% 10.5% 10.4% 34 Data Mining and Statistical Inference Problems using inferential statistics with survey data! • Direct Approach: What question is being addressed? – Assumes you have a hunch or hypothesis – Does one or more variables (the independents) predict another (the dependent)? – Using Regression Analysis • Indirect Approach: Looking for covariates! – Your just fishing! – You might just catch a “bottom fish!” • Limitations using Excel™ with large data sets S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 35 Correlation: Exploring Age, Income, Education and Credit Card Debt Input: Age Income Education CCDebt Function: Excel™ Data Analysis – Correlation Output: Correlation Age Income educ CCDebt S.T.Schuyler, D.Sc. 01/08/2010 Age 1 -0.33 -0.16 -0.71 Income educ CCDebt 1 0.27 0.75 1 0.20 1 Analyzing Data Using Excel (tm) 36 A Regression Example: Income and CCDebt Input: Income (independent variable) CCDebt (Dependent variable) Function: Excel™ Data Analysis – Regression (Excel™ is limited to single regression) Output: Regression Statistics Multiple R 0.75033731 R Square 0.563006078 Adjusted R Square 0.562341954 Standard Error 2306.625707 Observations 660 ANOVA df Regression Residual Total S.T.Schuyler, D.Sc. 01/08/2010 SS MS F Significance F 1 4.51E+09 4.51E+09 847.7418 2.1543E-120 658 3.5E+09 5320522 659 8.01E+09 Analyzing Data Using Excel (tm) 37 $10,000 $10,000 $10,000 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $17,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $37,500 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 $50,000 Y Plot of Actual vs. Predicted CCDebt using Income X Variable 1 Line Fit Plot $25,000 $20,000 $15,000 $10,000 $5,000 $- Income Y S.T.Schuyler, D.Sc. 01/08/2010 Predicted Y Analyzing Data Using Excel (tm) 38 5. Synthesizing Qualitative and Quantitative Results (Time dependent) • Methods applicable to presenting tables, charts and graphs into MS Word™ documents – Paste special – Paste Link • Course Wrap-up S.T.Schuyler, D.Sc. 01/08/2010 Analyzing Data Using Excel (tm) 39