• Once the data starts to flow, our attention turns to data analysis
– Data preparation – includes editing, coding and data entry
– Exploring, displaying and examining data – the search for meaningful patterns
– Data mining – used to extract patterns and predictive trends from databases
• Checking entries for correctness, consistency
• Coding – assigning numbers
• Data entry – spreadsheet, data editor of a statistical program or database
• You could move directly into the statistical analysis …
• When the study’s purpose is not the production of causal inferences, confirmatory data analysis is not required
• When it is, you should discover as much as possible about the data before selecting the appropriate means of confirmation
• Set of techniques
• The flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute
• EDA can be compared to the role of the police detectives and other investigators
• Confirmatory analysis can be compared to the role of the judge
• The former are involved in the search for clues the latter are preoccupied with evaluating the strength
• Free to take many paths in revealing mysteries in the data
• Emphasizes visual representations and graphical techniques over summary statistics
• Summary statistics , may obscure, conceal the underlying structure of the data
• When numerical summaries are used exclusively and accepted without visual inspection, the selection of confirmatory modes may be based on flawed assumptions and may produce erroneous conclusions
• Frequency Tables
• Bar Charts
• Pie Charts
• Information
– Displays the data from the lowest value to the highest
– Columns for percent
– Percent adjusted for missing values
– Cumulative percent
Value Label Value Frequency % Valid % Cum. %
Chemicals 1 10 10.0 10.0 10.0
Consumer Products 2 8 8.0 8.0 18.0
Durables 3 7 7.0 7.0 25.0
Energy 4 13 13.0 13.0 38.0
Financial 5 24 24.0 24.0 62.0
Health 6 4 4.0 4.0 66.0
High-Tech 7 11 11.0 11.0 77.0
Insurance 8 6 6.0 6.0 83.0
Retailing 9 7 7.0 7.0 90.0
Other 10 10 10.0 10.0 100.0
Total 100 100.0 100.0
Valid Cases 100 Missing Cases 0
30
25
20
15
10
5
0
Chemicals Consumer Durables Energy Financial Health High-tech Insurance Retailing Other
Series1
Chemicals
Consumer
Durables
Energy
Financial
Health
High-tech
Insurance
Retailing
Other
• The values and percentages are more readily understood in graphic format.
• The relative sizes of the sectors can be visualized with the bar and pie
Row Value Freq. % Cum.% Row Value Freq. % Cum.% Row Value Freq. % Cum.%
1 54.9 1 2 2
2 55.4 1 2 4 14 62.6 1 2 30
3 55.6 1 2 6
4 56.4 1 2 8
15 64.8 1 2 32
16 66.0 2 4 36
5 56.8 1 2 10
6 56.9 1 2 12
7 57.8 1 2 14
8 58.1 1 2 16
17 66.3 1 2 38
18 67.6 1 2 40
19 69.1 1 2 42
20 69.2 1 2 44
9 58.2 1 2 18 21 70.5 1 2 46
10 58.3 1 2 20 22 72.7 1 2 48
11 58.5 1 2 22
12 59.2 2 4 26
23 72.9 1 2 50
24 73.5 1 2 52
• The last chart was not informative
• Primary contribution was an ordered list of values
• If converted to a bar chart, it would have 48 bars of equal length and two bars with two occurrences
• A pie chart would also be pointless
• Notice that when the variable of interest is measured on an interval-ration scale and is one of many potential values, these techniques are not particularly informative
• Conventional solution for display of interval-ratio data
• Group the variable’s values into intervals
• Useful
– Displaying all intervals in a distribution even those without observed values
– Examining the shape of the distribution for skewness, kurtosis and the modal pattern
• Questions to ask
– Is there a single hump?
– Are subgroups identifiable when multiple modes are present?
– Are straggling data values detached from the central concentration?
15
10
25
20
5
0
1 2 3 4 5 6 7 8 9
Series1
• Intervals with 0 counts show gaps in the data and alert the analyst to look for problems with spread
• There are two extreme values
• Along with the peaked midpoint and reduced number of observations in the upper tail, this histogram warns us of irregularities in the data.
• Closely related to the histogram
• Shares features but offers unique advantages
• Easy to construct by hand for small samples
• In contrast to histograms which lose information by grouping values into intervals, actual data can be inspected directly
• Range of data is apparent at a glance
• Also shape and spread impressions immediate
• To develop, the first digit of each data item are arranged to the left of a vertical line.
• Each row is referred to as a stem and each piece of information leaf
5 0 2 2 3 5 6 7 8
6 4 5 5 6 6 6 7 8 8 8 8 9 9
7 0 2 2 6 8
8
9 2 4
10 0 1 8
11 3
12 1
13 1
14 0 6
15 3
16 3 6
17
18 3
19
20 6
21 8
• Another technique for exploratory data analysis
• Boxplot reduces the detail of the stem-and-leaf display and provides a different visual image of the distribution’s location, spread, shape, tail length, and outliers
• Summary consists of the median, upper and lower quartiles, and the largest and smallest observations.
• The median and quartiles are used because they are particularly resistant statistics.
• Example: data set = [5,6,6,7,7,7,8,8,9]
• The mean is 7 and the standard deviation 1.23
• Replace the 9 with 90 and the mean becomes 16 and the standard deviation 27.78.
• Changing only one of the nine values has disturbed the location and spread summaries to the point where they no longer represent the other eight values. Both mean and standard deviation are considered nonresistant statistics
• The median remained at 7 and the lower and upper quartiles stayed at 6 and 8, respectively.
• Rectangular plot that encompasses 50 percent of the data values
• A center line ( or other notation) marking the median and going through the width of the box
• The edges of the box are called hinges
• The whiskers that extend from the right and left hinges to the largest and smallest values
Outside
Value
Or outlier
Smallest observed value within
1.5 IQR of lower hinge
Whiskers
Median
Largest observed value within
1.5 IQR of upper hinge
Outside
Value
Or outlier
Extreme
Or far
Outside value
1.5IQR
IQR
Outer fence
Lower hinge
Minus 3(IQR)
Inner fence
Lower hinge
Minus 1.5(IQR)
50% of observed
Values are within the box
1.5IQR
Inner fence
1.5(IQR) plus
Upper hinge
Outer fence
3(IQR) plus
Upper hinge
Minimum = 54.9
Lower hinge = 60.3
Median = 74.55
Upper hinge = 111.52
Maximum = 218.2
IQR = 111.52 – 60.3 = 51.22
.5 (IQR) = 25.61
Inner fence lower hinge = 60.3 – (51.22+25.61) = -16.53
Inner fence upper hinge = 111.52 + (51.22+25.61) = 188.35
The smallest and largest values from the distribution within the fences are used to determine the whisker length
• In preliminary analysis, it is important to separate legitimate outliers from errors in measurement, editing, coding and data entry
• Outliers that are mistakes should be corrected or removed
Symmetric
Right Skewed
Left Skewed
Small Spread
• Gain insight into the data
• More common ways of summarizing location, spread, and shape
• Used resistant statistics
• From these we could make decisions on test selection and whether the data should be transformed or reexpressed before further analysis