Statistics - New York University

advertisement
Statistics and Data
Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
1/39
1. Data Presentation
Statistics and Data
Analysis
Part 1 – Data Presentation
Telling the story statistically
2/39
1. Data Presentation
Samples are surprisingly small
> 1010 Observations
> Telephone sample
> Sampling error
3/39
1. Data Presentation
What Does it Mean?
Slightly more than one-third of Americans have a favorable opinion of
the Democratic-led Congress, a poll said Wednesday.
The Pew Research Center for the People & the Press said the 37%
expressing a positive opinion represents a decline of 13 points since
April.
The favorable percentage is one of the lowest in more than two decades
of Pew surveys – if not the lowest, the poll said. The previous low was
40% in January, but the result is not statistically significant because of
the margin of error.
(USA Today)
We will develop the idea of the “margin of error” and how it is computed.
4/39
1. Data Presentation
Really?
The following was taken from
http://www.msnbc.msn.com/id/27339545/
An msnbc.com guide to presidential polls
Why results, samples and methodology vary from survey to survey
WASHINGTON - A poll is a small sample of some larger number, an
estimate of something about that larger number. For instance, what
percentage of people reports that they will cast their ballots for a particular
candidate in an election? A sample reflects the larger number from
which it is drawn. Let’s say you had a perfectly mixed barrel of 1,000
tennis balls, of which 700 are white and 300 orange. You do your sample
by scooping up just 50 of those tennis balls. If your barrel was perfectly
mixed, you wouldn’t need to count all 1,000 tennis balls — your
sample would tell you that 30 percent of the balls were orange.
Your sample might tell you that approximately 30 percent of the
balls were orange.
5/39
1. Data Presentation
The Visual Data Do Tell the Story:
Napoleon’s March to and from Moscow
6/39
1. Data Presentation
Informative Data Table
Life Expectancy: Highest 15 Countries, 2010
Disability Adjusted Life Expectancy



7/39
40
1. Data Presentation
A Dynamic Picture
8/39
1. Data Presentation
Bar Charts vs. Data Tables
9/39
1. Data Presentation
Probability of Survival to Age 50, Female at Birth
U.S. and 20 Other Wealthy Countries
It is possible to be
misled by a
presentation such
as this one. Note
the vertical axis.
What does this graph tell
you? What do the
probabilities mean? Are the
differences meaningful?
10/39
1. Data Presentation
11/39
1. Data Presentation
Does living longer make people happier? Or do people
live longer because they are happier?
12/39
1. Data Presentation
Does the Picture Tell the Story?
This is the only graphic in
the article. The article
compares default rates on
VA vs. FHA mortgages. Is
there anything wrong with
this picture? The very
technical looking
graph/table is unrelated to
the article.
New York Times, Page RE1, July 24, 2014
13/39
1. Data Presentation
Data Presentation Agenda
Data Types: Cross Section and Time Series
 Summarizing Data Graphically




Summarizing Data with Descriptive Statistics



14/39
Pie chart, bar chart
Box plot, histogram
Central tendency
Spread
Distribution (shape)
1. Data Presentation
Data = A Set of Facts
A picture of some aspect of the world
Pizza Sales by Type
What do the data tell
you?
How can you use the
information?
What additional
information would
make these data
(more) informative?
15/39
1. Data Presentation
Data Types and Measurement

Quantitative
 Discrete
= count: Number of car accidents by city by time
 Continuous = quantitative measurement: Housing prices

Qualitative
 Categorical: Shopping mall, car brand, trip mode
 Ordinal: Survey data on attitudes; “How do you feel about…?”
Strongly disagree  Disagree  Neutral  Agree  Strongly agree
Moody’s bond ratings: Aaa, Aa, A, Bbb, Bb, B, and so on.

Frameworks
 Cross section
 Time series
16/39
1. Data Presentation
Discrete, Count Data, Time Series
17/39
1. Data Presentation
Continuous Quantitative Data
Housing Prices and Incomes
18/39
1. Data Presentation
Unordered Qualitative Data
Travel Mode Between Sydney and
Melbourne by 210 Travelers
19/39
1. Data Presentation
Ordered Qualitative Data
German Health Satisfaction Survey; 27,326 individuals. On a
scale from 0 to 10, how do you feel about your health?
20/39
1. Data Presentation
Aggregated Data May Be
Easier to Understand
(7-8)
(4-6)
(9-10)
(0-3)
Bad
21/39
Fair
Good
Excellent
1. Data Presentation
Ordered Qualitative Outcomes
Bond Ratings
Movie Ratings
Arithmetic Mean may not be
meaningful.
(a) Ordinal measure – rankings
(b) Look at that distribution!
22/39
1. Data Presentation
A Problem with Ordered Survey Response Data
61 Stern Students’ Ranking of Subway Safety (1994)*
Safety
Count
Percent
Cum Pct
1
17
27.87
27.87
Very Unsatisfactory
2
15
24.59
52.46
Unsatisfactory
3
17
27.87
80.33
OK
4
10
16.39
96.72
Satisfactory
5
2
3.28
100.00
Very Satisfactory
There is no objective meaning to “3” on some standard scale.
Does everyone’s “1” or “2” or “3” … mean the same thing?
* Jeff Simonoff: Data Presentation and Summary, pp. 3-4
23/39
1. Data Presentation
Cross Section Data
Housing Prices and Incomes
24/39
1. Data Presentation
Time Series Data: Oil Price
Graph is much more useful and informative than a table for time series data.
25/39
1. Data Presentation
Representing
Data
 In
raw form
 Transformed to a visual form
 Summarized graphically
 Summarized statistically
26/39
1. Data Presentation
Pie Chart vs. Frequency Table
Pizza Pies Sold, by Type
Pie Chart of Percent vs Type
Meatball
Garlic 5.0%
2.3%
Mushroom and Onion
9.2%
C ategory
Pepperoni
Plain
Mushroom
Sausage
Pepper and Onion
Mushroom and Onion
Garlic
Meatball
Pepperoni
21.8%
Pepper and Onion
7.3%
Sausage
5.8%
Mushroom
16.2%
Plain
32.5%
Same Information. Which is more useful for your audience?
27/39
1. Data Presentation
Data Representation:
Bar Chart vs. Pie Chart
Chart of Number vs Type
Pie Chart of Percent vs Type
4000
Meatball
Garlic 5.0%
2.3%
Mushroom and Onion
9.2%
Number
3000
C ategory
Pepperoni
Plain
Mushroom
Sausage
Pepper and Onion
Mushroom and Onion
Garlic
Meatball
Pepperoni
21.8%
2000
Pepper and Onion
7.3%
1000
Sausage
5.8%
0
i
on
er
p
p
Pe
n
ai
Pl
m
oo
hr
us
M
e
ag
us
a
S
er
pp
Pe
d
an
n
io
On
m
oo
hr
us
M
d
an
n
io
On
c
rl i
Ga
ea
M
l
al
tb
Mushroom
16.2%
Plain
32.5%
Type
BAR CHART
PIE CHART
Same data. Which is easier to understand?
28/39
1. Data Presentation
Table vs. Bar Chart (or both)
29/39
2013 data. Source: Bloomberg
1. Data Presentation
2013 Valuation of U.S. Sports Teams
These figures reveal a league strategy.
Football
Baseball
30/39
1. Data Presentation
A Box Plot Describes the Distribution
of Values in a Set of Data
Average House Listing Price by State
900000
Hawaii
800000
700000
Listing
600000
500000
400000
300000
200000
100000
Box and Whisker Plot for House Price Listings
31/39
1. Data Presentation
Raw Data on Housing Prices and Incomes
32/39
1. Data Presentation
Making a Box Plot for Per Capita Income
Maximum=31136
3rd Quartile = 24933
Median
=22610
Interquartile Range = IQR
= 24933-21677 = 3256
1st Quartile = 21677
Minimum=17043
33/39
1. Data Presentation
Box and Whisker Plot
= extreme
observations
What is an outlier?
Why do we believe a
particular point is an
outlier?
Outliers
Smaller of (Maximum, Median + 1.5 IQR
75th Percentile
Interquartile
range=IQR
Median
25th Percentile
Larger of (Minimum, Median – 1.5 IQR
34/39
1. Data Presentation
Histogram for House Price Listings
Histogram of Listing
14
12
10
Frequency
A histogram
describes the
sample data and
suggests the
nature of the
underlying data
generating
process. Note the
“skewness” of the
distribution of
listings.
8
6
4
2
0
35/39
200000
300000
400000
500000 600000
Listing
700000
800000
900000
1. Data Presentation
Distribution of House
Price Listings
… shows up in the box and
whisker plot. Note the long
whisker at the top of the
figure.
Histogram of Listing
14
12
Average House Listing Price by State
8
6
900000
4
800000
2
700000
0
600000
200000
300000
400000
500000 600000
Listing
700000
800000
900000
Listing
Frequency
10
500000
400000
Asymmetry (skewness) in the
histogram of listing prices…
300000
200000
100000
36/39
1. Data Presentation
House Price Listings and
Per Capita Incomes. States.
Regression and Correlation. Are
these two variables correlated?
r = .48
How to describe/summarize them.
How to explain the variation across
states
How to determine if there is any
correlation between the two variables.
37/39
1. Data Presentation
Big Data: Netflix Cinematch Rating/Recommendation System
38/39
1. Data Presentation
Summary


What story does the data presentation tell?
 Data in raw form tell no story.
 Visual representation of data tells something about the data
 The representation of the data may reveal something about
the underlying process that the data measure.
What tool is most informative?
 Reduction to a small number of features
 Visual displays of data
 Data Table – Organizing the data is often a good start.
 Pie chart
 Box and whisker plots
 Bar charts
 Histograms
 Time series plots
“There are lies, damned lies and statistics.” (Benjamin Disraeli)
39/39
1. Data Presentation
Download