The Nature of Statistics

advertisement
The Nature of Statistics:
The art of learning about and understanding our
world through data.
Essentials: The Nature of Statistics
(a.k.a: The bare minimum I should take along from this topic.)
• Definitions and relationships as presented on the Anatomy of
the Basics: Statistical Terms and Relationships sheet
• Identification of variables and their characteristics
• Careful review of data and their presentation
• Providing a context for the data
• Why use percentages rather than numeric counts when
making comparisons
69.2 80 35,000
• What do you know about these numbers?
• What do they mean to you?
• What is missing?
Okay, so What is Statistics?
(or is that What ARE Statistics?)
Statistics is the study of how to collect, organize, analyze,
interpret and report numerical information in order to
make decisions.
Statistics are the numeric data we use to better understand
our world. They may take the form of frequencies, means,
percentages, variances, etc.
What is a Study?
• 3 Types:
• Observational – observe and measure; can
identify association, not causation.
• Experimentation – impose treatment and
observe characteristics; can help establish causation.
• Simulation – using computers to simulate
situations that are not practical to do in real time.
Basic Terminology
• DATA: Are numbers with a context - i.e. numbers with
meaning.
– Examples: not 48.2, but 48.2 kg. not 5.23, but 5.23 inches)
• VARIABLE: A characteristic or property of an individual
population unit that varies from one person or thing to
another.
– Examples: age, square footage, and assessed value represent
three variables associated with homes in Oneonta.
– Variables have Values. Example: The variable hair color has the
values of brown, blonde, red, etc.
• UNIT (Element): Any individual member of the defined
population.
– Examples: Each bottle of soda in a production run is a unit; each
penny in a roll of pennies is a unit; each person enrolled in a class
is a unit.
Data: One variable (here unidentified, i.e. no context), multiple values
“Raw” Data (N=160)
“Organized” raw data (N=160)
Unit
73 “different”
numbers
Time Period Otsego Lake was Frozen (days)
Raw Data
Grouped Data
Time Period Otsego Lake was Frozen (days)
Otsego Lake: Days Frozen
!849-50 to 2009-10
70
60
Frequency
50
40
30
20
10
0
0-24
25-49
50-74
75-99
Days
100-124
125-149
Data: Two Variables: year and days; multiple values
Time Period Otsego Lake was Frozen:
Mean Days/Decade
Time Period Otsego Lake was Frozen:
Mean Days/Decade
So is the Greenhouse Effect at work here?
To be studied through further statistical analysis, such as the use of ANOVA…
Anatomy of the Basics: Statistical Terms and Relationships
Descriptive Statistics: methods for organizing and
Statistics is the study of how to collect,
organize, analyze, interpret and report
numerical information.
summarizing information. E.g. Number of students in this
class by major, baseball standings, housing sales by month.
Inferential Statistics: methods for drawing conclusions
and measuring the reliability of those conclusions using sample
results. E.g. Political views of all 4-year college students.
Parameter: numerical characteristic of a
Population: all individuals, items, or objects
Population vs. Sample
whose characteristics are being studied.
population.
Census: data collected from ALL members
of the population.
Sample: a portion of the population
Statistic: numerical characteristic of a
selected for study.
sample.
Variable:
a characteristic or property
of an individual unit. Variables
have values.
Qualitative: a variable that cannot be measured
numerically E.g. Gender, eye color.
Discrete: a variable whose values are countable. It can
Quantitative: a variable that can
only assume certain values, with no intermediate values.
E.g. Number of auto accidents in Oneonta in 1998.
be measured numerically.
E.g. Income, height, number of siblings one has.
Continuous: a variable that can assume any numerical
value over an interval or intervals. E.g.Time.
Nominal: grouping individual observations into qualitative
No Arithmetic Operations: individual
observations can only be categorized.
categories or classes. E.g. Grouping individuals by whether they
are left-handed or right-handed.
Ordinal: individual observations are assigned a number or
“ranking.” There is a sense of “more than,” but you cannot say
“how much” more than. E.g. Military ranks.
Scaling of Variables
(Measurement Levels)
Arithmetic Operations: individual
observations have meaningful numeric
values.
Interval: variables have no true zero point. Cannot say
how much more. E.g. Temperature ( F or C), IQ scores.
Ratio: variables have a true zero point. Can say how
much more. E.g. Weight, height.
Population Basic Terminology
• POPULATION:
– Complete collection of all elements or units (usually people,
objects, transactions, or events) that we are interested in
studying.
– In terms of data, a population is the collection of all outcomes,
responses, measurement, or counts that are of interest.
– CENSUS: A complete enumeration (or accounting) of the
population (i.e. collecting data from every element (or unit) in the
population).
– PARAMETER: A numeric value associated with a population. (e.g.
- the average height of ALL students in this class, given that the
class has been defined as a population)
Sample Basic Terminology
• SAMPLE: Taken from a population a sample is a subset from which
information is collected.
– Example: 25 cans of corn (sample) randomly obtained from a full days production
(population)
• STATISTIC: A numeric value associated with a sample.
–
Example: the average height of 10 individuals randomly selected from the class (defined
population).
• INFERENCE: An estimate, prediction, or some other generalization
about a population based on information contained in a sample.
– Example: Based upon a randomly selected sample of 25 flights at JKF International
Airport (the sample; individual flights are units) taken from all flights on Dec. 24,
2009 (defined population), we can state with a degree of confidence the mean
delay for the population of the day’s flights was 35 minutes (sample statistic in
context being inferred to the population).
In Summary
To include ALL units,
you are looking at:
• POPULATION
• CENSUS
• PARAMETERS
Parameter
Statistic
To work with a subset of
all units, you are
looking at:
• SAMPLE
• STATISTICS
• INFERENCES to a
population
Population
Sample
Example: Identifying Data Sets
In a recent survey, 1708 adults in the United States were asked if
they think global warming is a problem that requires immediate
government action. Nine hundred thirty-nine of the adults said
yes.
Describe the data set. Identify:
The population:
The sample:
A variable being studied:
Values of the Variable:
Source; Adapted from: Pew Research Center; Larson/Farber 4th ed.
Examples: Populations & Samples
• Smoking: Identify the population and sample.
– A survey, 250 college students at Union College were asked if they
smoked cigarettes regularly. Thirty-five of the students said yes.
Identify the population and the sample.
• Student Income: Decide whether the numerical value describes a
population parameter or a sample statistic.
•
A survey of 450 Cornell University students reported their average weekly income
from part-time employment was $325.
• For both of the above studies:
– What are the units of the population/sample?
– Identify a variable being studied.
– Identify values of the variable.
Descriptive Statistics:
• DESCRIPTIVE STATISTICS: Organize and summarize
information using numerical and graphical methods.
– Examples:
• Summarizing the age of cars driven by students in a frequency table.
• Graphing the ages of students.
• Identifying the mean speed of cars driving in a 30 mph zone.
• A descriptive statement describes some aspect of the data.
(Select a statistical measure and put it into sentence format.)
– Examples:
• Thirty-eight percent of the orange trees suffered damage due to the cold
temperatures.
• The average weight for the 23 cars studied was 2,738 lb.
• The mean number of days Otsego Lake was frozen per winter was 88.69 days.
Descriptive Statistics at Work:
SUNY Oneonta Car Registrations
Numeric tables, pictures (graphs & charts), and text are three methods used to present data.
During the 2006 year there were 1.346 cars registered at SUNY Oneonta. Car registrations contain
many variables, such as car type, car color, year of car, and license plate number. Noted below are
ways descriptive statistics are used to convey information about the selected variables: a frequency
table of Registrant Type (i.e. who registered the car); a graphic presentation of Vehicle Age; and
text (written descriptive statement) presenting the mean Vehicle Age, of the registered cars.
Frequency Table:
Graphic presentation (here a Histogram):
Registrant Type
Valid
Commuter
Faculty
M anagement
Ot her
Frequency Percent
512
38.0
Valid Percent
38.0
Cumulative
Percent
38.0
223
16.6
16.6
54.6
13
1.0
1.0
55.6
58
4.3
4.3
59.9
Resident
287
21.3
21.3
81.2
Staff
253
18.8
18.8
100.0
Total
1346
100.0
100.0
Mean & Median: The Mean age of cars driven by
students was 7.45 years (vs. 6.19 yrs. for employees).
The Median age of registered vehicles for students
was 7.0 years (5.0 years for employees).
Inferential Statistics:
• INFERENTIAL STATISTICS uses sample data to make
estimates, decisions, predictions, or other generalizations
about the population.
– The aim of inferential statistics is to make an inference about a
population, based on a sample (as opposed to a population
census), AND to provide a measure of precision for the method
used to make the inference.
• An inferential statement uses data from a sample and
applies it to a population.
Examples of Inferential Statistics:
• A Gallup Poll found that 57% of dating teens had been out
with somebody of another race or ethnic group (+/- 4.5%;
95% CI)
– Interpretation: We are 95% confident that between 52.5% and 61.5%
(57% +/- 4/5%) of dating teens have been out with someone of a
different race/ethnicity.
• A Gallup Poll found that 40% of Americans would quit their
job if they won the lottery (+/- 4%; 95% CI).
– Interpretation: We are 95% confident that the true population
proportion of Americans who would quit their job if they were to win
a lottery lies between 36% and 44%).
Example: Descriptive and Inferential Statistics
Decide which part of the study represents the
descriptive branch of statistics. What conclusions might
be drawn from the study using inferential statistics?
A large sample of men, aged 48,
was studied for 18 years. For
unmarried men, approximately
70% were alive at age 65. For
married men, 90% were alive at
age 65.
Source: (The Journal of Family Issues)
Larson/Farber 4th ed.
Characteristics of Data
Before conducting any data analysis the characteristics of the variable under study
must be identified. This will result in utilizing appropriate tables, graphs and
statistical analysis.
Two Types of Data
• Qualitative Data can be separated into different
categories (values) that are distinguished by some
nonnumeric characteristic. Qualitative data are also
referred to as categorical or attribute data.
– Examples include gender, eye color, and car brands
– Note that the values of this type of variable are
differentiated by words rather than numeric values.
Example: Eye Color values include blue, brown, hazel,
etc.
• Quantitative Data are “number-based” and
represent counts or measurements. This type of
data may be subdivided into two categories...
• Discrete Data - result when the number of possible
values is either a finite or a countably infinite number.
– Examples: Siblings, Cars, and Coins in a jar (think of whole
number counts here; even if you cannot count them all).
• Continuous Data - result from infinitely many possible
values corresponding to some continuous scale that covers
a range of values without gaps, interruptions, or jumps.
Continuous data can assume any value, including
fractional parts.
– Examples: Height, Weight, Time
N.B.: Qualitative data cannot be classified as discrete or continuous.
Example: Classifying Data by Type
The base prices of several vehicles are shown in the
table. Which data are qualitative data and which are
quantitative data? (Source Ford Motor Company)
Source: Larson/Farber 4th ed.
4 Levels of Measurement
The level of measurement determines which statistical
calculations are meaningful. The four levels of measurement
are: nominal, ordinal, interval, and ratio.
Nominal
Levels
of
Measurement
Ordinal
Interval
Ratio
Lowest
to
highest
Levels of Measurement (cont.)
• Nominal – characterized by data that consist of names,
labels, or categories only. The data cannot be arranged in an
ordering scheme. Qualitative data.
– Examples: Gender, Yes/No, Political Party affiliation,
names of students.
• Ordinal – characterized by data that can be arranged in
some order, but the differences between data values either
cannot be determined or are meaningless. These variables
may be either qualitative (categorical) data or quantitative
(numerical) data.
– Examples: Military Rank, Position in a race, Attitude scales.
Levels of Measurement (cont.)
• Interval – like the ordinal level, with the additional
property that the difference between any two data values is
meaningful. However, there is no natural zero starting point.
Quantitative data.
– Examples: Temperature (F or C); longitude; Calendar
Years.
• Ratio – is the interval level modified to include the natural
zero starting point. At this level, differences and ratios are
both meaningful. Quantitative data.
– Examples: Height, Weight, Time, Age.
Summary of Levels of Measurement
Put data in
categories
Arrange
data in
order
Subtract data
values
Determine if one
data value is a
multiple of another
Nominal
Yes
No
No
No
Ordinal
Yes
Yes
No
No
Interval
Yes
Yes
Yes
No
Ratio
Yes
Yes
Yes
Yes
Level
of
measurement
Example: Classifying Data by Level
Two data sets are shown. Which data set consists of
data at the nominal level? Which data set consists of
data at the ordinal level? (Source: Nielsen Media Research)
Source: Larson/Farber 4th ed.
Example: Classifying Data by Level
Two data sets are shown. Which data set consists of
data at the interval level? Which data set consists of
data at the ratio level? (Source: Major League Baseball)
Source: Larson/Farber 4th ed.
Anatomy of the Basics: Statistical Terms and Relationships
Descriptive Statistics: methods for organizing and
Statistics is the study of how to collect,
organize, analyze, interpret and report
numerical information.
summarizing information. E.g. Number of students in this
class by major, baseball standings, housing sales by month.
Inferential Statistics: methods for drawing conclusions
and measuring the reliability of those conclusions using sample
results. E.g. Political views of all 4-year college students.
Parameter: numerical characteristic of a
Population: all individuals, items, or objects
Population vs. Sample
whose characteristics are being studied.
population.
Census: data collected from ALL members
of the population.
Sample: a portion of the population
Statistic: numerical characteristic of a
selected for study.
sample.
Variable:
a characteristic or property
of an individual unit. Variables
have values.
Qualitative: a variable that cannot be measured
numerically E.g. Gender, eye color.
Discrete: a variable whose values are countable. It can
Quantitative: a variable that can
only assume certain values, with no intermediate values.
E.g. Number of auto accidents in Oneonta in 1998.
be measured numerically.
E.g. Income, height, number of siblings one has.
Continuous: a variable that can assume any numerical
value over an interval or intervals. E.g.Time.
Nominal: grouping individual observations into qualitative
No Arithmetic Operations: individual
observations can only be categorized.
categories or classes. E.g. Grouping individuals by whether they
are left-handed or right-handed.
Ordinal: individual observations are assigned a number or
“ranking.” There is a sense of “more than,” but you cannot say
“how much” more than. E.g. Military ranks.
Scaling of Variables
(Measurement Levels)
Arithmetic Operations: individual
observations have meaningful numeric
values.
Interval: variables have no true zero point. Cannot say
how much more. E.g. Temperature ( F or C), IQ scores.
Ratio: variables have a true zero point. Can say how
much more. E.g. Weight, height.
Misuse of Statistics
ah yes… the old torture the data long enough and they will confess to anything routine...
• Precise Numbers
Tonight’s paid attendance was 56,423
• Guesstimates
It was estimated that one million spectators lined
the road to L’Alpe d’Heuz for the 16th stage of
the 2004 Tour de France race.
• Distorted Percentages
New and improved with 50% more ... – 50% might
not be a meaningful amount.
• Partial Pictures
Ford truck adds
• Loaded Questions
Line item veto
• Misleading Graphs
Visual distortions of data
• Pictographs
The crescive cow.
• Pollster Pressure
Public bathrooms.
• Small/Bad Samples
67% suspended
• Self-Selected Surveys
CNN phone-in surveys
Pictograph: “This year my business
profits doubled!”
Visual Presentations of Data – Beware
Source: http://findarticles.com
Data Considerations
• Anecdotal Evidence – basing our conclusions
on a few individual cases. e.g. We remember
the airplane crash that kills several hundred
people and fail to notice that data for all flights
show that flying is much safer than driving.
• Lurking Variables – almost all relationships
between two variables are influenced by other
variables lurking in the background.
Airline Flights: Alaska Airlines vs. American West
Which would you choose to fly?
On Time Delayed
Alaska Airlines
America West
3274
(86.7%)
6438
(89.1%)
501
(13.3%)
787
(10.9%)
Alaska Airlines vs. American West
A Closer Look
Alaska Air
America West
On Time
Delayed
On
Time
Delayed
Los Angeles
497
62
694
117
Phoenix
221
12
4840
415
San Diego
212
20
383
65
San Francisco
503
102
320
129
Seattle
1841
305
201
61
TOTAL
3274
501
6438
787
Departure
Location
We now know that American West has a better “On Time” record, but Alaska
Airlines has a better “On Time” record at every airport. How can that be?
Alaska Air
On Time
Delayed
On
Time
Delayed
497
62
694
117
(88.9%)
(11.1)
(85.6)
(14.4)
221
12
4840
415
(94.8)
(5.2)
(92.1)
(7.9)
212
20
383
65
(91.4)
(8.6)
(85.5)
(14.5)
503
102
320
129
(83.1)
(16.9)
(71.3)
(28.7)
1841
305
201
61
(85.8)
(14.2)
(76.7)
(23.3)
3274
501
6438
787
(86.7)
(13.3)
(89.1)
(10.9)
Departure
Location
Los Angeles
Phoenix
San Diego
San Francisco
Seattle
TOTAL
America West
End of Slides
Download