Statistics 13 Elementary Statistics

advertisement
Statistics 13 Elementary Statistics
Summer Session I 2012
Lecture Notes 1: An Introduction to Statistics1
Definition 1.1 Statistics is the science of data. This involves collecting, classifying,
summarizing, organizing, analyzing, and interpreting numerical information.
And more views from others...
• ‘There are three kinds of lies: lies, damned lies, and statistics.’ by Mark Twain.
• ‘The average human has one breast and one testicle.’ by Des McHale.
• ‘Statistics may be defined as “a body of methods for making wise decisions in the
face of uncertainty.”’ by W.A. Wallis.
• ‘I keep saying that the sexy job in the next 10 years will be statisticians, and I’m
not kidding.’ by Hal Varian, chief economist at Google.2
For curious students: ‘How to lies with Statistics’ by Darrell Huff.3
Definition 1.2 Descriptive statistics utilizes numerical and graphical methods to
look for patterns in a data set, to summarize the information revealed in a data set, and
to present that information in a convenient form.
Definition 1.3 Inferential statistics utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger set of data.
1
Last update: June 25, 2012
New York Times: http://www.nytimes.com/2009/08/06/technology/06stats.html
3
It won’t be tested, but is probably fun to read! Don’t read it when you cannot get into sleep, it may
make things worse...
2
1
Inferences make statements about populations (the data set about which information
is desired).
Definition 1.4
An experimental unit is an object about which we collect data.
• A measurement results when a variable is actually measured on an experimental
unit.
• A set of measurements is called data.
Definition 1.5
A population is a set of units that we are interested in studying.
The definition of population depends on the particular study.
• If you are interested in studying the height of US citizens, the population is all the
heights of US citizens.
• If you are interested in studying the height of Brazilian citizens, the population is
all the heights of Brazilian citizens.
Definition 1.6
unit.
A variable is a characteristic or property of an individual population
How many variables have you measured in your study?
• Univariate data: One variable is measured on a single experimental unit.
• Bivariate data: Two variables are measured on a single experimental unit.
• Multivariate data: More than two variables are measured on a single experimental
unit.
What is the difference between populations and variables?
• A population is a set of existing units such as people, objects, transactions, or
events.
• A variable is a characteristic or property of an individual population unit such as
height of a person, time of a reflex, amount of a transaction, etc.
Definition 1.7
A sample is a subset of the units of a population.
The four major methods of collecting data:
1. A published source: These data have already been collected by someone else and
is available in a published source.
2. A designed experiment: These data are collected by a researcher who exerts
strict control over the experimental units in a study. These data are measured
directly from the experimental units.
2
3. A survey: These data are collected by a researcher asking a group of people one
or more questions. Again, these data are collected directly from the experimental
units or people.
4. An observational study: These data are collected directly from experimental
units by simply observing the experimental units in their natural environment and
recording the values of the desired characteristics.
Definition 1.8 A statistical inference is an estimate, prediction, or some other generalization about a population based on informaion contained in a sample.
What is the difference between descriptive and inferential statistics?
• Descriptive statistics utilizes numerical and graphical methods to look for patterns,
to summarize, and to present the information in a set of data.
• Inferential statistics utilizes sample data to make estimates, decisions, predictions,
or other generalizations about a larger set of data, such as population.
Definition 1.9 A measure of reliability is a statement (usually quantitative) about
the degree of uncertainty associated with a statistical inference.
Example 1 Gallup poll of teenagers A Gallup Youth Poll was conducted to determine the
topics that teenagers most want to discuss with their parents. The findings show that 46% would
like more discussion about the family’s financial situation, 37% would like to talk about school,
and 30% would like to talk about religion. The survey was based on a national sampling of 505
teenagers, selected at random from all U.S. teenagers.
• Sample: the set of 505 teenagers selected at random from all U.S. teenagers.
• Population: the set of all teenagers in the U.S.
• Variable of interest: the topics that teenagers most want to discuss with their parents.
1. How is the inference expressed?
The inference is expressed as a percent of the population that want to discuss particular
topics with their parents.
2. Newspaper accounts of most polls usually give a margin of error (e.g., plus or minus
3%) for the survey result. What is the purpose of the margin of error and what is its
interpretation?
The ”margin of error” is the measure of reliability. This margin of error measures the
uncertainty of the inference.
Definition 1.10 Quantitative data are measurements that are recorded on a naturally occurring numerical scale.
Quantitative discrete:
3
• Number of wins by a baseball team in a season (0, 1, 2, 3, etc)
• Number of incidences of fire on campus in a year (0, 1, 2, etc)
• Number of customers of a store in a day
Quantitative continuous:
• Time it takes for me to commute between my home and school (between 5 min and
15 min)
• Time to complete a survey
Definition 1.11 Qualitative data are measurements that cannot be measured on a
natural numerical scale; they can only be classified into one of a group of categories.4
Example 2 National Bridge Inventory. All highway bridges in the United States are
inspected periodically for structural deficiency by the Federal Highway Administration (FHWA).
Data from the FHWA inspections are compiled into the National Bridge Inventory (NBI). Several
of the nearly 100 variables maintained by the NBI are listed next. Classify each variable as
quantitative or qualitative.
Qualitative
Quantitative
• Toll bridge (yes or no)
• Condition of deck (good, fair or poor)
• Type of route (interstate, U.S., state,
county, or city)
•
•
•
•
Length of maximum span (feet)
Number of vehicle lanes
Average daily traffic
Bypass or detour length (miles)
Definition 1.12 A representative sample exhibits characteristics typical of those
possessed by the target population.
Definition 1.13 A random sample of n experimental units is a sample selected from
the population in such a way that every different sample of size n has an equal chance of
selection.
Example 3 Example 1 Cont’d. Refer to Example 1,
3. Is the sample representative of the population?
Yes! Since the sample was a random sample, it should be representative of the population.
Definition 1.14 Statistical thinking involves applying rational thought and the science of statistics to critically assess data and inferences.
4
Often, we assign arbitrary numerical values to qualitative data for ease of computer entry and analysis.
But these assigned numerical values are simply codes: they cannot be meaningfully aded, subtracted,
multiplied, or divided.
4
Example 4 Success/failure of software reuse. The PROMISE Software Engineering
Repository, hosted by the University of Ottawa, is a collection of publicly available data sets
to serve researchers in building prediction software models. A PROMISE data set on software
reuse, saved in the SWREUSE file, provides information on the success or failure of reusing
previously developed software for each in a sample of 24 new software development projects.
(Data source: IEEE Transactions on Software Engineering, Vol. 28, 2002.) Of the 24 projects,
9 were judged failures and 15 were successfully implemented.
• Experimental units: the 24 new software development projects.
• Population: the set of all new software development projects.
• Variable of interest: the outcome of reusing previously developed software for the new
software development projects. Since the outcomes could either be successes or failures,
the variable is qualitative.
1. Critically evaluate the statement “Since 15/24=0.625, it follows that 62.5% of all new
software development projects will be successfully implemented.” Is it correct?
No! In the sample, 15 of the 24 or 62.5% projects were judged as successfully implemented.
This is the success rate of the sample. This would be a good estimate of the population
percentage of successfully implemented projects, but it is only an estimate. If we took
another sample of size 24, the percentage of successful projects would not necessarily be
62.5%.
Definition 1.15 Selection bias results when a subset of the experimental units in
the population is excluded so that these units have no chance of being selected in that
sample.
Example 5 Alf Landon vs. Franklin D. Roosevelt. During the 1936 presidential campaign, a classic instance of faulty sampling occurred. The Literary Digest, a respected magazine,
polled about 10 million Americans (the largest number ever) and got responses from about 2.4
million. For ease, the individuals surveyed were selected from registered automobile owners, registered telephone subscribers, and the magazines own readers.
The poll showed that Landon would likely be the overwhelming winner and FDR would get only
43% of the votes. However, the election result is FDR won, with 62% of the votes (Most lopsided
in history in terms of electoral votes). The magazine was completely discredited because of the
poll, and was soon discontinued.
Why? The erroneous prediction occurred because the voters used in the sample were not representative of the general voting population. In 1936, telephones and automobiles were unaffordable to
the average voter, and “John Q. Public”’ didn’t read Literary Digest either. Thus, the majority
of voters didn’t really care what Literary Digest had to say, and didnt even bother submitting the
survey.
Definition 1.16 Nonresponse bias results when the researchers conducting a survey
or study are unable to obtain data on all experimental units selected for the sample.
5
Definition 1.17 Measurement error refers to inaccuracies in the values of the data
recorded.5 In surveys, this kind of error may be due to ambiguous or leading questions
and the interviewer’s effect on the respondent.
Example 6 A biologist studying the reproduction of a particular strain of bacterium might
encounter random errors due to slight variation of temperature or light in the room.
For your interest: A random error is a type of measurement errors. It is random in nature
and very difficult to predict. Another type of measurement errors is systematic error. Random
error can be estimated by comparing multiple measurements, and reduced by averaging multiple
measurements. In contrast, systematic error cannot be discovered this way because it always
pushes the results in the same direction. If the cause of a systematic error can be identified, then
it can usually be eliminated.
5
Measurement error is not equal to ’mistake’ !
6
Download