Uploaded by hannehmeh2001

toaz.info-statistics-and-probability-reviewer-pr 1b2486be9569da7b39cf629ed9e2833e

advertisement
Statistics
- a science of conducting studies to collect, organize,
summarize, analyze, present, interpret, and draw
conclusions from data
- used to analyze the results of surveys and as a tool in
scientific research to make decisions based on controlled
experiments. Other uses of statistics include operations
research, quality control, estimation, and prediction
2. Ordinal - classifies data into categories that can be ranked;
but precise differences between the ranks do not exist; judging
(1st, 2nd) rating scale (poor, excellent)
3. Interval - ranks data, and precise differences between units
of measure exist; there is no meaningful zero (IQ, temperature)
4. Ratio - possesses all the characteristics of interval
measurement, and a true zero exists. True ratios exist when the
same variable is measured on two different members of the
population; height, weight, time, salary, age
Variable - a characteristic or attribute under study that can
assume different values
Data Collection Methods
Random Variables - values are determined by chance
1. Surveys
Data - values (observations or measurements) that the
variables can assume
Data Set - a collection of observations (data values) on one
or more variables
Population - consists of all subjects (human, etc) that are
being studied
Sample - a group of subjects selected from a population
2 Main Areas of Statistics
1. Descriptive statistics - the collection, organization,
summarization, and presentation of data. Tables, charts or
graphs are used to organize and present data. Descriptive
values such as the average score are used to summarize
data.
2. Inferential statistics - generalizing from samples to
populations, performing estimations and hypothesis tests,
determining relationships among variables, and making
predictions. Make inferences from samples to populations.
Hypothesis testing - a decision-making process for
evaluating claims about a population, based on
information obtained from samples
Classifications of Variables
1. Quantitative - Numerical and can be ordered or ranked
(age, heights, weights, body temperatures)
a) Discrete - values that can be counted
b) Continuous - assume an infinite number of values
between any two specific values; obtained by
measuring and often include fractions and decimals
2. Qualitative - variables that can be placed into distinct
categories, according to some characteristic or attribute
(gender, religion, geographic locations)
a) Telephone - less costly, more candid, not face-face.
Disadvantages: some don’t have phones or will not answer,
unlisted numbers
b) Mailed Questionnaires - less expensive to conduct,
respondents can remain anonymous. Disadvantages: low
number of responses, inappropriate answers to questions;
some people may have difficulty reading or understanding
the questions
c) Personal Interview - obtain in-depth responses.
Disadvantages: interviewers must be trained in asking
questions and recording responses; interviewer may be
biased
2. Surveying records
3. Direct observation of situations
Reasons for Using Samples
1. Saves time and money
2. Enables the researcher to get information that he or she
might not be able to obtain otherwise
3. Enables the researcher to get more detailed information
about a particular subject
4 Basic Sampling Techniques
1. Random sampling - subjects are selected by random
numbers from calculators, computers, or tables; for a sample of
size n, all possible samples of this size have an equal chance of
being selected from the population.
Limitation: if the population is extremely large, it is time
consuming to number and select the sample elements
Methods for Random Sampling
a) Fish bowl - number each element of the population, place
the numbers on cards in a hat or fishbowl, mix them, and
select the sample by drawing the cards
Measuring Variables - to establish relationships between
variables; observe the variables and measure/record their
observations.
b) Random numbers - number the elements of the population
sequentially and then select each element by using random
numbers
Scale of measurement - measuring a variable into a set of
categories and a process that classifies each individual into
one category
2. Systematic random sampling- using every kth number after
the first subject us selected from 1 through k; done after the
first number is selected at random. The advantage of
systematic sampling is the ease of selecting the sample
elements.
4 Types of Measurement Scales
1. Nominal level of measurement - classifies data into
mutually exclusive (non overlapping), exhausting
categories in which no order or ranking can be imposed on
the data.gender, zip code, eye color, nationality, religion
3. Stratified random sampling - dividing the population into
subgroups, called strata, and subjects are randomly selected
within groups; ensures representation of all population
subgroups that are important to the study. Disadvantages:
many variables of interest, dividing a large population into
representative subgroups requires a great deal of effort.
4. Classes must be mutually exclusive - non-overlapping class
limits so that data cannot be placed into two classes
4. Cluster sampling- subjects are selected by using an
intact group(cluster) that is representative of the
population.
5. Classes must be continuous - no gaps in frequency
distribution
Advantages: A cluster sample can reduce costs, it can
simplify fieldwork it is convenient.
Disadvantage: homogeneous
6. Classes must be exhaustive - enough to accommodate all
the data
Reasons for constructing a frequency distribution
1. To organize the data in a meaningful, intelligible way.
Frequency Distribution and Graphs
Constructing a frequency distribution - most convenient
method of organizing data
Frequency distribution -organization of raw data in table
form, using classes and frequencies; way of presenting a
summary of the data that shows
a) possibility of seeing patterns or relationships in data
b) how
many
times
each
data
(observation/outcome) occurs in a data set
Class - quantitative/qualitative category, each raw data
value is placed into
Tally - data recorded in the sequence which they are
collected, before they are processed/ranked
Frequency - number of data values contained in a specific
class
1. Qualitative variable (ordinal/nominal data)
Class, tally, frequency, percent
2. Quantitative variable (numerical data)
a)
3. To facilitate computational procedures for measures of
average and spread
4. To enable the researcher to draw charts and graphs to
present data
5. To enable the reader to compare different data sets
point
Components of frequency distribution table
a)
2. To enable the reader to determine the nature or shape of
the distribution.
Class limit, class boundaries - numbers used to
separate the classes so there are no gaps in the
frequency distribution; tally, frequency
Basic Rules: Constructing “Class” in the Frequency
Distribution
Types of Frequency Distribution
1. Categorical Frequency Distribution - used for data that can
be placed in specific categories, such as nominal/ordinal level
data.
2. Grouped Frequency Distributions - used when the range of
the data is large, the data must be grouped into classes that are
more than one unit in width.
3. Ungrouped Frequency Distribution - used when the range
of the data values is relatively small, a frequency distribution
can be constructed using single data values for each class
4. Cumulative Frequency Distribution - gives total # of values
that fall below the upper boundary of each class. Values are
found by adding the frequencies of classes less than or equal to
upper class boundary of a specific class (ascending cumulative
frequency)
Sample of Frequency Distribution Table
1. There should be 5-20 classes
2. Class limits should have the same decimal place value
as the data
a)
Class boundaries should have one additional
place value and end in a 5
Lower limit - 0.5 = lower boundary
Upper limit + 0.5 = upper boundary
3. Classes must be equal in width - found by subtracting
lower/upper class limit of one class from lower/upper class
limit of the next class if boundaries are given. Find the
class width by dividing the range by the number of classes
* don’t subtract limits of a single class; incorrect answer
*researcher decides how many classes to use and the
width of each class
Sturge’s Rule - determining number of classes to use in a
histogram or frequency distribution table
Constructing statistical charts and graphs - most useful
method of presenting the data
Uses of graphs in statistics
1. Convey data to viewers in pictorial form
2. Useful in getting the audience’s attention in a presentation
3. Describe/analyze data set
4. Discuss an issue, reinforce a critical point, summarize data
set
5. Discover trends/patterns in a situation
k = 1+3.322(log10n)
Frequency Distribution Graphs
k = number of classes
• X axis - score categories (X values)
n = size of the data
• Y axis - frequencies
• Histogram or a polygon - When the score categories have
numerical scores from an interval or ratio scale
Commonly Used Graphs
1. Histogram - contiguous vertical bars of various heights
(frequencies)
2. Frequency polygon - using lines that connect points
plotted for the frequencies
3. Ogive or Cumulative Frequency - represents the
cumulative frequencies. visually represent how many
values are below a certain upper class boundary
Constructing Statistical Graphs
1. Draw and label x and y axes
2. Choose a suitable scale and label it on the y axis
3. Represent the class boundaries on the x axis
4. Plot the points and draw the bars or lines
the distribution; reported along with the mean or the median
Modal class - the mode for grouped data; the class with the
largest frequency
1. Unimodal - a data set that has only one value that occurs
with the greatest frequency
2. Bimodal - a data set that has two values that occur with the
same greatest frequency, both values are considered to be the
mode
3. Multimodal - a data set that has more than two values that
occur with the same greatest frequency, each value is used as
the mode
Central Tendency and the Shape of the Distribution
Relative Frequency Graphs - used when the proportion of
data values is more important than the actual number of
data values
1. Symmetrical (Normal) Distribution - the data values are
evenly distributed on both sides of the mean. When the
distribution is unimodal, the mean, median, and mode are the
same and are at the center of the distribution
To convert a frequency into a proportion or relative
frequency, divide the frequency for each class by the total
of the frequencies. The sum of the relative frequencies will
always be 1
Other Types of Graph
1. Bar graph - vertical or horizontal bars whose heights or
lengths represent the frequencies of the data
2. Pareto chart - frequency distribution for a categorical
variable, frequencies are displayed by vertical bars,
arranged in order from highest to lowest
3. Time series graph - represents data that occur over a
specific period of time; look for trends/patterns
4. Pie graph - circle divided into sections or wedges
according to the percentage of frequencies; nominal/
categorical
2. Positively Skewed or Right-skewed Distribution - majority
of the data values fall to the left of the mean and cluster at the
lower end of the distribution; the “tail” is to the right. The
mean is to the right of the median, and the mode is to the left
of the median
Data Distribution
Measures of Central Tendency
Central tendency - descriptive statistical measure that
determines a single value that best describes the center
and represents the entire distribution; condense a large
set of data into a single value
- goal is to identify the single value that is the best
representative for the entire set of data
Statistic - a characteristic or measure obtained by using
the data values from a sample
3. Negatively Skewed or Left-skewed Distribution - majority of
the data values fall to the right of the mean and cluster at the
upper end of the distribution, with the tail to the left. The mean
is to the left of the median, and the mode is to the right of the
median
Parameter - a characteristic or measure obtained by using
all the data values from a specific population
1. Mean - most commonly used measure of central
tendency; balance point of the distribution; sum of the
values divided by the total number of values
2. Median - midpoint of the list where scores in a
distribution are listed from smallest to largest; a more
appropriate measure of central tendency than the mean;
divides the scores so that 50% of the scores in the
distribution have values that are equal to or less than the
median
3. Mode - most frequently occurring category or score in
the distribution or in the data set; peak or high point of
*When a distribution is extremely skewed, the value of the
mean will be pulled toward the tail
Central Tendency and Variability - two primary values that are
used to describe a distribution of scores
Central tendency - the central point of the distribution
Variability - descriptive statistic that describes how the
scores are scattered around that central point; determined
by measuring distance
- inferential statistic that describes how accurately any
individual score or sample represents the entire
population
Measures of Variation
1. Range - total distance covered by the distribution, from
the highest score to the lowest score
R = highest value - lowest value
2. Variance (  or s2) - average of the squares of the
distance each value is from the mean
2
2 
 ( X  )
2
N
X = individual value
μ = population mean
N = population size
s2 
( X  X )
n 1
Q2 is the same as the 50th percentile, or the median
Q3 corresponds to the 75th percentile
4. Interquartile Range (IQR) - difference between Q1 and Q3
and is the range of the middle 50% of the data; used to identify
outliers, and as a measure of variability in exploratory data
analysis (EDA)
5. Deciles - Deciles divide the distribution into 10 groups,
denoted by D1, D2, etc. Deciles can be found by using the
formulas given for percentiles
Relationships Among Percentiles, Deciles, and Quartiles
• Deciles are denoted by D1 , D2 , D3 , and they correspond to
P10, P20, P30
• Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to
P25, P50, P75
• The median is the same as P50 or Q2 or D5
X = sample mean
n = sample size
3. Standard Deviation (  or s) - standard distance
between a score and the mean; square root of the
variance
Uses of Variance and Standard Deviation
1. To determine the spread of the data.
2. To determine the consistency of a variable
3. To determine the number of data values that fall within
a specified interval in a distribution
4. Used quite often in inferential statistics.
Coefficient of Variation (CVar) - statistic that allows to
compare standard deviations when the units are different;
the standard deviation divided by the mean, result
expressed as a percentage
For samples:
CVar 
s
100%
X
For population:

CVar  100%

Measures of Positions - used to locate the relative
position of a data value in the data set
1. Standard score (z-score) - tells how many standard
deviations a data value is above or below the mean for a
specific distribution of values
a) If a z score is 0, the data value is the same as the mean
b) if the z score is (+), the score is above the mean
Exploratory (Descriptive) Data Analysis, EDA - to examine data
to find out what information can be discovered about the data
such as the center and the spread
Stem-and-Leaf Plot - data plot that uses part of the data value
as the stem and part of the data value as the leaf to form
groups or classes. Leading digit (stem), trailing digit (leaf),
frequency
Boxplot (Box and Whisker Plot) - graph of a data set obtained
by drawing: the lowest value of the data set (minimum), Q1,
the median, Q3, the highest value of the data set (maximum)
Comparing Boxplots for Two or More Data Sets - use the
location of the medians. To compare the variability, use the
interquartile range or the length of the boxes.
Probability and Counting Rules
Probability - the chance of an event occurring
Basic Concepts of Probability
c) if the z score is (-), the score is below the mean
1. Probability Experiments - a chance process that generates a
set of data or well-defined results called outcomes
When all data for a variable are transformed into z scores,
the resulting distribution will have a mean of 0 and a
standard deviation of 1
2. Outcome - the result of a single trial of a probability
experiment
value  mean
z
sd
3. Space sample (S) - set of all possible outcomes of a
statistical experiment
2. Percentile - divide the data set into 100 equal groups
percentile = (# of values below X)+0.5 x 100%
total # of values
3. Quartiles - divide the distribution into four groups,
separated by Q1, Q2, Q3
Q1 is the same as the 25th percentile
Tree Diagram - used to determine all possible outcomes of a
probability experiment
Classifications of Events
a)
Event (E) - consists of a set of outcomes of a probability
experiment
Independent Events - the probability of both
occurring is P(A and B) = P(A) x P(B)
b)
Dependent Events - conditional probability P(B/A)
- the probability of both occurring is
P(A and B) = P(A) x P(B/A)
1. Independent - the first event does not affect the
probability of the next event occurring
2. Dependent - the probability of the second event
occurring depends on the first event
3. Complementary event ( E ) - set of outcomes in the
sample space that are not included in the outcomes of
event E; mutually exclusive
P(E)  1 P(E)
Conditional Probability
The probability that event B occurs given that event A has
already occurred:
P(B|A) = P(A and B)
P(A)
P(E)  P(E)  1
Determination of the Number of Outcomes of Events
Three Basic Interpretations of Probability
1. Fundamental Counting Rule - mulitply (k1 * k2 * k3 * kn)
1. Classical Probability - relies of the sample space;
assumes all outcomes are equally likely to occur; actual
performance of experiment is not necessary; outcomes
are obtained by observation and tree diagram
2. Permutation - arrangement of n objects in a specific order
Permutation Rule - # of permutations of n objects taking r
objects at a time; order is important
P(E) = # of outcomes in E =
n(E) total # of outcomes n(S)
2. Empirical Probability - uses frequency distribution;
outcomes are based on the frequency distribution and
observation
n
Pr 
n!
(n  r)!
where n! = n factorial
3. Combination - selection of distinct objects without order
Combination Rule - # of combinations of r objects selected
from n objects; order is not important
P(E) = frequency for class = f
total frequencies
n
n
Cr 
n!
(n  r)!r!
3. Subjected Probability - researcher makes an educated
guess about the chance of an event occurring; experiment
performance not needed; based on educated personal
judgment/estimate, opinions and inexact information
Probability Distribution - a relative frequency distribution of all
possible outcomes if an experiment
Four Basic Probability Rules
Different Types of Probability Distribution
Probability Rule 1 - probability of any event is a number
(fraction/decimal) between and including 0 and 1
1. Probability Distribution of Discrete Variables - binomial,
poisson distribution
0  P(E)  1
2. Probability Distribution of Continuous Variables - uniform,
normal distribution
Probability Rule 2 - if event E can’t occur, probability is 0
Probability Rule 3 - if event E is certain, probability is 1
Probability Rule 4 - sum of the probabilities of all
outcomes in the sample space is 1
*Probability values range from 0 to 1
*When probability is near 0, occurrence is highly unlikely
*When probability is near 0.5, there is a 50-50 chance
*When probability is near 1, event is likely to occur
*When probability of an event/complement is known, the
other can be found by subtracting the probability from 1
Rules in Solving Probability of Compound Events (2 or
more)
1. Addition Rule
a)
b)
Mutually Exclusive Events - when two events A
and B are mutually exclusive P(A or B) = P(A) +
P(B)
Non-mutually Exclusive - if A and B are not
mutually exclusive P(A or B) = P(A) + P(B) - P(A
and B)
2. Multiplication Rule and Conditional Probability
Random Variables - characteristic that varies from one
component of a population to another; its values vary randomly
or by chance
1. Discrete Random Variables - has a finite or countable
number of values (0, 1, 2…)
2. Continuous Random Variables - has infinitely many values
associated with measurements on a continuous scale where
there are no gaps or interruptions (5, 5.1, 6.2…)
Discrete Probability Distribution - table, graph, or
mathematical expression that specifies all possible values
(outcomes) of a random variable with their probabilities. It
should satisfy the criteria:
1.
 P(x)  1
2. 0  P(x)  1
where x is a discrete variable and
P(x) is the probability of x
for every value of x
Mean of a Probability Distribution - expected value; typical
value that represents the central location of a probability
distribution 
xP(x)
Variance and Standard Deviation of a Probability
Distribution - measures the amount of spread in a
Hypergeometric Random Variable - the number X of successes
of a hypergeometric experiment
distribution
Probability mass function (pmf)
2   [(x  ) 2 P(x)]
 K  N  K 
 k  n  k 



P( X  k ) 
 N 
 n 
 
Binomial Distribution - with parameters n and p, is the
discrete probability distribution of the # of successes in a
sequence of n independent experiments
4 Properties of Binomial Distribution
1. Fixed Number of Trials (n)
2. Two outcomes in a trial, success or failure
3. Trials are independent
4. Probability of success P remains constant
where N = population size
K = # of success states in the population
n = # of draws
k = # of observed successes
 a = is a binomial coefficient
 b
General Formula
pmf is (+) when max(0, n  K  n)  k  min(K, n)
pmf satisfies the recurrence relation
 N  K 


n

P( X  0)  
N
 
 n 
 
 
X ~ B(n, p)
P( X  r)nc rpr qnr
X = random variable
n = # of trials
r = # of successes
q = # of failures
p = probability of success
Mean and Variance
X ~ B(n, p)
mean   E(x)  np
variance  2  Var( X )  npq
where q 1 p
Mode - of a binomial B(n,p) distribution
|(n+1)p|
if (n+1)p is 0 or a noninteger
(n+1)p and (n+1)p-1 if (n+1)p{1,..., n}
n
if (n+1)p=n+1
Median - no formula to find the median for a binomial
distribution
Multinomial Distribution - used to compute probabilities
in situations that have more than 2 possible outcomes
1. Statistical experiment with k outcomes
2. Repeated independently n times
n!
P  (n !)( n !)...( n !)
1
2
( n1 )
p1
p2
(n2)
... pk
( n k)
k
where P = probability
n = total # of events
n1 = # of times outcome 1 occurs
n2 = # of times outcome 2 occurs
nk = # of times outcome k occurs
p1 = probability of outcome 1
p2 = probability of outcome 2
pk = probability of outcome k
Hypergeometric Distribution - discrete probability
distribution that describes the probability of k successes in
n draws, without replacement, from population N that
contains exactly K objects, wherein each draw is either a
success or a failure
Conditions Characterizing Hypergeometric Distribution
1. The result of each draw can be classified into one of two
mutually exclusive categories (Pass/Fail, True/False )
2. The probability of a success changes on each draw, as
each draw decreases the population (sampling without
replacement from a finite population)
Download