Uploaded by Bukwase Mgidlana

ADM cheat sheet 1

advertisement
NOMINAL:
INTERVAL
STATS:
Observations of a qualitative variableDESCRIPTIVE
can only be classified
and counted.
Data classifications are ordered according to the amount
of the characteristic they possess.
differences in the characteristic are represented by
 is Measures
oforder
location
(both
central and non-central Equal
location)
There
no particular
to the
labels.
equal differences in the measurements.
of Spread
e.g.eyeMeasures
color, gender,
religion. (or dispersion) about the central location value
e.g. tempertaure, dress size
 Measures of shape (skewness)
ORDINAL:
RATIO:
Data classifications are represented by sets of labels or
names (high, medium, low) that have relative values.
Interval level with an inherent zero starting point.
Differences and ratios are meaningful.
The data classified can be ranked or ordered, but the
differences between data values cannot be determined or
are meaningless.
Data classifications ordered according to the amount of
the characteristics they possess.Equal differences in the
characteristic are represented by equal differences in the
numbers assigned to the classifications.
e.g. your rank in class, juice tasting- sprite is 1, coke is 2,
fanta is 3, pepsi is 4.
The zero point is the absence of the characteristic and the
ratio between two numbers is meaningful.
e.g. distance traveled; salary
Central Location Measures:
Non Central Location Measures:
1. mean
2. mode
3. median
1. Quartiles: Divides an ordered data set into 4 parts
2. Percentiles
3. Geometric Mean
MEDIAN:
MODE:
MEAN:
1. The
There
is a of
unique
median forthat
eachappears
data set.
1.
value
the observation
most frequently.
1.
widely
used
measurelarge
of location.
2. The
It is most
not affected
bysummarizing
extremely
or small values
2.
Especially
useful in
nominal-level
data.and is therefore a
2. Features:
all
values
are
used;
it
is
unique;
the
sum values
of the deviations
from the
valuable
measure
of
central
tendency
when
such
occur.
3. mean
Advantage:calculated
Not affected
by outliers.
byratio-level,
summing
the
values and dividing
by the number
3. It can isbe0;computed
for
interval-level,
and ordinal-level
data. of
4.
For many
of data, there
is no mode
because
no value
values.
4. Disadvantages:
It can be computed
for ansets
open-ended
frequency
distribution
if the
median
3. appears
Weakness:
be open-ended
distorted
outliers.
once. ORby
there
is more than one mode.
does notmore
liecan
inthan
an
class.
GEOMETRIC MEAN:
1. Used where data represents percentage changes
2. The data must be represented as decimal values – a 4% increase is 1.04 a 4%
decrease is 0.96
3. Geometric mean = nth root of the data points multiplied by each other
(where n= sample size)
4. GM = n√X1 X2 X3…. Xn
Z-score:
Formulae: z = (x- µ)/ σ
The standard normal distribution


Solve for x: x= µ + zσ
A normal distribution with a mean of 0 and a
standard deviation of 1.
It is also called the z distribution.
Measure of Dispersion:
1.
2.
3.
4.
Range
Standard deviation
Variance
Coefficient of variation
VARIANCE AND STANDARD DEVIATION:
1. Variance is the arithmetic mean of the squared deviations from the mean.
2. The most common and useful measure of dispersion because it is the average
distance of each observation from the mean.
3. Commonly used as a measure to compare the spread in two or more sets of
observations
4. Advantages: uses all the values of a data set; expressed in the same unit of as
the observations.
5. The variance and standard deviations are nonnegative and are zero only if all
observations are the same.
6. For populations whose values are near the mean, the variance and standard
deviation will be small.
7. For populations whose values are dispersed from the mean, the population
variance and standard deviation will be large.
8. The variance overcomes the weakness of the range by using all the values in the
population.
RANGE:
1. The simplest measure of dispersion
2. Computed by subtracting the lowest value of a data from the highest
value in the set.
3. Not a reliable measure of dispersion, since it only uses two values from
the data set.
4. Extreme values can distort the range to be very large while most of the
elements may actually be very close together.
5. Widely used in statistical process control (SPC) applications.
NORMAL DISTRIBUTION AND STANDARD DEVIATIONS OF THE MEAN:


Measure of Skewness:

68.3% of all data values lie within 1 SD of the mean. (between the lower limit of [mean – SD] AND
the upper limit of [mean+ SD]).
95.5% of all data values lie within 2 SD of the mean. (between the lower limit of [mean – 2SD]
AND the upper limit of [mean +2SD]).
99.7% of all data values lie within 3 SD of the mean. (between the lower limit of [mean – 3SD]
AND the upper limit of [mean +3SD]).
Pearson’s coefficient of skewness
4 Shapes commonly observed:
 SK = 0, distribution is symmetrical. Hence mean = median = mode.
 SK > 0, distribution is positively skewed. Hence mean > median.
 SK < 0, distribution is negatively skewed. Hence mean < median.
POSITIVELY SKEWED:
 Median is prefered
 Mean will be mostly influenced (inflated and distorted) by large outliers and hence will lie
furtherest to RHS of mode and median .
 Distribution will have a long ‘tail’ to RHS.
Coefficient of Variation: (expressed as %)



= Standard deviation/ mean
CV is a measure of relative variability
It is therefore possible to compare the variability of
data across different samples, especially if the
NEGATIVELY SKEWED:
 Median is prefered
 Mean will be mostly influenced (deflated and distorted) by small outliers and hence will lie
furtherest to LHS of mode and median.
 Distribution will have a long ‘tail’ to LHS.
Outliers:

A data value (x) that has a z-score either below -3 or above +3.
BINOMIAL
BINOMIAL.DIST (x; n; p; cummulative?)
Distriptive stats:
1.
2.
Mean: µ =np
Standard deviation: σ = √np (1-p)
POISSON.DIST (x; mean; cummulative?)
n = sample size;
NORM.DIST (x; mean; SD; cummulative?)
p= probability of a success outcome on a single
independent object
Sample Mean:
 It is normally distributed
 It has a mean equal to the population mean, µ
 It has a standard deviation, called the standard error, σx
equal to σ/ √n
POISSON
Distriptive stats:
1.
2.
Mean: µ =λ
Standard deviation: σ = √ λ
λ = the mean number of occurences of a given
outcome of the random variable for a
predetermined time, space or volume interval.
Conveniece sampling:
 Sampling is drawn to suit the convenience of the researcher.
 E.g. select motorists from only one petrol station; select item
for inspection from only one shift instead of a number of
shifts.
Snowball sampling
Used when it is not easy to identify the members of the target
population for reasons of sensitivity or confidentiality (i.e. in studies
related to HIV, ganster activity, sexuality, illegal immigrants). If one
member can be identified, then this person is asked to identify other
members of the same target population. This selection of sampling
units is non-random and potentially biased.
Non-probability sampling methods:
Disadvantages
of non-probability
1.
Conveniece
sampling sampling:
2.
Judgment
 The samples
are likelysampling
to be unrepresentative of their target
3.
Quota
sampling
population.
This will
introduce bias into the statistical findings,
4. significant
Snowballsections
samplingof the population are likely to have
because
been omitted from the selection process.
 It is not possible to measure the sampling error from data based on
a non-probability sample. Sampling error is the difference between
the actual population parameter value and its sample statistic. As a
result, it is not valid to draw statistical inferences from nonprobability sample data.
 However, non-probability samples can be useful in exploratory
research situations or in less-scientific surveys to provide initial
insights into and profiles of random variables under study.
Stratified Random Sampling:
 Used when the population is assumed to be heterogenous with
repsect to the random variable under study. The population is
divided into segments (strata), where the population members
within each stratum are relatively homogeneous. Thereafter, simple
random samples are drawn from each stratum.
 If the random samples are drawn in proportion to the relative size of
each stratum, then this method of sampling is called proportional
Quota Sampling
 Setting of quotas of sampling untis to interview from specific subgroups of a population. When the quota for any
one subgroup is met, no more sampling units are selected from that subgroup for interview. This introduces
selection bias into the sampling process. The main feature of quota sampling is the non-random selection of
sampling units to fulfill the quota limits.
 E.g. a researcher may set a quota to interview 40 males and 70 females from 25- to 40- year age group on
Sampling
savings practices. When the quota Judgment
of interviews
for any one subgroup is reached (either male or female) no
further eligble sampling units from that
subgroup are
selected
for interview
purposes.
 Researcher
use
their judement
to select
the best sampling units to
include in the sample.
 E.g. only professional
rugby
players
are interviewed on the need for rule
Systematic
Random
Sampling:
Simple Random Sampling:
changes in the
only
labour
union leaders
(instead
of
 sport;
It is used
when
a sampling
frameare
(i.e.seleted
an address
list or
 It is assumed that the population is
general workers)
to respond
to a studymembers)
regarding working
conditions
in
database
of population
exists. Sampling
begins
homogeneous with respect to the random
mining industry.
by randomly selecting the first sampling unit. Thereafter
variable under study (i.e. the sampling units
share similar views on the research questions ;
subsequent sampling units are selected at a uniform
or the objects in a population are influenced by
interval relative to the first sampling unit. Since only the
the same background factors.)
first sampling unit is randomly selected, some
randomness is sacrificed.
 One way to draw a simple random sample is to
 To draw a systematic random sampling , first divide the
assign a number to every element of the
sampling frame by the sample size to determine the size
population and then effectively ‘draw numbers
of a sampling block. Randomly choose the first sample
from a hat’. If a database of names exists, then
a random number generator can be used to
member from within the first sampling block. Then
draw a simple random sample.
choose subsequent sample members by selecting one
member from each sampling block at a consant interval
 E.g. the population of Cape Town motorists is ti
from the previously sampled member.
be surveyed for their views on toll roads. A
simple random sample of Cape Town motorists
is assumed to be representative of this
Advantages of random sampling methods:
population as their views are unlikely to differ
 Random sampling reduces selection bias, meaning that the
significantly acros gender, age, car type driven.
sample statistics are likely to be ‘better’ (unbiased) estimates
 E.g. in a production process, parts that come off
of their population parameters.
the same production line can be selected using
 The error in sampling (sampling error) can be calculated from
simple random sampling to check the quality of
data that is recorded using random sampling methods. This
the entire batch produced.
makes the findings of inferential analysis valid.
Cluster Random Sampling:
 Certain target populations form natural clusters, which make for easier sampling.
 E.g. labour forces clsuter within factories; students cluster within educational institutions; outputs from different
production runs (e.g. margarine tubs) are batched and labelled separately, forming clusters.
 Sampling units within these sampled clusters may themselves be randomly selected to provide a representative
sample from the population. For this reason it is called a two-stage cluster sampling (e.g. select schools (stage 1) as
clusters, then pupils within schools (stage 2)).
 Tends to be used when the population is large and geographically dispersed.
 Advantage: reduces the per unit cost of sampling
 Disadvantage: tends to produce larger sampling errors than those resulting from simple random sampling.
Level of confidence 1-a
a/2
90%
5%
1.645
95%
2.5%
1.96
98%
1%
2.33
99%
0.5%
2.575
Download