Uploaded by rubyw411

Lecture2-1 - introduction to statistics

advertisement
Audit analytics with statistics (1)
Introduction to statistics
1
Recommended Books (optional)
• Statistics in Plain English by Timothy C. Urdan (2010)
• Statistics Without Tears: A Primer for Non- Mathematicians by Derek
Rowntree (2003)
• Applied Linear Statistical Models by Michael Kutner, Christopher
Nachtsheim, John Neter and William Li (Aug 10,2004)
2
What is statistics?
• Statistics is the study of how to collect, organize, analyze, and
interpret numerical information from data.
Statistics is a discipline which is concerned with:
• designing experiments and other data collection,
• summarizing information to aid understanding,
• drawing conclusions from data, and
• estimating the present or predicting the future.
3
Descriptive and inferential statistics
Descriptive statistics involves methods of organizing, picturing and
summarizing information from data.
Inferential statistics involves methods of using information from a
sample to draw conclusions about the population.
4
Populations and samples
• Population is a collection of observations that represents cases and situations of interest
E.g.,
All company transactions for a selected period of time
All people who were born on January 23, 2000
• Sample is a selection (subset) from the population
E.g.,
A random collection of company transactions
30 people born on January 23, 2000
In many cases it is impossible to study the whole population, and a sample is used instead
Statistics is often used to generalize from a sample to population while quantifying the probability of error.
5
Sampling
• sampling is a procedure for selecting sample elements from a population.
• Simple random sampling – every observation from the population
has an equal chance of being selected
1. The population consists of N objects.
2. The sample consists of n objects.
3. All possible samples of n objects are equally likely to occur.
• reduces the sample bias
6
Sampling (cont’d)
• Representative (stratified) sampling
1. observations that match the population on a specific characteristics
are selected.
2. ensures that sample is similar to the population in a particular way
7
Variables and measurement
• A variable is a characteristic of an observation
In statistics, a variable has two defining characteristics:
1. A variable is an attribute that describes a person, place, thing, or idea.
2. The value of the variable can "vary" from one entity to another.
e.g.,
transaction: value, date, buyer, seller, etc.
person: name, age, gender, income, etc.
• Variables taxonomy
8
Variables and measurement-examples
9
Distributions
Statistical distribution is an arrangement of values of a variable
showing their observed or theoretical frequency (probability) of
occurrence.
10
Popular distributions
• normal distribution
F distribution
• uniform distribution
chi-square distribution
• t distribution
11
Measures of central tendency
• Central tendency is the tendency of the observations center (cluster)
around a particular value
• There are three main measures of central tendency:
1. Mean
2. Median
3. Mode
12
Mean
• Mean is the arithmetic average of a sample values. The mean of a sample or a population is
computed by adding all of the observations and dividing by the number of observations.
• Example:
Suppose we draw a sample of five women and measure their weights. They weigh 100 pounds, 100
pounds, 130 pounds, 140 pounds, and 150 pounds.
the mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds. In the general
case, the mean can be calculated, using one of the following equations:
Population mean = μ = ΣX / N OR Sample mean = x = Σx / n
where ΣX is the sum of all the population observations, N is the number of population observations,
Σx is the sum of all the sample observations, and n is the number of sample observations.
When statisticians talk about the mean of a population, they use the Greek letter μ to refer to the
mean score. When they talk about the mean of a sample, statisticians use the symbol x to refer to
the mean score.
13
Median
• Median is the middle value of a sample. To find the median, we rank
the observations in order from smallest to largest value. If the
number of observations is odd, it is the middle value. If the number of
observations is even, it is the average of the two middle values.
• Returning to the example of the five women, the median value would
be 130 pounds; since 130 pounds is the middle weight.
14
Mode
• Mode is the most frequent value in a sample (may not unique).
• Example:
The distribution of 10 employees age is as follows:
32, 35, 38, 38, 38, 41, 46, 47, 55, 58
The mode is equal to 38 since 38 is the most of an age in the sample
(appears 3 times out of 10)
15
Measures of dispersion (variability)
• Measures of dispersion are used to describe the spread of the data,
or its variation around a central value.
• Three measures of dispersion:
1. Range
2. Variance
3. Standard deviation
16
Range
• Range is the difference between the largest and the smallest value in a
sample.
Example:
Example: Suppose two machines produce nails which are on average 10
inches long. A sample of 11 nails is selected from each machine.
Machine A: 6, 8, 8, 10, 10, 10, 10, 10, 12, 12, 14
Machine B: 6, 6, 6, 8, 8, 10, 12, 12, 14, 14, 14
the range is the same for both data, namely 14 - 6 = 8. The range is, while
useful, too crude a measure of variability.
17
Variance
• Variance is the average of the squared differences from mean.
• find the difference between each data point and the mean, and
average these differences
• measure the differences to the mean regardless of the sign (positive
or negative difference)
• chooses a square function
• n – 1 to compute an average
18
Variance (cont’d)
Hence, we will use this formula to compute the data spread, or variance:
Variance = add up the squares of (Data points - mean), then divide that sum by (n - 1)
There are two symbols for the variance, just as for the mean:
• is the variance for a population
• is the variance for a sample
In other words, the variance is computed according to the formulas:
•
•
(for the population variance)
(for the sample variance)
19
Standard deviation
• Standard deviation is defined as a square root of the variance, and shows the average deviation from the
mean.
• As with the mean, there are two letters for variance and standard deviation:
•
•
is the variance for a population and
is the variance for a sample and
is the population standard deviation
is the sample standard deviation
Example: Consider the sample data 6, 7, 5, 3, 4. Compute the standard deviation for that data.
To compute the standard deviation, we must first compute the mean, then the variance, and finally we can
take the square root to obtain the standard deviation.
Computing the mean:
Computing the variance:
Standard deviation:
20
Download