Audit analytics with statistics (1) Introduction to statistics 1 Recommended Books (optional) • Statistics in Plain English by Timothy C. Urdan (2010) • Statistics Without Tears: A Primer for Non- Mathematicians by Derek Rowntree (2003) • Applied Linear Statistical Models by Michael Kutner, Christopher Nachtsheim, John Neter and William Li (Aug 10,2004) 2 What is statistics? • Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data. Statistics is a discipline which is concerned with: • designing experiments and other data collection, • summarizing information to aid understanding, • drawing conclusions from data, and • estimating the present or predicting the future. 3 Descriptive and inferential statistics Descriptive statistics involves methods of organizing, picturing and summarizing information from data. Inferential statistics involves methods of using information from a sample to draw conclusions about the population. 4 Populations and samples • Population is a collection of observations that represents cases and situations of interest E.g., All company transactions for a selected period of time All people who were born on January 23, 2000 • Sample is a selection (subset) from the population E.g., A random collection of company transactions 30 people born on January 23, 2000 In many cases it is impossible to study the whole population, and a sample is used instead Statistics is often used to generalize from a sample to population while quantifying the probability of error. 5 Sampling • sampling is a procedure for selecting sample elements from a population. • Simple random sampling – every observation from the population has an equal chance of being selected 1. The population consists of N objects. 2. The sample consists of n objects. 3. All possible samples of n objects are equally likely to occur. • reduces the sample bias 6 Sampling (cont’d) • Representative (stratified) sampling 1. observations that match the population on a specific characteristics are selected. 2. ensures that sample is similar to the population in a particular way 7 Variables and measurement • A variable is a characteristic of an observation In statistics, a variable has two defining characteristics: 1. A variable is an attribute that describes a person, place, thing, or idea. 2. The value of the variable can "vary" from one entity to another. e.g., transaction: value, date, buyer, seller, etc. person: name, age, gender, income, etc. • Variables taxonomy 8 Variables and measurement-examples 9 Distributions Statistical distribution is an arrangement of values of a variable showing their observed or theoretical frequency (probability) of occurrence. 10 Popular distributions • normal distribution F distribution • uniform distribution chi-square distribution • t distribution 11 Measures of central tendency • Central tendency is the tendency of the observations center (cluster) around a particular value • There are three main measures of central tendency: 1. Mean 2. Median 3. Mode 12 Mean • Mean is the arithmetic average of a sample values. The mean of a sample or a population is computed by adding all of the observations and dividing by the number of observations. • Example: Suppose we draw a sample of five women and measure their weights. They weigh 100 pounds, 100 pounds, 130 pounds, 140 pounds, and 150 pounds. the mean weight would equal (100 + 100 + 130 + 140 + 150)/5 = 620/5 = 124 pounds. In the general case, the mean can be calculated, using one of the following equations: Population mean = μ = ΣX / N OR Sample mean = x = Σx / n where ΣX is the sum of all the population observations, N is the number of population observations, Σx is the sum of all the sample observations, and n is the number of sample observations. When statisticians talk about the mean of a population, they use the Greek letter μ to refer to the mean score. When they talk about the mean of a sample, statisticians use the symbol x to refer to the mean score. 13 Median • Median is the middle value of a sample. To find the median, we rank the observations in order from smallest to largest value. If the number of observations is odd, it is the middle value. If the number of observations is even, it is the average of the two middle values. • Returning to the example of the five women, the median value would be 130 pounds; since 130 pounds is the middle weight. 14 Mode • Mode is the most frequent value in a sample (may not unique). • Example: The distribution of 10 employees age is as follows: 32, 35, 38, 38, 38, 41, 46, 47, 55, 58 The mode is equal to 38 since 38 is the most of an age in the sample (appears 3 times out of 10) 15 Measures of dispersion (variability) • Measures of dispersion are used to describe the spread of the data, or its variation around a central value. • Three measures of dispersion: 1. Range 2. Variance 3. Standard deviation 16 Range • Range is the difference between the largest and the smallest value in a sample. Example: Example: Suppose two machines produce nails which are on average 10 inches long. A sample of 11 nails is selected from each machine. Machine A: 6, 8, 8, 10, 10, 10, 10, 10, 12, 12, 14 Machine B: 6, 6, 6, 8, 8, 10, 12, 12, 14, 14, 14 the range is the same for both data, namely 14 - 6 = 8. The range is, while useful, too crude a measure of variability. 17 Variance • Variance is the average of the squared differences from mean. • find the difference between each data point and the mean, and average these differences • measure the differences to the mean regardless of the sign (positive or negative difference) • chooses a square function • n – 1 to compute an average 18 Variance (cont’d) Hence, we will use this formula to compute the data spread, or variance: Variance = add up the squares of (Data points - mean), then divide that sum by (n - 1) There are two symbols for the variance, just as for the mean: • is the variance for a population • is the variance for a sample In other words, the variance is computed according to the formulas: • • (for the population variance) (for the sample variance) 19 Standard deviation • Standard deviation is defined as a square root of the variance, and shows the average deviation from the mean. • As with the mean, there are two letters for variance and standard deviation: • • is the variance for a population and is the variance for a sample and is the population standard deviation is the sample standard deviation Example: Consider the sample data 6, 7, 5, 3, 4. Compute the standard deviation for that data. To compute the standard deviation, we must first compute the mean, then the variance, and finally we can take the square root to obtain the standard deviation. Computing the mean: Computing the variance: Standard deviation: 20