Last Update 23rd February 2011 SESSION 5 & 6 Introduction to Statistics Lecturer: University: Domain: Florian Boehlandt University of Stellenbosch Business School http://www.hedge-fundanalysis.net/pages/vega.php Learning Objectives Part 1 1. Sampling (Random Sampling) 2. Sampling Error 3. Nonsampling Error Sampling • Why? Cost! • The sample proportions are used as an estimate for the population proportions • Examples: – Nielsen ratings (1,000 television viewers) – Quality Management (destroy items?) Terminology • Target Population: the population about which statisticians want to draw inferences • Sampled Population: The actual population from which the sample is taken • The sample statistic is a good estimator of the population parameter if target population = sampled population Terminology • Self-selected samples are always biased, because individuals who participate are more keenly interested in the issue than nonparticipants (SLOP = self-selected opinion poll) Sampling Plan • A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen. • A stratified random sample is obtained by separating he population into mutually exclusive sets (strata), and then drawing simple random samples form each stratum. Sampling Plan • A cluster sample is a simple random sample of groups or cluster of elements Simple Random Sampling • Concept: Raffles each element of the chosen population is assigned a unique number and then ‘drawn from a hat’ + Social security numbers + Student numbers – Telephone numbers • A random number table / random number generator (Excel: RAND) can be used to select sample numbers. Simple Random Sampling • Example Tax Returns (Keller 2006: p. 148) Stratified Random Sampling • Concept: Increase the amount of information aboiut the population • Examples of criteria separating the population into strata: – – – – Gender Age Occupation Household Income Stratified Random Sampling • Example Proposed Tax Increase: 1. Draw random samples form four income groups according to their proportions in the population 2. Make adjustments before making inferences about the entire population Stratum Income ‘000s Population % 1 Under 25 25% 250 2 25-49 40% 400 3 50-75 30% 300 4 Over 75 5% 50 Total Sample 1,000 Systematic Sampling • Concept: sample members are chosen in a regular manner working progressively through the list • Example Vega students: 500 students from Vega’s 8,500 enrolled students: 8,500 / 500 = 17. Thus, every 17th student would be selected Cluster Sampling • Concept: Useful when it is difficult or costly to develop a complete list of population members (i.e. making it difficult to draw a simple random sample) or when the population elements are widely dispersed (geographically) • Example: Each block within a city represents a cluster. A sample of clusters could then be selected and every household within these clusters is questioned (sampling error? sample size) Sampling Error • Sampling error refers to the differences between the sample and the population that exist because of the observations that happened to be selected for the sample. The value of the sample mean will deviate from the population mean simply by chance • The difference between the true (unknown) value of the population mean μ and its estimate (the sample mean x-bar) is the sampling error • The only way to reduce the sampling error is to increase the sample size n Nonsampling Error • Nonsampling errors are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly • Nonsampling errors are more serious than sampling errors, because taking a larger sample won’t diminish the size, or possibilty of occurrence, of this error Types of Nonsampling Error • Errors in data acquisition: incorrect measurements/responses, inaccurate recording • Nonresponse error: refers to bias introduced when responses are not obtained from some members of the sample (not representative of target population); self-administered surveys • Selection bias: Some members of the target population cannot possibly be included in the sample (e.g. members have no phone) Learning Objectives Part 2 4. Frequency Tables 5. Histograms 6. Class Intervals and Width Frequency Tables – Data Types Interval Data Class Intervals Count the number of observations that fall into each of a series of intervals Nominal Data Ordinal Data Categories Count the number of times each category of the variable occurs Frequency Distribution Frequency Distribution Histogram Bar Chart Frequency Tables – Data Types • There are times when a data set contains a large number of values (even when the data type is nominal) that would result in a table with too many rows to be convenient. We can overcome this problem by grouping the data into fewer categories or classes and then compiling a grouped frequency distribution. Frequency Tables – Data Types Ungrouped Data Grouped Data Class Intervals Categories Count the number of observations that fall into each of a series of intervals Count the number of times each category of the variable occurs Frequency Distribution Frequency Distribution Histogram Bar Chart Frequency Tables – Data Types • Example 1: Coffee refills Data type nominal; Data ungrouped Categories • Example 2: Class marks out of 100 Data type nominal; BUT: Data may be grouped Class intervals (approximately interval) • Example 3: Waiting times at supermarket cashiers Data type interval Class intervals Number of Categories Nominal / not grouped: 1. Determine maximum and minimum observation 2. Define categories including all distinct (integer) observations in between Example tossing two dice: Min: 2 Max: 12 Other possible outcomes: 2 4 5 6 7 8 9 10 11 (all outcomes accounted for) Number of Class Intervals Interval or grouped data: The more observations there are the larger the number of class intervals required. Sturges’ Formula Number of class intervals = 1 + 3.3 log10(n) OR Number of class intervals = 1 + 1.4 ln(n) Example n = 50: Number of class intervals = 1 + 3.3 log10 (50) = 1 + 3.3 * 1.70 = 6.61 ≈ 7 Number of class intervals = 1 + 1.4 ln(50) = 1 + 1.4 * 3.91 = 6.48 ≈ 6 Excursion Logarithms • The logarithm of a number to a given base is the exponent to which the base must be raised in order to produce that number. (Example: 10^1.70 = 50) • The natural logarithm is the logarithm to the base e, where e is an irrational constant approximately equal to 2.718. The natural logarithm of a number x (written as ln(x)) is the power to which e would have to be raised to equal x. (Example: e^3.91 = 50) • The mathematical constant e (Euler’s number) is the unique real number such that the value of the derivative d/dx (slope of the tangent line) of the function f(x) = ex at the point x = 0 is equal to 1. It is called the exponential function. Class Interval Width Class width: 1. Subtract largest observation from smallest observation 2. Divide by number of classes (Sturges) 3. Round class width to convenient value 4. Select a lower limit so that the first class interval contains the smallest observation. Determine all other intervals consecutively by adding (multiples) of the class width Definitions • Class Mark or Class Midpoint: Adding the lower class limits to the upper class limits and dividing by two frequency polygon • Width of a class interval or class length: The difference between the upper class limit and the lower class limit. Usually, all classes are of equal width / length (Sturges) • Class Boundaries: The class limits are stated in such a way that there is no overlap between classes Definitions Class Boundaries: The class limits are stated in such a way that there is no overlap between classes. Limits are stated in this manner so that there cannot be any doubt as to which class a certain value (observation) is to be allocated. Since data is often rounded, the true class limits are not the same as the stated class limits. Example: Weights recorded to the nearest kilogram Stated class interval: 60 – 62 True class interval or class boundaries: 59.5 – 62.5