Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Lesson 1 Chapter 1: Basic Statistical Concepts Michael Akritas Department of Statistics The Pennsylvania State University Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles 1 Populations, Samples, and Census 2 Some Sampling Concepts 3 Random Variables and Statistical Populations 4 Basic Graphics for Data Visualization 5 Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Introduction to R R is a GNU project. The GNU (recursive acronym for ”GNU’s Not Unix”) project, sponsored by the Free Software Foundation, was launched in 1984 to develop a complete Unix-like operating system which is free software. To find out about R go to http://www.R-project.org/ . See also the NY Times article http://www.nytimes.com/ 2009/01/07/technology/business-computing/ 07program.html?pagewanted=all To download R go to http://cran.r-project.org/. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles You can start using R as a calculator: 2*4; 2**3; sqrt(16); sin(pi); cos(2*pi); log(exp(1)); log(10,base=10) Try some simple commands: 1:10, seq(1,10), seq(1,10,1), seq(2,10, 2). Also, rep(1,5), rep(”a”,5), rep(seq(1,4),2) or rep(1:4,2), c(rep(0,5),rep(1,7)). Can store the numbers in ”objects”: x=c(rep(0,5),rep(1,7)). x=seq(2,10,2); sum(x); mean(x). Try also x/2; x**2; sqrt(x) Define functions: f=function(x){x**2}. Try f(2); f(c(2,3)) Integrate a function: integrate(f, 0, 3). Try also g=function(x){x**(-2)}; integrate(g, 1, Inf) Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Why Statistics? Example (Examples of Engineering/Scientific Studies) Comparing the compressive strength of two or more cement mixtures. Comparing the effectiveness of three cleaning products in removing four different types of stains. Predicting failure time on the basis of stress applied. Assessing the effectiveness of a new traffic regulatory measure in reducing the weekly rate of accidents. Testing a manufacturer’s claim regarding a product’s quality. Studying the relation between salary increases and employee productivity in a large corporation. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles These studies require Statistics due to the intrinsic variability: The compressive strength of different preparations of the same cement mixture will differ. The figure in http://sites.stat.psu.edu/˜mga/401/fig/ HistComprStrCement.pdf shows 32 compressive strength measurements (MegaPascal units), of test cylinders (6 in. diameter, 12 in. high), using water/cement ratio of 0.4, measured on the 28th day after they are made. Under the same stress, two beams fail at different times. The proportion of defective items of a certain product will differ from batch to batch. Intrinsic variability renders the objectives of the case studies, as stated, ambiguous. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles The objectives of the case studies can be made precise if stated in terms of averages or means. Comparing the average hardness of two different cement mixtures. Predicting the average failure time on the basis of stress applied. Estimation of the average coefficient of thermal expansion. Estimation of the average proportion of defective items. Moreover, because of variability, the words ”average” and ”mean” have a technical meaning which can be made clear through the concepts of population and sample. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Definition Population is a well-defined collection of objects or subjects, of relevance to a particular study, which are exposed to the same treatment or method. Population members are called units. The objective of a study is to investigate certain characteristic(s) of the units of the population(s) of interest. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Example (Populations and Unit Characteristics) All water samples taken from a lake. Characteristics: Mercury concentration; Concentration of other pollutants. All items of a certain manufactured product (that have, or will be produced). Characteristic: Proportion of defective items. All students enrolled in Big Ten universities during the 2013-14 academic year. Characteristics: Favorite type of music; Political affiliation. Two types of cleaning products. Characteristic: cleaning effectiveness. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Populations consisting of the same type of units but differ in the treatment, or method, applied to them are called treatment populations. Example (Treatment Populations) The concentration of pollutants in water samples is analyzed by two different labs. Water samples sent to Lab 1 constitute population 1, and those sent to Lab 2 constitute population 2. The time to failure of beams is studied under different stress conditions. The beams subjected to each stress condition constitute different populations. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Full (i.e. population-level) understanding of a characteristic requires the examination of all population units, i.e. a census. For example, full understanding of the relation between salary and productivity of a corporation’s employees requires obtaining these two characteristics from all employees. However, taking a census can be time consuming and expensive: The 2000 U.S. Census costed $6.5 billion, while the 2010 Census costed $13 billion. Moreover, census is not feasible if the population is hypothetical or conceptual, i.e. not all members are available for examination. Because of the above, we typically settle for examining all units in a sample, which is a subset of the population. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Due to the intrinsic variability, the sample properties/attributes of the characteristic of interest will differ from those of the population. For example The average mercury concentration in 25 water samples will differ from the overall mercury concentration in the lake. The proportion in a sample of 100 PSU students who favor the use of solar energy will differ from the corresponding proportion of all PSU students. The relation between bear’s chest girth and weight in a sample of 10 bears, will differ from the corresponding relation in the entire population of 50 bears in a forested region. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles 100 200 Weight 300 400 The GOOD NEWS is that, if the sample is suitably drawn, then sample properties approximate the population properties. 20 25 30 35 40 45 50 55 Chest Girth Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Sampling Variability Samples properties of the characteristic of interest also differ from sample to sample. For example: 1 2 The number of US citizens, in a sample of size 20, who favor expanding solar energy, will (most likely) be different from the corresponding number in a different sample of 20 US citizens. The average mercury concentration in two sets of 25 water samples drawn from a lake will differ. The term sampling variability is used to describe such differences in the characteristic of interest from sample to sample. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts 100 200 Weight 300 400 Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles 20 25 30 35 40 45 50 55 Chest Girth Figure: Illustration of Sampling Variability. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Population level properties/attributes of characteristic(s) of interest are called (population) parameters. Examples of parameters include averages, proportions, percentiles, and correlation coefficient. The corresponding sample properties/attributes of characteristics are called statistics. Sample statistics approximate the corresponding population parameters but are not equal to them. Statistical inference deals with the uncertainty issues which arise in approximating parameters by statistics. The tools of statistical inference include point and interval estimation, hypothesis testing and prediction. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Example (Examples of Estimation, Hypothesis Testing and Prediction) Estimation (point and interval) would be used in the task of estimating the coefficient of thermal expansion of a metal, or the air pollution level. Hypothesis testing would be used for deciding whether to take corrective action to bring the air pollution level down, or whether a manufacturer’s claim regarding the quality of a product is false. Prediction arises in cases where we would like to predict the failure time on the basis of the stress applied, or the age of a tree on the basis of its trunk diameter. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles For valid statistical inference the sample must be representative of the population. For example, a sample of PSU basketball players is not representative of PSU students, if the characteristic of interest is height. Typically it is hard to tell whether a sample is representative of the population. So, we define a sample to be representative if . . . (cyclical definition!!) it allows for valid statistical inference. The only guarantee for that comes from the method used to select the sample (sampling method). The good news is that there are several sampling methods guarantee representativeness. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Definition A sample of size n is a simple random sample if the selection process ensures that every sample of size n has equal chance of being selected. In simple random sampling every member of the population has the same chance of being included in the sample. The reverse, however, is not true. Example To select a sample of 2 students from a population of 20 male and 20 female students, one selects at random one male and one female students. Is this a s.r.s.? (Does every student have the same chance of being included in the sample?) Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Another sampling method for obtaining a representative sample is called stratified sampling. Definition A stratified sample consists of simple random samples from each of a number of groups (which are non-overlapping and make up the entire population) called strata. Examples of strata include: ethnic groups, age groups, and production facilities. If the units in the different strata differ in terms of the characteristic under study, stratified sampling is preferable to s.r.s. For example, if different production facilities differ in terms of the proportion of defective products, a stratified sample is preferable. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles How do we select a s.r.s. of size n from a population of N units? STEP 1: Assign to each unit a number from 1 to N. STEP 2: Write each number on a slips of paper, place the N slips of paper in an urn, and shuffle them. STEP 3: Select n slips of paper at random, one at a time. Alternatively, the entire process can be performed in software like R. We will see this in the next lab session. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Sampling without replacement simply means that a population unit can be included in a sample at most once. For example, a simple random sample is obtained by sampling without replacement: Once a unit’s slip of paper is drawn, it is not placed back into the urn. Sampling with replacement means that after a unit’s slip of paper is chosen, it is put back in the urn. Thus a population unit could be included in the sample anywhere between 0 and n times. Rolling a die can be thought of as sampling with replacement from the numbers 1, 2, . . . , 6. Though conceptually undesirable, sampling with replacement is easier to work with from a mathematical point of view. When a population is very large, sampling with and without replacement are practically equivalent. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Non-representative samples arise whenever the sampling plan is such that a part, or parts, of the population of interest are either excluded from, or systematically under-represented in, the sample. This is called selection bias. Two examples of non-representative samples are self-selected and convenience samples. A self-selected sample often occurs when people are asked to send in their opinions in surveys or questionnaires. For example, in a political survey, often those who feel that things are running smoothly or who support an incumbent will (apathetically) not respond, whereas those activists who strongly desire change will voice their opinions. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Convenience samples are made up from the most easily accessed units. For example, randomly selecting students from your classes will not result in a sample that is representative of all PSU students since your classes are mostly comprised of students with the same major as you. Example (The Literary Digest poll of 1936) The magazine had been extremely successful in predicting the results in US presidential elections, but in 1936 it predicted a 3-to-2 victory for Republican Alf Landon over the Democratic incumbent Franklin Delano Roosevelt. Worth noting is that this prediction was based on 2.3 million responses (out of 10 million questionnaires sent). On the other hand Gallup correctly predicted the outcome of that election by surveying only 50,000 people. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Variable = a Numerical Characteristic If the characteristic of interest can be measured expressed as a number, e.g. thermal expansion of a metal, hardness of cement, mercury concentration, or number of accidents it is are called quantitative. Examples of non-quantitative characteristics are gender, make of car, eye color, strength category, political affiliation. Such characteristics are called categorical or qualitative. Because statistical procedures are applied to numerical data sets, the categories in categorical characteristic are labeled with arbitrarily chosen numbers (i.e. ’male’= −1, ’female’= +1). A characteristic expressed as a number is called a variable. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Types of Variables Qualitative variables are a particular kind of discrete variables. Quantitative variables can also be discrete. All variables expressing counts, such as the number of earthquakes, the number of fish caught etc, are discrete. Quantitative variables expressing measurements on a continuous scale are examples of continuous variables. Measurements of length, strength, weight, or time to failure are examples of continuous variables. When two or more characteristics are measured on each population unit, we have bivariate or multivariate variables. Example of bivariate: Salary increase and productivity. Example of multivariate: Age, income, education level. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Random Variables When a unit is randomly sampled from a population, the value of its variable will be denoted by X (or Y, or Z, etc). Because of the intrinsic variability, X is not known a-priori and thus it is called a random variable (r.v.). The population from which a random variable is drawn is called the underlying population of the r.v. The collection of of the variable values of all population units is called the statistical population. The statistical population of a r.v. is NOT the same as the set of values a variable can take. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Example 1 A list of the weight of every PSU student is the statistical population of the r.v. weight. 2 A list of 1s and 0s representing every student’s opinion on whether solar energy should be expanded is the statistical population of the r.v. expressing opinion on solar energy. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Sampling from the Statistical Population It should be intuitively clear that taking a sample of n units form some population and recording the variable of each sampled unit, is equivalent to taking a sample of n units from the statistical population of the random variable and its underlying population. Henceforth, the word sample will also be used to denote a sample from the statistical population. Such a sample 1 Consists of units of the statistical population i.e. numbers. 2 The numbers are not known a-priori, so they are rv’s. 3 A sample of size n will be denoted by X1 , X2 , . . . , Xn . Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Histograms and Stem and Leaf Plots In histograms the range of the data is divided into bins, and a box is constructed above each bin. The height of each box is the bin’s frequency. Alternatively, the heights can be adjusted so the histogram’s area is one. R will automatically choose the number of bins but it also allows user specified intervals. Moreover, R offers the option of constructing a smooth histogram. In stem and leaf plots each observation gets split into its stem, which is the beginning digit(s), and its leaf, which is the first of the remaining digits. They retain more information about the original data but do not offer as much flexibility in selecting the bins. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles The R data set faithful The histogram, with superimposed smooth histogram, for a sample of 272 eruption durations from the Old Faithful geyser is shown in http://stat.psu.edu/˜mga/401/ fig/HistOldFaith1.pdf The stem and leaf plot for the same data set is shown in http: //stat.psu.edu/˜mga/401/fig/StemLeaf.pdf Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Scatterplots The basic scatterplot is useful for exploring the relationship between two variables. An enhance version identifies subclasses of data. See http://stat.psu. edu/˜mga/401/fig/BearsChG_W_by_S.pdf A scatterplot matrix is a matrix of scatterplots for all pairs of variables in a data set. See http://stat.psu.edu/ ˜mga/401/fig/BearMeas_by_S.pdf. It helps identify the best single predictor of weight. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Scatterplots Continued Scatterplots with marginal histograms shows histograms of the two variables in the margins of the scatterplot. See http://stat.psu.edu/˜mga/401/ fig/BearMeas_with_MarginalHist.pdf 3D Scatterplots are useful for exploring the relationship between three variables. For example, http://stat. psu.edu/˜mga/401/fig/TempProdElect2.pdf gives a three dimensional view of the joint effect of temperature and production volume on electricity consumed. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Pie Charts and Bar Graphs Pie charts and bar graphs are used with count data to display the proportion of each category in a sample. The pie chart is popular in the mass media and one of the most widely used statistical charts in the business world. It is a circular chart, where the circle is divided into sections whose areas represent proportions. The pie chart in http: //www.stat.psu.edu/˜mga/401/fig/LvMsPie.pdf displays information on the November, 2011 light vehicle market share of car companies (source: http://wardsauto.com/keydata/USSalesSummary0702.xls). Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles According to Steven’s power law bar lengths is better than section areas for comparing the different proportions. Bar graphs resemble histograms with the heights of the bars equal to the proportion of each category. The bar graph display of the November 2011 light vehicle market share data is shown in http: //stat.psu.edu/˜mga/401/fig/LvMsBar2.pdf. Remark: When the heights of the bars are arranged in a decreasing order, the bar graph is also called Pareto chart. The Pareto chart is one of the key tools used in quality control, where it is often used to represent the most common sources of defects in a manufacturing process, or the most frequent reasons for customer complaints, etc. [Google Pareto principle] Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot The Most Common Parameters For a univariate statistical population these are: The proportion. For example, the proportion of Honda Accords that will require warranty repair work in 36,000 miles. The average. For example, the average failure time at a given stress level. The variance and standard deviation. These parameters quantify the intrinsic variability. The median and other percentiles. Can be used to quantify both location and variability. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Outline 1 Populations, Samples, and Census 2 Some Sampling Concepts 3 Random Variables and Statistical Populations 4 Basic Graphics for Data Visualization 5 Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Proportions are relevant whenever the variable of interest is categorical, or has been categorized. Definition 1 If the population has N units, and Ni units are in category i, then the population proportion for category i, is pi = 2 #{population units of category i} N = i. #{population units} N If a sample of size n is taken, and ni sample units are in category i, then the sample proportion for category i, is bi = p #{sample units of category i} n = i. #{sample units} n Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example 1 In a sample of 1000 adults, 72% favor tougher penalties for b? drunk driving. Is the correct notation for 0.72 p or p 2 In a population of 80 engineering majors taking a required statistics class, 40 are enthusiastic about having computer labs. If a s.r. sample of 20 from these students 8 are enthusiastic. What is the correct notation for 40/80 = 0.5 and for 8/20 = 2/5? b Always remember that, under s.r. sampling, p approximates, but in general is different from p. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Outline 1 Populations, Samples, and Census 2 Some Sampling Concepts 3 Random Variables and Statistical Populations 4 Basic Graphics for Data Visualization 5 Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Consider a population of N units, and let v1 , v2 , . . . , vN denote the statistical population corresponding to some variable. Then the population average or population mean, denoted by µ, is the arithmetic average of all values in the statistical population. Thus, µ= N 1X vi . N i=1 If the random variable X denotes the value of the variable of a randomly selected population unit, then a synonymous terminology for the population mean is expected value of X , or mean value of X , and is denoted by µX or E(X ). Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example In a population of 500 tin plates, the number of plates with 0, 1 and 2 scratches is N0 = 190, N1 = 160 and N2 = 150. Thus, in the statistical population v1 , . . . , v500 , 190 vi equal 0, 160 equal 1, and 150 equal 2. The population mean is 500 µ= 1 X 0 × N0 1 × N1 2 × N2 vi = + + = 0.92 500 500 500 500 i=1 If a tin plate is selected at random and X is the rv denoting the number of scratches, the mean value of X is 0.92 and we write µX = 0.92, or E(X ) = 0.92. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot If a sample of size n is taken, and x1 , x2 , . . . , xn denote the variable values of the sample units, then the sample average or sample mean, denoted by x, is n 1X xi x= n i=1 Under s.r. sampling, a sample mean approximates, but in general is different from the population mean. Example If a s.r. sample of n = 100 is taken from the 500 tin plates, it could be that there are n0 = 40, n1 = 34 and n2 = 26 plates with 0, 1 and 2 scratches. In this case, x = 0.86. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Proportions are Averages! A proportion is a special case of a mean. To see this: Consider the example with the tin plates, where N1 = 160 out of N = 500 have one scratch, and let the variable X take the value 1 if a tin plate has one scratch and the value 0 otherwise. Note that for the statistical population, v1 , . . . , v500 , of this variable, 160 vi are equal to 1 and 340 are equal to 0. Thus, µX = 160 N1 = 0.32, which equals p = . 500 N Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Outline 1 Populations, Samples, and Census 2 Some Sampling Concepts 3 Random Variables and Statistical Populations 4 Basic Graphics for Data Visualization 5 Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Let v1 , v2 , . . . , vN be a statistical population with mean µ. Definition The population variance, σ 2 , is defined as σ2 = N 1X (vi − µ)2 . N i=1 The standard √ deviation is the positive square root of the variance: σ = σ 2 . If the rv X denotes a randomly selected value from the statistical population, then a synonymous terminology for the population variance is variance of X , and is denoted by σX2 , or q Var(X ). The standard deviation of X is σX = Michael Akritas σX2 . Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot A simpler computational formula for the variance is σ2 = N 1X 2 v i − µ2 N i=1 . xample E Consider the tin plate example, so the statistical population v1 , . . . , v500 , has 190 vi equal 0, 160 equal 1, 150 equal 2, and µ = 0.92. Then, σ2 = 190 × 0 1 × 160 4 × 150 + + − 0.922 = 0.6736. 500 500 500 Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot If x1 , x2 , . . . , xn denotes a sample from the statistical population, the sample variance and its computational formula are: X n 2 n n 1 X 1 X 1 2 2 2 (xi − x) = xi − xi . S = n−1 n−1 n i=1 i=1 i=1 √ The sample standard deviation is S = S 2 . Under s.r. sampling, S 2 approximates, but in general is different from σ 2 . Example Consider the s.r. sample of n = 100 tin plates, which has 40, 34 and 26 plates with 0, 1 and 2 scratches. Then, S2 = 1 [138 − 73.96] = 0.647 99 Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Why Divide by n − 1? Because this assures that the average of the sample variances resulting from all possible samples is equal to the population average. Example The variance of the population {0, 1}, which corresponds to tossing a fair coin, is 0.25 (why?). The possible samples of size two, taken with replacement, are {0, 0}, {0, 1}, {1, 0}, {1, 1}. Verify that the four sample variances average to 0.25. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Outline 1 Populations, Samples, and Census 2 Some Sampling Concepts 3 Random Variables and Statistical Populations 4 Basic Graphics for Data Visualization 5 Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Roughly speaking, the (1 − α)100th sample percentile separates the part having the (1 − α)100% smaller values, from that which has the α100% larger values. Thus: The 90th sample percentile separates the largest 10% from the lower 90% values in the data set. The 50th sample percentile is also called the sample median. The 25th, the 50th and the 75th sample percentiles are also called sample quartiles. The 25th and 75th percentiles are the lower quartile and upper quartile, respectively. The distance between the lower and upper quartiles is called the interquartile range or IQR. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Order Statistics as Sample Percentiles Let X1 , . . . , Xn be a s.r. sample from a continuous distribution. The ordered sample values are denoted X(1) , X(2) , . . . , X(n) . Thus, X(1) < X(2) < · · · < X(n) . X(i) , the ith smallest sample value, is defined to be the 100 i−0.5 n -th sample percentile. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example A s.r.s. of 10 black bears’ weights is: 154 158 356 446 40 154 90 94 150 142. Give the order statistics, and state the population percentiles they estimate. Solution: The R command sort( c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142) ) returns the order statistics: 40, 90, 94, 142, 150, 154, 154, 158, 356, 446. These order statistics estimate the 5th, 15th, 25th, 35th, 45th, 55th, 65th, 75th, 85th and 95th population percentiles, respectively. For example, X(3) = 94 is the 100(3 − 0.5)/10 = 25th percentile and estimates the corresponding population percentile. [In R the percentiles are obtained with: ”100*(1:10 - 0.5)/10”.] Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot In the above example none of the order statistics corresponds to the median or the 90th percentile. In general, if n is even, none of the order statistics corresponds to the median. For example, If n = 5 then X(3) , the 3rd smallest value, is the 100 2.5 5 = 50th sample percentile or median. If n = 4 then X(2) is the 100 1.5 4 = 37.5th sample percentile, while X(3) is the 100 2.5 4 = 62.5th sample percentile. Thus, none of the ordered values is the median. Depending on n, the above definition may not identify other percentiles of interest. In such cases, we use interpolations. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Percentiles in R R uses a different interpolation algorithm for evaluating sample percentiles from a given data set. With the data set in the object x, the commands median(x) quantile(x,0.25) quantile(x,c(0.3,0.7,0.9)) summary(x) R commands for percentiles give, respectively, the median, the 25th percentile, the 30th, 70th and 90th percentiles, and a five number summary of the data consisting of x(1) , q1 , x̃, q3 , and x(n) . Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example Using the previous sample of 10 black bear weights, estimate the population median, 70th, 80th and 90th percentiles. Solution: With the sample values in the object w, i.e. w=c(154, 158, 356, 446, 40, 154, 90, 94, 150, 142) the R command quantile(w,c(0.5, 0.7, 0.8, 0.9)) returns 152.0, 155.2, 197.6, 365.0 for the sample median, 70th, 80th and 90th percentiles, respectively. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot The five number summary of the data given by the ”summary(x)” command in R is the basis for the boxplot. A boxplot displays the central 50% of the data with a box, the lower and upper edges are at q1 and q3 , respectively, a line inside the box represents the median. The lower 25% and upper 25% of the data are represented by lines (or whiskers) which extend from each edge of the box. The lower (upper) whisker extends from q1 (q3 ) until the smallest (largest) observation within 1.5 interquartile ranges from q1 (q3 ). Observations further from the box than the whisker ends (i.e. smaller than q1 − 1.5 × IQR or larger than q3 + 1.5 × IQR) are called outliers, and are plotted individually. See http://sites.stat.psu.edu/˜mga/401/fig/ BoxplotOzoneR.pdf Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example Scientists have been monitoring the ozone hole since 1980. See the images shown in http://ozonewatch.gsfc.nasa.gov/ The 14 Ozone measurements (Dobson units) given in http://stat.psu.edu/∼mga/401/Data/OzoneData.txt. are taken in 2002 from the lower stratosphere, between 9 and 12 miles altitude. Give the five number summary of this data and construct the box plot. Solution: Read the data in the R object oz using Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot oz=read.table(”http://stat.psu.edu/∼mga/401/Data/OzoneData.txt”, header =T) Then, use the command summary(oz) (or summary(oz$OzoneData)) to get the five number summary of this data. For the boxplot use boxplot(oz, col=”grey”), or boxplot(oz$OzoneData, col=”grey”). Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Hand Calculation of Sample Median Definition Let X(1) , X(2) , . . . , X(n) denote the ordered sample values in a sample of size n. The sample median is defined as X( n+1 ) , if n is odd 2 e X = X( n ) + X( n +1) 2 2 , if n is even 2 Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot e and X ) Example (Relation Between X Find the sample median of X1 = 2.3, X2 = 3.2, X3 = 1.8, X4 = 2.5, X5 = 2.7. Solution. We first order the values from smallest to largest: X(1) = 1.8, X(2) = 2.3, X(3) = 2.5, X(4) = 2.7, X(5) = 3.2. e = X n+1 = X(3) = 2.5. Since sample size is odd, X ( 2 ) e = 2.5. For this data, X = X e = 2.5. Thus X If X(5) is changed to 4.2, then X = 2.7 but X e is not. is affected by outliers, where as X In general, if the histogram of the data is positively skewed e , and if it is negatively skewed X < X e. X >X Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Hand Calculation of Sample Quartiles and Sample IQR Definition The sample lower quartile or SLQ is defined as the median of the smallest n/2 values, if n is even the median the smallest (n + 1)/2 values, if n is odd The sample upper quartile or SUQ is defined as the median of the largest n/2 values, if n is even the median the largest (n + 1)/2 values, if n is odd The sample interquartile range, or SIQR, defined as SIQR = SUQ − SLQ Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Example Find the lower and upper quartiles of the n = 9 observations 9.39, 7.04, 7.17, 13.28, 9.00, 7.46, 21.06, 15.19, 7.50. Solution. Since n is odd, the SLQ is the median of the Smallest 5(= (n + 1)/2) values: 7.04, 7.17, 7.46, 7.50, 9.00 and the SUQ is the median of the Largest 5(= (n + 1)/2) values: 9.00, 9.39, 13.28, 15.19, 21.06. Thus SLQ = 7.46, and SUQ = 13.28. Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts Outline Populations, Samples, and Census Some Sampling Concepts Random Variables and Statistical Populations Basic Graphics for Data Visualization Proportions, Averages, Variances and Percentiles Proportions: Population- and SampleAverages: Population- and SampleVariance: Population- and SampleSample Percentiles and the Box Plot Go to next lesson http://stat.psu.edu/˜mga/401/ course.info/lesson2.pdf Go to the Stat 401 home page http://stat.psu.edu/˜mga/401/course.info/ Michael Akritas Lesson 1 Chapter 1: Basic Statistical Concepts