Describing Statistical Data Statistics Statistics is the science of data A body of methods and theory that is applied to numerical evidence when making inference in the face of uncertainty Descriptive Statistics The use of statistics methods for summary calculation (summarizing or describing data) and graphic displays that facilitate its interpretation and subsequent analysis. Descriptive Statistics, http://en.wikipedia.org/wiki/Descriptive_statistics Used to describe the main features of a collection of data quantitatively. Dealing with summarization and description of collections of data including the concepts of arithmetic central tendency (means, median, mode), dispersion, association, histogram, etc. For example, we can report the average reading test score for the students in each classroom in a school, to give a descriptive sense of the typical scores and their variation. If we perform a formal hypothesis test on the scores, we are doing inductive rather than descriptive analysis. Statistics Terms Counting Frequency of a value o The number of times that value occurs in a given data set Relative frequency o Relative frequency of a value is the proportion of all observations in the data set with that value Cumulative frequency o Obtained by adding relative frequency for one value at a time Measurement of the Center of the data o Mean (referred to as the Average) The sum of the values of the observations divided by the number of observations o Median The mid‐point of the observations when arranged in order Half of the observation in a data set lie below the median and half lie above the median o Mode Mode is the most frequent value. It is the value that occurs most commonly in the data set o Standard Deviation o Skewness It refers to the degree which a distribution is asymmetric Measuring Variability in the Data 1 o o o o Measures of center are an excellent starting point in summarizing data, but they usually do not tell the full story and can be misleading if there is no information about the variability or spread of the data. An adequate summary of a set of data requires both measures of center and a measure of variability Each measures of variability is most often associated with one of the measures of center. Median When the median is used to describe the center, the variability and general shape of the data distribution Quartile, http://en.wikipedia.org/wiki/Quartile A quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents ¼ of the same population. It is one type of quartile. Q1 = lower quartile (cutting off lowest 25% of data) = 25th percentile Q2 = second quartile = median = cuts data set in half = 50th percentile Q3 = third quartile = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile Arithmetic Mean or Sample Mean [1] If the n observations in a sample are denoted by x1, x2, …, xn the sample mean is ∑ ⋯ ̅ Mean of Probability Distribution µ If we think of a probability distribution as a model for the population, one way to think of the mean is as the average of all the measurement in the population. For a finite population with N equally likely values, the probability mass function is f(xi) = 1/N and the mean is ∑ Sample Variance If x1, x2, …,xn is a sample of n observations, the sample variance is ∑ ̅ 1 The Sample Standard Deviation Standard deviation is the square root of its variance (s2) It is a widely used measure of the variability or dispersion It shows how much variation there is from the “average” (mean, or expected/budgeted value) 2 The Standard Deviation Standard deviation is the square root of its variance It is a widely used measure of the variability or dispersion It shows how much variation there is from the “average” (mean, or expected/budgeted value) A low σ indicates that data points tend to be very close to the mean; a high σ indicates that the data is spread over a range of values Let X be a random variable with mean value μ : The standard deviation of X is If X takes random values from a finite data set x1, x2, …, xn, with each value having the same probability, the standard deviation is ⋯ o o ∑ or using summation notation Sampling and Graphics Data Representation Numerical summary o mean, standard deviation, range o provide information about only one feature of data Stem‐and‐Leaf plots (diagrams) – provide general visual impressions about a data set o Useful for data display for small samples (about 20 observations) o Divide each number into two parts: a stem, consisting of one or more of the leading digits, and a leaf, consisting of remaining digit o List the stem values in a vertical column o Record the leaf for each observation beside its stem o Write the units for stems and leaves on the display Frequency distribution (Frequency table) – provide general visual impressions about a data set o More compact summary of data than “stem‐and‐leaf” o Divide the range of data into intervals, which are usually called class interval, cells, or bins Histogram – provide general visual impressions about a data set o A visual display of the frequency distributions o Label the bin (class interval) boundaries on a horizontal scale o Mark and label the vertical scale with the frequencies or relative frequencies o Above each bin, draw a rectangle where height is equal to the frequency (or relative frequency) corresponding to that bin. Boxplot o Simultaneously display three quartiles, the minimum, and maximum of the data on a rectangular box o Whisker (largest, smallest data points) o Interquartile ranges (IQR) o Outliers o Extreme outliers Time sequence plot o A time series or time sequence graph (plot) o Showing: Trends, Cycles 3 o Horizontal axis – time (minutes, hours, days, years, etc.) Probability plots o How do we know if a particular probability distribution is a reasonable model for the data? o Verifying assumptions o Can provide insight into the underlying physical mechanism generating the data o Reliability engineering Verifying that time‐to‐failure data from an exponential distribution identifies the failure mechanism in the sense that the failure rate is constant with respect to time o A probability plot A graphical method for determining whether sample data conforms to a hypothesized distribution based on a subjective visual examination of the data Normal probability plots have 100(j – 0.5)/n on the left vertical scale and (sometimes) 100[1 –(j – 0.5)/n] on the right vertical scale Variable value plotted on the horizontal scale Standardized normal scale Zj Assessing the “closeness” of the points to the straight line o Example 6‐7 Battery Life (portable personal computer) o Normal Distribution or Gaussian distribution (section 4‐6, pp. 118‐125, of the text book) Probabilities associated with a normal distribution (Figure 4‐12, page 120 of the text book) 68% within the interval (µ ‐ σ to µ + σ ) 95% within the interval (µ ‐2σ to µ + 2σ ) 99.73% within the interval (µ ‐ 3σ to µ + 3σ ) o Standard Normal Random Variable A random variable with µ = 0 and σ2 = 1 is called a standard normal random variable and is denoted as Z. The cumulative distribution function of a standard normal random variable is denoted as o Ф(z) = P(Z ≤ z) Standardizing a Normal Random Variable (page 122, of the text book) If X is a normal random variable with E(X) = µ and V(X) = σ2, the random variable o Is a normal random variable with E(Z) = 0 and V(Z) = 1. That is, Z is a standard normal variable. Example 4‐13 Normally Distributed Current (page 122) The current measurement in a strip of wire are assumed to follow a normal distribution with a mean (µ) of 10 milli‐amperes and a variance of 4 (milli‐ amperes)2. What is the probability that a measurement will exceed 13 milli‐amperes? 4 P(X > 13) = P(Z > 1.5) = 1 – P(Z ≤ 1.5) = 1‐ 0.93319 = 0.006681 Appendix Table III Minitab v16, Tutorials: Data for Display Descriptive Statistics You must have numerical data. For example: Number of defective parts found per shift during one month Individual test scores for students in a class Cost of machine repairs from a service provider The optional grouping column (also called a By column) can be numeric, text, or date/time. Minitab displays separate descriptive statistics for each value in this By variable. For example: Survey responses grouped by gender Customer service calls grouped by shift Customer satisfaction ratings grouped by branch location Consider how much data you need to make the test meaningful. Although you can display descriptive statistics for only one or two data values, if you have more data, your results are more likely to be informative. Statistical Inference Statistical methods are used to make decisions and draw conclusions about populations Divided into two major areas o Parameter Estimation o Hypotheses Testing An Example of Parameter Estimation Problem o An engineer is analyzing the tensile strength of a component used in an automobile chassis o Variability due to Raw material batches Manufacturing processes Measurement procedures o Wants to estimate the Mean Strength of the population of components o Uses sample data to compute a number that is in some sense a reasonable value of the true population mean – a Point Estimate of parameter that has a good statistical properties and also with needed Precision. An Example of Hypotheses Testing o Two different reaction temperatures t1 and t2 can be used in a chemical process. o The engineer conjectures that t1 will result in higher yields than t2. o Statistical hypothesis testing is the framework for solving problems of this type Formulating hypotheses that allow the demonstration of “Mean Yield using t1 is higher than the Mean Yield using t2” The focus is on drawing conclusions about a hypothesis that is relevant to the engineering decision 5 Point Estimation A point estimate of some population parameter θ is a single numerical value of a statistic . The statistic is called the point estimation. An example (unknown mean µ) o Assume that the random variable X is normally distributed with an unknown mean µ. o The sample mean is a point estimate of the unknown population mean µ. That is, ̂ . o After the sample has been selected, the numerical value is the point estimate of µ. o Thus, if x1 = 25, x2 = 30, x3 = 29, and x4 = 31, the point estimate of µ is 1 2 3 4 25 30 29 31 28.75 4 4 Inferential Statistics The study of how inferences are made from numerical data Deductive Statistics The use of probability to determine the chance of obtaining a particular kind of sample result Inductive Statistics Drawing general conclusions from the specific Inferences about populations are drawn from samples The sample is all that is known; we must determine uncertain characteristics of the population from the incomplete information available Statistical Error Because statistics concerns uncertainty, statistical procedures are available both to control and to measure the risks of making erroneous conclusions. The Population and the Sample A statistical population o is the collection of all possible observations of a specific characteristic of interest o consists of observations of some characteristic of interest associated with individual concerned, not the individual items or persons themselves Population Universe The sample contains only some observations An observation o the basic element of statistics, a single data point o may be a physical measurement (weight, height, etc) o An answer to a question (yes, no) o A classification (defective or nondefective) Sampling o Sample is biased in favor of specific groups (in favor of persons who have similar tastes, education, and social experience o Not biased: Random selection from the public at large in which everyone has an equal chance of representation Elementary units (concepts, constructs) The Frame 6 Help define the population of interest ‐ target population Predict the Outcome of an Election (The opinion survey) A Sample of Voter preference toward candidates is taken to predict the outcome of an election The sample should represent the population of votes to be cast for a specific office The population of interest – target population – is the roll of registered voters. The Working Population The votes to be cast by registered voters constitute the working population The sample must be drawn from this population Two Kinds of Populations – Quantitative & Qualitative o Quantitative population When the characteristics can be expressed numerically, such as height, weight, cost, income, etc o Qualitative population When the characteristics is non‐numerical, such as sex, race, marital status, occupation, or college major o o Table 1 – Possible observations for various quantitative populations Elementary Unit Characteristics of Unit of Measurement Possible Values Interest Person Age Years 32.5 yrs Micro‐circuit Defective solder joints Number 10 Tire Remained thread Millimeters 10mm Account balance Amount Dollars $1,000.50 Employer Female employee Percent 45% Common stock EPS (earning per share) Dollars $3.50 Keypuncher Errors Proportion 0.01 Light bulb Lifetime Hours 500 hrs Can of food Weight of contents Ounces 16 oz Table 2 – Possible observations for various qualitative populations Elementary Unit Characteristics of Possible Attributes interest Person Sex Male, female Security Type Bond, common stock, preferred stock Building Exterior materials Brick, wood, aluminum Employee Experience Applicable, not applicable Television Quality Defective, non‐defective Firm Legal status Corporation, partnership, proprietorship Patient Condition Satisfactory, critical Student Residence On‐campus, off‐campus 7 The Frequency Distribution Finding a meaningful pattern for the data Ages of a Sample of 100 Statistics Students (row data) o Class interval: grouping with width (2 years) ‐ 18.0‐ under 20.0, 20.0‐under 22.0, and so on o Counting the numbers of ages in each class interval: Age Interval, Tally, Number of Persons (class frequency) o Lower class limit, Highest class limit Graphical Display: o The histogram – a visual display of the frequency distribution Vertical axis represents frequency Horizontal axis: Age with division of class intervals o Constructing an equal bin width Histogram Label the bin (class interval) boundaries on horizontal scale Mark and label the vertical scale with frequencies or relative frequencies Above each bin, draw a rectangle where height is equal to the frequency corresponding to this bin Graphical Display: Frequency Polygon and Curves Descriptive Analysis The frequency distributions tells us o How the observations cluster around a central value o The degree of dispersion or difference between observations Descriptive Analysis o No student is younger than 18 and that age below 28 are most typical o Of those, the most common age is somewhere between 22 and 24 o The students in the sample are generally older. It is possible that the population could be made up of night students and that the older persons work on their degree on a part‐ time basis while holding full‐time jobs Constructing a Frequency Distribution A more compact summary of data Width and Number of Class Intervals o Data are arranged into intervals called class intervals, cells, or bins o Usually between 5 and 20 bins is satisfactory in most cases o Class Interval Width or number of bins = (Largest value – Smallest value)/Number of Class Intervals Example Fuel Consumption (miles per gallon) Achieved by 100 Medium‐Size Cars (raw data) o Width = (23.9 – 14.1)/5 = 1.96 mpg (round off 1.96 to 2.0) o Frequency Distribution 14.0‐under 16.0 – 9 16.0‐under 18.0 – 13 18.0‐under 20.0 – 24 20.0‐under 22.0 – 38 22.0‐under 24.0 – 16 Total 100 8 Relative & Cumulative Distributions Relative frequency Dividing the observed frequency in each bin by the total number of observations OR The ratio of number of observations in a particular category to the total number of observations. Number with decimal points – total 1.00, percentage – total 100% Applications o Comparison of two qualitative population (there is less competition in …) Relative Frequency Distribution Histogram or chart that show the distribution of data units Cumulative Frequency Distribution A cumulative frequency – the sum of the frequencies for successively higher class intervals and only applies when the observations are numerical Provide useful descriptions of a population o Cumulative Freq vs. SAT Test Score Common Forms of the Frequency Distribution Normal Distribution (bell‐shaped); skewed to left, skewed to right, 2‐tails Exponential Distribution Uniform Distribution Exponential distribution This frequency curve approximates a great many populations in which the observation involve items that exhibit changes in status overtime Applications o Characterizing equipment lifetime until failure o Time between successive arrivals by cars at toll booth o Time between successive arrivals by emergency hospital patients o Analysis of waiting‐line or queuing situation Formula and Plots [3] – the exponential model, with only one unknown parameter, is the simplest of all life distribution models o The key equations Cumulative Distribution Function (CDF) Probability Density Function (PDF) Summary Descriptive Measures Summary of Measures Central Tendency or Location – a value around which observations tend to cluster and which typifies their magnitude o Arithmetic Mean Sample Mean Sample Mean using Group Data o Median o Mode 9 Dispersion or Variability (among observation values – shows how observed values differ from each other) Range – the difference between the largest and smallest observations Median vs. Mean Skewed Distribution Measuring Variability o Average Deviation o The Variance o The Standard Deviation The Meaning o Sample Variance o Sample Standard Deviation Group Data Calculation Chebyshev’s Theorem Proportion – in qualitative data indicates how frequently a particular attribute is observed o Sample Proportion The Statistical Sampling Study The Need for Samples The economic advantage of using samples The time factor The very large population Partly inaccessible populations The destructive nature of the observation Accuracy and sampling Designing and Conducting a Sampling Study (Major Stages) Planning o Identify population o Choose observation procedure o Choose sample type o Decide statistical procedure o Find necessary sample size o (Deductive Statistics Used here) Data collection o Select sample units o Make observations o (major goal: avoid bias) Data analysis and conclusion o Calculate sample statistics o Estimate the value of population parameters o Test hypotheses regarding populations o (Inductive or Inferential Statistics Used here) References 10 [1] Chapter 6. Descriptive Statistics, Applied Statistics and Probability for Engineers, 5th Edition, by Douglas C. Montgomery and George C. Runger, Published by John Wiley & Sons, Inc., 2011 [2] Chapter 7. Sampling Distributions and Point Estimation of Parameters, , Applied Statistics and Probability for Engineers, 5th Edition, by Douglas C. Montgomery and George C. Runger, Published by John Wiley & Sons, Inc., 2011 [3] Engineering Statistics Handbook, http://www.itl.nist.gov/div898/handbook/apr/section1/apr161.htm [4] Cumulative Distribution Functions (CDF), Mathworks, http://www.mathworks.com/help/toolbox/stats/cdf.html [5] Probability Density Function, http://en.wikipedia.org/wiki/Probability_density_function [6] Normal Distribution, http://en.wikipedia.org/wiki/Normal_distribution [7] Excel 2010 Statistical Functions (new algorithms for improved accuracy), http://office.microsoft.com/en‐us/excel‐help/statistical‐functions‐HP005203066.aspx [8] Microsoft Excel 2010 Function Improvements, http://blogs.msdn.com/b/excel/archive/2009/09/10/function‐improvements‐in‐excel‐2010.aspx [9] Excel Tutorial, MS&E 121 Introduction to Stochastic Modeling, http://www.stanford.edu/class/msande121/Materials/ExcelTut.pdf [10]Excel Formulas & Functions, http://lca.lehman.cuny.edu/lehman/itr/html/library/Excel‐Formulas‐ manual.pdf [11]Excel: Function and Data Analysis Tools, KU Library, http://www.techdocs.ku.edu/docs/excel_2003_functions.pdf [12]Excel Probability Distribution Functions, http://me368.engr.wisc.edu/supplemental_material/excel_statistics_functions.doc [13]Problems with Excel for Statistical Analysis, http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf [14]Percentile and Cumulative Probabilities (Sales Forecast Example), http://www.vertex42.com/ExcelArticles/mc/PercentileRank.html 11