Review of Basic Concepts Applied Statistics Analysis GENG 605 - L01 Fall 2023 Review of Basic Concepts Dr. Khalifa alkhalifa Research Decisions Discussed 1. Selection of a sample from a population 2. Evaluating whether a sample is representative of a population 3. Descriptive versus inferential applications of statistics 4. Levels of measurement and types of variables 5. Selection of a statistical analysis that is appropriate for the type of data 6. Experimental design versus nonexperimental design 1 Population vs. Sample Data are the numeric observations of a phenomenon of interest. The totality of all observations is a population. A portion used for analysis is a random sample Population a b Sample cd b ef gh i jk l m n o p q rs t u v w x y z c gi o n r u y 2 Population vs. Sample Actual studies often differ from ideal research situations described in research methods and statistics textbooks: How sample is obtained Level of measurement of variables Analysis strategy 3 Why Sample? Less time consuming than a census Less costly to administer than a census It is possible to obtain statistical results of a sufficiently high precision based on samples Sample statistics (known) Population parameters Inference (unknown, but can be estimated from sample evidence) Sample Population 4 Selection of sample from a population Samples are often described as randomly selected from a population whose members can be identified People often use accidental or convenience samples (whoever they can easily get) Advantage of convenience samples – low cost; disadvantage, limited generalizability. 5 How to evaluate whether a sample is representative If representative of a population, a sample should have similar demographics (e.g. same proportions male/ female, same mean age, and so forth) and similar scores on all important variables In practice we often rely on “proximal similarity”, we assume our results should generalize to hypothetical populations that happen to be similar to the people in our samples 6 Discussion How are samples in your research area usually obtained? What types of selection bias may arise from use of convenience samples? What populations (if any) can you make inferences about, based on these samples? How is the ability to generalize research findings limited by these types of samples? 7 Sample Mean If the n observations in a sample are denoted by x1 , x2 , , xn , … the sample mean is For a finite population with N equally likely values, the probability mass function is f(xi)= 1/N and the mean is 8 Example 1: Engine connectors Let’s consider the eight observations on pull-off force collected from the prototype engine. The eight observations are x1 = 12.6, x2 = 12.9, x3 = 13.4, x4 = 12.3, x5 = 13.6, x6 = 13.5, x7 = 12.6, and x8 = 13.1. The sample mean is 8 x average xi i 1 8 12.6 12.9 ... 13.1 8 104 13.0 pounds 8 The sample mean is the balance point i 1 2 3 4 5 6 7 8 xi 12.6 12.9 13.4 12.3 13.6 13.5 12.6 13.1 13.00 = AVERAGE($B2:$B9) 9 Sample Variance If x1 ,x2 , ,xn… is a sample of n observations, the sample variance is The sample standard deviation, s, is the positive square root of the sample variance 10 Why n-1 The true variance is based on data deviations from the true mean, μ The sample calculation is based on the data deviations from 𝑋 , not μ. 𝑋 is an estimator of μ; close but not the same. So, the n-1 divisor is used to compensate for the error in the mean estimation. “In statistics, the term "n-1" often appears in the context of sample variance and sample standard deviation calculations. It's used as a divisor when calculating these statistics. This adjustment is made to account for the fact that we are working with a sample from a larger population rather than the entire population.” 11 Degrees of Freedom The sample variance is calculated with the quantity n-1 This quantity is called the “degrees of freedom” Origin of the term: There are n-1 deviations from 𝑋, in the sample The sum of the deviations is zero n-1 of the observations can be freely determined, but the nth observation is fixed to maintain the zero sum 12 Sample Standard Deviation The standard deviation is the square root of the variance σ is the population standard deviation symbol s is the sample standard deviation symbol 13 Example 2 The quantities needed for calculating the sample variance and sample standard deviation for the pull-off force data is in the table below. Calculate the sample variance and sample standard deviation i 1 2 3 4 5 6 7 8 sums = xbar = xi xi - xbar 12.6 -0.4 12.9 -0.1 13.4 0.4 12.3 -0.7 13.6 0.6 13.5 0.5 12.6 -0.4 13.1 0.1 104.00 0.0 Divide by 8 13.00 variance = standard deviation = 2 (xi - xbar) 0.16 0.01 0.16 0.49 0.36 0.25 0.16 0.01 1.60 Divide by 7 0.2286 0.48 xi is pounds, Mean is pounds, Variance is pounds2, Standard deviation is pounds 14 Computation of s2 The prior calculation is definitional and tedious. A shortcut is derived here and involves just 2 sums 15 Population Variance Vs Sample Variance When the population is finite and consists of N values, we may define the population variance as The sample variance is a reasonable estimate of the population variance 16 Variance by Shortcut x xi i 1 s 2 i 1 n 1 n n 2 i i 1 2 3 4 5 6 7 8 Sums = xi 12.6 12.9 13.4 12.3 13.6 13.5 12.6 13.1 104.0 2 xi 158.76 166.41 179.56 151.29 184.96 182.25 158.76 171.61 1,353.60 2 n 1,353.60 104.0 8 7 2 1.60 0.2286 pounds 2 7 s 0.2286 0.48 pounds 17 Sample Range If the n observations in a sample are denoted by x1, x2, …, xn, the sample range is: r = max(xi) – min(xi) It is the largest observation in the sample minus the smallest observation From Example 1: r = 13.6 – 12.3 = 1.30 Note: population range ≥ sample range 18 Mean, Median, Mode Mean Median Mode Midpoint of ranked values Most frequently observed value n x x i1 i n Arithmetic average 19 Mean, Median, Mode The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 1 2 3 4 5 15 3 5 5 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 1 2 3 4 10 20 4 5 5 20 Mean, Median, Mode In an ordered list, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 Not affected by extreme values 21 Finding the Median The location of the median: n 1 Median position position in the ordered data 2 If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers n 1 Note that is not the value of the median, only the 2 position of the median in the ranked data 22 Finding the Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may be no mode There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode 23 Range Vs Interquartile Range The range is the simplest measure of variation Difference between the largest and the smallest observations Ignores the way in which data are distributed (AI) Percentiles and the Interquartile Range (IQR) are statistical concepts used to describe and understand the distribution of data in a dataset, particularly when dealing with ordered or ranked data. They help summarize the spread and central tendency of the data. 24 Range Vs Interquartile Range 1.Percentiles: •Definition: A percentile is a measure that indicates the relative standing of a particular value within a dataset when the data is sorted in ascending order. It tells you what percentage of the data falls below or is equal to a specific value. •Calculation: To find the pth percentile, you arrange the data in ascending order, and then find the data point that separates the lowest p% of the data from the highest (100-p)% of the data. This data point is the pth percentile. •Example: The 25th percentile (also known as the first quartile, Q1) is the value below which 25% of the data falls. •Use: Percentiles are commonly used in various fields, such as education (e.g., standardized test scores), healthcare (e.g., growth charts for children), and finance (e.g., income percentiles). 25 Range Vs Interquartile Range 2.Interquartile Range (IQR): •Definition: The IQR is a measure of statistical dispersion or spread. It represents the range between the 25th percentile (Q1) and the 75th percentile (Q3) of a dataset. •Calculation: To calculate the IQR, you first find the values of Q1 and Q3 (the 25th and 75th percentiles), and then subtract Q1 from Q3. •Formula for IQR: IQR = Q3 - Q1 •Use: The IQR is often used to identify and understand the spread or variability of data. It is less sensitive to extreme values (outliers) compared to the range, making it a robust measure of dispersion. 26 Range Vs Interquartile Range Here's a simple step-by-step process for finding percentiles and the IQR: 1.Sort the data in ascending order. 2.Find the Median (50th percentile), which is the middle value if there's an odd number of data points or the average of the two middle values if the number of data points is even. 3.Find Q1 (the 25th percentile) by locating the median of the lower half of the data (all values below the median). 4.Find Q3 (the 75th percentile) by locating the median of the upper half of the data (all values above the median). 5.Calculate the IQR by subtracting Q1 from Q3. Interquartile range = 3rd quartile – 1st quartile IQR = Q3 – Q1 27 Interquartile Range Find a quartile by determining the value in the appropriate position in the ranked data, where First quartile position: Q1 = 0.25(n+1) Second quartile position: Q2 = 0.50(n+1), (the median position) Third quartile position: Q3 = 0.75(n+1) where n is the number of observed values 28 Example : Interquartile Range Data # 1 2 q1 3 4 5 q2 6 7 8 q3 9 10 11 3 74 1 99 12 73 40 23 20 13 27 Sort data A--->Z 1 3 12 13 20 23 27 40 73 74 99 First Quartile (q1), (n+1)/4 = (11+1)/4=3 q1 =X(3)=12 Second Quartile (q2) = median (n+1)/4 = (11+1)/4=6 q2 =X(6)=23 Third Quartile (q3), 3(n+1)/4 = 3(11+1)/4=9 q3 =X(9)=73 29 Example : Interquartile Range Data # 1 2 3 4 5 6 7 8 9 10 80 95 20 67 93 11 15 55 75 96 Sort data A--->Z 11 15 20 55 67 75 80 93 95 96 First Quartile (q1) (n+1)/4 = (10+1)/4=2.75 q1 is between X(2) and X(3), by interpolation: q1=X2+0.75*(X3-X2)=15+0.75(20-15)=18.75 Second Quartile (q2) = median (n+1)/2 = (10+1)/2=5.5 q2 is between X(5) and X(6). q2=(X5+X6)/2=(67+75)/2=71 Third Quartile (q3) 3(n+1)/4 = 3(10+1)/4=8.25 q3 is between X(8) and X(9), by interpolation: q3=X8+0.25*(X9-X8)=93+0.25*(95-93)=93.5 IQR=q3-q1= 93.5-18.75 =74.75 30 Frequency Distributions and Histograms A frequency distribution is a more compact summary of data than a stem-and-leaf diagram. To construct a frequency distribution, we must divide the range of the data into intervals, which are usually called class intervals, cells, or bins Number of bins is usually between 5 and 20 The number of classes, multiplied by the class interval, should exceed the range of the data. The square root of the sample size is a guide Sec 6-3 Frequency Distributions and Histograms 31 Frequency Distributions and Histograms The boundaries of the class intervals should be convenient values, as should the class width A histogram is a visual display of a frequency distribution, similar to a bar chart or a stem-and-leaf diagram Steps to construct a histogram with equal bin widths: 1) Label the bin boundaries on the horizontal scale 2) Mark & label the vertical scale with the frequencies or relative frequencies. 3) Above each bin, draw a rectangle whose height is equal to the frequency corresponding to that bin 32 Example Considerations: Range = 245 – 76 = 169 Class Frequency 70 ≤ x < 90 2 90 ≤ x < 110 3 Trial class width = 18.9 110 ≤ x < 130 6 130 ≤ x < 150 14 Decisions: Number of classes = 9 150 ≤ x < 170 22 170 ≤ x < 190 17 Class width = 20 190 ≤ x < 210 10 210 ≤ x < 230 4 Range of classes = 20 * 9 = 180 230 ≤ x < 250 2 80 Starting point = 70 Sqrt(80) = 8.9 Relative Frequency 0.0250 0.0375 0.0750 0.1750 0.2750 0.2125 0.1250 0.0500 0.0250 1.0000 Cumulative Relative Frequency 0.0250 0.0625 0.1375 0.3125 0.5875 0.8000 0.9250 0.9750 1.0000 33 Example 6 Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Note these features – (1) horizontal scale bin boundaries & labels with units, (2) vertical scale measurements and labels, (3) histogram title at top or in legend. 34 Histograms with Unequal Bin Widths If the data is tightly clustered in some regions and scattered in others, it is visually helpful to use narrow class widths in the clustered region and wide class widths in the scattered areas In this approach, the rectangle area, not the height, must be proportional to the class frequency bin frequency Rectangle height = bin width 35 Poor Choices in Drawing Histograms Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Errors: too many bins (17) create jagged shape, horizontal scale not at class boundaries, horizontal axis label does not include units. 36 Cumulative Frequency Plot Cumulative histogram of compressive strength of 80 aluminum-lithium alloy specimens. Comment: Easy to see cumulative probabilities, hard to see distribution shape 37 Shape of a Frequency Distribution Histograms of symmetric and skewed distributions (b) Symmetric distribution has identical mean, median and mode measures. (a & c) Skewed distributions are positive or negative, depending on the direction of the long tail. Their measures occur in alphabetical order as the distribution is approached from the long tail. 38 Histograms for Categorical Data Categorical data is of two types: Ordinal: categories have a natural order, e.g., year in college, military rank. Nominal: Categories are simply different, e.g., gender, colors Histogram bars are for each category, are of equal width, and have a height equal to the category’s frequency or relative frequency A Pareto chart is a histogram in which the categories are sequenced in decreasing order. This approach emphasizes the most and least important categories. 39 Example 6: Categorical Data Histogram Airplane production in 1985. (Source: Boeing Company) Comment: Illustrates nominal data in spite of the numerical names, categories are shown at the bin’s midpoint, a Pareto chart since the categories are in decreasing order 40 Box Plot or Box-and-Whisker Chart A box plot is a graphical display showing center, spread, shape, and outliers (SOCS). It displays the 5-number summary: min, q1, median, q3, and max. Description of a box plot Sec 6-4 Box Plots 41 Box Plot or Box-and-Whisker Chart 42 Example 7 Box plot of compressive strength of 80 aluminum-lithium alloy specimens. Comment: Box plot may be shown vertically or horizontally, data reveals three outliers * and no extreme outliers. Lower outlier limit is: 143.5 – 1.5*(181.0-143.5) = 87.25 43 Box Plot or Box-and-Whisker Chart Symmetric Positive Skew Negative Skew 44 Time Sequence Plots A time series plot shows the data value, or statistic, on the vertical axis with time on the horizontal axis A time series plot reveals trends, cycles or other time-oriented behavior that could not be seen in the data Upward trend Cycle Company sales by year (a). By quarter (b) 45 Time Sequence Plots A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.) When measurements are plotted as a time series, we often see • trends, • cycles, or • other broad features of the data 46 Digidot Plot Combining a time series plot with some of the other graphical displays that we have considered previously will be very helpful sometimes. The stem-and-leaf plot combined with a time series Plot forms a digidot plot 47 Probability Plots Probability plotting is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data. Probability plotting typically uses special graph paper, known as probability paper, that has been designed for the hypothesized distribution. 48 Example 7: Probability Plots The effective service life (Xj in minutes) of batteries used in a laptop are given in the table. We hypothesize that battery life is adequately modeled by a normal distribution. Ten observations 176, 191, 214, 220, 205,192, 201, 190, 183, 185. It is assumed that battery life is modeled by a normal distribution Use the probability plotting to investigate this assumption 49 Constructing a Probability Plot To construct a probability plot: 1. Sort the data observations in ascending order: x(1), x(2),…, x(n). 2. The observed value x(j) is plotted against the observed cumulative frequency (j – 0.5)/n 3. The paired numbers are plotted on the probability paper of the proposed distribution If the paired numbers form a straight line, then the hypothesized distribution adequately describes the data 50 Example 7: Probability Plots First, arrange the observations in ascending order and calculate their cumulative frequency (j-0.5)/10 as shown below Calculations for Constructing a Normal Probability Plot j 1 2 3 4 5 6 7 8 9 10 x(j) 176 183 185 190 191 192 201 205 214 220 (j-0.5)/10 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 On ordinary paper 100(j-0.5)/10 5 15 25 35 45 55 65 75 85 95 Normal probability plot for battery life 51 Probability Plot on Standardized Normal Scores A normal probability plot can be plotted on ordinary axes using zvalues. The normal probability scale is not used. Calculations for Constructing a Normal Probability Plot j 1 2 3 4 5 6 7 8 9 10 Normal Probability plot obtained from standardized normal scores. This is equivalent to previous figure x(j) 176 183 185 190 191 192 201 205 214 220 (j-0.5)/10 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 zj -1.64 -1.04 -0.67 -0.39 -0.13 0.13 0.39 0.67 1.04 1.64 52 Probability Plot on Standardized Normal Scores Normal probability plots indicating a non-normal distribution (a) Heavy tailed distribution (b) Light tailed distribution (c) Right skewed distribution 53 Descriptive versus Inferential Statistics Descriptive Statistics: Statistics such as sample mean used only to describe the sample. Example: mean exam score for a class of students. Inferential Statistics: Results for sample used to make inferences about means or correlations in a broader population. Example: In most published research, author suggests that results can be generalized to a broader population, beyond the sample in study. 54 Types of Variables In this textbook, a simpler distinction is made between two types of variables: Categorical Quantitative 55 Type of variable (categorical versus quantitative) determines What type of descriptive statistics and graphs are used to describe pattern of scores on just one variable (e.g. bar chart versus histogram) What types of statistics are used to describe statistical relationship between two variables (e.g. a categorical independent variable such as drug/no drug and a quantitative outcome such as anxiety score or heart rate t test) 56 Type of Variable Influences Type of Graph and Choice of Analysis Categorical Variable • Frequency table • Bar Chart • Number and percent of cases in each group • Mode Quantitative Variable • Frequency table • Histogram • Mean, Median • Variance, Standard Deviation • Min, Max, Range 57 Normal Distribution 58 Research validity Internal validity: the degree to which research results can be used to infer causality. A well designed experiment meets the conditions for causal inference and has strong internal validity. External validity: the degree to which the research situation resembles the real life situations of interest; a study that uses a highly artificial situation that is not similar to any situation in the real world has weak external validity. 59 Ideally, research should be strong on both internal and external validity Experiments are typically stronger on internal validity; however, if the situation is very contrived, an experiment may be weak on external validity. Non experimental studies conducted in field settings examining naturally occurring behaviors are typically stronger on external validity; however, they are usually very weak on internal validity because rival explanatory variables are not controlled. 60 Experiments are sometimes not possible For some research questions (e.g. the effect of long term smoking on health outcomes in humans), it is not possible to do an experiment, for two reasons: Not possible to randomly assign people to smoke / not smoke for a long period of time. Even if it were possible, it would be unethical. 61 parametric & nonparametric analysis Parametric and Nonparametric analyses are two different approaches used in statistics to analyze data. They are distinguished by the assumptions they make about the underlying data distribution and the types of data they are best suited for. Here's an overview of each and their differences: Parametric Analysis: Assumptions: Parametric methods assume that the data comes from a specific probability distribution, typically the normal (Gaussian) distribution. This means that you assume a certain shape for the data. Examples: Parametric tests include t-tests, analysis of variance (ANOVA), regression analysis, and many more. These tests make assumptions about the population parameters, such as means and variances. Use Cases: Parametric tests are generally used when you have data that can be reasonably assumed to follow a specific distribution, and when you want to make inferences about population parameters. For example, you might use a t-test to compare means of two groups. Efficiency: Parametric tests are often more powerful (i.e., they can detect smaller differences or effects) when their assumptions are met. They can also provide more precise estimates of parameters. This book primarily covers parametric analyses. 62 Parametric & nonparametric analysis Nonparametric Analysis: Assumptions: Nonparametric methods make fewer assumptions about the underlying data distribution. They do not assume that the data follows a specific probability distribution. Examples: Nonparametric tests include the Wilcoxon signed-rank test, Mann-Whitney U test, Kruskal-Wallis test, and the chi-squared test. These tests are used when data does not meet the assumptions of parametric tests or when you want to avoid making strong distributional assumptions. Use Cases: Nonparametric tests are used when data is categorical, ordinal, or when the assumptions of parametric tests are violated. They are also used when you need to compare medians or test for associations between variables in a distribution-free manner. Robustness: Nonparametric tests are considered more robust because they don't rely on specific distributional assumptions. They can be used in a wider range of situations, especially when the data is skewed or has outliers. 63 Differences between Parametric and Nonparametric Analysis Assumptions: The main difference is in their assumptions. Parametric tests assume a specific data distribution (usually normal), while nonparametric tests make fewer or no distributional assumptions. Data Types: Parametric tests are typically used with interval or ratio data (continuous), while nonparametric tests are more flexible and can be used with nominal, ordinal, interval, or ratio data. Power: Parametric tests tend to be more powerful when their assumptions are met. Nonparametric tests may have lower power but are more robust in situations where data doesn't conform to parametric assumptions. Sample Size: Parametric tests may require larger sample sizes to perform well, especially when dealing with non-normal data. Nonparametric tests can often be used with smaller sample sizes. In practice, the choice between parametric and nonparametric analysis depends on your data and the assumptions you can reasonably make. If your data meets the assumptions of parametric tests, they are generally preferred due to their higher statistical power. However, if you're unsure about data distribution or your data doesn't meet the assumptions, nonparametric tests provide a robust alternative. 64 Summary Each student should be able to apply all these concepts to the research question that he/she is most interested in. Given a research question, student should be able to develop a plan for obtaining a sample, describe how variables will be manipulated and/or measured, identify whether variables are categorical or quantitative, state whether the research is experimental or non experimental. 65 References Applied Statistics: From Bivariate Through Multivariate Techniques by Montgomery, Rebecca M. Warner, 2th edition., SAGE Publishing, 2012. ISBN: 9781412991346. Applied Statistics and Probability for Engineers by Montgomery, D. C. and Runger, G. C., 6th edition., Wiley, 2011. ISBN-13: 978-0470-50578-6. 66