Uploaded by Sager AlKuwari

Week 1 - Review of Basic Concepts

advertisement
Review of Basic Concepts
Applied Statistics Analysis
GENG 605 - L01
Fall 2023
Review of Basic Concepts
Dr. Khalifa alkhalifa
Research Decisions Discussed
1.
Selection of a sample from a population
2.
Evaluating whether a sample is representative of a population
3.
Descriptive versus inferential applications of statistics
4.
Levels of measurement and types of variables
5.
Selection of a statistical analysis that is appropriate for the type
of data
6.
Experimental design versus nonexperimental design
1
Population vs. Sample
 Data are the numeric observations of a phenomenon of interest.
The totality of all observations is a population. A portion used for
analysis is a random sample
Population
a b
Sample
cd
b
ef gh i jk l m n
o p q rs t u v w
x y
z
c
gi
o
n
r
u
y
2
Population vs. Sample
 Actual studies often differ from ideal research situations
described in research methods and statistics textbooks:
How sample is obtained
Level of measurement of variables
Analysis strategy
3
Why Sample?
 Less time consuming than a census
 Less costly to administer than a census
 It is possible to obtain statistical results of a sufficiently high
precision based on samples
Sample statistics
(known)
Population parameters
Inference
(unknown, but can be estimated
from sample evidence)
Sample
Population
4
Selection of sample from a population
 Samples are often described as randomly selected from a
population whose members can be identified
 People often use accidental or convenience samples (whoever
they can easily get)
 Advantage of convenience samples – low cost; disadvantage,
limited generalizability.
5
How to evaluate whether a sample is
representative
 If representative of a population, a sample should
have similar demographics (e.g. same proportions
male/ female, same mean age, and so forth) and
similar scores on all important variables
 In practice we often rely on “proximal similarity”, we
assume our results should generalize to hypothetical
populations that happen to be similar to the people in
our samples
6
Discussion
 How are samples in your research area usually
obtained?
 What types of selection bias may arise from use of
convenience samples?
 What populations (if any) can you make inferences
about, based on these samples?
 How is the ability to generalize research findings
limited by these types of samples?
7
Sample Mean
 If the n observations in a sample are denoted by x1 , x2 , , xn , …
the sample mean is
 For a finite population with N equally likely values, the probability
mass function is f(xi)= 1/N and the mean is
8
Example 1: Engine connectors
Let’s consider the eight observations on pull-off force collected from
the prototype engine. The eight observations are x1 = 12.6, x2 = 12.9,
x3 = 13.4, x4 = 12.3, x5 = 13.6, x6 = 13.5, x7 = 12.6, and x8 = 13.1.
The sample mean is
8
x  average 

 xi
i 1
8
12.6  12.9  ...  13.1

8
104
 13.0 pounds
8
The sample mean is the balance point
i
1
2
3
4
5
6
7
8
xi
12.6
12.9
13.4
12.3
13.6
13.5
12.6
13.1
13.00
= AVERAGE($B2:$B9)
9
Sample Variance
 If x1 ,x2 , ,xn… is a sample of n observations, the sample variance is
 The sample standard deviation, s, is the positive square root of the
sample variance
10
Why n-1
 The true variance is based on data deviations from the true mean, μ
 The sample calculation is based on the data deviations from 𝑋 , not
μ. 𝑋 is an estimator of μ; close but not the same. So, the n-1 divisor
is used to compensate for the error in the mean estimation.
 “In statistics, the term "n-1" often appears in the context of sample
variance and sample standard deviation calculations. It's used as a
divisor when calculating these statistics. This adjustment is made to
account for the fact that we are working with a sample from a larger
population rather than the entire population.”
11
Degrees of Freedom
 The sample variance is calculated with the quantity n-1
 This quantity is called the “degrees of freedom”
 Origin of the term:
There are n-1 deviations from 𝑋, in the sample
The sum of the deviations is zero
n-1 of the observations can be freely determined, but the nth
observation is fixed to maintain the zero sum
12
Sample Standard Deviation
 The standard deviation is the square root of the variance
 σ is the population standard deviation symbol
 s is the sample standard deviation symbol
13
Example 2
The quantities needed for calculating the sample variance and
sample standard deviation for the pull-off force data is in the table
below. Calculate the sample variance and sample standard deviation
i
1
2
3
4
5
6
7
8
sums =
xbar =
xi
xi - xbar
12.6
-0.4
12.9
-0.1
13.4
0.4
12.3
-0.7
13.6
0.6
13.5
0.5
12.6
-0.4
13.1
0.1
104.00
0.0
Divide by 8
13.00
variance =
standard deviation =
2
(xi - xbar)
0.16
0.01
0.16
0.49
0.36
0.25
0.16
0.01
1.60
Divide by 7
0.2286
0.48
xi is pounds, Mean is pounds, Variance is pounds2, Standard deviation is pounds
14
Computation of s2
The prior calculation is definitional and tedious. A shortcut is
derived here and involves just 2 sums
15
Population Variance Vs Sample Variance
 When the population is finite and consists of N values, we may
define the population variance as
 The sample variance is a reasonable estimate of the population
variance
16
Variance by Shortcut


x    xi 

 i 1 
s 2  i 1
n 1
n
n
2
i
i
1
2
3
4
5
6
7
8
Sums =
xi
12.6
12.9
13.4
12.3
13.6
13.5
12.6
13.1
104.0
2
xi
158.76
166.41
179.56
151.29
184.96
182.25
158.76
171.61
1,353.60
2
n
1,353.60  104.0  8

7
2
1.60

 0.2286 pounds 2
7
s  0.2286  0.48 pounds
17
Sample Range
If the n observations in a sample are denoted by x1, x2, …, xn, the
sample range is:
r = max(xi) – min(xi)
It is the largest observation in the sample minus the smallest
observation
From Example 1:
r = 13.6 – 12.3 = 1.30
Note: population range ≥ sample range
18
Mean, Median, Mode
Mean
Median
Mode
Midpoint of
ranked values
Most frequently
observed value
n
x
x
i1
i
n
Arithmetic
average
19
Mean, Median, Mode
 The most common measure of central tendency
 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
1  2  3  4  5 15

3
5
5
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1  2  3  4  10 20

4
5
5
20
Mean, Median, Mode
 In an ordered list, the median is the “middle” number (50% above,
50% below)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
Not affected by extreme values
21
Finding the Median
 The location of the median:
n 1
Median position 
position in the ordered data
2
 If the number of values is odd, the median is the middle number
 If the number of values is even, the median is the average of the two
middle numbers
n 1
 Note that
is not the value of the median, only the
2
position of the median in the ranked data
22
Finding the Mode
 A measure of central tendency
 Value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode
 There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
23
Range Vs Interquartile Range
 The range is the simplest measure of variation
 Difference between the largest and the smallest observations
 Ignores the way in which data are distributed
 (AI) Percentiles and the Interquartile Range (IQR) are statistical
concepts used to describe and understand the distribution of data
in a dataset, particularly when dealing with ordered or ranked data.
They help summarize the spread and central tendency of the data.
24
Range Vs Interquartile Range
1.Percentiles:
•Definition: A percentile is a measure that indicates the relative
standing of a particular value within a dataset when the data is sorted in
ascending order. It tells you what percentage of the data falls below or is
equal to a specific value.
•Calculation: To find the pth percentile, you arrange the data in
ascending order, and then find the data point that separates the lowest
p% of the data from the highest (100-p)% of the data. This data point is
the pth percentile.
•Example: The 25th percentile (also known as the first quartile, Q1) is
the value below which 25% of the data falls.
•Use: Percentiles are commonly used in various fields, such as
education (e.g., standardized test scores), healthcare (e.g., growth
charts for children), and finance (e.g., income percentiles).
25
Range Vs Interquartile Range
2.Interquartile Range (IQR):
•Definition: The IQR is a measure of statistical
dispersion or spread. It represents the range between
the 25th percentile (Q1) and the 75th percentile (Q3) of
a dataset.
•Calculation: To calculate the IQR, you first find the
values of Q1 and Q3 (the 25th and 75th percentiles),
and then subtract Q1 from Q3.
•Formula for IQR: IQR = Q3 - Q1
•Use: The IQR is often used to identify and understand
the spread or variability of data. It is less sensitive to
extreme values (outliers) compared to the range,
making it a robust measure of dispersion.
26
Range Vs Interquartile Range
Here's a simple step-by-step process for finding percentiles and
the IQR:
1.Sort the data in ascending order.
2.Find the Median (50th percentile), which is the middle value
if there's an odd number of data points or the average of the
two middle values if the number of data points is even.
3.Find Q1 (the 25th percentile) by locating the median of the
lower half of the data (all values below the median).
4.Find Q3 (the 75th percentile) by locating the median of the
upper half of the data (all values above the median).
5.Calculate the IQR by subtracting Q1 from Q3.
 Interquartile range = 3rd quartile – 1st quartile
IQR = Q3 – Q1
27
Interquartile Range
 Find a quartile by determining the value in the appropriate position
in the ranked data, where
 First quartile position:
Q1 = 0.25(n+1)
 Second quartile position: Q2 = 0.50(n+1), (the median position)
 Third quartile position:
Q3 = 0.75(n+1)
where n is the number of observed values
28
Example : Interquartile Range
Data
#
1
2
q1 3
4
5
q2 6
7
8
q3 9
10
11
3
74
1
99
12
73
40
23
20
13
27
Sort data
A--->Z
1
3
12
13
20
23
27
40
73
74
99
First Quartile (q1), (n+1)/4 = (11+1)/4=3
 q1 =X(3)=12
Second Quartile (q2) = median
(n+1)/4 = (11+1)/4=6  q2 =X(6)=23
Third Quartile (q3), 3(n+1)/4 = 3(11+1)/4=9
 q3 =X(9)=73
29
Example : Interquartile Range
Data
#
1
2
3
4
5
6
7
8
9
10
80
95
20
67
93
11
15
55
75
96
Sort data
A--->Z
11
15
20
55
67
75
80
93
95
96
First Quartile (q1)
(n+1)/4 = (10+1)/4=2.75
q1 is between X(2) and X(3), by interpolation:
q1=X2+0.75*(X3-X2)=15+0.75(20-15)=18.75
Second Quartile (q2) = median
(n+1)/2 = (10+1)/2=5.5  q2 is between
X(5) and X(6).
q2=(X5+X6)/2=(67+75)/2=71
Third Quartile (q3)
3(n+1)/4 = 3(10+1)/4=8.25
q3 is between X(8) and X(9), by
interpolation:
q3=X8+0.25*(X9-X8)=93+0.25*(95-93)=93.5
IQR=q3-q1= 93.5-18.75 =74.75
30
Frequency Distributions and Histograms
 A frequency distribution is a more compact summary of data
than a stem-and-leaf diagram.
 To construct a frequency distribution, we must divide the
range of the data into intervals, which are usually called class
intervals, cells, or bins
 Number of bins is usually between 5 and 20
 The number of classes, multiplied by the class interval,
should exceed the range of the data. The square root of the
sample size is a guide
Sec 6-3 Frequency Distributions and Histograms
31
Frequency Distributions and Histograms
 The boundaries of the class intervals should be convenient values,
as should the class width
 A histogram is a visual display of a frequency distribution,
similar to a bar chart or a stem-and-leaf diagram
 Steps to construct a histogram with equal bin widths:
1) Label the bin boundaries on the horizontal scale
2) Mark & label the vertical scale with the frequencies or
relative frequencies.
3) Above each bin, draw a rectangle whose height is equal
to the frequency corresponding to that bin
32
Example
Considerations:
Range = 245 – 76 = 169
Class
Frequency
70 ≤ x < 90
2
90 ≤ x < 110
3
Trial class width = 18.9
110 ≤ x < 130
6
130 ≤ x < 150
14
Decisions:
Number of classes = 9
150 ≤ x < 170
22
170 ≤ x < 190
17
Class width = 20
190 ≤ x < 210
10
210 ≤ x < 230
4
Range of classes = 20 * 9 = 180 230 ≤ x < 250
2
80
Starting point = 70
Sqrt(80) = 8.9
Relative
Frequency
0.0250
0.0375
0.0750
0.1750
0.2750
0.2125
0.1250
0.0500
0.0250
1.0000
Cumulative
Relative
Frequency
0.0250
0.0625
0.1375
0.3125
0.5875
0.8000
0.9250
0.9750
1.0000
33
Example 6
Histogram of compressive strength of 80 aluminum-lithium alloy specimens. Note these
features – (1) horizontal scale bin boundaries & labels with units, (2) vertical scale
measurements and labels, (3) histogram title at top or in legend.
34
Histograms with Unequal Bin Widths
 If the data is tightly clustered in some regions and scattered in
others, it is visually helpful to use narrow class widths in the
clustered region and wide class widths in the scattered areas
 In this approach, the rectangle area, not the height, must be
proportional to the class frequency
bin frequency
Rectangle height =
bin width
35
Poor Choices in Drawing Histograms
Histogram of compressive strength of 80 aluminum-lithium alloy
specimens. Errors: too many bins (17) create jagged shape, horizontal
scale not at class boundaries, horizontal axis label does not include units.
36
Cumulative Frequency Plot
Cumulative histogram of compressive strength of 80 aluminum-lithium alloy
specimens. Comment: Easy to see cumulative probabilities, hard to see
distribution shape
37
Shape of a Frequency Distribution
Histograms of symmetric and skewed distributions
 (b) Symmetric distribution has identical mean, median and mode measures.
 (a & c) Skewed distributions are positive or negative, depending on the direction
of the long tail. Their measures occur in alphabetical order as the distribution is
approached from the long tail.
38
Histograms for Categorical Data
 Categorical data is of two types:
 Ordinal: categories have a natural order, e.g., year in college, military
rank.
 Nominal: Categories are simply different, e.g., gender, colors
 Histogram bars are for each category, are of equal width, and have
a height equal to the category’s frequency or relative frequency
 A Pareto chart is a histogram in which the categories are
sequenced in decreasing order. This approach emphasizes the
most and least important categories.
39
Example 6: Categorical Data Histogram
Airplane production in 1985. (Source: Boeing Company) Comment: Illustrates
nominal data in spite of the numerical names, categories are shown at the bin’s
midpoint, a Pareto chart since the categories are in decreasing order
40
Box Plot or Box-and-Whisker Chart
 A box plot is a graphical display showing center, spread,
shape, and outliers (SOCS).
 It displays the 5-number summary: min, q1, median, q3, and
max.
Description of a box plot
Sec 6-4 Box Plots
41
Box Plot or Box-and-Whisker Chart
42
Example 7
Box plot of compressive strength of 80 aluminum-lithium alloy specimens.
Comment: Box plot may be shown vertically or horizontally, data reveals three
outliers * and no extreme outliers. Lower outlier limit is: 143.5 – 1.5*(181.0-143.5)
= 87.25
43
Box Plot or Box-and-Whisker Chart
Symmetric
Positive Skew
Negative Skew
44
Time Sequence Plots
 A time series plot shows the data value, or statistic, on the vertical axis
with time on the horizontal axis
 A time series plot reveals trends, cycles or other time-oriented behavior
that could not be seen in the data
Upward trend
Cycle
Company sales by year (a). By quarter (b)
45
Time Sequence Plots
 A time series plot is a graph in which the vertical axis denotes the
observed value of the variable (say x) and the horizontal axis
denotes the time (which could be minutes, days, years, etc.)
 When measurements are plotted as a time series, we
often see
• trends,
• cycles, or
• other broad features of the data
46
Digidot Plot
 Combining a time series plot with some of the other graphical
displays that we have considered previously will be very helpful
sometimes. The stem-and-leaf plot combined with a time series
Plot forms a digidot plot
47
Probability Plots
 Probability plotting is a graphical method for determining whether
sample data conform to a hypothesized distribution based on a
subjective visual examination of the data.
 Probability plotting typically uses special graph paper, known as
probability paper, that has been designed for the hypothesized
distribution.
48
Example 7: Probability Plots
 The effective service life (Xj in minutes) of batteries used in a laptop
are given in the table. We hypothesize that battery life is
adequately modeled by a normal distribution.
 Ten observations
176, 191, 214, 220, 205,192, 201, 190, 183, 185.
 It is assumed that battery life is modeled by a normal distribution
 Use the probability plotting to investigate this assumption
49
Constructing a Probability Plot
 To construct a probability plot:
1. Sort the data observations in ascending order: x(1), x(2),…, x(n).
2. The observed value x(j) is plotted against the observed
cumulative frequency (j – 0.5)/n
3. The paired numbers are plotted on the probability paper of
the proposed distribution
 If the paired numbers form a straight line, then the hypothesized
distribution adequately describes the data
50
Example 7: Probability Plots
 First, arrange the observations in ascending order and calculate
their cumulative frequency (j-0.5)/10 as shown below
Calculations for Constructing a Normal
Probability Plot
j
1
2
3
4
5
6
7
8
9
10
x(j)
176
183
185
190
191
192
201
205
214
220
(j-0.5)/10
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
On ordinary paper
100(j-0.5)/10
5
15
25
35
45
55
65
75
85
95
Normal probability plot for battery life
51
Probability Plot on Standardized Normal Scores
 A normal probability plot can be plotted on ordinary axes using zvalues. The normal probability scale is not used.
Calculations for Constructing a
Normal Probability Plot
j
1
2
3
4
5
6
7
8
9
10
Normal Probability plot obtained from
standardized normal scores. This is
equivalent to previous figure
x(j)
176
183
185
190
191
192
201
205
214
220
(j-0.5)/10
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
zj
-1.64
-1.04
-0.67
-0.39
-0.13
0.13
0.39
0.67
1.04
1.64
52
Probability Plot on Standardized Normal Scores
 Normal probability plots indicating a non-normal distribution
(a) Heavy tailed distribution
(b) Light tailed distribution
(c) Right skewed distribution
53
Descriptive versus Inferential Statistics
Descriptive Statistics: Statistics such as sample
mean used only to describe the sample.
Example: mean exam score for a class of
students.
Inferential Statistics: Results for sample used to
make inferences about means or correlations in a
broader population. Example: In most published
research, author suggests that results can be
generalized to a broader population, beyond the
sample in study.
54
Types of Variables
In this textbook, a simpler distinction is made between
two types of variables:
 Categorical
 Quantitative
55
Type of variable (categorical versus
quantitative) determines
 What type of descriptive statistics and graphs are used to describe
pattern of scores on just one variable (e.g. bar chart versus
histogram)
 What types of statistics are used to describe statistical relationship
between two variables (e.g. a categorical independent variable
such as drug/no drug and a quantitative outcome such as anxiety
score or heart rate  t test)
56
Type of Variable Influences Type of Graph and
Choice of Analysis
Categorical Variable
• Frequency table
• Bar Chart
• Number and percent of
cases in each group
• Mode
Quantitative Variable
• Frequency table
• Histogram
• Mean, Median
• Variance, Standard
Deviation
• Min, Max, Range
57
Normal Distribution
58
Research validity
Internal validity: the degree to which research
results can be used to infer causality. A well
designed experiment meets the conditions for causal
inference and has strong internal validity.
External validity: the degree to which the research
situation resembles the real life situations of interest;
a study that uses a highly artificial situation that is
not similar to any situation in the real world has weak
external validity.
59
Ideally, research should be strong on both
internal and external validity
Experiments are typically stronger on internal
validity; however, if the situation is very contrived, an
experiment may be weak on external validity.
Non experimental studies conducted in field settings
examining naturally occurring behaviors are typically
stronger on external validity; however, they are
usually very weak on internal validity because rival
explanatory variables are not controlled.
60
Experiments are sometimes not possible
For some research questions (e.g. the effect
of long term smoking on health outcomes in
humans), it is not possible to do an
experiment, for two reasons:
Not possible to randomly assign people to
smoke / not smoke for a long period of
time.
Even if it were possible, it would be
unethical.
61
parametric & nonparametric analysis
Parametric and Nonparametric analyses are two different approaches used
in statistics to analyze data. They are distinguished by the assumptions they
make about the underlying data distribution and the types of data they are
best suited for. Here's an overview of each and their differences:
 Parametric Analysis:
 Assumptions: Parametric methods assume that the data comes from a
specific probability distribution, typically the normal (Gaussian)
distribution. This means that you assume a certain shape for the data.
 Examples: Parametric tests include t-tests, analysis of variance (ANOVA),
regression analysis, and many more. These tests make assumptions
about the population parameters, such as means and variances.
 Use Cases: Parametric tests are generally used when you have data that
can be reasonably assumed to follow a specific distribution, and when you
want to make inferences about population parameters. For example, you
might use a t-test to compare means of two groups.
 Efficiency: Parametric tests are often more powerful (i.e., they can detect
smaller differences or effects) when their assumptions are met. They can
also provide more precise estimates of parameters.
 This book primarily covers parametric analyses.
62
Parametric & nonparametric analysis
 Nonparametric Analysis:
 Assumptions: Nonparametric methods make fewer assumptions about
the underlying data distribution. They do not assume that the data follows
a specific probability distribution.
 Examples: Nonparametric tests include the Wilcoxon signed-rank test,
Mann-Whitney U test, Kruskal-Wallis test, and the chi-squared test. These
tests are used when data does not meet the assumptions of parametric
tests or when you want to avoid making strong distributional assumptions.
 Use Cases: Nonparametric tests are used when data is categorical,
ordinal, or when the assumptions of parametric tests are violated. They
are also used when you need to compare medians or test for associations
between variables in a distribution-free manner.
 Robustness: Nonparametric tests are considered more robust because
they don't rely on specific distributional assumptions. They can be used in
a wider range of situations, especially when the data is skewed or has
outliers.
63
Differences between Parametric and
Nonparametric Analysis
 Assumptions: The main difference is in their assumptions. Parametric
tests assume a specific data distribution (usually normal), while
nonparametric tests make fewer or no distributional assumptions.
 Data Types: Parametric tests are typically used with interval or ratio data
(continuous), while nonparametric tests are more flexible and can be used
with nominal, ordinal, interval, or ratio data.
 Power: Parametric tests tend to be more powerful when their assumptions
are met. Nonparametric tests may have lower power but are more robust
in situations where data doesn't conform to parametric assumptions.
 Sample Size: Parametric tests may require larger sample sizes to perform
well, especially when dealing with non-normal data. Nonparametric tests
can often be used with smaller sample sizes.
 In practice, the choice between parametric and nonparametric analysis
depends on your data and the assumptions you can reasonably make. If
your data meets the assumptions of parametric tests, they are generally
preferred due to their higher statistical power. However, if you're unsure
about data distribution or your data doesn't meet the assumptions,
nonparametric tests provide a robust alternative.
64
Summary
 Each student should be able to apply all these
concepts to the research question that he/she is
most interested in.
 Given a research question, student should be able to
develop a plan for obtaining a sample, describe how
variables will be manipulated and/or measured,
identify whether variables are categorical or
quantitative, state whether the research is
experimental or non experimental.
65
References
 Applied Statistics: From Bivariate Through Multivariate Techniques by
Montgomery, Rebecca M. Warner, 2th edition., SAGE Publishing, 2012.
ISBN: 9781412991346.
 Applied Statistics and Probability for Engineers by Montgomery, D. C. and
Runger, G. C., 6th edition., Wiley, 2011. ISBN-13: 978-0470-50578-6.
66
Download