Uploaded by Shirly Ting

Module 1 Fundamental of Statistics - updated

advertisement
G
N
I
R S
E
E TIC &
N S 4
I
G TI 1 )
F
N
O
0
4
A
E T
3
L
0
A
S CC 30 1
T
N
(E NG LE
E S
E ODU AM IC
M
T
D
S
N
I
T
U
F TA
S
LEARNING OUTCOMES
▪ Understand the concepts of statistics
▪ Differentiate between population and sample
▪ Know how to represent/visualize/display
▪ Using Microsoft Excel or Scilab/Matlab to analyze,
summarize and display data
QUESTIONS
▪ What are the differences between population and
sample?
▪ Identify the graphical representations of data.
▪ Explain the differences between bar chart and
histogram.
▪ What is the difference between Median and
Average?
POPULATION VS SAMPLE
POPULATION
SAMPLE
Entire group of individuals
that we want to gather
information
A part of the population
that we actually examine
in order to obtain
information
Size
N
n
Mean
∑$
!"# 𝑥!
𝜇=
𝑁
$
&
∑
!"# 𝑥! − 𝜇
&
𝜎 =
𝑁
∑%!"# 𝑥!
𝑥̅ =
𝑛
%
&
∑
𝑥
−
𝑥
̅
!
!"#
𝑠& =
𝑛−1
Design observation
Variance
Or
Standard deviation
𝜎=
𝜎&
𝑠
IN CLASS ACTIVITY
▪ Collect BMI data of all students in this class and
update in google sheet
▪ Based on the collected data, categorize the data into
different level of underweight (<18.5), normal (18.522.9), pre-obese (23.0-27.4), obese I (27.5-34.9),
obese II (35.0-39.9) and obese III (>=40).
▪ Plot appropriate representations of the data.
▪ Summarize the data as follows:
Mean
Std. dev
n
Median Range
Min
Max
Q1
Q2
▪ Take a sample of size 30 and compute its statistics.
CONCEPT OF STATISTICS
▪ Most of real world problems required statistics to
draw conclusion.
▪ Steps involved in statistical analysis:
Data Collection
Data
Organization/Representation
What is the size of population (N)?
How many sample?
What is the size of sample (n) ?
What type of data (discrete or
continuous)?
How do you keep your data?
How do you visualize the data?
Data Analysis Methods and
Techniques via descriptive
or inferential statistics
What statistical
methods/techniques you use?
Data
Interpretation
What can you draw from the
analysis result?
WHAT IS DATA?
▪ Data is a collection of facts, such as numbers, words,
measurements, observations or even just descriptions
of things.
▪ Types:
❑
❑
Qualitative data is descriptive information (it describes
something)
Quantitative data is numerical information (numbers)
DATA COLLECTION
▪ Data can be collected in many ways.
▪ The simplest way is direct observation.
❑ Example: Counting Cars
▪ You want to find how many cars pass by a certain point
on a road in a 10-minute interval.
▪ So: stand near that road, and count the cars that pass
by in 10 minutes.
▪ You might want to count many 10-minute intervals at
different times during the day, and on different days
too!
▪ Experimental data collection
▪ You also can gather data through a survey
❑ Example:
▪ You can survey people (through questionnaires, opinion
polls, etc) or things (like pollution levels in a river, or
traffic flow)
HOW DO YOU REPRESENT DATA?
▪ Supposed you are to present the following data on sales
for the month of February, what method would you
choose?
I can use
graphs or charts
or plots …
•
•
•
Do you know there are around 30 different choices of graphs?
Yet, not all graphs are appropriate for presentation purposes.
Thus, you need to choose the most suitable graph for your data.
CONT.
▪ Bar Graphs
▪ Pie Charts
▪ Dot Plots
▪ Line Graphs
▪ Scatter (x,y) Plots
▪ Pictographs
▪ Histograms
▪ Frequency Distribution
Distribution
and
▪ Stem and Leaf Plots
▪ Cumulative Tables and Graphs
▪ Graph Paper Maker
▪ a lot more can be added to this list
Grouped
Frequency
SOMETHING TO PONDER
▪ So far, we have learned several techniques to
describe data using graphs / charts.
▪ Is it effective?
❑ Graphs / Charts are effective at giving the
overall view of a situation
▪ HOWEVER
❑ Graphs / Charts cannot give precise information
for inferential purposes (note: infer == to make
conclusions)
▪ THUS – you need to add numerical representations
BASIC NUMERICAL
REPRESENTATIONS
BASIC NUMERICAL REPRESENTATION
(CONT.)
Draw the frequency histogram. Calculate the mean, median
and mode for the number of quarts of milk purchased by the
following 25 households:
0 0 1 1 1
1
1 2 2 2
2
2 2 2 2
2
3 3 3 3
3 4 4 4 5
Mean?
Median?
Mode?
MEASURES OF CENTER
▪ A measure along the horizontal axis of the data
distribution that locates the center of the
distribution.
▪ What do you use as a measure of center?
(a) Mean?
(b) Median?
(c) Mode?
•
Not all three are suitable to describe a distribution for ALL cases
EXTREME VALUES
▪ The mean is more easily affected by extremely large
or small values than the median.
▪ The median is often used as a measure of center
when the distribution is skewed.
SKEWNESS
▪ A measure of asymmetry in a statistical distribution
▪ 0 indicates perfect symmetry
▪ Negative indicates more values lie above the mean
(left tail) ▪ Positive indicates more values lie below the mean
(right tail)
KURTOSIS
▪ a measure of whether the data are heavy-tailed or
light-tailed relative to a normal distribution
▪ positive kurtosis indicates that the distribution has
heavier tails than the normal distribution (>3)
▪ negative kurtosis indicates that the distribution has
lighter tails than the normal distribution (<3)
SKEWED RIGHT (POSITIVELY
SKEWED)
▪ Skewed Right – long tail to the right
▪ A few high numbers pull the mean above the median
The set:
The graph:
Num.
Frequency
1
3
2
5
3
3
4
1
Mean = [1(3) + 2(5) + 3(3) + 4(1)] / 12 = 2.17
Median = 2
Mean > Median
SKEWED LEFT (NEGATIVELY SKEWED)
▪ Skewed Left – long tail to the left
▪ A few low numbers pull the mean below the median
The set:
The graph:
Num.
Frequency
1
1
2
3
3
5
4
3
Mean = [1(1) + 2(3) + 3(5) + 4(3)] / 12 = 2.83
Median = 3
Mean < Median
QUARTILES
▪ Quartiles are the values that divide a list of numbers
into quarters
▪ 25% of the measurements of the given dataset (that
are represented by Q1)
▪ Q2 = Median
▪ Interquartile range = Q3 –Q1
▪ Calculate all quartiles for the following numbers:
10, 2, 4, 7, 8, 5, 11, 3, 12
The formula
doesn't give
you the
value for the
quartile, it
gives you
the place
BOX AND WHISKER PLOT
▪ a convenient way of visually displaying the data
distribution through their quartiles
https://support.microsoft.com/en-us/office/create-a-boxplot-10204530-8cdf-40fe-a711-2eb9785e510f
MEASURES OF CENTRE VS. VARIABILITY
I was told that the average height of plants here is
only 1 feet.
But this tree is 10 feet high!!! !#$&*^(&**
Often, measure of centre does not give the true picture. Need to
know the measure of variability from the centre too….
MEASURES OF VARIABILITY
▪ A measure along the horizontal axis of the data
distribution that describes the spread of the
distribution from the center.
THE RANGE
▪ The range, R describes the difference between the
largest and smallest measurements.
▪ Example: A botanist records the number of petals on
5 flowers:
5, 12, 6, 8, 14
▪ The range is
R = 14 – 5 = 9
THE VARIANCE
▪ The variance is measure of variability that uses all
the measurements (as oppose to range R that uses
only 2 measurements, maximum and minimum).
▪ It measures the average deviation of the
measurements from their mean.
▪ Flower petals:
5, 12, 6, 8, 14
Step 1: Find the mean.
Step 2: For each data
point, find the square of its
distance to the mean.
Step 3: Sum the values
from Step 2.
Step 4: Divide by the
number of data points
4
6
8
10
12
14
THE VARIANCE (CONT.)
▪ The variance of a population of N measurements is
the average of the squared deviations of the
measurements about their mean m.
▪ The variance of a sample of n measurements is the
sum of the squared deviations of the measurements
about their mean, divided by (n – 1).
THE STANDARD DEVIATION
▪ In calculating the variance, we squared all of the
deviations, and in doing so changed the scale of the
measurements.
▪ To return this measure of variability to the original
units of measure, we calculate the standard
deviation, the positive square root of the variance.
2 WAYS TO CALCULATE THE SAMPLE
VARIANCE
Use the Definition Formula:
Sum
𝑥!
𝑥! − 𝑥̅
𝑥! − 𝑥̅ "
5
-4
16
12
3
9
6
-3
9
8
-1
1
14
5
25
45
0
60
CONT.
Use the Calculation Formula:
Sum
5
25
12
144
6
36
8
64
14
196
45
465
SOME NOTES
• The value of s is ALWAYS positive.
• The larger the value of s2 or s, the larger the
variability of the data set.
• Why divide by n –1?
• The sample standard deviation s is often used to
estimate the population standard deviation s.
Dividing by n –1 gives us a better estimate of s.
EXERCISE 1
1. Question: Find the mean, median and mode of:
5, 7, 3, 5, 6, 8, 5, 6, 4, 6, 25
Solution: Note: First, arrange the data
3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 25
median = 6; mean = 80/11 = 7.27 ; modes = 5 and 6
2. Question: Eliminate the last observation x= 25 and then
find the mean, median and mode. How do these values
compare with those found using the full data set?
Solution: median = 5.5; mean = 55/10 = 5.5; modes =
5
and 6. The mean is smaller.
3. Question: How do possible outliers (such as 25) affect
these values?
Solution: The mean is very much affected by the
outlier,
while the median and mode are not so.
EXERCISE 2
Given the observations
7, 9, 10, 6, 8, 7, 8, 9, 8
calculate:
1. the range
Solution : R = 10 – 6 = 4
2. the mean
Solution : Mean = 72 / 9 = 8
3. the variance
Solution : Variance = [588 – (722/9)] / 8 = 12 / 8 = 1.5
4. the standard deviation
Solution : Standard Deviation = √1.5 = 1.225
Activity
▪ Kindly put your full name and student's ID, and
upload your 30 seconds introductory video. The
video should contain an introduction about
yourself, where do you live, what is your hobby,
etc., and what's your expectation of this course.
▪ https://www.csusm.edu/qc/facultydocuments/biof
older/bio353.pdf
Download