Fundamentals of Statistical Analysis

advertisement
Fundamentals of
Statistical Analysis
DR. SUREJ P JOHN
Definition of Variables
A variable is an attribute of a person or an object that varies.
Measurement are rules for assigning numbers to objects to represent quantities of
attributes.
Back to Table of Content
Definition
Datum is one observation about the variable being measured.
Data are a collection of observations.
A population consists of all subjects about whom the study is being
conducted.
A sample is a sub-group of population being examined.
What Is Statistics?
Statistics is the science of describing or making inferences about the
world from a sample of data.
Descriptive statistics are numerical estimates that organize and sum up or
present the data.
Inferential statistics is the process of inferring from a sample to the
population.
Five Types of Statistical Analysis
1. Descriptive analysis – data distribution
2. Inferential analysis – hypothesis testing
3. Differences analysis – hypothesis testing
4. Association analysis – correlation
5. Predictive analysis – regression
Descriptive vs. Inferential Statistics
A Hypothesis:
A statement relating to an observation that may be true but for which a
proof (or disproof) has not been found
The results of a well-designed experiment or data collection may lead to
the proof or disproof of a hypothesis
Inferential Statistics
Samples
Sub-samples
Population
For example, Heights of male vs. female at age of 25.
Our observations: male H > female H; it may be linked to genetics, consumption
and exercise etc.
Is that true for male H> female H?
i.e. Null hypothesis: male H ≤ female H
Scenario I: Randomly select 1 person from each sex.
Male: 170
Female: 175
Then, Female H> Male H ?
Scenario II: Randomly select 3 persons from each sex.
Male: 171, 163, 168
Female: 160, 172, 173
What is your conclusion then? Which is the better Scenario?
Important messages here:
(1) Sample size is very important and will affect your
conclusion
(2) Measurement results vary among samples (or
subjects) – that is “variation” or “uncertainty”.
(3) Variation can be due to measurement errors
(random or systematic errors) and inherent
within samples variation. For example, at age 20,
female height varies from 158 to 189 cm. Why?
(4) Therefore, in Statistics, we always deal with
distributions of data rather than a single point of
measurement or event.
0.10
0.09
Probability density
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
140
150
160
170
Height (cm)
180
190
Moments of a Normal Distribution
Each moment measures a different dimension of the distribution.
1. Mean (1st moment)
2. Standard deviation (2nd moment)
3. Skewness (3rd moment)
4. Kurtosis (4th moment)
Mean
mean
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 mm
Mean (µ) is equal to the sum of n number of observation divided by the
number of observations (sample size)
Mean = Sum of values/n = Xi/n
e.g. length of 8 fish larvae at day 3 after hatching:
0.6, 0.7, 1.2, 1.5, 1.7, 2.0, 2.2, 2.5 mm
mean length = (0.6+0.7+1.2+1.5+1.7+2.0+2.2+2.5)/8
= 1.55 mm
Standard deviation
The standard deviation (SD) (represented by the Greek letter sigma, σ)
shows how much variation or dispersion from the average exists.
A low standard deviation indicates that the data points tend to be very close
to the mean (also called expected value); a high standard deviation
indicates that the data points are spread out over a large range of values.
The formula is easy: it is the square root of the Variance. The Variance is
defined as: the average of the squared differences from the Mean.
Standard deviation
Calculate SD?
Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the probability
distribution of a real-valued random variable about its mean. The skewness value can be positive
or negative, or even undefined.
Kurtosis
The coefficient of Kurtosis is a measure for the degree of peakedness
/flatness in the variable distribution.
Kurtosis <0
Kurtosis = 0
Kurtosis > 0
Frequency Distribution
In statistics, a frequency distribution is an arrangement of the values that one or more variables
take in a sample. Each entry in the table contains the frequency or count of the occurrences of
values within a particular group or interval, and in this way, the table summarizes
the distribution of values in the sample.
Frequency distribution tables can be used for both categorical and numeric variables.
Table 1. Frequency table for the number of cars
registered in each household
Number of cars (x)
Tally
Frequency (f)
0
4
1
6
2
5
3
3
4
2
Cross Tabulation
A cross-tabulation (or cross-tab for short) is a display of data that
shows how many cases in each category of one variable are divided
among the categories of one or more additional variables.
In a cross-tab, a cell is a combination of two or more characteristics,
one from each variable.
If one variable has two categories and the second variable has four
categories, for instance, the cross-tab will have 6 cells, each with a
number specific to that category
Sample #
Gender
Handedness
1
Female
Right-handed
2
Male
Left-handed
3
Female
Right-handed
4
Male
Right-handed
5
Male
Left-handed
6
Male
Right-handed
7
Female
Right-handed
8
Female
Left-handed
9
Male
Right-handed
10
Female
Right-handed
Left-handed
Right-handed
Total
Males
2
3
5
Females
1
4
5
Total
3
7
10
Comparing Means
We need to compare the means of groups in Inferential statistics.
T-tests and ANOVA (Analysis of Variance) are the methods commonly used for comparing means.
Independent T tests
Independent T tests are used for testing the difference between the means of two independent
groups. For Independent T-tests, there should be only one independent variable but it can have
two levels. There should be only one dependant variable.
Ex: gender (male and female)
How male and female students differ in academic performance?
Anova (Analysis of Variance)
Anova is used as the extension of Independent t-tests.
This is used when the researcher is interested in whether the means from
several ( >2) independent groups differ.
For Avova, only one dependant variable should be present. There should be
only ONE independent variable present (but it can have many levels unlike
in independent t-tests)
Statistical errors in hypothesis testing
Statistical Errors in Hypothesis Testing
Consider court judgments where the accused is presumed innocent until
proved guilty beyond reasonable doubt (I.e. Ho = innocent)
If the accused is If the accused is
innocent
guilty
(Ho is true)
(Ho is false)
Court’s
decision:
Guilty
Wrong
judgement
OK
Court’s
decision:
Innocent
OK
Wrong
judgement
Statistical Errors in Hypothesis Testing
Similar to court judgments, in testing a null hypothesis in statistics,
we also suffer from the similar kind of errors:
If Ho is true
If Ho is false
If Ho is rejected
Type I error
No error
If Ho is accepted
No error
Type II error
Download