Class 6 Note: STATISTICAL DATA ANALYSIS

advertisement
STATISTICAL DATA ANALYSIS
Prepared by: Dr. Surej P John
Objective: Understanding of the basic statistical analysis methods
Variables:
 A variable is an attribute of a person or an object that varies.
 Measurement are rules for assigning numbers to objects to represent quantities of
attributes.
 Datum is one observation about the variable being measured.
 Data are a collection of observations.
 A population consists of all subjects about whom the study is being conducted.
 A sample is a sub-group of population being examined.
Statistics:
Statistics is the science of describing or making inferences about the world from a sample of
data. There are five types of statistical analysis. They are
1. Descriptive analysis – data distribution
2. Inferential analysis – hypothesis testing
3. Differences analysis – hypothesis testing
4. Association analysis – correlation
5. Predictive analysis – regression
Descriptive Vs Inferential statistics
 Descriptive statistics are numerical estimates that organize and sum up or present the
data.
 Inferential statistics is the process of inferring from a sample to the population.
For example, Heights of male vs. female at age of 25.
Our observations: male H > female H; it may be linked to genetics, consumption and exercise
etc.
 Is that true for male H> female H?
i.e. Null hypothesis: male H ≤ female H
 Scenario I: Randomly select 1 person from each sex.
Male: 170
Female: 175
Then, Female H> Male H ?
 Scenario II: Randomly select 3 persons from each sex.
Male: 171, 163, 168
Female: 160, 172, 173
What is your conclusion then? Which is the better Scenario?
Please note down the following in inferential statistical analysis:
(1) Sample size is very important and will affect your conclusion
(2) Measurement results vary among samples (or subjects) – that is “variation” or
“uncertainty”.
(3) Variation can be due to measurement errors (random or systematic errors) and inherent
within samples variation. For example, at age 20, female height varies from 158 to 189
cm. Why?
(4) Therefore, in Statistics, we always deal with distributions of data rather than a single
point of measurement or event.
Normal Distribution of data:
The normal (or Gaussian) distribution is a very commonly occurring continuous probability
distribution—a function that tells the probability that an observation in some context will fall
between any two real numbers. For example, the distribution of grades on a test administered
to many people is normally distributed. Normal distributions are extremely important
in statistics and are often used in the natural and social sciences for real-valued random
variables whose distributions are not known.
The normal distribution is immensely useful because of the central limit theorem, which states
that, under mild conditions, the mean of many random variables independently drawn from the
same distribution is distributed approximately normally, irrespective of the form of the original
distribution: physical quantities that are expected to be the sum of many independent processes
(such as measurement errors) often have a distribution very close to the normal.
Moments of a Normal Distribution
Each moment measures a different dimension of the distribution.
1. Mean (1st moment)
2. Standard deviation (2nd moment)
3. Skewness (3rd moment)
4. Kurtosis (4th moment)
Mean:
Mean (µ) is equal to the sum of n number of observation divided by the number of
observations (sample size)
Mean = Sum of values/n = Xi/n
Standard Deviation
In statistics and probability theory, the standard deviation (SD) (represented by the Greek
letter sigma, σ) shows how much variation or dispersion from the average exists. A low
standard deviation indicates that the data points tend to be very close to the mean (also called
expected value); a high standard deviation indicates that the data points are spread out over a
large range of values.
The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek
letter sigma). The formula is easy: it is the square root of the Variance. The Variance is
defined as: The average of the squared differences from the Mean.
Frequency distribution
In statistics, a frequency distribution is an arrangement of the values that one or more
variables take in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval, and in this way, the table
summarizes the distribution of values in the sample.
The frequency (f) of a particular observation is the number of times the observation occurs in
the data. The distribution of a variable is the pattern of frequencies of the observation.
Frequency distributions are portrayed as frequency tables, histograms, or polygons.
Frequency distributions can show either the actual number of observations falling in each range
or the percentage of observations. In the latter instance, the distribution is called a relative
frequency distribution.
Frequency distribution tables can be used for both categorical and numeric variables.
Continuous variables should only be used with class intervals.
Table 1. Frequency table for the number of cars
registered in each household
Number of cars (x)
Tally
Frequency (f)
0
4
1
6
2
5
3
3
4
2
By looking at this frequency distribution table quickly, we can see that out of 20 households
surveyed, 4 households had no cars, 6 households had 1 car, etc.
Cross Tabulation
A cross-tabulation (or cross-tab for short) is a display of data that shows how many cases in
each category of one variable are divided among the categories of one or more additional
variables. In a cross-tab, a cell is a combination of two or more characteristics, one from each
variable. If one variable has two categories and the second variable has four categories, for
instance, the cross-tab will have 6 cells, each with a number specific to that category
combination.
Cross tabulation (or crosstabs for short) is a statistical process that summarizes categorical
data to create a contingency table. They are heavily used in survey research, business
intelligence, engineering and scientific research. They provide a basic picture of the
interrelation between two variables and can help find interactions between them.
Look at the following example:
Sample #
Gender
Handedness
1
Female
Right-handed
2
Male
Left-handed
3
Female
Right-handed
4
Male
Right-handed
5
Male
Left-handed
6
Male
Right-handed
7
Female
Right-handed
8
Female
Left-handed
9
Male
Right-handed
10
Female
Right-handed
Cross-tabulation leads to the following contingency table:
Lefthanded
Righthanded
Total
Males
2
3
5
Females
1
4
5
Total
3
7
10
Comparing Means:
T-tests and ANOVA (Analysis of Variance) are the methods commonly used for comparing
means.
Independent T tests
Independent T tests are used for testing the difference between the means of two independent
groups. For Independent T-tests, there should be only one independent variable but it can
have two levels. There should be only one dependant variable.
Ex: gender (male and female)
How male and female students differ in academic performance?
ANVOA
Anova is used as the extension of Independent t-tests. This is used when the researcher is
interested in whether the means from several ( >2) independent groups differ. For Avova, only
one dependant variable should be present. There should be only ONE independent variable
present (but it can have many levels unlike in independent t-tests)
For example,
Is there any difference in various nationalities’ customer satisfaction in flying THAI Airways?
Here Customer satisfaction is the dependent variable and Nationality is the independent
variable with various levels.
Statistical Errors in hypothesis testing:
There are two major errors in hypotheses testing. They are known as Type 1 error and Type 2
error respectively.
Rejecting the null hypothesis when it is in fact true is called a Type I error.
Not rejecting the null hypothesis when in fact the alternate hypothesis is true is called a Type
II error.
Example: Let us consider the example of a defendant in a trial.
The null hypothesis is "defendant is not guilty;" the alternate is "defendant is guilty."
A Type I error would correspond to convicting an innocent person;
A Type II error would correspond to setting a guilty person free.
Therefore we can summarize the two forms of errors in hypothesis testing like in the table
given below:
If Ho is true
If Ho is false
If Ho is rejected
Type I error
No error
If Ho is accepted
No error
Type II error
Download