Fundamentals of Data
Analysis
Basics of statistics
Program for today
Basic terms and definitions
Discrete distributions
Continuous distributions
Normal distribution
Topics for discussion
What is the statistics ?
Definition of Statistics:
1.
A collection of quantitative data pertaining to a subject or group.
Examples are blood pressure statistics etc.
2.
The science that deals with the collection, tabulation, analysis, interpretation, and presentation of quantitative data
What is the statistics ?
Two phases of statistics:
Descriptive Statistics: o Describes the characteristics of a product or process using information collected on it.
Inferential Statistics (Inductive): o Draws conclusions on unknown process parameters based on information contained in a sample.
o Uses probability
Probability
When we cannot rely on the assumption that all sample points are equally likely, we have to determine the probability of an event experimentally. We perform a large number of experiments N and count how often each of the sample points is obtained. The ratio of the number of occurrences of a certain sample point to the total number of experiments is called the relative frequency .
Probability
The probability is then assigned the relative frequency of the occurrence of a sample point in this long series of repetitions of the experiment.
This is based on the axiom, called the " law of large numbers ", which says that the relative frequency approaches the true (theoretical) probability of the outcome if the experiment is repeated over and over again. How important is the drawing of conclusions based on statistical analysis.
Probability where n(E) is the number of times, the event E took place out of a total of N experiments. From this definition we can see that the probability is a number between 0 and 1. When the probability is 1, then we know that a particular outcome is certain.
Probability
For a discrete random variable definition of probability is intuitive: n
P
N where n(x) is the number of occurences of the desired value of the random variable x (successes) in N samples ( N
).
Probability
For a continuous random variable, this definition requires the identification of a small range of variation Δx ( Δx
0), for which the probability is determined :
P
x
0
x
x
0
x
x
0
x
N x
0
x
For a continuous random variable it is preferable to use the probability density function: f
P
x
0
x
x x
0
x
Histogram
The histogram is the most important graphical tool for exploring the shape of data distributions. And a good way to visualize trends in population data. The more a particular value occurs, the larger the corresponding bar on the histogram .
Histogram
Constructing a histogram
Step 1: Find range of distribution, largest smallest values
Step 2: Choose number of classes, 5 to 20
Step 3: Determine width of classes, one decimal place more than the data, class width = range/number of classes
Step 4: Determine class boundaries
Step 5: Draw frequency histogram
Histogram
Number of groups or cells
If number of observations < 100 – 5 to
9 cells
Between 100-500 – 8 to 17 cells
Greater than 500 – 15 to 20 cells
Analysis of histogram
Analysis of histogram
Calculating the average for ungrouped data
X
i n
1 and for grouped data:
X n
i
X
i h
1 f X i n
f X
1
f X
2
...
f
1
f
2
...
f h f X h
.
Boundaries
23.6-26.5
26.6-29.5
29.6-32.5
32.6-35.5
35.6-38.5
38.6-41.5
41.6-44.5
44.6-47.5
47.6-50.5
Total
Analysis of histogram
Midpoint
25.0
28.0
31.0
34.0
37.0
40.0
43.0
46.0
49.0
63
58
52
34
Frequency
4
36
51
16
6
320
Computation
100
1008
1581
2142
2146
2080
1462
736
294
11549
Measures of dispersion
Range
Standard deviation
Variance
Measures of dispersion
The range is the simplest and easiest to calculate of the measures of dispersion.
R = X max
X min
Measures of dispersion
Standard deviation inside the probe:
S
n i
1
( n
1
)
2
Measures of dispersion
For a discrete random variable definition of variation is as follows:
V
x i
E
2
P
i when for continous is:
V
b a
x
E
2
f
dx
Parameters of a distribution
Parameter is a characteristic of a population, i.o.w. it describes a population
Statistic is a characteristic of a sample, used to make inferences on the population parameters that are typically unknown, called an estimator
Parameters of a distribution
Parameters of a distribution
Expected value (EV) discrete random variable:
E
k i
Z
1 x i
P
i and for continuous random variable:
E
b
x f
dx a
Random numbers
2
7106
8993
8566
5201
8274
7158
1223
9836
2362
8162
6569
7020
8788
1
1534
6128
6047
0806
9915
2882
9213
8410
9974
3402
8188
3825
9801
3
2836
4102
8644
5705
4525
4341
4388
3899
2103
8226
1492
1124
6338
5
5574
0330
9297
1448
5752
1178
6691
1253
3825
3364
8823
9155
3309
4
7873
2551
9343
7355
5695
3463
9760
3883
4326
0782
2139
7483
5899
7
7590
6427
3500
7514
7172
1173
8214
6988
6187
4500
0613
3209
0968
6
7545
2358
6751
9562
9630
5789
6861
1683
9079
7871
6878
4919
0807
9
1202
9325
2913
0402
0227
0820
0611
8026
1489
9424
0241
2364
4205
8
5574
7067
8754
9205
6988
0670
8813
9978
2721
5598
7161
5959
0539
10
7712
2454
1258
2427
4264
5067
3131
6751
4216
3816
3834
2555
8257
Normal distribution
Characteristics of the normal curve:
It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.
The distribution is single peaked, not bimodal or multimodal
Also known as the Gaussian distribution
Normal distribution
Characteristics of the normal curve:
It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.
The distribution is single peaked, not bimodal or multimodal
Also known as the Gaussian distribution
Normal distribution
Probability density function:
N( μ,σ)
N(0,1) - standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1
Normal distribution
Exponential distribution
for
Cumulative distribution function is given by:
F ( x ) = P (-oo, x )
Thanks for attention !