Lecture 3

advertisement

Fundamentals of Data

Analysis

Lecture 3

Basics of statistics

Program for today

 Basic terms and definitions

 Discrete distributions

 Continuous distributions

 Normal distribution

Topics for discussion

What are the applications of statistics in modern physics?

How important is the drawing of conclusions based on statistical analysis ?

What is the statistics ?

Definition of Statistics:

1.

A collection of quantitative data pertaining to a subject or group.

Examples are blood pressure statistics etc.

2.

The science that deals with the collection, tabulation, analysis, interpretation, and presentation of quantitative data

What is the statistics ?

Two phases of statistics:

 Descriptive Statistics: o Describes the characteristics of a product or process using information collected on it.

 Inferential Statistics (Inductive): o Draws conclusions on unknown process parameters based on information contained in a sample.

o Uses probability

Probability

When we cannot rely on the assumption that all sample points are equally likely, we have to determine the probability of an event experimentally. We perform a large number of experiments N and count how often each of the sample points is obtained. The ratio of the number of occurrences of a certain sample point to the total number of experiments is called the relative frequency .

Probability

The probability is then assigned the relative frequency of the occurrence of a sample point in this long series of repetitions of the experiment.

This is based on the axiom, called the " law of large numbers ", which says that the relative frequency approaches the true (theoretical) probability of the outcome if the experiment is repeated over and over again. How important is the drawing of conclusions based on statistical analysis.

Probability where n(E) is the number of times, the event E took place out of a total of N experiments. From this definition we can see that the probability is a number between 0 and 1. When the probability is 1, then we know that a particular outcome is certain.

Probability

For a discrete random variable definition of probability is intuitive: n

P

N where n(x) is the number of occurences of the desired value of the random variable x (successes) in N samples ( N

 

).

Probability

For a continuous random variable, this definition requires the identification of a small range of variation Δx ( Δx 

0), for which the probability is determined :

P

 x

0

 x

 x

0

  x

  x

0

 x

N x

0

  x

For a continuous random variable it is preferable to use the probability density function: f

P

 x

0

 x

 x x

0

  x

Histogram

The histogram is the most important graphical tool for exploring the shape of data distributions. And a good way to visualize trends in population data. The more a particular value occurs, the larger the corresponding bar on the histogram .

Histogram

Constructing a histogram

Step 1: Find range of distribution, largest smallest values

Step 2: Choose number of classes, 5 to 20

Step 3: Determine width of classes, one decimal place more than the data, class width = range/number of classes

Step 4: Determine class boundaries

Step 5: Draw frequency histogram

Histogram

Number of groups or cells

 If number of observations < 100 – 5 to

9 cells

Between 100-500 – 8 to 17 cells

Greater than 500 – 15 to 20 cells

Analysis of histogram

Analysis of histogram

Calculating the average for ungrouped data

X

 i n 

1 and for grouped data:

X n

 i

X

 i h 

1 f X i n

 f X

1

 f X

2

...

 f

1

 f

2

...

 f h f X h

.

Boundaries

23.6-26.5

26.6-29.5

29.6-32.5

32.6-35.5

35.6-38.5

38.6-41.5

41.6-44.5

44.6-47.5

47.6-50.5

Total

Analysis of histogram

Midpoint

25.0

28.0

31.0

34.0

37.0

40.0

43.0

46.0

49.0

63

58

52

34

Frequency

4

36

51

16

6

320

Computation

100

1008

1581

2142

2146

2080

1462

736

294

11549

Measures of dispersion

Range

Standard deviation

Variance

Measures of dispersion

The range is the simplest and easiest to calculate of the measures of dispersion.

R = X max

X min

Measures of dispersion

Standard deviation inside the probe:

S

 n i

1

( n

1

)

2

Measures of dispersion

For a discrete random variable definition of variation is as follows:

V

   x i

E

2

P

  i when for continous is:

V

 b a

  x

E

2

 f

  dx

Parameters of a distribution

Parameter is a characteristic of a population, i.o.w. it describes a population

Statistic is a characteristic of a sample, used to make inferences on the population parameters that are typically unknown, called an estimator

Parameters of a distribution

Population - Set of all items that possess a characteristic of interest

Sample - Subset of a population

Parameters of a distribution

Expected value (EV) discrete random variable:

E

 k i

Z

1 x i

P

  i and for continuous random variable:

E

 b

  x f

  dx a

Random numbers

2

7106

8993

8566

5201

8274

7158

1223

9836

2362

8162

6569

7020

8788

1

1534

6128

6047

0806

9915

2882

9213

8410

9974

3402

8188

3825

9801

3

2836

4102

8644

5705

4525

4341

4388

3899

2103

8226

1492

1124

6338

5

5574

0330

9297

1448

5752

1178

6691

1253

3825

3364

8823

9155

3309

4

7873

2551

9343

7355

5695

3463

9760

3883

4326

0782

2139

7483

5899

7

7590

6427

3500

7514

7172

1173

8214

6988

6187

4500

0613

3209

0968

6

7545

2358

6751

9562

9630

5789

6861

1683

9079

7871

6878

4919

0807

9

1202

9325

2913

0402

0227

0820

0611

8026

1489

9424

0241

2364

4205

8

5574

7067

8754

9205

6988

0670

8813

9978

2721

5598

7161

5959

0539

10

7712

2454

1258

2427

4264

5067

3131

6751

4216

3816

3834

2555

8257

Normal distribution

Characteristics of the normal curve:

 It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.

 The distribution is single peaked, not bimodal or multimodal

 Also known as the Gaussian distribution

Normal distribution

Characteristics of the normal curve:

 It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side.

 The distribution is single peaked, not bimodal or multimodal

 Also known as the Gaussian distribution

Normal distribution

 Probability density function:

 N( μ,σ)

 N(0,1) - standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1

Normal distribution

Exponential distribution

Probability density function

for

Cumulative distribution function

Cumulative distribution function is given by:

F ( x ) = P (-oo, x )

Thanks for attention !

Download