Document

advertisement

4. Basic distributions with R

CA200

(based on the book by Prof. Jane M. Horgan)

1

Discrete distributions: Binomial distribution

Def:

Conditions:

1.

An experiment consists of n repeated trials

2. Each trial has two possible outcomes a) a success with probability p is constant from trial to trial b) a failure with probability q = 1 − p

3. Repeated trials are independent

X = number of successes in n trials. X is a BINOMIAL RANDOM VARIABLE.

The probability of getting exactly x successes in n trials is:

In R use function: prefix “binom” with prefix “d” for pdf, “p” for cdf. Function are dbinom(x, n, p) pbinom(x, n, p)

2

Binomial distribution – cont.

Example 1. Terminals on an on-line computer system are attached to a communication line to the central computer system. The probability that any terminal is ready to transmit is 0.95. p = 0.95

Let X = number of ready terminals

Example 2. A fair coin is tossed 10 times. Success and failure are “heads” and “tails”, respectively, each with probability, 0.5.

Lets denote:

X = number of heads (successes) obtained

Example 3. It is known that 20% of integrated circuit chips on a production line are defective. To maintain and monitor the quality of the chips, a sample of 20 chips is selected at regular intervals for inspection.

Lets denote:

X = number of defectives found in the sample

p = 0.20

n = 20

P(X=0) = P(0 defective in a sample of size 20);

P(X=1) = P(1 defectives in a sample of size 20)

P(X=2) = P(2 defectives in a sample of size 20)

3

Binomial distribution – cont.

Example 4. It is known that 1% of bits transmitted through a digital transmission are received in error. One hundred bits are transmitted each day.

Lets denote:

X = number of bits found in error each day p = 0.01

n = 100

P(X=0) = P(0 errors in the 100 transmission)

P(X=1) = P(1 error in the 100 transmission)

P(X=2) = P(2 errors in the 100 transmission)

4

Binomial distribution – cont.

Calculating binomial pdfs in R:

Use prefix “binom” with “d” for pdf. Function is dbinom with arguments n and p.

Example 1:

For all cumulative probabilities: x <- 0:5 dbinom(x, size = 5, prob = 0.95)

[1] 0.0000003125 0.0000296875 0.0011281250 0.0214343750 0.2036265625 0.7737809375

The probability of getting exactly 3 ready terminals from five: dbinom(x = 3, size =5, prob = .95)

[1] 0.02143438

Example 2: with round function for 4 decimals!

x <- 0:10 round(dbinom(x, 10, 0.5), 4)

[1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010

Example 3: dbinom(x, 20, 0.2)

Example 4: dbinom(x, 100, 0.01)

5

Binomial distribution – cont.

Plotting binomial pdfs in R:

par(mfrow = c(2, 2)) x <- 0:5

# multiframe

#Example 1 plot(x, dbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals", ylab = "P(X = x)", type = "h", main = "Ready Terminals, n = 5, p = 0.95") x <- 0:10 #Example 2 plot(x, dbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads", ylab = "P(X = x)", type = "h", main = "Heads, n = 10, p = 0.5") x <- 0:20 #Example 3 plot(x, dbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives", ylab = "P(X = x)", type = "h", main = "Defectives, n = 20, p = 0.2") x <- 0:20 #Example 4. No need for full range here plot(x, dbinom(x, size = 100, prob = 0.01), xlab = "Number of Bits in Error", ylab = "P(X = x)", type = "h", main = "Bits in Error, n = 100, p = 0.01")

6

Binomial distribution – cont.

Binomial pdfs

7

Binomial distribution – cont.

Calculating binomial cdfs in R:

Use prefix “binom” with “p” for cdf. Function is pbinom with arguments n and p.

Example 1: n = 5, p = 0.95

P(X ≤ 3): pbinom(3, 5, 0.95) #[1] 0.0225925

P(X > 3):

1 - pbinom(3, 5, 0.95)

Example 3: n = 20, p = 0.02

P(X ≤ 4):

#[1] 0. 9774075 pbinom (4, size = 20, prob = 0.2) #[1] 0.6296483

P(X > 4) = 1 - P(X ≤ 4 ):

1- pbinom(4, size = 20, prob = 0.2) #[1] 0.3703517

For all cumulative probabilities x <- 0:20 prob <- pbinom(x, size = 20, prob= 0.2) round to 4 decimal places: round(prob, 4) # [1] 0.0115 0.0692 0.2061 0.4114 0.6296

8

Binomial distribution – cont.

Plotting binomial cdfs in R:

par(mfrow = c(2,2)) #multiframe x<-0:5 plot(x, pbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 5, p = 0.95") x<-0:10 plot(x, pbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads", ylab = "P(X< = x)", ylim = c(0, 1), type = "s", main= "n = 10, p = 0.5") x<-0:20 plot(x, pbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 20, p = 0.2") x<-0:100 plot(x, pbinom(x, size = 100, prob = 0.01), xlab = "Number of Error Bits", ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 100, p = 0.01")

9

Binomial distribution – cont.

Binomial cdf

10

Normal distribution – continuous distribution

• in practice, measures tend to cluster around a central values

• the largest proportion of measurement occurring near central value and decreasing as the measurements extend at either side of this central value

bell-shaped curve symmetrical about the central value is called “normal distribution”

17

Normal distribution – cont.

Example 5. Heights of females and males. We can accept that:

on average, males are taller than females but maybe spread around their respective averages could be the same

most of female and male heights cluster around the same average

the heights of females and males might be modelled by normal distribution:

Height of females and males:

X ~ N(1.65, 0.25)

X ~ N(1.85, 0.25)

 bell-shaped curves

centred at different points

variation at either side of central points is the same

Two normal distributions with different mean and the same spread

18

Normal distribution – cont.

Example 6. Arises in product manufacturing. Components manufactured by two machines are set to a diameter 10 cm.

One machine produces components with error 1cm:

X ~ N(10, 1)

The second machine has an error of 2cm:

X ~ N(10, √2)

 bell-shaped curves

centred at the same point, 10cm

variation at either side of central points is different

one high and narrow, the other low and wide

Two normal distributions with the same mean and different standard deviations 19

Normal distribution – cont.

The Probability Density Function for Normal distribution

Def:

The PDF of a normal (Gaussian) distribution, f(x), is a function defined as:

𝒇 𝑿 =

𝟏 𝝈 𝟐𝝅 𝒆

(𝑿−µ)

𝟐𝝈

𝟐

It can be shown that

E(X) = µ

Var(X) = σ 2

− ∞ < 𝑿 < ∞

X is said to be normally distributed with a mean of μ and a variance of σ 2 i.e.

X ~ N (μ, σ 2 )

Increasing μ moves the distribution along the x-axis (example 1), and increasing σ 2 widens the distribution (example 2)

20

Normal distribution – cont.

Generating Normal Curves

To plot a normal curve use the function curve with dnorm as the argument: curve (dnorm(x, 2, 0.25), from = 1, to = 3)

# plots a normal curve with a mean of 2 and a standard deviation of 0.25 in the range 1 to 3 curve (dnorm(x, 2, 0.25), from = 1, to = 3, labels = FALSE, tick = FALSE)

# to suppress the labels and axis ticks

21

Normal distribution – cont.

Generating Normal Curves

To generate normal curves on same axis use (example 1, example 2) cbind which combines R objects with columns and matplot which plot columns as matrices x <- seq(0.5, 3, len = 101) #101 equidistant values between 0.5 and 3 y <- cbind(dnorm(x, 1.65, 0.25), dnorm(x, 1.85, 0.25)) #two normal curves matplot(x, y, type = "l", xlab = "height(metres)", ylab = "f(x)") text(1.15, 0.5, "Females")

#multiple plotting text(2.35, 0.5, "Males")

Females Males

0.5

1.0

1.5

2.0

height(metres)

2.5

3.0

Normal distribution – cont.

The Cumulative Distribution Function

Def:

The CDF of normal distribution in R is obtained with pnorm function:

P(X ≤ x) = pnorm(x, μ, σ)

Example 7. From previous records it is known that examination results are normally distributed with mean μ = 45 and the standard deviation σ = 4:

X ~ N(45, 16)

What percentage of students obtain a mark

• less than 45;

• larger than 45;

• larger than 50;

• between 40 and 50;

• greater than 37?

23

Using R to obtain normal probabilities

24

Conditional Probabilities in Normal distribution

Example 8. An analogue signal received at a detector (measured in microvolts), is normally distributed with a mean of 100 and a variance of 256. What is the probability that the signal will be less than 120 microvolts, given that it is larger than 110 microvolts?

Solution 8.

X ~ N(100, 16)

Recall Conditional probability of B given A: 𝑃 𝐵 𝐴 =

𝑃(𝐴∩𝐵)

𝑃(𝐴)

𝑃 𝑋 < 120 𝑋 > 110 =

𝑃( 𝑋 > 110 ∩ 𝑋 < 120 )

=

𝑃(𝑋 > 110)

𝑃(110 < 𝑋 < 120)

𝑃(𝑋 > 110)

For

P(110 < X < 120) pnorm(120, 100, 16) - pnorm(110, 100, 16)

[1] 0.1603358

P(X > 110)

1 - pnorm(110, 100, 16)

[1] 0.2659855

𝑃 𝑋 < 120 𝑋 > 110 =

0.1603358

0.2659855

= 0.6027988

25

Example 9. From previous records scores on an aptitude are normally distributed with mean μ = 500 and standard deviation σ = 100. What is the probability that an individual’s score exceeds 650, given that it exceeds 500?

Solution 9.

X ~ N(500, 100)

𝑃 𝑋 > 650 𝑋 > 500 =

𝑃( 𝑋 > 500 ∩ 𝑋 > 650 )

𝑃(𝑋 > 500)

=

𝑃(𝑋 > 650)

𝑃(𝑋 > 500)

For

P(X > 650)

1 - pnorm(650, 500, 100)

[1] 0.0668072

P(X > 110)

1 - pnorm(500, 500, 100)

[1] 0.5

P X > 650 X > 500 =

0.0668072

0.5

= 0.1336144

26

Example 10. Summary for basic statistics with R:

(Based on the Excel file “CA200_Questionnaire_Summary”which has been previously circulated) a) Read the survey answers into a suitable data frame in R. [It was necessary to save the file, from within Excel, in comma separated (csv) format. But this has already been done for you that is the circulated file is of .csv

type]

To read the file in R use “read.csv”. For example, if the file is on the C drive then results<-read.csv(“C:/CA200_Questionnaire_Summary.csv”, header=T) this will read the data from the file into the data frame called “results”.

As a check that the data have been read correctly, one can see that the command: results$Q1 gives results from column Q1 only:

[1] 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 2 2 1 2 2 NA 1 2 1 NA 1 2 2 1 1 2 2 1 1 1 1 1 1 2 b) Use R to find out how many males and females there are (check the questionnaire for the questions!) For number of females and males use summary function. This will give you all important information:

- Hint: First convert categorical variable Q1 into a factor using: results$Q1<-factor(results$Q1)

Then use summary(results) to find out how many females/males are out there c) Use R to and to separate data for females and for males into two different data sets:

Use subset function: m_results<-subset(results, results$Q1==1)

To check if this is properly entered use function to print results in the work space: m_results #This will filter only data for males

repeat same for separating results for females by using: f_results<-subset(results, results$Q1==2) #This will filter only data for females

27

Example 10. Summary - cont.

d) Identify any other categorical variables (questions) and similarly specify these as factors. Then, repeat the summary(results) command.

e) Calculate mean and standard deviation of the salary for summer job for females Q13b:

> f_results$Q13b

[1] 200 150 NA 240 NA 150 NA 300 200 200 300 NA NA NA 330

> summary(f_results$Q13b)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

150 200 200 230 300 330 6

> mean(f_results$Q13b) #don’t forget to remove all NAs

[1] NA

> mean(f_results$Q13b, na.rm=T)

[1] 230

> sd(f_results$Q13b, na.rm=T)

[1] 66.52067

f) For question e) use R to calculate skewness and kurtosis:

Note: you will have to install additional package which contains these functions. The name of a package is

“e1071”. For installation use:

> install.packages(“e1071”)

> library("e1071")

Loading required package: class

Warning message: package 'e1071' was built under R version 2.13.2

> skewness(f_results$Q13b, na.rm=T)

[1] 0.2196901

#what can you conclude from this number?

> kurtosis(f_results$Q13b, na.rm=T)

[1] -1.68135

#what can you conclude from this number?

28

Example 10. Summary - cont.

Note:

Categorical data are generally non-numerical data which are placed into exclusive categories and then counted rather than measured. In fact, there are two types of categorical data, namely

Nominal: Simple names or labels which cannot be ordered (e.g. car registration number or a PIN number)

Ordinal: Can be placed in a meaningful order (though without any measurements). For example,

“small”, “medium” or “large” vests.

A factor is a vector object used to specify a discrete classification (grouping) of the components.

of other vectors of the same length.

g) Plot the data from e) in a histogram and compare to Normal curve by e.g. superposition:

> m <-mean(f_results$Q13b, na.rm=T)

> s <-sd(f_results$Q13b, na.rm=T)

> mn <-min(f_results$Q13b, na.rm=T)

> mx <-max(f_results$Q13b, na.rm=T)

> c<-curve(dnorm(x, m, s), from=mn, to=mx)

> hist(f_results$Q13b, freq=FALSE, xlab=“put adequate name”, ylab=“density”, main=“put adequate name”)

> lines(c)

29

Download