Department of Civil Engineering Statistics for Construction Random Variables and Probability distributions Prepared by: Bahram Abedinianagerabi Outline • What is a random variable? • What is a distribution? • Where do ‘commonly-used’ distributions come from? • What distribution does my data come from? • Do I have to specify a distribution to analyse my data? Department of Civil Engineering 2 What is a random variable? • A random variable is a number associated with the outcome of a stochastic process Waiting time for hauling trucks Average number of major flooding in Arlington Productivity rate for construction labours • In statistics, we want to take observations of random variables and use this to make statements about the underlying stochastic process Are the productivity rate for two different projects are the same? What is the probability which placing slab on grade takes more than 3 days? • Parametric models provide much power in the analysis of variation (parameter estimation, hypothesis testing, model choice, prediction) Statistical models of the random variables Models of the underlying stochastic process Department of Civil Engineering 3 What is a distribution? • A distribution characterises the probability (mass) associated with each possible outcome of a stochastic process • Distributions of discrete data characterised by probability mass functions P( X x) P( X x) 1 x x • Distributions of continuous data are characterised by probability density functions (pdf) f (x) f ( x)dx 1 x Department of Civil Engineering 4 Expectations and variances • Suppose we took a large sample from a particular distribution; we might want to summarise something about what observations look like ‘on average’ and how much variability there is. • The expectation of a distribution is the average value of a random variable over a large number of samples. E ( X ) xP( X x) or x xf ( x)dx • The variance of a distribution is the average squared difference between randomly sampled observations and the expected value. Var ( X ) x E ( x) P( X x) or 2 x 2 x E ( x ) f ( x)dx Department of Civil Engineering 5 Random variable assumptions • In most cases, we assume that the random variables we observe are independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. • This assumption allows us to make all sorts of statements both about what we expect to see and how much variation to expect. • Suppose X, Y and Z are random variables and a and b are constants. E ( X Y Z ) E ( X ) E (Y ) E ( Z ) 3E ( X ) Var ( X Y Z ) Var ( X ) Var (Y ) Var ( Z ) 3Var ( X ) E (aX b) aE ( X ) b Var (aX b) a 2 Var ( X ) Var 1n X i 1n Var ( X ) i Department of Civil Engineering 6 ‘Commonly-used’ distributions • At the core of much statistical theory and methodology lie a series of key distributions (e.g. Normal, Binomial, Uniform, etc.) • These distributions are closely related to each other and can be ‘derived’ as the limit of simple stochastic processes when the random variable can be counted or measured • In many settings, more complex distributions are constructed from these ‘simple’ distributions Ratios: E.g. Beta, Cauchy Compound: E.g. Geometric, Beta Mixture models Department of Civil Engineering 7 Bernoulli random variable • Bernoulli random variable has two possible outcomes: 0 or 1. • A binomial distribution is the sum of independent and identically distributed Bernoulli random variables. • For example, say I have a coin, and, when tossed, the probability that it lands heads is p. P(x) = (1−𝑝)𝑥−1 ∗ 𝑝 μ= 1 𝑝 , 𝜎2 = 1−𝑝 𝑝2 Department of Civil Engineering 8 Example 1 • A division of a construction company has over 200 employees, 48% percent of its employees are male. The company is going to randomly select 10 of these employees to attend a conference related to new technologies for Tunneling. A) Let Z equals the number of male employees chosen. Is Z a binomial variable? Why or why not? Solution: True. Each trial has two outcomes (male or not), results of each trial can be considered independent since we're sampling less than %10 percent of the population, there is a fixed number of trials (10), and the probability of success is the same for each trial (%48 percent). • Technically, since we are sampling without replacement, each employee is not independent and the probability slightly changes as we sample. But the %10, percent condition says that we can still use a binomial distribution since we are sampling less than %10, percent of the population. • When our sample size is small in comparison to the population, this assumption of independence doesn't change our results too much. Department of Civil Engineering 9 Example 2 • Historical data indicates that for the last 100 years, there have been 4 major floods at a river. A) Find the probability of a 10 years flooding. Department of Civil Engineering 10 Example 3 • The productivity of an employment is 65%. The construction employee is being observed. A) What is the probability that the first time that the employee is not productive is his/her 7th observation. Department of Civil Engineering 11 Binomial Distribution • Often, we don’t care about the exact order in which successes occurred. We might therefore want to ask about the probability of k successes in n trials. This is given by the binomial distribution. • For example, the probability of exactly 3 heads in 4 coins tosses = P(HHHT)+P(HHTH)+P(HTHH)+P(THHH) Each order has the same Bernoulli probability = (1/2)4 There are 4 choose 3 = 4 orders • Generally, if the probability of success is q, the probability of k successes in n trials. n k P(k | n,q ) q (1 q ) n k k • The expected number of successes is nq and the variance is nq(1-q). Department of Civil Engineering 12 Example 4 • Historical data indicates that 3% of slab concretes fail the strength test. Twenty tests are performed. A) What is the probability that exactly 17 slabs have enough strength? B) What is the probability that at least two slabs do not meet the strength requirement? Department of Civil Engineering 13 Normal Distribution • Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. • In graph form, normal distribution will appear as a bell curve. Department of Civil Engineering 14 Normal Distribution (Cont’d) • The general formula for the normal distribution is where • • • • • σ is a population standard deviation; μ is a population mean; x is a value or test statistic; e is a mathematical constant of roughly 2.72; π a mathematical constant of roughly 3.14. Department of Civil Engineering 15