Uploaded by jennifer li

Descriptive Statistics Lecture Notes

WEEK 1: Descriptive Statistics
Types of data (concepts)
• A variable is a characteristic of a population or of a sample from a population. A data set contains observations on
- Discrete (whole numbers that can’t be broken down such as number of items) or continuous variables
(numbers that can be broken down e.g height and weight) à Under continuous there is interval (numbers
with known differences between variables e.g time) and ratio (numbers that have measurable intervals
where difference can be determined such as height or weight)
- Quantitative or qualitative (nominal is when variables are simply labelled, ordinal is when variables are
labelled and also in a specific order e.g poor, good, very good)
To apply statistical analyses directly to qualitative data, it must be converted to quantitative data.
Types of observations
The type of observation made by the statistician can also be used to classify data:
• Time series data consists of measurements of the same concept at different points in time (e.g Sydney-area births
per day) for each day in a year
• Cross sectional data consists of measurements of one or more concepts at a single point in time (e.g age, gender,
marital status of sample of UNSW staff in a particular year)
• Longitudinal/panel data features both characteristics
The type of data you have influences what type of analysis is appropriate:
• Examining monthly or seasonal patterns in the number of births would be sensible.
• Also suppose the marital status is coded as Single = 1, Married = 2, Divorced = 3, Widowed = 4. It would not make
sense to calculate the ‘average marital status’ of the UNSW staff
1. Summaries for categorical variables
Frequency distribution (graphical summary)
Frequency distributions are summaries of categorical data using counts. Categories need to be mutually exclusive and
Relative frequency = frequency of n/total
Pie charts and bar charts are graphical representation of frequency distributions:
• Pie charts show relative frequencies more explicitly
Histograms (graphical summary)
Assuming data is ordinal (whether discrete or continuous), the obvious categories for the data values may not exist,
but we can create categories by defining lower and upper category limits. These categories need to be mutually
exclusive and exhaustive.
Categories are called bins in excel
Describing histograms
• Symmetry (or lack of) describes whether the left half of a symmetric histogram is a mirror image of the right half
• Skewness – a feature of an asymmetric histogram
- Long tail to the right = positively skewed (mean > median)
- Long tail to the left = negatively skewed (mean < median)
- A skewed distribution may also be associated with outliers
• Clusters
Number of modal classes/bins
- The modal class is the class with highest frequency. If a histogram has a single modal class, it is unimodal.
Histograms may be unimodal or multimodal
1.2 Describing bivariate relationships
How can relationships between multiple variables be characterised?
Contingency table (‘cross-tabulation’ or ‘cross-tab’ table) which captures the relationship between two qualitative
variables e.g mode of transport & gender
Scatterplots captures the relationship between two quantitative variables. If one of these variables is ‘time’ then
we get a time series plot
The purpose of these plots is to understand the data and to help you solve a problem. Thus, always ask yourself is the data
appropriate and if not what else do you need to answer the question and solve the problem.
There are three categories of numerical summaries, measures of location, measures of variability and measures of
Measures of central tendency (location)
A parameter describes a key feature of a population
A statistic describes a key feature of a sample
A natural measure of ‘location’ or ‘central tendency’ (a key feature) is the arithmetic mean (other variants include
weighted mean/WAM and geometric mean)
The median is the middle value of ordered observations:
- When n is odd the median will be a particular value
- When n is even, the median is the average of the middle two values
The mode is the most frequently occurring value (s) – ‘the modal class’ was previously defined in the context of
unimodal histograms.
The mean, median and mode each provide different notions of ‘representative’ or ‘typical’ central values.
• For symmetric distributions, mean = median
• For positively (negatively) skewed data: mean > (<) median. The median is preferred when the data contains
Measures of relative location
Sometimes we wish to measure variation relative to location:
• Case 1: Observations all measured in millions, and standard deviation is 20 – relatively little variability
• Case 2: Observations all positive but less than 100 à s = 20 may indicate a lot of variability
(Sample) coefficient of variation provides a measure of relative variabilityà
Percentiles – measures of relative location
The median relies on a ranking of observations to measure location. This idea generalizes to percentiles – the Pth
percentile is the value for which P percent of observations are less than that value.
• The median is the 50th (Q2) percentile
• The 25th (Q1) and 75th (Q3) percentiles are called, respectively, the lower and upper quartiles
• The difference between the upper and lower quartiles is called the interquartile range – another measure of
- Q1 = (x2 + x3)/2
- Q2 = (x4 + x5)/d2
- Q3 = (x6 + x7)/2
- IQR = Q3 – Q1
Measures of variability
Variance and standard deviation are two basic measures of variability.
Range is a simple measure of variability. Range = maximum – minimum
Variance is the most common measure of variability. It measures average squared
distance from the mean
- Division by n-1 for sample variance relates to properties of estimators
The standard deviation is the spread measured in the original units of the data (not squared)
We can create a transformed variable with zero mean and ‘unit’ i.e variance from any original quantitative variable. This
transformed variable is free of units of measurement, which is called calculating z-scores (one z-score per observation).
Calculate (observation – mean) and then divide this difference by the standard deviation: Z = [xi−μ]/σ
E.g suppose for mutual fund A, the maximum return is 63%. This point has a z-score of (63-10.95)/21.89 =
2.38 which implies that 63% is 2.38 standard deviations above the mean return
Coefficient of variation
To understand whether the variance or standard variation is large or small depends on the scale of the units. These are
measures of relative variability and allows us to compare the variability across different variables and datasets
(Sample) coefficient of variation: s/x
Population coefficient of variation: σ/μ
Measures of association
To examine the relation of two variables, there are two main measures which are covariance and correlation
Introduction to linear regression
Covariance is a numerical measure:
• Positive (negative) covariance à Positive (negative) linear association
• Zero covariance à No linear association
xi and yi are values of observation
μx and μy represent the mean for x and
Thus [xi - μx] represents the x-axis distance
of the observation to the mean of x and tells
you the direction (applies to y value as well)
However covariance is not scale free. The correlation coefficient is a standardized, unit-free measure of association: It
ranges between 1 (perfect positive linear relationship) and -1 (perfect negative linear relationship).
Linear regression
Covariance and correlation are the two basic measures of association. This relationship can be further characterised by
running a simple linear regression. To determine the linear relationship between x&y, and to choose the intercept and
slope values to give the best fit, we minimise the residual sum of squares (least squares method – a basis of regression
Where b0 and b1 are chosen to
The ‘solution’ to this minimization problem consists of one intercept estimate and one slope estimate that together
minimize the residual sum of squares’
The point that consists of the mean of x and y will lie on the line of best fit
b1 will have the same sign as the covariance (correlation) between y and x
The fit of the ‘model’ (the fitted line) to the actual data (= the y-values observed at each x value) is described by the Rsquared statistic. A fit statistic considers how much variation the residuals (the variation in the y variable that is not
explained by the model) compared to how much variation is in the whole y variable. The ‘larger’ this unexplained
variation, the less the model ‘fits’
In a simple bivariate regression à R2 = [correlation of x and y]2 à R2 is also known as the coefficient of determination
b1 = 15.296/(9.095) = 1.682
b0 = 10 – 1.682*12.677 = -11.323
• Excel assumes the covariances has the whole population so to recover sample covariances, we have to multiply by
n/(n-1) à 14.267*(15/14) = 15.296 is actually the sample variance
WEEK 2: Probability, discrete random variables, and the binomial distribution
WEEK 3: Continuous random variables and the normal distribution
WEEK 4: Estimators, the sampling distribution and estimating the population
WEEK 5: Estimating the population mean
WEEK 6: Errors with hypothesis testing and the Chi-squared test
WEEK 7: Simple linear regression, assumptions and OLS inference
WEEK 8: The multiple linear regression model and model building
WEEK 9: Addendum: Addressing causality, experiments and regressions