9/03/2015 Stat 5002 An Introduction to Statistics with Applications in Computing Lecture 1 Introduction to Statistical Thinking Objectives of STAT5002 Samples Populations; Sample Statistics Population Parameters; Graphical summaries of Data; https://elearning.sydney.edu.au/webapps/ Numerical summaries of Data. To introduce students to basic statistical concepts and methods for further studies. methodologies related to statistical data analysis and Data Mining. a number of useful statistical models computer oriented estimation procedures Objective of Statistics smoothing and nonparametric concepts analysis of large data sets. the R computing language for all computational aspects in the course ©Sydney University 3 1 9/03/2015 Samples Populations Populations (ALL) Define the target population -- the population to which we want to generalize our findings. We use information from a SAMPLE to answer questions or discover features about a target POPULATION ? ? Specify characteristics that identify the members of the population. Who/What? Where? When? ? Population Example: Characteristics such as age, income, education, gender and marital status are typically used in studies concerning people. A sampling frame is a List or Rule Defining the Population. This is usually unachievable, and we often need to restrict our studies to the population to which we can gain access. 10 Samples Some of All Representative Sample It is often difficult, or even impossible, to obtain a random sample. Individual observations should be selected independently! Samples need be representative of the population (not biased) Population Sample size needs to be large enough! A random sample is one where each member of the population has the same chance of being selected. Independent observations ∴ random sample: Representative of population 11 2 9/03/2015 Samples need to be Bias Samples need to be representative of the target population Bias may be defined as any systematic error (ie. not occurring randomly) which results in incorrect conclusions about the target population. Observations within samples must be independent of each other Some types of bias include selection bias measurement bias Samples must not be b i a s e d ! response bias confounding 14 ©Sydney University Types of Bias Two schools of Thought Selection Bias Frequentist Selection bias refers to any systematic differences occurring in the way that subjects are selected for a study. Population is fixed Samples vary (somewhat) Measurement Bias Bayesian Population varies Sample is fixed Measurement bias refers to systematic differences in the measurement of variables. Response bias Response bias can occur when the response rate to a survey is too low. Confounding A confounder is a variable that distorts (increases or decreases) the apparent effect of one variable (determinant) on another variable (outcome). ©Sydney University 16 22 3 9/03/2015 Scope of Statistics/Data Mining Study Understand Problem!! Data Mining Design Study Scope of Statistics Collect Sample Obtain Data Organise Data Organise Data Data Analysis Exploratory Data Analysis Interpretation of Results Interpretation of Results Report Results Report Results? Where do data come from? Types of Statistical Studies Statistical Studies An observational study is one in which there is no intervention by the investigator nor is there any treatment imposed. Observational Studies Experimental Studies An experimental study is one in which the investigator has some control over the determinant. Data Mining from Databases (c)Sydney University 27 (c)Sydney University 1.22 4 9/03/2015 Experimental Studies Obtaining Data Population Sampling Statistical Studies Sample Observational Studies Randomisation Experimental Group Experimental Studies Stanford prison experiment Control Group Comparison Compare! First Data Collection (Before) First Data Collection (Before) http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html http://www.med.uottawa.ca/sim/data/Study_Designs_e.htm No Treatment Treatment Data Mining from Databases, Comparison Compare! http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets04 05.html (c)Sydney University 29 (c)Sydney University Second Data Collection (After) Second Data Collection (After) CRISP Data Mining Cross Industry Standard Process for Data Mining Aim: To develop an industry tool and application neutral process for conducting Knowledge Discovery (KD). Data Mining 31 (c)Sydney University, 2014 32 5 9/03/2015 Variables Measurements taken on subjects in a study vary amongst subjects. These measurements (data) are usually organised in a spreadsheet consisting of rows and columns. The rows contain information about individual subjects or records. The columns contain the values of the measurements that vary the variables. Data Evidence from Samples Variables usually take on specific roles determinants Predictors Explanatory variable/s Input independent variable/s influence outcomes Outcomes Response variable/s Output dependent variable/s 34 A Spreadsheet BOM station number Month 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 66062 1 Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Max_ 1913 25 27.1 32.6 21.9 23.1 24.6 23.9 23.8 23.9 25.2 26.3 26.9 31.3 25.2 25.9 27.1 27.8 30.6 27.7 21.7 22.2 24.6 Min_ Max_ 1913 2013 19.1 26.2 17.1 22.9 20.7 24.8 17.5 26.6 15.8 28.3 15.4 28 18.9 27.5 18.6 42.3 16.8 25 16 25.4 19.8 29.6 20.1 31.2 19.8 23.8 20 23.7 20.1 24.9 20.9 27.2 20.4 29 19.4 45.8 20.2 24.8 19.6 24.3 16.9 26.6 15 29.6 Min_ >34_ >34_ 2013 1913 2013 20.2 0 0 20.3 0 0 18.4 0 0 18.3 0 0 20.9 0 0 21.6 0 0 21.4 0 0 20.9 0 1 21.1 0 0 20.2 0 0 21.2 0 0 23.5 0 0 20.7 0 0 17.1 0 0 16.8 0 0 19.1 0 0 21.4 0 0 21.7 0 1 21.5 0 0 20.2 0 0 20.7 0 0 20.9 0 0 <9_ 1913 <9_ 2013 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Diff Max 7.1 5.8 4.1 9.1 12.5 12.6 8.6 23.7 8.2 9.4 9.8 11.1 4 3.7 4.8 6.3 8.6 26.4 4.6 4.7 9.7 14.6 Diff Min 1.1 3.2 ‐2.3 0.8 5.1 6.2 2.5 2.3 4.3 4.2 1.4 3.4 0.9 ‐2.9 ‐3.3 ‐1.8 1 2.3 1.3 0.6 3.8 5.9 Types of Data http://www.bom.gov.au/climate/data/ 35 6 9/03/2015 Types of Data Variable Types Categorical variables Categorical variables are variables where each observation falls into one of a finite number of groups. nominal Categorical/ Group Nominal variables: named variables with no implicit order. Examples: Type of cancer, Personality type Ordinal variables: grouped variables with implicit order. Examples: Level of education, grade ordinal Numerical/ Quantitative If there are two groups the variable is often referred to as being binary or dichotomous (having two possible values). discrete Binary variables can be either nominal, such as sex, or continuous ordinal such as age group, eg < 20 years, ≥ 20 years. 37 Colour (nominal Size (ordinal) Small Medium 39 Numerical / Quantitative Variables Numerical variables are measured variables and can be either discrete or continuous. Large Discrete variables are variables that take discrete values: eg. Number of children, number of people in a store. White Continuous variables are those that can assume many values within a certain range or interval: eg. height, weight, pulse rate. Green Numerical variables are also referred to interval or scale variables Purple 44 (c)Sydney University 7 9/03/2015 'Numerical' data cover a range of values usually measured with an instrument or along some scale or counted (but a large number). Discrete Continuous - can only take some values For example: Marks in a test ( max half mark accuracy only) number of steps walked in a day (whole numbers) can take any values Example: Distance walked in a day (2.6km, 2.67km, 2.675km, etc University entrance scores The Variables in the Spreadsheet Variable Description Data Type BOM station number Bureau of Meteorology station number Categorical, Nominal Month Month of Year (1:12) Categorical, Ordinal Day Day of Month (1:31) Continuous, Discreet Max_1913 1913 Daily max temp (Co) Numeric, continuous Min_1913 1913 Daily min temp (Co) Numeric, continuous Max_2013 2013 Daily max temp (Co) Numeric, continuous Min_2013 2013 Daily min temp (Co) Numeric, continuous Very Hot?_1913 1: 1913 max temp > 34 0: otherwise Categorical, Ordinal Very Hot?_2013 1: 2013 max temp > 34 0: otherwise Categorical, Ordinal Very Cold?_1913 1: 2013 min temp < 9 0: otherwise Categorical, Ordinal Very Cold?_2013 1: 1913 min temp < 9 0: otherwise Categorical, Ordinal Diff Max Max_2013 - Max_2013 Numeric, continuous Diff Min Max_2013 - Max_2014 Numeric, continuous 47 All graphs need A title Clearly labelled axes Appropriate comments to have clarity to be aesthetically satisfying Summarising Data: Graphical Methods (c)Sydney University 51 8 9/03/2015 Displaying Categorical Data: One Variable Bar Chart Contingency Table: Showing counts for Two Categorical Variables Number of Very Hot Days in 1913 Temperature 400 350 300 250 200 150 100 50 0 Not so Hot >34 C Number of Very Cold Days 2013 350 300 Year < 9C Not so extreme > 34C Total 1913 65 295 5 365 2013 33 326 6 365 Total 98 621 11 730 250 200 150 100 50 0 Not so Cold <9C 53 52 Clustered bar chart 350 Numerical Summary: Categorical Data For categorical data we simply tabulate the counts and/or proportions of data (denoted p in a sample, or in a population) in the categories of interest. Numbers of Very Hot, and Very Cold Days in 1913 and 2013 Number of Days 300 A Clustered Bar Chart is a visual display showing associations between two categorical variables. 250 1913 200 2013 Counts of Days in each Year 150 100 Year < 9C Not so extreme > 34C 50 1913 65 295 5 0 2013 33 326 6 < 9C Not so extreme Temperatures > 34C Percentages of Days in each Year It appears that the daily temperatures were not so extreme in both 1913 and 2013 there was a larger proportion of extremely cold days in 1913 than in 2013 the proportion of very hot days was low in both years 54 Year < 9C Not so extreme > 34C 1913 17.81% 80.82% 1.37% 2013 9.04% 89.32% 1.64% 55 9 9/03/2015 Histogram A histogram is a simple and effective display, useful for displaying the distribution of numerical data. A histogram shows the number of observations that fall into each of several nonoverlapping groups or bins. Daily Minimum Temperatures in 2013 40 The bins of a histogram adjoin each other so there are no gaps between bins, unless a bin is empty. Displaying Numerical Data 30 20 10 0 7 24 57 56 Structure of a Box Plot Median and quartiles A boxplot displays a five-number summary of a numerical set of data. These numbers are whiskers median lower quartile outliers upper quartile Minimum the smallest value Lower Quartile separates the lower 25% of values from the rest Median: the half-way point of the data Upper Quartile: separates the upper 25% of values from the rest Maximum: the largest value A boxplot also identifies any unusually large or small values in a dataset, called outliers. 58 59 10 9/03/2015 Comparative Box Plots Comparing box plots Box plots enable the comparison of several samples of data simultaneously. Daily Minimum and Maximum Temperatures, When making comparisons using box plots compare 1913 and 2013 1913_Min centres 2013_Min spreads and 1913_Max mention unusual observations 2013_Max 0 10 20 30 40 50 Temperature oC It appears that both minimum and maximum daily temperatures in 2013 were slightly higher than those in 1913. See: http://freedom.indiemaps.com/ 60 61 Scatter Plot Construction of scatter plot A scatter plot shows the relation between two numerical variables. Draw X and Y axes to cover the range of the two variables. The two variables, X and Y, are referred to as the predictor and response variable respectively, although they do have other names. Plot one point for each observation ie. (x, y) X predictor Label the axes and mark the scale Y Comment on the plot. Y response determinant outcome independent dependent X If X increases and Y increases then a POSITIVE relation exists. If X increases and Y decreases then a NEGATIVE relation exists. X Y X 62 Y 63 11 9/03/2015 Scatterplots of Temperatures Maximum Temperatures: 1913 and 2013 Minimum Temperatures: 1913 and 2013 44 44 Maximum Temperatures 2013 Minimum Temperatures 2013 Displaying Data 36 28 20 28 4 4 20 28 36 44 Minimum Temperatures 1913 Numerical Categorical Clustered bar chart Comparative Box plots Numerical Comparative Box plots Scatter plots One Variable Only Bar Chart Histograms 20 12 12 Categorical 36 12 4 Data Type 4 12 20 28 36 44 Maximum Temperatures 1913 The points on the diagonal lines represent days where the minimum (or maximum) temperature in 1913 was the same as in 2013. Is there a sensible message here?? 64 http://www.gapminder.org/videos/the-joy-of-stats/ Displaying Data Data Type Categorical 65 Numerical Categorical 20 Numerical 12 4 4 12 20 One Variable Only http://www.gapminder.org/world 66 6 7 12 9/03/2015 Wordle (c)Sydney University, 2014 http://www.oceancalendars.com.au 6 8 Measures of Centre Data summaries: Numerical Data Mode: The most frequently occurring value in the dataset. The data may be nominal, ordinal or numeric. Median: The middle value when all the data are placed in order. The data must be ordinal or numerical. For an even number of values the median is the average of the two middle values. Mean: The Arithmetic Average. The data must be either discreet or continuous. The mean is calculated by dividing the 'sum of the values' by the 'number of the values'. http://www.youtube.com/watch?v=oNdVynH6hcY 71 13 9/03/2015 The Mean Mean versus median The median cuts the data into two sections with the same number of observations in each The mean is calculated by dividing the 'sum of the values' by the 'number of the values'. n xi x i 1 n Symmetric Data 50% 50% xithe i values of the data ̅ the average or the 'mean‘ of the x values The mean is the centre of gravity (point of balance) of the data. (sigma) 'the sum of'. Medians and Means http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html 72 Data: 1 3 6 The mean is affected by outliers, the median is not. 74 mean Mean: Centre of balance Mean = Median Samples Populations 10 ? ? Mean = (1 + 3 + 6 + 10)/4 = 5 We use ? Population Sample Statistics to estimate Population Parameters 0 1 2 3 4 5 6 7 8 9 10 75 14 9/03/2015 Sample Statistics estimate Population Parameters Mean x Median ~ x ~ Measures of spread Numeric data is often described, or summarised, using two statistics a measure of centrality, or location, and a measure of spread, or dispersion. Daily Minimum and Maximum Temperatures, 2013 Minimum Maximum 0 77 40 35 A measure of 30 variability 25 is important 20 15 10 5 0 5 10 15 20 25 30 35 40 45 50 Temperature oC 78 The inter-quartile range The inter-quartile range (IQR) is the difference between the upper and lower quartiles in an ordered set of numerical data. IQR = UQ - LQ -5 -10 40 35 30 25 The IQR gives the range of the middle 50% of a set of data, so is sometimes called the midspread. The inter-quartile range is rarely influenced by outliers in the data. 20 15 10 Daily Minimum and Maximum Temperatures, 2013 5 0 -5 -10 For the minimum temperatures in 2013: IQR ≈ 18-11 =7 For the maximum temperatures in 2013: IQR ≈ 21-26.5 = 5.5 Minimum Maximum 0 10 20 30 Temperature oC 40 50 80 15 9/03/2015 The range The Standard Deviation The standard deviation is a measure of how closely the data are grouped about the mean. The range is the difference between the maximum value and the minimum value in an ordered set of numerical data. The larger the standard deviation the the greater the spread. Range = max - min It is defined in terms of the deviations of the data from the mean (called residuals). The sample standard deviation, s, is the square root of the average (sort of) squared residual. The range will be influenced by outliers in the data. s Daily Min and MaxTemps, 2013 For the minimum temperatures in 2013: Range ≈ 24 - 7 = 17 n Minimum Maximum For the maximum temperatures in 2013: Range ≈ 46 - 13 = 33 0 ( x1 x )2 ( x2 x ) 2 . . . ( xn x )2 n 1 10 20 30 Temperature oC 40 (x x ) i 1 50 2 i n 1 Residual = xi – x, ie. observed value – sample mean. 81 82 Standard deviation (s) Deviations of points from the mean Mean 2.5 1.5 -0.5 sd 5 5 5 5 5 5 5 5 0 1 3 5 7 9 5 3.16 0 5 15 34 86 28 34.94 ‐3.5 ‐1 1 3 5 7 A measure of how much the data are spread around the mean 9 83 84 16 9/03/2015 Sample Statistics Standard deviation (s) Mean sd 5 0 1 3 5 7 9 5 3.16 0 5 15 34 86 28 34.94 5 5 5 5 5 5 5 A measure of how much the data are spread around the mean estimate Population Parameters Mean x Median ~ x ~ Std.dev s Variance s2 2 The variance, 2, is the square of the standard deviation and is estimated by s2. 85 86 The data in Excel The data we have been using this week is stored in an Excel workbook named Daily MaxMin Temp_18592013.xlsx. The data we will be using are stored in the spreadsheet called 1913; 2013. Use File…Save as … and save the data in Text (Tab-delimited) (*txt) format, named Daily MaxMin Temp_1859-2013.txt Doing it with R! 87 (c)Sydney University, 2014 88 17 9/03/2015 Reading in the Data into R Renaming variables You can rename variables programmatically or interactively. # rename interactively fix(mydata) # results are saved on close From the File Drop Down menu in R select Change dir… and change the working directory in R to the directory and folder where your data are stored in Excel. # rename programmatically #Recoding a continuous variable into categorical variable #Mark those whose control measurement is >34 as "VeryHot", and those with <=34 as "NotVeryHot": tempdat$VHot2013[tempdat$Max_2013 > 34] <- "VeryHot" tempdat$VHot2013[tempdat$Max_2013 <=34] <- "NotVeryHot" First row of the dataset contains names of each variable Read in the data, type # Convert the column to a factor!!! tempdat$VHot2013 <- factor(tempdat$VHot2013) temp.dat = read.table("Daily MaxMin Temp_1859-2013.txt”, header=T) To look at the first 10 rows of data, type temp.dat[1:10, ] To edit the data, type fix(temp.dat) (Make changes directly on the spreadsheet) (c)Sydney University 89 # you can re-enter all the variable names in order # changing the ones you need to change. # the limitation is that you need to enter all of them! names(mydata) <- c("x1","age","y", "ses") 90 Some Graphics commands R command plot() Graphing data in R Outcome 2-D scatterplot barplot() Bar graph hist() Histogram lines() Line graph points() Adds points to a plot legend() Adds a legend to the plot axis() Adds an axis to the plot 92 18 9/03/2015 Setting the Graphing Parameters Bar Charts in R The par() function defines the settings for subsequent commands. To construct a bar chart of the categorical variable VHot2013, type counts<-table(tempdat$VHot2013) barplot(temp.dat$VHot2013) Arguments within other graphics functions can also be used. http://www.statmethods.net/advgraphs/parameters.html http://research.stowers-institute.org/efg/R/Graphics/Basics/maroma/index.htm?utm_source=twitterfeed&utm_medium=twitter Number of Very Hot Days in 2013 Counts 0 50 Example: par(mfrow=c(1,1), mar=c(3.0,3.0,3.0,3.0), mgp=c(1.1,0.1,0), oma=c(0,2,1.4,0), las=1, tcl=0.2, cex=0.8) 150 250 350 ##Detail: barplot(counts, main="Number of Very Hot Days in 2013", names.arg=c("35C or more","Less than 35C"), xlab="Maximum Temperature", ylab="Counts", col="darkred") 35C NotVeryHot or more Less VeryHot than 35C Maximum Temperature 93 94 Presentation of Numerical data Present numerical summaries of data in neatly organised tables, with column and row headings Easy to read!!! Numerical summaries in R n median mean std.dev Min_1913 365 13.9 13.73 4.35 Max_1913 365 21.3 21.52 5.13 Min_2013 365 14.9 15.03 4.2 Max_2013 365 23.6 23.71 4.36 97 19 9/03/2015 Tables # 2-Way Frequency Table attach(mydata) mytable <- table(A,B) # A will be rows, B will be columns mytable # print table margin.table(mytable, 1) # A frequencies (summed over B) margin.table(mytable, 2) # B frequencies (summed over A) prop.table(mytable) # cell percentages prop.table(mytable, 1) # row percentages prop.table(mytable, 2) # column percentages More examples in the Tutorial! 98 References Introductory Statistics Lecture Notes, Macquarie University Susan Imberman: notes on Data Mining vs. Statistics Wasserman: Chapter 1 R http://www.statmethods.net/ http://www.statmethods.net/graphs/ http://addictedtor.free.fr/graphiques/ http://www.rseek.org http://www.cookbook-r.com/Graphs/Shapes_and_line_types/ http://rprogramming.net/ http://it-ebooks.info/book/537/ http://www.ats.ucla.edu/stat/r/ 20