Statistics Definition Statistics is a collection of methods for planning experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting and drawing conclusions based on that data. It’s the science of understanding data and of making decisions in the face of variability and uncertainty. Descriptive Statistics uses both numerical and graphical methods to summarize and/or describe the characteristics of a known set of data. Ex) Average age of this class is 19.6 years old. Inferential Statistics goes beyond the description. It involves the use of sample data to make inferences about a larger set of data from which the sample was chosen. Ex) Infer or conclude that the average age of all Stfx students is 19.6 years old. Data are observations or measurements that you or someone else observes or measures (scores, counts, measurements, names) gathered to draw conclusions (inferences). - Data does not have to be numbers - Data is more than just a list of numbers or names - Data has a story, or a context to describe the observations Raw Data is data that has not been sorted or changed in any way. Ex) {15.0, 18.5, 18.9 24.0} o Data – meaning Raw Data – no meaning Variation is the difference or change in observations or measurements. Example) The total monthly sales in ($000) for four randomly selected months last year was. {15.0, 18.5, 18.9, 24.0} • The average monthly sales are 19.1 or $19,100 (Sum / number of variables = 19.1) – We might conclude that the average monthly sales in is $19,100 – Or the yearly sales are 12 × 19.1 = 229.2 = $229,200 • Note how sales in each sample month is different. – We have variation in the data. – How (accurate or reliable) will our (conclusion or inference) be given the variation in this sample data? – To answer this, we will need to have a better understanding of Variation. Reliability is a measure of how good our inference (based on the sample) really is. Goals of Statistics are to enable an investigator to plan research so as to take variability into account (manage or reduce variability). Extract the maximum amount of reliable information and to quantify any variability in the data. Population is the complete collection of elements (scores, people, measurements) to be studied. Sample is a sub-collection of elements drawn from the population. Experimental unit (or just unit or element) is an object (person, object, event) upon which we collect data. Parameter is a numerical measurement describing some characteristic of a population. Statistic is a numerical measurement describing some characteristic of a sample. Statistical inference is an estimate or prediction about a population and its parameters based on information obtained through the sample and its sample statistics. Variable is a characteristic observed on sample data that can vary from unit to unit in the sample. For Example: Consider the class as a sample of STFX students. What are some characteristics that can be observed on each student here? - Hair Color, Degree, Height, Wake-up time, Shoe Size, Number of Siblings Variable Categorical (Qualitative) Variables Numerical (Quantitative) Variables classified as belonging to groups or measured using a numerical scale categories. E.g., Hair colour, Degree program, Breakfast (yes/no) Discrete variables can take only a finite set of values E.g., Shoe size, Number of Siblings Continuous variables can take all or any value E.g., Height, Wake-up time Classification of Variables (Alternative) Levels of Measurement - Nominal -- data consist of names, labels or categories, no ordering scheme. - Ordinal -- data can be arranged in some order, but differences cannot be determined or are meaningless. - Interval -- data can be arranged in some order with meaningful differences between the data. No natural starting point. - Ratio -- same as interval but does have a natural starting point. Observational Study, observations and measurements are made on subjects in their natural setting without modifying the subjects studied. Designed Experiment, we apply some treatment to the subjects (modify the subjects) and observe the effects on the subjects. - Usually, there are two or more groups to be studied, one of which is the control group, the other treatment groups. The treatment is only applied to the treatment groups. Survey asks people questions and records their response. - May be done over phone, mail, or face to face Census surveys every member of the population. - The results are very reliable, since the sample is the same as the population. Published Source one uses results that have already been collected and published. Selection Bias: is when part of the population is excluded from ever being in the sample, may be intentional or non-intentional. Non-Response Bias: is when data cannot be obtained from every unit in the sample. Ex) Not everyone responds to a survey, yet they were selected as part of the sample Representative Sample has characteristics that are essentially the same as those possessed by the population from which it was drawn. Random Sampling, members of the population are selected in such a way that each member has an equal chance of being selected. (Appendix A) Confounding Variable (an alternative explanation for the differences between the groups) cannot be ruled out. Placebo neutral treatment that has no "real" effect on the dependent variable. Unstacked Data: Data values are stored in two columns, each column is a variable from a different group, and can only store data for two variables Stacked Data: Data values stored in a spreadsheet format, each row contains data for a single individual, can store many variables, and most Statistical Software uses this format