Chapter 1: Introduction to Statistics 1.1 An Overview of Statistics 1.2 Data Classification 1.3 Experimental Design 1.1 An Overview of Statistics What is statistics? Science of data Data are numbers with context It can be broken down to three branches: Data analysis Probability Statistical Inference A Definition of Statistics Data It is collection of facts Consists of information coming from observations, counts, measurements or responses Statistics Uses data to gain insight and draw conclusions It is the science of collecting, organizing, analyzing and interpreting data in order to make decisions Data sets Population It is the collection of all outcomes, responses, measurements or counts that are of interest. Sample It is a subset of the populations. Population: All students taking Statistics classes at NSCC Sample: All Students in Math109 section 05 Data sets Parameter It is a description of a population characteristic. Statistic: It is a description of a sample characteristic. Branches of Statistics Descriptive statistics: It involves organization, analysis, summarization and display of data. Probability theory: It is the branch of statistics which deals with chance or random phenomena i.e. it tries to quantify how likely events are to occur. Inferential statistics: It is the branch statistics that involves using a sample to draw conclusions about a population. A basic tool in the study of inferential statistics is probability. 1.2 Data classification How do we classify data? Types of data Qualitative data Data which cannot be measured by a numerical scale. It consists of attributes (like gender, nationality). It can be binary (yes or no) or categorical Quantitative data Data which can be measured or identified by a numerical scale i.e. it consists of numerical measurements, counts. Types of data Nominal: data at this level is qualitative only Ordinal: data at this level is qualitative or quantitative, they can be ranked or ordered but differences between measurements are not meaningful. Interval: data at this level can be ordered and meaningful differences can be calculated. A zero entry measures a position on a scale. It is not an inherent zero **. Ratio: data at this level are similar to those at the interval level with the added property that a zero entry is an inherent zero. A ratio of two data values can be performed so that the one data value can be a multiple of another. ** inherent zero is a zero that implied ‘none’. 1.3 Experimental Design What is experimental study? An experiment deliberately imposes a treatment on a group of objects or subjects in the interest of observing the response. It is wise to take time and effort to organize the experiment properly to ensure that the right type of data, and enough of it, is available to answer the questions of interest as clearly and efficiently as possible Design of a statistical study Guidelines to designing a statistical study: Identify the variable(s) of interest and the population of the study Design data collection process. If you use a sample, make sure the sample is representative of the population. Collect the data. Summarize the data, using descriptive statistics techniques. Interpret the data and make decisions about the population using inferential statistics. Identify any possible errors. Data Collection Methods: • Observational study Basically you observe ‘what is’. An observational study is a study in which a researcher simply observes behavior in a systematic manner without influencing or interfering with the behavior • Perform an experiment: Here, a treatment is applied to part of the population and responses are observed. Another part of the population may be used as a control group, in which no treatment is applied. The results of the treatment and the control group are studied and compared. • Simulation: It is the use of mathematical or physical model to reproduce the conditions of a situation or process. They allow you to study situations hat are impractical or even dangerous to create in real life. • Survey: it is an investigation of one or more characteristics of a population. Experimental Design An experiment deliberately imposes a treatment on a group of objects or subjects in the interest of observing the response. This differs from an observational study, which involves collecting and analyzing data without changing existing conditions. Because the validity of a experiment is directly affected by its construction and execution, attention to experimental design is extremely important. Three key principles of experimental design are: Control of the effects of lurking variables on the response, most simply by comparing several treatments. Randomization, use of chance to assign experimental units to treatments. Replication of the experiment on many units to reduce chance variation in the results. Experimental Design Control: An experiment involves a dependent variable and independent variables. One usually conducts the experiment to see the impact of the latter on the former. It is very likely that a variety of factors other than the independent variable which is of interest affect the results of the experiment. Hence in order to maintain the integrity of the experiment it is important to control these influential factors. Some factors are: Confounding variable: it is an extraneous variable in an experiment that correlates with both the dependent and independent variable. Placebo effect: it occurs when a subjects shows a favorable reaction to a placebo i.e. when he or she is not administered the actual treatment but a placebo in its place. To control or minimize this effect the blinding technique is used. Single blind: it is when the subject does not know whether he or she is receiving the treatment or a placebo. Double blind: it is when both the researcher and subject are unaware if the subject is receives a treatment or placebo. Experimental Design Randomization: •It is a process of randomly assigning experimental units to different treatment groups. •In a completely randomized design, experimental units or subjects are assigned to different treatment groups through random selection. •In some cases the experimenter is aware of differences among groups of the experimental units or subjects. In such cases it is necessary to use blocks, which are groups of subjects/units with similar characteristics before they are randomly assigned to a treatment group. This setup is known as a randomized block design. Replication: To improve the results of an experimental, replication, the repetition of an experiment on a large group of subjects, is required. Replication reduces variability in experimental results, increasing their significance and the confidence level with which a researcher can draw conclusions about an experimental factor. Sampling techniques What is a census? A census is a count or measure of an entire population. Although it provides complete information it is costly, cumbersome and time consuming. What is sampling? Sampling is the process of selecting units (e.g., people, organizations) from a population of interest so that by studying the sample we may fairly generalize our results back to the population from which they were chosen. To collect unbiased data, a researcher must ensure that the sample is representative of the population. What is a sampling error? A sampling error is the difference between the results of a sample and those of the population. Sampling techniques Random sample is one in which every member of the population has an equal chance of being selected. Simple random sample is a sample in which every possible sample of the same size has the same chance of being selected. Now when you choose members of a sample, you should decide whether it is acceptable to have the same population member selected more than once: • If it is acceptable, ,then the sampling process is known as with replacement. • If it is not acceptable, then the sampling process is said to be without replacement. Sampling techniques Stratified random sample is formed when the researcher first divides the population into groups that share similar characteristics, called strata and then selects a simple random sample from each stratum. Cluster sample is formed by diving the population into naturally occurring subgroups, called clusters, and selecting all the members in one or more clusters. Systematic sample is one in which members of the population are ordered in some way, a starting number is randomly selected and then sample members are selected at regular intervals from the starting number. Convenience sample consists of only available members of the population. This type of sample often leads to biased studies. Homework Section 1.1 1-4, 11, 13, 14, 17, 19, 21, 27 (assume U.S.), 29, 30, 32, 33, 36, 40, 41 Section 1.2 7-10, 15, 16 Section 1.3 1,2, 4-10, 15-21, 29, 30, 31, 33, 43 (random, stratified and clustering only) Read Chapter 2 What are the odds??? :P