BES 220 – Theme 1 Introduction to Data Terms and Definitions Summary Statistics – A single number summarizing a large amount of data (e.g. proportions). Associated/Dependent Variables – The values of one variable relate in some way to the values of the other. Census – A study that aims to observe every member of a population. Exploratory Analysis – An approach to data analysis that emphasizes the use of informal graphical procedures not based on prior assumptions about the structure of the data or on formal models for the data. Interference – The process of drawing conclusions about a population on the basis of measurements or observations made on a sample of units from the population. Observation Study – A study in which the objective is to uncover cause-and-effect relationships but in which it is not feasible to use controlled experimentation, in the sense of being able to impose the procedure or treatments whose effects it is desired to discover, or to assign subjects at random to different procedures. Placebo – A treatment designed to appear exactly like a comparison treatment, but which is devoid of the active component. Blinding – A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment has been given the trial is termed double-blind. If only one of the patient or doctor is aware, the trial is called singleblind. Treatment Group – Patience receive the experimental medical treatment and receive medical management after the procedure. Control Group – Patience do NOT receive the experimental medical treatment, but they receive medical management after the procedure. LO 1: Types of Variables WhatsApp 071 385 7167 for more Mechanical Engineering notes Example: Students in an introductory statistics course were asked the following questions as part of a class survey: 1. 2. 3. 4. 5. 6. What is your gender, male or female? Are you introverted or extraverted? On average, how much do you get per night? What is your bedtime: 8pm-10pm, 10pm-12am, 12am-2am, later than 2am? How many countries have you visited? On a scale of 1(very little) – 5(a lot), how much do you dread this semester? *The data matrix below shows the results. Columns represent variables and rows represent cases. Student 1 2 3 … Variable Type Gender Male Female Female … Categorical (Regular) Intro/Extra Extravert Introvert Extravert … Categorical (Regular) Sleep 9 8 7 … Numerical (Continuous) Bedtime 10-12 8-10 12-2 … Categorical (Ordinal) Countries 18 7 2 … Numerical (Discrete) Dread 3 5 2 … Categorical (Ordinal) LO 2: Associated Variables • • • Associated variables – Variables that show some relationship with one another. Also known as dependent variables. Positive association – Variables are directly proportional Negative association – Variables are indirectly proportional LO 3: Independent Variables • Independent Variables - Variables not associated (no evident relationship), are known as independent. LO 4: Explanatory and Response Variables • • • Explanatory Variable – The variable within a pair of variables which is suspected of affecting the other. Response Variable – The variable which is a result of the explanatory variable. e.g. explanatory variable = poverty -> response variable = federal spending Note: labelling variables as explanatory and response does not guarantee that the relationship between the two is actually casual, even if there is an association identified between the two variables. • Confounding Variable – A variable (can’t always be measured/examined) that is correlated to both the explanatory and response variables. In observational studies, casual conclusions can be attempted by exhausting the search for confounding variables. LO 5: Classification of a Study • • Observational Study – Studies provide evidence of naturally occurring associations between variables, but they cannot by themselves show a casual connection (collection of data where there is no direct interference with the data). o Perspective Study – Identifies individuals and collects information as events unfold. o Retrospective Study – Collects data after events have taken place. Experimental Study – Studies which try to prove a causal connection. There is generally an explanatory and response variable within the experiment, to test a hypothesis. o Randomized Experiment – studies which contain randomized assignments, which is fundamentally crucial for drawing casual connections between variables. Note: Experimental studies allow for causal conclusions to be made, but observational studies are only sufficient to show associations. WhatsApp 071 385 7167 for more Mechanical Engineering notes LO 5: Random Sampling vs Random Assignment Random Sampling – Occurs when subjects are being selected for a study. • • If subjects are selected randomly form the population, then each subject in the population is equally likely to be selected, and the resulting sample is likely representative of the population. The study’s results are generalizable to the population at large. Random Assignment – Occurs only in experimental settings, where subjects are being assigned to various treatments. • • If subjects are assigned randomly to treatments, then any observed effect can be attributed to thee treatment, and hence we can make casual conclusions based on the study. Allows you to make sure that the only difference between the various treatments groups is what you are studying. LO 5: Correlation vs Causation Causation – One variable causes something to happen to another variable. • • • A causes B. In order to imply causation, a true experiment must be performed where subjects are randomly assigned to different conditions. In data analysis, association does not imply causation, and causation can only be inferred from a randomized experiment. Correlated – Variables share some kind of relationship. • • A and B seem to be happening at the same time. Conclusions can be drawn, but you cannot imply anything from the data. LO 6: Sources of Bias Anecdotal Evidence – Data which represents 1/2 cases and is unclear whether these cases are representative of the population (typically composed of unusual cases that are remembered due to their striking characteristics). Sampling Bias: • • • Non-Response – Not everyone selected to answer for example a survey answers it, this results in uncertainty of the results representing the entire population. Voluntary Response Convenience Sample – You cannot gather information from people living in Centurion only, if you want data to represent the entire Pretoria. WhatsApp 071 385 7167 for more Mechanical Engineering notes LO 7: Sampling Schemes A population represents the entire group effected by a scenario; the sample is a selected portion from the population which will be studied to draw conclusions. • • • • Simple Random Sampling – Each subject in the population is equally likely to be selected (like a raffle). Stratified Sampling – First divide the population into homogenous strata (subjects within each stratum are similar, across strata are different), then randomly sample from within each strata (divide and conquer). o Advantage – useful when cases in each stratum are similar with respect to outcome interest. o Disadvantage – analysing data from this method is more complicated. Cluster Sampling – First divide the population into clusters (subjects within each cluster are non-homogenous, but clusters are similar to each other), then randomly sample a few clusters, and then sample all cases within those clusters. o Advantage – can be a more economical technique; helpful when there is a lot of case-to-case variability within a cluster, but clusters themselves don’t look very different to one another. o Disadvantage – more advanced analysis techniques are typically required. Multistage Sampling – First divide the population into clusters, then randomly sample a few clusters, and then randomly sample from within each cluster. LO 8: 4 Principles of Randomized Experimental Design 1. Controlling • Researchers assign treatments to cases and do their best to control any differences within a group. 2. Randomization • Researchers randomize patients into treatment groups to account for variables that cannot be controlled. This helps prevent accidental bias from entering a study. 3. Replication • The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response. 4. Blocking • Researchers sometimes know or suspect that variables, other than the treatment, influence the response. Under these circumstances, they may first group individuals based on this variable and then randomize cases within each block to treatment groups. LO 9: Identify Single/Double-Binding in a Study Blinding – A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment has been given the trial is termed double-blind. If only one of the patient or doctor is aware, the trial is called singleblind. Quick Test 1. Describe when a study’s results can be generalized to the population at large and when causation can be inferred. 2. Explain why random sampling allows for generalizability of results. 3. Explain why random assignment allows for making casual conclusions. 4. Explain how blinding can help eliminate the placebo effect and other biases. 5. Understand random assignment vs random sampling. WhatsApp 071 385 7167 for more Mechanical Engineering notes LO 10: Scatterplots Use scatterplots for describing the relationship between two numerical variables making sure to note the direction (positive/negative), form (linear/non-linear) and the strength of the relationship as well as any unusual observations that stand out. LO 11: Description of Numerical Variable Distribution Mention its shape, centre and spread as well as any unusual observations. LO 12: Commonly Used Measures of Centre and Spread Centre: 1. Mean (arithmetic average) 2. Median (midpoint) 3. Mode (most frequent observation) Spread: 1. Standard deviation (variability around the mean) 2. Range (max – min) 3. Interquartile range (middle 50% of the distribution) LO 13: Distribution Shapes Symmetric: Right Skewed Left Skewed Unimodal Bimodal Multimodal Uniform LO 14: Visualization of Numerical Distributions Use histograms and boxplots to visualise the shape, centre and spread of numerical distributions Use intensity maps for visualizing the spatial distribution of the data LO 15: Robust Statistic Robust Statistic (e.g. median, IQR) – Measures that are not heavily affected by skewness and extreme outliers. Determine when they are more appropriate measured of centre and spread compared to other similar statisitcs WhatsApp 071 385 7167 for more Mechanical Engineering notes LO 16: Transformations Recognise when transformations can make the distribution of data more symmetric, and hence easier to model. LO 17: Description of One Categorical Variable Distribution Use frequency tables and bar plots to describe the distribution of one categorical value LO 18: Different Modality of Distributions Give the picture LO 19: Assessment of the Relationship between 2 Categorical Values Use contingency tables and segmented bar plots or mosaic plots to asses the relationship between two categorical variables LO 20: Recognise when WhatsApp 071 385 7167 for more Mechanical Engineering notes Exercise 1.1 Exercise 1.6 Exercise 1.9 Exercise 1.27 Exercise 1.30 WhatsApp 071 385 7167 for more Mechanical Engineering notes Exercise 1.35 Exercise 1.40 WhatsApp 071 385 7167 for more Mechanical Engineering notes