Uploaded by Jacobus Greeff

Intro to Data: Definitions, Variables, & Study Design

advertisement
BES 220 – Theme 1
Introduction to Data
Terms and Definitions
Summary Statistics – A single number summarizing a large amount of data (e.g. proportions).
Associated/Dependent Variables – The values of one variable relate in some way to the values of the other.
Census – A study that aims to observe every member of a population.
Exploratory Analysis – An approach to data analysis that emphasizes the use of informal graphical procedures not
based on prior assumptions about the structure of the data or on formal models for the data.
Interference – The process of drawing conclusions about a population on the basis of measurements or observations
made on a sample of units from the population.
Observation Study – A study in which the objective is to uncover cause-and-effect relationships but in which it is not
feasible to use controlled experimentation, in the sense of being able to impose the procedure or treatments whose
effects it is desired to discover, or to assign subjects at random to different procedures.
Placebo – A treatment designed to appear exactly like a comparison treatment, but which is devoid of the active
component.
Blinding – A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or
doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment
has been given the trial is termed double-blind. If only one of the patient or doctor is aware, the trial is called singleblind.
Treatment Group – Patience receive the experimental medical treatment and receive medical management after
the procedure.
Control Group – Patience do NOT receive the experimental medical treatment, but they receive medical
management after the procedure.
LO 1: Types of Variables
WhatsApp 071 385 7167 for more Mechanical Engineering notes
Example:
Students in an introductory statistics course were asked the following questions as part of a class survey:
1.
2.
3.
4.
5.
6.
What is your gender, male or female?
Are you introverted or extraverted?
On average, how much do you get per night?
What is your bedtime: 8pm-10pm, 10pm-12am, 12am-2am, later than 2am?
How many countries have you visited?
On a scale of 1(very little) – 5(a lot), how much do you dread this semester?
*The data matrix below shows the results. Columns represent variables and rows represent cases.
Student
1
2
3
…
Variable Type
Gender
Male
Female
Female
…
Categorical
(Regular)
Intro/Extra
Extravert
Introvert
Extravert
…
Categorical
(Regular)
Sleep
9
8
7
…
Numerical
(Continuous)
Bedtime
10-12
8-10
12-2
…
Categorical
(Ordinal)
Countries
18
7
2
…
Numerical
(Discrete)
Dread
3
5
2
…
Categorical
(Ordinal)
LO 2: Associated Variables
•
•
•
Associated variables – Variables that show some relationship with one another. Also known as dependent
variables.
Positive association – Variables are directly proportional
Negative association – Variables are indirectly proportional
LO 3: Independent Variables
•
Independent Variables - Variables not associated (no evident relationship), are known as independent.
LO 4: Explanatory and Response Variables
•
•
•
Explanatory Variable – The variable within a pair of variables which is suspected of affecting the other.
Response Variable – The variable which is a result of the explanatory variable.
e.g. explanatory variable = poverty -> response variable = federal spending
Note: labelling variables as explanatory and response does not guarantee that the relationship between the two is actually casual, even if
there is an association identified between the two variables.
•
Confounding Variable – A variable (can’t always be measured/examined) that is correlated to both the
explanatory and response variables. In observational studies, casual conclusions can be attempted by exhausting
the search for confounding variables.
LO 5: Classification of a Study
•
•
Observational Study – Studies provide evidence of naturally occurring associations between variables, but they
cannot by themselves show a casual connection (collection of data where there is no direct interference with the
data).
o Perspective Study – Identifies individuals and collects information as events unfold.
o Retrospective Study – Collects data after events have taken place.
Experimental Study – Studies which try to prove a causal connection. There is generally an explanatory and
response variable within the experiment, to test a hypothesis.
o Randomized Experiment – studies which contain randomized assignments, which is fundamentally
crucial for drawing casual connections between variables.
Note: Experimental studies allow for causal conclusions to be made, but observational studies are only sufficient to show associations.
WhatsApp 071 385 7167 for more Mechanical Engineering notes
LO 5: Random Sampling vs Random Assignment
Random Sampling – Occurs when subjects are being selected for a study.
•
•
If subjects are selected randomly form the population, then each subject in the population is equally likely to
be selected, and the resulting sample is likely representative of the population.
The study’s results are generalizable to the population at large.
Random Assignment – Occurs only in experimental settings, where subjects are being assigned to various
treatments.
•
•
If subjects are assigned randomly to treatments, then any observed effect can be attributed to thee
treatment, and hence we can make casual conclusions based on the study.
Allows you to make sure that the only difference between the various treatments groups is what you are
studying.
LO 5: Correlation vs Causation
Causation – One variable causes something to happen to another variable.
•
•
•
A causes B.
In order to imply causation, a true experiment must be performed where subjects are randomly assigned to
different conditions.
In data analysis, association does not imply causation, and causation can only be inferred from a randomized
experiment.
Correlated – Variables share some kind of relationship.
•
•
A and B seem to be happening at the same time.
Conclusions can be drawn, but you cannot imply anything from the data.
LO 6: Sources of Bias
Anecdotal Evidence – Data which represents 1/2 cases and is unclear whether these cases are representative of the
population (typically composed of unusual cases that are remembered due to their striking characteristics).
Sampling Bias:
•
•
•
Non-Response – Not everyone selected to answer for example a survey answers it, this results in uncertainty of
the results representing the entire population.
Voluntary Response
Convenience Sample – You cannot gather information from people living in Centurion only, if you want data to
represent the entire Pretoria.
WhatsApp 071 385 7167 for more Mechanical Engineering notes
LO 7: Sampling Schemes
A population represents the entire group effected by a scenario; the sample is a selected portion from the
population which will be studied to draw conclusions.
•
•
•
•
Simple Random Sampling – Each subject in the population is equally likely to be selected (like a raffle).
Stratified Sampling – First divide the population into homogenous strata (subjects within each stratum are
similar, across strata are different), then randomly sample from within each strata (divide and conquer).
o Advantage – useful when cases in each stratum are similar with respect to outcome interest.
o Disadvantage – analysing data from this method is more complicated.
Cluster Sampling – First divide the population into clusters (subjects within each cluster are non-homogenous,
but clusters are similar to each other), then randomly sample a few clusters, and then sample all cases within
those clusters.
o Advantage – can be a more economical technique; helpful when there is a lot of case-to-case variability
within a cluster, but clusters themselves don’t look very different to one another.
o Disadvantage – more advanced analysis techniques are typically required.
Multistage Sampling – First divide the population into clusters, then randomly sample a few clusters, and then
randomly sample from within each cluster.
LO 8: 4 Principles of Randomized Experimental Design
1. Controlling
• Researchers assign treatments to cases and do their best to control any differences within a group.
2. Randomization
• Researchers randomize patients into treatment groups to account for variables that cannot be
controlled. This helps prevent accidental bias from entering a study.
3. Replication
• The more cases researchers observe, the more accurately they can estimate the effect of the
explanatory variable on the response.
4. Blocking
• Researchers sometimes know or suspect that variables, other than the treatment, influence the
response. Under these circumstances, they may first group individuals based on this variable and then
randomize cases within each block to treatment groups.
LO 9: Identify Single/Double-Binding in a Study
Blinding – A procedure used in clinical trials to avoid the possible bias that might be introduced if the patient and/or
doctor knew which treatment the patient is receiving. If neither the patient nor doctor are aware of which treatment
has been given the trial is termed double-blind. If only one of the patient or doctor is aware, the trial is called singleblind.
Quick Test
1. Describe when a study’s results can be generalized to the population at large and when causation can be
inferred.
2. Explain why random sampling allows for generalizability of results.
3. Explain why random assignment allows for making casual conclusions.
4. Explain how blinding can help eliminate the placebo effect and other biases.
5. Understand random assignment vs random sampling.
WhatsApp 071 385 7167 for more Mechanical Engineering notes
LO 10: Scatterplots
Use scatterplots for describing the relationship between two numerical variables making sure to note the direction
(positive/negative), form (linear/non-linear) and the strength of the relationship as well as any unusual observations
that stand out.
LO 11: Description of Numerical Variable Distribution
Mention its shape, centre and spread as well as any unusual observations.
LO 12: Commonly Used Measures of Centre and Spread
Centre:
1. Mean (arithmetic average)
2. Median (midpoint)
3. Mode (most frequent observation)
Spread:
1. Standard deviation (variability around the mean)
2. Range (max – min)
3. Interquartile range (middle 50% of the distribution)
LO 13: Distribution Shapes
Symmetric:
Right Skewed
Left Skewed
Unimodal
Bimodal
Multimodal
Uniform
LO 14: Visualization of Numerical Distributions
Use histograms and boxplots to visualise the shape, centre and spread of numerical distributions
Use intensity maps for visualizing the spatial distribution of the data
LO 15: Robust Statistic
Robust Statistic (e.g. median, IQR) – Measures that are not heavily affected by skewness and extreme outliers.
Determine when they are more appropriate measured of centre and spread compared to other similar statisitcs
WhatsApp 071 385 7167 for more Mechanical Engineering notes
LO 16: Transformations
Recognise when transformations can make the distribution of data more symmetric, and hence easier to model.
LO 17: Description of One Categorical Variable Distribution
Use frequency tables and bar plots to describe the distribution of one categorical value
LO 18: Different Modality of Distributions
Give the picture
LO 19: Assessment of the Relationship between 2 Categorical Values
Use contingency tables and segmented bar plots or mosaic plots to asses the relationship between two categorical
variables
LO 20:
Recognise when
WhatsApp 071 385 7167 for more Mechanical Engineering notes
Exercise 1.1
Exercise 1.6
Exercise 1.9
Exercise 1.27
Exercise 1.30
WhatsApp 071 385 7167 for more Mechanical Engineering notes
Exercise 1.35
Exercise 1.40
WhatsApp 071 385 7167 for more Mechanical Engineering notes
Download