Experimentation in Computer Science – Part 3 1 Experimentation in Software Engineering --- Outline Empirical Strategies Measurement Experiment Process (Continued) 2 Experiment Process: Phases Experiment Idea Experiment Process Experiment Definition Experiment Planning Experiment E Operation Analysis & Interpretation Presentation & Package Conclusions Experiment Planning: Overview Experiment Definition Experiment Planning Context Selection Hypothesis Formulation Variables Selection Selection of Subjects Validity Evaluation Experiment Operation Instrumentation Experiment Design 4 Experiment Planning: Instrumentation Instrumentation types: Objects (e.g., specs, code) Guidelines (e.g., process descriptions, checklists, tutorial documents) Measurement instruments (surveys, forms, automated data collection tools) Overall goal of instrumentation: facilitate its performance without affecting control (instrumentation must not affect outcomes) Experiment Planning: Validity Evaluation Threats to external validity concern the ability to generalize results outside the experimental setting Threats to internal validity concern the ability to conclude that a causal effect exists between independent and dependent variables Threats to construct validity concern the extent to which variables and measures accurately reflect the constructs under study. Threats to conclusion validity concern issues that affect our ability to draw accurate statistical conclusions Experiment Planning: Process and Threats Related Theory (hypothesis) cause-effect construct Cause construct Effect construct treatment-outcome construct Observation Treatment Independent variable Outcome Dependent variable Experiment Planning: Process and Threats Related Theory (hypothesis) cause-effect construct Cause construct external Effect construct construct construct treatment-outcome construct Observation Treatment Independent variable internal conclusion Outcome Dependent variable Experiment Planning: Threats to External Validity Population: subject population not representative of population we wish to generalize to Place: experimental setting or materials not representative of setting we wish to generalize to Time: experiment is conducted at a time that affects results Reduce external validity threats in a given experiment by making environment as realistic as possible; however, reality is not homogenous, so important to report environment characterisitics. Reduce external validity threats long-term through replication. Experiment Planning: Threats to Internal Validity Instrumentation: measurement tools report inaccurately or affect results Selection: groups selected are not equivalent Learning: subjects learn over the course of the experiment, altering later results Mortality: subjects drop out of the experiment Social Effects: e.g., control group resents treatment group (demoralization or rivalry) Reduce internal threats through careful experiment design. Experiment Planning: Threats to Construct Validity Inadequate preoperational explication of constructs: theory isn’t clear enough (e.g. what is “better”) Mono-operation or mono-method bias: using a single independent variable, case, subject, treatment, or measure may under-represent constructs Levels of constructs: using incorrect levels of constructs may confound presence of construct with its level Integration of testing and treatment: testing itself makes subjects sensitive to treatment; test is part of treatment Social effects: experimenter expectancy, evaluation apprehension, hypothesis guessing Reduce construct threats through careful design, and replication. Experiment Planning: Threats to Conclusion Validity Low statistical power: increases risk of being unable to reject a false null hypothesis Violated assumptions of statistical tests: some tests have assumptions, e.g. about normally distributed and independent samples Fishing: searching for a specific result causes analyses to not be independent, and researchers may influence results by seeking specific outcomes Reliability of measures: if you can’t measure the result twice with equal outcomes, measures aren’t reliable Reduce conclusion validity threats through careful design, and perhaps through consultation with statistical experts Experiment Planning: Priorities Among Validity Threats Decreasing some types of threats may cause others to increase. (E.g. using CS students increases group size, reduces heterogeneity, aids conclusion validity, reduces external validity.) Tradeoffs need to be considered for type of study: Theory testing is more interested in internal and construct validity than external Applied experimentation is more interested in external and possibly conclusion validity Experiment Process: Phases Experiment Idea Experiment Process Experiment Definition Experiment Planning Experiment E Operation Analysis & Interpretation Presentation & Package Conclusions Experiment Operation: Overview Experiment operation: carrying out the actual experiment and collecting data Three phases: Preparation Execution Data validation 15 Experiment Operation: Preparation Locate participants Offer inducements to obtain participants Obtain participant consent, maybe also IRB approval Consider confidentiality (maintain it, inform participants about it) Avoid deception where it affects participants, reveal it later discussing necessity (beware validity tradeoffs; providing information is good but may affect results) Prepare instrumentation Objects, guidelines, tools, forms Use pilot studies and walkthroughs to reduce threats 16 Experiment Operation: Execution Execution might take place over a small set of specified occasions, or across a long time span Data collection takes place: subjects or interviewers fill out forms, tools collect metrics Consider interaction between experiment and environment, e.g., if experiment is being performed in-vivo, watch for confounding effects (experiment process altering behavior) 17 Experiment Operation: Data Validation Verify that data has been collected correctly Verify that data is reasonable Consider whether outliers exist and should be removed (must be for good reasons) Verify that experiment was conducted as intended Post-experiment questionnaires can assess whether subjects understood instructions 18 Experiment Process: Phases Experiment Idea Experiment Process Experiment Definition Experiment Planning Experiment E Operation Analysis & Interpretation Presentation & Package Conclusions Analysis and Interpretation: Overview Quantitative interpretation can include: Descriptive statistics: describe and graphically present data set, used before hypothesis testing to better understand data and identify outliers Data set reduction: locate and possibly remove anomalous data points Hypothesis testing: apply statistical tests to determine whether the null hypothesis can be rejected 20 Analysis and Interpretation: Visualizing Data Sets Graphs are effective ways to provide an overview of a data set Basic graphs types for use in visualization: Scatter plots Box plots Line plots Bar charts Cumulative bar charts Pie charts 21 Analysis and Interpretation: Data Set Reduction Hypothesis testing techniques depend on quality of data set; data set reduction improves data set quality by removing anomalous data (outliers) Outliers can be removed, but only for reasons such as that they represent rare events not likely to occur again Scatter plots can help find outliers Statistical tests can determine probabilities that points are outliers Sometimes redundant data is not easily analyzed, if the redundancy is too large; factor analysis and principal components analysis can identify orthogonal factors with which to replace redundant factors 22 Analysis and Interpretation: Hypothesis Testing Hypothesis testing: can we reject H0? If statistical tests say we can’t, we draw no conclusions If tests say we can, H0 is false with a given significance = P(type-I-error) = P(reject H0 | H0 is true). We also calculate p-value : the lowest possible significance with which we can reject H0 Typically, is 0.05; to claim significance must be < 23 Analysis and Interpretation: Statistical Tests per Design Design Parametric One factor, one treatment Non-parametric Chi-2 Binomial test One factor, two treatments, completely randomized t-test f-test Mann-Whitney Chi-2 One factor, two treatments, paired comparison paired t-test Wilcoxon Sign test One factor, more than two treatments ANOVA Kruskal-Wallis Chi-2 More than one factor ANOVA 24 Analysis and Interpretation: Statistical Tests Important to choose the right test - type of data must be appropriate are data items paired or not? is data normally distributed or not? are data sets completely independent or not? Take a stats course, see texts such as Montgomery, consult with statisticians, use statistical packages 25 Analysis and Interpretation: Statistical vs Practical Significance Statistical significance does not imply practical importance. E.g. if T1 is shown with statistical significance to be 1% more effective than T2, it must still be decided whether 1% matters Lack of statistical significance does not imply lack of practical importance. The fact that H0 cannot be rejected at level does not mean that H0 is true, and results of high practical importance may justify using a lower 26 Experiment Process: Phases Experiment Idea Experiment Process Experiment Definition Experiment Planning Experiment E Operation Analysis & Interpretation Presentation & Package Conclusions Presentation: An Outline for an Experiment Report 1. Introduction, Motivation 2. Background, Prior Work 3. Empirical Study 3.0 Research Questions 3.1 Objects of analysis 3.1.1 participants 3.1.2 objects 3.2 Variables and measures 3.2.1 independent variables 3.2.2 dependent variables 3.2.3 other factors 3.3 Experiment setup 3.3.1 setup details 3.3.2 operational details 3.4 Analysis strategy 3.5 Threats to validity 3.6 Data and analysis 4. Interpretation 5. Conclusions 28 Presentation Issues • • • • Supporting replicability. What to say and what not to say? How much to say? Describing design decisions