Question 1 : (8p) In your opinion, which steps are compulsory when conducting Bayesian data analysis? Please explain what steps one takes when designing models, so that we ultimately can place some confidence in the results. You can either draw a flowchart and explain each step, or write a numbered list explaining each step. (It’s ok to write on the backside, if they haven’t printed on the backside again. . . ) Solution: Bayesian data analysis involves several key steps to ensure that we can place confidence in the results. Here's a numbered list of the essential steps in Bayesian data analysis: Define the Research Question: Clearly articulate the problem or question you want to address through your analysis. Data Collection: Gather relevant data that can help answer the research question. Prior Belief Specification: Define your prior beliefs or prior probability distributions about the parameters of interest. This step reflects your knowledge before observing the data. Likelihood Function: Describe the likelihood function, which quantifies how your data is generated given the model parameters. Model Selection: Choose an appropriate probabilistic model that describes the relationship between the data and the parameters. This could be a linear regression, Bayesian network, etc. Bayesian Inference: Use Bayes' theorem to update your prior beliefs with the observed data, resulting in the posterior distribution of the parameters. Sampling or Computational Methods: Depending on the complexity of the model, use sampling methods like Markov Chain Monte Carlo (MCMC) or variational inference to approximate the posterior distribution. Model Checking: Assess the model's goodness of fit by comparing its predictions to the observed data. Diagnostic tests, such as posterior predictive checks, can be helpful. Sensitivity Analysis: Explore how different choices of priors or model structures affect the results to assess the robustness of your conclusions. Interpretation: Interpret the posterior distribution to make meaningful or predictions about the parameters. Report Results: Clearly communicate the results, including point estimates and credible intervals, in a way that is understandable to others. Sensitivity to Assumptions: Acknowledge the limitations and uncertainties in the analysis, especially regarding the choice of priors and model assumptions. Decision or Inference: Based on the posterior distribution and the research question, make decisions or inferences. This may involve hypothesis testing, prediction, or parameter estimation. Question 2 : (12p) Underfitting and overfitting are two concepts not always stressed a lot with black-box machine learning approaches. In this course, however, you’ve probably heard me talk about these concepts a hundred times. . . Multilevel models can be one way to handle overfitting, i.e., employing partial pooling. Please design (write down) three models. The first one should use complete pooling, the second one should employ no pooling, and the final one should use partial pooling. (Remember to use math notation!) (9p) Explain the different behaviors of each model. (3p) Solution: This question can be answered in many different ways. Here’s one way. complete pooling yi ∼ Binomial(n, pi) logit(pi) = α α ∼ Normal(0, 2.5) no pooling yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal(0, 2) partial pooling using hyper-parameters and hyper-priors yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal( ̄α, σ)  ̄α ∼ Normal(0, 1) σ ∼ Exponential(1) Complete Pooling (Pooled Model): All data points share a common parameter $\mu$. This model assumes that all data points are essentially identical in terms of the parameter of interest. It is useful when there's strong prior belief that the data is highly homogeneous. However, it may not capture individual variability, potentially leading to underfitting. No Pooling (Unpooled Model): Each data point has its own individual parameter $\mu_i$, offering the most flexibility. This model can capture individual variations but may lead to overfitting, especially when there's limited data. It does not leverage information from other data points. Partial Pooling (Hierarchical Model): Data points are grouped, and each group shares a common parameter $\mu_j". This model balances between the extremes of complete pooling and no pooling. It allows for sharing of information among similar data points within a group, reducing overfitting while still accommodating individual variability. It is a good choice when there is some heterogeneity in the data. Question 2.1: (9p) In Bayesian data analysis we have three principled ways of avoiding overfitting. The first way makes sure that the model doesn’t get too excited by the data. The second way is to use some type of scoring device to model the prediction task and estimate predictive accuracy. The third, and final way is to, if suitable data is at hand, design models that actually do something about overfitting. Which are the three ways? (3p) Explain and provide examples for each of the three ways (2p+2p+2p) Paper Solution: In the first way we use regularizing priors to capture the regular features of the data. In particular it might be wise to contrast this with sample size! In the second way we rely on information theory (entropy and max ent distributions) and the concept of information criteria, e.g., BIC, DIC, AIC, LOO, and WAIC. For the final, third way, the below is a good answer: complete pooling yi ∼ Binomial(n, pi) logit(pi) = α α ∼ Normal(0, 2.5) no pooling yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal(0, 2) partial pooling using hyper-parameters and hyper-priors yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal( ̄α, σ)  ̄α ∼ Normal(0, 1) σ ∼ Exponential(1) Solution: Three Ways to Avoid Overfitting in Bayesian Data Analysis: Regularization: Explanation: Regularization techniques are used to prevent models from becoming overly complex or "too excited" by the data. They introduce constraints on the model parameters to discourage extreme values and reduce overfitting. Example: In Bayesian regression, you can use L1 regularization (Lasso) or L2 regularization (Ridge) to add penalty terms to the likelihood, encouraging model parameters to remain close to zero. For example, adding a penalty term to the regression coefficients in linear regression discourages the model from assigning excessive importance to any single predictor. Model Scoring and Cross-Validation: Explanation: Scoring devices and cross-validation are used to assess the predictive accuracy of a model. These techniques evaluate how well a model generalizes to unseen data, helping to identify overfitting. Example: Cross-validation involves splitting the data into training and testing sets multiple times and assessing the model's performance on the testing data. If a model performs exceptionally well on the training data but poorly on the testing data, it may be overfitting. Common metrics for scoring include mean squared error, log-likelihood, or the widely used Watanabe-Akaike Information Criterion (WAIC) in Bayesian analysis. Hierarchical or Multilevel Models: Explanation: Hierarchical or multilevel models are designed to address overfitting by explicitly modeling the structure of the data, accounting for variability at different levels. These models can capture data patterns while controlling for overfitting. Example: In education research, a multilevel model can be used to analyze student performance in different schools. The model considers both school-level and student-level factors. By incorporating school-level effects, the model accounts for variations between schools while still capturing individual student characteristics. This helps prevent overfitting and provides more robust inferences. complete pooling yi ∼ Binomial(n, pi) logit(pi) = α α ∼ Normal(0, 2.5) no pooling yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal(0, 2) partial pooling using hyper-parameters and hyper-priors yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal( ̄α, σ)  ̄α ∼ Normal(0, 1) σ ∼ Exponential(1) These three ways provide a principled approach to mitigating overfitting in Bayesian data analysis. Regularization constrains parameter estimates, model scoring helps evaluate predictive accuracy, and hierarchical models account for data structure to prevent overfitting and improve the robustness of inferences. Question 3 : (5p) What is epistemological justification and how does it differ from ontological justification, when we design models and choose likelihoods/priors? Please provide an example where you list epistemological and ontological reasons for a likelihood. Solution: Epistemological Justification: Epistemological justification is concerned with the epistemological beliefs and knowledge that we have about the data and the underlying system we are modeling. It relates to our understanding of how data is generated and what we know or believe about it based on existing information or expertise. In other words, epistemological justification is about what we know and what we think we know about the problem at hand. Example of Epistemological Justification for Likelihood: Suppose you are modeling the height of a certain plant species. Your epistemological justification for the likelihood function might involve: Knowledge that plant heights typically follow a normal distribution based on previous studies or domain expertise. Understanding that external factors like soil quality and sunlight influence plant height, leads you to choose a normal distribution as it is appropriate for capturing these influences. Ontological Justification: Ontological justification, on the other hand, is concerned with the ontological assumptions about the underlying nature of the system. It pertains to what is true or exists in reality, independent of our beliefs or knowledge. Ontological justification is about the structure of the world, the actual physical or statistical processes generating the data, and the properties of the system we are modeling. Example of Ontological Justification for Likelihood: Continuing with the plant height example, ontological justification for the likelihood might involve: The physical properties of plant growth and genetic factors that lead you to consider that the true distribution of plant heights should be normal. The belief that the data generation process itself follows certain underlying physical laws, and that these laws imply a normal distribution of plant heights. In summary, the key difference between epistemological and ontological justification is in the source and nature of the justifications: Epistemological justification is based on our knowledge, beliefs, and understanding of the data and the problem. It reflects what we think we know. Ontological justification is grounded in the actual nature of the world and the system we are modeling. It reflects what is true or exists independently of our beliefs. Paper Solution: Epistemological, rooted in information theory and the concept of maximum entropy distributions. Ontological, the way nature works, i.e., a physical assumption about the world. One example could be the Normal likelihood for real (continuous) numbers where we have additive noise in the process (ontological), and where we know that in such cases Normal is the maximum entropy distribution. Question 4 : (8p) List and explain the four main benefits of using multilevel models. Paper Solution: • Improved estimates for repeat sampling. When more than one observation arises from the same individual, location, or time, then traditional, single-level models either maximally underfit or overfit the data. • Improved estimates for imbalance in sampling. When some individuals, locations, or times are sampled more than others, multilevel models automatically cope with differing uncertainty across these clusters. This prevents over-sampled clusters from unfairly dominating inference. • Estimates of variation. If our research questions include variation among individuals or other groups within the data, then multilevel models are a big help, because they model variation explicitly. • Avoid averaging, retain variation. Frequently, scholars pre-average some data to construct variables. This can be dangerous, because averaging removes variation, and there are also typically several different ways to perform the averaging. Averaging therefore both manufactures false confidence and introduces arbitrary data transformations. Multilevel models allow us to preserve the uncertainty and avoid data transformations. Solution: Handling Hierarchical Data Structures: Multilevel models are well-suited for analyzing data with hierarchical or nested structures, such as students within schools, patients within hospitals, or measurements taken over time. These models account for the correlation and variability at multiple levels in the data. For example, in educational research, you can use multilevel models to assess both individual student performance and school-level effects on academic achievement. Improved Precision and Efficiency: By modeling data hierarchically, multilevel models can often provide more precise and efficient parameter estimates compared to traditional linear models. They "borrow strength" from higher-level units to estimate parameters for lower-level units. This reduces the risk of overfitting and provides stable estimates, particularly when there are few observations per group. This is particularly useful in scenarios with limited data at lower levels of the hierarchy. Handling Variation and Heterogeneity: Multilevel models account for variation within and between groups or clusters. They allow for the estimation of both fixed effects (global trends) and random effects (group-specific deviations from those trends). This is useful for capturing heterogeneity in the data, as it acknowledges that not all groups or individuals are identical. For example, in medical studies, random effects can capture the differences in patient responses to treatments. Modeling Changes Over Time: Multilevel models can handle longitudinal or repeated measures data effectively. They allow for the modeling of temporal changes within subjects or entities while also considering the variation between subjects. This is vital in various fields, such as clinical trials, where patients are monitored over time, or in social sciences to study individual trajectories of development. Question 5 : (6p) Below you see a Generalized Linear Model yi ∼ Poisson(λ) f(λ) = α + βxi What is f(), and why is it needed? (2p) Provide at least two examples of f() and when you would use them? (4p) Paper Solution: Generalized linear models need a link function, because rarely is there a “μ”, a parameter describing the average outcome, and rarely are parameters unbounded in both directions, like μ is. For example, the shape of the binomial distribution is determined, like the Gaussian, by two parameters. But unlike the Gaussian, neither of these parameters is the mean. Instead, the mean outcome is np, which is a function of both parameters. For Poisson we have λ which is both the mean and the variance. The link function f provides a solution to this common problem. A link function’s job is to map the linear space of a model like α + βxi onto the non-linear space of a parameter! Solution: In the context of the given Generalized Linear Model (GLM), the function "f()" is the link function. It's a fundamental component of GLMs and is needed to establish a connection between the linear combination of predictors (α + βxi) and the mean parameter (λ) of the Poisson distribution. The link function transforms the linear predictor to ensure that it is within a valid range and adheres to the assumptions of the distribution. Two common link functions in GLMs are the identity link and the log link: Identity Link (f(λ) = λ): In this case, the linear combination (α + βxi) is directly used as the mean parameter (λ) of the Poisson distribution. This implies that the response variable (yi) is linearly related to the predictors without any transformation. Use when modeling count data where the predictor variables have a linear and additive effect on the mean response. For example, in modeling the number of customer calls as a function of time spent on the website, an identity link could be appropriate if you believe that each additional minute spent directly increases the number of calls. Log Link (f(λ) = log(λ)): Here, the linear combination (α + βxi) is transformed by taking the natural logarithm. This transformation is often used when modeling data with multiplicative relationships. Use when dealing with count data where you expect a multiplicative effect of predictors on the mean response. For example, in epidemiology, when modeling disease rates, a log link might be suitable because it captures relative changes (multiplicative) rather than absolute changes. Other common link functions used in GLMs include the logit link for binary data (e.g., logistic regression) and the inverse link for gamma-distributed data, among others. Question 5.1: (4p) Below you see a Generalized Linear Model yi ∼ Binomial(m, pi) f(pi) = α + βxi What is f(), and How does it work? Provide at least two examples of f(). Paper Solution: Generalized linear models need a link function, because rarely is there a “µ”, a parameter describing the average outcome, and rarely are parameters unbounded in both directions, like µ is. For example, the shape of the binomial distribution is determined, like the Gaussian, by two parameters. But unlike the Gaussian, neither of these parameters is the mean. Instead, the mean outcome is np, which is a function of both parameters. Since n is usually known (but not always), it is most common to attach a linear model to the unknown part, p. But p is a probability mass, so pi must lie between zero and one. But there’s nothing to stop the linear model α falling below zero or exceeding one. + βxi from The link function f provides a solution to this common problem. A link function’s job is to map the linear space of a model like α + βxi onto the non-linear space of a parameter. Solution: In the context of a Generalized Linear Model (GLM), the function f() is called the link function. It plays a crucial role in connecting the linear predictor with the mean of the response variable, which is a non-linear relationship. The link function transforms the linear combination of predictors into a range that is suitable for modeling the response variable. There are various link functions available in GLMs, and the choice of the link function depends on the nature of the data and the specific modeling requirements. Two common examples of link functions and how they work are as follows: Logit Link Function (f() = log(π / (1 - π))): • • • This link function is often used when modeling binary or binomial response data, where π represents the probability of success (in a binary outcome) or the probability of an event occurring (in a binomial count). The logit link function maps the linear combination of predictors (α + βxi) to the log-odds scale. It transforms the linear predictor into a range from negative infinity to positive infinity, which is suitable for modeling probabilities. Example: Logistic regression models often use the logit link function to model the probability of a binary event, such as the likelihood of a customer making a purchase (success/failure) based on predictor variables like age, income, and gender. Identity Link Function (f() = π): • The identity link function is used when modeling response variables that are continuous and directly interpretable as probabilities. In this case, π represents the conditional mean of the response variable. • The identity link function essentially maintains the linear relationship between the predictors and the response without any transformation. It is often used in simple linear regression when the response variable is continuous and can take any real value. • Example: In a linear regression model, the identity link function is used when modeling the relationship between a predictor variable (e.g., temperature) and a continuous response variable (e.g., sales volume) to predict the expected sales volume based on temperature. The choice of the link function in a GLM is crucial for modeling the response variable effectively. Different link functions are suitable for different types of data, and they ensure that the relationship between predictors and the response is appropriately captured. The link function transforms the linear predictor into a form that aligns with the distribution and characteristics of the response variable, allowing for meaningful modeling and interpretation. Question 5.2: (5p) Below you see a Generalized Linear Model yi ∼ Binomial(n, p) f(p) = α + βxi What is f () called and why is it needed? (2p) What should f () be in this case? (1p) Does f () somehow affect your priors? If yes, provide an example. (2p) Paper Solution: Generalized linear models need a link function, because rarely is there a “ µ”, a parameter describing the average outcome, and rarely are parameters unbounded in both directions, like µ is. For example, the shape of the binomial distribution is determined, like the Gaussian, by two parameters. But unlike the Gaussian, neither of these parameters is the mean. Instead, the mean outcome is np, which is a function of both parameters. For Poisson we have λ which is both the mean and the variance. The link function f provides a solution to this common problem. A link function’s job is to map the linear space of a model like α + βxi onto the non-linear space of a parameter! In the above case the link function is the logit(). If you set priors and use a link function the often concentrate probability mass in unexpected ways. Solution: What is f() and Why is it Needed? • f() in this context is the link function, and it's needed to transform the linear combination of predictors into a valid probability value for the response variable. It links the linear predictor to the range of the response variable. The link function is essential in Generalized Linear Models (GLMs) because it ensures that the predicted probabilities lie within the (0, 1) range, which is necessary for modeling binary or count data. What Should f() Be in this Case? • In this case, since the response variable follows a Binomial distribution with parameters n (the number of trials) and p (the probability of success in each trial), the appropriate link function should be the logit link function. The logit link function transforms the linear predictor to the log-odds scale, ensuring that the predicted probabilities are within the (0, 1) range. Does f() Affect Your Priors? Provide an Example. • Yes, the choice of the link function can affect your choice of priors. The link function influences the interpretation of the model coefficients and the prior distribution for the coefficients. For example, if you use the logit link function, which models the log-odds of success, your choice of priors for α and β might be based on prior knowledge of log-odds or logistic regression coefficients. The link function impacts the scale and meaning of the coefficients, which can guide your prior specifications. • Suppose you are modeling the probability of a customer making a purchase (purchasing vs. not purchasing) based on their age (xi) using a Binomial distribution. If you choose the logit link function, your priors for α and β might reflect prior information about the log-odds of making a purchase based on age. For instance, you might specify a Normal prior for β with a mean reflecting your prior belief about the effect of age on the log-odds of purchasing. The choice of the link function should align with the characteristics of your data and the interpretability of the model, and it can influence the specification of prior distributions for model parameters. Question 6 : (6p) What is the purpose and limitations of using Laboratory Experiments and Field Experiments as a research strategy? Provide examples, i.e., methods for each of the two categories, and clarify if one use mostly qualitative or quantitative approaches (or both). Paper Solution: Laboratory Experiments: Real actors’ behavior Non-natural settings (contrived) The setting is more controlled Obtrusiveness: High level of obtrusiveness Goal: Minimize confounding factors and extraneous conditions Maximize the precision of the measurement of the behavior Establish causality between variables Limited number of subjects Limits the generalizability Unrealistic context Examples of research methods: Randomized controlled experiments and quasi-experiments, Benchmark studies Mostly quantitative data Field Experiments: Natural settings. Obtrusiveness: Researcher manipulates some variables/properties to observe an effect. Researcher does not control the setting. Goal: Investigate/Evaluate/Compare techniques in concrete and realistic settings. Limitations: Low statistical generalizability (claim analytical generalizability!) Fairly low precision of the measurement of the behavior (confounding factors). Not the same as an experimental study (changes are made and results observed), but does not have control over the natural setting. Correlation observed but not causation. Examples: Evaluative case study, Quasi-experiment, and Action research. Often qualitative and quantitative. Solution: Laboratory Experiments: Purpose: Laboratory experiments are controlled research studies conducted in a controlled environment (laboratory) to investigate causal relationships between variables. The purpose is to establish cause-and-effect relationships by manipulating independent variables and observing their impact on dependent variables under controlled conditions. Limitations: Artificial Environment: Laboratory experiments may suffer from artificiality, as the controlled environment may not fully represent real-world settings. Limited Generalizability: Findings from lab experiments may not easily generalize to real-world situations. Ethical Concerns: Some experiments may pose ethical challenges when manipulating variables, especially in social and psychological research. Examples: Qualitative or Quantitative? Both approaches can be used in laboratory experiments, depending on the research question. For instance, in a psychological study on the impact of stress on memory, quantitative measures such as reaction times and memory scores could be collected, and qualitative interviews might explore participants' experiences. Field Experiments: Purpose: Field experiments are conducted in real-world settings to study the effects of manipulated variables on a target population. The purpose is to increase the external validity of research by observing phenomena in natural contexts with less control than laboratory experiments. Limitations: Less Control: Field experiments have less control over extraneous variables, making it challenging to establish causality. Logistical Challenges: Conducting experiments in real-world settings can be logistically complex and expensive. Ethical and Privacy Concerns: Researchers must consider ethical issues and privacy concerns when conducting field experiments, especially in social and behavioral sciences. Examples: Qualitative or Quantitative? Both approaches can be used in field experiments. For instance, a field experiment in public health assessing the effectiveness of a smoking cessation program may use quantitative data (e.g., reduction in smoking rates) and qualitative data (e.g., participant interviews) to gain a comprehensive understanding of the program's impact. Question 7 : (8p) Below follows an abstract from a research paper. Answer the questions, • Which of the eight research strategies presented in the ABC framework does this paper likely fit? Justify and argue! • Can you argue the main validity threats of the paper, based on the research strategy you picked? – It would be very good if you can list threats in the four common categories we usually work with in software engineering, i.e., internal, external, construct, and conclusion validity threats. Context: The term technical debt (TD) describes the aggregation of sub-optimal solutions that serve to impede the evolution and maintenance of a system. Some claim that the broken windows theory (BWT), a concept borrowed from criminology, also applies to software development projects. The theory states that the presence of indications of previous crime (such as a broken window) will increase the likelihood of further criminal activity; TD could be considered the broken windows of software systems. Objective: To empirically investigate the causal relationship between the TD density of a system and the propensity of developers to introduce new TD during the extension of that system. Method: The study used a mixed-methods research strategy consisting of a controlled experiment with an accompanying survey and follow-up interviews. The experiment had a total of 29 developers of varying experience levels completing a system extension tasks in an already existing systems with high or low TD density. The solutions were scanned for TD, both manually and automatically. Six of the subjects participated in follow-up interviews, where the results were analyzed using thematic analysis. Result: The analysis revealed significant effects of TD level on the subjects’ tendency to re-implement (rather than reuse) functionality, choose non-descriptive variable names, and introduce other code smells identified by the software tool SonarQube, all with at least 95% credible intervals. Additionally, the developers appeared to be, at least partially, aware of when they had introduced TD. Conclusion: Three separate significant results along with a validating qualitative result combine to form substantial evidence of the BWT’s applicability to software engineering contexts. This study finds that existing TD has a mayor impact on developers propensity to introduce new TD of various types during development. While mimicry seems to be part of the explanation it can not alone describe the observed effects. Paper Solution: This is a mixed method study, i.e., an experiment (but not an RCT!), but where we also see qualitative parts, like from a field study. Solution: Research Strategy: Based on the information provided in the abstract, the paper likely fits into the research strategy known as a "Mixed-Methods Approach." This strategy is supported by the following evidence: The study explicitly mentions using "a controlled experiment with an accompanying survey and follow-up interviews." This indicates the integration of both quantitative (controlled experiment) and qualitative (survey and interviews) methods in the research. Justification: The mixed-methods approach combines both quantitative and qualitative research techniques to provide a more comprehensive understanding of the research problem. In this case, the paper incorporates controlled experiments to gather quantitative data about technical debt (TD) and its effects on developers' behavior. It also employs surveys and follow-up interviews to gather qualitative insights from the developers. This combination of methods helps validate the results and provides a richer context for understanding the causal relationship between TD density and developers' actions. Validity Threats: The main validity threats in the paper can be categorized into four common categories used in software engineering research: Internal Validity Threats: • Selection Bias: The choice of 29 developers with varying experience levels could introduce selection bias. Developers with different levels of experience may respond differently to TD, affecting the internal validity of the study. • Experimenter Effects: Any influence or bias introduced by the experimenters during the controlled experiment or interviews could impact the internal validity. External Validity Threats: • Population Generalizability: The results from a specific set of developers may not generalize to all software development contexts. This threatens the external validity of the findings. • Contextual Generalizability: The study's context (high or low TD density systems) may not represent the diversity of software development projects, impacting the external validity. Construct Validity Threats: • Operationalization of TD: How TD is defined, measured, and categorized can be a construct validity threat. In this case, the paper mentions scanning for TD both manually and automatically, which may introduce variations in the measurement. • Measurement of Developer Behavior: The paper discusses various behaviors of developers, such as re-implementing functionality and choosing non-descriptive variable names. The construct validity depends on the accurate measurement of these behaviors. Conclusion Validity Threats: • Multiple Comparisons: To ensure that the results are not spurious, the paper should address the issue of multiple comparisons due to examining multiple outcomes (e.g., re-implementing functionality, choosing variable names, introducing code smells). • Extraneous Variables: The presence of uncontrolled variables that may confound the results, such as other external factors affecting developers' decisions, could threaten the conclusion validity. In summary, the mixed-methods research approach in the paper offers a holistic view of the research problem but introduces various validity threats that researchers should consider when interpreting the results and generalizing them to broader software engineering contexts. Question 7.1: (6p) Below follows an abstract from a research paper. Answer the questions, • Which of the eight research strategies presented in the ABC framework does this paper likely fit? Justify and argue! • Can you argue the main validity threats of the paper, based on the research strategy you picked? – It would be very good if you can list threats in the four common categories we usually work with in software engineering, i.e., internal, external, construct, and conclusion validity threats. Software engineering is a human activity. Despite this, human aspects are under-represented in technical debt research, perhaps because they are challenging to evaluate. This study’s objective was to investigate the relationship between technical debt and affective states (feelings, emotions, and moods) from software practitioners. Forty participants (N = 40) from twelve companies took part in a mixed-methods approach, consisting of a repeated-measures (r = 5) experiment (n = 200), a survey, and semi-structured interviews. The statistical analysis shows that different design smells (strong indicators of technical debt) negatively or positively impact affective states. From the qualitative data, it is clear that technical debt activates a substantial portion of the emotional spectrum and is psychologically taxing. Further, the practitioners’ reactions to technical debt appear to fall in different levels of maturity. We argue that human aspects in technical debt are important factors to consider, as they may result in, e.g., procrastination, apprehension, and burnout. Solution: Research Strategy: This research paper likely fits the "Mixed Methods" research strategy in the ABC framework. Here's the justification and argument: • The paper mentions using a mixed-methods approach, which combines quantitative (experiment and survey) and qualitative (semi-structured interviews) data collection methods. • The objective is to investigate the relationship between technical debt and affective states, a complex phenomenon that benefits from both quantitative and qualitative insights. • The use of both experimental and interview data aligns with the mixed-methods strategy's focus on gaining a comprehensive understanding of the research question. Validity Threats: Validity threats in this study can be categorized into four common categories in software engineering research: Internal Validity Threats: • Selection bias: The choice of participants from specific companies may not represent the entire software engineering population. • History effects: Changes in participants' affective states over time may confound the results, particularly in the repeatedmeasures experiment. External Validity Threats: • Generalizability: Findings from 12 companies may not generalize to the broader software engineering community. • Interaction effects: The impact of technical debt on affective states may vary based on company-specific factors. Construct Validity Threats: • Measurement validity: Ensuring that the instruments used to measure affective states and technical debt are accurate and reliable. • Construct underrepresentation: The study acknowledges that human aspects in technical debt are challenging to evaluate, indicating potential limitations in capturing the full scope of the construct. Conclusion Validity Threats: • Observer bias: Interviewers or data analysts' preconceived notions or biases may affect the interpretation of qualitative data. • Ecological validity: The study may face challenges in capturing the complexity of real-world software development environments in a controlled experiment. In summary, this study falls under the mixed-methods research strategy and demonstrates a focus on understanding the relationship between technical debt and affective states among software practitioners. To enhance the study's validity, researchers should address threats related to participant selection, generalizability, measurement validity, and potential biases in interpreting data. The complexity of human aspects in technical debt research adds to the challenges in construct validity and ecological validity. Question 8 : (6p) In the paper Guidelines for conducting and reporting case study research in software engineering by Runeson & Höst the authors present an overview of research methodology characteristics. They present their claims in three dimensions: Primary objective, primary data, and design. Primary objective indicates what the purpose is with the methodology (e.g., explanatory), primary data indicates what type of data we mainly see when using the methodology (e.g., qualitative), while design indicates how flexible the methodology is from a design perspective (e.g., flexible) Please fill out the table below according to the authors’ presentation. Methodology Survey Case study Experiment Action research Primary objective Descriptive Exploratory Exploratory Improving Primary data Quantitative Qualitative Quantitative Qualitative Design Fixed Flexible Fixed Flexible Question 9 : (4p) See the DAG below. We want to estimate the total causal effect of Screen time on Obesity. Design a model where Obesity is approximately distributed as a Gaussian likelihood, and then write down a linear model for μ to clearly show which variable(s) we should condition on. Also add, what you believe to be, suitable priors on all parameters. If needed to you can always state your assumptions. Parental Education Screen time Physical Activity Obesity Self-harm Paper Solution: 1 p for correct conditioning, 3p for appropriate model as further down below. > dag <− da g i t t y (" dag { PE −> O S −> PA −> O PE −> S −> SH O −> SH }") > adjus tmentSe t s ( dag , exposure = "S" , outcome = "O") { PE } and the answer is { PE }, i.e., Parental education. In short, check what elementary confounds we have in the DAG (DAGs can only contain four), follow the paths, decide what to condition on. If we condition on Self-harm we’re toast. . . Oi ∼ Normal(μi, σ) μi = α + βSTSTi + βPEPEi α ∼ Normal(0, 5) {βST, βPE} ∼ Normal(0, 1) σ ∼ Exponential(1) Solution: To estimate the total causal effect of Screen time on Obesity, we can create a causal model and write down a linear model for μ (the mean of Obesity) to show which variables we should condition on. We'll also specify suitable priors for the parameters. In this causal model, it's essential to condition on variables that are potential confounders or mediators. Based on the given causal relationships, here's the linear model: Causal Model: Parental education is a parent-level variable. Screen time is a child-level variable influenced by parental education. Physical Activity is a child-level variable influenced by both parental education and Screen time. Self-harm is a child-level variable influenced by parental education and Screen time. Obesity is the outcome variable influenced by all the mentioned variables. Linear Model for μ: To estimate the total causal effect of Screen time on Obesity while controlling for potential confounders, we can write a linear model for μ as follows: μ=α+β1Parental Education + β2Screen Time + β3Physical Activity + β4Self-Harm In this model, μ represents the mean of Obesity. We include Parental Education, Screen Time, Physical Activity, and Self-Harm as covariates to account for their potential influence on Obesity. Priors: For suitable priors, we can make some assumptions based on the nature of the data and our beliefs. Common prior choices for regression parameters like β include normal distributions with mean 0 and precision (inverse of variance) τ. The specific choice of priors may depend on your data and prior knowledge. Here's a general form for the priors: Priors for β coefficients: You can assume that β coefficients follow normal distributions with mean 0 and precision τ (τ controls the spread of the distribution, with higher values indicating less spread). Priors for α: The intercept α can also follow a normal distribution, typically centered around 0. Priors for error terms: You can assume normally distributed errors (likelihood) with mean 0 and a suitable variance (σ^2). Keep in mind that the choice of priors can influence the results, so it's important to select priors that are appropriate for your specific dataset and research context. Bayesian methods allow you to specify these priors based on your prior knowledge or beliefs about the parameters. Question 10 : ([−15, 15]p) Below follows a number of multiple choice questions. There can be more than one correct answer! Mark the correct answer by crossing the answer. In the case of DAGs, X is the treatment and Y is the outcome. Correct answer gives 0.5 points, wrong or no answer deducts 0.5 points. Q1 What is this construct called: X ← Z → Y ? {Collider} {Pipe} {Fork} {Descendant} Q2 What is this construct called: X → Z ← Y ? {Collider} {Pipe} {Fork} {Descendant} Q3 What is this construct called: X → Z → Y ? {Collider} {Pipe} {Fork} {Descendant} Q4 If we condition on Z we close the path X → Z ← Y . {True} {False} Q5 What should we condition on? Z A X Y {A} {Z} {Z and A} {Nothing} Q6 What should we condition on? A Z X Y {A} {Z} {Z and A} {Nothing} Q7 What should we condition on? A X Z Y {A} {Z} {Z and A} {Nothing} Q8 How can overfitting be avoided? {cross-validation} {unregularizing priors} {Use a desinformation criteria} Q9 Berkson’s paradox is when two events are. . . {correlated but should not be} {correlated but has no causal statement} {not correlated but should be} Q10 An interaction is an influence of predictor. . . {conditional on parameter} {on a parameter} {conditional on other predictor} Q11 How does an interaction look like in a DAG? {X ← Z → Y } {X → Z → Y } {X → Z ← Y} Q12 To measure the same size of an effect in an interaction as in a main/population/β parameter requires, as a rule of thumb, at least a sample size that is. . . {4x larger} {16x larger} {16x smaller} {8x smaller} Q13 We interpret interaction effects mainly through. . . {tables} {posterior means and standard deviations} {plots} Q14 In Hamiltonian Monte Carlo, what does a divergent transition usually indicate? {A steep region in parameter space} {A flat region in parameter space} {Both} Q15 Your high bR values indicate that you have a non-stationary posterior. What do you do now? {Visually check you chains} {Run chains for longer} {Check effective sample size} {Check E-BMFI values} Q16 What distribution maximizes this? H(p) = −∑ ni=1 pilog(pi) {Flattest} {Most complex} {Most structured} {Distribution that can happen the most ways} Q17 What distribution to pick if it’s a real value in an interval? {Uniform} {Normal} {Multinomial} Q18 What distribution to pick if it’s a real value with finite variance? {Gaussian} {Normal} {Binomial} Q19 Dichotomous variables, varying probability? {Binomial} {Beta-Binomial} {Negative-Binomial/Gamma-Poisson} Q20 Non-negative real value with a mean? {Exponential} {Beta} {Half-Cauchy} Q21 Natural value, positive, mean, and variance are equal? {Gamma-Poisson} {Multinomial} {Poisson} Q22 We want to model probabilities? {Gamma} {Beta} {Delta} Q23 Unordered (labeled) values? {Categorical} {Cumulative} {Nominal} Q24 Why do we use link functions in a GLM? {Translate from likelihood to linear model} {Translate from linear model to likelihood} {To avoid absurd values} Q25 On which effect scale are parameters? {Absolute} {None} {Relative} Q26 On which effect scale are predictions? {Absolute} {None} {Relative} Q27 We can use . . . to handle over-dispersion. {Beta-Binomial} {Negative-Binomial} {Exponential-Poisson} Q28 In zi models we assume the data generation process consist of two disparate parts? {Yes} {No} Q29 Ordered categories have. . . {a defined maximum and minimum} {continuous value} {an undefined order} {unknown ‘distances’ between categories} Q30 When modeling ordered categorical predictors we can express them in math like this: β∑j=0 δj. Cross correct statement: {δ is the total effect of the predictor and β are proportions of total effect} {β is total effect of predictor and δ are proportions of total effect} Question 1 : (3p) What is underfitting and overfitting and why does it lead to poor prediction? Write a short example (perhaps as a model specification?) and discuss what would be a possible condition for underfitting and overfitting. Paper Solution: One straightforward way to to discuss this is to show two simple models, one where we only have a general intercept (complete pooling), and one where you estimate an intercept for each cluster in the data (varying intercept), but where we don’t share information between the clusters (think of the pond example in the coursebook), i.e., no pooling. This question can be answered in many different ways. yi ∼ Binomial(n, pi) logit(pi) = α α ∼ Normal(0, 2.5) above complete pooling, below no pooling yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal(0, 2) for j = 1, ……….., n Solution: Underfitting and overfitting are two common issues in machine learning and statistical modeling that can lead to poor prediction performance: Underfitting: Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It typically results in a high bias, meaning the model's predictions are systematically off the mark, either by consistently overestimating or underestimating the true values. Example: Model Specification: Linear regression with only one feature to predict house prices. Condition for Underfitting: If the housing market depends on multiple features such as square footage, number of bedrooms, and neighborhood, and you use a simple linear model with just one of these features, the model may underfit the data because it doesn't account for the complexity of the relationship. In this condition, the model is too simplistic to capture the nuances of the housing market, leading to underfitting. The model might predict the same price for all houses, which would result in poor predictions. Overfitting: Overfitting occurs when a model is too complex and fits the noise in the data rather than the true underlying patterns. It typically results in a high variance, meaning the model performs well on the training data but poorly on unseen or new data. Example: Model Specification: A decision tree with many branches that perfectly fits all training data points. Condition for Overfitting: If you create a very complex decision tree with too many branches and leaves, it might fit the training data points exactly, even the noise, by creating intricate rules for each data point. In this condition, the model is overly complex, fitting the training data to the point of memorization, but it doesn't generalize well to new data, leading to overfitting. As a result, the model might make poor predictions on unseen houses because it's too tailored to the training data. Both underfitting and overfitting can lead to poor prediction performance because they compromise the model's ability to generalize to new or unseen data. In machine learning, the goal is to find the right balance between model complexity and performance, ensuring that the model captures the underlying patterns without being overly simplistic or overly complex. Techniques like cross-validation and regularization can help mitigate these issues. Question 2 : (3p) What is epistemological justification and how does it differ from ontological justification, when we design models and pick likelihoods? Please provide an example where you argue epistemological and ontological reasons for selecting a likelihood. Paper Solution: Epistemological, rooted in information theory and the concept of maximum entropy distributions. Ontological, the way nature works, i.e., a physical assumption about the world. One example could be the Normal() likelihood for real (continuous) numbers where we have additive noise in the process (ontological), and where we know that in such cases Normal is the maximum entropy distribution. Solution: Epistemological justification and ontological justification are two distinct aspects of model design in statistics and data analysis, particularly when choosing likelihoods: Epistemological Justification: Epistemological justification is based on our knowledge, beliefs, and understanding of the data and the underlying processes we are modeling. It relates to what we know, think, or infer about the data generation process. Epistemological reasons for selecting a likelihood are driven by existing knowledge, theory, and information. Example - Epistemological Justification for Likelihood: Imagine you are modeling the time it takes for a website user to complete a certain task. Based on prior research and domain expertise, you know that user task completion times often follow a log-normal distribution. Your epistemological justification for selecting a log-normal likelihood is based on the understanding that this distribution aligns with your prior knowledge and observations of similar scenarios. Ontological Justification: Ontological justification, on the other hand, is based on the nature of reality, regardless of our knowledge or beliefs. It pertains to the actual properties and characteristics of the system or phenomenon being modeled. Ontological reasons for selecting a likelihood are driven by the belief that a particular distribution reflects the inherent properties of the data-generating process. Example - Ontological Justification for Likelihood: Consider modeling the number of defects in a manufacturing process. You have no prior knowledge or information, but you believe that defects are inherently rare events in the process. In this case, you might choose a Poisson distribution as the likelihood because it reflects the inherent property of rare events occurring independently over time, aligning with the nature of defects. In summary, the key difference lies in the source of justification: • • Epistemological Justification relies on what we know, our prior beliefs, and domain-specific knowledge to choose a likelihood. Ontological Justification is based on the belief that a particular likelihood reflects the inherent properties or characteristics of the data-generating process, regardless of prior knowledge or beliefs. It assumes that the chosen distribution is a fundamental aspect of the phenomenon being modeled. Question 3 : (2p) What are the two types of predictive checks? Why do we have to do these predictive checks? Paper Solution: Prior predictive checks: To check we have sane priors. Posterior predictive checks: To see if we have correctly captured the regular features of the data. Solution: The two types of predictive checks are posterior predictive checks and prior predictive checks. These checks are conducted in the context of Bayesian data analysis to assess the performance and validity of a Bayesian model. Posterior Predictive Checks: Posterior predictive checks involve generating new data points, often referred to as "replicates," using the posterior distribution of the model parameters. These replicates are simulated based on the estimated model and parameter uncertainty. The observed data are then compared to the replicates generated from the model. By assessing how well the observed data align with the replicated data, you can evaluate the model's ability to capture the underlying data-generating process. Discrepancies can reveal model misspecifications, bias, or other issues. Prior Predictive Checks: Prior predictive checks are similar to posterior predictive checks but are based on the prior distribution of the model parameters, without incorporating information from the observed data. Prior predictive checks help assess the appropriateness of the prior distribution, which captures your beliefs or expectations about the parameters before seeing any data. By simulating data from the prior and comparing it to the observed data, you can identify if the prior assumptions align with the actual data. Why We Do Predictive Checks: Predictive checks serve several crucial purposes in Bayesian data analysis: Model Assessment: Predictive checks allow us to assess the quality and validity of our Bayesian model. They help detect model inadequacies, including underfitting and overfitting, and identify potential misspecifications. Model Comparison: By comparing the observed data to the replicates or data generated under different models, we can choose the model that best describes the observed data. This is particularly important when considering multiple competing models. Quantifying Uncertainty: Predictive checks provide a way to quantify the uncertainty inherent in Bayesian analysis. They help us understand how well the model accounts for the inherent randomness in the data. Identifying Outliers: Predictive checks can help identify outliers or anomalies in the data. Unusual discrepancies between observed data and replicates may indicate data points that need further investigation. In summary, predictive checks are essential in Bayesian data analysis to assess model performance, make informed model choices, evaluate prior assumptions, and ensure that the model aligns with the data and the underlying data-generating process. Question 4 : (6p) When diagnosing Markov chains, we often look at three diagnostics to form an opinion of how well things have gone. Which three diagnostics do we commonly use? What do we look for (i.e, what thresholds or signs do we look for)? Finally, what do they tell us? Paper Solution: Effective sample size: A measure of the efficiency of the chains. 10% of total sample size (after warmup) or at least 200 for each parameter. R-hat: A measure of the between and the within variance of our chains. Indicates if we have reached a stationary posterior probability distribution. Should approach 1:0 and be below 1:01. Traceplots: Check if the chains are mixing well. Are they converging around the same parameter space? Solution: When diagnosing the performance of Markov chains, three common diagnostics are often used to assess how well the chains have converged and whether they have adequately sampled the target distribution. These diagnostics are: Autocorrelation Function (ACF): • What to Look For: In the ACF, we look for a rapid decay in autocorrelation as the lag increases. Ideally, we want to see the autocorrelations quickly drop close to zero for lags beyond a certain point. • What It Tells Us: A rapidly decaying ACF suggests that the chain is efficiently exploring the parameter space and is producing independent samples from the target distribution. Slowly decaying autocorrelations indicate that the chain might be stuck, leading to inefficient sampling and longer mixing times. Gelman-Rubin Diagnostic (R-hat or $\hat{R}$): • What to Look For: The Gelman-Rubin diagnostic compares the within-chain variability to between-chain variability for multiple chains started from different initial values. $\hat{R}$ should be close to 1 (e.g., less than 1.1) for all model parameters. • What It Tells Us: A $\hat{R}$ close to 1 indicates that different chains have converged to the same target distribution, suggesting good convergence. Values significantly greater than 1.1 may indicate that the chains have not mixed well, and further sampling is needed. Effective Sample Size (ESS): • What to Look For: ESS measures the number of effectively independent samples that the chain provides. A higher ESS indicates a more efficient sampling process. Ideally, you want an ESS that is a substantial fraction of the total number of samples. • What It Tells Us: A high ESS implies that the chain is providing a sufficient number of effectively independent samples, ensuring that you have enough information to estimate parameters and make valid inferences. A low ESS suggests that the samples are highly correlated and might not be representative of the target distribution. In summary, these diagnostics are used to evaluate the convergence and performance of Markov chains in Bayesian analysis. A rapidly decaying ACF, a Gelman-Rubin diagnostic close to 1, and a high ESS are all indicative of well-behaved chains that have adequately explored the parameter space and are providing reliable samples from the target distribution. These diagnostics help ensure that the results obtained from Markov chains are valid and representative of the posterior distribution of interest. Question 5 : (3p) Quadratic approximation and grid-approximation works very well in many cases, however in the end we fall back on Hamiltonian Monte Carlo (HMC) at many times. Why do we use HMC, i.e., what are the limitations of the other approaches? Additionally, how does HMC work conceptually, i.e., describe with your own words? Paper Solution: Well, quap and other techniques assume a Normal likelihood. Additionally, they don’t see levels in a model (they simply do some hill climbing to find the maximum a posterior estimate), i.e., when a prior is itself a function of parameters, there are two levels of uncertainty. HMC simulates a physical system, i.e., the hockey puck, which it flips around the log-posterior (i.e., the valleys are the interesting things, and not the hills as in the case of quap). Soution: Why Use Hamiltonian Monte Carlo (HMC): Hamiltonian Monte Carlo (HMC) is a powerful and versatile sampling technique used in Bayesian statistics and data analysis. We use HMC for several reasons, primarily because it overcomes some of the limitations associated with other approaches like quadratic approximation and grid-approximation: • Handling Complex and Multimodal Distributions: o Quadratic approximation methods, such as Laplace approximation, are effective for simple, unimodal distributions but may struggle with more complex, multimodal distributions. HMC is particularly well-suited to explore complex, multi-peaked posteriors effectively. • Reducing Computational Inefficiency: o Grid-approximation techniques involve exhaustive evaluation of the likelihood function at a grid of points, which can be computationally demanding and may not be feasible for high-dimensional problems. HMC provides more efficient sampling by using gradient information to guide the sampling process. • Handling High-Dimensional Spaces: o In high-dimensional parameter spaces, quadratic approximation may not capture the curvature and dependencies accurately, leading to poor approximations. HMC leverages gradient information to navigate high-dimensional spaces more effectively. How HMC Works Conceptually: Hamiltonian Monte Carlo is a more advanced Markov chain Monte Carlo (MCMC) method that combines ideas from classical physics with MCMC techniques. Conceptually, HMC works as follows: • • • • • Mimicking Physical Systems: HMC treats the posterior distribution as a landscape or energy surface, where the probability distribution corresponds to the height of the landscape. It then simulates a physical system traveling on this landscape. Introduction of Hamiltonian Dynamics: HMC introduces a concept from physics called Hamiltonian dynamics, which includes the idea of particles moving with kinetic energy (related to momentum) and potential energy (related to the probability distribution). Leapfrog Integration: To simulate Hamiltonian dynamics, HMC uses a numerical integration scheme called "leapfrog" to calculate how the particle's position and momentum change over time. The kinetic energy represents the uncertainty of the particle's position, while the potential energy represents the likelihood of the parameter values. Metropolis-Hastings Step: After simulating the Hamiltonian dynamics, HMC proposes a new state by altering the momentum and position of the particle. The Metropolis-Hastings acceptance criteria are then applied to accept or reject the proposed state based on the energy difference and the detailed balance property. Rejection-Free Updates: HMC can take large steps through the parameter space efficiently because of its use of gradient information. This results in more efficient exploration and often fewer correlated samples, reducing the need for autocorrelation adjustments. In summary, HMC combines principles from physics to efficiently explore the posterior distribution by simulating a particle's movement in an energy landscape. It leverages gradient information to guide the exploration and can handle complex, highdimensional, and multimodal distributions that other methods may struggle with. dWAIC Weight SE dSE 0.0 1 14.26 NA 40.9 0 11.28 10.48 44.0 0 11.66 12.23 Question 6 : (4p) As a result of comparing three models, we get the above output. What does each column (WAIC, pWAIC, dWAIC, weight, SE, and dSE) mean? Which model would you select based on the output? Paper Solution: WAIC: information criteria (the lower the better). pWAIC: effective number of parameters in the model. dWAIC: the difference in WAIC between the models. weight: an approximate way to indicate which model WAIC prefers. SE: standard error. How much noise are we not capturing. dSE: the difference in SE between the models. If you have approximately 4–6 times larger dWAIC than dSE you have fairly strong indications there is a ‘significant’ difference. Here M1 is, relatively speaking, the best. Solution: The output provided appears to be the result of model comparison using the Widely Applicable Information Criterion (WAIC) and related statistics. Here's the interpretation of each column: m1 m2 m3 WAIC 361.9 401.8 405.9 pWAIC 3.8 2.6 1.1 1. WAIC (Widely Applicable Information Criterion): WAIC is a measure of the out-of-sample predictive accuracy of a model. It quantifies how well the model predicts new, unseen data. Lower WAIC values indicate better predictive performance. 2. pWAIC (Penalized WAIC): pWAIC is a version of WAIC that incorporates a penalty for model complexity. It adjusts WAIC for overfitting, favoring simpler models. Lower pWAIC values suggest more parsimonious models. 3. dWAIC (Difference in WAIC): dWAIC represents the difference in WAIC between the current model and the model with the lowest WAIC. It quantifies how much worse the model is compared to the best-performing model. 4. Weight: The weight represents the model's relative support in the model comparison. It quantifies the probability that each model is the best among the ones considered. Models with higher weights are more likely to be the best choice. 5. SE (Standard Error of WAIC): The standard error of WAIC estimates the uncertainty associated with the WAIC calculation. Smaller SE values indicate more confidence in the WAIC estimate. 6. dSE (Standard Error of dWAIC): The standard error of dWAIC quantifies the uncertainty in the difference in WAIC values between the current model and the best model. Smaller dSE values suggest greater confidence in the dWAIC estimate. Model Selection: Based on the output, you would typically select the model with the lowest WAIC or pWAIC because these values represent predictive accuracy and model fit. In this case, model "m1" has the lowest WAIC (361.9), making it the top choice for predictive performance. However, it's essential to consider the dWAIC values and their associated SE when deciding between models. The models "m2" and "m3" have considerably higher WAIC values, and their dWAIC values are substantially larger than the standard errors (dSE). This suggests that model "m1" is significantly better than the other two models in terms of predictive performance. In summary, you would select model "m1" based on the output because it has the lowest WAIC and is well-supported by the relative weight. Model "m2" and "m3" do not perform as well in predicting new data, as indicated by their higher WAIC values and dWAIC differences from the best model. Question 7 : (6p) The DAG below contains an exposure of interest X, an outcome of interest Y , an unobserved variable U, and three observed covariates (A, B, and C). Analyze the DAG. a) Write down the paths connecting X to Y (not counting X -> Y )? b) Which must be closed? c) Which variable(s) should you condition on? d) What are these constructs called in the DAG universe? • U -> B <- C • U <- A -> C A (U) C B X Y Paper Solution: First is, X <- U <- A -> C -> Y , and second is, X <- U -> B <- C -> Y . These are both backdoor paths that could confound inference. Now ask which of these paths is open. If a backdoor path is open, then we must close it. If a backdoor path is closed already, then we must not accidentally open it and create a confound. Consider the first path, passing through A. This path is open, because there is no collider within it. There is just a fork at the top and two pipes, one on each side. Information will flow through this path, confounding X -> Y . It is a backdoor. To shut this backdoor, we need to condition on one of its variables. We can’t condition on U, since it is unobserved. That leaves A or C. Either will shut the backdoor, but conditioning on C is the better idea, from the perspective of efficiency, since it could also help with the precision of the estimate of X -> Y (we’ll give you points for conditioning on A also). Now consider the second path, passing through B. This path does contain a collider, U -> B <- C. It is therefore already closed. If we do condition on B, it will open the path, creating a confound. Then our inference about X -> Y will change, but without the DAG, we won’t know whether that change is helping us or rather misleading us. The fact that including a variable changes the X -> Y coefficient does not always mean that the coefficient is better now. You could have just conditioned on a collider! First construct is a collider. If you condition on B you open up the path. The second construct is a confounder (fork). If you condition on A you close the path. Solution: In the provided Directed Acyclic Graph (DAG), let's analyze the connections and answer the questions: a) Paths Connecting X to Y (excluding X -> Y): X -> A -> C -> Y X -> A -> C <- B -> Y X -> A <- U -> B -> Y X -> A <- U -> B <- C -> Y b) Which Paths Must Be Closed: To identify the causal effect of X on Y while blocking other paths, we need to close the following paths: X -> A -> C -> Y (Close at A or C) X -> A <- U -> B -> Y (Close at A or B) X -> A <- U -> B <- C -> Y (Close at A, B, or C) c) Variables to Condition On: To block the identified paths and estimate the causal effect of X on Y, you should condition on the variables that are immediate or downstream ancestors of both X and Y. In this case, you should condition on variable A and variable C. d) Constructs in the DAG Universe: U -> B <- C: This structure represents a collider in the DAG universe. The presence of U makes B and C conditionally dependent when observing their common effect on Y, introducing conditional dependence (a non-causal relationship). U <- A -> C: This structure represents a fork in the DAG universe. A is a common cause of both U and C, making A a confounder in the relationship between U and C. In summary, to estimate the causal effect of X on Y while blocking confounding paths, you should condition on variables A and C. The DAG analysis helps identify the paths of interest and the necessary variables to condition on to obtain a valid causal inference. Question 8 : (8p) Conducting Bayesian data analysis requires us to do several things for us to trust the results. Can you describe the workflow and what the purpose is with each step? Paper Solution: Start with a null model, do prior checks, check diagnostics, do posterior checks, then conduct inferential statistics if needed. Also, it would be good if you mention comparisons of models and that it is an iterative approach. Solution: Conducting Bayesian data analysis is a systematic process that involves several key steps to ensure the trustworthiness of the results. Here's a comprehensive workflow with descriptions of the purpose of each step: Define the Research Question and Objectives: Purpose: Clearly articulate the research question and what you aim to achieve through Bayesian analysis. This step sets the foundation for the entire analysis. Data Collection and Preprocessing: • Purpose: Gather relevant data, clean, and preprocess it as needed. High-quality data is crucial for accurate modeling. Null Model Specification: • Purpose: Start with a null model (often the simplest model) to establish a baseline for comparison. This step helps assess the improvement of more complex models. Prior Specification: • Purpose: Define prior distributions for model parameters based on prior knowledge or expert judgment. Priors encode existing information and beliefs about the parameters. Prior Checks: • Purpose: Evaluate the appropriateness of the chosen priors. This step helps identify potential issues with prior specifications, such as informative or uninformative priors. Model Building and Selection: • Purpose: Develop a Bayesian model that describes the data generation process. This includes specifying likelihood functions and modeling relationships among variables. Model selection involves comparing different models to choose the most suitable one. Bayesian Inference: • Purpose: Use Bayesian techniques, such as Markov chain Monte Carlo (MCMC) or variational inference, to estimate the posterior distribution of model parameters. This step provides parameter estimates and their uncertainty. Model Diagnostics: • Purpose: Assess the performance and convergence of the MCMC algorithm. Diagnostics, like trace plots, autocorrelation, and Gelmn-Rubin statistics, help ensure that the MCMC samples are reliable. Posterior Checks • Purpose: Evaluate the fit of the model to the observed data. Posterior predictive checks involve simulating new data from the posterior distribution and comparing it to the observed data. Comparison of Models: • Purpose: Compare different models using techniques like the Widely Applicable Information Criterion (WAIC) to assess predictive performance and determine which model best fits the data. Iterative Approach: • Purpose: Bayesian data analysis is often an iterative process. You may need to revisit earlier steps, adjust priors, or refine models based on diagnostics and model comparisons. Inferential Statistics and Reporting: • Purpose: If the goal is hypothesis testing or parameter estimation, conduct inferential statistics to draw conclusions and make inferences based on the Bayesian results. Summarize and report the findings in a clear and interpretable manner. Interpretation and Decision-Making: • Purpose: Interpret the results in the context of the research question and objectives. Use the Bayesian analysis to inform decisions or draw scientific conclusions. Communication: • Purpose: Communicate the findings to relevant stakeholders, colleagues, or the broader scientific community through reports, papers, or presentations. The purpose of this workflow is to ensure a systematic and rigorous approach to Bayesian data analysis, from defining the research question to drawing meaningful conclusions. It emphasizes the importance of model specification, prior choices, diagnostics, and model comparisons to build confidence in the results and facilitate robust decision-making. Question 9 : (5p) To set correct insurance premiums, a car insurance firm is analyzing past data from several drivers. An example dataset is shown below: Driver Age Experience Number of Incidents 1 45 High 1 2 29 Low 3 3 26 Medium 1 4 50 Medium 0 ……………….. ……………………… ……………………… ……………………….. With this data, the firm wants to use Bayesian data analysis to predict the number of accidents that are likely to happen given the age of the driver and their experience level (low, medium, and high). • Write down a mathematical model definition for this prediction using any variable names and priors of your choice. • State the ontological and epistemological reasons for your likelihood. Remember to clearly state and justify the choices and assumptions regarding your model. Paper Solution: A math model (following the math notation with sub i sprinkled over the model) should be given. They’ll hopefully go for a Poisson or Negative-Binomial/Gamma-Poisson. Solution: To perform Bayesian data analysis to predict the number of accidents based on the age of the driver and their experience level, you can specify a statistical model. Here's a mathematical model definition with variable names and priors, along with the ontological and epistemological reasons for the likelihood: Mathematical Model: Let's define the following variables: • • • A is the age of the driver. E represents the experience level of the driver (categorical: Low, Medium, High). N is the number of accidents. We aim to model the number of accidents N as a function of age A and experience level E. Likelihood: To model the likelihood, we assume that the number of accidents follows a Poisson distribution, which is suitable for modeling rare events like accidents. The likelihood function is: N ∼Poisson(λ) Where λ represents the accident rate, which we define as: λ = exp(β0 + βA + βEE) Here, β0 is the intercept, βA is the coefficient for age, and βE is the coefficient for experience level. The choice of Poisson likelihood is based on the ontological assumption that accident counts are non-negative and can be modeled as rare events. Priors: We need to define priors for the model parameters (β0, βA, and βE) (β0 ∼ Normal (0, 102): A weakly informative prior for the intercept. (βA ∼ Normal (0, 102):: A weakly informative prior for the age coefficient. (βE ∼ Normal (0, 102): A weakly informative prior for the experience coefficient. We choose weakly informative priors to let the data inform the parameter estimates, while also allowing for reasonable values. Ontological and Epistemological Reasons for Likelihood: • Ontological Justification: The choice of Poisson likelihood is based on the nature of the data. The count of accidents is a non-negative integer, and the Poisson distribution is commonly used to model the frequency of rare events, which is often the case with accidents. • Epistemological Justification: The assumption that accidents follow a Poisson distribution is based on the understanding that accident rates can vary by age and experience level. By using the Poisson likelihood with appropriate predictors, we aim to capture this variability and make predictions based on available data. In summary, this Bayesian model uses a Poisson likelihood to predict the number of accidents based on driver age and experience level. Weakly informative priors are chosen to allow data-driven parameter estimation while respecting the nature of the variables and events being modeled. The likelihood and priors are chosen based on both ontological considerations (nature of accident data) and epistemological reasoning (predictive accuracy and data-driven modeling). Question 10 : (4p) What is the purpose and limitation of using field study and field experiment as a research strategy? Paper Solution: Field studies: Specific and real-world setting (natural settings). Unobtrusiveness: Researcher does not change/control parameters or variables Goal: Understand phenomena in concrete and realistic settings. Explore and generate new hypothesis. Low statistical generalizability (claim analytical generalizability!) Low precision of the measurement of the behavior If the data is collected in a field setting that does NOT necessarily imply that a study is a field study. Examples of research methods: Case study, Ethnographic study, Observational study. Mostly qualitative but may include quantitative data. Field experiment: Specific and real-world setting (natural settings). Obtrusiveness: Researcher manipulates some variables/properties to observe an effect. Researcher does not control the setting. Goal: Investigate/evaluate/compare techniques in concrete and realistic settings. Low statistical generalizability (claim analytical generalizability!) Low precision of the measurement of the behavior (confounding factors) Not the same as an experimental study. Changes are made and results observed. But does not have control over the natural setting. Correlation is observed. But not causation, generally speaking. Examples of research methods: Evaluative case study, quasi-experiment, action research. Both qualitative and quantitative data. Solution: Using field studies and field experiments as research strategies in social sciences and other disciplines have distinct purposes and limitations: Purpose of Field Studies: • • • Real-World Context: Field studies aim to investigate phenomena within their natural, real-world context. Researchers observe and gather data from the environment where the events naturally occur. This contextuality allows for a deep understanding of the subject under investigation. External Validity: Field studies often have high external validity. Findings from field studies are more likely to generalize to real-world situations and populations because they are conducted in natural settings. Exploration and Description: Field studies are valuable for exploratory research and generating hypotheses. They can reveal new insights and provide rich descriptions of complex social phenomena. Limitations of Field Studies: • • • Lack of Control: One of the primary limitations is the lack of experimental control. Researchers cannot manipulate variables as they would in a controlled laboratory setting. This lack of control makes it challenging to establish causation definitively. Confounding Variables: Field studies are susceptible to confounding variables that can obscure the relationships being studied. The complex and dynamic nature of the real world can introduce many uncontrolled variables. Observational Bias: Researchers in field studies may introduce observational bias or misinterpretation due to their presence in the environment. Participants may also alter their behavior when aware of being observed (the "observer effect"). Purpose of Field Experiments: • • • Balancing Control and Realism: Field experiments combine the benefits of experimental control with the realism of natural settings. Researchers can manipulate independent variables while studying participants in their everyday environments. Causation: Field experiments aim to establish causal relationships. By manipulating variables and randomizing treatments, researchers can draw stronger conclusions about cause-and-effect relationships. Applied Research: Field experiments are often used in applied research to test interventions, policies, or practical solutions in real-world contexts. The findings have direct relevance and application. Limitations of Field Experiments: • • • Practical Constraints: Field experiments can be logistically challenging and costly. Researchers must navigate real-world complexities, making the execution of experiments more difficult. Limited Control: While field experiments offer more control than pure field studies, control is still limited compared to laboratory experiments. Researchers must accept some level of external interference. Generalizability: Field experiments may have lower internal validity compared to laboratory experiments due to the limitations on control. Researchers must carefully consider the trade-off between internal and external validity. In summary, the purpose of field studies is to gain a deep understanding of real-world phenomena, while the purpose of field experiments is to establish causal relationships while maintaining some level of control. The main limitation of both strategies is the trade-off between control and realism, as well as potential challenges related to confounding variables and observational bias in field studies. Researchers should choose the strategy that best aligns with their research goals and context. Question 10.1: (6p) What is the purpose and limitations of using Field Experiments and Field Studies as a research strategy? Provide examples, i.e., methods for each of the two categories, and clarify if one use mostly qualitative or quantitative approaches (or both). Paper Solution: Field Experiments: Natural settings. • Obtrusiveness: Researcher manipulates some variables/properties to observe an effect. Researcher does not control the setting. • Goal: Investigate/Evaluate/Compare techniques in concrete and realistic settings. • Limitations: Low statistical generalizability (claim analytical generalizability!) Fairly low precision of the measurement of the behavior (confounding factors). • Not the same as an experimental study (changes are made and results observed), but does not have control over the natural setting. • Correlation observed but not causation. Examples: Evaluative case study, Quasi-experiment, and Action research. Often qualitative and quantitative. Field Studies: Natural settings. • Obtrusiveness: Researcher does not change/control parameters or variables. • Goal: Understand phenomena in concrete and realistic settings. Explore and generate new hypothesis. • Limitations: Low statistical generalizability (claim analytical generalizability!) Likely low precision of the measurement of the behavior. Examples: Case study, Ethnographic study, and Observational study. Often qualitative. Solution: Field experiments and field studies are research strategies used to investigate real-world phenomena in their natural settings. Each approach has specific purposes, advantages, and limitations. Here's an explanation of the purpose and limitations of using field experiments and field studies, along with examples and the qualitative and quantitative approaches commonly associated with each: Field Experiments: Purpose: • Field experiments aim to establish cause-and-effect relationships by manipulating one or more variables in a natural or field setting. • The primary purpose is to assess the impact of an intervention, treatment, or manipulation on the dependent variable while maintaining ecological validity. Advantages: • High external validity: Findings are applicable to real-world situations. • Allows for causal inference: The manipulation of variables helps establish causal relationships. • Often produces quantitative data, making it suitable for statistical analysis. Limitations: • Limited control: Researchers have less control over confounding variables in field settings. • Ethical and logistical constraints: Conducting field experiments may involve ethical concerns or practical challenges. • Potential for observer effects: The awareness of being observed may affect participants' behavior. Examples: Evaluative case study, Quasi-experiment, and Action research. Often qualitative and quantitative. Field Studies: Purpose: • Field studies are observational investigations conducted in real-world settings without direct manipulation of variables. • The primary purpose is to understand and describe natural behaviors, processes, or phenomena. Advantages: • High ecological validity: Findings are reflective of real-world conditions. • Can explore complex and multifaceted phenomena in their natural context. • Often flexible, allowing for a mix of qualitative and quantitative data collection. Limitations: • Limited control over variables: Researchers cannot manipulate variables as in experiments. • Observational bias: The presence of researchers may influence or bias the behavior of participants. • May require substantial time and resources for data collection. Examples: Case study, Ethnographic study, and Observational study. Often qualitative. Approaches: • Field experiments often rely on quantitative approaches to measure and analyze the effects of interventions on dependent variables. Statistical methods are commonly used to analyze the data. • Field studies can use both qualitative and quantitative approaches. Qualitative methods such as observations, interviews, and content analysis help explore complex phenomena, while quantitative methods like surveys or measurements can provide quantifiable data for analysis. In summary, field experiments are valuable for establishing causality and often employ quantitative approaches, while field studies aim to understand and describe real-world phenomena, making use of both qualitative and quantitative data collection methods. Both approaches offer high ecological validity but differ in terms of control and the level of manipulation of variables. Question 11 : (6p) Read parts of the paper appended to this exam (you should know which parts to focus on and not read everything!) and answer the questions, • Which of the eight research strategies presented in the ABC framework does this paper fit? Justify and argue! • Can you present an overview of the research method employed? • Can you argue the main validity threats of the paper? – It would be very good if you can list threats in the four common categories we usually work with in software engineering, i.e., internal, external, construct, and conclusion validity threats. Paper Solution: The students can argue for several research strategies since the authors of the paper could’ve been more explicit. Once they’ve argued for one or a few strategies, they need to explain what the strategy they picked does. Finally, they should think about threats to this study (we’ve covered internal, external, conclusion, and construct validity threats for different research strategies/approaches). Solution: Research Strategy: First, identify the research strategy that the paper follows. While it may not be explicitly mentioned, you can make an educated argument based on the paper's content. Consider the following common research strategies used in software engineering: • Experiment: If the paper involves controlled experiments to test hypotheses or interventions, it falls into the experimental research strategy. • Survey: If the paper collects data through surveys or questionnaires to understand the opinions or practices of a target group, it fits the survey research strategy. • Case Study: If the paper extensively examines one or a few cases in-depth to draw insights or general principles, it is a case study research strategy. Justify your choice by explaining how the paper aligns with the characteristics of the selected research strategy. Research Method Overview: Provide an overview of the research method employed in the paper. Describe how the study was conducted, including data collection techniques, sampling methods, and any interventions or treatments applied. Explain the key steps and procedures that were used to gather and analyze data. Validity Threats: Identify and discuss potential validity threats in the paper. Consider the following categories of validity threats: • • • • Internal Validity: Address potential issues related to the design and execution of the study. Look for factors that could influence the study's results or conclusions, such as confounding variables, selection bias, or issues with the experimental setup. External Validity: Examine the extent to which the study's findings can be generalized to broader populations, settings, or timeframes. Consider factors that might limit the external validity, such as a specific participant group or context. Construct Validity: Assess the validity of the measurements, tools, or constructs used in the study. Consider whether the paper accurately measures and defines the variables or concepts of interest. Conclusion Validity: Evaluate the soundness of the conclusions drawn from the study's results. Examine whether the statistical analyses and interpretations are appropriate and justified. Question 12 : (20p) A common research method in software engineering that is used to complement other methods is survey research. Here follows a number of questions connected to survey research: 1. A survey can be supervised or unsupervised. What’s the difference? (2p) 2. Surveys can either intervene at one point, or repeatedly over time. What are these two types of surveys usually called? (1p) (a) When we do intervene repeatedly over time we often see two common study designs being used (tip: it has to do with the sample you use). (1p) 3. Discuss the weaknesses and strengths with convenience sampling. (2p) 4. We often differ between reliability and validity concerning surveys, (a) What is the difference between the reliability and validity in survey design? (2p) (b) Name and describe at least two types of reliability in survey design. (3p) (c) Name and describe at least two types of validity in survey design. (3p) 5. Even if you measure and estimate reliability and validity you still want to evaluate the survey instrument. Which are the two (2) common ways of evaluating a survey instrument? Explain their differences. (3p) 6. Likert scale questions are common in survey designs. If you want to use such a variable as a predictor in your statistical model then you need to set a suitable prior, e.g., not Normal prior. Which one should you preferably use and why? (3p) Paper Solution: 1. Supervised: Assigning a survey researcher to one or a group of respondents. Unsupervised: E.g., mailed questionnaire. 2. Cross-sectional vs. longitudinal 2a. Cohort and panel studies 3. Strength: It’s convenient and we usually get decent sample sizes. Weaknesses: It’s not random so confounders will be lurking. 4a. Reliability: A survey is reliable if we administer it many times and get roughly the same distribution of results each time. Validity: How well the survey instrument measures what it is supposed to measure? 4b. e.g., Test-retest (intra-observer), alternate form reliability, internal consistency, interobserver (Krippendorff’s ). 4c. e.g., content validity, criterion validity, construct validity. 5. Focus groups: Assemble a group of people representing respondents, experts, beneficiaries of survey results. Ask them to complete the survey questionnaire. The group identifies missing or unnecessary questions, ambiguous questions and instructions. Pilot studies: Uses the same procedures as planned for the main study but with smaller sample. Targets not only problems with questions but also response rate and reliability assessment. 6. A Dirichlet(α) prior. The Dirichlet distribution is the multivariate extension of the Beta distribution. The Dirichlet is a distribution for probabilities, values between zero and one that all sum to one. The beta is a distribution for two probabilities. The Dirichlet is a distribution for any number. This way we model the probability separately for each category in our Likert question and we do not assume a Gaussian distribution. Solution: 1. Supervised Survey: o Assigning a survey researcher to one or a group of respondents. o In a supervised survey, a surveyor or interviewer actively interacts with respondents. o The surveyor asks questions, records responses, and ensures that the survey process is managed and guided by a human facilitator. o Respondents are typically aware of the surveyor's presence and may receive assistance or clarifications if they have questions about the survey. o Supervised surveys are commonly used in situations where it is important to control the survey process, ensure data quality, and obtain detailed information from respondents through structured or semi-structured interviews. Unsupervised Survey: o E.g., mailed questionnaire o In an unsupervised survey, respondents complete the survey independently without direct interaction with a surveyor or interviewer. o Unsupervised surveys are often self-administered, such as online questionnaires, paper-and-pencil surveys, or automated phone surveys. o Respondents are responsible for reading and understanding the survey questions, providing answers, and submitting their responses without real-time guidance or assistance from a surveyor. o Unsupervised surveys are cost-effective and can be used for large-scale data collection, but they rely on respondents' ability to understand and navigate the survey on their own. 2. Surveys that intervene at one point in time are typically called "cross-sectional surveys." These surveys collect data from a sample of respondents at a single moment in time. Surveys that repeatedly collect data from the same sample over time are usually referred to as "longitudinal surveys" or "panel surveys." These surveys are designed to track changes and developments within the same group of respondents over multiple time points, allowing researchers to study trends and variations over time. 2a. When we intervene repeatedly over time, two common study designs that are often used are: • Panel Study: In a panel study, the same set of individuals or entities (the panel) is surveyed or observed at multiple time points. This design allows researchers to track changes within the same group over time and assess individual or group-level variations. • Cohort Study: In a cohort study, participants are grouped into cohorts based on shared characteristics, such as birth year, exposure to a specific event, or a particular attribute. These cohorts are then followed over time to observe how the characteristics or outcomes of interest change within each group. 3. Strengths: • Ease and Speed: Convenience sampling is simple and quick to implement. Researchers can easily gather data by selecting participants who are readily available and willing to participate. This makes it a cost-effective and efficient method, particularly for exploratory research or when time and resources are limited. • Reduced Costs: Since researchers choose participants based on convenience, it often results in cost savings, as there's no need for complex sampling procedures, extensive recruitment efforts, or incentives. • Practicality: In situations where a random or stratified sample is challenging to obtain, such as during a crisis or in small or hard-to-reach populations, convenience sampling may be the only feasible option. Weaknesses: • Selection Bias: One of the primary weaknesses of convenience sampling is selection bias. The sample is not representative of the entire population, as it's based on the availability of participants. This can lead to skewed or ungeneralizable results, making it inappropriate for making population-wide inferences. • Limited Generalizability: Because of the non-random nature of participant selection, findings from convenience samples are less likely to be applicable to the broader population. Researchers cannot confidently generalize the results to larger groups, which limits the external validity of the study. • Potential for Overrepresentation: Certain groups or demographics may be more accessible and, therefore, overrepresented in convenience samples. This can lead to an unbalanced sample that does not reflect the diversity of the population. • Sampling Error: The lack of a random sampling process means that the sample may not accurately represent the population. As a result, any statistical analysis or conclusions drawn from convenience samples may be subject to significant sampling error. • Researcher's Bias: Researchers may inadvertently introduce their own bias when selecting participants, as they may choose those they find more convenient or agreeable, potentially influencing the research findings. 4a. Reliability refers to the consistency, stability, or repeatability of survey measurements. It assesses whether the same survey, when administered to the same group or individual under similar conditions, produces consistent results over time. In other words, reliable surveys should yield the same or very similar responses when the same questions are repeated. Reliability is concerned with measurement error. Validity, on the other hand, refers to the accuracy and truthfulness of survey measurements. It assesses whether a survey accurately measures what it is intended to measure. Validity answers the question of whether the survey items and questions are relevant and appropriate for capturing the construct or concept under investigation. Valid surveys should measure what they claim to measure. In essence, reliability is about the consistency of survey results, while validity is about the truthfulness and appropriateness of those results. 4b. Test-Retest Reliability: Test-retest reliability assesses the consistency of survey results over time. To measure it, the same survey is administered to the same group of participants on two separate occasions. High test-retest reliability implies that responses are consistent and stable over time. Internal Consistency Reliability: This type of reliability assesses the consistency of responses within a single survey. One common measure of internal consistency is Cronbach's alpha, which evaluates how closely related survey items or questions are to each other. High internal consistency reliability suggests that the survey items are measuring the same underlying construct. 4c. Content Validity: Content validity assesses whether the survey items cover the full range of content or aspects of the construct being measured. To establish content validity, experts review the survey to ensure it includes all relevant dimensions of the construct. This type of validity is concerned with the comprehensiveness of the survey. Construct Validity: Construct validity assesses whether the survey accurately measures the intended construct or concept. Researchers use various methods, such as factor analysis or convergent and discriminant validity testing, to demonstrate that survey items are related to the construct they are meant to measure. High construct validity indicates that the survey effectively captures the underlying concept. 5. Pilot Testing: o Explanation: Pilot testing involves administering the survey to a small group of participants who are similar to the target population. The primary goal is to identify and address any issues or problems with the survey instrument before its full-scale deployment. Participants in pilot testing are asked to complete the survey and provide feedback on their experience. o Differences: ▪ Purpose: Pilot testing focuses on uncovering practical issues, such as unclear or confusing questions, technical glitches, and the overall survey administration process. It aims to fine-tune the survey instrument to enhance its usability and data quality. ▪ Sample Size: Pilot testing typically involves a small sample of participants, often ranging from 10 to 30 individuals. The sample size is not intended to be representative of the entire target population but is sufficient for identifying instrument-related issues. ▪ Feedback: Participants in pilot testing are encouraged to provide qualitative feedback, suggesting improvements, pointing out ambiguities, and reporting any difficulties they encountered. ▪ Timing: Pilot testing occurs in the pre-data collection phase, allowing researchers to make necessary revisions before the full survey is conducted. Psychometric Analysis: o Explanation: Psychometric analysis involves using statistical methods to assess the reliability and validity of the survey instrument. This includes examining the internal consistency of survey items, conducting factor analyses to confirm construct validity, and using item response theory (IRT) to evaluate item discrimination and difficulty. o Differences: ▪ Purpose: Psychometric analysis focuses on the measurement properties of the survey instrument, such as its reliability, content validity, construct validity, and item performance. It aims to establish the instrument's credibility as a valid and reliable measurement tool. ▪ Sample Size: Psychometric analysis typically requires a larger sample size, often representative of the target population, to perform statistical tests and analyses. The sample size is critical for ensuring the reliability and validity of the instrument. ▪ Quantitative Analysis: Psychometric analysis relies on quantitative data and statistical techniques. Researchers use statistical tests and measures to assess the properties of survey items and the overall instrument. ▪ Timing: Psychometric analysis is conducted after data collection, using responses from the actual survey participants. This analysis confirms the quality of the survey instrument based on empirical data. 6. When using a Likert scale variable as a predictor in a statistical model, it's often advisable to use an appropriate prior distribution, and the t-distribution prior can be a suitable choice. The reason for preferring a t-distribution prior over a normal prior is based on several considerations: o Robustness to Outliers: Likert scale data can sometimes contain outliers or exhibit skewness, especially in smaller sample sizes. The t-distribution is more robust to outliers and heavy-tailed, making it a better choice when dealing with non-normally distributed or skewed data. It allows the model to accommodate the potential variability and extreme values in the Likert scale. o Flexibility: The t-distribution has an additional degree of freedom parameter (ν) that allows you to control the tail behavior of the distribution. When ν is set to a large value, the t-distribution closely approximates a normal distribution. However, as ν decreases, the distribution becomes heavier-tailed, providing more flexibility to capture deviations from normality. o Incorporating Uncertainty: Likert scale data often has a level of measurement uncertainty or measurement error. The t-distribution naturally incorporates this uncertainty, allowing for more realistic modeling of the data's inherent variability. o Better Handling of Small Samples: In cases where the sample size is relatively small, the t-distribution is more appropriate because it produces wider credible intervals compared to a normal distribution. This reflects the greater uncertainty in parameter estimates when working with limited data. o Prior Elicitation: The t-distribution is particularly useful when prior information about the population parameters is vague or uncertain. It acts as a way to express the researcher's uncertainty about the prior distribution. Question 13: (7p) Finish the following statements, 1. Crap! I got 15 divergent transitions, now I probably need to. . . 2. No! I have bR > 1:6, now I need to. . . 3. My posterior predictive check shows that I don’t capture the regular features of the data. I probably need to. . . 4. My traceplots look like a mess. They are all mixed up and look like hairy caterpillars. I guess that. . . 5. My DAG has a pipe between X (intervention) and Y (outcome). I should probably. . . 6. The posterior is proportional to the . . . 7. (see figure above) Hmm.. .my prior predictive check looks a bit funky. I probably need to. . . Paper Solution: 1. . . . reparameterize the model or be more careful with PriPC 2. . . . run the chains for more iterations 3. . . . take into account more aspects (dispersion?) by extending the model 4. . . . it’s ok! 5. . . . not condition on the thing in the middle, i.e., the ‘pipe’ 6. . . . prior times the likelihood 7. . . . tighten my prior [redo prior predictive check] Solution: Crap! I got 15 divergent transitions; now I probably need to review and diagnose the model's geometry. Divergent transitions often indicate issues with the model's parameterization or prior choices. I may need to reparameterize the model, adjust priors, or identify and resolve any problematic features in the data. No! I have bR > 1:6; now I need to investigate the model's behavior. A high Bayesian R-hat statistic (bR) suggests lack of convergence, so I should inspect the chains, assess the potential sources of non-convergence, and take appropriate corrective actions, such as increasing the number of iterations or adjusting the model. My posterior predictive check shows that I don't capture the regular features of the data. I probably need to revisit the model specification. If the posterior predictive check reveals discrepancies with the observed data, I might need to refine the model, consider additional covariates, or reassess the choice of likelihood or prior distributions. My traceplots look like a mess. They are all mixed up and look like hairy caterpillars. I guess that the chains are not mixing well. To address this, I should consider a longer burn-in period, thinning, or different sampling techniques. It's essential to ensure proper convergence and mixing for reliable results. My DAG has a pipe between X (intervention) and Y (outcome). I should probably include an arrow connecting X and Y to represent a causal relationship. A pipe in a Directed Acyclic Graph (DAG) suggests that X and Y are not causally related. To model causation, I should add an arrow from X to Y, indicating that changes in X can influence Y. The posterior is proportional to the product of the likelihood and the prior. In Bayesian analysis, the posterior distribution represents updated beliefs about the parameters given both the observed data (likelihood) and prior information. (see figure above) Hmm... my prior predictive check looks a bit funky. I probably need to reevaluate my prior choices. If the prior predictive check raises concerns, I should reconsider the selection of priors, ensuring they align better with my prior knowledge and expectations. Adjusting the priors may lead to more meaningful inferences in the Bayesian analysis. Question 14: (3p) Rewrite the following model as a multilevel model. yi ∼ Binomial(n; pi) logit(pi) = αGROUP[i] + βxi αGROUP ∼ Normal(0; 1:5) β ∼ Normal(0; 0:5) Paper Solution: All that is really required to convert the model to a multilevel model is to take the prior for the vector of intercepts, α GROUP, and make it adaptive. This means we define parameters for its mean and standard deviation. Then we assign these two new parameters their own priors, hyperpriors. This is what it could look like, yi ∼ Binomial(1; pi) logit(pi) = αGROUP[i] + βxi α GROUP ∼ Normal( ̄α, σα) β ∼ Normal(0; 1)  ̄α ∼ Normal(0; 1:5) σα ∼ Exponential(1) The exact hyperpriors you assign don’t matter here. Since this problem has no data context, it isn’t really possible to say what sensible priors would be. Question 15: (10p or even 10p) Let us finally end with some multiple choice questions! You get one point if you answer a question correctly (one or more choices can be correct!) You get 1 points if you fail to answer the question correctly (if you don’t answer you get 0 points). Simply put circle(s) around what you think is/are the correct answer(s)! Q1: What is/are the premises for epistemological justifications when we design our models? a) Information theory b) Maximum entropy c) None of the answers Q2: Adding predictors and parameters to a model can have which of the following impact(s)? a) Complex models b) Models that overfit c) Improved estimates Q3: Why do we use posterior predictive checks? a) Check that model captures regular features of data b) Check that our priors are sane c) Check that we have a perfect fit d) Check the effect priors have on outcome Q4: Which of the following strategies are in a non-empirical setting? a) Sample studies b) Computer simulations c) Field studies d) Laboratory experiments Q5: In the model definition below, which line is the likelihood? µ ∼ Normal(0, 10) yi ∼ Normal(µi, σ) σ ∼ Exponential(1) a) Line 1 b) Line 2 c) Line 3 d) None of the answers (1) (2) (3) Q6: In the model definition below, how many parameters are in the posterior distribution? yi ∼ Normal (µi, σ) µi = α + βxi α ∼ Normal(0, 10) β ∼ Normal(0, 1) σ ∼ Exponential(2) a) 1 b) 2 c) 3 d) 4 Q7: Which mechanisms can make multiple regression produce false inferences about causal effects? a) Confounding b) Multi-collinearity c) Not conditioning on posttreatment variables Q8: Which of the below desiderata are key concerning information entropy? a) The measure of uncertainty should not be continuous b) The measure of uncertainty should increase as the number of possible events increases c) The measure of uncertainty should be multiplicative Q9: What is the difference between an ordered categorical variable (OCV) and an unordered categorical variable (UCV)? a) For UCV we often see ‘Likert’ questions b) For OCV the distances between the values are not necessarily the same c) For UCV the values merely represent different discrete outcomes d) None of the answers Q10: Which of the following priors will produce more shrinkage in the estimates? a) αTANK ∼ Normal (0, 1) b) αTANK ∼ Normal (0, 2) c) None of the answers Question 2: (4p) Underfitting and overfitting are two concepts not always stressed a lot with black-box machine learning approaches. In this course, however, you’ve probably heard me talk about these concepts a hundred times. . . What happens when you underfit and overfit, i.e., what would the results be? What are some principled ways to deal with under- and overfitting? Paper Solution: One straightforward way to to discuss this is to show two simple models, one where we only have a general intercept α (complete pooling), and one where you estimate an intercept for each cluster in the data (varying intercept), but where we don’t share information between the clusters (think of the pond example in the coursebook), i.e., no pooling. This question can be answered in many different ways. yi ∼ Binomial(n, pi) logit(pi) = α α ∼ Normal(0, 2.5) above complete pooling, below no pooling yi ∼ Binomial(n, pi) logit(pi) = αCLUSTER[j] αj ∼ Normal(0, 2) for j = 1, ……….., n Solution: Underfitting: • Result: Underfitting occurs when a model is too simple to capture the underlying patterns in the data. As a result, the model performs poorly on both the training data and unseen data (test or validation data). It has high bias and low variance. • Consequences: The model lacks the capacity to learn from the data effectively, leading to high errors, low accuracy, and poor generalization. It often produces oversimplified or inaccurate predictions. • Principled Ways to Address Underfitting: o Increase Model Complexity: Choose a more complex model, such as a deeper neural network, to better fit the data. o Add More Features: Include additional relevant features or input data that may help the model capture the underlying patterns. o Decrease Regularization: Reduce the strength of regularization techniques (e.g., dropout, weight decay) that can constrain the model. o Collect More Data: Obtain a larger dataset to provide the model with more information for learning. Overfitting: • Result: Overfitting occurs when a model is overly complex and fits the training data perfectly but fails to generalize to new, unseen data. The model has low bias and high variance. • Consequences: The model may perform exceptionally well on the training data but poorly on validation or test data. It has essentially memorized the training data and cannot adapt to new data, resulting in poor generalization and high errors. • Principled Ways to Address Overfitting: o Regularization: Apply regularization techniques, such as L1 or L2 regularization, dropout, or early stopping, to constrain the model's complexity and prevent it from fitting the training data too closely. o Cross-Validation: Use techniques like k-fold cross-validation to assess the model's performance on multiple validation sets, which helps identify overfitting. o Feature Selection: Eliminate irrelevant or noisy features that may contribute to overfitting. o Simplify the Model: Choose a simpler model architecture, reduce the number of layers or units, or use techniques like pruning in decision trees. o Collect More Data: Expanding the dataset can also help reduce overfitting by providing more diverse training examples. Question 3 : (10p) To understand how team size affects psychological safety the following data was collected: Team Team size SPI Psychological safety 1 5 67% High 2 15 33% Low 3 11 49% Low 4 7 90% High ……. …… …….. …….. ………. …….. ………. …….. The experiment started with assuming that planning effectiveness and psychological safety has a very strong association. For planning effectiveness they used schedule performance indicator (SPI) as a stand-in variable. Schedule Performance Indicator = (Completed points / Planned points) Based on the result, if the SPI is more than 50% they are classified as a team with high psychological safety. If less than 50% they are classified as a team with low psychological safety. With the above data, the firm wants to use your knowledge to understand the association between team size and psychological safety. • Write down the mathematical model definition for this prediction using any variable names and priors of your choice. • State the ontological and epistemological reasons for your likelihood. Remember to clearly state and justify the choices and assumptions regarding your model. Paper Solution: This one is actually a bit trickier then it looks like at first glance. Sure, we know that we should design a model where SPI or psychological safety should be used as an outcome, but which one should we use? SPI is a continuous number from 0 to 100, while psychological safety is a dichotomous variable (basically 1/0). Hence, in the first case a Beta regression could be a possibility, while in the latter a logistic regression could be an option. The lesson here is that we should probably use model comparison to look at out of sample prediction capabilities. I would be very hesitant in using psychological safety as an outcome. Why? Well, we have taken a continuous value (SPI) and made it into a dichotomous value. In short, from an information theoretical point of view we are shooting ourselves in the foot. Never bin your variables! To get max point on this question the answer must contain a reasoning that we should not use dichotomous variables as a replacement for continuous variables. Solution: To model the association between team size and psychological safety using the provided data, you can create a mathematical model with Bayesian inference. Here's a model definition with variable names and priors: Mathematical Model Definition: 1. Likelihood: • For each team i, the likelihood of observing high or low psychological safety is modeled using a Bernoulli distribution. • Let yi be the binary outcome variable for team i, where yi = 1 represents high psychological safety, and yi =0 represents low psychological safety. • The likelihood for each team i is given by: yi ∼ Bernoulli (pi) • pi represents the probability of high psychological safety for team i. The probability pi is a function of the team size (xi) and will be determined by a logistic regression model. • Logistic Regression Model (Link Function): • To model the relationship between team size and psychological safety, a logistic regression model is used to predict pi. • The log odds of high psychological safety is modeled as a linear combination of team size: logit (pi) = α + βxi • Where: o Logit (pi) is the log-odds of high psychological safety for team i. o α is the intercept. o β is the coefficient that represents the effect of team size on psychological safety. o xi is the team size for team i. 3. Priors: • Prior distributions are assigned to the model parameters α and β. • Choose suitable prior distributions based on prior knowledge or beliefs. For example, you can use normal or Cauchy distributions with appropriate hyperparameters. • Priors convey your beliefs about the parameters before observing the data. Ontological and Epistemological Justifications: 1. Likelihood Choice (Epistemological Reason): • The choice of the Bernoulli likelihood is appropriate for binary outcomes, such as high or low psychological safety. It's a common choice for modeling binary or categorical data. • Using a likelihood based on the observed classification of teams as having high or low psychological safety allows us to make probabilistic inferences about unobserved teams. 2. Logistic Regression (Ontological and Epistemological Reasons): • The logistic regression model is a suitable choice because it relates the log-odds of an event (high psychological safety) to a predictor variable (team size). It provides insights into how team size influences psychological safety. • This model assumes that the log-odds of high psychological safety follow a linear relationship with team size. This assumption is a common starting point for understanding associations between predictors and binary outcomes. 3. Prior Distributions (Epistemological Reason): • The choice of prior distributions for α and β allows you to incorporate existing knowledge or beliefs about the parameters. • The specific choice of priors should reflect your prior understanding of the relationship between team size and psychological safety, as well as your expectations for the model parameters. By formulating this Bayesian model, you can estimate the effects of team size on psychological safety and quantify the uncertainty in those estimates. It provides a systematic and probabilistic approach to understanding the association between these two variables. 2. Question 4 : (3p) What is epistemological justification and how does it differ from ontological justification, when we design models and choose likelihoods? Please provide an example where you argue epistemological and ontological reasons for selecting a likelihood. Paper Solution: Epistemological, rooted in information theory and the concept of maximum entropy distributions. Ontological, the way nature works, i.e., a physical assumption about the world. One example could be the Normal() likelihood for real (continuous) numbers where we have additive noise in the process (ontological), and where we know that in such cases Normal is the maximum entropy distribution. Solution: Epistemological justification and ontological justification are two essential aspects in the process of designing statistical models and choosing likelihoods in Bayesian data analysis. They refer to different types of reasoning that guide our modeling choices. Let's explore their differences and provide an example where both epistemological and ontological justifications are relevant in selecting a likelihood. Epistemological Justification: • Definition: Epistemology deals with questions related to knowledge, belief, and the justification of knowledge. Epistemological justification, in the context of Bayesian data analysis, refers to the reasons and justifications based on our beliefs, prior knowledge, or subjective understanding of the problem. • Emphasis: This type of justification emphasizes what we subjectively know or believe about the problem, even if our beliefs are not based on empirical data. Ontological Justification: • Definition: Ontology is concerned with the nature of reality, existence, and what exists in the world. Ontological justification, in Bayesian data analysis, refers to the reasons and justifications based on the actual properties of the datagenerating process or the true underlying nature of the problem. • Emphasis: This type of justification emphasizes the objective properties of the problem, data, or phenomenon, irrespective of our subjective beliefs. Example: Suppose you are designing a statistical model to estimate the probability of a coin landing heads (success) when flipped. You have the observed results of flipping the coin 100 times, and it landed heads 60 times. • • • • Epistemological Justification: If you have prior knowledge that the coin is biased because someone told you it's biased (subjective belief), you might choose a likelihood that reflects that belief. For instance, you could choose a Beta distribution with parameters that express your belief in the bias. Ontological Justification: If you have no prior knowledge or reason to believe the coin is biased, you might choose a more neutral or objective likelihood. In this case, you could select a Binomial likelihood, which is a common choice for modeling the probability of a binary outcome without any preconceived bias. In this example, epistemological justification is based on what you believe or know subjectively about the coin's bias, while ontological justification is based on the objective properties of the observed data and the coin's true nature, regardless of your beliefs. In Bayesian data analysis, it's important to carefully consider both types of justifications when choosing priors, likelihoods, and model structures. The choice should strike a balance between your subjective beliefs and the inherent properties of the problem, guided by the available data and prior knowledge. Question 5: (4p) When diagnosing Markov chains, we often look at several diagnostics to form an opinion of how well things have gone. Name four diagnostics we commonly use? What do we look for in each diagnostics (i.e, what thresholds or signs do we look for)? Finally, what do they tell us? Paper Solution: Effective sample size: A measure of the efficiency of the chains. 10% of total sample size (after warmup) or at least 200 for each parameter. R_hat: A measure of the between and the within variance of our chains. Indicates if we have reached a stationary posterior probability distribution. Should approach 1.0 and be below 1.01. Traceplots: Check if the chains are mixing well. Are they converging around the same parameter space? MCSE: Markov chain standard error or Standard Error for short. E-BFMI: etc. Divergent chains: . . . Solution: Gelman-Rubin Diagnostic (R-hat or Gelman's R): • What to Look For: Look for values of R-hat close to 1.0. If R-hat is much larger than 1.0 (e.g., 1.1 or higher), it suggests that the chains have not converged. • What It Tells Us: R-hat is a measure of convergence, specifically assessing whether multiple chains have mixed well. A value close to 1.0 indicates that the chains have likely converged to the same distribution, while values significantly larger suggest a lack of convergence. Effective Sample Size (ESS): • What to Look For: Ensure that the ESS for each parameter is sufficiently large, typically at least a few thousand or more, depending on the problem. Low ESS values indicate poor sampling efficiency. • What It Tells Us: ESS quantifies the effective number of independent samples obtained from the Markov chain. Higher ESS indicates more efficient sampling, while low ESS suggests that the chains are not exploring the distribution effectively. Traceplots and Autocorrelation Plots: • What to Look For: Inspect traceplots to check for mixing and convergence. In autocorrelation plots, look for rapid decay in autocorrelation as lag increases. • What It Tells Us: Traceplots visualize the parameter values over the iterations, and a well-mixed chain exhibits erratic, "noisy" behavior. Autocorrelation plots indicate how correlated samples are at different lags. Rapid decay in autocorrelation signifies good mixing, while slow decay suggests poor mixing. Quantile-Quantile (Q-Q) Plots: • What to Look For: Examine Q-Q plots to assess if the posterior samples match the expected quantiles of the target distribution. • What It Tells Us: Q-Q plots are used to check if the posterior distribution aligns with the target distribution (e.g., a normal distribution). Deviations from the diagonal line in the plot suggest discrepancies between the posterior and target distributions. Question 6: (4p) Explain the four main benefits of using multilevel models. Paper Solution: • Improved estimates for repeat sampling. When more than one observation arises from the same individual, location, or time, then traditional, single-level models either maximally underfit or overfit the data. • Improved estimates for imbalance in sampling. When some individuals, locations, or times are sampled more than others, multilevel models automatically cope with differing uncertainty across these clusters. This prevents oversampled clusters from unfairly dominating inference. • Estimates of variation. If our research questions include variation among individuals or other groups within the data, then multilevel models are a big help, because they model variation explicitly. • Avoid averaging, retain variation. Frequently, scholars pre-average some data to construct variables. This can be dangerous, because averaging removes variation, and there are also typically several different ways to perform the averaging. Averaging therefore both manufactures false confidence and introduces arbitrary data transformations. Multilevel models allow us to preserve the uncertainty and avoid data transformations. Question 7: Table 1: Output from running WAIC on three models. WAIC SE dWAIC dSE pWAIC m1 127.6 14.69 0.0 NA 4.7 m3 129.4 15.10 1.8 0.90 5.9 m2 140.6 11.21 13.1 10.82 3.8 (4p) As a result of comparing three models, we get the above output. What does each column (WAIC, SE, dWAIC, dSE, and pWAIC) mean? Which model would you select based on the output? Paper Solution: WAIC: information criteria (the lower the better). pWAIC: effective number of parameters in the model. dWAIC: the difference in WAIC between the models. SE: standard error. How much noise are we not capturing. dSE: the difference in SE between the models. If you have 4–6 times larger dWAIC than dSE you have fairly strong indications there is a ‘significant’ difference. Here no model is, relatively speaking, the best. If we are forced to pick a model then one could follow to logical arguments: 1. Pick the one with the lowest WAIC. 2. Pick the simplest model. Solution: In the provided output from running the Watanabe-Akaike Information Criterion (WAIC) on three models, each column conveys different information. Here's an explanation of each column, and based on this information, we can determine which model to select: WAIC (Watanabe-Akaike Information Criterion): WAIC is a measure of the out-of-sample predictive accuracy of a statistical model. It quantifies the trade-off between the model's fit to the data and its complexity. Lower WAIC values indicate better model fit, meaning the model is better at predicting new, unseen data. SE (Standard Error of WAIC): SE represents the standard error of the WAIC estimate. It provides a measure of the uncertainty in the WAIC value. Smaller SE values suggest a more precise estimate of the WAIC. dWAIC (Difference in WAIC): dWAIC is the difference in WAIC between the model in question and the model with the lowest WAIC. It indicates how much worse or better a model is compared to the best-performing model in the set. Smaller dWAIC values suggest a model is closer in performance to the best model. dSE (Standard Error of dWAIC): dSE represents the standard error of the dWAIC estimate. It provides a measure of the uncertainty in the dWAIC value. Smaller dSE values suggest a more precise estimate of the difference in WAIC. pWAIC (Penalized WAIC): pWAIC is a penalized version of WAIC. It accounts for the effective number of parameters in the model and helps address potential overfitting. Lower pWAIC values indicate better predictive performance. Model Selection: Based on the provided output, we can make the following observations: Model 1 (m1) has the lowest WAIC value (127.6), indicating good out-of-sample predictive accuracy and model fit. Model 2 (m2) has the highest WAIC value (140.6), suggesting relatively worse out-of-sample predictive accuracy. Model 3 (m3) falls in between with a WAIC value of 129.4. To select the best model, we should consider both the WAIC value and the difference in WAIC (dWAIC) compared to the model with the lowest WAIC. A lower WAIC value and a smaller dWAIC value indicate better model performance. Therefore, in this case, Model 1 (m1) appears to be the preferred choice as it has the lowest WAIC and the smallest dWAIC. However, it's essential to consider other factors and the context of the analysis when making a final decision about model selection. Question 8: (5p) Write an example mathematical model formula for a Poisson regression model with two different kinds of varying intercepts, also known as a cross-classified model. Solution: A Poisson regression model with two different kinds of varying intercepts, also known as a cross-classified model, is used to account for hierarchical structures in data where observations can be attributed to multiple non-nested grouping factors. Here's an example mathematical model formula for such a model: Mathematical Model Formula for Poisson Regression with Cross-Classified Varying Intercepts: Suppose you have count data (e.g., the number of accidents) and you want to model it as a function of both individual-level and location-level factors. Let Yij be the count of accidents for individual “i” in location “j”. The model can be specified as: Yij ∼ Poisson(λij) Here, we have two varying intercepts: Individual-Level Varying Intercept: • Each individual “i” has a varying intercept bi that captures the random effect specific to that individual. • This varying intercept accounts for unobserved individual-level characteristics that affect the count of accidents. Location-Level Varying Intercept: • Each location “j” has a varying intercept cj that captures the random effect specific to that location. • This varying intercept accounts for unobserved location-level characteristics that affect the count of accidents. To complete the model, you need to specify the linear predictors for λij: log(λij) = β0 + bi + cj + β1Xij1 + β2Xij2 + ….. • • • • β0 represents the fixed intercept. bi represents the individual-level varying intercept for individual cj represents the location-level varying intercept for location β1, β2 represent the fixed coefficients for individual-level covariates (e.g., age, gender) and location-level covariates (e.g., urban/rural). This model allows you to simultaneously account for random effects at both the individual and location levels, capturing the heterogeneity in accident counts due to unobserved characteristics at these levels. The model can be extended to accommodate additional complexities, such as including other fixed and random effects, interactions, and covariates. Parameter estimation in such models typically involves Bayesian methods or other advanced statistical techniques. Question 9: (4p) Explain the terms in your own words: • prior • posterior • information entropy • instrumental variable Paper Solution: Solution: The first three are easiest to explain using the formulas (Bayes’ theorem and information entropy). However, one could make a more high-level conceptual explanation. The final one can be explained in different ways, e.g., in causal terms, an instrumental variable is a variable that acts like a natural experiment on the exposure. Or in other words, it is a third variable, Z, used in regression analysis when you have endogenous variables— variables that are influenced by other variables in the model, i.e., you use it to account for unexpected behavior between variables. Using an instrumental variable to identify the hidden (unobserved) correlation allows you to see the true correlation between the explanatory variable X and response variable, Y. Let’s say you had two correlated variables that you wanted to regress: X and Y. Their correlation might be described by a third variable Z, which is associated with X in some way. Z is also associated with Y but only through Y ’s direct association with X. For example, let’s say you wanted to investigate the link between depression (X) and smoking (Y ). Lack of job opportunities (Z) could lead to depression, but it is only associated with smoking through its association with depression (i.e. there isn’t a direct correlation between lack of job opportunities and smoking). This third variable, Z (lack of job opportunities), can generally be used as an instrumental variable if it can be measured and it’s behavior can be accounted for. Solution: Prior: A "prior" in Bayesian statistics is like your initial belief or expectation about a parameter or a variable's distribution before you've seen any data. It's the knowledge you bring to the analysis based on previous information or subjective judgments. Think of it as your starting point, which is later updated with data to form the "posterior." Posterior: The "posterior" represents your updated beliefs about a parameter or variable after you've incorporated observed data. It's like your revised estimate or understanding, taking into account the evidence provided by the data. In Bayesian statistics, the posterior is the result of combining the prior information with the likelihood function that describes how well the model fits the data. Information Entropy: Information entropy is a measure of uncertainty or disorder in a system. In simpler terms, it quantifies how much you don't know about something. High entropy means high uncertainty, while low entropy indicates low uncertainty or high information. For example, a coin toss has high entropy because you don't know the outcome until it happens, whereas a loaded die with biased sides has lower entropy because you have some knowledge about its outcomes. Instrumental Variable: An "instrumental variable" is a statistical concept used in causal inference. It's like a helper variable that's used to establish a causal relationship between an independent variable (X) and a dependent variable (Y) when there's potential endogeneity or omitted variable bias. The instrumental variable should be correlated with the independent variable but uncorrelated with unobservable factors affecting the dependent variable. It's a tool to tease out causal relationships in complex situations. Question 10: (2p) What are the two kinds of varying effects? Explain the effect they have on a statistical model. Paper Solution: Intercepts and slopes. Our models often make a jump in prediction capability when we add both. Together they are often called varying effects. A varying intercept means we estimate an intercept for each, e.g., subject in an experiment. Varying slopes allows us to also estimate a separate slope for each subject. These type of models often result in many parameters that need to be estimated. That is however usually OK, as long as we use partial pooling and an information criteria to keep an eye on overfitting (remember: the more complex models will be penalized). Solution: The two kinds of varying effects in statistical models are random effects and fixed effects. Each of these varying effects has a distinct impact on a statistical model: Random Effects: • Random effects are used to model variability that arises from unobservable or unmeasured factors within a dataset. They are typically considered as random draws from a larger population or distribution. Random effects are particularly useful in hierarchical or multilevel models, where data is organized into different groups or clusters (e.g., individuals within families or students within schools). • The effect of random effects is to account for the variation between groups or clusters. They allow for modeling both within-group and between-group variability, providing a way to handle heterogeneity in the data. In essence, random effects help capture and quantify sources of unexplained variability while allowing for generalizability to a broader population. Fixed Effects: • Fixed effects are used to model the impact of categorical variables, where each category is assigned its own parameter estimate. They are often used to control for the influence of specific categorical factors on the outcome variable. Fixed effects can be thought of as a set of fixed constants. • The effect of fixed effects is to account for and estimate the impact of specific categories or factors on the outcome. They provide a way to explicitly measure the differences in the mean response associated with different levels of the categorical variable. In summary, random effects account for unobservable variations between groups or clusters and are often used to generalize findings beyond the dataset, while fixed effects estimate the impact of specific categories or factors within the dataset. The choice between random effects and fixed effects depends on the research question, the nature of the data, and the goals of the statistical analysis. Question 10.1: (4p) What are the two kinds of varying effects (random effects)? Explain the effect they have on a statistical model. Also, can you use continuous values for varying effects? If yes, how? Paper Solution: Intercepts and slopes. Our models often make a jump in prediction capability when we add both. Together they are often called varying effects. A varying intercept means we estimate an intercept for each, e.g., a subject in an experiment. Varying slopes allows us to also estimate a separate slope for each subject. These type of models often result in many parameters that need to be estimated. That is however usually OK, as long as we use partial pooling and an information criteria to keep an eye on overfitting (remember: the more complex models will be penalized). Yes, you can use continuous values in random effects, e.g., Gaussian Process. Solution: In Bayesian statistics and hierarchical models, varying effects are often referred to as random effects. Random effects are a crucial component of statistical models, particularly in the context of mixed-effects models. There are two primary types of varying effects: group-level random effects and individual-level random effects. Here's an explanation of each type and their effects on a statistical model: Group-Level Random Effects: • Group-level random effects are often used to model variability at the group or cluster level. These groups could be defined by categories or clusters, such as schools, cities, or patients within a hospital. • The main effect of group-level random effects is to account for unexplained variability or heterogeneity between groups. They capture deviations from the overall model by introducing group-specific variations. • Group-level random effects are especially useful when the groups being modeled exhibit different intercepts or slopes but are still related to a common structure. • For example, in a study examining test scores across multiple schools, group-level random effects can capture schoolspecific variations in test scores, accounting for differences in school performance beyond what can be explained by the fixed effects (e.g., school size, teacher-to-student ratio). Individual-Level Random Effects: • Individual-level random effects are used to model variability at the individual level. These are often employed in longitudinal or repeated measures data where each subject has multiple observations. • The primary purpose of individual-level random effects is to account for variations in individual responses that cannot be explained by the fixed effects. They capture the unique characteristics of each individual within the same group. • Individual-level random effects can model the correlation and dependence of repeated measurements within the same individual. • For instance, in a study tracking blood pressure measurements over time for different patients, individual-level random effects can account for variations in blood pressure that are specific to each patient, beyond what can be explained by time trends and other fixed effects. Regarding the use of continuous values for varying effects, yes, it is possible to use continuous random effects. Continuous random effects are typically used when you expect a smooth or continuous relationship between the random effects and the response variable. These continuous random effects are often modeled using Gaussian (normal) distributions, allowing for a wide range of values. For example, in a study examining the effects of various doses of a drug on patient outcomes, you might use a continuous random effect to capture the dose-response relationship specific to each patient. The continuous random effect can take on a range of values, allowing for a flexible representation of individual responses to different doses of the drug. In summary, random effects in statistical models can be categorized as group-level or individual-level, depending on the level of variation being modeled. Additionally, continuous random effects can be employed when modeling continuous relationships between the random effects and the response variable, providing flexibility in capturing individual variability. Question 11: (8p) What are the four elemental confounds on which any Directed Acyclic Graph can be explained? Please draw the four confounds. Explain what they mean (preferably by explaining if one should condition or not on certain elements). Solution: In the context of Directed Acyclic Graphs (DAGs), there are four elemental confounds that can help explain the relationships and potential sources of bias in causal inference. These confounds are represented as nodes in a DAG. Let's draw the four elemental confounds and explain their meanings, including whether one should condition on certain elements: Collider (C): • A collider is a node where two or more arrows (edges) meet. It's a common source of bias and represents a situation where conditioning on the collider can induce a spurious association between its parent nodes. • To Condition or Not: Whether to condition on a collider depends on the research question. Conditioning on a collider can open up the bias, so you should condition on a collider if you want to investigate the relationship between the parent nodes. Mediator (M): • A mediator is an intermediate node in a causal pathway between an exposure and an outcome. It represents a variable through which the effect of an exposure is transmitted to the outcome. • To Condition or Not: You typically do not condition on a mediator because doing so would block the direct effect of the exposure on the outcome, which is what you want to estimate. Common Cause (A): • A common cause is a node that is a parent to both an exposure and an outcome. It represents a variable that affects both the exposure and the outcome, potentially leading to confounding. • To Condition or Not: Conditioning on a common cause can help control for confounding when estimating the causal effect of the exposure on the outcome. It's often necessary to condition on common causes. Common Effect (B): • A common effect is a node that has two or more parents that are not directly connected to each other. It represents a variable affected by multiple causes. • To Condition or Not: Conditioning on a common effect can introduce bias because it may open up paths that would otherwise be blocked. In many cases, you do not want to condition on a common effect unless it's required by the research question. Here's a simplified visual representation of the four elemental confounds in a DAG: A E M B X C Y D A (Common Cause) is a variable that affects both the Exposure (X) and the Outcome (Y). M (Mediator) is an intermediate variable through which the effect of the Exposure (X) is transmitted to the Outcome (Y). B (Common Effect) is a variable that has multiple causes, in this case, both the Exposure (X) and the Collider (C). C (Collider) is a node where two or more arrows meet and is a common source of bias when estimating causal effects. Understanding and appropriately addressing these confounds is crucial for causal inference and avoiding biases in causal relationships. Whether to condition on specific elements depends on the underlying causal structure and the research question. Question 12: (4p) What is the purpose and limitations of using laboratory experiments and experimental simulations as a research strategy? Paper Solution: Experimental simulations: Real actors’ behavior. Non-natural settings (contrived). Obtrusiveness: High level of obtrusiveness as the context is artificial Goal: Study a specific setting with higher degree of measurement and context control Generalizability is reduced as it still mirrors a specific setting Higher precision of the measurement of the behavior Lower level of realism Controlled setting that resembles the real world. Examples of research methods: Simulation games and Management games Both qualitative and quantitative (depends on the instrument) Laboratory experiment: Real actors’ behavior. Non-natural settings (contrived). The setting is more controlled. Obtrusiveness: High level of obtrusiveness Goal: Minimize confounding factors and extraneous conditions. Maximize the precision of the measurement of the behavior. Establish causality between variables. Limited number of subjects. Limits the generalizability. Unrealistic context. Solution: Laboratory Experiments and Experimental Simulations are research strategies that offer distinct advantages and limitations: Purpose of Laboratory Experiments: Control and Isolation: Laboratory experiments provide a controlled environment where researchers can manipulate and isolate variables. This allows for the precise examination of causal relationships and the ability to establish cause-and-effect. Replicability: Experiments can be replicated to verify findings and ensure the consistency of results. This contributes to the robustness of scientific conclusions. Internal Validity: They offer strong internal validity as researchers can closely monitor and control conditions, minimizing the influence of extraneous factors. Precision: Laboratory experiments often yield quantitative data, providing precise measurements and the ability to detect subtle effects. Limitations of Laboratory Experiments: External Validity: Findings from controlled laboratory settings may not generalize well to real-world situations. The controlled environment may not fully represent the complexity of the external world, limiting external validity. Ethical Constraints: Some experiments involving human subjects may raise ethical concerns, and certain research questions cannot be addressed in a laboratory due to ethical or practical reasons. Artificiality: The artificiality of the laboratory setting can affect the behavior of participants, leading to behaviors that might not occur in natural settings. Resource-Intensive: Laboratory experiments can be resource-intensive in terms of time, equipment, and participant recruitment, making them less feasible for some research questions. Purpose of Experimental Simulations: Realistic Scenarios: Experimental simulations aim to recreate real-world scenarios within a controlled setting, providing a balance between control and realism. Applied Research: They are valuable for applied research, such as studying human behavior in response to specific situations or interventions, with practical implications. Exploration of Complex Phenomena: Experimental simulations are well-suited for studying complex phenomena that cannot be easily replicated in a laboratory. Behavioral Insights: They can provide insights into human or system behavior under different conditions, helping inform decisionmaking. Limitations of Experimental Simulations: Limited Control: While more realistic than traditional laboratory experiments, simulations may still lack the level of control required for strict causal inference. Generalizability: Similar to laboratory experiments, the generalizability of findings from simulations to broader contexts can be a challenge. The specific simulation design may limit external validity. Resource Demands: Setting up and conducting realistic simulations can be resource-intensive, including the need for advanced technology, infrastructure, and skilled personnel. Complexity: Designing and running simulations may involve complex modeling and data analysis, potentially requiring specialized expertise. In summary, laboratory experiments offer high control and precision but may lack external validity, while experimental simulations aim to bridge the gap between controlled experiments and real-world scenarios, but still face limitations related to control and generalizability. The choice between these research strategies depends on the research question and the balance between control and realism needed for the study. Question 13: (6p) Below follows an abstract from a research paper. Answer the questions, • Which of the eight research strategies presented in the ABC framework does this paper likely fit? Justify and argue! • Can you argue the main validity threats of the paper, based on the research strategy you picked? – It would be very good if you can list threats in the four common categories we usually work with in software engineering, i.e., internal, external, construct, and conclusion validity threats. Context. Computer workers in general, and software developers specifically, are under a high amount of stress due to continuous deadlines and, often, over-commitment. Objective. This study investigates the effects of a neuroplasticity practice, a specific breathing practice, on the attention awareness, well-being, perceived productivity, and self-efficacy of computer workers. Method. We created a questionnaire mainly from existing, validated scales as entry and exit survey for data points for comparison before and after the intervention. The intervention was a 12-week program with a weekly live session that included a talk on a well-being topic and a facilitated group breathing session. During the intervention period, we solicited one daily journal note and one weekly well-being rating. We replicated the intervention in a similarly structured 8-week program. The data was analyzed using Bayesian multilevel models for the quantitative part and thematic analysis for the qualitative part. Results. The intervention showed improvements in participants’ experienced inner states despite an ongoing pandemic and intense outer circumstances for most. Over the course of the study, we found an improvement in the participants’ ratings of how often they found themselves in good spirits as well as in a calm and relaxed state. We also aggregate a large number of deep inner reflections and growth processes that may not have surfaced for the participants without deliberate engagement in such a program. Conclusion. The data indicates usefulness and effectiveness of an intervention for computer workers in terms of increasing wellbeing and resilience. Everyone needs a way to deliberately relax, unplug, and recover. Breathing practice is a simple way to do so, and the results call for establishing a larger body of work to make this common practice. Paper Solution: I would argue that this should be classified as a Field Experiment Solution: The research described in the provided abstract is best categorized under the "Experimental Simulation" research strategy within the ABC framework. Several key elements justify this classification. First and foremost, the study involves a controlled intervention, specifically a 12-week program with weekly breathing sessions and well-being talks. Such controlled interventions are characteristic of experimental simulations where researchers manipulate specific variables to observe their effects systematically. Moreover, the study's data analysis methods, including the use of Bayesian multilevel models for quantitative analysis and thematic analysis for qualitative components, are commonly associated with experimental research designs. The combination of quantitative and qualitative data aligns with the typical approach used in experimental simulations to gain a comprehensive understanding of the effects of an intervention. However, this research strategy introduces several validity threats that should be carefully considered. Internal validity threats include the possibility of historical events or participant maturation affecting well-being independently of the intervention. To address these threats, the study could benefit from a control group that would enable comparisons over time. Selection bias is another internal validity threat, as participants who chose to engage in the intervention may differ from those who did not. Implementing random assignment or matching techniques for a control group can help mitigate this bias. External validity threats may arise from interaction effects, where the effectiveness of the intervention varies across different contexts or populations. To enhance external validity, researchers should aim to generalize their results to a broader and more diverse set of participant groups and work environments. Additionally, the study should consider ecological validity, ensuring that the research context closely represents typical work conditions to strengthen the relevance of the findings to real-world settings. Construct validity threats are primarily related to the measurement tools used in the study. Ensuring the accuracy, reliability, and validity of the questionnaire and survey instruments is vital for the study's overall construct validity. Finally, in terms of conclusion validity, the research employs Bayesian multilevel models for data analysis. To mitigate threats to statistical conclusion validity, it is essential for the researchers to conduct a thorough assessment of the models and results, considering the limitations and assumptions inherent in their statistical approach. In summary, while the research aligns with the experimental simulation strategy, it should be mindful of the aforementioned validity threats across internal, external, construct, and conclusion validity. Addressing these threats will be crucial to ensuring the study's robustness and its ability to provide meaningful insights into the effects of the neuroplasticity practice on the wellbeing and self-efficacy of computer workers. Question 14: (11p) A common research method in software engineering that is used to complement other methods is survey research. Here follows a number of questions connected to survey research: 1. We often differ between reliability and validity concerning surveys, (a) What is the difference between the reliability and validity in survey design? (2p) (b) Name and describe at least two types of reliability in survey design. (3p) (c) Name and describe at least two types of validity in survey design. (3p) 2. Even if you measure and estimate reliability and validity you still want to evaluate the survey instrument. Which are the two (2) common ways of evaluating a survey instrument? Explain their differences. (3p) Paper Solution: 1a. Reliability: A survey is reliable if we administer it many times and get roughly the same distribution of results each time. Validity: How well the survey instrument measures what it is supposed to measure? 1b. e.g., Test-retest (intra-observer), alternate form reliability, internal consistency, inter-observer (Krippen- dorff’s α). 1c. e.g., content validity, criterion validity, construct validity. 2. Focus groups: Assemble a group of people representing respondents, experts, beneficiaries of survey results. Ask them to complete the survey questionnaire. The group identifies missing or unnecessary questions, ambiguous questions and instructions. Pilot studies: Uses the same procedures as planned for the main study but with smaller sample. Targets not only problems with questions but also response rate and reliability assessment. Solution: Question 1: Reliability and Validity in Survey Research (a) Difference between Reliability and Validity: Reliability in survey design refers to the consistency, stability, or reproducibility of survey measurements. It assesses whether the survey instrument produces consistent results when administered to the same participants under the same conditions. Reliability indicates the extent to which the survey measures the concept it intends to measure in a consistent manner. Validity, on the other hand, refers to the accuracy and truthfulness of the survey instrument in measuring what it claims to measure. It assesses whether the survey instrument accurately captures the construct or concept it intends to measure. Validity indicates the extent to which the survey instrument measures what it is supposed to measure and the extent to which inferences can be made based on the survey results. (b) Types of Reliability in Survey Design: Test-Retest Reliability: Test-retest reliability assesses the consistency of survey measurements over time. It involves administering the same survey to the same group of participants on two separate occasions and then comparing the responses. High testretest reliability indicates that the survey instrument consistently measures the construct over time. Internal Consistency Reliability: Internal consistency reliability assesses the degree to which different items or questions within the survey instrument are correlated with each other. A common measure of internal consistency is Cronbach's alpha, which indicates how closely related the items are within the survey. High internal consistency reliability suggests that the items are measuring the same underlying construct. (c) Types of Validity in Survey Design: Content Validity: Content validity assesses the extent to which the survey instrument's content, including questions and items, adequately represents the construct it intends to measure. Content validity is often evaluated by expert judgment and ensures that the survey covers all relevant aspects of the construct. Construct Validity: Construct validity examines whether the survey instrument accurately measures the underlying construct or theoretical concept. It involves assessing the relationship between the survey scores and other variables or constructs to confirm that the survey measures what it is intended to measure. Common techniques for assessing construct validity include factor analysis and hypothesis testing. Question 2: Evaluating Survey Instruments Two common ways of evaluating survey instruments are: Pilot Testing: Pilot testing involves administering the survey instrument to a small group of participants (usually a subset of the target population) before full-scale data collection. The purpose is to identify and rectify any issues with question clarity, wording, or formatting. Pilot testing helps ensure that the survey is well-understood by respondents and that it functions as intended. Psychometric Analysis: Psychometric analysis involves using statistical techniques to assess the quality of the survey instrument. This can include conducting factor analysis to determine if items load onto the expected factors or constructs, calculating reliability coefficients (e.g., Cronbach's alpha) to assess internal consistency, and assessing the instrument's validity through correlations with other relevant variables. Psychometric analysis provides empirical evidence of the survey instrument's quality and its ability to measure the intended constructs accurately. Differences between Pilot Testing and Psychometric Analysis: Pilot testing is primarily qualitative and focuses on identifying practical issues with the survey, such as ambiguous wording or confusing item order. It helps refine the survey instrument before data collection. Psychometric analysis, on the other hand, is quantitative and focuses on the statistical properties of the survey, including its reliability and validity. It provides empirical evidence of the survey's measurement properties and its ability to produce consistent and accurate results. In summary, while pilot testing addresses practical issues in survey design, psychometric analysis assesses the statistical properties and quality of the survey instrument. Both approaches are essential for ensuring the reliability and validity of survey research. Question 15: (6p) Look at the DAG below. We want to estimate the total causal effect of W on D. Which variable(s) should we condition on and why? S W M A D Paper Solution: > library ( dagitty ) > dag <− dagitty ( " dag { A −> D A −> M −> D A <− S −> M S −> W −> D }") > adjustment Sets ( dag , exposure="W" , outcome="D") and the answer is { A, M } { S }, i.e., either A and M, or only S. In short, check what elementary confounds we have in the DAG (DAGs can only contain four), follow the paths, decide what to condition on. Solution: In the Directed Acyclic Graph (DAG) provided, the objective is to estimate the total causal effect of the variable W (W -> D) on the variable D. To estimate this causal effect accurately, it's important to identify the variables that should be conditioned on. In causal inference, conditioning on specific variables is done to control for confounding and ensure that the causal effect of interest is not biased. Here are the variables that should be conditioned on and the reasons why: A (Ancestral Variable): Conditioning on variable A is necessary because it is a common cause of both the treatment variable (W) and the outcome variable (D). This is evident in the paths A -> M and A -> D, which are backdoor paths that can lead to confounding. By conditioning on A, we close these backdoor paths, ensuring that the effect of W on D is not biased by the common cause A. M (Mediating Variable): Conditioning on M is also crucial because it is a potential mediator between the treatment variable W and the outcome variable D. Conditioning on M helps control for the mediating pathway W -> M -> D and ensures that the total causal effect of W on D is separated from the indirect effect mediated through M. Therefore, to estimate the total causal effect of W on D in the given DAG, one should condition on variables A and M. Conditioning on A addresses the issue of confounding, while conditioning on M separates the direct effect of W on D from any indirect effects mediated through M. This approach ensures a more accurate estimation of the total causal effect. Question 16: (5p or even -5p) You get one point if you answer a question correctly. You get -1 points if you fail to answer the question correctly (if you don’t answer you get 0 points). Simply write your answer below. 1. We should always start the Bayesian data analysis by designing a . . . . 2. Adding predictors to a model can lead to several things. Two common things are . . . 3. My R value is 2.6. I should . . . 4. We can quantify using Kullback-Leibler divergence. 5. I am first and foremost always interested in propagating while doing BDA. Paper Solution: 1. null/initial/simple model 2. overfitting, using a confounder since we might not have a DAG, etc. 3. probably not increase iterations. . . You must likely reparameterize the model. In short, you’re in deep shit and something is seriously wrong. 4. distance to the target, i.e., if we have a pair of candidate distributions, then the candidate that minimizes the divergence will be closest to the target. Since predictive models specify probabilities of events (observations), we can use divergence to compare the accuracy of models. 5. uncertainty Solution: 1. We should always start the Bayesian data analysis by designing a probabilistic model that represents our knowledge and uncertainties about the data and the underlying processes. 2. Adding predictors to a model can lead to several things. Two common things are increasing the model complexity and potential improvement in the model's predictive accuracy. 3. My R value is 2.6. I should interpret the context in which this R value is calculated. Without context, it's not possible to determine the significance or meaning of the R value. 4. We can quantify using Kullback-Leibler divergence the difference between two probability distributions, measuring how one distribution diverges from a second, expected probability distribution. 5. I am first and foremost always interested in propagating while doing BDA uncertainty and incorporating prior beliefs to update them using observed data, allowing for a more informed and probabilistic understanding of the phenomena under study. Question 1: (6p) See the DAG below. We want to estimate the total causal effect of W on D. Which variable(s) should we condition on and why? M A S D W Paper Solution: > dag <− dagitty (" dag { A −> D A <− S −> W S −> M −> D S −> W −> D }") > adjustment Sets ( dag , exposure = "W" , outcome = "D") and the answer is { A, M } { S }, i.e., either A and M, or only S. In short, check what elementary confounds we have in the DAG (DAGs can only contain four), follow the paths, decide what to condition on. Solution: To estimate the total causal effect of W on D in the given Directed Acyclic Graph (DAG), we need to consider which variables to condition on to reduce confounding and obtain an unbiased estimate. In this case, we should condition on the variables A and S. Here's the reasoning: Variables to Condition on and Why: • A (Ancestral Variable): We should condition on A because it is a parent of D, and it can be a common cause of D and W. Conditioning on A will help account for any direct causal influence of A on D and remove the confounding effect. • S (Ancestral Variable): Conditioning on S is also important because S is a common cause of both W and A. Failing to condition on S would leave the confounding path open, leading to biased estimates. Conditioning on S allows us to block the path from S to W and A, ensuring that we capture the direct effect of W on D without confounding. By conditioning on both A and S, we can control for the common causes of W and D, ensuring that the estimated total causal effect of W on D is as unbiased as possible. This reduces the risk of confounding and provides a more accurate estimate of the causal relationship between W and D. Question 3: (3p) It is possible to view the Normal likelihood as a purely epistemological assumption, rather than an ontological assumption. How would you argue such a statement? Solution: Arguing the Normal Likelihood as an Epistemological Assumption: The interpretation of the Normal likelihood as a purely epistemological assumption rather than an ontological one can be justified through the following reasoning: Epistemological vs. Ontological Assumptions: • Epistemological assumptions are concerned with our knowledge and beliefs about the world, while ontological assumptions are concerned with the nature of reality itself. • Epistemological assumptions reflect what we know or assume about the data-generating process, but they don't necessarily claim that the world operates in a specific way. • Epistemological assumptions pertain to what we can know or infer from data, while ontological assumptions relate to the underlying structure of reality itself. • The choice of likelihood functions in Bayesian data analysis often serves as a tool for making inferences based on observed data, and it doesn't necessarily imply a statement about the true nature of the data-generating process. Flexibility of Likelihoods: • The choice of a likelihood function in Bayesian analysis is often guided by mathematical convenience, computational tractability, or the desire to capture specific data patterns. • The Normal distribution is a common choice because of its mathematical properties, especially in the context of conjugate priors and analytical solutions. It is known to be analytically tractable. Modeling Convenience: • In many cases, the decision to use a Normal likelihood is primarily driven by the modeling convenience it offers. The Normal distribution is well-behaved and has desirable mathematical properties, making it a pragmatic choice for modeling data in various applications. • Researchers may select a Normal likelihood as a practical and convenient tool for modeling data patterns without necessarily claiming that the underlying data-generating process is inherently Normal. Flexibility of Bayesian Framework: • The Bayesian framework allows for incorporating prior information into the model. This means that even if the Normal likelihood is chosen for convenience, prior information can still shape the posterior distribution and provide a more nuanced view of the data. Epistemic Uncertainty: • Bayesian analysis recognizes the presence of epistemic uncertainty, which arises from our knowledge limitations. By treating the Normal likelihood as an epistemological choice, we acknowledge that it reflects our current understanding of the data. In summary, viewing the Normal likelihood as an epistemological assumption suggests that it is a choice made for practical and computational reasons, rather than a definitive statement about the underlying reality. Bayesian analysis allows for the incorporation of prior information and the exploration of various likelihood functions, making it a flexible framework that can accommodate different modeling perspectives. Question 4: (8p) Name at least four distributions in the exponential family (4p). Provide examples of when one can use each distribution when designing statistical models and explain what is so special about each distribution you’ve picked, i.e., their assumptions (4p). Paper Solution: Take your pick, Gamma, Normal, Exponential, Poisson, Binomial, are some of the distributions. . . Solution: Four Distributions in the Exponential Family: Normal Distribution: • Assumptions: The Normal distribution is characterized by two parameters: mean (μ) and variance (σ^2). It assumes that the data follows a symmetric, bell-shaped curve, with outcomes evenly distributed around the mean. • Special Characteristics: It is often used in models where data is assumed to be normally distributed, such as linear regression. The Normal distribution is particularly special because of its well-known properties, making it suitable for modeling a wide range of real-world phenomena. Exponential Distribution: • Assumptions: The Exponential distribution is characterized by a single parameter, the rate (λ). It assumes that the data follows a memoryless, continuous probability distribution, often modeling the time between events in a Poisson process. • Special Characteristics: The Exponential distribution is special because of its memoryless property, which means that the probability of an event occurring in the future is independent of the time already elapsed. It's used in survival analysis and reliability studies. Poisson Distribution: • Assumptions: The Poisson distribution is characterized by a single parameter, the rate (λ), representing the average rate of events occurring in a fixed interval. It assumes that events are rare and independent. • Special Characteristics: The Poisson distribution is special because it models count data, such as the number of phone calls at a call center within a specific time frame. It's commonly used in scenarios where events happen randomly and infrequently. Binomial Distribution: • Assumptions: The Binomial distribution is characterized by two parameters: the number of trials (n) and the probability of success in each trial (p). It assumes a fixed number of independent trials with two possible outcomes (success or failure) in each trial. • Special Characteristics: The Binomial distribution is special because it models binary data, like the number of successful coin flips out of a fixed number of attempts. It's commonly used in applications where there are repeated independent trials, each with a known probability of success. These distributions are special because they each have distinct characteristics and assumptions that make them suitable for modeling different types of data. The choice of distribution in statistical models depends on the nature of the data and how well the assumptions of the selected distribution align with the real-world phenomenon being studied. Question 6: (6p) What is the purpose and limitations of using Judgement Studies and Sample Studies as a research strategy? Provide examples, i.e., methods for each of the two categories, and clarify if one use mostly qualitative or quantitative approaches (or both). Paper Solution: Judgement Studies: Neutral setting Judge or rate behaviors Respond to stimulus Discuss a topic of interest Setting: Neutral setting Actors: Systematic sampling Goal: Seek generalizability over the responses not over the population of actors. Responses are not specific to any context. Unobtrusive and no control. Examples of research methods: Delphi study Panel of experts Focus group Evaluation study Interview studies Qualitative and/or quantitative data Sample Studies: Study: the distribution of a characteristic of a population or the correlation between two or more characteristics. Setting: Neutral setting Actors: Systematic sampling Goal: Generalizability over a population of actors Responses are not specific to any context Unobtrusive and no control It does not need to be sample of humans. It can be of artifacts such as projects, models, etc. . . Examples of research methods: Surveys Systematic literature reviews Software repository mining Interviews Mostly quantitative data but can include qualitative data. Solution: Judgment Studies: Purpose: Judgment studies involve collecting expert opinions, judgments, or ratings to gain insights into a specific problem or domain. These studies are valuable for harnessing expert knowledge, especially when empirical data is scarce or difficult to obtain. Judgment studies can provide qualitative and quantitative insights, depending on the nature of the study. They are useful for understanding complex phenomena, decision-making processes, or evaluating the quality of outcomes. Limitations: • Subjectivity: Judgment studies are inherently subjective, as they rely on expert opinions. The results can be influenced by individual biases, experiences, or perspectives. • Limited Generalizability: Findings from judgment studies might not generalize to broader populations, as they are often specific to the experts involved. • Resource-Intensive: Conducting judgment studies may require significant time and effort to engage with experts and collect their judgments. Sample Studies: Purpose: Sample studies involve collecting data from a subset (sample) of a larger population to draw inferences and make generalizations about the entire population. They are commonly used for quantitative research and hypothesis testing. Sample studies are particularly useful for estimating parameters, testing hypotheses, and making predictions based on statistical analysis. Limitations: • • • Sampling Bias: If the sample is not selected randomly or is not representative of the population, sampling bias can occur, leading to invalid generalizations. Resource Constraints: Conducting sample studies can be resource-intensive, especially when aiming for a large and diverse sample size. Limited Contextual Understanding: Sample studies may provide statistical insights but might lack the depth of understanding that can be obtained through qualitative methods. Examples and Approaches: Judgment Studies: Example: Expert Delphi Panel on Climate Change Impact Predictions Method: Engage a panel of climate scientists to provide expert judgments on the potential impacts of climate change. Experts might rate the severity of specific impacts or provide qualitative insights. Approach: Qualitative insights are gathered through open-ended discussions, while quantitative aspects involve structured rating scales or consensus-building processes. Sample Studies: Example: Survey on Customer Satisfaction with a Product Method: Administer a survey to a random sample of customers who have used a product, collecting quantitative data on their satisfaction levels. Approach: Quantitative analysis involves statistical techniques such as regression to assess factors affecting satisfaction, and results are quantifiable. In practice, researchers may use a combination of both judgment studies and sample studies to triangulate findings, enhancing the robustness of their research. The choice between qualitative and quantitative approaches depends on the research objectives and the availability of relevant data. Question 7: (8p) Below follows an abstract from a research paper. Answer the questions, • Which of the eight research strategies presented in the ABC framework does this paper likely fit? Justify and argue! • Can you argue the main validity threats of the paper, based on the research strategy you picked? – It would be very good if you can list threats in the four common categories we usually work with in software engineering. Requirement prioritization is recognized as an important decision-making activity in requirements en- gineering and software development. Requirement prioritization is applied to determine which require- ments should be implemented and released. In order to prioritize requirements, there are several ap- proaches/techniques/tools that use different requirements prioritization criteria, which are often identi- fied by gut feeling instead of an in-depth analysis of which criteria are most important to use. In this study we investigate which requirements prioritization criteria are most important to use in industry when determining which requirements are implemented and released, and if the importance of the criteria change depending on how far a requirement has reached in the development process. We conducted a quantitative study of one completed project from one software developing company by extracting 32,139 requirements prioritization decisions based on eight requirements prioritization criteria for 11,110 requirements. The results show that not all requirements prioritization criteria are equally important, and this change depending on how far a requirement has reached in the development process. Paper Solution: A reply from one of the authors of the ABC paper: This is a good one—and quite a common one as well. It’s not straightforward, and I don’t think there is a crystal-clear ‘right’ answer, and I’ve struggled a bit with that in the past—but I’ve come to the realization that it’s “OK” that there is no right answer per se, but rather that at least we have some anchor points to have the discussion; the root question of course is: Is this a quantitative case study, i.e., a case study using a sample study as strategy, or vice versa, a sample study with data from a single case study? It very much depends I think on the nature of the sample, and whether that is a very specific sample to the company. If the sampling allows generalizing to beyond the company, then one could argue it’s a sample study. Solution: Research Strategy: The research strategy that best fits this paper is likely an "Observational Study" within the ABC framework. The paper presents an analysis of actual requirements prioritization decisions in an industrial context without any active intervention or manipulation by the researchers. It falls under the "O" category because it observes and analyzes existing data to draw conclusions. Justification: The study investigates requirements prioritization practices in an industrial setting, which aligns with the observational approach. It does not involve any experimental manipulation or intervention. The data collection method involves extracting 32,139 requirements prioritization decisions from a completed project, which represents observations of real-world practices. The research objective is to understand which prioritization criteria are important in industry, based on actual decision-making data, which is characteristic of observational research. Validity Threats: The main validity threats in this observational study within the context of software engineering may include: External Validity: Threat: The findings might be specific to the context of the software company studied and may not generalize to other companies. Mitigation: Conducting similar studies in multiple software development organizations to assess the external validity of the findings. Internal Validity: Threat: Confounding variables or unaccounted factors might influence the results. For example, specific project characteristics or team dynamics could affect prioritization. Mitigation: Controlling for potential confounders through statistical techniques or conducting follow-up qualitative research to understand hidden factors. Construct Validity: Threat: The criteria used to assess the importance of prioritization factors might not fully capture the complex nature of software requirements. Mitigation: Ensuring that the criteria used are comprehensive and appropriate for the context. Refining the criteria based on expert consultation or pilot testing. Conclusion Validity: Threat: Drawing incorrect or premature conclusions from the data, especially when inferring causality. Mitigation: Careful statistical analysis, peer reviews, and conducting follow-up studies to confirm the stability of the results. This research paper provides valuable insights into industry practices related to requirement prioritization but should be cautious about addressing these validity threats to enhance the quality and generalizability of the findings. Question 8: (12p) In software engineering research, which has taken inspiration from the social sciences, we often classify threats to validity in four categories. Which are the four categories (4p)? Please explain what question(s) each category tries to answer and provide an example for each category (8p). Solution: The four categories of threats to validity in software engineering research are as follows: Internal Validity: • Question: Does the study provide sufficient evidence to establish a causal relationship between the independent and dependent variables? • Example: In a study investigating the impact of code review on software quality, a threat to internal validity could be a lack of control over other factors (e.g., team dynamics) that might influence the observed outcomes, making it difficult to attribute changes solely to code review. External Validity: • Question: To what extent can the findings of the study be generalized beyond the specific context or sample to other situations or populations? • Example: Research conducted in a controlled laboratory setting may have low external validity if the findings cannot be applied to real-world software development projects. Construct Validity: • Question: Are the measures used to assess the variables of interest valid, and do they accurately capture the constructs under investigation? • Example: In a study assessing the quality of software documentation, a threat to construct validity might arise if the chosen metrics do not comprehensively represent documentation quality or if the metrics are subject to bias. Conclusion Validity: • Question: Can the conclusions drawn from the data accurately represent the relationships or effects observed in the study? • Example: In a study on the impact of an agile development process on project completion time, a threat to conclusion validity could arise if the statistical analysis misinterprets the data, leading to incorrect conclusions about the process's effect. These categories help researchers identify potential issues in their studies and take appropriate measures to address or mitigate these threats. By considering internal, external, construct, and conclusion validity, researchers can enhance the quality and robustness of their findings in software engineering research. Question 9: (6p) See the DAG below. We want to estimate the total causal effect of SizeStaff on Productivity. What should we condition on? (1p) Design a complete model, in math notation, where the outcome Productivity is a count 0, . . . , to infinity, and include the variable(s) needed to answer the above question. Also add, what you believe to be, suitable priors on all parameters. State any assumptions! (5p) Attitudes SizeStaff Experience Effort LOC Productivity Paper Solution: 4p for appropriate model as further down below. d <− dagitty ( ’ dag{ bb ="0 , 0 , 1 , 1" Attitudes [ pos =" 0 . 474 , 0 . 305 "] Effort [ exposure , pos =" 0 . 414 , 0 . 522 "] Experience [ pos =" 0 . 734 , 0 . 333 "] LOC [ pos =" 0 . 545 , 0 . 527 "] Productivity [ outcome , pos =" 0 . 482 , 0 . 674 "] SizeStaff [ pos =" 0 . 243 , 0 . 322 "] Attitudes −> LOC Effort −> Productivity Experience −> LOC LOC −> Productivity SizeStaff −> Effort SizeStaff −> LOC }’) r$> adjustment Sets ( d , exposure = " S i z e S t a f f " , outcome = " Pr o d u c t i vi ty ") {} and the answer is {}, i.e., nothing (well except for SizeStaff obviously)! In short, check what elementary confounds we have in the DAG (DAGs can only contain four), follow the paths, decide what to condition on. Productivityi ∼ Poisson(λi) log(λi) = α + βssSizeStaffi α ∼ Normal(0, 2.5) βss ∼ Normal(0, 0.2) We’d like to see a slightly broader prior on α (what could productivity be, well it’s a count and I assume that it’s probably not in the millions), and a significantly more narrow prior on β since we have a log link, i.e., a Normal(0,0.2) still allow values as high as ∼ 2.5, which is perfect for a β (that would be a large effect after all!) Solution: To estimate the total causal effect of SizeStaff on Productivity, we should condition on the variable "Effort." Here's a complete model in mathematical notation, suitable priors, and relevant assumptions: Model: • Let Productivityi represent the productivity of the i-th software development team. • Efforti represents the effort expended by the i-th team. SizeStaffi represents the size of the software development team i. LOCi represents lines of code produced by team i. Attitudesi represents the attitudes of the team (i). Experiencei represents the experience of the team i. • • • • We model the relationship as follows: Efforti ∼ Gamma (αE, βE)(Gamma distribution for effort) SizeStaffi ∼ Normal (μS, σS)(Normal distribution for team size) Experiencei ∼ Normal (μX, σX)(Normal distribution for experience) LOCi ∼ Poisson(λ)(Poisson distribution for lines of code) Attitudesi ∼ Normal (μA , σA)(Normal distribution for attitudes) The relationships are defined as follows: • • • • Efforti has a causal effect on LOCi, and it follows a gamma distribution with shape αE and rate βE. SizeStaffi influences Efforti, and its distribution follows a normal distribution. Experiencei is normally distributed and affects LOCi. Attitudesi is normally distributed and can influence both LOCi and Efforti. Assumptions: 1. We assume that the productivity (Productivityi) is linked to the lines of code produced (LOCi) and can be modeled as a function of LOCi with additional latent variables (e.g., effort, attitudes, and experience) that explain the variability. 2. 3. The gamma distribution for Efforti is suitable to model effort, as it is a positive continuous variable. Normal distributions are used for SizeStaff, Experience, and Attitudes as these variables are continuous and could have a wide range of values. 4. 5. Poisson distribution is used for LOCi as it models count data. The relationships among these variables are assumed to be linear for simplicity, but more complex relationships could be explored. This model incorporates the variables needed to estimate the causal effect of SizeStaff on Productivity while considering relevant latent variables. Appropriate priors have been chosen for the parameters. Question 10: (8p) Below follows a number of multiple choice questions. There can be more than one correct answer! Mark the correct answer by crossing the answer. In the case of DAGs, X is the treatment and Y is the outcome. Q1 What is this construct called: Y ← Z ← X? {Collider} {Pipe} {Fork} {Descendant} Q2 What is this construct called: Y → Z ← X? {Collider} {Pipe} {Fork} {Descendant} Q3 We should never condition on Z: Y ← Z → X? {True} {False} Q4 If we condition on Z we close the path X → Z ← Y . {True} {False} Q5 What is this construct called: Y ← Z → X? {Collider} {Pipe} {Confounder} {Descendant} Q6 An interaction is an influence of predictor. . . {on a parameter} {conditional on other predictor} {conditional on parameter} 𝑛 Q7 What distribution maximizes this? 𝐻(𝑝) = − ∑ 𝑝𝑖 𝑙𝑜𝑔(𝑝𝑖) 𝑖=1 {Most complex} {Most structured} {Flattest} {Distribution that can happen the most ways} Q8 What distribution to pick if it’s a real value in an interval between 0 and 1, but where neither 0 nor 1 are allowed? {Exponential} {Log-Normal} {Beta} Q9 What distribution to pick if it’s a real value with finite variance? {Binomial} {Negative-Binomial} {Normal} Q10 Dichotomous variables, varying probability? {Binomial} {Beta-Binomial} {Negative-Binomial/Gamma-Poisson} Q11 A 2-parameter distribution with a shape parameter k and a scale parameter θ for positive real values 0, . . . , to infinity? {Binomial} {Beta} {Gamma} Q12 Natural value, positive, mean and variance are equal? {Gamma-Poisson} {Poisson} {Cumulative} Q13 Unordered (labeled) values? {Categorical} {Cumulative} {Dinominal} Q14 On which effect scale are parameters? {Absolute} {Relative} {None} Q15 On which effect scale are predictions? {None} {Absolute} {Relative} Q16 In zi models we assume the data generation process consist of two disparate parts? {Yes} {No}