1 MDA using R A. Step-by-step guidelines to download and install R and RStudio on Windows Install R R is available in the webpage The Comprehensive R Archive Network (cran). 1. Go to the R Project website: https://cran.r-project.org 2. Click on "Download R for Windows". 3. Click on "base" (for base R installation). 4. Click on the link Download R 4.x.x for Windows (86 megabytes, 64 bit). 5. Run the downloaded file (.exe) to start installation: o Choose language. o Follow the onscreen instruction while installing. Accept default options unless you have specific needs. o Click Next until it finishes. Install RStudio 1. Go to the RStudio download page: https://posit.co/download/rstudio-desktop/ 2. Scroll down to the "Install RStudio" section. 3. Click “Download RStudio Desktop for Windows”. 4. Run the downloaded installer and follow the setup steps: o Accept license agreement. o Choose installation folder. o Click Install and then Finish. Open RStudio • • Search for RStudio in your Windows start menu. Open it — R will run in the background through RStudio. Tips • • • Always install R before RStudio (RStudio needs R to work). You can install packages in RStudio using: install.packages("package_name") 2 B. R Packages R packages are collections of functions, data, and documentation bundled together to extend the functionality of R. They allow users to perform specialized tasks without writing all the code from scratch. Key Points About R Packages: • Installation: You can install packages from CRAN using: install.packages("packageName") • Loading: After installation, load a package into your R session with: library(packageName) • • Popular R Packages by Category: o Data Manipulation: dplyr, data.table, tidyr o Data Visualization: ggplot2, plotly, lattice o Statistical Modeling: lmtest, caret, glmnet o Machine Learning: randomForest, xgboost, mlr3 o Time Series Analysis: forecast, xts, zoo o Text Mining: tm, tidytext, text2vec o Web Scraping: rvest, httr o Reproducible Reports: knitr, rmarkdown Creating Your Own Package: Use devtools::create("myPackage") and follow standard directory structures and documentation practices. 3 C. R Packages for Specific Tasks C1. Descriptive Statistics (Univariate & Multivariate) + Plots Recommended Packages: • • • • • psych summarytools skimr GGally (for multivariate visualization) ggplot2 Plot Functions: • Univariate (e.g., histograms, boxplots): ggplot(data, aes(x = variable)) + geom_histogram() ggplot(data, aes(y = variable)) + geom_boxplot() • Multivariate (pair plots, correlation matrices): GGally::ggpairs(data) corrplot::corrplot(cor(data), method = "circle") C2. Linear Regression + Plots Recommended Packages: • • • • stats car broom ggplot2 Plot Functions: • Base plot diagnostics: model <- lm(y ~ x, data = df) plot(model) # Residuals, leverage, Q-Q • Using ggplot2: ggplot(df, aes(x = x, y = y)) + geom_point() + 4 geom_smooth(method = "lm", se = TRUE) • car package: car::avPlots(model) # Added variable plots C3. Logistic Regression + Plots Recommended Packages: • • • • stats pROC ResourceSelection ggplot2 Plot Functions: • ROC curve: library(pROC) roc_obj <- roc(df$actual, predict(model, type = "response")) plot(roc_obj) • Predicted probabilities plot: df$prob <- predict(model, type = "response") ggplot(df, aes(x = predictor, y = prob)) + geom_line() C4. Exploratory Factor Analysis (EFA) + Plots Recommended Packages: • • • • psych GPArotation nFactors corrplot Plot Functions: • Scree plot: fa.parallel(df) • # Built-in scree and parallel analysis Factor loadings heatmap: efa_result <- fa(df, nfactors = 3) heatmap(efa_result$loadings) 5 • Correlation plot: corrplot(cor(df), method = "circle") C5. Confirmatory Factor Analysis (CFA) + Plots Recommended Packages: • • lavaan semPlot Plot Functions: • Path diagram: library(semPlot) semPaths(cfa_model, whatLabels = "std", layout = "tree") • Model fit summary: summary(cfa_model, fit.measures = TRUE, standardized = TRUE) C6. Structural Equation Modeling (SEM) + Plots Recommended Packages: • • • lavaan semPlot OpenMx (for advanced models) Plot Functions: • SEM path diagram: semPaths(sem_model, whatLabels = "std", layout = "circle") • Latent vs observed variable plots: Included via semPlot; customize node shapes/colors. 6 D. R Code for multivariate data analysis Step 1: Learn the Basics of R • • • • • Install R and RStudio (if you haven't already) Learn basic R syntax: variables, data types, operators Master vectors, lists, matrices, data frames, and factors Use basic functions and write your own Practice with indexing and subsetting � Recommended: R for Data Science by Hadley Wickham & Garrett Grolemund (Chapters 1–8) Step 2: Data Manipulation and Visualization • • Use tidyverse packages: o dplyr for data manipulation (filter, select, mutate, group_by, summarize) o ggplot2 for visualization o readr and tibble for data import and handling Learn piping (%>%) and functional programming basics � Practice: Clean and summarize datasets like mtcars, iris, or any CSV Step 3: Foundations of Multivariate Statistics Learn the theory and practical implementation of these methods: • • • • • • • Principal Component Analysis (PCA) – prcomp(), FactoMineR, factoextra Factor Analysis – psych::fa(), factanal() Cluster Analysis – kmeans(), hclust(), NbClust Multidimensional Scaling (MDS) – cmdscale(), vegan Discriminant Analysis (LDA/QDA) – MASS::lda(), klaR MANOVA – manova() function Canonical Correlation Analysis – CCA, yacca � Dataset ideas: USArrests, mtcars, iris, wine, healthcare datasets Step 4: Advanced Techniques and Visualization • • Use GGally, corrplot, and heatmaply for correlation and relationships Interactive visuals: plotly, shiny for dashboards 7 • Explore caret for multivariate predictive modeling Step 5: Projects & Practice • Try projects like: o Customer segmentation o Gene expression data PCA o Credit risk clustering o Sports performance or sensor data analysis D1: Univariate and then move on to multivariate descriptive statistics using R. Step 1: Descriptive Statistics in R – Univariate Analysis We’ll use the built-in mtcars dataset to demonstrate. 🔹🔹 Summary of a Single Variable # Load dataset data(mtcars) # Summary of 'mpg' summary(mtcars$mpg) 🔹🔹 Basic Stats mean(mtcars$mpg) median(mtcars$mpg) sd(mtcars$mpg) var(mtcars$mpg) range(mtcars$mpg) IQR(mtcars$mpg) 🔹🔹 Histogram and Boxplot hist(mtcars$mpg, main = "Histogram of MPG", col = "lightblue") boxplot(mtcars$mpg, main = "Boxplot of MPG") 🔹🔹 Using psych Package install.packages("psych") library(psych) describe(mtcars$mpg) 8 Step 2: Descriptive Statistics – Multivariate Analysis Now let’s look at multiple variables together. 🔹🔹 Summary Statistics for Whole Dataset summary(mtcars) 🔹🔹 Correlation Matrix cor(mtcars) 🔹🔹 Pairwise Plots pairs(mtcars[, 1:4], main = "Scatterplot Matrix") 🔹🔹 Using psych::describe() for All Variables describe(mtcars) 🔹🔹 Using GGally for Detailed Plotting install.packages("GGally") library(GGally) ggpairs(mtcars[, 1:5]) Your Practice Task Try the above code blocks in R and answer: • • Which variable has the highest mean and standard deviation? What relationships do you see in the scatterplot matrix? Comprehensive R script for Descriptive Statistics 📄📄 Full R Script: Descriptive Statistics (Univariate & Multivariate) # ------------------------------------------------------------# DESCRIPTIVE STATISTICS: Univariate and Multivariate + Plots # Dataset: mtcars (built-in) # ------------------------------------------------------------# Load dataset data(mtcars) head(mtcars) # Optional: Load required packages if (!require(psych)) install.packages("psych") if (!require(GGally)) install.packages("GGally") if (!require(corrplot)) install.packages("corrplot") if (!require(ggplot2)) install.packages("ggplot2") library(psych) library(GGally) library(corrplot) 9 library(ggplot2) # -------------------------# 1. UNIVARIATE ANALYSIS # -------------------------# Summary of a single variable (e.g., mpg) summary(mtcars$mpg) # Basic statistics mean(mtcars$mpg) median(mtcars$mpg) sd(mtcars$mpg) var(mtcars$mpg) range(mtcars$mpg) IQR(mtcars$mpg) # Detailed summary for all variables describe(mtcars) # Histogram hist(mtcars$mpg, col = "skyblue", main = "Histogram of MPG", xlab = "Miles per Gallon") # Boxplot boxplot(mtcars$mpg, main = "Boxplot of MPG", col = "lightgreen") # Density plot plot(density(mtcars$mpg), main = "Density Plot of MPG") polygon(density(mtcars$mpg), col = "lightblue", border = "darkblue") # -------------------------# 2. MULTIVARIATE ANALYSIS # -------------------------# Summary statistics for entire dataset summary(mtcars) # Correlation matrix cor_matrix <- cor(mtcars) print(cor_matrix) # Correlation plot corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8) # Scatterplot matrix (first 5 variables) pairs(mtcars[, 1:5], main = "Scatterplot Matrix") # GGally pairwise plot (first 5 variables) ggpairs(mtcars[, 1:5]) # Multivariate Boxplot (using ggplot2 for grouped variable) mtcars$cyl <- as.factor(mtcars$cyl) ggplot(mtcars, aes(x = cyl, y = mpg, fill = cyl)) + geom_boxplot() + labs(title = "MPG by Cylinder Group", x = "Cylinders", y = "Miles per Gallon") # -------------------------# 3. Save Descriptive Summary as Table # -------------------------- 10 desc_stats <- describe(mtcars) write.csv(desc_stats, "descriptive_statistics.csv", row.names = TRUE) # -------------------------# 4. OPTIONAL: Plot All Boxplots Together # -------------------------boxplot(mtcars, main = "Boxplots of All Variables", las = 2, col = rainbow(ncol(mtcars))) # -------------------------# 5. Practice Ideas # -------------------------# Q1: Which variable has the highest variability? # Q2: What pairs of variables are highly correlated? # Q3: Are there any outliers in mpg or hp? # Q4: How does mpg differ by number of cylinders? 11 D2: Linear regression in R, covering simple and multiple regression 1. 2. 3. 4. 5. 6. Model building Parameter estimation Assumption testing Goodness of fit Residual analysis Tutorial questions We'll use the mtcars dataset for illustration. Example Dataset: mtcars Let's predict Miles Per Gallon (mpg) from other car features. Simple Linear Regression Model: mpg ~ wt (miles per gallon by car weight # Fit model model_simple <- lm(mpg ~ wt, data = mtcars) # View summary summary(model_simple) Interpretation • • • Estimate: Coefficients Pr(>|t|): Significance R-squared: Goodness of fit Multiple Linear Regression Model: mpg ~ wt + hp + cyl model_multiple <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(model_multiple) 12 Parameter Estimation From summary(): • • Intercept and slope coefficients Standard error, t-values, and p-values Also: confint(model_multiple) # Confidence intervals for coefficients Assumption Testing • Linearity plot(model_multiple$fitted.values, mtcars$mpg, xlab = "Fitted", ylab = "Actual", main = "Linearity Check") abline(0, 1) • Independence (Durbin-Watson Test) install.packages("car") library(car) durbinWatsonTest(model_multiple) • Homoscedasticity (constant variance) plot(model_multiple$fitted.values, resid(model_multiple), main = "Residuals vs Fitted") abline(h = 0) • Normality of Residuals hist(resid(model_multiple), main = "Histogram of Residuals") qqnorm(resid(model_multiple)); qqline(resid(model_multiple)) shapiro.test(resid(model_multiple)) # Formal test Goodness of Fit Check: • • • R-squared and Adjusted R-squared AIC(model_multiple) – lower is better anova(model_simple, model_multiple) – compare models 13 Residual Analysis par(mfrow = c(2, 2)) plot(model_multiple) # 4-panel diagnostic plots Tutorial Questions for Practice 1. Fit a simple linear regression model with hp predicting mpg. 2. Add wt and disp as predictors and compare with the previous model. 3. Which predictor contributes most to mpg prediction? 4. Are the residuals normally distributed? 5. Test multicollinearity using VIF. vif(model_multiple) # From 'car' package Full R script file (with all code and comments) 📄📄 R Script: Linear Regression Analysis # ------------------------------------------------------------# LINEAR REGRESSION IN R: Simple and Multiple Regression # Dataset: mtcars # ------------------------------------------------------------# Load built-in dataset data(mtcars) # -------------------------# 1. Simple Linear Regression # -------------------------# Model: mpg ~ wt model_simple <- lm(mpg ~ wt, data = mtcars) # Summary of the model summary(model_simple) # Plot actual vs fitted plot(model_simple$fitted.values, mtcars$mpg, xlab = "Fitted", ylab = "Actual", main = "Linearity Check") abline(0, 1, col = "red") # -------------------------# 2. Multiple Linear Regression # -------------------------# Model: mpg ~ wt + hp + cyl model_multiple <- lm(mpg ~ wt + hp + cyl, data = mtcars) summary(model_multiple) # Confidence intervals for coefficients confint(model_multiple) 14 # -------------------------# 3. Assumption Testing # -------------------------# Install required packages if (!require(car)) install.packages("car") library(car) # Independence: Durbin-Watson Test durbinWatsonTest(model_multiple) # Homoscedasticity: Residuals vs Fitted plot(model_multiple$fitted.values, resid(model_multiple), main = "Residuals vs Fitted", xlab = "Fitted values", ylab = "Residuals") abline(h = 0, col = "red") # Normality: Histogram and QQ plot hist(resid(model_multiple), main = "Histogram of Residuals") qqnorm(resid(model_multiple)); qqline(resid(model_multiple), col = "blue") # Normality test shapiro.test(resid(model_multiple)) # -------------------------# 4. Goodness of Fit # -------------------------# R-squared, Adjusted R-squared available in summary() # Compare models anova(model_simple, model_multiple) # AIC for model comparison AIC(model_simple) AIC(model_multiple) # -------------------------# 5. Residual Diagnostics # -------------------------# Diagnostic Plots par(mfrow = c(2, 2)) plot(model_multiple) # Multicollinearity: VIF vif(model_multiple) # -------------------------# 6. Tutorial Questions # -------------------------# Q1: Simple regression with hp model_hp <- lm(mpg ~ hp, data = mtcars) summary(model_hp) # Q2: Add wt and disp, compare with previous model model_expanded <- lm(mpg ~ hp + wt + disp, data = mtcars) summary(model_expanded) anova(model_hp, model_expanded) 15 # Q3: Which predictor contributes most? Check p-values # Q4: Are residuals normal? See QQ plot and Shapiro-Wilk test above # Q5: Check multicollinearity (already done with vif()) # Reset plotting layout par(mfrow = c(1, 1)) 16 D3. ANOVA A complete R script for ANOVA (one-way and two-way), including model fitting, post-hoc tests, assumptions testing, visualizations, and practice questions using the mtcars dataset. 📄📄 R Script: ANOVA in R # ------------------------------------------------------------# ANOVA in R: One-way and Two-way with Assumptions and Post-hoc # Dataset: mtcars # ------------------------------------------------------------# Load dataset data(mtcars) # Convert relevant variables to factors mtcars$cyl <- as.factor(mtcars$cyl) mtcars$gear <- as.factor(mtcars$gear) mtcars$am <- as.factor(mtcars$am) # -------------------------# 1. One-way ANOVA: mpg ~ cyl # -------------------------# Fit ANOVA model model_aov1 <- aov(mpg ~ cyl, data = mtcars) # Summary of the model summary(model_aov1) # -------------------------# 2. Post-hoc Test: Tukey HSD # -------------------------TukeyHSD(model_aov1) # -------------------------# 3. Boxplot of Groups # -------------------------boxplot(mpg ~ cyl, data = mtcars, main = "MPG by Number of Cylinders", xlab = "Cylinders", ylab = "Miles Per Gallon", col = "lightblue") # -------------------------# 4. Assumption Testing # -------------------------# Normality of residuals shapiro.test(residuals(model_aov1)) # Should be p > 0.05 # QQ Plot qqnorm(residuals(model_aov1)); qqline(residuals(model_aov1), col = "red") # Homogeneity of variance: Levene’s Test if (!require(car)) install.packages("car") library(car) 17 leveneTest(mpg ~ cyl, data = mtcars) # -------------------------# 5. Two-way ANOVA: mpg ~ gear * am # -------------------------model_aov2 <- aov(mpg ~ gear * am, data = mtcars) summary(model_aov2) # -------------------------# 6. Two-way ANOVA (no interaction): mpg ~ gear + am # -------------------------model_aov3 <- aov(mpg ~ gear + am, data = mtcars) summary(model_aov3) # -------------------------# 7. Practice Questions # -------------------------# Q1: Is there a significant effect of gear on mpg? summary(aov(mpg ~ gear, data = mtcars)) # Q2: Is there a significant effect of transmission (am) on mpg? summary(aov(mpg ~ am, data = mtcars)) # Q3: Is the interaction gear:am significant? summary(aov(mpg ~ gear * am, data = mtcars)) # Q4: Post-hoc for gear TukeyHSD(aov(mpg ~ gear, data = mtcars)) # Q5: Plot residuals of model_aov2 par(mfrow = c(2, 2)) plot(model_aov2) par(mfrow = c(1, 1)) 18 D4. Logistic regression Overview of Logistic Regression • • Used when dependent variable is binary (0/1, Yes/No). Models the log-odds (logit) of the outcome as a linear combination of predictors. Dataset Example: mtcars Let’s model the transmission type (am: 0 = automatic, 1 = manual) as a function of other variables like mpg, wt, hp. 1. Data Preparation data(mtcars) mtcars$am <- factor(mtcars$am, levels = c(0, 1), labels = c("Automatic", "Manual")) 2. Simple Logistic Regression Model: am ~ mpg # Fit model model_log_simple <- glm(am ~ mpg, data = mtcars, family = binomial) # Summary summary(model_log_simple) # Odds ratio exp(coef(model_log_simple)) # Confidence intervals for odds ratios exp(confint(model_log_simple)) 3. Multiple Logistic Regression Model: am ~ mpg + wt + hp model_log_multiple <- glm(am ~ mpg + wt + hp, data = mtcars, family = binomial) summary(model_log_multiple) exp(coef(model_log_multiple)) # Odds ratios exp(confint(model_log_multiple)) # Confidence intervals 19 4. Goodness of Fit # Model fit statistics logLik(model_log_multiple) AIC(model_log_multiple) # Log-likelihood # AIC value # Pseudo R-squared install.packages("pscl") library(pscl) pR2(model_log_multiple) 5. Predict and Evaluate # Predict probabilities pred_probs <- predict(model_log_multiple, type = "response") # Convert to class (threshold = 0.5) pred_class <- ifelse(pred_probs > 0.5, "Manual", "Automatic") # Confusion matrix table(Predicted = pred_class, Actual = mtcars$am) # Accuracy mean(pred_class == mtcars$am) 6. ROC Curve and AUC install.packages("pROC") library(pROC) roc_obj <- roc(mtcars$am, pred_probs) plot(roc_obj, main = "ROC Curve") auc(roc_obj) Tutorial Questions for Practice 1. Is mpg a significant predictor of transmission type? 2. Fit a multiple logistic regression with hp, wt, and qsec—interpret odds ratios. 3. Compute the predicted probability of a manual transmission for a car with mpg = 25, wt = 2.5, hp = 100. 4. Evaluate classification accuracy and AUC. 20 R script file for logistic regression 📄📄 R Script: Logistic Regression in R # ------------------------------------------------------------# LOGISTIC REGRESSION: Simple and Multiple Logistic Regression # Dataset: mtcars # ------------------------------------------------------------# Load dataset data(mtcars) # Convert response variable to a factor (0 = Automatic, 1 = Manual) mtcars$am <- factor(mtcars$am, levels = c(0, 1), labels = c("Automatic", "Manual")) # -------------------------# 1. Simple Logistic Regression: am ~ mpg # -------------------------# Fit logistic regression model model_log_simple <- glm(am ~ mpg, data = mtcars, family = binomial) # Model summary summary(model_log_simple) # Odds ratio exp(coef(model_log_simple)) # Confidence interval for odds ratio exp(confint(model_log_simple)) # -------------------------# 2. Multiple Logistic Regression: am ~ mpg + wt + hp # -------------------------model_log_multiple <- glm(am ~ mpg + wt + hp, data = mtcars, family = binomial) # Model summary summary(model_log_multiple) # Odds ratios exp(coef(model_log_multiple)) # Confidence intervals for odds ratios exp(confint(model_log_multiple)) # -------------------------# 3. Goodness of Fit # -------------------------# Log-likelihood and AIC logLik(model_log_multiple) AIC(model_log_multiple) # Pseudo R-squared if (!require(pscl)) install.packages("pscl") library(pscl) 21 pR2(model_log_multiple) # -------------------------# 4. Prediction and Classification # -------------------------# Predict probabilities pred_probs <- predict(model_log_multiple, type = "response") # Convert to class labels using 0.5 threshold pred_class <- ifelse(pred_probs > 0.5, "Manual", "Automatic") # Confusion matrix table(Predicted = pred_class, Actual = mtcars$am) # Classification accuracy mean(pred_class == mtcars$am) # -------------------------# 5. ROC Curve and AUC # -------------------------if (!require(pROC)) install.packages("pROC") library(pROC) # ROC and AUC roc_obj <- roc(mtcars$am, pred_probs) plot(roc_obj, main = "ROC Curve") auc(roc_obj) # -------------------------# 6. Practice Questions # -------------------------# Q1: Is mpg a significant predictor? → Check p-value in model_log_simple # Q2: Try another model: am ~ hp + wt + qsec model_alt <- glm(am ~ hp + wt + qsec, data = mtcars, family = binomial) summary(model_alt) exp(coef(model_alt)) # Q3: Predict for new data new_data <- data.frame(mpg = 25, wt = 2.5, hp = 100) predict(model_log_multiple, newdata = new_data, type = "response") Predicted probability # Q4: Evaluate classification accuracy and AUC (already done) # 22 D5. Ordinal logistic regression Response variable has ordered categories (e.g., "Low", "Medium", "High"). Overview of Ordinal Logistic Regression • • • Used when the dependent variable is categorical and ordered Models the cumulative log odds of the response variable R package: MASS::polr() (Proportional Odds Logistic Regression) Example Dataset: winequality (or we simulate one) Let’s simulate a small example for clarity: # Simulate an ordinal response set.seed(123) n <- 100 data_ordinal <- data.frame( quality = factor(sample(c("Low", "Medium", "High"), n, replace = TRUE, prob = c(0.3, 0.5, 0.2)), ordered = TRUE, levels = c("Low", "Medium", "High")), alcohol = rnorm(n, mean = 10, sd = 1.5), acidity = rnorm(n, mean = 3.3, sd = 0.5) ) Fit Ordinal Logistic Regression Model install.packages("MASS") library(MASS) # Fit model using polr (Proportional Odds Logistic Regression) model_ordinal <- polr(quality ~ alcohol + acidity, data = data_ordinal, Hess = TRUE) # Summary summary(model_ordinal) Interpret Coefficients # Compute p-values ctable <- coef(summary(model_ordinal)) p_vals <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2 ctable <- cbind(ctable, "p value" = round(p_vals, 4)) ctable • Positive coefficient: Predictor increases odds of being in higher category 23 • Negative coefficient: Predictor increases odds of being in lower category Odds Ratios and Confidence Intervals # Odds ratios exp(coef(model_ordinal)) # Confidence intervals confint(model_ordinal) exp(confint(model_ordinal)) # CI on log-odds scale # CI on odds ratio scale Predict and Evaluate # Predict probabilities for each class predict(model_ordinal, type = "probs")[1:5, ] # Predict class predict(model_ordinal, type = "class")[1:5] Tutorial Questions 1. 2. 3. 4. Which variable significantly affects wine quality? Interpret the sign of each coefficient. What are the odds of being rated “High” vs. “Low” with higher alcohol? Predict class for alcohol = 11, acidity = 3.2. predict(model_ordinal, newdata = data.frame(alcohol = 11, acidity = 3.2), type = "class") Full R script file for ordinal logistic regression 📄📄 R Script: Ordinal Logistic Regression in R # ------------------------------------------------------------# ORDINAL LOGISTIC REGRESSION using polr() from MASS package # ------------------------------------------------------------# Load necessary package if (!require(MASS)) install.packages("MASS") library(MASS) # -------------------------# 1. Simulate Ordinal Data # -------------------------- 24 set.seed(123) n <- 100 data_ordinal <- data.frame( quality = factor( sample(c("Low", "Medium", "High"), n, replace = TRUE, prob = c(0.3, 0.5, 0.2)), ordered = TRUE, levels = c("Low", "Medium", "High") ), alcohol = rnorm(n, mean = 10, sd = 1.5), acidity = rnorm(n, mean = 3.3, sd = 0.5) ) # View first few rows head(data_ordinal) # -------------------------# 2. Fit Ordinal Logistic Regression # -------------------------# Fit proportional odds logistic regression model model_ordinal <- polr(quality ~ alcohol + acidity, data = data_ordinal, Hess = TRUE) # Summary summary(model_ordinal) # -------------------------# 3. Coefficient Interpretation with p-values # -------------------------# Coefficient table with p-values ctable <- coef(summary(model_ordinal)) p_vals <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2 ctable <- cbind(ctable, "p value" = round(p_vals, 4)) ctable # -------------------------# 4. Odds Ratios and Confidence Intervals # -------------------------# Odds ratios exp(coef(model_ordinal)) # Confidence intervals (on log-odds scale) confint(model_ordinal) # Odds ratio confidence intervals exp(confint(model_ordinal)) # -------------------------# 5. Prediction and Probabilities # -------------------------# Predict class predict(model_ordinal, type = "class")[1:5] # Predict class probabilities predict(model_ordinal, type = "probs")[1:5, ] # Predict for new data 25 new_data <- data.frame(alcohol = 11, acidity = 3.2) predict(model_ordinal, newdata = new_data, type = "class") predict(model_ordinal, newdata = new_data, type = "probs") # -------------------------# 6. Tutorial Questions (for practice) # -------------------------# Q1: Which variable significantly affects wine quality? # → Check p-values in the coefficient table # Q2: Interpret direction and impact of alcohol/acidity # Q3: Compute and interpret odds ratios (already done) # Q4: Predict outcome for new case: # alcohol = 11, acidity = 3.2 → already done above 26 D6. Exploratory Factor Analysis (EFA) What is Exploratory Factor Analysis (EFA)? • • • Purpose: Identify latent constructs from observed variables. Use case: Psychometrics, survey analysis, behavioral data, etc. Assumes: Interval/ratio data, adequate sample size, linear relationships Key R Packages for EFA • • • psych – for core factor analysis functions GPArotation – for rotation of factors nFactors – for determining number of factors Example Dataset: bfi from psych package install.packages("psych") library(psych) data(bfi) head(bfi[,1:10]) # We'll use the first 10 personality items 1. Data Preparation 🔹🔹 Check for Missing Values bfi_clean <- na.omit(bfi[, 1:10]) # Use only first 10 items, remove NAs 🔹🔹 Check Sampling Adequacy (KMO) and Bartlett’s Test KMO(bfi_clean) # Should be > 0.6 cortest.bartlett(bfi_clean) # p < 0.05 indicates suitability 2. Determine Number of Factors fa.parallel(bfi_clean, fa = "fa") # Scree plot and parallel analysis Choose the number of factors based on scree and eigenvalues > 1. 27 3. Perform Factor Analysis Extract 3 factors (example), with rotation efa_model <- fa(bfi_clean, nfactors = 3, rotate = "varimax", fm = "ml") print(efa_model) • • • nfactors – number of latent factors rotate – use "varimax" (orthogonal) or "oblimin" (oblique) fm – factor extraction method: "ml" (maximum likelihood), "pa" (principal axis) 4. Visualize Factor Loadings fa.diagram(efa_model) # Factor structure 5. Interpret Output • • • • Factor loadings: Correlations between items and factors Communalities (h2): Proportion of each item’s variance explained SS loadings: Sum of squared loadings per factor Rotation: Aids interpretability Tutorial Questions 1. 2. 3. 4. How many factors are suggested by parallel analysis? Which items load most strongly on each factor? Are there any cross-loading items? Compare varimax vs oblimin rotation. Full R script filefor Exploratory Factor Analysis (EFA) 📄📄 R Script: Exploratory Factor Analysis (EFA) in R # ------------------------------------------------------------# EXPLORATORY FACTOR ANALYSIS (EFA) # Dataset: bfi (Big Five Inventory - psych package) # ------------------------------------------------------------# Install required packages if (!require(psych)) install.packages("psych") 28 if (!require(GPArotation)) install.packages("GPArotation") library(psych) library(GPArotation) # -------------------------# 1. Load and Prepare Data # -------------------------# Load bfi dataset from psych package data(bfi) # Use first 10 items (5-point Likert scale personality questions) bfi_data <- bfi[, 1:10] # Remove missing values bfi_clean <- na.omit(bfi_data) # -------------------------# 2. Check Assumptions # -------------------------# KMO Measure of Sampling Adequacy (should be > 0.6) KMO(bfi_clean) # Bartlett’s Test of Sphericity (p < 0.05 indicates suitability) cortest.bartlett(bfi_clean) # -------------------------# 3. Determine Number of Factors # -------------------------# Scree plot and parallel analysis fa.parallel(bfi_clean, fa = "fa", n.iter = 100, main = "Parallel Analysis") # -------------------------# 4. Factor Extraction # -------------------------# Example: Extract 3 factors using maximum likelihood and varimax rotation efa_model <- fa(bfi_clean, nfactors = 3, rotate = "varimax", fm = "ml") # Print full results print(efa_model) # View only factor loadings efa_model$loadings # -------------------------# 5. Visualize Factor Structure # -------------------------# Diagram of factor loading structure fa.diagram(efa_model) # -------------------------# 6. Optional: Try Oblique Rotation # -------------------------efa_oblique <- fa(bfi_clean, nfactors = 3, rotate = "oblimin", fm = "ml") print(efa_oblique) 29 # -------------------------# 7. Interpretation Guide # -------------------------# - Factor loadings > 0.4 considered meaningful # - Look for items strongly associated with each factor # - Avoid cross-loading items (> 0.3 on multiple factors) # - Communalities (h2): proportion of variance explained per item # - SS loadings: strength of each factor (total variance explained) 30 D7. Confirmatory Factor Analysis (CFA) What is Confirmatory Factor Analysis (CFA)? • • Used to test a pre-specified factor structure (based on theory or prior EFA) Unlike EFA, CFA requires: o The number of factors o Which items load on which factors Packages Used for CFA in R • • lavaan – powerful package for specifying and fitting structural models semPlot – for visualizing factor structure Example CFA Model (based on bfi data from EFA) Let’s suppose we derived a 3-factor model from EFA: • • • Factor1: A1, A2, A3 Factor2: C1, C2, C3 Factor3: N1, N2, N3 📄📄 R Script: Confirmatory Factor Analysis (CFA) # ------------------------------------------------------------# CONFIRMATORY FACTOR ANALYSIS (CFA) # ------------------------------------------------------------# Install necessary packages if (!require(lavaan)) install.packages("lavaan") if (!require(semPlot)) install.packages("semPlot") library(lavaan) library(semPlot) # Load dataset data(bfi, package = "psych") # Use subset of 9 variables from 3 theoretical factors (example) cfa_data <- bfi[, c("A1", "A2", "A3", "C1", "C2", "C3", "N1", "N2", "N3")] cfa_data <- na.omit(cfa_data) # -------------------------# 1. Specify CFA Model # -------------------------- 31 cfa_model <- ' Agreeableness =~ A1 + A2 + A3 Conscientiousness =~ C1 + C2 + C3 Neuroticism =~ N1 + N2 + N3 ' # -------------------------# 2. Fit CFA Model # -------------------------fit_cfa <- cfa(cfa_model, data = cfa_data, std.lv = TRUE) summary(fit_cfa, fit.measures = TRUE, standardized = TRUE) # -------------------------# 3. Fit Indices to Report # -------------------------# Common indices # - CFI > 0.90 (good), > 0.95 (excellent) # - RMSEA < 0.08 (acceptable), < 0.05 (good) # - SRMR < 0.08 (good) # - χ²/df < 3 desirable fitMeasures(fit_cfa, c("chisq", "df", "pvalue", "cfi", "tli", "rmsea", "srmr")) # -------------------------# 4. Visualize CFA Model # -------------------------semPaths(fit_cfa, whatLabels = "std", layout = "tree", edge.label.cex = 1.2) Tutorial Questions 1. 2. 3. 4. Is the CFA model a good fit (based on CFI, RMSEA, SRMR)? Which items have the highest standardized loadings on their factor? Try a model with cross-loadings—does it improve the fit? Compare CFA with EFA results—are they consistent? 32 D8. Structural Equation Modeling (SEM) SEM is a generalization of multiple regression and factor analysis that lets you: • • • Model relationships between latent variables Include both measurement and structural components Estimate direct, indirect, and total effects Key Package: lavaan lavaan allows you to specify CFA + path models easily. Example SEM Scenario (based on bfi data): Suppose we hypothesize: • • • Conscientiousness (latent) is measured by: C1, C2, C3 Neuroticism (latent) is measured by: N1, N2, N3 JobPerformance is influenced by both Conscientiousness and Neuroticism We’ll simulate JobPerformance as an observed outcome. 📄📄 R Script: Structural Equation Modeling (SEM) # ------------------------------------------------------------# STRUCTURAL EQUATION MODELING (SEM) with lavaan # ------------------------------------------------------------# Install packages if (!require(lavaan)) install.packages("lavaan") if (!require(semPlot)) install.packages("semPlot") library(lavaan) library(semPlot) # Load data data(bfi, package = "psych") # Use items from two latent factors: Conscientiousness (C) and Neuroticism (N) sem_data <- bfi[, c("C1", "C2", "C3", "N1", "N2", "N3")] sem_data <- na.omit(sem_data) # Simulate a dependent variable: JobPerformance set.seed(123) sem_data$JobPerformance <- scale(0.5 * sem_data$C1 - 0.3 * sem_data$N1 + rnorm(nrow(sem_data), sd = 0.5)) # -------------------------- 33 # 1. Specify SEM Model # -------------------------sem_model <- ' # Measurement models Conscientiousness =~ C1 + C2 + C3 Neuroticism =~ N1 + N2 + N3 # Structural model JobPerformance ~ Conscientiousness + Neuroticism ' # -------------------------# 2. Fit SEM Model # -------------------------fit_sem <- sem(sem_model, data = sem_data, std.lv = TRUE) # -------------------------# 3. Model Summary and Fit Indices # -------------------------summary(fit_sem, fit.measures = TRUE, standardized = TRUE) # Key fit indices fitMeasures(fit_sem, c("chisq", "df", "pvalue", "cfi", "tli", "rmsea", "srmr")) # -------------------------# 4. Visualize SEM Model # -------------------------semPaths(fit_sem, what = "std", layout = "tree", edge.label.cex = 1.2, sizeMan = 6, sizeLat = 8) Tutorial Questions 1. 2. 3. 4. Do Conscientiousness and Neuroticism significantly predict Job Performance? Is the model a good fit based on CFI, RMSEA, and SRMR? What are the standardized effects of each latent variable? Try adding a correlation between Conscientiousness and Neuroticism—does fit improve? 34 Modify this model to include mediation or moderation Mediation and moderation—two fundamental concepts in structural modeling. MEDIATION vs MODERATION Concept Meaning Mediation An intermediate variable transmits the effect from predictor to outcome. Moderation A third variable changes the strength or direction of a predictor-outcome relationship. Scenario Using bfi Dataset Let’s simulate this case: Variables: • • • • Conscientiousness (C) → predictor (latent) Stress (S) → mediator (observed or latent) JobPerformance (J) → outcome (observed) Neuroticism (N) → moderator (latent) 1. Mediation Model Path: Conscientiousness → Stress → JobPerformance Conscientiousness also has a direct effect on JobPerformance. 📄📄 R Script: Mediation SEM Model # Load libraries library(lavaan) library(semPlot) data(bfi, package = "psych") # Prepare variables sem_data <- na.omit(bfi[, c("C1", "C2", "C3", "N1", "N2", "N3")]) set.seed(123) # Simulate stress (mediator) and job performance (outcome) sem_data$Stress <- scale(0.3 * sem_data$C1 + 0.4 * sem_data$N1 + rnorm(nrow(sem_data), sd = 0.5)) 35 sem_data$JobPerformance <- scale(0.5 * sem_data$C1 - 0.3 * sem_data$Stress + rnorm(nrow(sem_data), sd = 0.5)) # Model specification with mediation mediation_model <- ' # Measurement models Conscientiousness =~ C1 + C2 + C3 Neuroticism =~ N1 + N2 + N3 # Structural model (mediation) Stress ~ Conscientiousness + Neuroticism JobPerformance ~ Stress + Conscientiousness # Indirect effect ind_effect := Conscientiousness * Stress ' # Fit the model fit_mediation <- sem(mediation_model, data = sem_data, std.lv = TRUE) # Summary summary(fit_mediation, standardized = TRUE, fit.measures = TRUE) # Visualize semPaths(fit_mediation, what = "std", layout = "tree", edge.label.cex = 1.2, sizeMan = 6, sizeLat = 8) 2. Moderation Model Path: Neuroticism moderates the effect of Conscientiousness on JobPerformance This requires creating an interaction term. Note: lavaan doesn't support latent × latent interactions directly without additional packages (semTools, latentInteractions, or XWITH). But you can do moderation using observed variables (or factor scores). 📄📄 Moderation with Observed Predictors # Use observed proxies sem_data$C_avg <- rowMeans(sem_data[, c("C1", "C2", "C3")]) sem_data$N_avg <- rowMeans(sem_data[, c("N1", "N2", "N3")]) # Create interaction term sem_data$CxN <- scale(sem_data$C_avg * sem_data$N_avg) # Model with moderation (JobPerformance ~ C + N + C*N) mod_model <- ' JobPerformance ~ C_avg + N_avg + CxN ' fit_mod <- sem(mod_model, data = sem_data, fixed.x = FALSE) 36 summary(fit_mod, standardized = TRUE, fit.measures = TRUE) Interpretation Tips • • • Mediation: Check if Stress is a significant pathway between Conscientiousness and JobPerformance. Moderation: If CxN is significant, the effect of Conscientiousness on performance depends on Neuroticism level. Use indirect, direct, total effects for mediation analysis via parameterEstimates() or custom labels. Complete R script that includes both mediation and moderation Complete R script that includes both mediation and moderation using lavaan and simulated data from the bfi dataset. 📄📄 Full R Script: SEM with Mediation and Moderation # ------------------------------------------------------------# STRUCTURAL EQUATION MODELING (SEM): Mediation & Moderation # Dataset: bfi (from psych package) # ------------------------------------------------------------# Install required packages if not already installed if (!require(lavaan)) install.packages("lavaan") if (!require(semPlot)) install.packages("semPlot") library(lavaan) library(semPlot) # -------------------------# 1. Load and Prepare Data # -------------------------# Load dataset data(bfi, package = "psych") # Select relevant variables (Conscientiousness: C1–C3, Neuroticism: N1–N3) sem_data <- na.omit(bfi[, c("C1", "C2", "C3", "N1", "N2", "N3")]) # Simulate mediator (Stress) and outcome (JobPerformance) set.seed(123) sem_data$Stress <- scale(0.3 * sem_data$C1 + 0.4 * sem_data$N1 + rnorm(nrow(sem_data), sd = 0.5)) sem_data$JobPerformance <- scale(0.5 * sem_data$C1 - 0.3 * sem_data$Stress + rnorm(nrow(sem_data), sd = 0.5)) # -------------------------# 2. Mediation Model (Latent) 37 # -------------------------mediation_model <- ' # Measurement model Conscientiousness =~ C1 + C2 + C3 Neuroticism =~ N1 + N2 + N3 # Structural model with mediation Stress ~ Conscientiousness + Neuroticism JobPerformance ~ Stress + Conscientiousness # Indirect effect label (optional) # ind_effect := Conscientiousness * Stress ' # Fit the mediation SEM fit_mediation <- sem(mediation_model, data = sem_data, std.lv = TRUE) # Show summary with fit indices and standardized coefficients summary(fit_mediation, standardized = TRUE, fit.measures = TRUE) # Visualize mediation model semPaths(fit_mediation, what = "std", layout = "tree", edge.label.cex = 1.2, sizeMan = 6, sizeLat = 8) # -------------------------# 3. Moderation Model (Observed Variables) # -------------------------# Compute mean scores for C and N (as observed variables) sem_data$C_avg <- rowMeans(sem_data[, c("C1", "C2", "C3")]) sem_data$N_avg <- rowMeans(sem_data[, c("N1", "N2", "N3")]) # Create interaction term (moderator) sem_data$CxN <- scale(sem_data$C_avg * sem_data$N_avg) # SEM with moderation effect (observed predictors) mod_model <- ' JobPerformance ~ C_avg + N_avg + CxN ' # Fit the moderation model fit_mod <- sem(mod_model, data = sem_data, fixed.x = FALSE) # Show summary summary(fit_mod, standardized = TRUE, fit.measures = TRUE) # -------------------------# 4. Interpretation Guide # -------------------------# In Mediation: # - Stress mediates the effect of Conscientiousness on JobPerformance # - Significant indirect effect = successful mediation # In Moderation: # - CxN coefficient tells if N moderates C's effect on JobPerformance # - Significant interaction means moderation exists
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )