Statistical Validation for the Training Evaluation Instrument Final Report Prepared by: Matthew J. Grawitch, Ph.D. Saint Louis University Executive Summary This report summarizes the data analytic results regarding the internal validity for the VA’s revised training evaluation instrument. The process employed two separate analysis sets. In the first set of analyses, a large multi-item Likert-scale questionnaire was used to assess participant reactions to training. The instrument included 10 dimensions of trainee satisfaction. Data from 155 educational activity participants within the VA system were then used to: 1. Identify the best set of items to measure each of the training evaluation dimensions; and 2. Identify the best possible model that would account for the relationships between the evaluation dimensions. The results suggested that within the VA system, participant self-evaluations of learning and their intention to apply what they learned to their jobs (planned action) were very highly correlated. The other dimensions of evaluation (content, materials, delivery methods, instructor/facilitator, instructional activities, program time or length, and training environment) all helped to define a factor representing overall evaluations of the educational activity. Finally, it was found that overall evaluations contributed to participant evaluations of learning and planned action. Part I of this report focuses on these results. Based on these results, several changes were made to the instrument, including: Eliminating the instructor/facilitator factor; Adding items that were not represented; Eliminating some items that did not meet the specific needs of the VA; and Refining some items to eliminate possible confusion. A refined version of the instrument was then used to evaluate educational activities within the VA system. Data from 179 educational activity participants were then used to replicate the results obtained for the first version of the instrument. Overall, the results confirmed the internal validity of the refined evaluation instrument. The factor structure of the refined instrument demonstrated convergence with the factor structure of the initial instrument. Again, overall evaluation was an important factor that resulted from the combination of the other dimensions of evaluation. As in the previous analysis set, overall evaluations contributed to participant evaluations of learning and planned action. Part II focuses specifically on these results. In Part III, the results from the first analysis and the results from the second analysis are compared. In general, the models demonstrated significant convergence. There were a few differences, all of which related to the delivery methods factor. However, some differences would be expected to occur and should not be interpreted as problematic. Given the extraordinary similarity between the two sets of results, the VA may consider its instrument to be an internally valid tool for evaluating educational activities within the VA system. In practical terms, the instrument would be able to provide three specific types of data: 1. An average score for each factor, so that trainers can assess different aspects of their educational activities; 2. An overall evaluation score (the mean of all six factor scores that represent Overall Evaluations), so that educational activities can be evaluated easily and efficiently; and 3. A learning and planned action score to help trainers evaluate participant perceptions regarding knowledge and skill gain and its applicability to the job. 1 Part I: Initial SEM Results The Likert data was examined using structural equation modeling (SEM) techniques. A total of 58 items comprised the initial questionnaire, allowing for the use of SEM to help identify the best set of items to measure each factor. Then, the questionnaire was used to evaluate several face-to-face conferences and other programs that took place within the VA system. A total of 155 completed questionnaires were analyzed in the SEM analysis. Using SEM techniques, measurement models were first developed for each of 10 factors: Content, Materials, Delivery Methods, Instructor/Facilitator, Instructional Activities, Program Time or Length, Training Environment, Overall Satisfaction, Learning, and Planned Action. Measurement models provide a way to determine how well individual items load onto one factor and represents a miniature version of a factor analysis. The purpose of examining measurement models first was to eliminate some of the items that were very poor contributors to each of the factors. Once measurement models had been created for each factor, actual SEM analyses were then performed. The SEM analyses integrated all of the individual measurement model analyses into one large model. The purpose was to “fit” the nine factors into a coherent model. A SEM analysis occurs in several stages. 1 (1) The analyst screens the data to make sure it is ready for the analysis. (2) The analyst specifies a hypothesized model that s/he would like to test. This includes specifying specific factors (the boxes and ovals) to be tested and the specific paths within the model (the arrows). (3) The data are then tested against the hypothesized model to determine how well the data “fit” that model. (4) If model fit is poor, alterations are made to the model using conceptual input (theory) and empirical input (modification diagnostics). (5) Once a model with acceptable fit is identified, the analyst interprets the paths within the model and determines if additional alterations are necessary to improve model fit or to eliminate factors within the model that are unacceptable (for example, factor loadings less then .70 would be considered poor, and a beta less than .60 would be considered unacceptable). The SEM techniques were useful in this context for two primary reasons. First, SEM would help to determine how the 10 factors (as represented by the dark blue ovals in Figure 1) fit together. Second SEM would help to determine the “best” items (as represented by the light blue boxes in Figure 1) from an empirical perspective for defining each of the factors. After several attempts, it was noted that the Learning and the Planned Action factors were very highly correlated (r = .99) and that they represented essentially the same factor. Therefore, a new factor, Learning & Planned Action, was developed as a means of integrating the two factors into one. Because of this aggregation of factors, nine factors were then used for future model testing: Content, Materials, Delivery Methods, Instructor/Facilitator, Instructional Activities, Program Time or Length, Training Environment, Overall Satisfaction, and Learning & Planned Action. 2 A series of models was then evaluated. Most of these models resulted in poor fit, with the Root Mean Square Error of Approximation (RMSEA) consistently greater than .10 (.00 to .05 = excellent fit; .05 to .10 = acceptable fit; .10+ = poor fit). These failed tests, however, helped to identify important insights that would lead to a model with acceptable fit. 1. To obtain a model that fit the data well, items that did not load well onto their designated factor in the larger model were eliminated. 2. The content, materials, delivery methods, instructor/facilitator, instructional activities, time, and training factors all loaded on one factor called overall training evaluations (the green oval in Figure 1). Overall training evaluations was not actually measured by the training evaluation instrument, but instead represented what is known as a “second-order” latent variable. 3. The overall evaluation latent factor eliminated the need for the overall satisfaction factor used in the instrument itself. The elimination of the overall satisfaction factor from the instrument then left a total of eight factors assessed by the instrument (i.e., content, materials, delivery methods, instructor/facilitator, instructional activities, program time, and training environment) and one second-order latent factor (i.e., overall evaluations). Based on these insights, a final model was developed for the instrument (Figure 1). This model identified a total of 20 items that best contributed to the model: 2 items that were acceptable indicators of the Content, Materials, Delivery Methods, Instructional Activities, Time, and Training Environment factors; 3 items that were acceptable indicators of the Instructor/Facilitator factor; and 5 items that were acceptable indicators of the Learning & Planned Action factor. Additionally, Content, Materials, Delivery Methods, Instructional Activities, Time, and Training Environment were all identified as acceptable indicators of the second-order factor: Overall Evaluations. In other words, the overall evaluations of participants were based on the combination of the six factors specified above. Finally, the model suggested that Overall Evaluations served as an indicator of Learning & Planned Action. In other words, overall evaluations of the training program were influenced by how much participants learned and the extent to which they planned to apply the knowledge and skills learned in training to their jobs. 3 Overall, the model presented in Figure 1 demonstrated poor fit based on the Chi-Squared analysis, χ2 (156) = 288.60, p < .000, meaning that the specified model and data were significantly different. However, because the degrees of freedom were so large (156) based on the number of “potential” paths, it was not a good indicator of model fit. The Root Mean Square Error of Approximation (RMSEA), Relative Fit Index (RFI), and the Standardized Root Mean Square Residual (SRMR) all suggested acceptable or good fit; RMSEA = .074 (.05 to .10 equals acceptable fit); RFI = .95 (.95 or greater suggests good fit); and SRMR = .044 (.05 or less suggests good fit). In this instance, they would be considered more appropriate fit indices. Additionally, as can be seen in Table 1, the factors that served as indicators of overall evaluations demonstrated a moderate degree of empirical overlap. This more than likely explains why a second-order latent factor was needed to specify a model with acceptable fit. In other words, the factors vary in terms of their degree of overlap, suggesting that some factors are more related than others. This leads to the conclusion that overall evaluations are not completely independent, but that it is the additive evaluations of all seven factors that results in overall evaluations of training programs. Based on the final item set (Table 2), the VA EES made six decisions in moving forward: 1. Rephrase item 16 so that it would be in a positive direction (all others items are positive in nature; 2. Eliminate the Instructor/Facilitator factor because it is already evaluated in a different instrument used by the VA; 3. Add a third item to the Content factor that was not identified as an acceptable indicator in this analysis (item would content to job-related skills); 4. Edit and refine some of the items on the Learning & Planned Action factor; 5. Collect additional data using only the refined instrument; and 6. Attempt to replicate the results of this analysis. Part II describes the results from this replication. 4 Figure 1. Final Model from Initial Results Item 40 .84 Item 1 -.69 Item 41 Content .78 .88 Item 4 .90 Item 42 -.68 Learning .81 .91 Item 43 .90 Materials .86 Item 7 Item 12 .87 Item 49 .92 .84 Delivery Methods .60 .91 .94 Overall Evaluation .90 Instructor .82 .83 Item 14 Item 15 Item 18 Item 22 Item 24 .90 .82 Item 26 Activities .89 Item 27 .81 Time .90 .76 Item 31 Item 33 .73 .82 Training Environment 5 .85 Item 37 Item 38 Table 1. Correlation Matrix for Indicators of Overall Evaluations 1 2 3 4 5 6 1. Content/Objectives 2. Materials .80 3. Instructor/Facilitator .80 .78 4. Activities .73 .78 .80 5. Delivery Methods .41 .52 .62 .58 6. Time .65 .71 .73 .73 .49 7. Training Environment .49 .63 .65 .65 .44 6 .59 Table 2. Final Item Set from Part I Content 1) The objectives clearly defined the content and/or skills to be addressed. 2) The content adequately addressed the stated objectives. Materials 3) Material enhanced content delivery. 4) Materials were used effectively during the educational activity. Delivery Methods 5) The use of visual aids enhanced content presentation. 6) The quality of the visual aids (e.g., layout, readability) enhanced learning. Instructor 7) The faculty were able to effectively present content. 8) The faculty were knowledgeable about the topic. 9) The faculty engaged participants effectively. Activities 10) Instructional activities kept my interest. 11) Instructional activities were effective in helping me to learn the content. Time 12) The length of the educational activity was sufficient for me to understand the content. 13) The length of the educational activity offered learners adequate time to provide feedback. Training Environment 14) The training environment was conducive to learning. 15) Logistics regarding the educational activity were clear. Learning & Planned Action 16) Overall, I failed to gain new knowledge or skills as a result of my participation in this educational activity. 17) I have learned the content required to attain the objectives of the educational activity. 18) I am not motivated to use the knowledge or skills I learned in this educational activity. 19) Content from this educational activity will be useful in improving my job performance. 20) I learned useful ideas and tools for my own work responsibilities. 7 Part II: SEM Results from Refined Instrument The refined set of items (Table 3) was then used to develop a refined instrument to assess educational activity effectiveness. The responses for a total of 179 participants within the VA system were included in the SEM analysis. The participants had attended training within the VA system, mostly in the form of conferences. The model was specified in nearly the same way, given that the Instructor/Facilitator factor had been eliminated and some of the items had been deleted or altered (see Figure 2). As in the previous set of analyses, the Chi-Squared analysis suggested poor fit, χ2 (126) = 293.35, p < .000, meaning that the specified model and the data were significantly different. However, because the degrees of freedom were so large (126) based on the number of “potential” paths, it was not a good indicator of model fit. The Root Mean Square Error of Approximation (RMSEA), Relative Fit Index (RFI), and the Standardized Root Mean Square Residual (SRMR) all suggested acceptable or good fit; RMSEA = .086 (.05 to .10 equals acceptable fit); RFI = .95 (.95 or greater suggests good fit); and SRMR = .05 (.05 or less suggests good fit). Therefore, the model tested for the refined instrument demonstrated an acceptable fit with the data. Additionally, Item three that measured content, which was not identified as an acceptable item in the previous analysis, did demonstrate an acceptable contribution to the model (β = .80). On the other hand, Item 12, a training item, demonstrated a poor contribution to the model (β = .57), since a Beta weight of less than .70 is considered poor in SEM analyses. Overall, though, the results successfully replicated the model tested in the previous analysis. As can be seen in Table 4, the factors that served as indicators of overall evaluations demonstrated a moderate degree of empirical overlap as they did in the previous analysis, suggesting again that the overall evaluation factor was relevant. In other words, the factors vary in terms of their degree of overlap, suggesting that some factors are more related than others. Therefore, overall evaluations do not result from the addition of completely independent factors. Part III details the similarities and differences between the two models and offers some conclusions and implications. 8 Table 3. Refined Item Set Used for Part II Content 1) The objectives clearly defined the content and/or skills to be addressed. 2) The content adequately addressed the stated objectives. 3) The content was relevant to my job-related needs. Materials 4) Material enhanced content delivery. 5) Materials were used effectively during the educational activity. Delivery Methods 6) The use of visual aids enhanced content presentation. 7) The quality of the visual aids (e.g., layout, readability) enhanced learning. Activities 8) Instructional activities kept my interest. 9) Instructional activities were effective in helping me to learn the content. Time 10) The length of the educational activity was sufficient for me to understand the content. 11) The length of the educational activity offered learners adequate time to provide feedback. Training Environment 12) The training environment was conducive to learning. 13) Logistics regarding the educational activity were clear. Learning & Planned Action 14) I gained new knowledge or skills as a result of my participation in this educational activity. 15) I have learned the content required to attain the objectives of the educational activity. 16) I am not motivated to use the knowledge or skills I learned in this educational activity. 17) I plan to use the content from this educational activity in my job. 18) My work environment allows application of this content to the job. 9 Figure 2. Replication of Likert-Scale Model. Item 14 .89 Item 1 .82 Item 15 Content .84 .94 Item 2 .91 Item 16 .93 .80 Learning Item 3 .90 Item 17 .73 .88 Item 18 .90 Materials .87 Item 4 Item 5 .90 .87 Delivery Methods .87 Overall Evaluation .92 Item 6 Item 7 .90 .85 Item 8 Activities .93 Item 9 .76 Time .86 .79 Item 10 Item 11 .68 .57 Training Environment 10 .93 Item 12 Item 13 Table 4. Correlation Matrix for Indicators of Overall Evaluations 1 2 3 4 5 1. Content/Objectives 2. Materials .79 3. Activities .82 .81 4. Delivery Methods .60 .78 .78 5. Time .69 .68 .66 .66 6. Training Environment .62 .61 .59 .59 11 .52 Part III: Conclusions In examining the models presented in Figures 1 and 2 and the correlation matrices presented in Tables 1 and 4, there are some specific comparisons that can be made between the two models. Three tests were employed: A test to see if the overall models were statistically different in terms of their fit with the data; Tests of the major paths to determine whether the factors loadings were similar across the models; and Tests of the correlations to determine whether the interrelationships among the factors were consistent from the first analysis to the second analysis. The results of these tests would be used to form an overall picture of how well the second model replicated the results of the first model. First, model fit statistics were compared. Although the RMSEA values differed for the first and the second models (.074 for model 1 as compared to .086 for model 2), this difference was not significantly different, χ2 (30) = 4.75, p = 1.00, suggesting that the two models were statistically equivalent in terms of their fit with the data. Therefore, evidence for consistency among the two models was found. Second, beta weights from models 1 and 2 were compared. Since the items that defined several of the factors differed, those beta weights were not examined analysis, because any conclusions drawn would be suspect. The paths from overall evaluations to each of the factors and from learning to overall evaluations were examined. Only the path from Overall Evaluations to the Delivery Methods factor was significantly different (Δβ = .27, p < .01). All other paths were found to be statistically the same. This difference may have resulted because the Instructor/Facilitator factor was excluded in the refined instrument. In other words, the interrelationships among the factors changed by eliminating one factor, which may have resulted in the significantly different beta weight. However, given the overwhelming similarities between the beta weights from models 1 and 2 (six of seven were statistically the same), evidence for consistency was obtained by examining the beta weights. 12 Finally, an analysis of the correlation matrices was conducted. The correlations for Table 1 and 4 were compared (See Table 5). A critical value of .01 was selected because of the large number of correlations being compared. As can be seen in Table 5, although correlations between the two sets of data varied somewhat, only two of the bivariate correlations changed significantly from the first analysis to the second analysis, and both correlations again involved the Delivery Methods factor. In the second analysis, Delivery Methods was significantly more correlated with the Materials and Activities factors that it was in the first analysis. There are several possible reasons for this. However, given the large number of revisions that were made to the instrument from the first to the second administration (in terms of specific items and the elimination of the Instructor/Facilitator factor), explaining this difference with any certainty would be extraordinarily difficult. Given that difficulty, in general there was substantial consistency in the correlations when comparing the first analysis to the second analysis (13 of the 15 correlations were found to be statistically the same). Therefore, based on tests of the model in Figure 2, it can be concluded that: 1. The data collected using the revised instrument demonstrated a good fit with the specified model; 2. The Overall Evaluations of participants in educational activities result from the additive influence of their evaluations of Content, Materials, Delivery Methods, Activities, Program Time, and Training Environment; and 3. The amount of learning and planned action that occurs influences overall evaluations of the educational activity. Finally, based on the comparisons between the initial and final SEM analyses, it can be concluded that: 1. The results were generally consistent from the first to the second analysis; and 2. Although there were a few discrepancies, the results of the second analysis provide validation for the refined instrument. 13 Table 5. Differences in Correlations between Tables 1 and 4 1 2 3 4 5 1. Content/Objectives 2. Materials -.01 3. Activities .09 -.03 4. Delivery Methods -.19 -.26* -.20* 5. Time -.04 -.03 .07 -.17 6. Training Environment -.13 .02 .06 -.15 .07 Note: Positive values indicate that Table 1 correlations were higher; * is significant at the .01 level. Given these conclusions, the VA may consider its instrument to be a valid tool for evaluating educational activities within the VA system. In practical terms, the instrument would be able to provide three specific types of data: An average score for each factor, so that trainers can assess different aspects of their educational activities; An overall evaluation score (the mean of all six factor scores that represent Overall Evaluations), so that educational activities can be evaluated easily and efficiently; and A learning and planned action score to help trainers evaluate participant perceptions regarding knowledge and skill gain and its applicability to the job. 14