Webinar PowerPoint Slides - Center on Response to Intervention

Iowa’s Application of Rubrics to Evaluate Screening and Progress Tools John L. Hosp, PhD University of Iowa Overview of this Webinar • Share rubrics for evaluating screening and progress tools • Describe process Iowa Department of Education used to apply rubrics Purpose of the Review • Survey of universal screening and progress tools currently being used by LEAs in Iowa • Review these tools for technical adequacy • Incorporate one tool into new state data system • Provide access to tools for all LEAs in state Collaborative Effort The National Center on Response to Intervention Structure of the Review Process Core Group Other IDE staff as well as stakeholders from LEAs, AEAs, and IHEs from across the state IDE staff responsible for administration and coordination of the effort Vetting Group Work Group IDE and AEA staff who conducted the actual reviews Overview of the Review Process • The work group was divided into 3 groups: Group A Key elements of tools: name, what it measures, grades it is used with, how it is administered, cost, time to administer Group B Technical features: reliability, validity, classification accuracy, relevance of criterion measure Group C Application features: alignment with CORE, training time, computer system feasibility, turn around time for data, sample, disaggregated data • Within each group, members worked in pairs Overview of the Review Process • Each pair: ▫ had a copy of the materials needed to conduct the review ▫ reviewed and scored their parts together and then swapped with the other pair in their group • Pairs within each group met only if there were discrepancies in scoring ▫ A lead person from one of the other groups participated to mediate reconciliation • This allowed each tool to be reviewed by every work group member Overview of the Review Process • All reviews will be completed and brought to a full work group meeting • Results will be compiled and shared • Final determinations across groups for each tool will be shared with the vetting group two weeks later • The vetting group will have one month to review the information and provide feedback to the work group Structure and Rationale of Rubrics • Separate rubrics for universal screening and progress monitoring ▫ Many tools reviewed for both ▫ Different considerations • Common header and descriptive information • Different criteria for each group (a, b, c) Universal Screening Rubric Header on cover page Iowa Department of Education Universal Screening Rubric for Reading (Revised 10/24/11) What is a Universal Screening Tool in Reading: It is a tool that is administered at school with ALL students to identify which students are at-risk for reading failure on an outcome measure. It is NOT a placement screener and would not be used with just one group of students (e.g., a language screening test) Why use a Universal Screening Tool: It tells you which students are at-risk for not performing at the proficient level on an end of year outcome measure. These students need something more and/or different to increase their chances of becoming a proficient reader. What feature is most critical: Classification Accuracy because it provides a demonstration of how well a tool predicts who may and may not need something more. It is critical that Universal Screening Tools identify the correct students with the greatest degree of accuracy so that resources are allocated appropriately and students who need additional assistance get it. Group A Group A Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook On-Line publisher Info. Outside Resource other than Publisher or Researcher of Tool Name of Screening Tool: Grades: (circle all that apply) K Criteria Cost (minus administrative fees like printing) Student time spent engaged with tool Skill/Area Assessed with Screener: 1 2 3 4 5 6 Above6 Justification Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable. The amount of student time required to obtain the data. This does not include set-up and scoring time. How Screener Administered: (circle one) Score 3 Free Score 2 $.01 to $1.00 per student Score 1 $1.01 to $2.00 per student ≤ 5 minutes per student 6 to 10 minutes per student 11 to 15 minutes per student Group or Score 0 $2.01 to $2.99 per student Individual Kicked out if: ≥ $3.00 & over per student >15 minutes per student Group B Criteria Criterion Measure used for Classification Accuracy (Sheet for Judging Criterion Measure) Justification The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined. Classification Accuracy (Sheet for Judging Classification Accuracy for Screening Tool) Tools need to demonstrate they can accurately determine which students are in need of assistance based on current performance and predicted performance on a meaningful outcome measure. This is evaluated with: Area Under the Curve (AUC), Specificity and Sensitivity The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined. Criterion Measure used for Universal Screening Tool (Sheet for Judging Criterion Measure) Group B Score 3 Score 2 15-12 points 11-8 points on on criterion criterion measure form measure form Score 1 7-4 points on criterion measure form Score 0 3-0 points on criterion measure form 9-7 points on classification accuracy form 6-4 points on classification accuracy form 3-1 points on classification accuracy form 0 points on classification accuracy form 15-12 points on criterion measure form 11-8 points on criterion measure form 7-4 points on criterion measure form 3-0 points on criterion measure form Kicked out if: Same test but uses different subtest or composite OR same test given at a different time No data provided Same test but uses different subtest or composite OR same test given at a different time Judging Criterion Measure Additional Sheet for Judging the External Criterion Measure (Revised 10/24/11) Used for: Circle all that apply Screening: Classification Accuracy Name of Criterion Measure: Gates Screening: Criterion Validity Progress Monitoring: Criterion Validity How Criterion Administered: (circle one) Group or Individual Information Relied on to make determinations: (circle all that apply) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook info. Outside Resource other than publisher or Researcher of Measure 1. An appropriate Criterion Measure is: a) External to the screening or progress monitoring tool b) A Broad skill rather than a specific skill c) Technically adequate for reliability d) Technically adequate for validity e) Validated on a broad sample that would also represent Iowa’s population On-Line publisher Judging Criterion Measure (cont) Feature Justification Criterion Measure is: a) External to the The criterion measure should be Screening or separate and not related to the Progress screening or progress monitoring Monitoring Tool tool. Meaning the outside measure should be by a different author/publisher and use a different sample. (e.g., NWF can’t predict ORF by the same publisher) Score 3 b) A broad skill rather than a specific skill Broad reading skills are measured (e.g., Total reading score on ITBS) We are interested in generalizing to a larger domain and therefore, the criterion measure should assess a broad area rather than splinter skills. Score 2 External with no/little overlap. Different author/publisher, standardization group. Score 1 Score 0 External with some/ a lot of overlap. Same author/publisher, and standardization group. Broad reading skills are measured but in one area (e.g., comprehension made up of two subtests) Specific skills measured in two areas (e.g., comprehension and decoding) Kicked Out Internal (same test using different subtest or composite OR same test given at a different time) Specific skill measured in one area (e.g., PA, decoding, vocabulary, spelling) Judging Criterion Measure (cont) c) Technically adequate for Reliability d) Technically adequate for Validity e) A broad sample is used Student performance needs to be consistently measured. Typically demonstrated with reliability under different items (alternate form, split half, coefficient alpha) The tool measures what it purports to measure. We focused on criterion-related validity to make this determination. The extent to which this criterion measure relates to another external measure that is determined good. The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state? Some form of reliability above .80 Some form of reliability between .70 and .80 Some form of reliability between .60 and .70 All forms of reliability below .50 Criterion ≥ .70 Criterion .50-.69 Criterion .30 -.49 Criterion .10 - .29 National sample Several States (3 or more) across more than one region States (3, 2 or 1 in one region) Sample of convenience, does not represent a state. Judging Classification Accuracy Additional Sheet for Judging Classification Accuracy for Screening Tool (Revised 10/24/11) Assessment: (Include name and grade) Complete the Additional Sheet for Judging the Criterion Measure. If it is not kicked out complete review for: 1) Area Under the Curve (AUC) 2) Specificity/Sensitivity 3) Lag time between when the assessments are given Feature Justification 1) Area Under the Curve (AUC) Technical Adequacy is Area Under the Curve is one Demonstrated for Area way to gauge how accurately a Under the Curve tool identifies students in need of assistance. It is derived from Receiver Operating Characteristic curves (ROC) and is presented as a number to 2 decimal places. One AUC is reported for each comparison— each grade level, each subgroup, each outcome tool, etc. Score 3 AUC ≤ .90 Score 2 AUC ≥ .80 Score 1 AUC ≥ .70 Score 0 AUC < .70 Kicked Out Judging Classification Accuracy (cont) 2) Specificity or Sensitivity Technical Adequacy is Specificity/Sensitivity is another Demonstrated for way to gauge how accurately a Specificity or Sensitivity tool identifies students in need (see below) of assistance. Specificity and Sensitivity can give the same information depending on how the developer reported the comparisons. Sensitivity is often reported as accuracy of positive prediction (yes on both tools). Therefore if the developer predicted positive/proficient performance, Sensitivity will express how well the screening tool identifies students who are proficient. If predicting at-risk or non-proficient, this is what Sensitivity shows. It is important to verify what the developer is predicting so that consistent comparisons across tools can be made (see below) 3) Lag time between when the assessments are given Lag time- length of time Time between when the between when the assessments are given should criterion and screening be shorter to eliminate effects assessment is given associated with differential instruction Sensitivity or Specificity ≥ .90 Sensitivity or Specificity ≥ .85 Sensitivity or Specificity ≥ .80 Sensitivity or Specificity < .80 Under two weeks Between two weeks and 1 month Between 1 month and 6 months Over 6 months Sensitivity and Specificity Considerations and Explanations Explanations: True means “in agreement between screening and outcome”. So true can be negative to negative in terms of student performance (i.e., negative meaning at-risk or nonproficient). This could be considered either positive or negative prediction depending on which the developer intends the tool to predict. As an example, a tool that has a primary purpose of identifying students at-risk for future failure would probably use ‘true positives’ to mean ‘those students who were accurately predicted to fail the outcome test’. Sensitivity = true positives/true positives + false negatives Specificity = true negatives/true negatives + false positives Key + = proficiency/mastery - = nonproficiency/at-risk 0 = unknown = Sensitivity = Specificity Consideration 1: Determine whether developer is predicting a positive outcome (i.e., proficiency, success, mastery, at or above a criterion or cut score) from a positive performance on the screening tool (i.e., at or above benchmark or a criterion or cut score) or a negative outcome (i.e., failure, nonproficiency, below a criterion or cut score) from negative performance on the screening tool (i.e., below a benchmark, criterion, or cut score). Prediction is almost always positive to positive or negative to negative; however in rare cases it might be positive to negative or negative to positive. Figure 1a Outcome This is an example of positive to positive prediction. In this case, Sensitivity + - is positive performance on the screening tool predicting positive outcome. Screening + Figure 1b Screening + Outcome This is the opposite prediction—negative to negative as the main focus. In - + this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. Using the same information in these two tables, Sensitivity in the top table will equal Specificity in the second table. Because our purpose is to predict proficiency, in this instance we would use Specificity as the metric for judging. Consideration 2: Some developers may include a third category—unknown prediction. If this is the case, it is still important to determine whether they are predicting a positive or negative outcome because Sensitivity and Specificity are still calculated the same way. Figure 2a Outcome + 0 - + Screening 0 Figure 2b Outcome Screening 0 + 0 + This is an example of positive to positive prediction. In this case, Sensitivity is positive performance on the screening tool predicting positive outcome. It represents a similar comparison to that in Figure 1a. This is the opposite prediction—negative to negative as the main focus. In this case, Sensitivity is negative (or at-risk) performance on the screening tool predicting negative outcome. It represents a similar comparison to that in Figure 1b. Using the same information in these two tables, Sensitivity in the top table will equal Specificity in the second table. Because our purpose is to predict proficiency, in this instance we would use Specificity as the metric for judging. Consideration 3: In (hopefully) rare cases, the developer will set up the tables in opposite directions (reversing screening and outcome or using a different direction for the positive/negative for one or both). This illustrates why it is important to consider which column or row is positive and negative for both the screening and outcome tools. Screening + Outcome 0 - 0 + Notice that the Screening and Outcome tools are transposed. This makes Sensitivity and Specificity align within rows rather than columns. Group B (cont) Criterion Validity for Universal Screening Tool. From technical manual Reliability for Universal Screening Tool. Reliability across raters for Universal Screening Tool. Tools need to demonstrate that they actually measure what they purport to measure (i.e., validity). We focused on criterion-related validity because it is a determination of the relation between the screening tool and a meaningful outcome measure. Tools need to demonstrate that the test scores are stable across items and/or forms. We focused on:  alternate form  split half  coefficient alpha How reliable scores are across raters is critical to the utility of the tool. If the tool is complicated to administer and score it can be difficult to train people to use it leading to different scores from person to person. Criterion ≥ .70 Criterion .50-.69 Criterion .30 -.49 Criterion .10 - .29 Criterion < .10 or no information provided Alternate Form > .80 Alternate Form > .70 Alternate Form > .60 Alternate Form > .50 There is no evidence of reliability Split-half > .80 Split-half > .70 Split-half > .60 Split-half > .50 Coefficient alpha >.80 Coefficient alpha >.70 Coefficient alpha >.60 Coefficient alpha >.50 Rater ≥.90 Rater .89-.85 Rater .84-.80 Rater ≤.75 Group C Criteria Alignment with Iowa CORE/ Demonstrated Content Validity Justification It is critical that tools assess skills identified in the Iowa Core. Literature & Informational:  Key Ideas & Details  Craft & Structure  Integration of knowledge & ideas  Range of reading & level of text complexity Foundational: (K – 1)  Print Concepts  Phonological Awareness  Phonics and Word Recognition  Fluency Foundational: (2 – 5)  Phonics and Word Recognition  Fluency Group C Score 3 Score 2 Has a direct Has alignment alignment with Iowa with the Iowa CORE (Provide CORE (provide Broad Area) Broad Area and Specific Skill) Score 1 Score 0 Kicked out if: Has no alignment with the Iowa CORE Group C (cont) Training Required Computer Application (tool and data system) Data Administration and Data Scoring The amount of time needed for Less than 5 training is one consideration related hours of to the utility of the tool. Tools that can training be learned in a matter of hours and (1 day) not days would be considered appropriate. Many tools are given on a computer Computer or which can be helpful if: schools have hard copy of computers, the computers are tool available. compatible with the software, and the Data reporting data reporting can be separated from is separate the tool itself. It is also a viable option if hard copies of the tools can be used if computers are not available. The number of people needed to Student takes administer and score the data speaks assessment to the efficiency of how data is on computer collected and the reliability of scoring. and it is automatically scored by computer at end of test 5.5 to 10 hours of training (2 days) 10.5 to 15 hours of training (3 days) Over 15.5 hours of training (4+ days) Computer application only. Data reporting is separate Computer or hard copy of tools available. Data reporting is part of the system Computer application only. Data reporting is part of the system Adult administers assessment to student and enters student’s responses (in real time) into computer and it is automatically scored by computer at end of test Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps AND referencing additional materials to get a score (having to look up information in additional tables) Group C (cont) Data Retrieval (time The data needs to be available in a for data to be timely manner in order to use the useable) information to make decisions about students Data can be used instantly Data can be used Same day Data can be used Next day Data are not available until 2 – 5 days later A broad sample is used The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state? National sample Several States (3 or more) across more than one region States (3, 2 or 1 in one region) Sample of convenience, does not represent a state. Disaggregated Data Viewing disaggregated data by subgroups (i.e, race, English language learners, economic status, special ed. status) helps determine how the tool works with each group. This information is often not reported but it should be considered if it is available. Race, economic status, and special ed. status are reported separately At least two disaggregated groups are listed One disaggregated group is listed No information on disaggregated groups Takes 5+ days to use data (have to send data out to be scored) Progress Monitoring Rubric Header on cover page Iowa Department of Education Progress Monitoring Rubric (Revised 10/24/11) Why use Progress Monitoring Tools: They quickly and efficiently provide an indication of a student’s response to instruction. Progress monitoring tools are sensitive to student growth (i.e., skills) over time, allowing for more frequent changes in instruction. They allow teachers to better meet the needs of their students and determine how best to allocate resources. What feature is most critical: Sufficient number of equivalent forms so that student skills can be measured over time. In order to determine if students are responding positively to instruction, they need to be assessed frequently to evaluate their performance and the rate at which they are learning. Descriptive info on each work group’s section Information Relied on to make determinations: (circle all that apply, minimum of two) Manual from publisher NCRtI Tool Chart Buros/Mental Measurement Yearbook On-Line publisher Info. Outside Resource other than Publisher or Researcher of Tool Name of Progress Monitoring Tool: Grades: (circle all that apply) Name of Criterion Measure: K 1 Skill/Area Assessed with Progress Monitoring Tool: 2 3 4 5 6 Above6 How Progress Monitoring Administered: (circle one) How Criterion Administered: (circle one) Group Group or or Individual Individual Group A Criteria Number of equivalent forms Cost (minus administrative fees like printing) Student time spent engaged with tool Justification Progress monitoring requires frequently assessing a student’s performance and making determinations based on their growth (i.e., rate of progress). In order to assess students’ learning frequently, progress monitoring is typically conducted once a week. Therefore, most progress monitoring tools have 20 to 30 alternate forms. Tools need to be economically viable meaning the cost would be considered “reasonable” for the state or a district to use. Funds that are currently available can be used and can be sustained. One time funding to purchase something would not be considered sustainable. The amount of student time required to obtain the data. This does not include set-up and scoring time. Tools need to be efficient to use. This is especially true of measures that teachers would be using on a more frequent basis. Score 3 20 or more alternate forms Score 2 15 – 19 alternate forms Score 1 10 – 14 alternate forms Score 0 9 alternate forms Kicked out if: < 9 alternate forms Free $.01 to $1.00 per student $1.01 to $2.00 per student $2.01 to $2.99 per student ≥$3.00 & over per student ≤ 5 minutes per student 6 to 10 minutes per student 11 to 15 minutes per student >15 minutes per student Group B Criteria Forms are of Equivalent Difficulty (Need to provide detail of what these are when publish review) Judgment of Criterion Measure (see separate sheet for judging criterion measure) Technical Adequacy is Demonstrated for Validity of Performance score (sometimes called Level) Justification Alternate forms need to be of equivalent difficulty to be useful as a progress monitoring tool. Having many forms of equivalent difficulty allows a teacher to determine how the student is responding to instruction because the change in score can be attributed to student skill versus a change in the measure. Approaches include:  Readability formulae (e.g., FleishKincaid, Spache, Lexile, FORCAST)  Euclidian Distance  Equipercentiles  Stratified Item Sampling The measure that is being used as a comparison must be determined to be appropriate as the criterion. In order to make this determination several features of the criterion measure must be examined. Performance score is a student’s performance at a given point in time rather than a measure of his/her performance over time (i.e., rate of progress). We focused on criterionrelated validity to make this determination because it is a determination of the relation between the progress monitoring tool and a meaningful outcome. Score 3 Addressed equating in multiple ways Score 2 Addressed equating in 1 way that is reasonable Score 1 Score 0 Addressed equating in a way that is NOT reasonable 15-12 points on criterion measure form 11-8 points on criterion measure form 7-4 points on criterion measure form 3-0 points on criterion measure form Criterion ≥ .70 Criterion .50-.69 Criterion .30 -.49 Criterion .10 - .29 Kicked out if: Does Not Provide any indication of equating forms Group B (cont) Technical Adequacy is Demonstrated for Reliability of Performance score Technical Adequacy is Demonstrated for Reliability of slope Tools need to demonstrate that the test scores are stable across item samples/forms, raters, and time. Across item samples/forms:  coefficient alpha  split half  KR-20  alternate forms Across raters:  interrater (i.e., interscorer, interobserver) Across time:  Test-retest Item samples/ forms ≥.80 Item samples/ forms .79-.70 Item samples/ forms .69-.60 Item samples/ forms ≤.59 Rater ≥.90 Rater .89-.85 Rater .84-.80 Rater ≤.75 Time ≥.80 Time .79-.70 Time .69-.60 Time ≤.59 The Reliability of the slope tells us how well the slope represents a student’s rate of improvement. Two criteria are used:  Number of observation, that is student data points needed to calculate slope.  Coefficients, that is reliability for slope. This should be reported via HLM (also called LMM or MLM) results. If calculated via OLS, the coefficients are likely to be lower. * 10 or more observations/ data points 9-7 observations/ data points 6-4 observations/ data points 3 or fewer observations/ data points Coefficient >.80 Coefficient >.70 Coefficient >.60 Coefficient <.59 Must Report 2/3 OR a score of 0 in 2 or more areas. (No tool would be kicked out due to lack of any one.) Group B (cont) * HLM=Hierarchical Linear Modeling LMM=Linear Mixture Modeling MLM=Multilevel Modeling OLS=Ordinary Least Squares HLM, LMM, and MLM are three different ways to describe a similar approach to analysis. Reliability of the slope should be reported as a proportion of variance accounted for by the repeated measurement over time. These methods take into account that the data points are actually related to one another because they come from the same individual. OLS does not take this into account and as such, would ascribe the extra variation to error in measurement rather than the relation among data points. Group C Criteria Alignment with Iowa CORE/ Demonstrated Content Validity Training Required Computer Application (tool and data system) Justification It is critical that tools assess skills identified in the Iowa Core. Literature & Informational:  Key Ideas & Details  Craft & Structure  Integration of knowledge & ideas  Range of reading & level of text complexity Foundational: (K – 1)  Print Concepts  Phonological Awareness  Phonics and Word Recognition  Fluency Foundational: (2 – 5)  Phonics and Word Recognition  Fluency The amount of time needed for training is one consideration related to the utility of the tool. Tools that can be learned in a matter of hours and not days would be considered appropriate. Many tools are given on a computer which can be helpful if: schools have computers, the computers are compatible with the software, and the data reporting can be separated from the tools itself. It is also a viable option if hard copies of the tools can be used if computers are not available. Score 3 Has a direct alignment with the Iowa CORE (provide Broad Area and Specific Skill) Score 2 Has alignment with Iowa CORE (Provide Broad Area) Score 1 Score 0 Less than 5 hours of training (1 day) 5.5 to 10 hours of training (2 days) 10.5 to 15 hours of training (3 days) Over 15.5 hours of training (4+ days) Computer or hard copy of tool available. Data reporting is separate Computer Computer or application hard copy of only. tool available. Data reporting Data reporting is separate is part of the system Computer application only. Data reporting is part of the system Kicked out if: Has no alignment with the Iowa CORE Group C (cont) Data Administration and Data Scoring The number of people needed to administer and score the data speaks to the efficiency of how data is collected and the reliability of scoring. Student takes assessment on computer, it is automatically scored by computer at end of test Adult administers assessment to student and enters student’s responses (in real time) into computer, it is automatically scored by computer at end of test Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps (adding together scores across many assessments, subtracting errors to get a total score) Data Retrieval (time for data to be useable) The data needs to be available in a timely manner in order to use the information to make decisions about students Data can be used instantly Data can be used Same day Data can be used Next day Adult administers assessment to student and then calculates a score at end of test by conducting multiple steps AND referencing additional materials to get a score (having to look up information in additional tables) Data are not available until 2 – 5 days later Takes 5+ days to use data (have to send data out to be scored) Group C (cont) A broad sample is used Disaggregated Data The sample used in determining the technical adequacy for a tool should represent a broad audience. While a representative sample by grade is desirable it is often not reported therefore, taken as a whole does the population used represent all students or is it specific to a region or state? Viewing disaggregated data by subgroups (i.e, race, English language learners, economic status, special ed. status) helps determine how the tool works with each group. This information is often not reported but it should be considered if it is available National sample Several States (3 or more) across more than one region States (3, 2 or 1 in one region) Sample of convenience, does not represent a state. Race, economic status, and special ed. status are reported separately At least two disaggregated groups are listed One disaggregated group is listed No information on disaggregated groups Findings • Many of the tools reported are not sufficient (or appropriate) for universal screening or progress monitoring • Some tools are appropriate for both • No tool (so far) is “perfect” • There are alternatives from which to choose Live Chat • Thursday April 26, 2012 • 2:00-3:00 EDT • Go to rti4success.org for more details

Webinar PowerPoint Slides - Center on Response to Intervention

Related documents

Products

Support

Webinar PowerPoint Slides - Center on Response to Intervention

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib