Treatment of dyslexia with the BrightStar method : statistical analysis of the effects of the treatment BrightStar Treatment BrightStar is a technology that stimulates transient neural circuits via a novel visual display in order to diminish reading difficulties (in struggling & reluctant readers) and alleviate the symptoms of dyslexics. The theoretical underpinnings of this technology lie squarely within a generalized magnocellular dysfunction framework and particularly, how dorsal and ventral neural networks can efficiently interact with each other to enable the phenomena of “reading”. Additionally, BrightStar effectively allocates visual covert attention towards transient peripheral stimuli (novel information) through rapid pre-attentive discrimination and processing, mainly through dorsal magnocellular projections to the cerebellum, thus effectively facilitating eye movements’ proper temporal aspects (e.g. steady gaze, saccadic inhibition). Oculomotor control is now accepted by many main-stream researchers as being an influential factor in dyslexia. BrightStar, a non-language based software-training program, is uniquely designed to work at the early low level of visual-sensory-motor neural processes and take advantage of the brain's plasticity to promote sensory-motor automatism. By retraining these neural circuits, BrightStar helps struggling and/or reluctant readers as well as dyslexics to rapidly improve their literacy skills. Program Training Protocol: Clients received 2 weekly 20-30 min long BrightStar sessions and a weekly 45 min remedial teaching outside of school during a period of 6 weeks . 209 clients in the age of 7 thru 70 participated in our research. The two base questions for the research were: 1. What are the effects of the BrightStar treatment on the reading and ‘spelling’ capabilities of the clients? 2. Is there a relationship between the effects of the BrightStar treatment and other factors, like age, gender, TIQ, L/R handed, comorbidity with other behavourial problems? Description of the dataset During the years 2007 – 2010 209 clients have been treated with the BrightStar method together with an extensive initial assessment and final assessment of their reading and ‘spelling’ capabilities. With the results of the initial and final assessments the observed changes were analysed and the statistical significance of the effects were determined in a statistical study performed by the Centre for Quantitative Methods, Eindhoven, The Netherlands. The following specific tests are part of the initial assessment and final assessment: PI Spelling Test The PI spelling test is a dictation on word level. ( PI research, Swets & Zeitlinger Publicers) The test consists of 135 words divided into 9 blocks of 15 words each, with increasing level. The first block is the easiest and the last of course the most challenging. The score is the number of words written correctly, translated into percentiles, didactical age equivalent scores and learning effectiveness measures. Word Reading Fluency (DMT). Drie-Minuten-Test [Three-Minute-Test] (Verhoeven, Cito group, Arnhem 1995). The DMT is a standardized Dutch test to measure fluency of reading syllable and multisyllabic words. The task is to read as many words as possible, correctly, from each card within 1 min. The score is the number of read words minus the number of incorrectly read words. The scores of the specific tests are translated into raw scores, CITO scores, didactical age equivalent scores and learning effectiveness measures. Non-word decoding Test - Klepel . The Klepel reading test (Van den Bos, Lutje Spelberg, Scheepstra, & De Vries, 1994): this test provides a standardized measure of non-word decoding. For 2 minutes the client has to read as many words as possible from a list of non-words. With each item the complexity of words increases. The outcomes of the test are standardized scores in the range from 1 to 19. The AVI (KPC group ‘s Hertogenbosch 1997) is a Dutch test to determine technical reading ability on text level. The test consists of eighteen cards. The cards correspond to nine levels of reading ability. Each card has its own error and speed limits. The scores of this test are translated into didactical age equivalent scores and learning effectiveness measures. SBWL, Rapid Naming and Reading Words (K.P. van den Bos, University Groningen) Rapid naming tests the ability to connect visual and verbal information by giving the appropriate names to common objects, colors, letters and digits. The vast majority of children and adults with reading disabilities have pronounced difficulties when asked to name that most familiar visual symbols and stimuli in the language. The client is shown different cards and is asked to name them all out loud. The examiner records the number of seconds the client needs to accomplish the task. 60 60 60 0 20 40 60 PI Woorddictee DLE - IA 20 40 60 PI -Woorddictee DLEPI - IA PI Woorddictee DLE FA Woorddictee DLE - IA 20 40 60 PI Woorddictee DLE FA Woorddictee DLE - IA PI -Woorddictee DLEPI - IA PI Woorddictee DLE - FA PI Woorddictee DLE - IA 0 0 0 0 20 20 20 40 40 40 PI Woorddictee PI Woorddictee PI Woorddictee DLE DLE - FADLE - FA- FA 0 20 20 20 40 40 40 60 0 0 0 Results for Results reading for and Results for reading and capabilities reading and capabilities capabilities 60 60 Errors are ignored for scoring purposes. The raw scores of the specific parts are Errors are into ignored for scoringscores. purposes. The raw scores of the specific parts are translated standardised translated into Another ofstandardised thefor SWBL is scores. the Word Reading Een-Minuut-Test Errors arepart ignored scoring purposes. The rawFluency scores of(EMTb), the specific parts are Another part of the SWBL is the Word Reading Fluency (EMTb), Een-Minuut-Test [One Minute Test] (Brus & Voeten, translated into standardised scores.1973) is a standardized Dutch test measuring general [One Minute & is Voeten, 1973) is a standardized Dutch measuring general word reading fluency with 116 ofReading increasing difficulty. The test participant is asked to Another partTest] of the(Brus SWBL thewords Word Fluency (EMTb), Een-Minuut-Test word reading fluency with 116 words of increasing difficulty. The participant is asked to read asTest] many(Brus words possible in 1ismin. Accuracy and speed are of importance. [One aloud Minute & as Voeten, 1973) a standardized Dutch test measuring general read aloud as many possible in increasing 1 min. Accuracy speed are of The score is thewords number of read correctly in 60and seconds and theimportance. amount of to wordtest reading fluency with as 116 words of difficulty. The participant is asked th The test score is the number of words read correctly in 60 seconds and the amount of seconds on as themany 50 item ( EMTb50) read aloud words as possible in 1 min. Accuracy and speed are of importance. th seconds theis50 ( EMTb50) The test on score theitem number of words read correctly in 60 seconds and the amount of th Phoneme seconds onawareness the 50 item ( EMTb50) Phoneme awareness The Phoneme awareness test is a test on non words phonemic synthesis and analysis, The Phoneme awareness testKop is aNoord-Holland.) test on non words phonemic synthesis and( till analysis, part of theawareness FIK2, (1993, SBD For the younger children group Phoneme partthe of maximum the FIK2, (1993, SBD Noord-Holland.) For the younger children group 6), score is 16 Kop and the maximum score is 20. and( till The Phoneme awareness test is afor testthe onolder non words phonemic synthesis analysis, 6), score is 16 Kop and for the older the maximum score is 20. partthe of maximum the FIK2, (1993, SBD Noord-Holland.) For the younger children ( till group 6), the maximum score is 16 and for the older the maximum score is 20. Cito Reading Technique and Tempo (Cito group, Arnhem 2004) Cito Cito Reading Technique and Tempo (Cito group,skills Arnhem The reading test measures the quiet reading and 2004) reading speed as a condition The Cito reading test measures the quiet reading reading as a condition for reading and word recognition. It is aand time test onspeed 14 different levels. Citocomprehensive Reading Technique and Tempo (Cito group,skills Arnhem 2004) for comprehensive reading and word recognition. It is a time test on 14 different levels. The scores of the tests are translated into Cito scores, didactical age equivalent scores Cito reading test measures the quiet reading skills and reading speed as a condition Thecomprehensive scores ofeffectiveness the tests are translated Cito scores, scores and learning measures. for reading and word into recognition. It isdidactical a time testage onequivalent 14 different levels. and learning effectiveness measures. The scores of the tests are translated into Cito scores, didactical age equivalent scores and learning effectiveness measures. Definitions: Definitions: Raw Score: The number scored correct on an assessment. Raw Score: The number scored correct assessment. Standardized Score: Conversion of the on rawanscore to a normalized score, based on how Definitions: Standardized Score: Conversion of the on rawanscore to a normalized score, based on how others of their age perform. Raw Score: The number scored correct assessment. others of their age perform. The didactical age (DL) is a normalised measure the of months a person Standardized Score: Conversion of the raw score to anumber normalized score, that based on how Thefollowed didactical ageperform. (DL) is a normalised measure the number of months that a person has education. others of theirregular age has followed regular education. The didactical age equivalent score (DLE) for a specific test gives the didactical age at (DL) is a normalised measure the number of months that a person Thefollowed didactical agetest equivalent score (DLE) for a specific test gives the didactical age at which the realised results are usually reached. has regular education. which the realised test results are usually reached. The learning measure is the ratio the gives didactical age equivalent didacticaleffectiveness age equivalent score (DLE) for a between specific test the didactical age at The effectiveness measure is the ratio between the didactical age equivalent scorelearning of the treated person and hisusually or her real didactical age: which the realised test results are reached. score of the treated person and his or her real didactical age: Learning effectiveness = DLE / DL The learning effectiveness measure is the ratio between the didactical age equivalent Learning effectiveness = DLE / DL score of the treated person and his or her real didactical age: For a person with average learning Learning effectiveness = DLE / DL capabilities the didactical age equivalent improves at For a person average the didactical age equivalent improves at the same ratewith as his or her learning didacticalcapabilities age, and therefore for a typical person the learning the same rate as his or her didactical age, and therefore for a typical person the learning effectiveness is about 1. Forlearning BrightStar starters,the thedidactical typical learning effectiveness For a person with average capabilities age equivalent improves at effectiveness For BrightStar starters, the typical effectiveness measures are is about which means that their improvement rate is about the halflearning of that the same rate asabout his 0.56 or1.her didactical age, and therefore for learning a typical person measures are about 0.56 which means that their improvement rate is about half of of an average is person their treatment. effectiveness aboutat 1.the Forbeginning BrightStarofstarters, the typical learning effectiveness that of an average at the beginning their treatment. measures are person about 0.56 which meansof that their improvement rate is about half of that of an average person at the beginning of their treatment. general general ‘spelling’ general ‘spelling’ ‘spelling’ As an example we show the analysis for the PI Woorddictee outcomes. In the scatter plot the results of the Final Assessment are plotted against the results of the Initial Assessment. On the average the final assessment is done 57 days, i.e. approximately 2 months, later than the initial assessment. We would expect an average improvement of about 2 DLE (for typical non-dyslectic) clients. In the scatter plot we see a very large part of the points are above the line FA = IA, which means that in most cases the clients have improved indeed. In the range between 30 DLE –50 DLE (for the initial assessment) we see that after the BrightStar treatment clients realise the maximum score of 60 DLE. 0 .02 Density .04 .06 .08 The following histogram shows the differences in DLE outcomes of the test (FA – IA). -10 0 10 20 PI Woorddictee DLE: FA - IA 30 For this specific test the average improvement is 7.1 DLE. An 95% confidence interval for the mean improvement is 6.1 DLE to 8.0 DLE, which is far better than 2 DLE which can be expected for non-dyslectic people. An interesting statistic is the Learning Effectiveness during the treatment, which is calculated as follows: Learning Effectiveness during treatment = (DLE FA – DLE IA) / (time gap FA and IA) The time gap is measured in number of months. For the clients the average Learning Effectiveness at the moment of the Initial Assessment was 56%. The Learning Effectiveness at the moment of the Final Assessment was 71%. This improvement in a period of about 2 months indicates that during these 2 months the Learning Effectiveness has been 391%, i.e. almost four times as high as for typical non-dyslectic people. The following table summarises the outcomes of all tests which were part of the assessments. The number of observations for the tests can differ, because (1) in some cases the outcomes of the tests could not be determined (because of too low scores) and (2) not all tests are always relevant for all clients. For all tests the statistical significance is formally tested. The statistical approach and tests used by CQM are described in Appendix A. Test PI Woorddictee (DLE) DMT1 (DLE) DMT2 (DLE) DMT3 (DLE) Klepel (Std Score) Klepel (DLE) AVI (DLE) SBWL Colour (Std Score) SBWL Numbers (Std Score) SBWL Pictures (Std Score) SBWL Letters (Std Score) SBWL Syl (Std Score) SBWL EMTb50 (Std Score) SBWL EMTb (Std Score) SBWL EMTb (DLE) Phonemes CITO Reading Technique (DLE) n Mean change FA – IA (95% conf. interval) LE IA LE FA LE during treatment 206 7.1 (6.1 – 8.0) 0.56 0.71 3.91 209 2.1 (1.7 – 2.4) - - - 205 204 197 208 135 182 209 209 209 207 122 201 172 194 165 5.6 (4.6 – 6.5) 4.6 (3.8 – 5.4) 4.1 (3.4 – 4.8) 1.5 (1.3 – 1.7) 6.9 (5.4 – 8.4) 3.2 (2.6 – 3.9) 1.9 (1.6 – 2.3) 2.2 (1.8 – 2.5) 2.0 (1.6 – 2.3) 1.7 (1.4 – 2.0) 1.2 (0.9 – 1.5) 1.5 (1.3 – 1.8) 5.8 (4.8 – 6.8) 3.5 (3.0 – 3.9) 6.6 (5.4 – 7.8) 0.49 0.49 0.49 0.51 0.49 - 0.52 0.50 0.59 0.58 0.57 0.64 0.55 - 0.61 0.61 3.02 2.56 2.27 3.68 1.84 - 3.15 3.85 For all tests the improvements are found to be statistically significant, as can also be seen from the shown confidence intervals in the table. Note that standard scores are standardized for age, and therefore usually remain constant. I.e. standard scores are not expected to change because of two months education. Results of the analysis of interaction with other factors In the study it has also been analysed whether there is a relationship between the effects of the treatment and the following factors: - Gender Preferred hand (Left-handed versus Right-handed clients) Comorbidity with behavourial factors like ADD, ADHD or PDD-NOS. The age of the client The total IQ of the client The availability of an official dyslexia certificate In appendix B the statistical tests used by CQM for the possible differences between the defined subgroups are described. A number of interesting relationships have been found: Gender In appendix B the statistical tests used by CQM for the possible differences between the defined subgroups are described. A number of interesting relationships have been found: Gender The gender of the client seems to have a relationship with the improvement on the SBWL tests. In all cases for female clients a larger improvement on the standard scores is observed. This is summarised in the following table. Test Mean change FA – IA Female clients Mean change FA – IA Male clients (95% conf. int.) (# observations) 2.4 (1.8 – 2.9) (79) 2.4 (1.8 – 3.1) (79) 2.6 (2.0 – 3.1) (79) 2.5 (2.0 – 3.0) (79) SBWL Colour (Std Score) SBWL Numbers (Std Score) SBWL Pictures (Std Score) SBWL Letters (Std Score) (95% conf. int.) (# observations) 1.6 (1.2 – 2.1) (130) 1.8 (1.4 – 2.2) (130) 1.9 (1.5 – 2.4) (130) 1.6 (1.2 – 2.1) (130) For the SBWL Letters test the following pictures illustrate the observations: 15 0 0 5 10 SBWL Letters std score - FA 5 10 15 Geslacht 0 5 10 SBWL Letters std score - IA Man SBWL Letters std score - IA 15 Vrouw -5 SBWL Letters std score: FA - IA 0 5 10 Geslacht m v In the scatter plot the dark and round-shaped points are observations for male clients, and the light and diamond shaped points are observations for female clients. The other picture shows box-plots of the observations for male clients (on the left) and female clients (on the right). In the scatter plot the dark and round-shaped points are observations for male clients, and the light and diamond shaped points are observations for female clients. The other picture shows box-plots of the observations for male clients (on the left) and female clients (on the right). Preferred hand For two tests an interesting relationship with the preferred hand of the client is observed. Left-handed clients show a larger improvement than the right-handed clients on SBWL pictures Preferred hand and SBWL EMTb (both Standard Scores), although the effect is not statistically significant at a significance level of 5%. For two tests an interesting relationship with the preferred hand of the client is Test Mean change FA – IA change FA –clients IA observed. Left-handed clients show a larger improvement thanMean the right-handed Left-handed clients Right-handed clients on SBWL pictures and SBWL EMTb (both Standard Scores), although the effect is not (95% conf. (95% conf. int.) statistically significant at a significance level of int.) 5%. (# observations) 3.1 (2.0 – 4.1) (30) Mean change FA – IA 2.1 (1.2 – 2.9) (29) SBWL Pictures (Std Score) Test SBWL EMTb (Std Score) (# observations) 2.1 (1.7 – 2.5) (157) Mean change FA – IA 1.4 (1.2 – 1.7) (151) Right-handed clients Left-handed clients (95% conf. int.) (# observations) 3.1 (2.0 – 4.1) (30) 2.1 (1.2 – 2.9) (29) SBWL Pictures (Std Score) SBWL EMTb (Std Score) (95% conf. int.) (# observations) 2.1 (1.7 – 2.5) (157) 1.4 (1.2 – 1.7) (151) The following scatter plot and box-plots show the observations graphically. In the scatter plot the dark and round-shaped points are observations for right-handed (R) clients, and the light and diamond shaped points are observations for left-handed (L) The following scatter plot and box-plots show the observations graphically. In the scatter plot the dark and round-shaped points are observations for right-handed (R) clients, and the light and diamond shaped points are observations for left-handed (L) 20 15 10 0 15 5 10 20 Voorkeurshand 5 10 15 SBWL Plaatjes std score - IA L 0 R SBWL Plaatjes std score - IA 20 5 0 0 SBWL Plaatjes stdSBWL score -Plaatjes FA std score - FA 5 10 0 15 5 10 20 15 20 Voorkeurshand 0 5 10 15 SBWL Plaatjes std score - IA R SBWL Plaatjes std score - IA 20 L -5 SBWL Plaatjes std score: FA - IA 0 5 10 Voorkeurshand L R Comorbidity From the analysis a very strong relationship exists between the improvements on the tests DMT1 DLE and SBWL Letters on the one hand, and the existence of comorbidity on the other hand: Test Mean change FA – IA clients with comorbidity (95% conf. int.) (# observations) DMT1 (DLE) SBWL Letters (Std Score) 1.3 (-0.2 – 2.8) (23) 0.7 (0.0 – 1.5) (23) Mean change FA – IA clients without comorbidity (95% conf. int.) (# observations) 6.1 (5.1 – 7.1) (182) 2.1 (1.8 – 2.5) (186) The following two pictures show the box plots for both tests. On the left side the clients with comorbidity and on the right side clients with no or unknown comorbidity are presented. Comorbide Comorbide -10 Nee/Onbekend Ja -10 -10 0 0 DMT1 DLE: FA - IA 10 20 DMT1 DLE: DMT1FADLE: - IA FA - IA 0 10 10 20 20 30 30 30 Comorbide Ja Nee/Onbekend Ja Nee/Onbekend Nee/Onbekend Comorbide 10 Comorbide Ja -5 SBWL Letters std score: FA - IA 0 5 SBWL Letters SBWLstd Letters score: stdFA score: - IA FA - IA -5 0 0 5 5 10 10 Comorbide Nee/Onbekend -5 Nee/Onbekend It is interesting to see Ja Nee/Onbekend It is interesting that for clients with comorbidity the observed improvements are less spread. It is interesting to see that for clients with comorbidity the observed improvements interesting to see are less spread. ovements are that less spread. for clients with comorbidity the observed improvements are less spread. Age Age Age From the analysis we observe that the largest improvements are made for clients in the Fromfor theallanalysis we observe thatfor theAVI largest improvements age category 12 – 18 years DLE measures, except DLE and Klepel DLE.are Formade for clients ovements are made From for the clients analysis in we the observe that the largest improvements are made for clients in the age category 12 – 18 years for all DLE measures, except for AVI DLE and Klepel D PI Woorddictee DLE, DMT1 DLE, EMTb DLE and CITO Reading Technique DLE the cept for AVI DLE ageand category Klepel 12 DLE. – 18 For years for allconvincing. DLE measures, except for AVI DLEand and Klepel DLE. For PI Woorddictee DLE, DMT1 DLE, EMTb DLE CITO Reading Technique DLE th differences are statistically very For AVI DLE and Klepel DLE no statistically TO Reading Technique PI Woorddictee DLE the DLE, DMT1 DLE, EMTb DLE and CITO Reading Technique DLE the differences are statistically very convincing. For AVI DLE and Klepel DLE no stati significant differences among the age categories are observed. I DLE and Klepel differences DLE no statistically are statistically very convincing. AVI the DLEage andcategories Klepel DLE statistically significant differencesFor among arenoobserved. e observed. significant differences among the age categories are observed. Nee/Onbekend Ja 60 0 20 40 CITO Leestechniek DLE - FA 10 20 30 40 50 60 Leeftijd categorie 0 20 40 CITO Leestechniek DLE - IA < 12 jaar >18 jaar 12-18 jaar CITO Leestechniek DLE - IA Table with outcomes for age categories: Test PI Woorddictee (DLE) DMT1 (DLE) DMT2 (DLE) DMT3 (DLE) Klepel (DLE) AVI (DLE) SBWL Colour (Std Score) SBWL Numbers (Std Score) SBWL Pictures (Std Score) SBWL Letters (Std Score) SBWL Syl (Std Score) SBWL EMTb (Std Score) SBWL EMTb (DLE) Phonemens(Std Score) CITO Reading Technique (DLE) 60 Mean change FA – IA < 12 years (95% conf. int.) (# observations) Mean change FA – IA 12 – 18 years (95% conf. int.) (# observations) Mean change FA – IA > 18 years (95% conf. int.) (# observations) 6.3 (5.2 - 7.4) (122) 4.8 (3.7 - 5.9) (121) 4.2 (3.2 - 5.2) (120) 3.7 (2.9 - 4.5) (113) 8.2 (6.0 - 10.4) (60) 3.6 (2.8 - 4.4) (100) 1.9 (-0.1 - 3.9) (125) 1.9 (1.5 - 2.3) (125) 2.0 (1.6 - 2.4) (125) 1.9 (1.5 - 2.3) (125) 1.4 (1.0 - 1.8) (123) 1.5 (1.2 - 1.8) (117) 5.3 (4.0 - 6.6) (88) 3.8 (3.2 - 4.4) (112) 4.9 (3.4 - 6.4) (84) 10.0 (7.8 - 12.2) (50) 8.4 (6.0 - 10.8) (50) 6.3 (4.2 - 8.4) (50) 5.6 (3.9 - 7.3) (50) 7.0 (4.4 - 9.6) (46) 2.9 (0.8 - 5.0) (49) 2.2 (1.3 - 3.1) (50) 2.6 (1.8 - 3.4) (50) 2.6 (1.8 - 3.4) (50) 2.2 (1.5 - 2.9) (50) 2.4 (1.7 - 3.1) (50) 1.4 (1.0 - 1.8) (50) 8.6 (6.3 - 10.9) (50) 3.5 (2.6 - 4.4) (49) 9.3 (7.2 - 11.4) (48) 4.8 (2.2 - 7.4) (30) 4.3 (1.9 - 6.7) (30) 3.8 (2.0 - 5.6) (30) 2.8 (1.4 - 4.2) (30) 4.0 (0.4 - 7.6) (26) 3.1 (1.4 - 4.8) (29) 1.4 (0.7 - 2.1) (30) 2.1 (1.3 - 2.9) (30) 2.6 (1.8 - 3.4) (30) 2.2 (1.4 - 3.0) (30) 1.5 (0.5 - 2.5) (30) 1.8 (0.9 - 2.7) (30) 2.9 (0.9 - 4.9) (30) 2.5 (1.3 - 3.7) (29) 6.2 (3.1 - 9.3) (29) To illustrate the differences the scatter plot and box plots for the outcomes for the CITO reading technique are shown below. In the scatter plot the dark round shaped points are observations for the age category < 12 years, the light diamond shaped points are observations for the age category 12-18 years, and the red, triangle shaped points are observations for the age category > 18 years. -10 CITO Leestechniek DLE: FA - IA 0 10 20 30 Leeftijdscategorie < 12 jaar 12-18 jaar > 18 jaar In the analysis no statistically significant relationships are observed between the effects of the treatment and either the intelligence of the client (TIQ) or the existence of an official dyslexia certificate. Conclusion Clients that have been treated with the BrightStar method convincing improvements on all aspects of tests are observed. Improvements can be reached with traditional remedial teaching methods, however experience has shown that these improvements are less and can only be reached with very hard work and discipline over a much longer period of time. Appendix A: Statistical tests on improvement during treatment The improvement during the treatment is defined as the score of a subject obtained during the final assessment minus the score of the same subject at the initial assessment. We distinguish between DLE and non-DLE tests. For the DLE tests, subjects that scored a zero at either the initial or the final assessment are left out of the analysis. This is done, because a zero means “can not be determined”. Possible observed changes in these cases would then not be correct and not comparable to other observed changes. The average time of the subject between the initial and the final assessment is 1.87 months. Hence, to assess the significance for the DLE tests, we compare the observed improvement in DLE score with the expected improvement (i.e., for a non-dyslexic person) of 1.87. This leads to a null hypothesis H0: mean (DLE FA – DLE IA) = 1.87 and an alternative, one-sided, hypothesis is HA: mean (DLE FA – DLE IA) > 1.87. The alternative hypothesis is one-sided, since the interest is only in improvements that are larger than expected. The hypotheses were tested using a standard paired t-test with a 95% confidence level, since the differences DLE FA – DLE IA, although slightly skewed to the right, do not deviate too far from the normal distribution. For the non-DLE test, all subjects that performed both the initial and final assessment are included in the analysis. In this case, we tested whether the improvement in score was significantly larger than zero. More formally, we assessed the null hypothesis H0: mean (score FA – score IA) = 0 versus the alternative, one-sided, hypothesis HA: mean (score FA – score IA) > 0. Again, these hypotheses were tested via a paired t-test with a 95% confidence level. Appendix B: Statistical tests on differences in improvements between subgroups For the statistical testing of possible differences in improvements between subgroups, again the improvement during the treatment is defined as the score of a subject obtained during the final assessment minus the score of the same subject at the initial assessment. Again subjects that scored a zero at either the final or initial assessment are left out of the analysis. We test to assess whether the improvement between two or more subgroups differs significantly. Here, we distinguish between explanatory variables that have 2 underlying subgroups (e.g., gender: male and female) and variables that have 3 or more underlying subgroups (e.g., age: <12, 12-18, >18 years). For the 2-subgroup variables, we test whether the difference in improvement between the two subgroups is significant. This gives a null hypothesis H0: mean (score subgroup 1) – mean (score subgroup 2) = 0 and an alternative (two-sided) hypothesis HA: mean (score subgroup 1) – mean (score subgroup 2) <> 0. The hypothesis were tested using a two-sample t-test with a confidence level of 95%. For the other variables, we apply the well-known Bonferroni method for multiple comparisons. This means that we test the same hypotheses as above but now for all pairs of subgroups and each with an increased confidence level of 1-0.05/r, where r is the number of underlying subgroups for the explanatory variable. This method ensures that the simultaneous confidence level remains at least 95%.