Uploaded by Florencia Secco

Music Aptitude in Elementary Students: Onset & Transition

Stabilized Music Aptitude: Onset, Transition, and Relative Constancy
in Upper Elementary Students
Roberta L. Yee
March 26, 2021
A dissertation submitted to the
faculty of the Graduate School of
the University at Buffalo, The State University of New York
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
Department of Learning and Instruction
Copyright by
Roberta L. Yee
All Rights Reserved
I am indebted to Dr. Maria Runfola, my advisor and committee chair, for her inspiration,
guidance, and exacting editorial skills. I offer my sincere thanks to Dr. Elisabeth Etopio and Dr.
Sunha Kim for their feedback, suggestions, and support as members of my dissertation
committee. Additional thanks to:
My daughters, Eliza and Sylvie, for their patience and understanding during this long
PhD process: I’ve been distracted for seven years, but am looking forward to
reconnecting with you, hopefully on an overseas trip.
My sister Denise, out-laws Dirk and Staci, and nieces Alyssa and Sophia for their love
and encouragement: I am deeply appreciative.
My brother Wendell for his expertise in designing a data collection spreadsheet: you
stepped up before I knew I was in over my head and created a thing of beauty.
My niece Julia for her many hours of data input: I am grateful for your accuracy and
attention to detail.
The students of the Halifax Area School District for providing a window into their
musical thinking.
My parents Bob and Jo Yee for their unconditional love, support, and belief in me: this is
for you.
The effect of chronological age and instruction on music aptitude, as well as the transition
between the developmental and stabilized music aptitude stages, were examined to further
establish the rationale for selection of the most appropriate music aptitude test for students in
Grades 3, 4, and 5. Archived scores of the Intermediate Measures of Music Audiation (IMMA)
were used in paired t-tests, Wilcoxon Signed Rank tests, and repeated measures ANOVA. No
effect of chronological age or instruction was concluded, and a period of transition could not be
substantiated definitively. It was conjectured tonal aptitude and rhythm aptitude stabilize
independently of one another. However, as type of instruction may have had a deleterious effect
on the findings for effect of instruction and substantiation of a transition period, further research
is recommended.
Keywords: stages of music aptitude, IMMA, chronological age, instruction, transition,
developmental music aptitude, stabilized music aptitude
Acknowledgements ............................................................................................................ iii
Abstract .............................................................................................................................. iv
List of Tables ................................................................................................................... viii
List of Figures .................................................................................................................. xvi
Chapter 1: Introduction ........................................................................................................1
Theoretical Framework ............................................................................................2
Background of the Study .........................................................................................3
Need and Significance of the Study .........................................................................8
Purpose of the Study ................................................................................................9
Research Questions ................................................................................................10
Scope and Delimitations ........................................................................................10
Definition of Terms................................................................................................12
Chapter 2: Literature Review .............................................................................................13
Introduction ............................................................................................................13
Supplementary Features of Music Aptitude ..........................................................16
Stages of Music Aptitude .......................................................................................19
Brief History of Music Aptitude Testing ...............................................................21
Recent Music Measures .........................................................................................33
Critique of Previous Music Aptitude Measures .....................................................34
Critique of Gordon’s Music Aptitude Measures ....................................................39
Music Aptitude Measures Developed by Gordon ..................................................42
Features of Stabilized Music Aptitude ...................................................................60
Chapter 3: Methodology ....................................................................................................72
Research Questions and Research Hypotheses ......................................................72
Participants .............................................................................................................74
Missing Values.......................................................................................................77
Instrument ..............................................................................................................84
Procedure ...............................................................................................................85
Chapter 4: Presentation and Interpretation of Data............................................................94
Pattern Analysis of Missing Data ..........................................................................95
Imputation of Missing Values ..............................................................................100
Overview of Statistical Analyses .........................................................................101
Research Question 1 ............................................................................................102
Research Question 2 ............................................................................................120
Research Question 3 ............................................................................................175
Summary ..............................................................................................................238
Chapter 5: Discussion, Recommendation, and Conclusions ...........................................240
Purpose of the Study ............................................................................................240
Methodology ........................................................................................................242
Results ..................................................................................................................244
Discussion ............................................................................................................246
Limitations of the Study.......................................................................................258
Recommendations ................................................................................................266
Adaptations to the Current Study .............................................................266
Extensions to the Current Study ..............................................................269
Conclusions ..........................................................................................................286
References ........................................................................................................................291
List of Tables
Table 1. Statistical tests by grade level and academic year ...............................................89
Table 2. Three-year longitudinal examination of IMMA scores .......................................91
Table 3. Variable summary ................................................................................................96
Table 4. Descriptive statistics of complete and excluded case samples (pooled) ............105
Table 5. Correlation results of complete and excluded case samples (pooled) ...............106
Table 6. Paired samples t-test results – complete case sample (pooled) .........................107
Table 7. Paired samples t-test results – excluded case sample (pooled) ..........................107
Table 8. 3ST-4FT Descriptive statistics and correlation coefficient ...............................109
Table 9. 3ST-4FT Paired t-test results .............................................................................109
Table 10. 4ST-5FT Descriptive statistics and correlation coefficient .............................110
Table 11. 4ST-5FT Paired t-test results ...........................................................................111
Table 12. 3SR-4FR Descriptive statistics and correlation coefficient .............................113
Table 13. 3SR-4FR Paired t-test results ...........................................................................114
Table 14. 4SR-5FR Descriptive statistics and correlation coefficient .............................114
Table 15. 4SR-5FR Paired t-test results ...........................................................................115
Table 16. 3SC-4FC Descriptive statistics and correlation coefficient .............................117
Table 17. 3SC-4FC Paired t-test results ...........................................................................117
Table 18. 4SC-5FC Descriptive statistics and correlation coefficient .............................118
Table 19. 4SC-5FC Paired t-test results ...........................................................................118
Table 20. Wilcoxon Signed Rank test results (tonal) ......................................................122
Table 21. Wilcoxon Signed Rank test results (rhythm) ...................................................123
Table 22. Wilcoxon Signed Rank test results (composite) ..............................................124
Table 23. 2007-2008 Grade 3 Descriptive Statistics (pooled) .........................................125
Table 24. 2007-2008 Grade 3 Correlation matrix (pooled) .............................................126
Table 25. 2007-2008 Grade 3 Shapiro-Wilk Test of Normality results ..........................127
Table 26. 2007-2008 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................128
Table 27. 2007-2008 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................128
Table 28. 2008-2009 Grade 3 Descriptive statistics (pooled)..........................................130
Table 29. 2008-2009 Grade 3 Correlation matrix (pooled) .............................................131
Table 30. 2008-2009 Grade 3 Shapiro-Wilk Test of Normality results ..........................131
Table 31. 2008-2009 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................132
Table 32. 2008-2009 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................133
Table 33. 2009-2010 Grade 3 Descriptive statistics (pooled)..........................................134
Table 34. 2009-2010 Grade 3 Correlation matrix (pooled) .............................................135
Table 35. 2009-2010 Grade 3 Shapiro-Wilk Test of Normality results ..........................135
Table 36. 2009-2010 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................136
Table 37. 2009-2010 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................137
Table 38. 2010-2011 Grade 3 Descriptive statistics (pooled)..........................................138
Table 39. 2010-2011 Grade 3 Correlation matrix (pooled) .............................................138
Table 40. 2010-2011 Grade 3 Shapiro-Wilk Test of Normality results ..........................139
Table 41. 2010-2011 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................140
Table 42. 2010-2011 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................140
Table 43. 2011-2012 Grade 3 Descriptive statistics (pooled)..........................................141
Table 44. 2011-2012 Grade 3 Correlation matrix (pooled) .............................................142
Table 45. 2011-2012 Grade 3 Shapiro-Wilk Test of Normality results ..........................143
Table 46. 2011-2012 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................144
Table 47. 2011-2012 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................145
Table 48. 2012-2013 Grade 3 Descriptive statistics (pooled)..........................................145
Table 49. 2012-2013 Grade 3 Correlation matrix (pooled) .............................................146
Table 50. 2012-2013 Grade 3 Shapiro-Wilk Test of Normality results ..........................146
Table 51. 2012-2013 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................148
Table 52. 2012-2013 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................149
Table 53. 2013-2014 Grade 3 Descriptive statistics (pooled)..........................................149
Table 54. 2013-2014 Grade 3 Correlation matrix (pooled) .............................................149
Table 55. 2013-2014 Grade 3 Shapiro-Wilk Test of Normality results ..........................150
Table 56. 2013-2014 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................151
Table 57. 2013-2014 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................152
Table 58. 2014-2015 Grade 3 Descriptive statistics (pooled)..........................................152
Table 59. 2014-2015 Grade 3 Correlation matrix (pooled) .............................................153
Table 60. 2014-2015 Grade 3 Shapiro-Wilk Test of Normality results .........................153
Table 61. 2014-2015 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................155
Table 62. 2014-2015 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................156
Table 63. 2015-2016 Grade 3 Descriptive statistics (pooled)..........................................156
Table 64. 2015-2016 Grade 3 Correlation matrix (pooled) .............................................157
Table 65. 2015-2016 Grade 3 Shapiro-Wilk Test of Normality results ..........................157
Table 66. 2015-2016 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................158
Table 67. 2015-2016 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................159
Table 68. 2016-2017 Grade 3 Descriptive statistics (pooled)..........................................159
Table 69. 2016-2017 Grade 3 Correlation matrix (pooled) .............................................160
Table 70. 2016-2017 Grade 3 Shapiro-Wilk Test of Normality results .........................160
Table 71. 2016-2017 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................162
Table 72. 2016-2017 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................162
Table 73. 2017-2018 Grade 3 Descriptive statistics (pooled)..........................................163
Table 74. 2017-2018 Grade 3 Correlation matrix (pooled) .............................................164
Table 75. 2017-2018 Grade 3 Shapiro-Wilk Test of Normality results ..........................164
Table 76. 2017-2018 Grade 3 Paired t-test results (pooled) ............................................165
Table 77. 2018-2019 Grade 3 Descriptive statistics (pooled)..........................................165
Table 78. 2018-2019 Grade 3 Correlation matrix (pooled) .............................................166
Table 79. 2018-2019 Grade 3 Shapiro-Wilk Test of Normality results ..........................167
Table 80. 2018-2019 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................168
Table 81. 2018-2019 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................169
Table 82. 2019-2020 Grade 3 Descriptive statistics (pooled)..........................................170
Table 83. 2019-2020 Grade 3 Correlation matrix (pooled) .............................................170
Table 84. 2019-2020 Grade 3 Shapiro-Wilk Test of Normality results ..........................171
Table 85. 2019-2020 Grade 3 Wilcoxon Signed Rank test results (pooled) ...................172
Table 86. 2019-2020 Grade 3 Wilcoxon Signed Rank test statistics (pooled) ................172
Table 87. Repeated Measures ANOVA combined results (tonal) ...................................176
Table 88. Repeated Measures ANOVA combined results (rhythm) ...............................177
Table 89. Repeated Measures ANOVA combined results (composite) ..........................179
Table 90. Group A: Descriptive statistics pooled results (tonal) .....................................180
Table 91. Group A: Mauchly’s Test of Sphericity results (tonal) ...................................181
Table 92. Group A. Tests of Within Subjects Effects results (tonal) ..............................181
Table 93. Group A: Multivariate test results (tonal) ........................................................182
Table 94. Group A: Pairwise Comparisons pooled results (tonal) ..................................182
Table 95. Group A: Descriptive Statistics pooled results (rhythm) .................................183
Table 96. Group A. Mauchly’s Test of Sphericity results (rhythm) ................................184
Table 97: Group A: Tests of Within Subjects Effects results (rhythm) ...........................184
Table 98: Group A: Multivariate test results (rhythm) ....................................................185
Table 99. Group A: Descriptive Statistics pooled results (composite) ............................186
Table 100. Group A: Mauchly’s Test of Sphericity results (composite) .........................186
Table 101. Group A: Tests of Within-Subjects Effects results (composite) ....................187
Table 102. Group A: Multivariate test results (composite) .............................................187
Table 103. Group A: Pairwise Comparisons pooled results (composite) ........................188
Table 104. Group B: Descriptive Statistics pooled results (tonal)...................................189
Table 105. Group B: Mauchly’s Test of Sphericity results (tonal)..................................189
Table 106. Group B: Tests of Within-Subjects Effects results (tonal) ............................190
Table 107. Group B: Multivariate test results (tonal) ......................................................190
Table 108. Group B: Pairwise Comparisons pooled results (tonal).................................191
Table 109. Group B: Descriptive Statistics pooled results (rhythm) ...............................192
Table 110. Group B: Mauchly’s Test of Sphericity results (rhythm) ..............................192
Table 111. Group B: Tests of Within-Subjects Effects results (rhythm) .........................193
Table 112. Group B: Multivariate test results (rhythm) ..................................................193
Table 113. Group B: Pairwise Comparison pooled results (rhythm)...............................194
Table 114. Group B: Descriptive Statistics pooled results (composite) ..........................195
Table 115. Group B: Mauchly’s Test of Sphericity results (composite) .........................196
Table 116. Group B: Tests of Within-Subjects Effects results (composite) ....................196
Table 117. Group B: Multivariate test results (composite)..............................................197
Table 118. Group B: Pairwise Comparisons pooled results (composite) ........................198
Table 119. Group C: Descriptive Statistics pooled results (tonal)...................................199
Table 120. Group C: Mauchly’s Test of Sphericity results (tonal)..................................199
Table 121. Group C: Tests of Within-Subjects Effects results (tonal) ............................200
Table 122. Group C: Multivariate test results (tonal) ......................................................200
Table 123. Group C: Pairwise Comparisons pooled results (tonal).................................201
Table 124. Group C: Descriptive Statistics pooled results (rhythm) ...............................202
Table 125. Group C: Mauchly’s Test of Sphericity results (rhythm) ..............................202
Table 126. Group C: Tests of Within-Subjects Effects results (rhythm) .........................203
Table 127. Group C: Multivariate test results (rhythm) ..................................................203
Table 128. Group C: Descriptive Statistics pooled results (composite) ..........................205
Table 129. Group C: Mauchly’s Test of Sphericity results (composite) .........................205
Table 130. Group C: Tests of Within-Subjects Effects results (composite) ....................206
Table 131. Group C: Multivariate test results (composite)..............................................206
Table 132. Group C: Pairwise Comparisons pooled results (composite) ........................207
Table 133. Group D: Descriptive Statistics pooled results (tonal) ..................................208
Table 134. Group D: Mauchly’s Test of Sphericity results (tonal) .................................208
Table 135. Group D: Tests of Within-Subjects Effects results (tonal) ............................209
Table 136. Group D: Multivariate test results (tonal) ......................................................209
Table 137. Group D: Pairwise Comparisons pooled results (tonal) ................................210
Table 138. Group D: Descriptive Statistics pooled results (rhythm) ...............................211
Table 139. Group D: Mauchly’s Test of Sphericity results (rhythm) ..............................211
Table 140. Group D: Tests of Within-Subjects Effects results (rhythm) ........................212
Table 141. Group D: Multivariate test results (rhythm) ..................................................212
Table 142. Group D: Descriptive Statistics pooled results (composite) ..........................214
Table 143. Group D: Mauchly’s Test Sphericity results (composite) .............................214
Table 144. Group D: Tests of Within-Subjects Effects results (composite) ....................215
Table 145. Group D: Multivariate test results (composite) .............................................216
Table 146. Group D: Pairwise Comparisons pooled results (composite) ........................217
Table 147. Group E: Descriptive Statistics pooled results (tonal) ...................................217
Table 148. Group E: Mauchly’s Test of Sphericity results (tonal) ..................................218
Table 149. Group E: Tests of Within-Subjects Effects results (tonal) ............................218
Table 150. Group E: Multivariate test results (tonal) ......................................................219
Table 151. Group E: Descriptive Statistics pooled results (rhythm) ...............................219
Table 152. Group E: Mauchly’s Test of Sphericity results (rhythm) ..............................220
Table 153. Group E: Tests of Within-Subjects Effects results (rhythm) .........................220
Table 154. Group E: Multivariate test results (rhythm) ...................................................221
Table 155. Group E: Pairwise Comparison pooled results (rhythm) ...............................221
Table 156. Group E: Descriptive Statistics pooled results (composite) ..........................223
Table 157. Group E: Mauchly’s Test of Sphericity results (composite) .........................223
Table 158. Group E: Tests of Within-Subjects Effects results (composite) ....................224
Table 159. Group E: Multivariate test results (composite) ..............................................225
Table 160. Group E: Pairwise Comparisons pooled results (composite) ........................226
Table 161. Group F: Descriptive Statistics pooled results (tonal) ...................................226
Table 162. Group F: Mauchly’s Test of Sphericity results (tonal) ..................................227
Table 163. Group F: Tests of Within-Subjects Effects results (tonal) .............................227
Table 164. Group F: Multivariate test results (tonal) ......................................................228
Table 165. Group F: Descriptive Statistics pooled results (rhythm) ...............................228
Table 166. Group F: Mauchly’s Test of Sphericity results (rhythm) ..............................229
Table 167. Group F: Tests of Within-Subjects Effects results (rhythm) .........................229
Table 168. Group F: Multivariate test results (rhythm) ...................................................230
Table 169. Group F: Descriptive Statistics pooled results (composite)...........................230
Table 170. Group F: Mauchly’s Test of Sphericity results (composite) ..........................231
Table 171. Group F: Tests of Within-Subjects Effects results (composite) ....................232
Table 172. Group F: Multivariate test results (composite) ..............................................233
List of Figures
Figure 1. PMMA/IMMA test answer sheet design ............................................................50
Figure 2. Research procedure ............................................................................................93
Figure 3. Overall summary of missing values ...................................................................95
Figure 4. Missing value patterns ........................................................................................97
Figure 5. Missing value patterns bar graph ........................................................................98
Figure 6. Paired t-test Spring-Fall results (tonal).............................................................112
Figure 7. Paired t-test Spring-Fall results (rhythm) .........................................................115
Figure 8. Paired t-test Spring-Fall results (composite) ....................................................119
Figure 9. Wilcoxon Signed Rank test results (tonal) .......................................................173
Figure 10. Wilcoxon Signed Rank test results (rhythm)..................................................174
Figure 11. Wilcoxon Signed Rank test results (composite) .............................................175
Figure 12. Repeated measures ANOVA results (tonal) ...................................................234
Figure 13. Repeated measures ANOVA results (rhythm) ...............................................235
Figure 14. Repeated measures ANOVA results (composite) ..........................................237
Figure 15. Adaptations and Extensions to the current study ...........................................266
Chapter 1
Adapting instruction to support individual differences is a hallmark of good teaching, yet
planning and implementing differentiated instruction is often an afterthought for music
educators. It is general consensus of educators that individualized instruction benefits all students
(Heathers, 1977). Similarly, individualized instruction in the music classroom is favorable, as it
can enable students to achieve mastery if the task is appropriate. Salvador (2011) asserted
students of varying levels of music aptitude may benefit from differentiated instruction. Scores
from a valid music aptitude test can help diagnose each student’s musical strengths and
weaknesses (Gordon, 2006), thus providing the necessary data by which teachers can determine
the appropriateness of tasks for each student.
A variety of music aptitude tests were developed in the early- to mid-1900s, differing in
their content and intent. Test authors in the gestalt camp such as Wing and Drake (Gordon, 1987)
believed each student possessed a global music aptitude, best expressed as a single composite
test score combining all dimensions. Others, exemplified by Seashore, Kwalwasser, and
Dykema, favored an atomistic approach to music aptitude (Gordon, 1987) in which separate
aptitude scores from several subtests were reported. Differences in production of the recorded
test prompts (natural sound production versus sound produced with musical instruments), the
musical context of the test prompts (isolated pitches versus tonal patterns), the task required of
the test taker (pitch discrimination versus counting of pitches), and the focus of the subtests to
measure sensitivity to musical expression (preference versus non-preference) exemplified
disparity in the function of available music aptitude tests. In addition, terms such as “ability”,
“talent”, “achievement”, “intelligence, and “aptitude” were used interchangeably and
indiscriminately (Boyle, 1992), thus confounding the construct of music aptitude test developers
sought to measure.
Theoretical Framework
The theoretical framework of this study was based on the extensive research of Edwin E.
Gordon on the construct and measurement of music aptitude, and data will be interpreted in this
study through the lens of music aptitude. From the results of numerous studies conducted by
Gordon, his doctoral students, and other music educators, Gordon (1987) theorized an omnibus
conceptualization of music aptitude as innate but not inherent, multidimensional, and
developmental (influenced by the music environment) until approximately age 9 (p. 9). Gordon
designed five music aptitude tests to measure developmental and stabilized music aptitude for a
range of student ages; norms were reported for students from preschool through college. Gordon
devised the music aptitude tests Audie and the Primary Measures of Music Audiation (PMMA)
to measure developmental music aptitude and the Advanced Measures of Music Audiation
(AMMA) and the Musical Aptitude Profile (MAP) to measure stabilized music aptitude. By
reviewing test scores, teachers were able not only to diagnose students’ musical strengths and
weaknesses but to use their knowledge of students’ musical strengths and weaknesses to
individualize instruction. Moreover, Gordon (2006) asserted the Intermediate Measures of Music
Audiation (IMMA) could be used to measure developmental music aptitude for students in
Grades 1–3 and stabilized music aptitude in students in Grades 4–6. Norms were published for
students in Grades K–3 (PMMA; Gordon, 1986c), Grades 1–6 (IMMA; Gordon, 1986c), Grades
4–12 (MAP; Gordon, 1995), and students in junior high, high school, and college (AMMA;
Gordon, 1989b). The overlap of normed grade levels between music aptitude tests and IMMA’s
ability to measure both developmental and stabilized music aptitude, depending on students’
chronological age, might result in uncertainty on the part of music teachers about selection of the
music aptitude test most appropriate and efficacious for use with their students.
Background of the Study
Investigations Using Music Aptitude Measures
Research on the design and effectiveness of music aptitude testing had been conducted
since the early 1900s. Numerous researchers incorporated Gordon’s music aptitude measures in
their studies designed to investigate varied topics such as improvisation (Amchin, 1995; Azzara,
1992; Bash, 1983; Briscuso, 1972; Ciorba, 2006; Della Pietra, 1997; Josuweit, 1991; Karas,
2005; Kołodziejski, 2019; Rowlyk, 2008; Stringham, 2010; Westervelt, 2001), composition
(Auh, 1995; Crawford, 2016; Guderian, 2008; Henry, 1995; Menard, 2009; Smith, 2004;
Stoltzfus, 2005), vocal music achievement (Conkling, 1994; Guerrini, 2002; Kimble, 1983;
McDowell, 1974; Miceli, 1998; Pereira et al., 2017; Rutkowski, 2015), instrumental music
achievement (Arms Gilbert, 1997; Baer, 1987; Belczyk, 1992; Bergonzi, 1991; Bernhard, 2003;
Brokaw, 1983; Choi, 1996; Cribari, 2014; Dell, 2003; Edmund, 2009; Frierson-Campbell, 2001;
Gouzouasis, 1990; Kendall, 1986; Klinedinst, 1989; Lee, 2007; Linklater, 1994; Liperote, 2004;
Milford, 2002; O’Leary, 2010; Ruthsatz, 2000), music reading (Bluestine, 2007; Ciepluch, 1988;
Jarvis, 1981; Karas, 2005; Kluth, 1986; McDonald, 2010; Milford, 2002; Palmer, 1974; Parks,
2005; Reifinger, 2018), and types of instruction for students of elementary through high school
age (Carroll, 1983; Cary, 1981; Clark, 2005; Davis, 1981; Etzel, 1979; Froseth, 1968; Gamble,
1989; Green, 2003; Groeling, 1975; Grutzmacher, 1985; Hansen, 1991; Haston, 2004; Hasty,
1992; Morgan, 1995; Ortner, 1990; Pursell, 2005; Smith, 2006). Thus, empirical evidence exists
to support Gordon’s conceptualization of music aptitude and audiation (Culp, 2017) and its use
in research.
Measurement of Music Aptitude
Gordon characterized developmental music aptitude as subject to the influence of musical
environment and prone to fluctuation until approximately age 9 (Gordon, 1986c, pp. 8–9). In
contrast, stabilized music aptitude was characterized as relatively immune to the effects of
training and instruction, ceasing to demonstrate marked change after age 9 (Gordon, 1987, p. 9).
The research of DeYarman (1972), Harrington (1969), Schleuter and DeYarman (1977),
Seashore (1919), Stevens (1987), and Wing (1939/1961) focused on measurement of music
aptitude for elementary-aged children from kindergarten through age 9, and sought the
approximate age after which influence of training and instruction no longer affected student
scores. The cessation of environmental influence thus indicated a shift from developmental
music aptitude to stabilized music aptitude; this interpretation shaped the design of research
investigating the construct and measurement of music aptitude. Reports of questionable test
reliability aside (Geissel, 1985, pp. 1–2), it would be imprudent to draw conclusions based on the
findings of these studies, as the nature and description of music aptitude measured in each
(gestalt, atomistic, or omnibus) was dissimilar, as previously noted.
However, two studies featured a unified approach to the construct of music aptitude in
their examination of music aptitude testing. Geissel (1985) conducted an investigation of the
comparative validities of PMMA, IMMA, and MAP with fourth grade students and concluded
PMMA was too easy for fourth grade students, but IMMA and MAP were valid measures of
stabilized music aptitude for students who possess high music aptitude (p. 32). Gordon (1986a)
later published a factor analysis of the same three measures and concluded the existence of
developmental and stabilized music aptitude from the results. Additional researchers concluded
the attainment of the stabilized music aptitude stage from the findings of no significant
difference from pre- to post-training music aptitude test scores; however, these studies generally
had other research foci, and their conclusions about stabilized music aptitude onset were
tangential (DeYarman, 1972; Gordon, 1989c; Schleuter & DeYarman, 1977). Thus, few
researchers have focused their work directly on identification of the age and nature of onset of
the stabilized music aptitude stage and the appropriateness of Gordon’s music aptitude measures
for students in upper elementary grades.
The identification of the age of onset of stabilized music aptitude is further complicated
by the overlap of recommended grade levels in Gordon’s music aptitude tests. Geissel (1985)
noted selection of MAP or IMMA test batteries for students aged 9 is obscured by overlapping
norms for Grade 4 students (p. 3). Onset of stabilized music aptitude was commonly postulated
as occurring at age 9 or 10 (Deutsch, 1982; Gordon, 1971 and 2005; Mang, 2013; Phillips et al.,
2002; Stevens, 1987) and had been characterized as “resistant to instruction” (DeYarman, 1975;
Gordon, 1986a, 1989b; Haroutounian, 2002; Mang, 2013; Moore, 1987). Gordon (1986c)
asserted the same test might be used to measure developmental music aptitude and recently
stabilized music aptitude of students in Grade 4 if the design and content conformed to research
specifications (p. 27), and concluded from a correlational study of IMMA and MAP that IMMA
also might function as a test of stabilized music aptitude (Bolton, 1995). Walters (1991)
concurred, but noted the superior diagnostic capabilities of MAP due to inclusion of sensitivity
constructs, as compared to IMMA. However, Geissel’s (1985) findings indicated IMMA and
PMMA (tests of developmental music aptitude) had more in common with each other than either
had with MAP (a test of stabilized music aptitude) (p. 31).
Schleuter and DeYarman (1977) asserted from their findings that formal music
instruction did not affect the music aptitude levels of students in kindergarten through fourth
grade, and thus concluded music aptitude stabilized before age five or six. These results were
similar to those found by DeYarman (1972) in a previous study. Nevertheless, there appeared to
be little empirical evidence to demarcate the shift from developmental to stabilized music
aptitude stages. The rationale for selection of music aptitude tests for upper elementary students
was therefore obscured, as IMMA and MAP both served as tests of stabilized music aptitude for
upper elementary students.
IMMA was considered a more advanced and discriminating version of PMMA due to
item difficulty of content (Walters, 1991). Bolton (1995) asserted IMMA was more suitable for
older children who were likely to have been acculturated to music in many different forms and to
have formed more sophisticated means of categorizing tonal and rhythm sounds (p. 31). PMMA
or IMMA can be administered to students in Grades 1, 2, and 3 because both measure
developmental music aptitude; Gordon (1986c) recommended administration of IMMA for
classes in which at least half the students score above the 80th percentile on some or all of
PMMA (p. 27). Phillips and Aitchison (1997) adhered to Gordon’s recommendation and
administered PMMA to their sample of third grade students because Gordon’s published
criterion for use of IMMA had not been met. A teacher therefore must estimate or calculate
students’ PMMA test scores in order to determine if the administration of IMMA would be more
appropriate, a Catch-22. Gordon (2001a) noted evaluation of developmental music aptitude
change through comparison of PMMA and IMMA scores or percentile ranks was ill-advised, and
instead proposed comparison of scores from administrations of the same test.
Transition Between Music Aptitude Stages
Gordon (1989b) later addressed a transitional stage of music aptitude, and promoted
IMMA administration for students transitioning from the developmental to stabilized music
aptitude stage (ages 6–9) or those who have attained the stabilized music aptitude stage (age 10–
11). Moreover, Gordon (2006) speculated that middle school might serve as
the period of a pronounced borderline between developmental and stabilized music
aptitude stages, and MAP is more appropriate for students just entering the stabilized
stage and AMMA for students who have gone beyond middle-ground and already settled
into that stage (p. 234).
In a related examination of developmental music aptitude, Gordon (1980b) found an uneven
pattern of growth of tonal and rhythm developmental music aptitudes of Grade 7 students from a
large city. This information did not correspond with what was previously understood about the
consistency of stabilized music aptitude levels in culturally homogeneous students. Gordon
recommended further analysis of the nature of the transition from developmental to stabilized
music aptitude. Thus, an investigation of the transitional stage between developmental and
stabilized music aptitude in the current study might reveal information helpful to teachers in
determining the appropriate measure of music aptitude to use with their students.
Longitudinal Constancy of Music Aptitude
Constancy of stabilized music aptitude over time has been expressed as a function of
relative standing. The effect of training on IMMA scores was an ancillary focus of Gordon’s
(1989c) examination of predictive validity of the Instrument Timbre Preference Test (ITPT) and
IMMA. After one year of instruction, IMMA scores of the experimental and control groups were
compared to their pre-instruction IMMA scores and a lack of acute difference found. In his
longitudinal predictive validity study of MAP, Gordon (2001c) examined the correlation
coefficients of MAP scores of successive years to determine students’ relative standing and
concluded that, in spite of an extended 3-year period of instrumental instruction, student MAP
scores displayed only typical increases and students maintained their relative standing on the test
in relation to published MAP norms. Similar results were revealed from Gordon’s 1970 study of
music aptitude differences in beginning instrumental students: students maintained their relative
standing on MAP after two years of instruction. Many studies have been conducted on music
aptitude (Froseth, 1971; Fullen, 1993; Gordon, 1970, 1999; Gromko & Walters, 1999; Guerrini,
2002; Hornbach & Taggart, 2005; Jaffurs, 2000; Mota, 1997; Phillips et al., 2002; Rutkowski,
1986, 1996, 2015; Rutkowski & Miller, 2003a; Schleuter, 1978), academic achievement
(Gordon, 2001c; Hufstader, 1974; Johnson, 2000; Klinedinst, 1991; McCarthy, 1974; Mitchum,
1969; Moore, 1987), and general intelligence (Gordon, 2006; Norton, 1980) as predictors of
music achievement, as well as studies on music achievement, academic achievement, and
general intelligence as predictors of music aptitude (Carson, 1998; Hobbs, 1985; Johnson, 2000;
Kuhlman, 2005; Simmons, 1981; Webb, 1984; Young, 1971). Further insight into the effect of
chronological age and instruction on music aptitude, as well as the shift between the
developmental and stabilized music aptitude stages, may help establish the rationale for selection
of the most appropriate and efficacious test for a particular group of students.
Need and Significance of the Study
Gordon examined previous music aptitude tests by Seashore, Wing, and others, and
concluded they differed in content and intent. Because there was no consensus on the definition
of the construct of music aptitude or the terms used, the results of the tests created to measure
music aptitude yielded differing interpretations. Thus, the need for this study is established by its
unique contribution: an examination of the onset of and transition to the stabilized music aptitude
stage based on a unified approach to music aptitude.
Practical applications of the current study’s findings will inform music educators, music
education researchers, and the field of music education in general. Teachers need detailed and
objective data from which to identify students’ learning needs (Salvador, 2011). Music aptitude
scores can fulfill that need: if the music aptitude measure selected is appropriate for the specific
group of students, scores should represent each student’s current level of tonal and rhythm
audiation and by extension, each student’s musical strengths and weaknesses. Music educators
may then individualize and adapt instruction based on this data to maximize learning of students
of all age levels and in all areas of concentration.
The significance of this study is established through its contribution to the knowledge
base of the nature and measurement of stabilized music aptitude. Not every research question is
answered definitively; instead, additional questions may be posed. This would not exemplify a
failed study, but rather an opportunity to extend current understandings by focusing future
research questions. Regardless of the specific findings, it is hoped music education researchers
and, by extension, the field of music education, may benefit from insight on when and how
music aptitude stabilizes that may be gained from the current study.
Purpose of the Study
The purpose of this study was to investigate the onset of, transition to, and longitudinal
constancy of stabilized music aptitude in upper elementary school students. Several published
music aptitude tests were available for administration to children from preschool age through
college and beyond. These tests had been researched thoroughly and found reliable and valid for
their intended use (Gordon, 1984a, 1986a, 1986b, 1989b, 1990c, 2001c). However, there was
overlap between the grade levels recommended for certain music aptitude tests, which could be
perplexing for music educators in their selection of the appropriate test for their students. IMMA
scores of a stable sample of upper elementary students were explored longitudinally in this study
in order to identify the appropriate music aptitude test for students in the intermediate grades,
based on the stage of music aptitude likely occurring.
Research Questions
The research questions used to guide this study are:
(1) At what grade level does chronological age cease to affect student music aptitude?
(2) At what grade level does instruction cease to affect student music aptitude?
(3) Is there evidence to substantiate the transition between the developmental music
aptitude stage and stabilized music aptitude stage at approximately age 9/Grade 4?
Scope and Delimitations
The scope of this study was restricted to an examination of stabilized music aptitude: its
progression from the developmental music aptitude stage, onset, and longitudinal constancy.
Therefore, a delimitation of this study was the review of relevant research directly related to the
construct and measurement of stabilized music aptitude. Students older than age 9 had been
observed in the stabilized music aptitude stage (Gordon, 1989b, 1995); however, the existence of
or length of transition period from the developmental to stabilized music aptitude stage was
unclear. Therefore in this study, the music aptitude test scores of students in elementary school,
specifically in Grades 3–5, were examined. Students in these grades had been taught by the
researcher and thus were administered IMMA routinely as a means of tracking music aptitude
levels for individualizing instruction; these students’ past scores comprised a convenience
sample. This sample of students was taken from a district population with very little transiency:
the great majority of students who graduated from this school district were elementary music
students of the researcher (approximately 90%) and all IMMA scores from Grades 3–5 had been
preserved, thus allowing a longitudinal view of Grade 3 IMMA scores over a 13-year period. In
addition, students’ scores over the 3-year period from Grades 3–5 were examined for an
additional longitudinal perspective.
The focus of this study was limited to elementary students in Grades 3–5, as findings of
extant literature had established the onset of stabilized music aptitude to have occurred around
age 9; students of that age are often in Grade 4. Although the researcher was a full-time
elementary general music teacher who administered PMMA and IMMA to all students in Grades
1–5, only IMMA scores of students in Grade 3–5 from the elementary population of a small rural
public elementary school were considered in this study. Scores of students with autism who
received support in a self-contained classroom were included in this study. However, due to
limits of students’ disabilities which may have affected their ability to comprehend the
directions, make timely decisions, and select and mark their answers, music aptitude tests were
administered to these students individually, the interval between test prompts was extended, and
paraprofessionals may have scribed student answers to more accurately document intended
student responses.
The choice to frame this study within Gordon’s construct of music aptitude was an
additional delimitation, made because evidence of IMMA’s use as a valid measure of both
developmental music aptitude and stabilized music aptitude for students in the intermediate
grades had been found in extant literature. It would be inappropriate to compare scores of
different music aptitude tests, even when those tests were similar in design, such as PMMA and
IMMA (Gordon, 1986c, pp. 66–67). Therefore, the use of scores of a single test (IMMA) to
examine the onset of stabilized music aptitude and the transition between stages of music
aptitude was advantageous.
Definition of Terms
Audiation: “the ability to hear and to give meaning to music when the sound is not physically
present or may never have been physically present” (Gordon, 2005, p. 11, emphasis in original);
the potential to achieve in music.
Compensatory growth: as a result of appropriate instruction, students’ musical needs are
mitigated: students receive a higher PMMA or IMMA raw score in the same grade or a higher
percentile rank in the next grade (Gordon, 1986c, p. 68).
Complementary instruction: as a result of appropriate instruction, students’ musical needs are
met: students’ percentile rank remains essentially the same after re-administration of PMMA or
IMMA (Gordon, 1986c, p. 76).
Developmental music aptitude: music aptitude that fluctuates due to influence of
environmental factors. Gordon (1987, p. 9) contended children remain in the developmental
music aptitude stage until approximately age 9.
Music achievement: “a measure of what a student has already learned in music” (Gordon, 1998,
p. 2) and “based primarily in the brain” (Gordon, 2010, p. 211).
Music aptitude: “a measure of a student’s potential to learn music” (Gordon, 1998, p. 5), best
understood through examination of scores from a valid test of music aptitude (Gordon, 1987, p.
2). More than 20 different music aptitudes have been identified through factor analysis (Gordon,
1998, p. 11).
Stabilized music aptitude: music aptitude that is unaffected by musical environment, training,
or practice, realized as maintenance of relative standing on music aptitude tests (Gordon, 1980b,
p. 25); students progress to the stabilized music aptitude stage at approximately age 9. “Musical
expression is indicative of stabilized music aptitude” (Gordon, 1980b, p. 26).
Chapter 2
Literature Review
The purpose of this chapter was to review the literature relevant to the purpose and
questions of this study. An integrative literature review (Cooper, 1989) was conducted to present
and draw conclusions from the various relevant research studies examined (Moustakas, 1994).
As such, the constructs of music achievement, audiation, and music aptitude were described, and
the recent history of music aptitude testing, principal researchers, and features of the music
aptitude measures designed by these researchers were presented.
Music Achievement
Music achievement, a measure of previous music learning, (Gordon, 1998, p. 5), was
typically assessed through rating scales of performance skills, fluency in reading and writing of
music notation, and measurement of music theory and music history knowledge, and could be
considered the skill level resulting from the aggregate of music aptitude level and accumulated
music experiences (Taggart, 1989, p. 46). Therefore, music achievement was acquired, and could
be augmented with continued access to a rich musical environment. Gordon (1989b) asserted
students’ music achievement levels would never exceed their stabilized music aptitude levels,
and observed the basis of true music achievement was the ability to generalize or infer (Gordon,
1991). Although Gordon (2006) noted a superior association between music achievement and
academic intelligence, Bixler (1968), Carson (1998), Gordon (1986b, 2006), Kuhlman (2005),
and Swaminathan et al. (2017) concluded no significant relationship between academic ability or
general intelligence and music aptitude. Thus, music achievement seemed to be “based primarily
in the brain” (Gordon, 2010, p. 211).
Music educators have struggled to agree on an appropriate term to describe how students
comprehend music. Terms with dissimilar definitions such as “aural perception”, “aural
imagery”, and “inner hearing” have been used by researchers and practitioners (e.g., Gromko &
Russell, 2002; Gromko & Walters, 1999; Karma, 1994; Kopiez & Lee, 2006, 2008; ShuterDyson, 1999; Wöllner et al., 2003; Young, 1971) as labels, yet none seemed to adequately define
what occurs when students learn music. Edwin Gordon (2015) coined the term “audiation” in
1975 (p. 9) to characterize “the ability to hear and to give meaning to music when the sound is
not physically present or may never have been physically present” (p. 11, emphasis in original).
Karma (1994) concurred, noting sound need not be physically present for musical thinking to
occur. Gordon (1999) further described audiation, noting, “Music is the result of the need to
communicate. Performance is how this communication takes place. Audiation is what is
communicated” (p. 42).
Audiation requires assimilation and comprehension, whereas musical imagery requires
neither. Aural perception does not require comprehension and is a reaction to immediate sound
events (Gordon, 2015). Memory and recognition are components of the audiation process, yet
none can stand alone; imitation is a product of audiation. In addition, audiation requires musical
context: the ability to sing a tune, perform a tune in a different key, tonality, or meter, play with
alternate fingerings, play a variation, or move to melodic phrases are indicative of audiation.
A temporal feature of audiation distinguishes it from aural perception and similar
constructs, as assimilation of past or anticipated musical events occur in audiation (Gordon,
1998, p. 12). Specifically, one audiates what has been heard previously to give meaning to what
is being heard presently and to predict what will be heard in the future (Gordon, 1981).
Audiation occurs in performance, improvisation, and composition in addition to imitation, due to
this temporal component. Karma’s theory of auditory structuring also had a temporal factor and
was discussed more thoroughly later in this chapter.
Geake (1999) theorized audiational abilities were the result of three phases of information
processing: successive synthesis, simultaneous synthesis, and executive synthesis. Geake
conjectured how the task of listening to music might be sequenced:
A sequence may be recognized (simultaneous), and then a sequence of segments
(successive) may be recoded as a musical phrase (simultaneous), and so on. The
importance of this cyclic arrangement for the assessment of individual differences is that
low ability on either of successive or simultaneous synthesis would cause a ‘bottleneck’
for coding to proceed to higher levels. Such an analysis may explain individual
differences in audiation (p. 11).
Geake asserted executive synthesis was manifested as selective attention, an attribute which
affected learning efficiency and further speculated increasing levels of executive synthesis were
required as one moved through the hierarchy of audiation stages.
Music Aptitude
One’s potential for audiation is known as music aptitude. Expressed another way, music
aptitude is the potential to achieve in music. Thus, “audiation is fundamental to music aptitude
and consequently to music achievement” (Gordon, 2011, p. 9). As Walters (1991) summarized,
the extent to which the ability of a person without instruction can hear, understand, and give
meaning to specific sounds is, according to Gordon, music aptitude. Boyle and Radocy (1987)
noted terms such as audiation, talent, aptitude, musicality, musical intelligence, and music ability
reflected constructs used to differentiate between those who demonstrate differing levels of
performance on musical tasks. There was no apparent consensus on the definition of these terms,
which were often used to reflect functions (e.g., assessment of potential) rather than musical
behaviors and were applied as labels based on informal criteria which appeared to lack a base of
systematic observation. Although Gordon initially used the term “musical aptitude”, he began to
use the collective noun “music aptitude” instead after the 1977 publication of “Revised Learning
Sequence and Patterns in Music”; the term music aptitude was used in the current study as well.
Gordon (2006) described the ability to generalize sound as a key feature of music aptitude, as
exemplified through audiation. Although this ability was also an attribute of general intelligence
(Culp, 2017), it was not bound by previous instruction or experience. Haroutounian (2002) noted
that, to a music psychologist, inherent sound discrimination and perception are at the core of
musical talent and music aptitude is a measure of potential musical talent (p. 19). Mang (2013)
concluded music aptitude was predictive in its function as a measurement of potential to achieve
in music.
A discussion of additional features of music aptitude such as the influence of
environment on innate music aptitude levels and the construct of music aptitude as a unified
entity or a compilation of disparate dimensions follows. In addition, associations of music
aptitude with brain function, academic achievement, and general intelligence were considered.
Supplementary Features of Music Aptitude
Nature versus Nurture
Whether music aptitude was a function of nature or nurture had been the source of
ongoing debate in professional circles (Gordon, 1998, p. 5). Those who promoted music aptitude
as an innate and unchangeable trait (nature) disputed those who viewed music aptitude as
fluctuating and influenced by environment, practice, or training (nurture). Proponents of the
nature theory conceived one is born with music aptitude or not, a “questionable but tacit
assumption” (Gordon, 1998, p. 6). Gordon aptly noted the case for the nature theory implied
students with low music aptitude might not benefit from a music education, whereas music
education for all was supported by the nurture theory.
Atomistic versus Gestalt
Whether music aptitude was comprised of discrete abilities (atomistic) or was holistic in
nature (gestalt) was another disputed premise (Grashel, 2008). Measures of atomistic music
aptitude yielded multiple subtest scores, as each dimension of music aptitude was considered in
isolation. Pure tones, isolated pitches, pitch counting, and non-preference features were typically
included in music aptitude measures (Boyle, 1992). Proponents of the atomistic view, such as
Seashore, tended to be Americans. English and European counterparts, such as James Mursell
and Herbert D. Wing, designed gestalt tests which yielded one composite score representing the
general factor approach to music aptitude (Degé et al., 2017); thus, music aptitude was viewed as
one-dimensional. Gestalt music aptitude measures typically combined tonal and rhythmic
dimensions within the same test, used musical instruments as the sound source and musical
phrases as the content of test items, and might include preference measures (Gordon, 1986a).
Brain Function
Studies of brain function and music have yielded interesting discoveries in relation to
music aptitude. Music training had been found to induce auditory plasticity due to aural skill
development, as indicated by structural and functional differences found in musicians’ brains
(Bugos et al., 2014). In addition, Zentner and Gingras (2019) noted a possible link between
interindividual variation (genotypic level) and musical behaviors (phenotypic level), reinforced
by varied levels of music aptitude in the general population. The extremes of range of music
aptitude were discussed by Zentner and Gingras (2019), who labeled nonmusicians with high
levels of music aptitude and undeveloped skills as musical sleepers and those with ample
musical training and declining skills as sleeping musicians. Moore (1990) described Gordon’s
reasoning that if preschool training could influence the development and level of general
intelligence, as asserted by Montessori (1917), Piaget (1953), and Bruner (1960), perhaps
preschool training could also influence musical intelligence, manifested as music aptitude.
Gordon noted neurologists’ hypothesis of a possible association between myelination of
great cerebral commissures and the development of the brain’s frontal lobes and stabilization of
music aptitude, and related the ability to make judgments, draw conclusions, anticipate coming
events, generalize, and make inferences associated with frontal lobe activity to the musical
predictions necessary for audiation (Gordon, 2006, 2013). Thus, it appeared the patterns of
typical brain maturation and childhood development mimicked the growth of developmental
music aptitude, and brain activity was more reflective of stabilized music aptitude function.
Academic Achievement and General Intelligence
The relationship of music aptitude to academic achievement or IQ was subject to dispute.
Citing the findings of Gordon (1986b) and Sergeant and Thatcher (1974), Kuhlman (2005)
asserted academic ability and music aptitude were not significantly related. Although academic
achievement had been found to successfully predict student success in music, this was most often
the case when student success was defined as reading music notation (Kuhlman, 2005)—an
undertaking perhaps more similar to school-related tasks than to the musical process of
audiation. Numerous researchers (Allen, 1981; Bailey, 1975; Brown, 1969; Klinedinst, 1991;
Mawbey, 1973; Pruitt, 1966) have asserted academic ability could be a positive factor in student
retention (Kuhlman, 2005). However, a firm association between music aptitude and IQ has not
been established. Carson (1998) noted individuals may exhibit high levels of music achievement
despite cognitive and academic deficiencies, and cited Gardner’s (1983) observation that music
aptitude was distinct from other types of intelligence, as evidenced by brain research in which
cerebral lesions have destroyed musical abilities without affecting other forms of intelligence.
Bixler (1968) concluded a minimal association between measures of music aptitude and those of
memory and auditory abilities and relative independence of music aptitude from intelligence and
academic achievement. Gordon (2006) concurred, asserting only a 5–10 percent relationship
between music aptitude and intelligence test scores. Swaminathan et al. (2017) reported no
association between music training and intelligence after controlling for music aptitude: it
appeared students with higher levels of intelligence and music aptitude chose to participate in
music training, rather than music training influencing intelligence. Due to limited evidence of a
definitive association between music aptitude, academic achievement, and general intelligence
for elementary students, a more thorough review of the topic was not deemed relevant to the
purpose of the current study.
Stages of Music Aptitude
Gordon (2001a) asserted two stages of music aptitude: the developmental music aptitude
stage, in which constant fluctuation of a child’s music aptitude level occurred due to the
influence of instruction and musical environment, and the stabilized music aptitude stage, which
occurred after age 9 and was defined by its lifetime constancy, regardless of environment. It was
evident no distinction between stages of music aptitude was made by early researchers, as they
implied music aptitude was crystallized or stabilized at birth (Gordon, 1998, p. 18). Gordon
(1981) asserted the stabilized music aptitude stage was preceded by a developmental stage in
which music aptitude fluctuated in accordance with a child’s innate potential and the quality of
the musical environment. Walters (1991) ascribed the defining feature of developmental music
aptitude as instability of young children’s music aptitude scores attributed to the varying effect of
musical influence, which resulted in constant adjustment of their relative music aptitude
positions as nature and nurture interacted. The developmental music aptitude stage was noted for
its volatility of test scores: the effect of instruction, training, and experience was manifested as
the instability of scores of students in the developmental music aptitude stage. Nevertheless,
Gordon (1986c) noted children’s ability to respond intuitively and to audiate immediate
impressions were an indication of the level at which their music aptitude would stabilize (p. 9).
Thus, it was posited appropriate formal instruction during the early school years could positively
affect the level of developmental music aptitude, and the younger children were, the more they
would benefit from a quality music environment (Gordon, 2001a). Music instruction received
during the developmental music aptitude stage created a lasting influence, as music aptitude
stabilized around age 9 and would remain the same throughout life (Gordon, 2012, p. 46).
Gordon (1981) found both developmental and stabilized music aptitude test scores were
normally distributed. Nevertheless, the number and content of the dimensions (e.g., tonal,
rhythm, expression) differed. Although measures for seven dimensions of stabilized music
aptitude had been developed, only measures of tonal and rhythm dimensions of developmental
music aptitude were found adequate. Gordon surmised this could be due to an absence of
additional developmental aptitude dimensions or the lack of an appropriate music aptitude test
for very young children. Nevertheless, Gordon (1995) attributed approximately 55 percent of the
reason or reasons for student success in school music to MAP scores, noting the remaining 45
percent was associated with extra-musical factors (p. 8).
The need to establish musical context for accurate measurement was a defining attribute
of stabilized music aptitude testing. Gordon (2005) noted the necessity to audiate content within
a musical context, in order that sound would be perceived as musical. A key criticism of
atomistic measures of music aptitude, and Seashore’s Measures of Musical Talent in particular,
was the lack of musical material as test content, which resulted in the measure of acoustical
rather than musical abilities, critics such as Mursell (1937) contended (Haroutounian, 2002, p.
14). Consequently, tonal test items presented within a rhythmic structure and rhythm test items
within a tonal context were promoted by Gordon as offering the greatest validity. Tests of
stabilized music aptitude such as AMMA and MAP included recorded performances of melodic
test prompts, from which test takers must answer questions about either the tonal or rhythm
aspects, while ignoring the other (Gordon, 2005).
Brief History of Music Aptitude Testing
Gordon (1987) contended it was through an examination of the content of a valid music
aptitude test that an understanding of music aptitude could be gained (p. 2), students most likely
to benefit from instruction might be identified, and musical strengths and weaknesses might be
diagnosed in order that instruction could be individualized appropriately (Gordon, 1990c). He
further reasoned young students with minimal background in music instruction who attained high
scores were an indication a test measured music aptitude and not music achievement (Gordon,
2002). South Korean and American teachers in Reynolds and Hyun’s (2004) qualitative study of
teachers’ understanding of music aptitude acknowledged they had attended to non-musical
behaviors such as attitude, participation, compliance, and academic achievement in their
assessment of their students’ music aptitude levels. In his study of diagnostic validity of MAP
with a sample of fourth- and fifth-grade beginning instrumental students, Gordon (1998) noted
teachers were not fully aware of student music aptitude levels, despite instructing the control
groups for two years (p. 107). Correlations of these teachers’ estimates of student music aptitude
and students’ MAP scores were moderate: .29 (tonal imagery and rhythm imagery), .34 (musical
sensitivity), and .43 (composite) (Gordon, 1998, p. 107). Hatfield (1967) reported a similar
finding in his study of MAP’s diagnostic validity with instrumental music majors (Gordon, 1998,
p. 107). On the other hand, Gordon (1970) found instrumental music achievement was greater
when music teachers used their knowledge of student music aptitude scores to diagnose students’
strengths and weaknesses for instructional purposes than when teachers did not have and use
music aptitude scores.
In the process of developing his music aptitude measures, Gordon (2005) studied the
design, content, and findings of all existing music aptitude tests, selected their most efficacious
attributes, and adapted these components for inclusion in his own music aptitude tests. As the
current study focused on the features of Gordon’s music aptitude measures, a description of his
contribution to music aptitude testing preceded descriptions of the early music aptitude
researchers and the music aptitude tests developed by Seashore, Drake, Wing, Kwalwasser and
Dykema, Bentley, Gaston, and Karma.
Gordon (2015) coined the term audiation to “explain how music is given meaning by
persons of all ages” (p. 9). Geake (1999) cited numerous researchers such as Gordon (1989d),
McPherson (1995), and Schleuter (1984) and acknowledged evidence of audiation as
fundamental to musical ability and success in music. Gordon described music aptitude as the
process by which we audiate, and posited two stages of music aptitude: developmental and
stabilized. Gordon concluded music aptitude consists of several aspects that are interrelated but
independent (Degé et al., 2017); thus, multifactorial (atomistic) and general factor (gestalt)
approaches were combined in the aptitude tests he developed. Gordon (2013) maintained a
child’s aptitude level was established at birth; however, he disagreed that music aptitude was an
inherited trait, noting lack of evidence to support the role of heredity in determining music
aptitude (p. 12). To clarify, although music aptitude could be transmitted through the genes, it
was not predictable based on ancestry (Gordon, 1998, p. 7). Unlike Seashore, Wing, and other
researchers who were psychologists with an interest in music, Gordon (1998), a musician with an
interest in the psychology of learning, concluded nature and nurture both contributed to music
aptitude, noting the nurture theory indicated children would become musical as a result of having
musical parents and the nature theory suggested musical parents could only bear musical
children. His logical conclusion was that music aptitude must be a product of both nature and
nurture (p. 7). Moore (1990) cited the work of other music psychologists (Deutsch, 1982;
Gordon, 1971; Lundin, 1967; Shuter, 1968), who concurred with Gordon’s assertion that innate
potential and musical exposure were both components of music aptitude.
Gordon (1981, 2002) noted the ability to identify tonal difference and rhythm difference
was associated with music aptitude, and asserted students who were able to identify difference
typically exhibited higher music aptitude than those who could not: because most students could
recognize sameness, its relationship to music aptitude was negated (Gordon, 2004). The primacy
of sameness also had been noted by researchers in disciplines other than music (Gordon, 1981).
Nevertheless, how children attend to sameness and difference in music was progressively
dissimilar as students aged. Gordon (1981) noted children’s increasing attention to relation of
sameness to difference as chronological age increased; any consideration of sameness and
difference as discrete entities emphasized difference. The results of factor analyses of four years
of PMMA scores (Gordon, 1981) allowed for the conclusion that changes occurred in audiation
from age 5 to 8: test items were audiated differently as the subjects aged. Gordon (1981)
conjectured music aptitude was multidimensional: more than thirty different music aptitude were
revealed in the pre-standardization research for the Musical Aptitude Profile and Gordon (1998)
contended more levels of music aptitude could be identified with appropriately-designed tests (p.
11). However, Gordon asserted tonal and rhythm dimensions of music aptitude had the greatest
import on music learning (Geake, 1999).
Gordon distinguished stabilized music aptitude as resistant to music instruction, and
asserted developmental music aptitude was subject to fluctuation due to environmental
influences. Gordon appropriated the terms “acculturation”, “imitation”, and “assimilation” to
describe the types of preparatory audiation in which young students absorbed and engaged the
musical environment with increasing consciousness. In Gordon’s (2012) view, students
continued to expand their audiation abilities throughout the developmental music aptitude phase,
“although the effect of environment on a child’s music aptitude decreases substantially with age”
(Gordon, 1981) and gain scores decrease until approximately age 9 (Gordon, 1986a). Because
developmental music aptitude scores might increase due to the influence of instruction, Gordon’s
tests of developmental music aptitude had been denounced as tests of music achievement
(Gordon, 2005). Gordon (2005) revealed the error in this perception: maintenance of relative
position in score distributions was a feature of the stabilized music aptitude stage, but not of the
developmental music aptitude stage. In fact, the median correlation of scores on the same
subtests administered years apart approached .80 for stabilized music aptitude tests, but only
approximated .30 for developmental music aptitude tests. Gordon (1998) stated definitively that
music aptitude crystallized or stabilized after approximately age 9 (p. 10). Once stabilized, one’s
music aptitude level became resistant to environment, even when that environment was
musically rich (Gordon, 2013, p. 14). Walters (1991) noted Gordon’s extensive contribution to
the field of music aptitude measurement; Carson (1998) concurred, touting the superior technical
documentation of Gordon’s music aptitude tests.
Numerous pre-publication studies were conducted in order to minimize the effect of
music achievement and maximize MAP’s focus on music aptitude (Gordon, 1965): Gordon
(2002) surmised that when students with minimal background in music instruction attained high
music aptitude scores, the music aptitude test truly measured music aptitude rather than music
achievement. A low correlation between music aptitude scores and factors such as intelligence
scores was considered evidence the test measured what it was designed to measure (Gordon,
1998, p. 88). In addition, a well-designed music aptitude test should report high reliability of
each subtest, comparatively low intercorrelations with other subtests, and a high correlation with
the battery’s total score (Gordon, 1998, p. 89). Scores from music aptitude tests have been
significantly correlated with intonation (Gordon, 1970) and performance achievement (Brokaw,
1983; Froseth, 1971; Gordon, 1984a, 1986b, 1989d, 1990c, 2001c; Pereira et al., 2017;
Schleuter, 1978). Although Guerrini (2004) concluded singing achievement was affected by
tonal music aptitude, many others (Atterbury & Silcox, 1993; Hornbach & Taggart, 2005; Mota,
1997; Phillips et al., 2002; Rutkowski, 1986, 1996; Rutkowski & Miller, 2003a), as cited by
Hornbach and Taggart (2005), concluded a weak relationship between singing voice use and
developmental tonal aptitude (Rutkowski, 1996). Additionally, the relationship between music
aptitude and composition was not supported in the research (Henry, 2002). Thus, the general
influence of music aptitude as a predictor of student success in music had not been thoroughly
substantiated, despite Gordon’s (2005) conclusion that certain types of music aptitude (e.g.,
meter or tempo aptitude) were more robust than others for predicting success in school music.
Gordon (1987) asserted an understanding of music aptitude could be gained through
examination of the content of a valid music aptitude test (p. 2). Culp (2017) concurred, noting
music aptitude was not reliably observed, but could be reliably measured. Two purposes of
music aptitude testing were identification of students most likely to benefit from music
instruction and identification of musical strengths and weaknesses for appropriate
individualization of instruction (Gordon, 1990c). Scores from music aptitude tests were intended
to provide teachers and parents with objective data, for use in individualizing instruction and
guiding children to achieve their musical potential (Gordon, 2006). Gordon concluded a weak
association of developmental and stabilized music aptitude from factor analytic results of his
1986a study and the results of Gordon’s 2002 examination of PMMA and non-preference MAP
subtests supported the need for different types of music aptitude tests to measure developmental
and stabilized music aptitude. To that end, Gordon (2006) developed measures of two
developmental music aptitudes (tonal and rhythm), two stabilized tonal aptitudes (melody and
harmony), two stabilized rhythm aptitudes (meter and tempo), and three stabilized preference
aptitudes (phrasing, balance, and style). In developing these music aptitude tests, Gordon (2005)
thoroughly examined existing music aptitude tests, to be described in the following section of
this chapter. He sought to determine the subjective and objective validities of the better-known
existing tests, in order to use new knowledge and techniques to develop different types of new
tests, from which the most feasible would be compared with existing tests. Gordon (1986c) noted
the cyclical nature of music aptitude testing and instruction based on test results—particularly for
those students in the developmental music aptitude stage, whose level of music aptitude
fluctuated based on interaction with the musical environment—and asserted a comparison of
scores of two administrations of PMMA or IMMA a semester or year apart would suggest the
need for changes in instruction (p. 9). Therefore, Gordon encouraged the use of music aptitude
test results to adapt instruction to suit each student’s individual learning needs, in addition to
their typical use for identification of students with high music aptitude for recruitment into music
programs (Gordon, 1995, p. 9).
Carl Seashore, a psychologist with an interest in music (Gordon, 2005), believed music
aptitude was inborn, inherited, and could be predicted accurately (Gordon, 1981). Seashore
developed the first standardized recorded battery of aptitude tests designed for students age 9 and
older (Gordon, 1986a) to identify and educate the musically talented (Gordon, 1981), and based
its design on atomistic principles. Thus, the Seashore Measures of Musical Talents (1919)
consisted of subtests intended to measure discrete aptitudes; no composite score was calculated.
Seashore’s belief that aptitude was multidimensional was exhibited in the multiple subtest scores
yielded in his 1919 test battery: each score represented a unique music aptitude.
The content of the Seashore Measures of Musical Talents included isolated tones
produced without musical instruments (e.g., tuning forks and beat-frequency oscillators) and
without musical characteristics; syntactical relations of pitches were avoided, as Seashore
believed such a relationship would more likely measure music achievement than music aptitude
(Gordon, 1998, p. 24). In addition, Seashore attempted to include series of pitches that would be
culture-free (Gordon, 1998, p 24). Seashore believed training or practice should not affect test
scores (Gordon, 1986a), as he was of the opinion that nature was the source of music aptitude.
Seashore claimed reliability coefficients in the high nineties if his aptitude test were administered
under ideal laboratory conditions (Stevens, 1987, p. 19) and rejected attempts to use external
criteria to validate his tests, as these terms would fall outside of the atomistic construct he
favored (Stevens, 1987, p. 22).
The Drake Musical Aptitude Tests were published in 1954 by psychologist Raleigh M.
Drake. Norms were provided for students as young as age seven (Gordon 1998, p. 50). Unlike
Seashore’s test battery, melodic phrases performed on piano were used as the test stimuli.
Respondents were tasked with listening to a test phrase and indicating the nature of its
successive rendition. Changes to the test phrase could be related to pitch, rhythm, or key. It was
of interest that Drake provided two forms of each test, differing in difficulty, as well as norms
for non-music and music students, defined as those having five or more years of musical
training (Gordon, 1998, p. 42). Gordon (1998) noted the likelihood that Drake considered music
achievement to be a factor of music aptitude, since test takers must have had some formal music
instruction in order to be familiar with the concepts of modulation, notes, and time required for
success on these tests (p. 41). In addition, Gordon observed Drake’s inclusion of tonal and
rhythm responses in the same test might be sufficient for older students, but were less
appropriate for young children in the developmental music aptitude stage.
Psychologist Wing developed the Standardized Tests of Musical Intelligence (1939,
revised in 1946) to identify musically intelligent children entering secondary school (Wing,
1962) so they might take advantage of instrumental training. Wing’s test battery exemplified the
gestalt theory of music aptitude, in which a general or omnibus factor of musical ability was
sought; tonal and rhythm dimensions were included within the same test, test prompts were
produced with musical instruments in musical contexts, and a composite score was yielded
(Gordon, 1986a). Wing favored nurture as the source of music aptitude, and attempted to
measure musical understanding (Haroutounian, 2002, p. 14); thus, Wing’s test battery was
designed to measure musical acuity and sensitivity (Wing, 1962). The listener was asked to
decide if rhythm, accent, intensity, or phrasing alterations to familiar melodies played on the
piano were the same or different, and if different, which presentation was preferred (Gordon,
1998, p. 44). Wing believed music preference was a component of music aptitude (Gordon,
1998, p. 45); thus, four preference subtests were included in the battery. Strong criterion validity
with teachers’ ratings (.64–.90) and solid reliability (.91 for whole test) were reported, and norms
for students aged 8 and older were included (Haroutounian, 2002, p. 292). Gordon (2005)
critiqued the premise of Wing’s test battery, noting knowledge of number of pitches or similarity
of chords without musical context was not relevant to the practice of music. Nevertheless,
Gordon (1998) considered Wing’s music aptitude tests superior to Seashore’s, as Wing
recognized Stage 3 audiation, in which objective or subjective tonality and meter were
established, was integral to stabilized music aptitude (p. 49).
Scores on the Kwalwasser-Dykema Music Tests, published in 1930, were intended to
inform teachers of student ability in order that instruction could be individualized (Stevens,
1987, pp. 13–14). This test battery, authored by two music education professors, was based on
the atomistic view. Students as young as first grade could be administered the 10 subtests, which
included recorded performances of orchestral instruments (Stevens, 1987, p. 14). Combined
norms were provided for students 8 years old through professional musicians aged 40 (Gordon,
1998, p. 35). Gordon (1998) noted Kwalwasser seemed to concur with Seashore on stabilization
of music aptitude at age eight or nine or believed younger children incapable of understanding
music aptitude test directions (p. 35). Test prompts included a melody played twice; the listener
was tasked with deciding if the two renditions were the same or exhibited a change in pitch or
rhythm. Two preference subtests measured “musical feeling and appreciation” (Stevens, 1987, p.
14): Melodic Taste, in which listeners indicated which of two endings was best, and Tonal
Movement, which measured the ability to judge the tendency of the final tone to proceed to a
point of rest (Gordon, 1998, p. 34). Because the test battery contained sections that measured
music achievement, content validity was limited (Stevens, 1987, p. 29). Reliability and
intercorrelations among subtests were not reported by Kwalwasser (Gordon, 1998, p. 37).
Although Bentley was a proponent of the gestalt theory of music aptitude, his Measures
of Musical Abilities (1966) seemed to combine gestalt and atomistic principles without an
obvious rationale (Gordon, 1998, p. 50). Bentley’s test battery consisted of four subtests, and
was the first intended to measure music aptitude in children as young as age 7 (Haroutounian,
2002, pp. 14–15), with the goal of examining only those abilities that were basic and essential to
the performance of music (Young, 1973). Students were tasked with identifying the relationship
of two pitches as same, higher, or lower (pitch discrimination subtest), counting the number of
pitches (tonal memory subtest), determining the number of pitches in each chord (chord analysis
subtest), and identifying the relationship of two rhythm patterns as same or noting the number of
the altered beat if changed (Rhythmic Memory subtest). Bentley seemed to believe a similar
description of music aptitude was appropriate for younger and older children and only attempted
to clarify Wing’s test directions (Gordon, 1998, p. 50). He reported significant correlations
between teachers’ estimates of musical ability and student test scores and moderate-to-high
reliability ratings of .84 for the total battery (Haroutounian, 2002, p. 293), as well as criterionrelated validity findings of college level and professional performances, despite the test’s focus
on very young children (Gordon, 1998, p. 50), but neglected to provide percentile rank norms or
reliability or validity coefficients for the subtests.
Gaston’s 1957 Test of Musicality was intended for administration to students aged 10 and
up to measure musical ability and interest (Stevens, 1987, p. 17). This brief battery included
three subtests, and reported split-half reliability coefficients of .88 for upper elementary and
middle school students and .84–.90 for high school students (Stevens, 1987, p. 39). A significant
relationship between teachers’ ratings and scores on the Test of Musicality also was reported
(Haroutounian, 2002, p. 292).
In contrast to Gordon’s definition of music aptitude as the realization of audiation, Karma
(1994), a Finnish psychologist, defined music aptitude as “the ability to hear or perceive sound
structures” (p. 20) whose distinct characteristic was temporal; Karma (2007) claimed music
aptitude was the ability to listen to music “musically” and efficiently. Karma’s (1994) music
aptitude definition was purposefully culture-free and separate from emotions or personality
traits; his construct of music aptitude delineated auditory structuring as separate from sensory
capacities, and he claimed a sufficient ability to hear differences in pitch, length, intensity, and
timbre was a necessary condition for perception of pattern structuring, but did not belong to the
construct of music aptitude (Karma, 2007). The temporal feature of sound patterns formed the
basis of Karma’s music aptitude measure and he considered the temporal aspect an essential
property of music cognition.
Karma’s measure of music aptitude was designed to avoid the effects of culture and
training, even while presenting test items within a musical context. A sample test item might
require the student to listen three times without pause to a brief pattern of sounds and form an
image of one iteration of the pattern. A fourth pattern was then played, after which the listener
must determine if the final pattern was the same as or different from the image (Karma, 1994).
Boyle (1992) noted Karma’s contention that structuring strategies were superior to traditional
testing for assessing music aptitude (p. 250) and assertion that discrimination tasks in
measurement of music aptitude and test validation based on correlations with achievement tests
were inappropriate (p. 249). Carson (1998) noted Karma’s conceptualization of music aptitude as
an indicator of broad and fundamental cognitive abilities was not musical per se. Karma
conceded, concluding from the results of his 1994 replication study that successive synthesis, a
non-musical process based on the time-order of stimuli, formed the fundamental basis of
audiation (Geake, 1999). The results of Karma’s 1990 auditory structuring test were compared to
results of a parallel visual version administered to students with significant hearing loss (Karma
1994), from which Karma concluded the presence of sound was not a necessary condition for
music perception. This was an important departure from the more conventional
conceptualizations of music aptitude of previous researchers as described by Boyle (1992).
Gordon asserted sound would not need to be physically present in order for audiation to occur.
Karma (1994) concurred, but asserted the defining characteristic of music was not sound, but the
thought processes triggered by sound. Karma claimed the findings of his 1994 study supported
the assumption that his auditory structuring test for subjects with no hearing impairment and the
visual parallel test for subjects with congenital hearing loss measured the same process of
temporal structuring, which was theorized as a measurement of music aptitude. This rationale
offered a construct of music aptitude markedly different from Gordon’s description of audiation.
Indeed, Karma (1984) seemed to imply the ability to conceive musical structure preceded what
Gordon would define as audiation.
Recent Music Measures
The conversation surrounding music aptitude, music ability, and music perception has
continued within the fields of music education and music psychology. Although some
researchers seem to distinguish between music ability, music skills, and music competence and
have developed measures accordingly (Law & Zentner, 2012; Müllensiefen et al., 2014;
Wallentin et al., 2010; Wolf et al., 2018), advocates of Gordon’s work would label all as tests of
music achievement. A lack of consensus on what should be measured (real-world skills or
technical knowledge) and who should be administered measures (amateurs or professionals,
musicians or nonmusicians) persists. Other researchers endorse the use of culturally responsive
performance-based assessment to mitigate racial inequity (Hood, 1998). Although other tests of
music perception have been published more recently, their target population was adults who
would have achieved the stabilized music aptitude stage previously. The Musical Ear Test (2010)
is a test of musical competence, designed to measure musical abilities of professional, amateur,
and non-musicians (Wallentin et al., 2010). As such, it is a test of music achievement. In the
Goldsmiths Musical Sophistication Index (Gold-MSI; Müllensiefen et al., 2014), the term
“musical sophistication” was used to label a psychometric construct intended to include musical
skills, expertise, achievements, and behaviors in college-age students or adults, without
assumptions made of innate, inherent, or acquired attainment. Thus, Gold-MSI is also a test of
music achievement. Another music achievement test, the Profile of Music Perception Skills
(PROMS; Law & Zentner, 2012), measured musical perception to determine musical
competence of college-age students or adults. Advanced amateur and professional musicians
were also the focus population of the Musical Ear Training Assessment (META; Wolf & Kopiez,
2018), which was designed to measure ear training skill, a form of music achievement. Because
the focus of these music achievement tests differed from the music aptitude focus of the current
study, an in-depth review of these tests was deemed extraneous.
Critique of Previous Music Aptitude Measures
Gordon’s criticism of previous music aptitude tests focused on content and validity. He
found, as had Mursell (1937), that Seashore’s Measures of Musical Talent (1919), in its isolation
of musical elements, was void of the musical context needed to describe music abilities (Gordon,
1969; Haroutounian, 2002). Gordon (2005) believed Seashore’s subtest measured auditory acuity
rather than a music trait. Haroutounian (2002) described Gordon’s objections as the laboratorynature of the sound production, extreme attempts to isolate each attribute of music aptitude, and
lack of musical context in the presentation of test items, which resulted in a test experience
devoid of musical functioning (p. 14). Gordon (1989b) noted that a pitch discrimination test,
such as he considered Seashore’s test battery, could have negative validity only and thus could
predict only whether students could not profit from instruction. Unlike the Seashore battery, the
MAP battery excluded subtests relying on discrimination (pitch or time) or memory (tonal or
rhythm) and contained preference subtests to measure musical sensitivity (Gordon, 1969). In his
1969 study of intercorrelations between MAP and the Seashore Measures of Musical Talents,
Gordon noted a weak relationship between corresponding subtests of the two batteries and
concluded the batteries assessed different abilities. Gordon also had reservations about certain
features of other music aptitude tests: he asserted the music aptitude tests of Wing and Bentley
were, fundamentally, measures of music achievement (Gordon, 2002), and disputed the musical
relevance of Wing’s (1939) subtests (Gordon, 2005). Geissel (1985) observed Wing’s use of the
same music content for younger and older students, which might have affected the test reliability
negatively when used with younger students (p. 2); these issues were noted for Bentley’s
Measures of Musical Abilities as well (Geissel, 1985; Gordon, 2002).
Boyle (1992) noted differences in the musical tasks and content measured in various
music aptitude tests (p. 249). Discrimination, memory, and recall required different processes
and were unequal tasks, yet had been measured as if equivalent in many aptitude tests. Music
ability, music aptitude, and music achievement had also been examined and compared
inaccurately as parallel constructs. Music sensitivity or preference as a construct seemed to be
generally accepted (Boyle, 1992, p. 251); however, there was little agreement on how it should
be measured (Bugos et al., 2014). In addition, the appropriateness of the musical tasks and
content for predicting musical potential was in dispute (Boyle, 1992, p. 249).
Karma dissented from the more commonly accepted definition of music aptitude as well
as from the type of validation typically sought and codified in Gordon’s work. Karma (2007) was
critical of the samples of music students in previous music aptitude testing which were
unrepresentative of the general population, and noted poor predictive or ecological validity of
music aptitude measures to predict real-world musical skills or behaviors. He asserted a music
aptitude measure with predictive validity was a composite of several constructs and therefore
unsuitable for testing music aptitude. Karma (1982) opposed the practice of establishing music
aptitude test validity through correlation of music aptitude test scores with an individual’s true
music aptitude level, noting the true nature of music aptitude was unknown and correlations were
often contaminated with variables that measured non-musical factors. Instead, Karma (2007)
preferred the use of construct validity, and proposed reporting of validity as measures of
relationship or difference rather than statistical significance (Karma, 1982). Gordon (1986c)
reported content validity as correlations of PMMA and IMMA with MAP and the Iowa Tests of
Music Literacy (ITML) and concurrent (criterion-related) validity as correlations with teachers’
ratings of student performance, in addition to longitudinal predictive validity (pp. 97–109).
Karma (1984) reported moderately high correlation coefficients (between .60 and .76) of
auditory structuring test scores and teachers’ estimates of student aptitude. He asserted high
correlations of instrumental teachers’ estimates of student music aptitude with results of music
achievement tests and self-reports were reasonable, but not with intelligence and amount of
musical training. Test validity could be claimed if correlations were predictive of test results
(Karma, 1982). However, Gordon (2012) adjudged music aptitude measures did not have
reasonable correlations with teachers’ estimates of student music aptitude (p. 51) and, in fact,
empirical evidence pointed to a weak relationship between teacher’s estimates of music aptitude
and student music achievement (Reynolds & Hyun, 2004; Stamou et al., 2010; Taggart, 1989).
Karma’s theory was far different from the generally-accepted construct of music aptitude based
on audiation, and his music aptitude measure was unconventional as well.
Another detractor of Gordon’s work, Australian John G. Geake (1996), examined musicspecific information processing involving perception, memorization, abstraction, and directed
attention planning. Although Geake (1996) concurred with Gordon that abilities are developed as
a result of appropriate experience, Geake (1999) attributed audiational abilities to information
processing, as manifested in neuropsychologist Alexander Luria’s (1966, 1970, 1973) cyclic
model of simultaneous, successive, and executive synthesis, in which musical perception closely
parallels text-reading (Geake, 1996). Geake concluded information was abstracted as it was
encoded for memory and planning within brain functioning. This frame of reference did not
account for an innate and individual baseline of music aptitude upon which early music
experience was imprinted, however, and assumed all cognitive thought was processed similarly.
Proponents of Gordon’s work might argue audiation was a uniquely musical process.
The relationship between information processing and music perception abilities in
samples of typical and “musically gifted” students, who were nominated by teachers for
demonstrating extreme musicality or were selected into advanced music training programs at an
early age, was investigated in Geake’s 1999 study. Geake (1999) speculated the “mozart” (p. 33)
students would perform better on the MAP than their non-gifted peers. From the findings of a
principal component analysis, Geake (1999) concluded the presence of three components:
Component 1 reflected successive synthesis, suitable for musical tasks requiring informational
encoding such as sequences of rhythm patterns; Component 2 reflected executive synthesis,
defined as “the formation of intentions and programs for behaviour” (p. 11) and manifested in
the generation of expectancies (Geake, 1996); and Component 3, which showed high loadings
for simultaneous synthesis, marked by the ability to focus on “the complete auditory image
produced by the set of components as a whole” (p. 32). Geake (1999) concluded MAP’s
measurement of audiation was dependent on general attentional and successive information
processing abilities. The musically gifted students demonstrated significantly higher information
processing abilities than their non-gifted peers; Geake (1996) contended their ability to
concentrate on music learning and performance tasks could be explained by executive synthesis,
in which attention was employed for metacognition. In other words, Geake attributed the
musically gifted students’ advanced performance abilities to their superior executive synthesis
abilities. Nevertheless, the superior ability of the mozart subjects to concentrate on music
learning and performance tasks also could be explained by superior ability to audiate (higher
music aptitude). As the basis of their selection could hardly be viewed as objective, it was likely
Gordon would have questioned the qualifications of the mozart students as musically gifted,
thereby negating the premise of this study.
Geake (1996) went on to assert that short-term memory played a pivotal role in the
processing of musical information, serving as an underpinning ability. He challenged Gordon’s
disavowal of the importance of short-term memory in MAP performance, arguing that individual
differences in short-term memory might partially predict individual differences in MAP scores.
Geake (1999) noted the musically gifted sample demonstrated significantly better short-term
memories than their non-gifted peers, and reported strong correlations for scores on each MAP
test (tonal, rhythm, and musical sensitivity) with non-music specific abilities such as successive
synthesis, thus suggesting MAP might not provide the complete music aptitude assessment
intended. Nevertheless, the equivalency of short-term memory and other information processing
abilities and audiational abilities had not been established empirically. Gordon (1998) asserted
rapid and successive exposure to many different musical phrases would force students to
memorize, rather than audiate, and consequently, subtests would be related to music achievement
rather than music aptitude. A key feature of Gordon’s music aptitude measures was the strictly
limited amount of time in which the test-taker must select a “same”/“like”, “different”, or “in
doubt” response; Geake’s studies lacked sufficient detail describing the speed at which students
were tasked with processing information. The ability of readers to compare Geake’s theory of
music aptitude with Gordon’s music aptitude construct was inhibited by this dearth of
information. In addition, Geake (1996) noted teachers commonly related high music ability to
other intellectual abilities in mathematics and language, and deemed superior ability of
musicians labeled “gifted” or “talented” was due to ability to process musical information at a
high level. As previously discussed, Gordon disputed the ability of teachers to identify musical
giftedness impartially without the aid of a valid and objective music aptitude measure.
Furthermore, it had not been established that an “ability to process musical information” was
equivalent to the process of audiation.
Geake (1999) also disputed Gordon’s consideration of musical sensitivity as a dimension
of music aptitude, noting inclusion of musical sensitivity as a factor of music aptitude was
inconsistent with Boyle’s taxonomy and his cited evidence. Nevertheless, Geake commended
Gordon for reporting musical sensitivity scores independently of the composite MAP score,
which resulted in a genuine profile of music aptitudes.
Critique of Gordon’s Music Aptitude Measures
The original participant sample on which IMMA norms were based included a relatively
small group of Grade 1–4 students, many of whom had taken and scored high on PMMA. The
students attended schools in 11 school systems in Pennsylvania and New York, of which one was
described as a private academy and another an “inner city school” (Gordon, 1986c, p. 85). It was
likely the sample was more homogeneous than heterogeneous, given the limited geographic area
represented and the previous association of the school systems with Gordon, who had
administered IMMA to most students in the sample (p. 85). Selection of the norms sample served
two needs of IMMA use: to document the statistical properties of the test, and to provide an
objective comparison as an alternative to local norms (p. 85).
Lack of a standardized music curriculum among, and differences in instruction between,
the school systems represented in the norms sample was typical at the national level (Gordon,
1986c, p. 86). Gordon had not defined or identified the need for informal guidance at the time
IMMA was created (M. Runfola, personal communication, March 26, 2021). Thus, it was
presumed all students in the norms sample were taught using traditional instruction.
Due to the lack of diversity in the original norms samples and the type of instruction
offered to those students, it seemed findings of Gordon’s music aptitude measures were
generalized to samples not reflective of or relevant to current student samples in many areas of
the United States. Since the sample of the current study was racially homogeneous, examination
of the sample’s make-up was outside the parameters of the current study. However, creation of
local norms is recommended for a more accurate comparison of findings of diverse populations.
Gordon (1986c) suggested the development of local norms was an outgrowth of frequent test
administrations and might be superior for comparing relative standing. Holahan and Thomson
(1981) concurred, proposing construction of local norms as standard practice for all tests.
Around the time of Gordon’s research, a change was occurring in political perspective
from a perception of America as a “melting pot” of minority groups, responding to societal
pressure to assimilate into the dominant culture, toward a view of America as a “salad bowl”.
Thus multiculturalism, defined by the Stanford Encyclopedia of Philosophy as “an ideal in which
members of minority groups can maintain their distinctive collective identities and practices”
(Multiculturalism, 2016, para. 1), became the goal and practice of many Western societies. For
example, Canada officially adopted a policy of multiculturalism in 1970 (Hess, 2015).
Nevertheless, multiculturalism emphasized commonality while downplaying difference,
thereby crucial discussion of the inequity of power inherent in systemic racism was
circumvented. Bradley (2007) noted the use of terms such as “culture”, “ethnicity”, and
“nationality” and euphemisms such as “poverty problem”, “welfare”, “urban schools”, and
“diversity” was a means of pointing out difference while avoiding direct discussion of race, and
attributed this to the emphasis on commonality promoted by multiculturalism. Yosso (2005)
concurred, noting “cultural difference” as an additional example of race coding. Thus, Gordon’s
use of the terms “culturally disadvantaged” and “inner-city”, while problematic, were indicative
of the time period in which his research was conducted. Despite this, future researchers are
obligated to look critically at how music aptitude is conceived, defined, and measured. If the
construct of music aptitude is truly universal, studies to confirm its accurate and equitable
measurement in students from all backgrounds and educational settings must be conducted.
Gerhardstein (2001), in his biography of Gordon, noted the impact of the University of
Iowa on Gordon’s work. Gordon served as a student, professor, and researcher at University of
Iowa until becoming the Director of Music Education at the State University of New York at
Buffalo in 1972 (p. 22); his tenure at University of Iowa, “a national center for educational
testing and measurement” (p. iv) was influential in his development of MAP. Gordon seemed to
concur with the procedures used to construct norms for the Iowa Test of Basic Skills (ITBS), a
well-established and widely used standardized measure of that time period, as evidenced by his
adoption of the pupil profile chart introduced in the ITBS scoring procedure, “in itself a plotting
device [italics in original], necessitating no recourse to tables of norms or overlay masks of any
kind” (Peterson, 1983, p. 32). In addition, Gordon provided grade-equivalent scales and
percentile norms similar to those published in ITBS in his music aptitude measures (Peterson,
1983, p. 136). The precedent of calculating separate norms for groups which exhibited different
levels of performance was established by ITBS (Peterson, 1983, p. 230) and duplicated in
Gordon’s (1995) aggregation of MAP scores for musically select students (p. 49). It was likely
Gordon’s terminology in describing the racial backgrounds of the participants in his post-MAP
research samples also was reflective of the influence of ITBS. State and national standards were
reported by ITBS, in order for school administrators to have the option to choose the set of
standards they believed most representative of their student population (Peterson, 1983, p. 231).
An ITBS test author noted factors such as population features, educational resources, and
cultural traditions to be considered in the creation of norms (Peterson, 1983, p. 230).
The fundamental differences in definition and process of the construct of music aptitude,
such as auditory structuring (Karma), information processing (Geake), and audiation (Gordon),
prohibited direct comparison of music aptitude measures. It remained the task of music
researchers and teachers to review extant literature, compare music aptitude constructs and data
supporting each construct, and draw their own conclusions when selecting an appropriate music
aptitude test for a particular group of students. As the current study focused on measures of
stabilized music aptitude based on Gordon’s construct, descriptions of those measures, in order
of initial publication date, were presented and the selected and adapted features of previous
music aptitude measures included in Gordon’s music aptitude test batteries highlighted below.
Music Aptitude Measures Developed by Gordon
The original 1965 publication of Gordon’s Musical Aptitude Profile (MAP) followed 8
years of extensive research in music aptitude, included more than 5,800 student participants in
seven pre-standardization studies (Runfola, 2016, pp. 360–361), and resulted in publication of a
comprehensive test of stabilized music aptitude for use with students in Grades 4–12. MAP
exhibited standards of reliability and validity on par with those reported for academic and
diagnostic achievement tests (Gordon, 2001c), the highest of any music aptitude test (Walters,
1991). Walters noted the import of Gordon’s (1967b) three-year longitudinal study in
establishing the predictive validity of MAP and isolating the constructs of aptitude and
achievement; the findings yielded impressive predictive validity coefficients (.61 in Year 1, .72
in Year 2, and .77 in Year 3) which served to promote the use of MAP as a predictor of
instrumental music achievement (Geissel, 1985, p. 8). Although a strong correlation of music
aptitude test scores and teacher ratings and judges’ evaluations of students’ performances was
observed, it should not be inferred that music aptitude scores are the result or outcome of music
achievement (Gordon, 2001c). Aptitude tests are marginally measures of achievement, although
the effect of achievement should be minimized as much as possible. Gordon recommended
administering the aptitude test prior to instruction, providing a prolonged training program,
evaluating performance after training, and comparing aptitude and performance scores to
establish predictive validity, as a longitudinal examination would provide clear evidence of the
test’s effectiveness in predicting music achievement.
MAP, described as an eclectic battery due to its use of preference and non-preference
subtests, drew from both atomistic and gestalt theories (Gordon, 1986a). The four non-preference
MAP subtests reflected views of those in the nature camp and the three preference subtests
favored proponents of nurture (Walters, 1991). Gordon (1987) contended musical context was
necessary for the most accurate measurement of stabilized music aptitude; therefore, MAP test
content was presented through the use of pairs of original short melodic phrases in a musical
context (Gordon, 1995, p. 13), which students labeled as “like”, “different”, or “in doubt”.
Although raw scores might increase in subsequent years, relative standing remained constant
(Gordon, 2001c). Walters (1991) asserted the stability of relative standing across all grade levels
contributed to MAP’s power and validity as a measure of music aptitude.
However, Zentner and Gingras (2019) claimed the high reported validity coefficient of
MAP was tempered by the association of the researchers (Gordon or his students) and the means
of publication (a publishing house whose standards may not have been stringent). Nevertheless,
two unique features of MAP contributed to its high level of validity. Listeners had the option to
select “same”, “different”, or “?” in response to each MAP test prompt. The “in-doubt” option
was intended to prevent guessing (Gordon, 1995, p. 87), which increased the validity of the
subtest (Gordon, 1998, p. 102). In addition, test items of various difficulty were scattered
randomly throughout each subtest in order to maintain interest and deter frustration of students
with low or average music aptitude levels (Gordon, 1998, p. 102), which offered an additional
boost to validity.
Norms based on a sample size of 12,809 students from 20 school systems in 18 states
(Gordon, 1995, p. 69) were reported for composite scores of each music aptitude test (tonal,
rhythm, musical sensitivity), as well as for each subtest of MAP. Gordon (1998) noted percentile
rank norms and score distributions were not markedly different for students of different
geographical regions, school settings, or cultures (p. 11). MAP had been shown to be a valid
measure for minoritized students (Gordon, 1980b), students in Finland (Sell, 1976), Germany
(Schoenoff, 1973; Schoonover, 1974), South Korea (Reynolds & Hyun, 1994), and Taiwan
(Chuang, 1997), students with special needs (Curtis, 1981), and students with intellectual
giftedness (Drennan, 1984). However, its efficacy in measurement of stabilized music aptitude in
college and university students had not been substantiated (Gordon, 1990c). MAP, with its seven
subtests, had been used in numerous research studies for its ability to diagnose strengths and
weaknesses of individual students (Geissel, 1985, p. 4). Gordon reported split-half reliability
coefficients of .90–.96 (composite) and criterion validity of .73 (composite) when MAP scores
were compared with scores of a music achievement test (Haroutounian, 2002, p. 293).
Although not unique in its inclusion of preference subtests, Stevens (1987) noted MAP
was the sole American test battery to include preference tests (p. 18): MAP’s inclusion of
musical sensitivity tests was relatively rare. Gordon (1987) found it unnecessary to group
musical sensitivity test questions in pairs: musical phrases in preference subtests did not require
concentrated attention, as students needed only an overall impression of the two test items in
order to decide which was preferred (p. 65). Timbre and dynamics were not included as separate
subtests; rather, aspects of timbre and dynamics were included in two preference subtests as an
intentional design feature to contribute to “a Gestalt of music phrasing and style” (Gordon,
1981). Gordon expressed confidence that formal music instruction contributed sparingly to
student responses on preference subtests, as the majority of students had no additional formal
training other than general music class. Yet student responses aligned with those made by
professional musicians, which lent support to Gordon’s (1998) contention that professional
musicians’ decisions were associated more closely with factors of music aptitude than music
achievement (p. 100). Gordon (1986a) concluded musical preference was a factor of stabilized
music aptitude, and asserted the three preference subtests contributed significantly to MAP’s
longitudinal predictive validity coefficient of .75 (Gordon, 1998, p. 61).
Due to the extensive reporting of validity associated with Gordon’s tests, Law and
Zentner (2012) selected MAP and AMMA for use in their examination of congruent validity of
researcher-designed Profile of Music-Perception Skills (PROMS). However, the ability of MAP
and AMMA to measure isolated audiational skills was questioned. Law and Zentner reported
correlations of AMMA melody scores with other test components, and noted the emphasis on
tonal memory required by the MAP tempo test, as it required the listener to retain the ending of
the melodies in order to compare and judge them as same or different.
Gordon’s Ottumwa study, conducted during the development of MAP (Gordon, 1995, p.
33), yielded an ostensibly reliable measure of aptitude in students in kindergarten and first grade,
but only through extraordinary measures (Walters, 1991), such as selected inclusion of children
with exemplary achievement, one-on-one administration by a parent, and markedly increased
administration time. The development of the Primary Measures of Music Audiation (PMMA;
1979b) was the result of the continued investigation of inconsistent reliability coefficients
yielded in studies of MAP administered to early elementary-aged children (DeYarman, 1972,
1975; Harrington, 1969; Schleuter & DeYarman, 1977). From these findings, Gordon inferred a
need for the creation of a more appropriate measure for use with children younger than 9 years.
Gordon (1979a) benefited from the experimentation of other researchers, who modified
the length and format of MAP’s answer sheets and recorded directions (DeYarman, 1972, 1975;
Harrington, 1969; Schleuter & DeYarman, 1977). Norms for students in Grades 4–12 were
included in the original 1965 publication of MAP; Harrington (1969) adapted three of the seven
MAP subtests, including one preference subtest (phrasing), by simplifying and re-recording test
directions and color-coding answer sheets for use with a younger age group. Harrington
calculated reliability coefficients for MAP scores of students in Grades 2 and 3, and concluded
subtest scores demonstrated satisfactory reliability and test results appeared to be more closely
related to musical ability than scholastic ability. Therefore, he concluded the primary version of
MAP functioned adequately in measuring music aptitude of young students. However, Gordon
attributed the low composite reliability coefficient in Harrington’s study to the expected
fluctuations of music aptitude in young children and thereby drew a different conclusion: the
primary MAP was not measuring stabilized music aptitude in second- and third-grade students—
it was revealing developmental music aptitude (DeYarman, 1975).
Using adapted test directions, different answer sheets, and the same music included in
Harrington’s 1969 study, DeYarman (1972) conducted an investigation of the stability of music
aptitude of kindergarten and first-grade children using his own version of the primary MAP.
DeYarman reported higher MAP composite reliability coefficients for his early primary sample
than those reported for Harrington’s sample of second and third grade students. DeYarman
(1975) reasoned reliability coefficients of MAP were as high for students in kindergarten–first
grade as Gordon had reported for students in Grades 4–12, thus, music aptitude must stabilize as
early as age 5 or 6. Using this rationale, DeYarman conducted a subsequent study in 1975 to
further investigate music aptitude stability in primary-age children. The study’s research
questions addressed stabilization of music aptitude and the effects of formal instruction on music
aptitude levels before Grade 4. However, the sample did not include primary-aged students;
rather, approximately 3,000 fourth-grade students constituted the sample. Zimmerman (1986)
noted difficulty determining how MAP results of Grade 4 students could lead to a conclusion of
music aptitude stabilization of students aged 5 or 6. That DeYarman’s conclusion was
controversial and specious, given that his sample consisted solely of students in fourth grade,
must be considered. Nevertheless, DeYarman’s (1975) research design in which correlations of
music aptitude test scores from a sample of fourth grade students were used to draw conclusions
about the onset of stabilized music aptitude was most efficacious and served as the template for
future studies: to note when music aptitude had stabilized rather than to attempt to observe when
developmental music aptitude had ceased was reasonable and prudent, given the potential
complexity of tracking the expected fluctuation of the relative standing of students in the
developmental music aptitude stage.
A follow-up study by Schleuter and DeYarman (1977) replicated and extended
DeYarman’s 1975 study. The MAP scores of fourth-grade students were compared to scores of
DeYarman’s 1975 sample. Schleuter and DeYarman (1977) concluded from their findings that
formal music instruction in the primary grades had little effect on student music aptitude levels;
thus, music aptitude must stabilize in or before kindergarten. The discrepancies between the
findings of DeYarman’s and Harrington’s studies caused Gordon to reconsider his earlier
agreement with DeYarman’s assertion that music aptitude stabilized in the primary grades
(DeYarman, 1975).
Gordon’s theory of developmental music aptitude resulted from his extensive research
involving approximately 10,000 primary students and his understanding of language
development and music perception. Neurological information supported the construct of
developmental music aptitude, as did research findings that indicated the effect of environmental
factors on music aptitude levels (Gordon, 2005), and substantiated developmental and stabilized
music aptitude as separate constructs through factor analysis of MAP, IMMA, and PMMA
scores (Gordon, 1986a). Gordon contended audiation of keyality and tempo were functions of
developmental music aptitude. Thus, discrimination of PMMA and IMMA test item pairs
emphasized keyality and tempo, rather than tonality and meter as in MAP (Gordon, 1986c, p.
100), further establishing IMMA as a measure of developmental music aptitude.
It was difference that contributed heavily to the validity of music aptitude tests: Gordon
(2002), in his examination of the item content of PMMA and IMMA, found total score reliability
was similarly high for 20 patterns with “different” as the correct answer as when all 40 test items
were included. Gordon (1981) concluded children in the developmental music aptitude stage
were more aware of how music was constructed than of its expressive qualities and were able to
recognize extremes of timbre and dynamics only. Therefore, the content of the three MAP
preference subtests was unusable for measuring the aptitude of young children (Walters, 1991).
Consequently, PMMA contained only non-preferences subtests.
The design of PMMA was unique. Students in the developmental music aptitude stage
had difficulty distinguishing tonal features from rhythm features if the two were conflated within
melodic patterns. Therefore, the test items of developmental music aptitude measures contained
tonal patterns without rhythm and rhythm patterns without pitch (Gordon, 2004). Within the
separate tonal subtest and rhythm subtest, students were tasked with listening to a pair of tonal
patterns or rhythm patterns to determine if the pair was “same” or “different”; research findings
since the publication of PMMA and IMMA indicated an increase in test reliability when “not
same” is used in place of “different” (Gordon, 2002). Answers were marked by circling the box
containing two identical faces or that containing two dissimilar faces on the answer sheet (see
Figure 1); therefore, reading, writing, speaking English, performing, and understanding music
theory were unnecessary for accurate administration of PMMA.
Because young children were inclined to have their attention diverted from the music
itself toward the source of the sound, test content for students in the developmental music
aptitude stage was performed on electronic instruments, in contrast to the acoustical instruments
used in stabilized music aptitude tests (Gordon, 2004). The test content was unfamiliar and
presented without musical context, and the interval of time between test items was insufficient
for students to memorize or fully recall each pair of patterns. Thus, Gordon mitigated the effects
of prior learning, resulting in a test of music aptitude rather than music achievement (Walters,
1991). PMMA was normed for students in kindergarten through third grade, approximately ages
5 through 8 (Reese & Shouldice, 2019, p. 480): separate subtest norms as well as composite
PMMA norms were reported, reflecting the atomistic and gestalt viewpoints.
Figure 1
PMMA/IMMA Test Answer Sheet Design (Gordon, 1986c)
Moore (1990) summarized subsequent research findings (Flohr, 1981; Moore, 1987;
Norton, 1980) that supported the theory of developmental music aptitude as a stage defined by
fluctuation and influenced by instruction. Moore noted the comprehensive body of research
conducted by Gordon (1979b, 1980b, 1981, 1986a) to establish PMMA as a measure of music
aptitude. Gordon’s 1982 study of longitudinal predictive validity of PMMA yielded a reliability
coefficient of .73 when pre-instructional PMMA scores and judges’ performance ratings were
correlated (Geissel, 1985, p. 12). Flohr (1981) found a significant effect of short-term music
instruction on PMMA scores (Reese & Shouldice, 2019, p. 478); Gordon (1980b) concluded
specialized music instruction focused on tonal and rhythm patterns resulted in high PMMA
scores. Walters (1991) noted evidence from subsequent studies indicated music aptitude was
unstable during the primary years, PMMA could be used to predict achievement, and
developmental aptitude was sensitive to instruction, especially for students with low music
aptitude. In addition, Bell (1981) found PMMA to be a valid measure for use with children with
developmental disabilities.
The Intermediate Measures of Music Audiation (IMMA) (Gordon, 1982), was developed
for use with groups in which at least half of the students scored above the 80th percentile on one
or both PMMA subtests (Gordon, 1986c, p. 120), as in Phillips and Aitchison’s (1997) study in
which PMMA was administered to a sample of third-grade students because Gordon’s criterion
not been met. Thus, IMMA, like PMMA, was a test of developmental music aptitude. However,
scores from PMMA and IMMA should not be compared because the difficulty level of test
content was not equivalent (Gordon, 1986c, pp. 66–67). Gordon (2005) had observed PMMA
was not complex enough for young students who had received superior music instruction: the
normal distribution of their scores skewed to the left and reliability decreased. Gordon (2005)
designed IMMA, a more advanced test battery, with those students in mind. Although originally
normed for students in Grades 1–4 (Geissel, 1985, pp. 2–3), the age range for IMMA was later
expanded to include students in Grades 5 and 6 (Gordon, 1986c, pp. 64–65). The format of
IMMA was identical to that of PMMA: students were tasked with determining sameness or
difference of two tonal patterns or two rhythm patterns, and tonal subtest, rhythm subtest, and
composite scores were yielded.
Gordon (1976) developed a taxonomy of tonal patterns and rhythm patterns based on
their audiation difficulty level (easy, medium difficult, and difficult). The difficult tonal patterns
and rhythm patterns were the sole test content for the Intermediate Measures of Music Audiation
(Gordon, 1982); IMMA was therefore a more discriminating measure of high aptitude for
children aged 6 to 9 years than Gordon’s previous test of developmental music aptitude for
primary students, PMMA (Gordon, 1984a; Haroutounian, 2002; Moore, 1990; Walters, 1991).
Gordon (2006) noted the choice to use PMMA or IMMA at a particular grade level was
dependent on desired difficulty of test content, as both were tests of developmental music
aptitude. IMMA had been used in numerous studies with students under age 9 for increased
discernment (Culp, 2017; Gromko & Russell, 2002; Guilbault, 2004; Kratus, 1994; Rutkowski,
2015; Rutkowski & Miller, 2003b; Saunders & Holahan, 1993), and had been found to have
strong predictive validity for instrumental and vocal achievement of fourth grade students
(Geissel, 1985, p. 11). Gordon (1986c) noted a .90 longitudinal correlation coefficient for IMMA
and MAP composite scores, from which he concluded IMMA assessed in the developmental
music aptitude stage what MAP assessed in the stabilized music aptitude stage (p. 17).
A curious feature of IMMA was its apparent capability of simultaneously functioning as
a measure of developmental music aptitude and of stabilized music aptitude in students age 9
(Grade 4). Gordon (1989c), finding a .89 correlation between pre- and post-instruction IMMA
scores, hypothesized that the immutable nature of IMMA scores for fourth- and fifth-grade
students suggested IMMA functioned as a test of stabilized music aptitude for students aged 9
and older. Gordon (1984a) conducted a longitudinal predictive validity study in which judges’
evaluation scores of fourth-grade boys’ violin and recorder performances were correlated with
pre- and post-training IMMA scores. He concluded the student sample was in the stabilized
music aptitude stage, based on the high correlations between students’ pre-training and posttraining IMMA scores and achievement ratings, and pre-training IMMA scores with posttraining IMMA scores, as only low to moderate correlations had been observed in previous
studies investigating the relationship of PMMA scores from one semester to another. Gordon
(1984a) observed that, despite test content designed to measure developmental music aptitude,
IMMA scores could be assumed to reasonably function as stabilized music aptitude scores for
students age 9 and older. Nevertheless, Gordon acknowledged the skewed distribution of scores
in his longitudinal predictive validity study of IMMA, due to the small number (N = 33) and
homogeneity of subjects, which seemingly lowered the reliabilities compared to those published
in the IMMA manual. A profusion of high scores on the post-training IMMA were attributed to
the subjects’ approaching fifth grade, which belied the theory that stabilized music aptitude was
reached by age 9 and resistant to the effects of training, instruction, or chronological age.
Reese and Shouldice (2019) asserted IMMA was suitable as a test of stabilized music
aptitude for children as old as 11 years when use of MAP was not possible (p. 480). Gordon
(1984a) promoted the screening properties of IMMA as a means of identifying students to whom
the more comprehensive Musical Aptitude Profile should then be administered. Although
Gordon (1993) established IMMA as a test of developmental music aptitude and identified
students in Grades 5 and 6 as definitively in the stabilized music aptitude stage, he noted use of
IMMA was suitable if insufficient time precluded administration of a stabilized music aptitude
test (p. 245). Geissel (1985) claimed it was common practice to administer IMMA to fourthgrade students, in addition to MAP: IMMA scores were used to identify students with high music
aptitude, who were then encouraged to participate in specialized music instruction, whereas
MAP scores were used to diagnose strengths and weaknesses to be addressed during instruction
(pp. 3–4). Walters (1991) concurred, acknowledging the suitability of IMMA as a predictor of
music achievement for fourth graders and its shortcomings as a diagnostic tool when compared
to MAP. In fact, when MAP’s purpose was identification rather than diagnosis, Gordon (1968)
suggested one or more of MAP’s subtests (balance or meter) need not be administered.
Nevertheless, these omissions were not recommended, as the savings in time would be minimal
and not worthy of the more cursory diagnosis.
Definitively determining when students had reached the stabilized music aptitude stage
could be an elusive prospect, however. Numerous researchers (DeYarman, 1975; Harrington,
1969; Schleuter & DeYarman, 1977; Stevens, 1987) had attempted to design studies to establish
when students’ music aptitude levels ceased to fluctuate, thereby indicating a level of stabilized
music aptitude. However, low reliability of measures when administered to younger students
confounded researchers: Wing published norms with questionable reliability for children aged 8
and younger, and Seashore asserted measures were not reliable until age 10 (Walters, 1991).
Moore (1990) summarized extant research of Deutsch (1982), Gordon (1971), and Mark (1986),
noting music aptitude may cease to develop beyond age 9 or 10. Others asserted music aptitude
stabilized much sooner. DeYarman (1975) concluded music aptitude stabilized by age 6 or
sooner, contradicting previous findings of fluctuating music aptitude in primary-grade students.
Although Gordon had specified the stabilized music aptitude stage begins at
approximately age 9, IMMA had been used repeatedly as a measure of developmental music
aptitude for students in Grades 4 and 5. In his 1998 factor analysis of fourth- and fifth grade
students’ Harmonic Improvisation Readiness Record (HIRR), IMMA, and MAP scores and
improvisation performances, Gordon (1998) noted in the fourth-grade analysis, IMMA loaded on
factor II, and HIRR and improvisation on factor I. However, in the fifth-grade analysis, the
opposite occurred: MAP loaded on factor I, and HIRR and improvisation ability loaded on factor
II. IMMA was thus selected and functioned as a measure of developmental music aptitude for
students in Grade 4, presumably age 9, thus blurring the delineation of age of onset of stabilized
music aptitude (p. 72). The results of Gordon’s 1984a longitudinal predictive validity study of
IMMA, described previously in this chapter, provided another example of the obscured age of
onset of stabilized music aptitude. Similarly, Levinowitz and Scheetz (1998) claimed
developmental music aptitude became less unstable as children near age 9. As the predictive
validity of the Instrument Timbre Preference Test (ITPT) and MAP, a test of stabilized music
aptitude, had been investigated previously (Gordon, 1986b), Gordon (1989c) also investigated
the predictive validity of ITPT and IMMA. Thus, IMMA was administered specifically as a test
of developmental music aptitude. Students were administered IMMA in fourth grade, and again
in fifth grade, when the students were able to elect to study a musical instrument, some according
to their ITPT results. Belczyk (1992) conducted a similar study investigating predictive validity
of ITPT and IMMA. Although his sample of 805 fourth-grade students was administered IMMA,
it was not indicated whether IMMA was intended to function as a test of developmental or
stabilized aptitude. Perhaps the use of IMMA as a test of developmental aptitude with students
aged 9 and higher was a reflection of Gordon’s (1989c) perception that “developmental and
stabilized music aptitudes are more a matter of attributes of the mind than of the properties of a
test” (p. 12). From examination of regression analysis results, Stevens noted each age level’s
scores were significantly and progressively higher than the scores of the preceding age level until
the age of 9, after which significant increase discontinued (Stevens, 1987, pp. 115–116). From
these gain scores, Stevens concluded the stabilized music aptitude stage began near the age of 9:
a similar conclusion to that of Gordon, but arrived at through different means.
Culp (2017) observed that researchers, influenced by previous research findings, differed
in their approach to the construct and measurement of music aptitude, which in turn affected
their inferences. Stevens (1987) was influenced by DeYarman’s (1975) conclusion that music
aptitude could stabilize as early as age 6; Gordon (1980a, 1986c, 2012) theorized music aptitude
stabilized around age 9, based on the research of Stanton and Koerth (1933), Wing (1968), and
his own 1967 study (Culp, 2017). Although Stevens and Gordon both defined music aptitude as
the potential to draw generalizations in music and viewed music aptitude testing as a valuable
tool to collect data for individualization of instruction, Stevens (1987) viewed the construct of
music aptitude as an ability to aurally perceive music (p. 45). This contrasted somewhat with
Gordon’s more comprehensive view of music aptitude as an extension of audiation, as evidenced
in his empirical research findings. Adhering to the prevailing premise that cessation of change to
relative standing of music aptitude scores was the most effectual means to determine the onset of
stabilized music aptitude, Stevens (1987) sought to establish the construct validity of her
researcher-designed music aptitude test by analyzing the composite test scores of children at
each age level (pp. 115–116).
Gordon (1984a) reported a range of predictive validity coefficients of .55 to .70 for
IMMA, and asserted composite IMMA scores predicted first and second semester instrumental
and vocal achievement at only a slightly lower rate than MAP; he found a substantial
relationship between achievement in instrumental performance and in vocal performance.
IMMA’s predictive validity coefficients were unusually high when compared to its concurrent
validity coefficients. In addition, Gordon (1989c) estimated a predictive validity coefficient of
.80 for IMMA when combined with the Instrument Timbre Preference Test after two years of
instrumental instruction.
Gordon, (1986a) conducted a factor analysis of MAP, PMMA, and IMMA (N=110
fourth-grade students) and reported the following results: Factor I (unrotated analysis) and Factor
II (rotated analysis) were identified as stabilized music aptitude factors. Factor II (unrotated
analysis) and Factor I (rotated analysis) were identified as developmental music aptitude factors.
Factor III was identified as stabilized music aptitude preference factors for both analyses.
Gordon concluded IMMA could serve as a test of developmental and stabilized music aptitude;
therefore, developmental and stabilized music aptitude were different. Nevertheless, the
congruent validity of IMMA as a test of stabilized music aptitude was in question. Carroll (1978)
summed up the premise of correlation, noting if two measures were significantly associated, they
might be regarded as “measuring the same thing” (p. 89). Gordon (1986c) concurred: if
correlation affirmed both tests were valid for their intended purpose, the tests exhibited
congruent validity (p. 109). Gordon (1986c) reported acceptably high correlation coefficients for
PMMA and IMMA (Grades 1–4, N = 126), IMMA and MAP (Grade 4, N = 92), and PMMA and
MAP (Grade 4, N = 227) (p. 111), and concluded PMMA, IMMA, and MAP measured similar
content and constituent characteristics. Although IMMA was found acceptable as a test of
stabilized music aptitude for students ages 9 and up, the congruent validity of IMMA as a test of
stabilized music aptitude with MAP was cursory and inconclusive. IMMA was not a newlydeveloped test, but instead demonstrated established validity for a different purpose (as a test of
developmental music aptitude). MAP had proven longitudinal validity but not parallel content.
Therefore, congruent validity for IMMA and MAP could not be established through traditional
means. It could be concluded PMMA, IMMA, and MAP exhibited congruent validity only if
“the same factor” they measured was music aptitude in general rather than developmental or
stabilized music aptitude.
Gordon (1989b) originally developed the Advanced Measures of Music Audiation
(AMMA) as a more complex measure of stabilized music aptitude for students in college or
university; subsequently, norms for high school and junior high students were added. Through
use of melodic test prompts within a single non-preference test, tonality, keyality, melody,
implied harmony, rhythm, meter, and tempo can be audiated concurrently. In a stabilized music
aptitude test such as MAP, students heard melodic patterns and discerned sameness or difference
of tonal or rhythm dimensions in separate subtests. However, in an advanced stabilized music
aptitude test such as AMMA, intended for administration to “mature students,” the student heard
melodic patterns and discerned sameness or difference of tonal and rhythm dimensions
simultaneously within a single test (Gordon, 2004). Gordon (1998) candidly observed he was
unable to explain why test design differences were necessary, only that he had arrived at this
conclusion through extensive research of test development, reliability, and validity (p. 112).
Because of its uniquely compact design, AMMA could be administered in approximately 20
minutes (Gordon, 1989b), compared to the estimated 50 minutes required for administration of
each MAP test (tonal, rhythm, and musical sensitivity) (Gordon, 1995, p. 36).
As with all of Gordon’s music aptitude tests, AMMA was designed to incorporate
atomistic and gestalt features. However, unlike MAP, AMMA did not include preference
subtests for measurement of stabilized music aptitude, due to its unique but labor-intensive
system of scoring which allowed tonal and rhythm scores to be calculated separately from the
administration of a single measure (Gordon, 1989b). Gordon published tonal, rhythm, and total
percentile rank norms for students in high school, college music majors, and college non-music
majors; however, Haroutounian (2002) asserted AMMA could be used to measure stabilized
music aptitude for students above Grade 4 (p. 16).
Concurrent validity with MAP was the initial means of validity established for AMMA.
Gordon (1989b) reasoned a moderate correlation coefficient of MAP and AMMA scores would
substantiate the validity of AMMA and promote design of additional test validity studies. An
investigation of the longitudinal predictive validity of AMMA was undertaken by Gordon in
1990. Total AMMA scores and etude performance ratings of college music students were
correlated, yielding a longitudinal predictive validity coefficient of .82 and suggesting AMMA
had a high degree of ability to predict performance achievement in college-age students: Gordon
(1990c) found more than 67 percent (the square of .82) of the reason or reasons for college
students’ success in music performance could be predicted by the total test scores on AMMA.
Gordon (1998) noted a high intercorrelation between AMMA tonal and rhythm test scores, likely
because questions with “same” as the correct answer were included in calculating both tonal and
rhythm scores (p. 114). Although MAP offered a more comprehensive diagnosis, it was not
consistently more valid; therefore, teachers might opt to administer AMMA to measure aptitude
levels of students in Grades 7–12 because of AMMA’s considerably shorter length (Gordon,
1998, p. 112).
Audie (Gordon, 1989a), a “game” of music aptitude for children ages 3 and 4, was
designed for individual administration by a parent. Although Audie was a test of developmental
music aptitude, Gordon (2005) asserted the need for preschool children to be presented with
melodic test prompts, even though they were able to address only the tonal or the rhythm aspects
of the test question. Therefore, test prompts in Audie consisted of a single melody, “Audie’s
song”, from which students must discriminate a tonal difference or a rhythm difference in the test
responses of the tonal and rhythm subtests, respectively. A hallmark of Gordon’s music aptitude
tests was the absence of required reading, writing, English language speaking, and knowledge of
music theory, as these skills were superfluous in Gordon’s test design. Children only needed to
comprehend if a given tonal or rhythm pattern was the same as or different than “Audie’s song”
and indicate their answer to the parent, who marked “yes”, “no”, or “?” accordingly on the
game’s answer sheet. Children were encouraged to play Audie repeatedly and independently.
After the parent had determined full engagement in the game, the child’s answers were marked
on the answer sheet. Either the tonal or rhythm subtest could be administered first. The parent
was encouraged to monitor recurrent results in order to observe growth and adapt instruction for
individual differences (Gordon, 1989a). Gordon found a typical child’s level of concentration
was limited to, at most, ten consecutive questions. Because the test was so brief, it was not useful
to calculate percentile norms. Instead, criterion validity was used to interpret scores.
Features of Stabilized Music Aptitude
Chronological Age Effect
The influence of chronological age on music aptitude received limited attention in extant
literature. Walters (1992) contrasted maturational readiness (chronological age) with experiential
readiness for music learning, which Gordon labeled “musical age”. Runfola and Etopio (2010)
further defined musical age: if developmental age was a measure of children’s physical and
psychological development, musical age was a measure of children’s musical development, in
which they progressed through preparatory audiation and audiation over time and under different
environmental circumstances. Radocy and Boyle (1979) viewed maturation as a reinforcer of
inherited musical potential, rather than an influence on music aptitude. Gordon (2001c) attributed
changes in MAP scores to general maturation, rather than to the effect of formal instrumental
instruction; changes were too small to affect practical significance. Chronological age was also
associated with score increase on developmental music aptitude tests and tests of stabilized
music aptitude (Gordon, 2005); however, maintenance of relative position in score distributions
distinguished the stabilized music aptitude stage from the developmental stage. Gordon (2002)
posited operation of different stages of music aptitude associated with chronological age were
suggested by substantial differences in paired PMMA and MAP scores of the same Grade 2 and
Grade 6 student sample. Thus, music aptitude stage was perceived as more reliable than tonal or
rhythm content in differentiating between students of different chronological ages. In his 1989
study of the effect of chronological age on AMMA scores, Gordon found the means and standard
deviation of children’s chorus participants aged 9–12 almost identical to data from college-aged
non-music majors from the IMMA standardization program. Thus, an effect of chronological age
on music aptitude was discounted.
Resistance to Instruction and Maintenance of Relative Standing
The manner in which music aptitude was historically defined would be classified
currently as stabilized music aptitude. Moore (1987) noted some scholars characterized music
aptitude as a series of fixed and unchangeable traits. This constancy, manifested as resistance to
direct instruction, training, or practice, was a hallmark of stabilized music aptitude. Gordon
(1987) was resolute in his characterization of music aptitude as immutable after approximately
age 9 (p. 9). Gordon (1995) based this conclusion on the findings of his 1967 study of predictive
validity of MAP, in which a negligible discrepancy between the mean difference of MAP scores
of band members and those in the “musically select” standardization sample over a 3-year period
was found (p. 97). Fosha (1964) administered MAP to elementary and secondary instrumental
and choral ensemble participants before and after one semester of formal music instruction, and
concluded from the slight differences in pre- and post-instruction scores that MAP scores of
musically select students were resistant to the influence of instruction. Gordon (1989c) compared
the pre- and post-instruction IMMA scores of approximately 170 elementary students in a 2-year
predictive validity study of ITPT and IMMA and found additional support for his assertion that
IMMA scores of students in Grades 4 and 5 were not affected by music instruction. Previous and
subsequent studies yielded similar results, thus indicating resistance to instruction was the
threshold for the stabilized music aptitude stage (DeYarman, 1975; Gordon, 1981; Mang, 2013).
Haroutounian (2002) observed future training can enhance but not extend music aptitude (p. 11).
Relative standing, marked by increases in raw scores and steady percentile ranks
(Gordon, 2001c), was maintained on all MAP subtests throughout the 3-year longitudinal
predictive study of MAP (Gordon, 1981). Scores of subsequent administrations of the same
stabilized music aptitude test yielded high correlations, whereas the correlation for scores on the
same developmental music aptitude test administered in subsequent years was low (Gordon,
2005). In addition, content, or “what is audiated”, was indifferent to instructional procedure
(Gordon, 1981), although considerably more fluctuation of developmental music aptitudes
occurred when instruction was modified to align with PMMA results. Stanton (1935) claimed
stability of music aptitude for participants of a longitudinal predictive validity study of the
Seashore measures, as cited by Gordon (1981). Nevertheless, Gordon (1995) later observed
individualized instruction should result in meaningful improvement in students’ music
achievement, though not an improvement in stabilized music aptitude.
Primacy of Rhythm
Of interest was the prominence of rhythm in stabilized music aptitude referred to
throughout the literature. Gordon (2005) noted the combination of meter and tempo aptitudes
could predict school music success more accurately than melody and harmony together, and
contended rhythm was the basis of music aptitude, as it served as the foundation for musical
style and expressiveness (Gordon, 1998, p. 60). Although Gordon (2005) asserted higher
predictive validity for meter than for tempo, he concluded tempo was the most fundamental of
rhythm aptitudes (Gordon, 1998, p. 104). In addition, the MAP meter subtest was found to be a
more valid measure of rhythm aptitude than one in which phrases of melodic rhythm were
compared (Gordon, 1998, p. 104) and had a high, but unlikely, relation to the musical
sensitivity–balance subtest (Gordon, 1998, p. 55). Gordon (1986a) concluded the IMMA rhythm
subtest may be more suggestive of stabilized music aptitude than of developmental.
Even so, an adjustment to the MAP tempo subtest was made (dynamic accents or clicks,
representing macrobeats, underlaid the test prompts) in order to increase reliability, as was also
necessary in the rhythm subtests of PMMA and IMMA. Rhythm reacted the most robustly on
factors of developmental and stabilized music aptitude in Gordon’s (n.d.) investigation of HIRR,
PMMA, IMMA, MAP, and improvisation ability (Gordon, 1998, p. 173). Therefore, Gordon
(1998) claimed knowledge of chord changes in syntactic time might be more important to the
audiation process than awareness of chord changes (p. 172). This construct aligned with Karma’s
perspective on the temporal aspect of music aptitude, manifested in his focus on auditory
structuring as a measure of music aptitude.
Music Preference
Bugos et al. (2014) noted the perception of expressive characteristics was viewed as the
most important element of musical performance by musicians and educators. Gordon (1980b)
contended musical expression was another key feature of stabilized music aptitude, and asserted
the expressive dimension of stabilized music aptitude joined the tonal and rhythm dimensions to
yield comprehensive music aptitude (Gordon, 1998, p. 60). Music sensitivity or preference as a
construct seemed to be generally accepted (Boyle, 1992; Bugos et al., 2014), yet there was little
agreement on how it should be measured. There was even a lack of consensus of definition and
terminology in assessing musical sensitivity (Boyle, 1982). Boyle accepted Kuhn’s (1979)
definition of preference as “an act of choosing, esteeming, or gaining advantage of one thing
over another through verbal statement, rating scale response, or choice made from among two or
more alternatives” (Boyle, 1982), yet Geake (1999) considered musical sensitivity independent
of music aptitude, and attributed both auditory discrimination and affective response to musical
sensitivity. Colwell instead used the term “stylistic discrimination”, as cited in Boyle (1982).
Boyle (1982) specified perception and response to sensory stimuli was measured in preference
tests. Listeners of musical preference tests were frequently asked to indicate which music
fragment of a pair were preferred; some, like the Indiana–Oregon Music Discrimination Test
(1965), required the listener to identify whether the change was melodic, harmonic, or rhythmic.
Based on the findings of his research into audiation of “same” and “different” in
developmental music aptitude, Gordon (1986c) noted older children attended more to musical
characteristics in test items (p. 103). In contrast, the inclusion of the dimensions of timbre and
dynamics in PMMA, a test of developmental music aptitude, resulted in reliability approaching
zero (Gordon, 1981). Therefore, tests of musical expression were included in MAP but not
PMMA or IMMA (tests of developmental music aptitude), as young children were focused
primarily on the constructive elements of music (Gordon, 1980b) and incapable of reliably
making judgments about their music preferences (Gordon, 2002). Gordon (1989b) theorized the
subjective understanding inherent in music aptitude was manifested as musical sensitivity and
measured by preference tests; Boyle (1982) noted correlations between sensitivity subtests and
performance evaluations and music achievement test scores yielded high validity coefficients.
Interestingly, a paper/pencil version of the Musical Nuance Test created by Bugos et al. (2014)
indicated an increase in musical nuance perception only up to age 10. The inclusion of
expression subtests only minimally increased the longitudinal predictive power of the AMMA
total score (Gordon, 1990c), thus Gordon opted not to include music preference subtests in
AMMA. Music preference subtests were also excluded from the ancillary study undertaken
concurrently with Gordon’s 1989 predictive validity study of IMMA and ITPT, in which MAP
tonal imagery and rhythm imagery subtests were administered to sixth-grade students who had
taken IMMA in Grade 4 (Gordon, 1990c). Since IMMA did not include music preference
subtests corresponding to the MAP musical sensitivity subtests, MAP expression subtests were
disregarded. Thus, little evidence in published research was found to compare the effect of music
preference subtests on stabilized music aptitude measurement.
Nevertheless, reliability coefficients were moderately high for some tests of musical
sensitivity. Although Heller reported low test-retest reliability coefficients of .28 and .50 and a
split-half coefficient of .42 for college students (Boyle, 1982), Wing reported a reliability
coefficient of .84 for 15-year old boys. Gordon reported reliability coefficients ranging from .84
to .90 for the MAP musical sensitivity test; Boyle’s 1982 investigation of the comparative
validity of Wing’s Standardised Tests of Musical Intelligence, MAP, and the Indiana–Oregon
Music Discrimination Test yielded split-half reliability coefficients of .88 (Grades 10–12) and
.82 (college students) and low correlation coefficients, indicating each test measured something
other than that measured by the other tests. Thus, Gordon (2004) asserted the high predictive
validity of music preference subtests established their importance as components of stabilized
music aptitude tests.
Ensemble Participation Effect
Little research was found on the effect of ensemble participation on music aptitude as
well. Gordon (1998) found ensemble participants as a group scored higher on MAP than
nonparticipants (pp. 79–80), and reported higher composite MAP scores for ensemble students
(Gordon, 1987, p. 83). Nevertheless, lack of ensemble participation was not a limiting factor for
high music aptitude scores, nor did ensemble participation guarantee high music aptitude scores,
as evidenced by a comparison of percentile norms reported in the MAP manual (Gordon, 1998,
p. 80). Gordon (1998) found a minimal increase in MAP scores for students studying a musical
instrument when compared to students without that training (p. 106), and reported more
ensemble participants scored higher on MAP than did nonparticipants (Gordon, 1995, p. 31).
Musically select students demonstrated small gains in pre- to post-test MAP scores in Fosha’s
(1964) study of students in Grades 4–12 (p. 87).
Regrettably, the discrepancies in MAP scores between participants and non-participants
in school music ensembles might be attributed to the selectivity of music performance groups.
Gordon (1995) studied MAP scores of students in performance groups and non-participants to
determine whether ensemble participants were likely to have higher music aptitudes (p. 90).
Although non-participants scored somewhat lower, many high-scoring students were not
engaged in school music activities (p. 91). The value of MAP scores (and by extension, IMMA
scores) in identifying students with high music aptitude for potential enrollment in school music
ensembles cannot be underestimated (p. 91).
Scores of all Grade 4–12 students in the MAP standardization program were used to
determine the need for separate norms for chorus, band, and orchestra rather than one set of
norms for all musically select students (Gordon, 1995, p. 90). Gordon found similarities in score
distributions for chorus, band, and orchestra members at each grade level and seemed to
conclude separate norms by instrument family were unnecessary (p. 91); however, no mention
was made of how the score distribution of non-participant scores compared to those of chorus,
band, and orchestra. In addition, the use of statistical testing to compare means of different
ensemble types was not referenced. Therefore, it is unknown if mean scores of one ensemble
type were significantly different from another. Scores of elementary students were considered
separately from those of middle and high school students in drawing conclusions about the need
for separate norms for musically select students, but not for determining the effect of ensemble
type on music aptitude.
Gordon (1970) noted a significant relationship between MAP scores and instrumental
music success, and reported students in performance groups scored higher on MAP than
nonparticipants (1995, p. 91). Subjects in his 1970 investigation of music aptitude differences in
Grade 4 and 5 beginning instrumental students were matched for music aptitude level, sex,
grade, and type of instrument; however, “years of participation” was not included as a variable.
Extant literature on the effect of ensemble participation on music aptitude was limited.
The elementary students in Gordon’s (1995) MAP study were in Grades 4–6. It is unknown to
what extent scores of students in each grade contributed to the conclusions drawn regarding the
need for separate norms. Consequently, it is not possible to aggregate the 1995 findings for
students in Grades 4 and 5 only, as constitutes the participant sample in the current study. It
appears the grade level at which a student participated in a performance group had not been
previously examined for its effect on music aptitude.
Transition to the Stabilized Music Aptitude Stage
The process of transition from developmental to stabilized music aptitude appeared to be
well-defined. Gordon (1981) noted in tonal audiation, one gradually attended to pitch center first,
then key, and finally mode. In rhythm audiation, the progression moved from paired beats of
equal length to melodic rhythm, and finally to meter. In terms of measurement, Gordon (1989b)
designed IMMA for students transitioning from the developmental to stabilized music aptitude
stage (ages 6 through 9) or those aged 10 and 11 who had already attained the stabilized music
aptitude stage. Gordon (1986a) contended the effect of musical environment, measured as gain
scores, decreased substantially with age until approximately age 9. Thus, it was conceivable
decreasing gain scores and diminishing influence of instruction were two indicators of a
transition period between the stages of developmental and stabilized music aptitude. An
additional indicator was the shift of student focus from keyality to tonality and from tempo to
meter, as Gordon (2002) asserted the higher a student’s tonal or rhythm music aptitude level, the
sooner the student’s audiation might begin to include tonality and meter.
Because of the discrepancy concerning the age at which music aptitude stabilized,
Schleuter and DeYarman (1977) concluded insufficient evidence was available to support
Gordon’s assertion that music aptitude stabilized at age 9. Gordon concurred with Seashore’s
theory that music aptitude is developmental in the early years, and asserted from findings of
extensive research involving a large sample of primary-aged children, studies of neurological
and language development, and music perception research that music aptitude was
developmental until approximately age 9 (Walters, 1991). Gordon (2005) further supported this
conclusion, noting developmental music aptitude stabilized at about age 9, which was
approximately the same age at which physical changes in brain development of the frontal lobe
occurred. Mang (2013), Phillips et al. (2002), and Stevens (1987) concurred. Moore (1990),
summarizing the work of Deutsch (1982), Gordon (1971), and Mark (1986), concluded stabilized
music aptitude began around age 9 or 10, when music aptitude might cease to improve despite
further training. Other researchers differed in their conclusions of age of stabilized aptitude
onset: DeYarman (1972, 1975) and Schleuter and DeYarman (1977) concluded music aptitude
stabilized as early as age 5 or 6, Forsythe (1984) found a partial stabilization of music aptitude in
a sample of preschool students who participated in music instruction (p. iii), Wing published
norms for students age 8 and younger (Walters, 1991), and Seashore hypothesized that music
aptitude stabilized at the age of 10 (Haroutounian, 2002, p. 15). Gordon (1967b) conducted a
multiple regression analysis using the grand composite performance achievement score as the
dependent variable and the seven MAP subtests as independent variables; in another study, test
items were factor analyzed by tonal subtest or rhythm subtest for “like” or “different” responses.
Gordon (1981) found MAP item factors of students age 9 and older were more similar to PMMA
item factors of students age 8 than for students age 5. Phillips et al. (2002) surmised Grade 3
(approximately age 9) was the pivotal year for development of aural skills, after which aural
acuity no longer hindered accurate pitch matching, and inferred support of Gordon’s definitions
of developmental and stabilized music aptitude. Thus, the grade level, as well as the age at which
music aptitude stabilized, had not yet been fully substantiated.
Dissension on the age of onset of stabilized music aptitude created a subsequent need for
accurate measurement of the stage and level of music aptitude. So certain was Gordon (1986c)
that the onset of stabilized music aptitude occurred in fourth grade that he stated IMMA would
measure developmental music aptitude of students in Grades 1–3 and stabilized music aptitude
of students in Grade 4 (p. 27). Moreover, Gordon (2005) suggested MAP was appropriate for
students who had entered the stabilized music aptitude stage and PMMA was suitable for those
who remained in the developmental music aptitude stage. Gordon’s implied recommendation
that the student’s stage of music aptitude must be known in order to determine which music
aptitude test was most appropriate to administer, especially in the case of the overlap of
suggested grade levels for tests such as IMMA and MAP, was impractical. In his 2001–2002
study examining the need for different music aptitude tests for developmental and stabilized
music aptitudes, Gordon (2002) concluded a test of developmental music aptitude was
insufficient to measure stabilized rhythm aptitude, thereby implying a need to use some music
aptitude tests as preliminary measures prior to administering the optimal music aptitude test for
each student. Gordon (2002) further warned use of an unsuitable test would discriminate against
very high and very low scoring students and yield misleading results, particularly for older
students. Gordon designed MAP, a test of stabilized music aptitude, for students as young as
fourth grade; the original range of grades for administration of IMMA, a test of developmental
music aptitude, was first through fourth grade. Thus, it could be implied the transition from
developmental to stabilized music aptitude occurred in fourth grade. The gap noted by Moore
(1987) between students age 8 (developmental music aptitude) and age 9 (stabilized music
aptitude) offered additional evidence of the likelihood a period of transition for developmental
and stabilized music aptitude occurred between ages 8 and 9.
An examination of music aptitude test scores seemed the best means of determining if a
student had fully transitioned from the developmental music aptitude stage and reached the
stabilized music aptitude stage. Walters (1991) noted Gordon’s choice of the label “primary” for
PMMA implied music aptitude of primary-aged children was not solidified. It was debatable
which music aptitude test should be administered to fourth and fifth grade students in particular,
because their chronological age (“about age nine”) suggested they might be transitioning from
the developmental music aptitude stage to the stabilized music aptitude stage. Without knowing
if these students had entered the stabilized music aptitude stage, were transitioning to the
stabilized music aptitude stage, or had not yet left the developmental music aptitude stage, a
level of ambiguity existed regarding the appropriate music aptitude test to be administered.
Gordon (2006) hypothesized,
It may be middle-school represents the period of a pronounced borderline between
developmental and stabilized music aptitude stages, and MAP is more appropriate for
students just entering the stabilized stage and AMMA for students who have gone
beyond middle-ground and already settled into that stage (p. 234).
The dissenting conclusions of past researchers lent support to the theory of a period of transition
from developmental to stabilized music aptitude: it was improbable that a shift between
developmental and stabilized aptitude occurred in an instant, without a period of transition or
staggered onset, even within one grade or age level. Instead, a continuum of music aptitude from
developmental to stabilized, with decreasing score gain, decreasing fluctuation of music aptitude
test score ranking, evidence of relative standing, and a progressive shift towards more fixed
levels of music aptitude seemed feasible as students approached and surpassed age of 9.
Gordon conjectured the transition to stabilized music aptitude might occur in stages (e.g.,
entering, transitioning, and settling) on or around age 9. Yet IMMA, a test of developmental
music aptitude, was recommended for use with students age 10 and 11 in Grades 5 and 6, despite
the assumption their music aptitudes had stabilized previously (Gordon, 2006), and MAP was
advised for administration to students interested in special music studies due to its
comprehensive diagnostic capabilities. One presumed fourth graders, whose music aptitudes
were potentially in the process of stabilizing, might also benefit from the diagnostic capabilities
of MAP if pursuing special music studies. Thus, the advantage of IMMA administration for
students in fourth grade conceivably transitioning into the stabilized music aptitude stage was in
need of additional investigation.
Although much had been written about the measurement of music aptitude, it was
apparent an examination of music aptitude tests based on a singular approach was needed in
order to better establish and describe the transition from developmental to stabilized music
aptitude. To study aptitude measures based on Gordon’s unified construct of music aptitude was
thus optimal. It was through the examination of stabilized music aptitude onset and the practical
application of the findings of this research that the significance of this study was established.
Chapter 3
The purpose of this chapter was to introduce the research methodology for this study.
Using a quantitative approach, the mean difference in students’ longitudinal scores was
examined to determine their capacity to predict the onset of the stabilized music aptitude stage,
highlight changes in IMMA scores that were suggestive of the grade level at which stabilized
music aptitude begins, and establish the feasibility of a period of transition between the
developmental and stabilized music aptitude stages. Descriptions of the sample, sampling
technique, measures, research design, and data analysis method are the primary components of
this chapter.
Research Questions and Research Hypotheses
The following is a summary of this study’s research questions and corresponding
research hypotheses:
Research Question 1: At what grade level does chronological age cease to affect student
music aptitude?
Research Hypothesis 1: Although raw scores may continue to increase as chronological
age increases, significant score increase will decline, then cease at approximately
age 9; no effect of chronological age on music aptitude is expected.
Gordon (1989b) stated unequivocally: “Chronological age has very little effect on test results”
(p. 43). He noted raw scores were expected to increase as chronological age increased, regardless
of the stage of music aptitude; however, the same relative position in the score distribution would
be maintained by students in the stabilized music aptitude stage only (Gordon, 2005). It was
expected IMMA subtest scores from this study would fluctuate between Grades 3–5;
nevertheless, statistical significance of mean score difference served as the standard against
which influence of chronological age would be interpreted.
Research Question 2: At what grade level does instruction cease to affect student music
Research Hypothesis 2: An influence of instruction, measured as significant IMMA
score difference will cease at approximately age 9; no effect of instruction on
music aptitude is expected.
The effect of instruction on music aptitude scores was used historically to define
stabilized music aptitude (DeYarman,1972; Harrington, 1969; Seashore, 1919; Schleuter &
DeYarman, 1977; Stevens, 1987). Using this definition, identification of the grade level at which
instruction ceased to affect music aptitude scores would help determine the age of onset of
stabilized music aptitude.
Research Question 3: Is there evidence to substantiate the transition between the
developmental music aptitude stage and stabilized music aptitude stage at age
9/Grade 4?
Research Hypothesis 3: Significant score fluctuation will occur throughout Grade 3,
begin to decline during Grade 4, and discontinue in Grade 5; a period of transition
is expected.
Gordon proposed a shift from developmental music aptitude (ages 6–9) to stabilized
music aptitude (ages 10 and 11) (1989b), and speculated middle school might serve as a
definitive boundary between the developmental and stabilized music aptitude stages (Gordon,
2006). Prior to gathering evidence that IMMA was able to function as a test of stabilized music
aptitude for students in Grades 5 and 6, Gordon (1982) had developed norms for IMMA only
through fourth grade. IMMA was conceived as a measure of developmental music aptitude
through age 9 (fourth grade). However, it had been conjectured IMMA might serve as a measure
of stabilized music aptitude in students age 9 and higher (Gordon, 1989d).
Nonprobability sampling occurs when individuals are selected for availability,
convenience, and ability to represent the characteristic being studied (Creswell, 2012, p. 145).
Consequently, there is no known nonzero chance of selection: nonprobability sampling is
subjective. Convenience sampling is a type of nonprobability sampling in which participants are
selected for their availability and willingness to participate, and may be comprised of recruits or
volunteers; as such, the sample may not be representative of the population at large (Creswell,
2012, p. 145). Such samples must be described in detail in order for the reader to conceptualize
the abstract population about whom statistical inferences are made, and caution must be used
when generalizing results from the sample to the population (Huck, 2012, pp. 100–102).
Nevertheless, convenience sampling is used in educational research when randomization is not
feasible due to scheduling or economic restraints (Pedhazur & Schmelkin, 1991).
Nonprobability convenience sampling was used to define the sample used in the current
study. Intact classes were used as a means of efficiently collecting data with the least disruption
to the school’s instructional schedule. The use of intact classes in this study might be less
disadvantageous than in other studies, as the collected music aptitude test scores were considered
by grade level, rather than by class level. IMMA was administered once or twice per academic
year to intact classes of third-, fourth-, and fifth-grade students over a period of thirteen years.
Scores were investigated as matched pairs (e.g., Fall scores compared to Spring scores of the
same academic year; Spring scores of one year compared to Fall scores of the next year) by
grade level, and Grade 3 scores from 2007–2019 were considered longitudinally. Thus, the effect
of test administration of intact classes was mitigated..
An optimal sample was representative of the population in question, in order that
generalizations could be drawn to the population at large. An advantage of the sampling
technique used in this study was convenience: the researcher had easy access to the archived
IMMA scores of the sample. The routine administration of IMMA to upper elementary students
and the preservation of these IMMA scores over time created a unique data set of IMMA scores
that was accessible and advantageous for this longitudinal study. Nevertheless, the disadvantages
of non-probability sampling were susceptibility to potential bias and increased sampling error.
The convenience samples could over- or under-represent a particular segment of the population,
which would then affect the ability to generalize to similar populations: the population
corresponding to the convenience sample was abstract, a hypothetical population that must be
inferred from the sample’s description (Huck, 2012, pp. 101–102).
Scores (N = 1,650) from students in Grades 3, 4, and 5 in a small, rural, public school
district in central Pennsylvania, where the researcher had been employed as the district
elementary general music teacher for sixteen years, provided the data for this study. The students
were predominantly White (98%), with 43.5% living in poverty (information based on free and
reduced-price lunch eligibility) (National Center for Education Statistics website, n.d.). The
district transiency rate for students was quite low: 93% of the 2018 graduating class and 84% of
the 2019 graduating class were enrolled in the researcher’s general music classes throughout
their tenure in elementary school.
An average of 60.1% of students in Grades 4 and 5 had participated in school
performance ensembles in the two most recent academic years. Thus, these students were
considered musically select as defined in the MAP manual (Gordon, 1995, p. 137). Students
could begin band instrument study and participation in chorus in fourth grade. Few students
engaged in private instrumental lessons, largely due to financial constraints and lack of
availability of local teachers. Therefore, opportunities such as participation in county band and
chorus were made available to students through the sponsorship of district music teachers.
Scores from bi-annual administrations of music aptitude tests (PMMA for Grades 1 and
2, IMMA for Grades 3–5) were used routinely in the sample school to differentiate instruction,
and scores were monitored to track individual students’ music aptitude development. IMMA was
administered intermittently once or twice per year to students in Grades 4 and 5; thus, grade level
sample sizes differed: Grade 3 (N = 1,035), Grade 4 (N = 389), and Grade 5 (N = 226). Scores
from all previous administrations of music aptitude tests for students in this school district had
been preserved. Consequently, it was possible to examine historical third grade IMMA scores
from more than a decade of test administrations, as well as to compare those scores with IMMA
scores from fourth- and fifth-grade students, when available. The stability of the overall sample
population, researcher’s long tenure in the school district, routine administration of music
aptitude tests, and preservation of past music aptitude scores created a unique longitudinal data
set of IMMA scores for use in this study. Nevertheless, the unequal numbers of grade level test
scores resulting from differences in IMMA administration by grade level from year to year might
have adversely affected the assumptions of the statistical procedures.
Missing Values
Missing data are prevalent in behavioral research (Leech et al., 2015, p. 292), yet
standard statistical methods typically presume complete information for all variables (SoleyBori, 2013). Difficulty in generalizing to a population or even misrepresentation of the
population may arise as a result of missing data (Leech et al., 2015, p. 292). Landerman et al.
(1997) warned that the ability of the complete case sample to accurately represent the sample or
target population might be affected by the percentage of sample cases with missing data on one
or more variables. The missingness mechanism which describes the relationship between
observed and missing data must be identified in order to best address how to handle missing data
(Cook, 2020). Missingness may be categorized as MCAR (missing completely at random: no
relationship between missingness and values), MAR (missing at random: a relationship exists
between missingness and observed data, but not between missingness and missing values), or
MNAR (missing not at random: a relationship between missingness and missing values is
undetectable) (Cook, 2020). Although MCAR is rarely true when large amounts of data are
missing (Leech et al., 2015, p. 292), it is possible to test for MCAR despite its composition of
unobserved values. A significant result on Little’s (1988b) MCAR test may indicate a violation
of the MCAR assumption, but does not imply MAR or NMAR status (van Ginkel et al., 2020).
One is not able to test for missingness in cases of MAR, as systematic difference in observed and
unobserved data cannot be compared when the values of the missing data are unknown (SoleyBori, 2013). With MAR data, one may impute the missing values of one variable from other
variables (Leech et al., 2015, p. 293). MNAR also cannot be determined inferentially, as
additional information about the population is needed to verify unobserved data (van Ginkel et
al., 2020). Determination of ignorability, in which missing data and the parameters of interest are
unrelated, or nonignorability, noted for the need to model missing data to accurately estimate
parameters in the model (Cook, 2020), may aid in identification of the missingness mechanism at
play. In addition, distinguishing between two main patterns of missingness may help determine
the appropriate method for mitigation of missing data (Soley-Bori, 2013). There is no consensus
on the amount of missing data that does or does not require mitigation. Rather, the need to
consider proportion of missing data in light of the unique context of their data set supersedes sole
reliance on critical values from other studies (Cook, 2020).
The simplest way to address missing data is through listwise deletion, the default option
in IBM’s Statistical Package for the Social Sciences (SPSS) software (IBM Corp., 2019b), which
excludes all cases with missing data (van Ginkel et al., 2020). MCAR is the same assumption
that underlies a complete-case analysis: when missing responses are discarded, a lack of power
and potential bias results (Schafer, 1999). Disadvantages of listwise deletion are the loss of
power from the smaller sample size and larger standard errors resulting from the discarding of
cases with missing data. Valid inferences can only be drawn from data sets that include discarded
cases if the discarded cases are representative of the entire data set. However, estimates may be
biased when discarded cases differ systematically from the rest (Schafer, 1999). Acock (2005)
noted loss of 20–50% of data is typical with use of listwise deletion; McKnight et al. (2007),
estimated the potential loss of data at roughly 60% (p. 100).
Pairwise deletion excludes fewer cases and is thus less wasteful than listwise deletion, as
only data with a missing value are excluded from calculations involving the variable for which
there is no score (Field, 2009, p. 177). However, this results in an inconsistent set of participants;
therefore, “the covariances do not have the constraints they would have if all covariances were
based on the same set of participants” (Acock, 2005), and any conclusions drawn might differ by
subsample (Cook, 2020).
Single imputation procedures replace missing values with a single constant value or
predicted value, which addresses the issue of wastefulness because missing cases are retained
(van Ginkel et al., 2020), but may introduce additional bias: imputed values contain no error, as
they are completely determined by a model applied to the observed data (Soley-Bori, 2013).
Acock (2005) noted the tendency of single imputation to underestimate standard errors and
overestimate the level of precision, resulting in a perception of power that cannot be justified by
the data. In mean substitution, a type of single imputation procedure, missing values are replaced
with the mean score of all available values (Cook, 2020). The use of a constant replacement
value should not be used to impute MAR or MNAR, as it can be ineffective in accounting for
extreme values, yielding a loss of variance (Cook, 2020). Expectation-maximization (EM),
another single imputation approach, is an iterative procedure that uses a model to predict the
missing values from observed values (Cook, 2020). Although EM has been found to produce
accurate results in large data sets and in cases where missing data are ignorable, the procedure
can yield overestimated standard errors, which increases the probability of Type I error (Cook,
2019). Reliable estimates have been produced using EM with MCAR cases which include up to
50% of missingness (Cook, 2020). Finally, in hot deck imputation, missing values are imputed
from a donor value of an observed case similar to the missing case (Kleinke, 2018). However,
the missing value may remain missing if no similar value is found in the dataset (Kleinke, 2018).
The hot deck procedure has been found effective when up to 20% of data is MCAR or MAR and
up to 10% if MNAR (Kleinke, 2018).
Cook (2020) identified multiple imputation (MI) procedures as the methodological
standard for handling missing data in the social sciences. MI generates plausible values of
observed variables in the data set (Li et al., 2015). This procedure is repeated multiple times,
creating many data sets that include imputed values. Each data set is slightly different from the
others; therefore, it is possible to use the same method and data, yet result in different values
(Soley-Bori, 2013). The use of multiple plausible values quantifies the uncertainty of estimating
missing values and avoids the false precision that may result from single imputation (Li et al.,
2015). These data sets are each analyzed using standard statistical methods, the results of which
are pooled and a single overall inference drawn. Imputations should provide reasonable
predictions for the missing data, yet the variability should reflect a degree of uncertainty
(Schafer, 1999), as estimated by the standard error.
Linear regression and predictive mean matching (PMM) are two MI procedures available
using SPSS (IBM Corp., 2019a). Missing values are estimated as random draws from a
conditional distribution in the linear regression approach (van Ginkel et al., 2020). Sufficiently
accurate results are reported regardless of missingness mechanism using the regression approach.
However, this procedure can generate missing values outside the range of observed values and
non-normally distributed data can skew results (Cook, 2020). Nevertheless, the regression
procedure has been found suitable when more than 5% and less than 40% of ignorable data is
missing (Cook, 2020).
A linear regression model and monotone structure are included in the PMM model
(Horton & Lipsitz, 2001); values are predicted from observed values most similar to the missing
values. For each case with missing data, a set of cases (the donor pool, typically identified as k)
with observed values similar to those of the predicted value is identified. From among those
similar cases, one is randomly chosen and its observed value used to substitute for the missing
value (Allison, 2015). Because imputed data are based on observed data, values remain within
the range of possibility (Little, 1988a). The PMM approach is more immune to the effect of nonnormal data (Cook, 2020) and misspecification (Schenker & Taylor, 1996). PMM is most
appropriate in cases with more than 5% and less than 40% of ignorable data (Cook, 2020). If few
suitable donor cases are available, performance of PMM might be poor (Kleinke, 2018);
however, Kleinke (2018) noted an increase in sample size might increase the accuracy of
statistical inferences. There is no mathematical theory to justify PMM; instead, Monte Carlo
simulations are relied upon (Allison, 2015). Nevertheless, based on reported results of extant
studies, it is generally accepted PMM is a potentially useful method (Allison, 2015). The SPSS
default setting for only one matched case results in no random draw, thereby underestimating
standard errors. Therefore, it was strongly recommended this default setting should be
overridden (Allison, 2015). Regardless of the missingness mechanism, both MI procedures are
appropriate for imputing missing data (Cook, 2020). MI yields more power than listwise
deletion, corrects for bias under MAR, and partly corrects for bias under MCAR when carried
out correctly (van Ginkel et al., 2020).
The number of imputations needed is dependent on the fraction of missing information
(γ), used by Rubin (1987) to define “the relative efficiency (RE) of multiple imputation as RE =
(1 + γ/m)−1/2, where m is the number of imputations” (Pan & Wei, 2016); a conclusion was drawn
that a small m (≤5) would be sufficient. Acock (2005) noted statistical software can generate
quick estimates of 10–20 imputations, which should be more than sufficient (Acock, 2005).
Soley-Bori (2013) suggested m = 20 was adequate. Nevertheless, Kleinke (2018) warned a toolarge donor pool might result in selection of inadequate donors, implausible imputations, and
biased inferences. However, selection of a too-small donor pool might cause repeated selection
of the same donor, resulting in deceptively increased correlations of m imputations and
underestimated standard errors. Based on their findings, Graham et al. (2007) recommended
researchers use more imputations than previously proposed and to consider both γ and the
amount of power falloff resulting from an inadequately-sized m.
Preliminary research by Young and Johnson (2015) has been conducted on the
similarities and difference in use of multiple imputation for longitudinal data sets. Young and
Johnson (2015) described an advantage of longitudinal data use in providing stronger inferences
about change processes, but cautioned longitudinal studies may include a large amount of
missing data. In panel data, the most common type of longitudinal data, variables are measured
repeatedly, with measurements taken at the same times for all subjects (Allison, 2009). Missing
values resulting from nonresponse to test items are categorized as “within-wave” and are
addressed typically through the deletion and imputation procedures previously described. In
contrast, Young and Johnson (2015) noted when respondents do not participate at all data
collection time points, entire waves of data are missing and information on time-varying change
is lost. Nevertheless, data from prior waves can be modeled to account for attrition. Logistic
regression can estimate the degree to which variables in previous waves predict attrition from
subsequent ones. Statistical results can be used to infer the missingness mechanisms of the data.
Long or stacked data structures are typical, organizing each individual’s records as waves. The
key feature of longitudinal data, a large amount of missing data due to whole-wave missingness,
does not seem to describe the data set of this study, as students were administered IMMA
routinely and could not withdraw from testing at will. In addition, only two waves of data at a
time are examined for this study’s purposes. Allison (2009) noted multiple imputation can
readily handle missing panel data. For these reasons, it was concluded special treatment of this
study’s data set was unnecessary.
Zhang (2016) advocated for transparency in reporting how missing values are handled.
Of the 1,650 students in this study, 27 students were absent on all test administration dates within
that academic year and no observed data were available for examination. In addition, Grade 3
observed scores were not available from 44 students who were not enrolled in the school district
until Grade 4 or Grade 5. Therefore, the data missing from these students, as well as missing
values from students who were absent for specific test administrations, were imputed using
predictive mean matching. As is typical in research studies, this study’s data set contained
numerous cases with missing values. A description of the procedure used to manage missing data
in the current study is described below:
1. Conduct a pattern analysis to determine if the patterns of missing data are consistent with
MAR, monotonic or nonmonotonic, and warrant imputation (Leech et al., 2015, p. 294).
2. Impute missing data values using predictive mean matching with 10 imputations.
3. Conduct statistical tests using new data sets with imputed values.
4. Pool all data sets and generate an overall estimate.
The Intermediate Measures of Music Audiation (IMMA) (Gordon, 1982; updated 1986c)
was designed to test developmental music aptitude in students in Grades 1–4; norms for students
in Grades 5 and 6 were added subsequently. Test items on IMMA were selected from tonal and
rhythm patterns deemed “difficult” in Gordon’s (1976) taxonomy of tonal and rhythm patterns.
In contrast, PMMA test items made use of the “easy” patterns from Gordon’s taxonomy. Thus,
IMMA was considered a more advanced version of PMMA. Directions for the tonal and rhythm
subtests were recorded, as were the test prompts (synthesizer-produced tonal patterns without
rhythm and rhythm patterns without pitch). Students aurally compared two tonal patterns or two
rhythm patterns, determined their sameness or difference, and circled the appropriate box on the
answer sheet. The ability to read words, numbers, or music notation was unnecessary, as object
identifiers were used to label test items. Percentile ranks for students in Grades 1–6 were
provided for tonal scores, rhythm scores, and composite (tonal plus rhythm) scores.
Test reliability and validity were estimated for group administration of IMMA to students
in first grade through sixth grade. Gordon reported composite split-halves reliability and testretest reliability coefficients between .76 and .91; the reliability coefficients reported for Grade 4,
when students presumably were transitioning to stabilized music aptitude, ranged from .76 to
.90. Content, concurrent, congruent, and longitudinal predictive validities were estimated for
IMMA and described at length in the IMMA test manual (Gordon, 1986c). Content validity, a
type of subjective validity (Gordon, 1989b) and an expression of how accurately the test content
measured what it was intended to measure, was established for IMMA through an examination
of test item difficulty and discrimination indices (Gordon, 1986c, pp. 98–99). Criterion validity,
a type of objective validity (Gordon, 1989b), was estimated most commonly through a
correlation of test scores with teacher ratings. Gordon conducted two such studies: the first study
(1984a) yielded a correlation coefficient of .36 for Grade 4 composite scores (IMMA scores
correlated with general music teacher’s ratings) and the second study estimated a correlation
coefficient of .81 for fourth grade students who participated in band. Geissel (1985) found the
validity coefficients of IMMA composite scores and MAP scores to be quite similar (.47 and .50,
respectively). Congruent validity of IMMA and PMMA was inferred because the correlation
estimated between the two tests was high (.74 for Grade 4) and the validity of PMMA had
previously been estimated as acceptably high (.73) (Geissel, 1985). The strength of the
relationships between groups of test scores was interpreted from correlation coefficients: weak
(.20–.35), moderate (0.35–.65), strong (0.66–.85), or very strong (.86 and above) (Creswell,
2012, p. 347).
This quantitative study investigated the relationship of music aptitude scores of a single
music aptitude measure in a stable student population over the course of thirteen years.
Permission was granted from the University at Buffalo’s Institutional Review Board (IRB) to
conduct a study using a sample that included historical student scores preserved from tests
administered prior to the study; permission was granted by the school district hosting the study
(the researcher’s employer). Students in Grades 3–5 were administered IMMA routinely as a
data-gathering measure for district-mandated music education. However, IMMA administration
was inconsistent: although third grade students were administered IMMA each Fall, a second
administration in the Spring was not always possible. In addition, IMMA scores for students in
Grades 4 and 5 were not collected routinely until the 2017–2018 academic year. The outbreak of
COVID-19 in Spring 2020 caused schools in Pennsylvania to cease in-person instruction and
move to remote, virtual instruction beginning March 16, 2020; this action prevented a Spring
administration of IMMA for all upper elementary students in the sample for that school year.
A power analysis was conducted to identify an appropriate sample size for group
comparisons (Creswell, 2012, p. 611). By applying Lipsey’s (1990) Sample Size Table, as cited
in Creswell (p. 611), it was estimated a minimum sample size needed to attain a .80 criterion
level of power for a medium effect size of .6 at p = .05 (two-tailed) was 45 students. The sample
size of each grade level in the current study (Grade 3: N = 1,035; Grade 4: N = 389; Grade 5: N =
226) exceeded this minimum; therefore, there was greater power to detect meaningful
differences in mean scores.
Student names and archived IMMA scores were entered into a Microsoft Excel
spreadsheet and cross-checked for accuracy. Student names were moved to a separate Excel
spreadsheet and assigned a randomly-generated number. These random numbers were added to
the original data spreadsheet. Thus, the data were de-identified for use in analysis, yet a record of
names and assigned number labels was maintained separately as a contingency. Identifying
numerical labels and data were copied and pasted from Excel into SPSS to minimize potential
errors in data entry. Electronic data was stored securely in a password-protected online file;
student answer sheets were stored in a locked box in a locked closet in the researcher’s music
classroom. Thus, confidentiality of student data was ensured.
Statistical calculations in this study were performed using SPSS software (Version 26)
(IBM Corp., 2019b). SPSS, designed for the analysis of social sciences data, was selected due to
the researcher’s familiarity with the software and the capacity of the premium package to analyze
descriptive statistics and advanced statistics needed for this study.
Research Question 1
Paired t-Tests (Spring–Fall).
To address Research Question 1, which framed an examination of the chronological age
at which stabilized music aptitude begins, paired samples t-tests were used to test whether the
means of two groups were different (Field, 2009, p. 324). Of the two types of t-tests, independent
samples and dependent samples, the latter was most appropriate for this study, as music
education researchers use paired t-tests to examine mean differences of the same group over a
period of time (Russell, 2018, p. 67). Specifically, paired samples t-tests were used to determine
if there was a significant difference in Spring IMMA tonal, rhythm, and composite scores of one
academic year and corresponding Fall test scores of the following academic year. Assumptions
of this parametric test were satisfied: the dependent variable, IMMA scores, were continuous and
measured at the interval level. Observations were independent of others and matched for each
individual. An examination of histograms was used to check for outliers and determine normal
distribution. As chronological age was continuous and formal music instruction was paused
between the Spring of one academic year to the Fall of the following year, the comparison of
these score pairs was intended to highlight the effect of chronological age on music aptitude
while controlling for instruction.
Research Question 2
Wilcoxon Signed Rank Tests or Paired t-Tests by Grade Level.
To consider the general effect of instruction on music aptitude, a series of Wilcoxon
Signed Rank tests was used to examine scores of Fall and Spring test administrations from the
same academic year. Although the examination of score difference in matched pairs was
typically estimated using a paired samples t-test, when the assumption of normality was violated,
as indicated by statistically significant Shapiro-Wilk test results, the Wilcoxon Signed Rank test,
a nonparametric equivalent of the paired samples t-test which used signed ranks to test the
difference in observations, was conducted. Thus, scores from the Fall IMMA administration of
one academic year and the Spring IMMA administration of the same academic year functioned
as a pre-test and post-test, respectively. The assumptions of a paired t-test (use of a random
sample, independence of observations, normal distribution, and satisfaction of the equal variance
assumption) were satisfied prior to interpretation of findings (Huck, 2012, pp. 225–229) or
mitigated through use of the nonparametric Wilcoxon Signed Rank test.
Huck (2012) expressed caution that a Type I error, an inflated chance for a false positive
that causes an erroneous rejection of the null hypothesis, could result when several null
hypotheses corresponding to different dependent variables are tested simultaneously (p. 221). A
common strategy to adjust for a potential Type I error is the Bonferroni adjustment procedure, as
the modified level of significance is more rigorous, making it more difficult for a researcher to
reject the null hypothesis (Huck, 2012, p. 221). The conventional .05 alpha level was divided by
2 (the number of analyses); therefore, the Bonferroni correction applied to each set of paired ttests was as follows: 𝛼altered = .05/2 = .025 (Huck, 2012, p. 221). The strength and direction of
correlations were considered, and Cohen’s d was calculated as t/√N (Russell, 2018, p. 76) to
estimate effect size of significant paired t-test results to consider practical significance. Effect
sizes of less than .20 were considered small, .50 was the threshold for medium effect sizes, and
.80 for large effects (Russell, 2018, p. 90). Effect size for Wilcoxon Signed Rank tests (r) was
calculated as Z/√n, where N was the number of observations, and interpreted as small (.10),
medium (.30), large (.50), or much larger than typical (.70) (Field, 2009, p. 558). Paired t-tests or
Wilcoxon Signed Rank tests were conducted using data from all academic years, as presented in
Table 1. A statistically significant difference in Fall and Spring IMMA scores from the same
academic year was interpreted as an indication that IMMA scores continued to fluctuate,
reflecting an effect of instruction on music aptitude.
Table 1
Statistical Tests by Grade Level and Academic Year
Research Question 3
Repeated Measures ANOVA.
The possibility of a stage of transition linking the developmental and stabilized music
aptitude stages was investigated through the use of a series of repeated measures ANOVA, a
univariate design. This statistical test was selected because of a desire to examine score
difference of individuals for three grade levels. Assumptions of one-way repeated measures
ANOVA were satisfied prior to data analysis (Field, 2009, p. 150). Continuous scale IMMA
scores (tonal, rhythm, or composite) served as the dependent variable and grade level as the
independent categorial variable. The assumption of normal distribution was satisfied through
examination of histograms. Sphericity was defined by Field (2009) as “the equality of variances
of the differences between treatment levels” (p. 459); however, one-way repeated measures
ANOVA is not robust to violations of the sphericity assumption (Huck, 2012, p. 321). Therefore,
when the Mauchly sphericity test yielded a statistically significant result, the Greenhouse-Geisser
correction was applied to produce a valid F-ratio (Field, 2009, p. 260) and thus account for the
lack of sphericity. The Huynh-Feldt correction was also reported, as the Greenhouse-Geisser
correction can be too conservative (Field, 2009, p. 466).
Four multivariate test statistics were produced by SPSS in a repeated measures ANOVA;
results of the four test statistics likely were different if there was more than one underlying
variate, and there was no consensus on which test statistic was preferable (Hatcher, 2013, p.
352). Wilks’s lambda (λ), the most widely reported multivariate statistic (Hatcher, 2013, p. 352),
“is a measure of the percent of variance in the dependent variables that is not explained by
differences in the independent variable” (Russell, 2018, p. 131), and should be reported in
addition to its corresponding F statistic. From Wilks’ lambda, η2, an index of effect size, may be
interpreted (.01 as a small effect, .06 as a medium effect, and .14 as a large effect) (Hatcher,
2013, p. 352). However, Hatcher (2013) recommended interpretation of Pillai’s Trace as slightly
more powerful than the other reported statistics when more than one variate is evident, and Field
(2009) asserted Pillai’s Trace was the most robust to violations of multivariate normality (p.
605). Thus, it was concluded interpretation of Pillai’s Trace was suitable as an estimation of
significant difference in the repeated measures ANOVA of the current study. Partial eta-squared
was reported as the effect size of significant Pillai’s Trace results. Data collection included
matched IMMA scores of individual students in Grades 3, 4, and 5, as presented in Table 2.
Table 2
Three-Year Longitudinal Examination of IMMA Scores
As an omnibus F-test, ANOVA tests for all differences in a set of means (Evans, 1996, p.
339). A resulting significant F statistic provided evidence that all group means were not equal,
and must be succeeded by computing a post hoc statistic to identify which pairs of means
significantly affected the group difference. The Bonferroni post hoc test was selected to guard
against a high familywise rate of Type I errors (Evans, 1996, p. 363), or false positive results,
and was more robust to violations of sphericity than post hoc tests such as Tukey’s honestly
significant difference (HSD). Thus, groups were compared in pairs through use of pairwise
comparisons (Huck, 2012, p. 260), with the level of significance adjusted to mitigate the risk of
Type I error (Huck, 2012, p. 262).
The effect size was considered in order to address the interpretation of practical
significance for significant ANOVA findings. Huck (2012) asserted partial eta-squared (ηp2)
provided an index of the proportion of variability explained by the independent variable (pp.
222–223). However, Field (2009) noted a preference for use of omega squared as the best
measure of the overall effect size for repeated measures ANOVA, as he believed it more useful
to report effect sizes for focused comparisons rather than the main ANOVA (pp. 479–481).
Hatcher (2013) cautioned eta-squared, and by extension partial eta-squared, tended to
overestimate the effect size; however, omega squared provided an unbiased estimate of the
variance (p. 370).
Thus, omega squared (2) was selected as the post hoc statistical measure of effect size
for this study. The equation used to calculate omega squared was
2 = SSB – (k – 1)MSw
where SSB was the Sum of Squares Between, k – 1 the degrees of freedom of Sum of Squares
Between, MSw the Mean Square Within, and SST the Sum of Squares Total. Microsoft Excel was
used to conduct these calculations. The value of omega squared is 1; the value is negative if the
observed F is less than one. The effect size criteria used for omega squared were .01 (small), .06
(medium), and .14 (large) (Hatcher, 2013, p. 370); these were the same criteria for interpretation
of eta-squared and partial eta-squared (Hatcher, 2013, p. 370). Examination of the effect size
aided in interpretation of practical significance: findings that were negligibly above or below the
desired p < .05 level might suggest mean score differences that were only nominally divergent,
implying a period of transition in music aptitude stages.
Significant difference of scores of separate grade levels was interpreted as students
functioning at dissimilar stages of music aptitude. For example, if the findings indicated Grade 3
IMMA tonal scores were significantly different than Grade 5 IMMA tonal scores, it was likely
students in Grade 3 were still in the developmental music aptitude stage, while Grade 5 students
might have transitioned to the stabilized music aptitude stage. In contrast, stability of scores
between grade levels might indicate students had already achieved the stabilized music aptitude
stage, defined as resistant to instruction and therefore immutable to significant score changes. A
graphic representation of the research procedure is depicted as Figure 2.
Figure 2
Research Procedure
The research methodology for this study was detailed in this chapter. The sample,
nonprobability convenience sampling method, measure of stabilized music aptitude, data
analysis techniques (paired t-tests, Wilcoxon Signed Rank tests, and repeated measures
ANOVA), and quantitative design were described in detail for transparency and the possibility of
future replication. An examination of the relationships explored in this study may shed light on
the onset of, transition to, and constancy of stabilized music aptitude in students of intermediate
grade levels.
Chapter 4
Presentation and Interpretation of Data
The purpose of this study was to investigate the onset of, transition to, and longitudinal
constancy of stabilized music aptitude in upper elementary students. To achieve this purpose, the
following questions guided the research:
1. At what grade level does chronological age cease to affect student music aptitude?
2. At what grade level does instruction cease to affect student music aptitude?
3. Is there evidence to substantiate the transition between the developmental music aptitude
stage and stabilized music aptitude stage at approximately age 9/Grade 4?
IMMA scores were collected from a large sample of students (N = 1,650) in Grades 3, 4, and 5
over a thirteen-year period from 2007 to 2019. McKnight et al. (2007) noted the prevalence of
missing data in research studies (p. 1) and observed that missing data were more likely to result
from repeated observations than from a single observation (p. 54). As expected, numerous values
were missing in the data set of the current study. Pattern analysis of data “provides descriptions
of patterns of missing and can be a useful exploratory step before imputation” (IBM Corp.,
2019a), as well as in selection of the appropriate missing data procedure. Therefore, pattern
analysis was conducted on the observed data of the current study using SPSS (McKnight et al.,
2007, p. 122). The results, generated as percentage of variables, cases, and data points including
missing data, were scrutinized and the pattern of missing and nonmissing values examined. In
addition, the result of Little’s MCAR test (1988b), a single global statistic expressed as 2 and
used to determine whether data are missing completely at random, was considered.
Pattern Analysis of Missing Data
As is apparent in the first two images of Figure 3, all measured variables (all
combinations of grade level and Fall/Spring test administrations considered in the current study)
and all cases (all students in the current study’s sample) were missing at least one value within
the unimputed data set of tonal, rhythm, and composite scores. Approximately 65% of all the
values (individual scores) within the unimputed data set of observed scores were missing, as
represented in the third image of Figure 3.
Figure 3
Overall Summary of Missing Values (Unimputed Grade 3–5 Data Set, All Scores)
Analysis variables sorted by percent of missing data in decreasing order (IBM Corp.,
2019a) are displayed as a Variable Summary table (see Table 3). Missing data on the eighteen
variables (Fall and Spring administrations of tonal, rhythm, and composite scores for Grades 3–
5) ranged from 10–94% of the sample. The percentage of missing values was extensive,
particularly in the Spring administrations of Grades 4 and 5, as IMMA was not administered
routinely in those grades and often in Fall only. This proportion of missing data was concerning,
and care would need to be taken to mitigate the effects of such a high level of missingness.
Table 3
Variable Summary (Unimputed Grade 3–5 Data Set, All Scores)
One hundred three patterns of missing data were found for the eighteen variables and are
exhibited in Figure 4. A group of cases with a similar pattern of missing and nonmissing values
is represented in each row (IBM Corp., 2019a). Patterns of missingness are interpreted
horizontally; red bars represent missing data found for each variable. The variables are arranged
horizontally on the x-axis in increasing order of missing values in order to approximate a
monotonic pattern.
Figure 4
Missing Value Patterns (Unimputed Grade 3–5 Data Set, All Scores)
Monotonicity describes a pattern of missingness in which missing data are “dependent or
conditional on missing data for other items or groups of items” (McKnight et al., 2007, p. 62):
missing and nonmissing data will appear contiguous in the Missing Value Pattern figure if the
data are monotonic. As expected, the pattern of missingness approximated nonmonotonicity for
Grade 3 Fall and Spring variables: the red bars representing missing data and those representing
nonmissing data are not contiguous for Grade 3 variables. The pattern of missingness was less
haphazard for Grade 4 and Grade 5 Fall variables, and from the contiguous concentration of red
bars in the lower right corner, it was concluded the patterns of missingness for Grade 4 Spring
subtest scores were monotonic and would require imputation of scores to mitigate systematic
bias of that portion of the sample.
The 10 most frequent patterns of missing values, illustrated in the bar graph in Figure 5,
offered another perspective. The most common pattern, Pattern 87, represents a large proportion
of missing data (approximately 40%); missing values for all Grade 4 and Grade 5 Fall and Spring
test administrations comprised the horizontal Pattern 87 in Figure 4. Similarly, it was indicated
by Pattern 100 that approximately 20% of cases were missing values on most test
administrations, from Grade 3 Spring through Grade 5 Spring.
Figure 5
Missing Value Patterns Bar Graph (Unimputed Grade 3–5 Data Set, All Scores)
From the missing values analysis, conducted to examine patterns of missing data for
identification of the missingness mechanism as missing at random (MAR), not missing at
random (NMAR), or missing completely at random (MCAR), and selection of the appropriate
statistical technique to handle the missing values, two patterns contained a markedly higher
percentage of missing values than others. Approximately 40% of the cases (all Grade 4 and
Grade 5 variables) were missing data, as modeled in Pattern 87; Pattern 100 reflected an
additional 20% of variables (from Grade 3 Spring tonal through Grade 5 Spring composite
variables) containing missing data. From these results, it was determined missing values were far
more prevalent in Grades 4 and 5, particularly for Spring IMMA administrations of all subtests,
and the missing value patterns appeared split between the nonmonotonicity of Grade 3 patterns
and the monotonicity of Grade 4 and Grade 5 patterns. The percentage of incomplete data in the
observed data set was quite large, as was prevalent in longitudinal studies. McKnight et al.
(2007) observed studies using longitudinal data often resulted in monotonic missing data
patterns. Although these patterns were somewhat predictable, they also generated large
proportions of missing data (p.106). Consequently, neither listwise deletion, in which only
complete cases were included, nor pairwise deletion, in which only cases having nonmissing
values for both variables within a given pair of variables were included, were recommended for
the current study. Either would result in a decrease in power and could potentially introduce bias,
as it was unknown whether the loss of those specific cases might lead to misrepresentation of the
population, as cautioned by Landerman et al. (1997). The evidence reinforced the conclusion that
imputation of missing values was prudent.
Data missing because of specific factors might bias the results (Leech et al., 2015, pp.
292–299); therefore, classification of the data as MAR, NMAR, or MCAR was necessary and
reliant on consideration of missingness characteristics. To determine if the missingness could be
categorized as MCAR (missing completely at random), Little’s MCAR test was conducted using
Grade 3, Grade 4, and Grade 5 IMMA tonal scores, rhythm scores, and composite scores. The
results of Little’s MCAR test, X2 (414, N = 131) = 397.219, p > .05, were not significant, an
indication the data were missing randomly, with no relationship between missing data and
observed data (McKnight et al., 2007, p. 46). MCAR data is ignorable and does not need to be
modeled for the parameter estimation process (McKnight et al., 2005, p. 51); however, even with
a large proportion of missing data, the observed data might generate unbiased parameter
estimates when the mechanism is MCAR (McKnight et al., 2007, p. 61).
McKnight et al. (2007) noted there did not appear to be a general agreement on
classification of the amount of missing data as small, medium, or large (p. 61); nevertheless,
Madley-Dowd et al. (2019) suggested even with large proportions of missing data (up to 90%),
use of a properly specified imputation model and MAR data can yield unbiased results. Jin and
Huber (2011) concurred, concluding multiple imputation (MI) may produce unbiased estimates
of large numbers of missing data classified as MCAR or MAR and adjudging MI as preferable to
complete case analysis under all missingness mechanisms. Due to the mixed results of the
pattern analysis, in which the pattern of monotonicity was split by grade levels, and Little’s
MCAR test, in which the missing data were classified as MCAR, it was concluded all observed
cases would be included and missing values imputed using predictive mean matching with 10
imputations as the preferred method for handling a large percentage of missing values.
Imputation of Missing Values
In predictive mean matching, a donor pool with observed values similar to those of the.
predicted value is identified. One value is selected randomly, and its observed value substituted
for the missing value (Schenker & Taylor, 1996). Ten imputations, conducted using SPSS,
resulted in a pooled data set that was then used in all subsequent statistical analyses (Leech et al,
2015, p. 306). The pooled data set results of each statistical analysis are presented when
available. Due to the nature of multiple imputation, slight differences may be expected for each
instance of imputing values (Van Ginkel et al., 2020). Standard deviations are not pooled
automatically in SPSS; therefore, pooled standard deviations were calculated using Microsoft
Excel (Heymans & Eekhout, 2019).
Overview of Statistical Analyses
To address Research Question 1, a series of paired t-tests was conducted using paired
samples from Spring IMMA scores of one academic year and corresponding Fall IMMA scores
of the subsequent academic year to examine the effect of chronological age on music aptitude
while controlling for instruction. Research Question 2 was addressed by conducting a series of
Wilcoxon Signed Rank tests or paired t-tests to examine scores of Fall and Spring
administrations from the same grade level, thus considering the effect of instruction on music
aptitude by academic year. A one-way repeated measures ANOVA was conducted to examine
IMMA scores longitudinally over a 3-year period to consider if mean differences in scores might
suggest a period of transition between developmental and stabilized music aptitude stages, as
queried in Research Question 3.
For paired t-tests, Cohen’s d, a standardized measure of mean difference calculated as
t/√N (Russell, 2018, p. 76), was used to estimate effect size of statistically significant results in
order to consider practical significance. According to Russell (2018), small effect sizes are .20–
.30, as approximately 4–9% of the total variance is explained, a medium effect size (d = .50)
estimates approximately 25% of the total variance, and d = .80 is the threshold for a large effect,
in which approximately 64% of the total variance is estimated (p. 90). Cohen (1988) further
characterized a small effect as “difficult to detect”, a medium effect as “large enough to be
visible to the naked eye”, and a large effect as a “grossly perceptible” difference (Hatcher, 2013,
pp. 163–164). For Wilcoxon Signed Rank tests, r was calculated as Z/√N, where N is the number
of observations (Field, 2009, p. 558), and interpreted as small (.10), medium (.30), large (.50), or
much larger than typical (.70) (Leech et al., 2015, p. 95). For repeated measures ANOVA, omega
squared was used to estimate the amount of variance of the dependent variable that was
explained by the independent variable (Field, 2009, p. 479): .01 was considered a weak effect,
.06 a moderate effect, and .14 a strong effect (Hatcher, 2013, p. 370). The a priori alpha level
was set at p < .05 for all statistical tests (Russell, 2018, p. 24).
Results of the paired t-test, Wilcoxon Signed Rank test, and repeated measures ANOVA
analyses are presented and discussed in this chapter and related to the research questions posited
in this study. Thus, the effect of chronological age and instruction on music aptitude at a given
grade level was considered, and the feasibility of a period of transition between the
developmental and stabilized music aptitude stages discussed.
Research Question 1
At what grade level does chronological age cease to affect student music aptitude?
To investigate the effect of chronological age on music aptitude, a series of paired t-tests
was used to determine if the mean difference of Spring scores of one academic year and
corresponding Fall scores of the following academic year, which allowed for increased
chronological age during the summer months without the effect of school-related music
instruction, was statistically significant. IMMA was administered in Grade 4 in the following
semesters only: Fall 2011, Fall 2015, Fall 2017, Fall 2018, Spring 2019, and Fall 2019. Of the
1,035 students for whom Grade 3 IMMA scores were available, 238 also had Grade 4 scores
available. Scores from the same individuals on two measures were desired to conduct a paired ttest; therefore, scores of 795 cases could not be matched to Grade 3 scores of the same
individuals, as IMMA had not been administered to those students in Grade 4.
Although a rationale for imputation of missing scores had been established previously,
imputation of such a large proportion of missing scores (approximately 77% of Grade 4 scores)
required additional consideration. Imputation had been deemed valid for cases in which scores
were missing arbitrarily due to student absence on the day of test administration, and the random
nature of the missingness was speculated. Missing data were anticipated, as the repeated
measures used in the study design were often a source of incidental missing data (McKnight et
al., 2007, pp. 54–55). However, it seemed a different matter to impute test scores of an entire test
battery (tonal, rhythm, and composite scores) for each student to whom the test had not been
administered in a particular semester, and exclusion of cases could reduce the power needed to
reject the null hypothesis when it was false. However, an a priori power analysis conducted
according to the following parameters estimated the suggested number of participants needed for
each group in the sample was 45: alpha level .05, power .80, and effect size .6 (Creswell, 2012,
p. 611). Thus, a sample size of 238, the number of students for whom both Grade 3 and Grade 4
IMMA scores were available, exceeded the minimum number of 45 recommended by the results
of the a priori power analysis. The power analysis results seemed adequate justification to
exclude the 795 cases with missing Grade 4 scores due to no test administration from the paired
t-test analysis for Research Question 1.
The ramifications of excluding a large number of missing cases were severe and not to be
taken lightly. Consequently, paired t-test results of Grade 3 Spring/Grade 4 Fall IMMA scores
for the observed case sample and the excluded case sample were compared. Missing values were
imputed using predictive mean matching with 10 imputations for both samples. Because values
were no longer missing due to imputation, the “observed case sample” will be referred to as the
“complete case sample” from this point forward.
Although the means for both the complete case sample and excluded case sample were
roughly comparable in most instances, the excluded means for Grade 3 Spring tonal and Grade 3
Spring composite scores were markedly different than comparable means of the complete case
sample. The mean Grade 3 Spring tonal score was the pooled result of 10 imputations using
predictive mean matching; the random draw from possible donors appeared skewed toward
lower scores during imputation. This contention was supported by an examination of the imputed
means and standard deviations for Grade 3 Spring tonal scores: the imputed mean scores were
several points lower than the original mean and the imputed standard deviations were higher than
those of other variables. It was likely the mean Grade 3 Spring composite score was affected by
the discrepancy in the mean Grade 3 Spring tonal score, as composite scores are comprised in
part of tonal scores. In addition, the standard deviation for Grade 3 Spring composite scores in
the excluded case sample (SD = 19.917) was inconsistent with that of other variables from the
same sample or the complete case sample. As the standard deviation is an indication of the
dispersion of scores from the mean, such a high standard deviation was suggestive of a large
amount of variation in Grade 3 Spring composite scores. SPSS-generated histograms of Grade 3
Spring composite scores of both samples were examined. It was observed scores of the complete
case sample more strongly resembled a normal distribution than did those of the excluded case
sample. Nevertheless, expediency precluded display of all charts produced. Perhaps the larger
sample size of the complete case sample (N = 1,054) allowed a more even dispersion of scores,
thus masking the impact of outliers in a manner the smaller size of the excluded case sample (N
= 238) could not accommodate.
In a similar manner, paired t-test results of Grade 4 Spring/Grade 5 Fall IMMA scores for
the complete case sample and excluded case sample were compared and missing values imputed
for both samples. Means of both samples were quite similar. However, the standard deviation of
Grade 5 Fall composite scores was considerably larger for the excluded case sample than for all
other standard deviations, meaning scores for that test were more widely dispersed from the
mean than scores of other tests. An examination of all histograms of both imputed samples
revealed a similar result to that concluded for Grade 3 Spring composite scores: a normal
distribution was more closely modeled by Grade 5 Fall composite scores of the complete case
sample than by those of the excluded case sample. Perhaps the effect of outliers was mitigated by
the larger sample size (N = 1,069): the dispersion of the smaller number of scores from the
excluded case sample (N = 132) would be more limited, allowing outliers to have greater
influence on the shape of the distribution, resulting in a discrepancy in standard deviation values.
Descriptive statistics for all samples are illustrated in Table 4.
Table 4
Descriptive Statistics of Complete and Excluded Case Samples (Pooled)
A compiled table of correlation results is presented in Table 5. It was concluded Grade 3
Spring tonal, rhythm, and composite scores were significantly correlated with corresponding
Grade 4 Fall scores level for both samples. In both samples, only composite score correlations
were significant for Grade 4 Spring and Grade 5 Fall scores. The correlation coefficients of the
complete case and excluded case samples were similar and quite small. These weak correlations
seemed to reflect Gordon’s (2012) assertion that correlations of scores from consecutive
semesters or years were weak, possibly because the influence of musical environment was
stronger than that of formal instruction for students in the developmental music aptitude stage (p.
54). Accordingly, it was concluded the use of the complete case sample (the sample containing
observed scores with missing values imputed) was preferable to address Research Question 1.
Table 5
Correlation Results of Complete and Excluded Case Samples (Pooled)
The paired samples results varied greatly between the two samples. The mean difference
in Grade 3 Spring scores and Grade 4 Fall scores was larger, the scores more widely dispersed
from the mean, and the distance of the sample mean from the population mean greater in the
excluded case sample than for the complete case sample. It appeared the ability of the excluded
case sample to accurately represent the nature of the complete case sample was flawed to such an
extent as to be deemed untrustworthy.
In contrast to the Grade 3 Spring/Grade 4 Fall findings, the paired samples test results for
the Grade 4 Spring and Grade 5 Fall complete case and excluded case samples were more
similar. Thus, the difference in scores (mean difference), score dispersion from the mean
(standard deviation), and distance of the sample mean from the population mean (standard error
of the mean) were comparable for both samples. Paired samples test results for complete cases
and excluded cases are displayed in Tables 6 and 7.
Table 6
Paired Samples t-Test Results–Complete Case Sample (Pooled)
Table 7
Paired Samples t-Test Results–Excluded Case Sample (Pooled)
A series of paired t-tests was conducted to determine if statistical results supported the
rationale to exclude missing cases due to student non-enrollment at the time of test
administration as described previously. The use of either the Grade 4 Spring/Grade 5 Fall
complete case sample or the excluded case sample could be justified from descriptive statistics,
correlation, and paired samples test results. This was in stark contrast to the findings for the
Grade 3 Spring/Grade 4 Fall complete case and excluded case samples, in which the use of the
complete case sample was favored strongly. There seemed no advantage to overlooking the
discrepancies in paired samples results, particularly for Grade 3 Spring/Grade 4 Fall scores; thus,
sole use of the excluded case sample seemed ill-advised. From these findings, it was concluded
the use of the complete case sample, which included observed values for all grade levels with all
missing values imputed, was more feasible and practical to address Research Question 1 of the
current study.
Paired Samples t-Test Results
Although Grade 4 Fall tonal scores were anticipated to exceed Grade 3 Spring tonal
scores based on Gordon’s (2005) conclusion that a score increase due to chronological age was
typical for tests, the mean difference in Grade 3 Spring/Grade 4 Fall tonal scores was not
significant. A weak but significant correlation (r = .278, p < .001) was found for Grade 3 Spring
scores and Grade 4 Fall scores (N = 1,070), as presented in Table 8: as Grade 3 Spring tonal
scores increased, so did Grade 4 Fall tonal scores. Correlation coefficients were interpreted as a
weak (.20–.35), moderate (0.35–.65), strong (0.66–.85), or very strong relationship (.86 and
above) (Creswell, 2012, p. 347).The mean difference of Grade 4 Fall and Grade 3 Spring tonal
scores was not statistically significant (t(27) = -.049, p > .05), as seen in Table 9.
Table 8
3ST–4FT Descriptive Statistics and Correlation Coefficient
Table 9
3ST–4FT Paired t-Test Results
Scores (N = 1,070) of students who had been administered the IMMA tonal subtest in the
Spring of their fourth-grade year and the Fall of their fifth-grade year were significantly
correlated (r = .257, p < .001), as displayed in Table 10. As Grade 4 Spring tonal scores
increased, Grade 5 Fall tonal scores also increased. The mean difference of Grade 5 Fall tonal
and Grade 4 Spring tonal scores was not statistically significant (t(13) = -.423, p > .05), as
illustrated in Table 11.
The mean difference between scores from the Spring administration and the subsequent
Fall administration of the IMMA tonal subtest for Grades 3–5 was not statistically significant for
all grade levels considered; these findings are summarized and graphically represented in Figure
6. Gordon (2005) noted an increase in raw scores due to chronological age was typical for tests;
therefore as chronological age increased, scores were expected to increase from the first to the
second test administration within one grade level, from one grade level to the subsequent grade
level, and from the Spring administration of one grade level to the Fall administration of the
following grade level, as was the case in this study. However, there was no significant difference
in mean scores between grade levels, from which one could speculate students already had
achieved the stabilized music aptitude stage, as score fluctuation would be expected for students
still in the developmental music aptitude stage (Gordon, 1981). Similarly, a high correlation
between scores of successive grade levels might be anticipated if students had achieved the
stabilized music aptitude stage (Gordon, 2005), as the effect of musical environment or training
would have ceased and relative standing on music aptitude tests would be maintained (Gordon,
1980b). Nonetheless, only a modest relationship between Spring scores and Fall scores of the
following academic year was suggested by the correlation coefficients in this study.
Table 10
4ST–5FT Descriptive Statistics and Correlation Coefficient
The weak correlations were surprising: the test items for the Grade 3 Spring subtests and
Grade 4 Fall subtests were identical and it was anticipated the correlation between those sets of
scores would be strong. Still, Gordon (2012) noted a weak correlation between scores from one
semester or year to another, even when all students received quality music instruction: “It seems
students’ immediate impressions and intuitive responses to environmental influences have more
influence on developmental music aptitude than systematic formal instruction in music
achievement” (p. 54). In addition, Gordon (2012) observed magnitude, rather than direction, of
score changes from year to year seemed to affect lower longitudinal correlation coefficients of
developmental music aptitude tests, unlike those for stabilized music aptitude test scores (p. 27).
Table 11
4ST–5FT Paired t-Test Results
Although this observation seemed to explain the findings of the current study, it had been
asserted students after age 9 had achieved the stabilized music aptitude stage (Gordon, 2001a)
and thus their music aptitude was unaffected by environmental influences. It was speculated the
correlation findings in the current study supported a transition period between the developmental
and stabilized music aptitude stages during the upper elementary years, in which the influence of
musical environment continued to affect music aptitude or recurred in a less predictable pattern
than originally described. Perhaps students correctly answered a similar number of test items in
each test administration, as suggested by the negligible difference in mean scores, but the test
items that were answered correctly were not identical from the Spring test administration to the
subsequent Fall test administration, as suggested by the small correlation coefficients.
Regardless, evidence to support an effect of chronological age on tonal music aptitude was not
found from these contradictory results.
Figure 6
Paired t-Test Spring–Fall Results (Tonal)
Results of the paired t-test examination of rhythm scores were dissimilar to tonal results.
As presented in Table 12, correlations of scores (N = 1,070) of students who had been
administered the rhythm subtest in the Spring of their third-grade year and the Fall of their
fourth-grade year were not significant (r = .095, p >.05). Grade 4 Fall rhythm scores were an
average of .449 points higher than Grade 3 Spring rhythm scores, and this mean difference was
statistically significant (t(31) = -2.062 p = .48), as exhibited in Table 13. Cohen’s d was
estimated as d = .11, a small effect size.
Similarly, scores (N = 1,070) of students who had been administered the IMMA rhythm
subtest in the Spring of their fourth-grade year and the Fall of their fifth-grade year were not
significantly correlated (r = .103, p > .05), as displayed in Table 14. The mean Grade 4 Spring–
Grade 5 Fall rhythm score difference was not statistically significant (t(13) = -2,092, p > .05)
(see Table 15). The lack of correlation significance might have been more notable had the
correlation itself been stronger; however, the weak correlation aligned with Gordon’s 2012
observation that correlations of scores of consecutive tests were low. The score decrease from
Grade 4 Spring to Grade 5 Fall scores was unexpected; nevertheless, the finding was not
significant. The significant increase in Grade 3 Spring–Grade 4 Fall scores coupled with the
decrease in Grade 4 Spring–Grade 5 Fall scores might have been noteworthy, as the trend
seemed to confirm the direction of score fluctuation described previously by Gordon (2012).
However, neither score difference reached the threshold of a 2-point increase ascribed by Gordon
(2002) to students who participate in traditional instruction, and the Grade 4 Spring–Grade 5 Fall
score difference was not statistically significant. Therefore, no effect of chronological age for
rhythm music aptitude was concluded, and practical significance was unlikely.
Table 12
3SR–4FR Descriptive Statistics and Correlation Coefficient
The mean difference in scores from the Spring administration to the subsequent Fall
administration of the IMMA rhythm subtest was approximately one-half point and statistically
significant for Grade 3 Spring–Grade 4 Fall scores only (p = .048). These findings are
summarized and graphically represented in Figure 7. A lack of notable score fluctuation, as was
apparent in this sample, would be anticipated if students had attained the stabilized music
aptitude stage (Gordon, 1980b). Based on these results, it was concluded chronological age had
little effect on the relative standing of rhythm aptitude scores in this sample. The weakness of the
correlations indicated a lack of association between rhythm scores of successive grade levels.
This was consistent with Gordon’s (2012) assertion that correlations of scores on consecutive
test administration were “alarmingly low”, perhaps due to the influence of students’ responses to
environmental influence on developmental music aptitude (p. 54). If students had already
achieved the stabilized music aptitude stage, the influence of musical environment should have
waned, resulting in a lack of correlation. The discrepancy in expected and observed correlation
Table 13
3SR–4FR Paired t-Test Results
Table 14
4SR–5FR Descriptive Statistics and Correlation Coefficient
Table 15
4SR–5FR Paired t-Test Results
Figure 7
Paired t-Test Spring–Fall Results (Rhythm)
coefficients and minimal score difference could be explained as students correctly answering a
similar number of test items, but not answering the same test items correctly. This observation is
speculative and would need further investigation in future studies.
As anticipated, results of the paired t-test examination of composite scores mirrored those
of tonal scores and rhythm subtest scores. As exhibited in Table 16, composite scores (N =
1,070) from the Grade 3 Spring and Grade 4 Fall test administrations were significantly
correlated (r = .252 p < .01): as Grade 3 Spring composite scores increased, Grade 4 Fall
composite scores also increased. Grade 4 Fall composite scores were an average of .874 points
higher than Grade 3 Spring composite scores. The mean difference was statistically significant
(t(87) = -2.641, p = .010) (see Table 17), and Cohen’s d was estimated as d = .11, a small effect.
Composite scores (N = 1,070) of students from Grade 4 Spring and Grade 5 Fall test
administrations were not correlated significantly (r = .119, p > .05), as displayed in Table 18.
The mean Grade 4 Spring–Grade 5 Fall composite score difference was not statistically
significant (t(27) = 1.912, p > .05), as presented in Table 19.
The trend of composite scores was anticipated to simulate that of tonal or rhythm scores
and to reflect an increase in composite scores as chronological age increased. Therefore, the
decrease in mean composite scores from Grade 4 Spring to Grade 5 Fall test administrations,
though non-significant, was unexpected, as findings of extant literature also supported score
increase due to chronological age. Thus, composite results resembled rhythm results of the same
test administrations. Despite the finding of statistical significance for Grade 3 Spring/Grade 4
Fall composite score difference, practical significance is cautioned, as the mean score difference
was less than one point.
Table 16
3SC–4FC Descriptive Statistics and Correlation Coefficient
Table 17
3SC–4FC Paired t-Test Results
The paired t-test composite findings of Spring scores from one academic year and Fall
scores of the following year are summarized and graphically represented in Figure 8. A clear
interpretation of these findings was elusive. A lack of pronounced score fluctuation would be
expected if students had achieved the stabilized music aptitude stage (Gordon, 1980b); this
seemed to be the case in this sample. A statistically significant correlation (r = .252, p < .01) of
Grade 3 Spring/Grade 4 Fall composite scores seemed to support a determination of stabilized
music aptitude: an association between composite scores of successive grade levels would seem
to indicate mean score difference had dwindled as the influence of musical environment waned,
as described by Gordon (1981).
Table 18
4SC–5FC Descriptive Statistics and Correlation Coefficient
Table 19
4SC–5FC Paired t-Test Results
Nevertheless, both relationships between Spring scores and Fall scores of the subsequent
academic year were weak; a discrepancy between the number of correctly answered test items
and the stability of those answers in relation to specific test items was speculated. Gordon (2012)
reported median correlations of repeated stabilized music aptitude test administrations were
approximately .80; corresponding correlations for developmental music aptitude test
administrations only approximated .30 (p. 27). Following this logic, it seemed the weak
correlations found for composite scores in the current study were suggestive of students’
continued presence in the developmental music aptitude stage. Yet mean composite score
differences were small for all grade levels and could signify composite music aptitude was fixed.
Student music aptitude stage could not be established conclusively from these findings, and little
evidence was found to suggest an effect of chronological age on composite aptitude.
Figure 8
Paired t-Test Spring–Fall Results (Composite)
The relative constancy of scores over the 3-year period from Grade 3 through Grade 5
could be interpreted as evidence students had progressed to the stabilized music aptitude stage
before Grade 3. The findings of previous research both support and dispute this interpretation.
Degé et al. (2017) and Walters (1991) noted the influence of external factors on developmental
music aptitude before age 9, yet Gordon (1980b, 1986c, 2012, 2013) was consistent in his claim
that musical environment no longer affected music aptitude at approximately age 9, defined as
the period of stabilized music aptitude. Phillips et al. (2002) conjectured Grade 3 or
approximately age 9 might serve as the pivotal year for development of aural skills before the
stabilization of music aptitude. It should be noted for the current study’s school district, the terms
“Grade 3” and “age 9” did not describe the same students, as 9-year old students were typically
in fourth grade.
Gordon (1998) noted a tendency for music aptitude test scores to increase with
chronological age for students in the stabilized music aptitude stage and for scores and percentile
ranks to fluctuate for students who take a developmental music aptitude test (p. 169). Mean score
differences in the current study were small: the largest difference, found for Grade 3
Spring/Grade 4 Fall composite scores, was less than one point. Nevertheless, Grade 3
Spring/Grade 4 Fall mean score differences were significant for rhythm scores and composite
scores in this sample. Correlations, although significant for tonal scores and Grade 3
Spring/Grade 4 Fall composite scores, were weak for all grade and subtest combinations.
Therefore, it was concluded the chronological age at which the developmental music aptitude
stage progressed to the stabilized music aptitude stage was not clarified by the results of the
current study, as there was no clear decrease in score fluctuation that might indicate the age at
which a shift between music aptitude stages might occur.
Research Question 2
At what grade level does instruction cease to affect student music aptitude?
The difference in mean scores of matched pairs by year and semester was examined to
gain an in-depth view of change in subtest scores by grade level as influenced by instruction. In
most instances, the assumption of normal distribution for paired t-tests was not met and the nonparametric Wilcoxon Signed Rank test was substituted. The following third grade scores were
not available, as IMMA was not administered in the Spring of those academic years:
Grade 3
Spring 2009, Spring 2011, Spring 2015, and Spring 2020
IMMA was not administered routinely in Grades 4 and 5. Therefore, only the following observed
scores were available:
Grade 4
Fall 2011, Fall 2015, Fall 2017, Fall 2018, Spring 2019, Fall 2019
Grade 5
Fall 2017, Fall 2018, Spring 2019, Fall 2019
Fall and Spring scores from the same academic year were compared and all missing values
imputed using predictive mean matching with 10 imputations (Acock, 2005). Results were
categorized and reported by statistical test, year, grade level, and semester. An online effect size
calculator (Stangroom, 2021) was used to calculate Cohen’s d, reported as an estimate of effect
size for paired t-test results and interpreted as a small effect (.2), medium effect (.5), or large
effect (.8) (Russell, 2018, p. 90). Wilcoxon Signed Rank test effect size was calculated using
Microsoft Excel as r = Z/√N (Field, 2009, p. 558), where N was the total number of observations.
The absolute value of r was interpreted as a small effect size (0.1), medium effect size (0.3), or
large effect size (0.5) (Field, 2009, p. 558).
Wilcoxon Signed Rank Test Results
A Shapiro-Wilk test was conducted on each set of grade level subtest scores to estimate
the assumption of normality (Field, 2009, p. 144). A significant result indicated a violation of the
assumption of normality. In some instances, kurtosis was found to be leptokurtic: the steepness
of the distribution curve was high, which suggested heavy tails or outliers. These results were
supported by visual examination of SPSS-generated histograms and Normal Q-Q plots for each
of the 10 imputations of each set of scores. When the normality assumption was violated,
Wilcoxon Signed Rank tests were conducted instead of paired t-tests to compare differences in
mean scores.
An effect of instruction on music aptitude was not concluded from the results of
Wilcoxon Signed Rank tests for Grade 3 IMMA scores by academic year. Mean score
differences were generally negligible and non-significant. A relative constancy of subtest scores
was suggestive of an attainment of the stabilized music aptitude stage prior to Grade 3.
Wilcoxon Signed Rank test results are summarized and depicted for tonal scores (see Table 20),
rhythm scores (see Table 21), and composite scores (see Table 22) of all academic years in order
to present a broad perspective of longitudinal change.
Table 20
Wilcoxon Signed Rank Test Results (Tonal)
As is apparent from the compiled tonal results featured in Table 20, correlations of all
pairs of Fall and Spring tonal scores were strong and significant.(p < .01). Only the mean
differences between Fall and Spring tonal scores for two years (2011–2012 and 2013–2014)
were statistically significant (p < .05). Results of each academic year are analyzed and discussed
separately later in this section in order to consider score change through a narrower lens.
Correlations of most pairs of Fall/Spring rhythm scores were strong and statistically
significant, as presented in Table 21. However, no mean differences were statistically significant
at p < .05. Results for each academic year are examined and interpreted in the next section for a
focused perspective on rhythm score change.
Table 21
Wilcoxon Signed Rank Test Results (Rhythm)
Correlations of Fall/Spring composite scores of all academic years were moderate or
strong and statistically significant. No significant mean difference of Fall/Spring composite
scores was found for any of the academic years considered. Composite findings of each
academic year are displayed and interpreted in the following section, in order to scrutinize score
change in detail.
Table 22
Wilcoxon Signed Rank Test Results (Composite)
2007–2008 Grade 3 Scores.
Sample sizes of 2007–2008 Grade 3 tonal, rhythm, and composite tests were slightly
different due to the number of student absences on each of the dates of test administration.
Therefore, missing values from each tests were imputed using predictive mean matching with 10
imputations (Acock, 2005), in which values were selected randomly from a donor pool of
observed values and substituted for missing values (Allison, 2015). This resulted in samples of
equal size. The imputed data sets, 10 for each original data set of tonal, rhythm, and composite
scores, were used for all subsequent statistical tests, which yielded pooled results from which a
single inference was drawn (McKnight et al., 2007). Mean scores of all tonal, rhythm, and
composite tests were comparable to corresponding Fall and Spring administrations, as displayed
in Table 23.
Table 23
2007–2008 Grade 3 Descriptive Statistics (Pooled)
A linear relationship was suggested from visual examination of scatterplots for all
corresponding subtests. Therefore, a Pearson’s Product-Moment correlation test was run for each
pair of corresponding subtests. Corresponding Fall and Spring scores were significantly
correlated (p < .05) for all subtests. The results of the bivariate relationships among all
combinations of test types (Huck, 2012, p. 50) was summarized in a correlation matrix, presented
in Table 24. In addition, tonal and rhythm subtest intercorrelations ranged from .497 to .636,
which were lower than the corresponding reliabilities of the tests: the subtests seemed to have no
more than 40% of their variances in common. The reported intercorrelation coefficients of
IMMA tonal and rhythm subtests ranged from .40 to .46 (Gordon, 1986c, p. 94); thus, the ability
of the tonal subtest and rhythm subtest to measure unique dimensions of music aptitude was
somewhat higher for this 2007–2008 Grade 3 sample when compared to the IMMA
standardization sample. These results were interpreted as indirect evidence the preponderance of
variance of tonal and rhythm subtests was related to factors not shared by the two subtests (p.
94): the tonal subtest and rhythm subtest functioned according to the standard established in the
IMMA manual for this study’s 2007–2008 Grade 3 sample. As expected, the composite scores
were highly correlated with both tonal subtest (ranging from .643 to .840) and rhythm subtest
(ranging from .599 to .888) scores, as tonal and rhythm subtest scores contributed to the
composite score.
Table 24
2007–2008 Grade 3 Correlation Matrix (Pooled)
From statistically significant Shapiro-Wilk results, exhibited in Table 25, a violation of
the necessary assumption of normal distribution for 2007–2008 Grade 3 tonal and composite
scores was concluded. The lower bound of the true significance value is an indication of the low
end of the range of possible p-values to which the estimated p-value belongs; it is, in effect, a
confidence interval for significance of Kolmogorov-Smirnov results. A visual examination of
associated Q-Q plots and boxplots supported the contention of assumptions violations.
Therefore, a series of Wilcoxon Signed Rank tests was conducted to investigate
differences between corresponding Fall and Spring subtest scores for two related samples.
Results are displayed below in Tables 26 and 27. This nonparametric test was selected because
the assumption of normality had been violated for most subtests, an indication that use of a
paired t-test was inappropriate. Unlike the paired t-test, which uses mean scores as the average,
signed ranks were used to test the difference of observations. All Spring subtest ranks were
higher than corresponding Fall ranks, as determined by the greater number of ranks (Field, 2009,
p. 558). Although tonal medians were similar (Mdn = 34.00), Spring tonal subtest means (33.36)
tended to be higher than Fall tonal subtest means (32.95), Z= -1.172, r = 0.18, a small effect.
Table 25
2007–2008 Grade 3 Shapiro-Wilk Test of Normality Results
Spring rhythm subtest ranks (Mdn = 30.00) were apt to be higher than Fall rhythm subtest ranks
(Mdn = 29.00), Z = -.329, r = 0.005, and Spring composite ranks (Mdn = 64.00) also trended
higher than Fall composite ranks (Mdn =63.00), Z= -.283, r = 0.004. No mean rank differences
were statistically significant (p > .05) for any set of ranks. Thus, no effect of instruction was
concluded for Fall and Spring tonal scores, rhythm scores, or composite scores for this 2007–
2008 sample. However, it was possible type of instruction adversely affected the finding of no
effect of instruction.
Gordon (2001b) theorized three stages of preparatory audiation: acculturation, in which
children absorb musical sounds, babble, and move with increasing purpose in relation to sounds
of their musical environment; imitation, in which children become aware of sameness and
Table 26
2007–2008 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 27
2007–2008 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
difference between their own performance and that of others and begin to imitate sounds of their
musical environment; and assimilation, in which children recognize and learn to coordinate their
singing, chanting, breathing, and moving (pp. 6–10). Structured and unstructured informal
guidance offered students in preparatory audiation opportunities to absorb music and respond as
they wish (Gordon, 2012, pp. 253–254): “random and experimental responses prepare them
[children] to sing, chant, and move within a culturally based context through audiation” (Gordon,
2012, p. 254). However, if students had not experienced adequate informal guidance, they would
lack the necessary readiness for formal instruction. Gordon (2012) asserted, “Extended informal
guidance in music is more beneficial than premature formal instruction” (p. 255): informal music
guidance is foundational to and directly influences developmental music aptitude and, by
extension, stabilized music aptitude (Gordon, 2006).
Ideally, children begin to move through the steps of preparatory audiation from a young
age; however, not all students have transitioned from preparatory audiation or “music babble” by
the time they begin formal schooling. Gordon (2013) noted children moved through preparatory
audiation at different rates (p. 29) and may or may not have emerged from music babble when in
the developmental music aptitude stage (Gordon, 2012, p. 251). Consequently, if students in the
current sample had not emerged from the music babble stage yet, the formal guidance provided
would have been inappropriate for their musical needs and their music aptitude test scores would
have decreased accordingly (Reese & Shouldice, 2019). Gordon (1981) exhorted,
Retroactive inhibition on the part of the young child in attempting to erase supposedly
erroneous concepts rather than learning how to assimilate them into new understanding,
as a result of no, or inappropriate formal instruction, may be the most potent cause of low
developmental music aptitude among young children (p. 46).
Thus, it must be considered instruction was not appropriate to support music aptitude, resulting
in no effect of instruction. Type and quality of instruction was beyond the purview of this study;
nevertheless, further examination is recommended as a topic of future research.
2008–2009 Grade 3 Scores.
Due to student absences on test administration dates, sample sizes for 2008–2009 Grade 3
tests were unequal. Predictive mean matching was used to impute missing values and the
imputed data set used for subsequent statistical tests, culminating in equal sample sizes and
pooled results. Mean Fall and Spring scores were comparable for all tests, as shown in Table 28.
Linearity of corresponding Fall and Spring scores was suggested by a visual examination of
tonal, rhythm, and composite scatterplots. Therefore, a Pearson’s Product-Moment correlation
test was conducted; results are featured in Table 29. Fall and Spring scores for all tests were
significantly correlated (tonal r = .780; rhythm r = .614; composite r = .840) and the effect sizes
large. All intercorrelations were also significant (p < .01) and ranged from .490 to .722,
suggesting subtests had no more than 52% of their variances in common. As predicted,
composite score intercorrelations were strong: tonal–rhythm coefficients ranged from .729 to
.926 and rhythm–composite coefficients from .679 to .920.
Table 28
2008–2009 Grade 3 Descriptive Statistics (Pooled)
A Shapiro-Wilk Test of Normality was conducted: all tests but Grade 3 Spring rhythm
were significant at p < .05, as presented in Table 30. A determination that the assumption of
normality had been violated for most groups of scores was supported by a visual examination of
boxplots and Normal Q-Q plots. Thus, the paired t-test was deemed inappropriate and the
Wilcoxon Signed Rank test substituted. These findings are displayed in Tables 31 and 32.
Table 29
2008–2009 Grade 3 Correlation Matrix (Pooled)
Table 30
2008–2009 Grade 3 Shapiro-Wilk Test of Normality Results
Although tonal medians were similar (Mdn = 34.00), Spring tonal subtest means (33.11)
tended to be higher than Fall tonal subtest means (32.88), Z = -1.323, r = .013, a small effect,
Spring rhythm subtest ranks (Mdn = 29.00) lower than Fall rhythm subtest ranks (Mdn = 30.00),
Z = -.034, r = .0003, and Spring composite subtest ranks (Mdn = 62.00) lower than Fall rhythm
subtest ranks (Mdn = 64.00), Z = -.622, r = .006. The mean difference was not significant (p >
.05) for any set of ranks.
Table 31
2008–2009 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Mean Fall ranks were greater than mean Spring ranks for rhythm and composite scores;
however, no difference was statistically significant. This finding reflected Gordon’s (2002)
assertion that no school music instruction could yield a decrease in average developmental music
aptitude test scores. For this score decrease to have occurred after one year of instruction was
suggestive that the formal rhythm instruction offered the students in this 2008–2009 Grade 3
sample was neither compensatory (students’ musical needs were mitigated) nor complementary,
(students’ current musical needs were met). Gordon (1986c) recommended an evaluation of, and
likely a change to, the type of instruction provided as the result of a score decrease (p. 76). An
effect of instruction was not concluded for this sample, although it seemed probable inadequate
instruction played a role in this finding.
Table 32
2008–2009 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
2009–2010 Grade 3 Scores.
Sample size differed for each subtest, according to student attendance. Predictive mean
matching was used to impute missing values; subsequent statistical tests were based on the
imputed data set. Therefore, equal sample sizes and pooled results are presented. Fall mean
scores were quite similar to corresponding Spring scores, as can be seen in Table 33.
A linear relationship between corresponding Fall and Spring test scores was suggested by a
visual examination of tonal, rhythm, and composite scatterplots. Consequently, a Pearson’s
Product-Moment correlation test was conducted to determine the strength, direction, and
magnitude of the relationship between corresponding Fall and Spring IMMA scores. The
correlation coefficients of all pairs of subtest scores are featured in Table 34. Fall and Spring
scores were significantly correlated (p < .01) for all subtests (tonal r = .669; rhythm r = .654;
composite r = .800) and the effect sizes large. Subtest intercorrelations were also significant:
tonal–rhythm intercorrelations ranged from .432 to .641: subtests seemed to have no more than
41% of their variances in common. Thus, the ability of the tonal subtest and rhythm subtest to
measure unique dimensions of music aptitude was moderate. As expected, intercorrelations with
composite scores were strong and significant: tonal–composite intercorrelations ranged from
.690 to .908 and rhythm–composite intercorrelations ranged from .664 to .903.
Table 33
2009–2010 Grade 3 Descriptive Statistics (Pooled)
The results of the Shapiro-Wilk Test of Normality, presented in Table 35, were
statistically significant for all tests, p < .05; thus, the use of paired t-tests was deemed
inappropriate due the violation of the normality assumption. A visual examination of Normal QQ plots and boxplots of each subtest confirmed the presence of occasional outliers. Therefore, a
series of Wilcoxon Signed Rank tests was conducted in lieu of paired t-tests. The results are
exhibited below in Tables 36 and 37.
Table 34
2009–2010 Grade 3 Correlation Matrix (Pooled)
Table 35
2009–2010 Grade 3 Shapiro-Wilk Test of Normality Results
Although tonal medians were similar (Mdn = 33.00), Spring tonal subtest means (32.52)
were inclined to be lower than Fall tonal subtest means (32.67) Z = -.255, r = 0.003 and Spring
rhythm subtest ranks (Mdn = 29.00) lower than Fall rhythm subtest ranks (Mdn = 30.00), Z = 1.158, r = 0.014. Composite medians were also similar (Mdn = 63.00), but Spring composite
means (61.67) tended to be lower than Fall composite means (62.33, Z= -.031, r = 0.0004).
Table 36
2009–2010 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Nevertheless, mean differences were not statistically significant (p > .05) for any set of ranks.
From non-significant mean differences of Fall and Spring tonal, rhythm, and composite scores,
no effect of instruction was concluded. This finding was in contrast to Gordon’s (2002)
contention that students would demonstrate an improvement of approximately 2 points per year
on a developmental music aptitude test if traditional instruction was offered.
Type and quality of instruction, if inappropriate based on the tonal and rhythm progress
of students through the music babble stage, could have influenced students’ ability to benefit
from the formal instruction offered in the school music environment. Gordon (2002) suggested
average scores would increase to the highest score obtainable with specialized instruction
emphasizing audiation. Whereas complementary or compensatory instruction likely would result
in an increase or maintenance of test scores, a sustained period of tonal and rhythm instruction
mismatched with students’ musical needs and providing insufficient support for students to move
through preparatory audiation could have resulted in decreased or stagnant music aptitude test
scores, from which a conclusion of no effect of instruction could also be drawn.
Table 37
2009–2010 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
2010–2011 Grade 3 Scores.
Original sample sizes were unequal due to student absences on dates of various test
administrations. Missing values were imputed using predictive mean matching; thus, sample
sizes of the newly imputed data set, used for all statistical tests, were made equal. Mean scores of
corresponding Fall and Spring test administrations were very similar, as shown in Table 38.
A linear relationship of Fall and Spring scores was concluded from visual examination of
tonal, rhythm, and composite scatterplots and a Pearson’s Product-Moment correlation test
conducted. This correlation matrix can be seen in Table 39. There was a large effect for all
correlations of tonal scores (r = .759), rhythm scores (r = .650), and composite scores (r = .758);
all correlations were significant (p < .01). Tonal–rhythm intercorrelations ranged from .179 to
.668 and were mostly significant (p < .05). Composite intercorrelations were statistically
significant for tonal scores, ranging from .578 to .881, and rhythm scores, ranging from .580 to
.887 (p < .05); all effect sizes were large.
Table 38
2010–2011 Grade 3 Descriptive Statistics (Pooled)
Table 39
2010–2011 Grade 3 Correlation Matrix (Pooled)
Shapiro-Wilk test results, exhibited in Table 40, were statistically significant for all but
Grade 3 Spring rhythm scores, and composite scores were suggestive of a violation of the
assumption of normality. The presence of occasional outliers was confirmed by a visual
examination of boxplots and Normal Q-Q plots. Therefore, it was concluded paired t-tests were
inappropriate and Wilcoxon Signed Rank tests were substituted (see Tables 41 and 42).
Table 40
2010–2011 Grade 3 Shapiro-Wilk Test of Normality Results
Although tonal and rhythm medians were similar (Mdn = 34.00 and 30.00, respectively),
Spring subtest means were inclined to be lower than their Fall counterparts, although not
significant: Spring tonal means (33.58) and Fall tonal means (33.70), Z = -1.166, r = .013, and
Spring rhythm means (29.11) and Fall rhythm means (29.22), Z = -.928, r = .011. Spring
composite ranks (Mdn = 64.00) were apt to be higher than Fall composite ranks (Mdn = 63.00), Z
= -1.953, r = .022. The composite result also was not statistically significant, although the
observed p-value (p = .051) was almost equal to the a priori p-value (p = .05). Nevertheless,
practical significance was discounted because the effect size was quite small (r = .022). No effect
of instruction was concluded from the findings because mean score differences were not
statistically significant.
Table 41
2010–2011 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 42
2010–2011 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
2011–2012 Grade 3 Scores.
As with previous test administrations, sample size differed for each subtest. However,
imputation of missing values created equal sample sizes; pooled results are presented. Mean
scores for corresponding subtests were comparable from the Fall to Spring administrations, as
seen in Table 43. An approximate linear relationship between corresponding Fall and Spring
scores was suggested from a visual examination of tonal, rhythm, and composite scatterplots.
Table 43
2011–2012 Grade 3 Descriptive Statistics (Pooled)
Results of a Pearson’s Product-Moment correlation test are presented in Table 44. Fall
and Spring scores were significantly correlated (p < .01) for all subtests (tonal r = .688; rhythm r
= .551, composite r = .699) and the effect sizes large. The intercorrelation between tonal and
rhythm subtest scores also was significant, with the exception of Grade 3 Fall tonal and Grade 3
Spring rhythm scores. Correlation coefficients ranged from .504 to .701; subtests seemed to have
no more than 49% of their variances in common. Therefore, tonal and rhythm subtests estimated
unique dimensions of music aptitude with modest success. As anticipated, intercorrelations with
composite scores were significant: tonal–composite correlation coefficients ranged from .635 to
.907; rhythm–composite correlation coefficients ranged from .599 to .928. The effect size of all
intercorrelations was large.
Table 44
2011–2012 Grade 3 Correlation Matrix (Pooled)
A Shapiro-Wilk Test of Normality was conducted; results for most tests were statistically
significant at p < .05, with the exception of Grade 3 Spring rhythm scores (see Table 45). A
visual examination of Normal Q-Q plots and boxplots for each subtest offered additional
evidence the assumption of normal distribution necessary to conduct a paired t-test had been
violated. Therefore, the Wilcoxon Signed Rank test was conducted to investigate mean
difference for each pair of corresponding subtests; results are featured in Tables 46 and 47.
Spring tonal subtest ranks (Mdn = 34.00) were significantly higher (p = .001) than Fall tonal
subtest ranks (Mdn = 32.00), Z= -3.270, r = 0.037, a small effect size. It was suggested by this
result that instruction might have had a nominal effect on tonal scores, as the mean Spring score
was 1.18 points higher than the Fall score. However, caution is recommended when considering
practical significance, as Gordon (2002) had suggested developmental music aptitude scores
increased approximately 2 points from year-to-year and the tonal score increase in question did
not meet that threshold. Spring rhythm subtest ranks (Mdn = 29.00) tended to be lower than Fall
rhythm subtest ranks (Mdn = 31.00), Z = -.831, r = 0.009; although composite medians were
similar (Mdn = 63.00), Spring composite means (60.94) trended higher than Fall composite
means (60.46), Z= -1.1326 r = 0.015. Nevertheless, no statistically significant difference in
rhythm ranks or composite ranks (p > .05) was found. From these non-significant mean score
differences, it was concluded instruction did not affect Fall and Spring rhythm or composite
scores. It was possible this sample of students had progressed beyond the tonal music babble
stage and consequently the formal instruction offered within the school music environment was
appropriate for their tonal development but not their rhythm development. However, the decrease
in rhythm ranks after one year of instruction was unexpected and indicative of instruction that
was neither compensatory nor complementary of students’ musical needs (Gordon, 1986c, p. 76).
Type and quality of instruction in general, and by tonal and rhythm dimension specifically, were
not considered in the current study; however, it becomes increasingly apparent as analyses are
interpreted that an examination of this topic is recommended in future studies.
Table 45
2011–2012 Grade 3 Shapiro-Wilk Test of Normality Results
2012–2013 Grade 3 Scores.
Again, sample sizes differed by subtest, according to student attendance on the dates of
test administration. Missing values were imputed using predictive mean matching and the
imputed data set used to conduct statistical tests. Corresponding Fall and Spring scores were
quite similar for all subtests, as seen in Table 48.
Table 46
2011–2012 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
A broad approximation of a linear relationship for Fall and Spring administrations was
suggested from a visual examination of scatterplots of tonal, rhythm, and composite scores. A
Pearson’s Product-Moment correlation test for each pair of corresponding subtests yielded
correlation coefficients for Fall and Spring administrations as follows: tonal r = .729, rhythm r =
.568, and composite r = .727. Thus, relationships between Fall and Spring scores of
corresponding subtests were strong, positive, and statistically significant at p < .01, as exhibited
in the correlation matrix shown in Table 49.
Table 47
2011–2012 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
Table 48
2012–2013 Grade 3 Descriptive Statistics (Pooled)
The intercorrelation between tonal and rhythm subtests was moderate, ranging from .378
to .590. Stronger correlations were found for relationships with composite scores: tonal–
composite correlation coefficients ranged from .644 to .872; rhythm–composite correlation
coefficients ranged from .579 to .907. This was to be expected, as tonal and rhythm scores were
summed to yield composite scores.
A Shapiro-Wilk Test of Normality yielded statistically significant results for all tests at p
< .05, as presented in Table 50, thus indicating a violation of the assumption of normality.
Additional evidence supporting the violation of the normality assumption, attributed to the
presence of outliers, was provided by a visual examination of Q-Q plots and boxplots.
Table 49
2012–2013 Grade 3 Correlation Matrix (Pooled)
Table 50
2012–2013 Grade 3 Shapiro-Wilk Test of Normality Results
The Wilcoxon Signed Rank test was used in lieu of the paired t-test, which required an
assumption of normal distribution. Although tonal and rhythm medians were similar (Mdn =
34.00 and 30.00, respectively), Spring tonal subtest means (33.27) were apt to be higher (p > .05)
than Fall tonal subtest means (32.91), Z= -.565, r = 0.007 and Spring rhythm subtest means
(29.46) higher than Fall rhythm subtest means (29.25), Z = -.540, r = 0.007. Spring composite
ranks (Mdn = 65.00) tended to be higher than Fall composite ranks (Mdn = 63.00), Z= -.647, r =
0.008.The difference was not statistically significant (p > .05) for any set of ranks; results can be
seen in Tables 51 and 52. From the non-significant results, it was concluded there was no
influence of instruction on tonal, rhythm, or composite music aptitude for this sample. Type and
quality of instruction was not investigated within this study’s design and might have had an
effect on music aptitude scores, as instruction inappropriately aligned with students’ musical
needs might have resulted in the non-significant score differences found in the current study.
This is further evidence additional research on type and quality of instruction should be
investigated in future studies.
2013–2014 Grade 3 Scores.
Due to irregularities in student attendance, original sample size differed by subtest, as
displayed in Table 53. Missing values were imputed using predictive mean matching; Fall and
Spring mean scores were found to be comparable. Subtest scatterplots broadly approximated
linear relationships of Fall and Spring scores.
Consequently, a Pearson’s Product-Moment correlation test was conducted for each
corresponding pair of subtest scores. The results are exhibited in Table 54. Correlations of
corresponding tonal (r = .777), rhythm (r = .520), and composite (r = .723) scores were
statistically significant at p < .01 and effect sizes large. The tonal–rhythm subtest intercorrelation
coefficients ranged from .355 to .628. Correlations of composite scores were strong, ranging
from .638 to .880 for tonal–composite intercorrelations and .495 to .914 for rhythm–composite
intercorrelations, as would be expected when tonal and rhythm scores were included in
composite scores.
A Shapiro-Wilk Test of Normality yielded statistically significant results at p < .05, an
indication of a violation of the normality assumption required for the paired t-test; results are
displayed in Table 55. A visual examination of Normal Q-Q plots and boxplots confirmed the
suspicion of the normality assumption violation, likely due to the presence of outliers.
Table 51
2012–2013 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 52
2012–2013 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
Table 53
2013–2014 Grade 3 Descriptive Statistics (Pooled)
Table 54
2013–2014 Grade 3 Correlation Matrix (Pooled)
Table 55
2013–2014 Grade 3 Shapiro-Wilk Test of Normality Results
In lieu of paired t-tests, a series of Wilcoxon Signed Rank tests was conducted on all
pairs of subtests. Results are featured in Tables 56 and 57. Spring tonal subtest ranks (Mdn =
34.00) were significantly higher (p = .001) than Fall tonal subtest ranks (Mdn = 33.00), Z= 3.234; r = 0.41, a moderate effect. Although composite medians were similar (Mdn = 63.00),
Spring means (62.35) were apt to be higher than Fall composite means (61.50), Z= -1.099, r =
0.0014. Rhythm medians were also similar (Mdn = 30.00), yet Spring rhythm subtest means
(28.95) tended to be lower than Fall rhythm subtest means (29.26), Z = -.165, r = 0.002. The
mean difference was not statistically significant (p > .05) for rhythm or composite ranks. Gordon
(1986c) observed each child’s tonal scores often differed from their rhythm scores; therefore, it
was unsurprising the tonal and rhythm findings for this sample were also dissimilar. Whether the
difference was meaningful or caused by error of measurement was unknown, as Gordon (1986c)
noted (p. 67). The reason Fall–Spring tonal scores differed significantly for this particular sample
of students was also unknown, although it could be conjectured the significant increase in scores
was an indication tonal instruction had been compensatory (Reese & Shouldice, 2019). Thus, an
effect of instruction was speculated for tonal music aptitude for the 2013–2014 Grade 3 sample;
however, no effect of instruction was concluded for rhythm or composite music aptitude. It was
possible the formal rhythm instruction offered within the school environment might have been
inadequate for students who remained in the rhythm babble stage, resulting in the lack of rhythm
and composite score fluctuation seen in the findings of the current study.
Table 56
2013–2014 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 57
2013–2014 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
2014–2015 Grade 3 Scores.
Imputation of missing values using predictive mean matching resulted in equal sample
sizes. All statistical tests were conducted from the imputed data set. Mean scores of
corresponding Fall and Spring tests were comparable as shown in Table 58.
Table 58
2014–2015 Grade 3 Descriptive Statistics (Pooled)
A Pearson’s Product-Moment correlation test was conducted on all scores; the tonal
correlation (r = .795), rhythm correlation (r = .694), and composite correlation (r = .828) were
significant (p < .01) and effect sizes large. Results are displayed in Table 59. Tonal–rhythm
intercorrelations also were significant; coefficients ranged from .423 to .685. Tonal–composite
intercorrelations, ranging from .726 to .910, and rhythm–composite intercorrelations, ranging
from .668 to .903), demonstrated a stronger association, as was anticipated due to tonal scores
and rhythm scores comprising composite scores. The absolute value of r was interpreted as a
small effect size (0.1), medium effect size (0.3), or large effect size (0.5).
Results of a Shapiro-Wilk Test of Normality were predominantly significant (p < .05),
with the exception of rhythm scores and Spring composite scores (see Table 60). It was
determined the assumption of normality had been violated; Wilcoxon Signed Rank tests were
conducted in lieu of paired t-tests and the results exhibited in Tables 61 and 62.
Table 59
2014–2015 Grade 3 Correlation Matrix (Pooled)
Table 60
2014–2015 Grade 3 Shapiro-Wilk Test of Normality Results
Although Fall and Spring tonal medians were similar (Mdn = 34.00 and 33.00,
respectively), Spring tonal means (32.16) were inclined to exceed Fall tonal means (31.78), Z = 1.729, r = .024. Composite medians were also similar (Mdn = 61.00), yet Spring composite
means (60.31) were apt to be slightly lower than Fall composite means (60.43), Z = -.091, r =
.001. Spring rhythm ranks (Mdn = 28.50) trended lower than Fall rhythm ranks (Mdn = 29.00), Z
= -.345, r = .005. No mean differences were statistically significant (p > .05). Spring scores
might be similar to or exceed Fall scores if instruction had been compensatory (students’ musical
needs were attenuated), complementary (students’ musical needs were satisfied), or both
(Gordon, 1986c, p. 76). Although no effect of instruction was concluded from the findings of the
2014–2015 Grade 3 sample, it was possible the decrease in rhythm scores after one year of
instruction might be indicative of inadequate type or quality of instruction for this sample of
students. If students had not yet transitioned from preparatory audiation, the formal rhythm
instruction offered could have been inappropriate for students’ musical needs.
2015–2016 Grade 3 Scores.
As anticipated, sample sizes, presented in Table 63, varied by subtest due to student
attendance on test administration days. Missing values were imputed using predictive mean
matching and the imputed data set used for statistical testing. Fall and Spring mean scores for
corresponding subtests were quite similar. A roughly linear relationship of corresponding subtest
scores was suggested from a visual examination of scatterplots.
Therefore, a Pearson’s Product-Moment correlation test was conducted for all subtests;
the correlation matrix may be found in Table 64. Correlations for corresponding tonal (r = .516),
rhythm (r = .415), and composite (r = .603) tests were statistically significant at p < .01 and the
effect sizes large. The tonal–rhythm intercorrelation coefficients ranged from .249 to .562
Composite intercorrelations were moderate to strong: tonal–composite coefficients ranged from
.524 to .885; rhythm–composite coefficients ranged from .423 to .866. The absolute value of r
was interpreted as a small effect size (0.1), medium effect size (0.3), or large effect size (0.5).
Table 61
2014–2015 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Results of a Shapiro-Wilk Test of Normality were statistically significant (p < .05) for
Grade 3 Spring tonal scores and Grade 3 Fall composite, as displayed in Table 65; however, Fall
tonal, Spring composite, and all rhythm scores were not statistically significant (p > .05). Thus, a
violation of the normality assumption for most tests was suggested from these results and the use
of paired t-tests was deemed inappropriate.
Table 62
2014–2015 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
Table 63
2015–2016 Grade 3 Descriptive Statistics (Pooled)
Therefore, a nonparametric Wilcoxon Signed Rank test was conducted for corresponding
pairs of subtest scores. Results are presented below in Tables 66 and 67. Although tonal and
rhythm medians were similar (Mdn = 32.00 and 28.00, respectively), Spring tonal subtest means
(31.18) were apt to be lower than Fall tonal subtest means (31.26), Z= -.796, r = 0.010 and
Spring rhythm means (27.71) higher than Fall rhythm means (27.62), Z = -.689, r = 0.009.
Spring composite ranks (Mdn = 60.00) were inclined to be higher than Fall composite ranks
(Mdn =59.00), Z= -.889, r = 0.011. The mean difference was not statistically significant (p > .05)
for tonal, rhythm, or composite ranks, and no effect of instruction was found for this sample. It
was possible students from this sample might have remained in preparatory audiation; if so, the
formal rhythm instruction offered likely was inappropriate for their musical needs. Gordon
(1986c) asserted complementary or compensatory instruction would maintain or increase scores;
therefore, it was possible the lack of complementary or compensatory instruction resulted in the
tonal score decrease observed in this sample. Although type and quality of instruction was not
investigated in the current study, it is suggested this topic be examined in future studies.
Table 64
2015–2016 Grade 3 Correlation Matrix (Pooled)
Table 65
2015–2016 Grade 3 Shapiro-Wilk Test of Normality Results
2016–2017 Grade 3 Scores.
As with tests from previous years, sample size differed by subtest due to student absences
on dates of test administration, as evidenced in Table 68. Predictive mean matching was used to
impute missing data; the imputed data set was used to conduct all statistical tests. Corresponding
Fall and Spring means were comparable. An approximate linear relationship between Fall and
Spring scores was suggested by a visual examination of subtest scatterplots.
Table 66
2015–2016 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
A Pearson’s Product-Moment correlation test was conducted for all subtests; results are
presented as a correlation matrix in Table 69. Statistically significant (p < .01) correlations were
interpreted for tonal (r = .571), rhythm (r = .507), and composite (r = .575) scores. Tonal–
rhythm intercorrelation coefficients ranged from .413 to .554. Composite correlations were
similarly strong: tonal–composite correlation coefficients ranged from .542 to .887; rhythm–
composite correlation coefficients ranged from .437 to .818. The absolute value of r was
interpreted as a small effect size (0.1), medium effect size (0.3), or large effect size (0.5).
A violation of the assumption of normality necessary for conducting paired t-tests was
concluded from statistically significant Shapiro-Wilk Test of Normality results (p < .05) for tonal
and composite scores. Rhythm results were not statistically significant (p > .05). Results are
illustrated in Table 70.
Table 67
2015–2016 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
Table 68
2016–2017 Grade 3 Descriptive Statistics (Pooled)
Table 69
2016–2017 Grade 3 Correlation Matrix (Pooled)
Table 70
2016–2017 Grade 3 Shapiro-Wilk Test of Normality Results
The contention the assumption of normality had been violated was supported by a visual
examination of Q-Q plots and boxplots of corresponding Fall and Spring scores. Consequently, a
nonparametric Wilcoxon Signed Rank test was conducted for all scores to investigate the
difference in scores of paired samples. Results are featured in Tables 71 and 72. Spring tonal
subtest ranks (Mdn = 33.00) tended to be higher than Fall tonal subtest ranks (Mdn = 32.00), Z= -
.595, r = 0.007. Although rhythm and composite medians were similar (Mdn = 29.00 and 60.00,
respectively), Spring rhythm means (27.89) were inclined to be slightly higher than Fall rhythm
means (27.87), Z = -.255, r = 0.003, and Spring composite means (58.69) slightly higher than
Fall composite means (58.63), Z= -.416, r = 0.005. The difference was not statistically
significant (p > .05) for any set of ranks. It appeared there was no effect of instruction for 2016–
2017 Grade 3 scores. However, an investigation of effect of type and quality of instruction was
recommended for future research, as it was possible the formal rhythm instruction offered in the
school environment might not have been appropriate for students who had not yet transitioned
from preparatory audiation, which in turn affected composite scores.
2017–2018 Grade 3 Scores.
As in all previous years, Grade 3 sample sizes in 2017–2018 differed according to student
attendance on test administration days, as displayed in Table 73. Missing values were imputed
using predictive mean matching and the imputed data set used to calculate all statistical tests.
From a visual examination of tonal, rhythm, and composite score scatterplots, a rough
approximation of linearity for corresponding Fall and Spring subtest scores was conjectured.
Therefore, a Pearson’s Product-Moment correlation test was conducted; results are displayed in
the correlation matrix in Table 74. Correlations for tonal (r = .743), rhythm (r = .381) and
composite (r = .410) scores were statistically significant (p < .05). Tonal–rhythm
intercorrelations ranged from .315 to .444. Tonal–composite intercorrelations were strong,
ranging from .627 to .861; rhythm–composite intercorrelations were more widely dispersed,
from .404 to .858. The absolute value of r was interpreted as a small effect size (0.1), medium
effect size (0.3), or large effect size (0.5).
Table 71
2016–2017 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 72
2016–2017 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
Table 73
2017–2018 Grade 3 Descriptive Statistics (Pooled)
A violation of the assumption of normality necessary to conduct a paired t-test was
revealed by a significant result for Spring composite scores (p < .05) on the Shapiro-Wilk Test of
Normality (see Table 75). However, no other Shapiro-Wilk test results were statistically
significant at p < .05. Additional evidence to support the assumption of normality for most
subtests was provided through a visual examination of Q-Q plots and boxplots for corresponding
Fall and Spring tonal, rhythm, and composite scores.
Paired t-tests were conducted to examine mean differences in IMMA tonal, rhythm, and
composite scores of corresponding Fall and Spring test administrations. Results are displayed in
Table 76. No significant difference in mean Fall and Spring tonal scores (t(3876) = -.914, p
>.05), rhythm scores (t(4084) = 1.874, p > .05), or composite scores (t(4240) = 1.874 (p > .05),
was found for this sample.
2018–2019 Grade 3 Scores.
Inconsistent student attendance affected subtest sample size once again. Predictive mean
matching was used to impute missing values and all statistical tests were based on the imputed
data set. Mean scores for corresponding tonal, rhythm, and composite scores were comparable,
as in previous years. Details are displayed in Table 77.
Table 74
2017–2018 Grade 3 Correlation Matrix (Pooled)
Table 75
2017–2018 Grade 3 Shapiro-Wilk Test of Normality Results
A rough approximation of linearity for Fall and Spring scores was suggested by a visual
examination of scatterplots of tonal scores, rhythm scores, and composite scores. Therefore, a
Pearson’s Product-Moment Correlation test was conducted to estimate the relationships between
all subtest scores. Results are illustrated in the correlation matrix in Table 78. Statistically
significant (p < .01) correlations were found for tonal (r = .752), rhythm (r = .603), and
composite (r = .776) scores. The intercorrelation of tonal and rhythm scores was also significant
(p < .01) and ranged from .444 to .662. Intercorrelations with composite scores were statistically
significant (p < .01): tonal–composite correlation coefficients ranged from .718 to .913, rhythm–
composite correlation coefficients ranged from .595 to .887. The absolute value of r was
interpreted as a small effect size (0.1), medium effect size (0.3), or large effect size (0.5).
Table 76
2017–2018 Grade 3 Paired t-Test Results (Pooled)
Table 77
2018–2019 Grade 3 Descriptive Statistics (Pooled)
Results of a Shapiro-Wilk Test of Normality, presented in Table 79, were significant (p <
.05) for half of the subtests (Grade 3 Fall tonal, Grade 3 Spring tonal, and Grade 3 Spring
composite), indicating a violation of the assumption of normality. However, Shapiro-Wilk results
for Grade 3 Fall rhythm, Grade 3 Fall composite, and Grade 3 Spring rhythm subtest scores were
not statistically significant (p > .05), suggesting an approximation of a normal distribution. The
contention the assumption of normality had been violated was supported by a visual examination
of Normal Q-Q plots and boxplots of Fall and Spring tonal, rhythm, and composite scores. A
linear relationship was approximated in the Normal Q-Q plots of Fall and Spring rhythm scores
before and after imputation, with no outliers indicated in SPSS-generated boxplots.
Table 78
2018–2019 Grade 3 Correlation Matrix (Pooled)
As a result of the normality assumption violation, paired t-tests were deemed
inappropriate. Instead, a Wilcoxon Signed Rank test was conducted to examine differences in
mean scores for corresponding Fall and Spring subtests and the results featured in Tables 80 and
81. Spring tonal subtest ranks (Mdn = 33.00) were inclined to be higher than Fall tonal subtest
ranks (Mdn = 32.00), Z= -1.815, r = 0.024), and Spring composite ranks (Mdn = 61.00) higher
than Fall composite ranks (Mdn = 60.00), Z= -1.425, r = 0.019. Although rhythm medians were
similar (Mdn = 28.00), Spring rhythm subtest means (28.17) tended to be higher than Fall
rhythm subtest means (27.79), Z = -.994, r = 0.013), No significant difference was estimated for
any set of scores (p > .05); therefore, no effect of instruction was concluded for this sample. It
was possible the tonal, rhythm, and composite score increases were the result of complementary
or compensatory instruction for this sample of students (Gordon, 1986c, p. 76); however, the
type and quality of instruction was not examined in the current study. Nevertheless, an
examination of the effect of type and quality of instruction was recommended for future study.
Table 79
2018–2019 Grade 3 Shapiro-Wilk Test of Normality Results
2019–2020 Grade 3 Scores.
Student absences on test administration dates produced unequal sample sizes, which were
mitigated through imputation of missing values using predictive mean matching. In addition,
remote online instruction due to the COVID-19 pandemic precluded a Spring 2020 IMMA test
administration; these missing scores were also imputed. The resulting data set was used for all
subsequent statistical tests. Mean tonal, rhythm, and composite scores on corresponding Fall and
Spring tests were comparable, as exhibited in Table 82.
A Pearson’s Product-Moment correlation test was conducted on all scores and results
presented in Table 83. The tonal correlation (r = .779), rhythm correlation (r = .677), and
composite correlation (r = .820) were statistically significant (p < .01), as were all
intercorrelations (p < .05): tonal–rhythm intercorrelations ranged from .379 to .716, tonal–
composite intercorrelations from .704 to .922, and rhythm–composite intercorrelations from .657
to .918. Composite intercorrelations were expected to be moderate-to-strong, as composite scores
are comprised of tonal scores and rhythm scores and are thus related. The absolute value of r was
interpreted as a small effect size (0.1), medium effect size (0.3), or large effect size (0.5).
Table 80
2018–2019 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
With the exception of Grade 3 Fall rhythm scores, results of a Shapiro-Wilk Test of
Normality (see Table 84) were statistically significant (p < .05). Thus, a violation of the
normality assumption was assumed and paired t-tests deemed inappropriate. Instead, the
nonparametric Wilcoxon Signed Rank test was conducted for estimation of mean differences of
matched pairs. Results are featured in Tables 85 and 86.
Although Fall and Spring tonal medians were similar (Mdn = 33.00 and 32.50,
respectively) and Fall and Spring composite medians were identical (Mdn = 60.00), Spring tonal
means (31.63) were inclined to exceed Fall tonal means (31.30), Z = -1.280, r = .018 and Spring
composite means (58.62) to exceed Fall composite means (58.36), Z = -.082, r = .001. Mean
Spring rhythm ranks (Mdn = 28.00) tended to be higher than Fall rhythm ranks (Mdn = 27.00), Z
= -.433, r = .006. However, no mean rank difference was statistically significant (p > .05).
Although no effect of instruction was hypothesized for students in the stabilized music aptitude
stage (Gordon, 2013, p. 15), a score increase was perhaps indicative of instruction well-suited to
the needs of this student sample (Gordon, 1986c, p. 76). Examination of the type and quality of
instruction was beyond the parameters of the current study; it is recommended they be
considered in future studies.
Table 81
2018–2019 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
An examination of Wilcoxon Signed Rank test results yielded a longitudinal perspective
of Grade 3 music aptitude through scrutiny of Grade 3 IMMA scores by grade level spanning a
period of 13 years. No effect of instruction on music aptitude was concluded from the minimal
score fluctuation in the current study, and achievement of the stabilized music aptitude stage
prior to Grade 3 was implied. Wilcoxon Signed Rank tests were used to compare IMMA Fall and
Spring scores; summaries of Wilcoxon Signed Rank test results by subtest are represented in
Figures 9, 10, and 11.
Table 82
2019–2020 Grade 3 Descriptive Statistics (Pooled)
Table 83
2019–2020 Grade 3 Correlation Matrix (Pooled)
Table 84
2019–2020 Grade 3 Shapiro-Wilk Test of Normality Results
There was a tendency for Spring tonal ranks to be higher than Fall tonal ranks for all
academic years except 2009–2010; Fall rhythm ranks were inclined to exceed Spring rhythm
ranks in approximately 60% of the years examined. Spring composite ranks were apt to surpass
Fall composite ranks for the majority of cases: it appeared tonal scores had a stronger influence
on composite scores for this data set. Nevertheless, significant mean differences were found only
for 2011–2012 and 2013–2014 tonal ranks and a medium effect size estimated; mean score
differences were not significantly different for most cases. Although tonal scores and composite
scores tended to increase from Fall to Spring, mean score differences were generally less than 1
point and non-significant. Gordon (2002) had noted an average yearly developmental music
aptitude score increase of 2 points with traditional instruction and optimal score increases when
instruction emphasized audiation; score differences in the current study were considerably
smaller. Thus, the mean score differences of the current study seemed to indicate either an
attainment of the stabilized music aptitude stage prior to Grade 3/age 8 or instruction that was illsuited to the musical needs of the students. It was concluded the relative constancy of tonal
scores, rhythm scores, and composite scores did not support an effect of instruction on music
Table 85
2019–2020 Grade 3 Wilcoxon Signed Rank Test Results (Pooled)
Table 86
2019–2020 Grade 3 Wilcoxon Signed Rank Test Statisticsa (Pooled)
aptitude. With appropriate informal guidance and formal instruction, children’s birth level of
developmental music aptitude might be aspired to; without it, developmental music aptitude
would likely decrease (Gordon, 2001b, p. 83).
Figure 9
Wilcoxon Signed Rank Test Results (Tonal)
It is critical that guidance and instruction of students still in the developmental music
aptitude stage is compensatory as well as complementary to the students’ level of music aptitude
in order to maintain or increase student music aptitude toward birth levels (Gordon, 1986c, p.
76). Therefore, it must be considered the formal instruction provided to the current study’s
Figure 10
Wilcoxon Signed Rank Test Results (Rhythm)
sample, with the exception of tonal instruction for 2011–2012 and 2013–2014 students, was not
sufficient for students’ musical needs, as no effect of instruction was found. Grade 3 students
were purported to be in the developmental music aptitude stage and it was expected their scores
would continue to fluctuate in response to influence of the musical environment. Lack of
significant score change could have indicated instruction was inadequate. Although the effect of
type of instruction was beyond the parameters of the current study, further research on this topic
is recommended.
Figure 11
Wilcoxon Signed Rank Tests Results (Composite)
Research Question 3
Is there evidence to substantiate the transition between the developmental music
aptitude stage and stabilized aptitude stage at age 9/Grade 4?
One-Way Repeated Measures ANOVA Results
In order to examine the longitudinal change in tonal scores, rhythm scores, and composite
scores to consider a period of transition as queried in Research Question 3, a series of one-way
repeated measures ANOVA tests was conducted using scores from all combinations of available
Fall and Spring scores of consecutive third-, fourth-, and fifth-grade years. A limitation of the
current study was the decision to examine only scores of students who had been administered
IMMA for three consecutive years rather than all available data for Grades 3, 4, and 5. Missing
scores were imputed using predictive mean matching; the imputed data set was used to conduct
all statistical tests. Results of tonal, rhythm, and composite findings are combined, summarized,
and presented in Tables 87, 88, and 89.
Table 87
Repeated Measures ANOVA Combined Results (Tonal)
Significant mean increases in tonal scores from Grade 3 to Grade 5 were found for
Groups C and D. For these two cases, tonal scores tended to increase throughout the 3-year
period, as illustrated in Table 87. However, mean differences from adjacent grade levels were
not significant. A significant increase in tonal scores also was found from Grade 3 to Grade 5 for
Group A; in addition, tonal scores increased significantly from Grade 3 to Grade 4 for this
sample. However, no significant score difference was found for Grade 4 to Grade 5. For this
case, tonal score increase seemed to taper after Grade 4. No significant mean differences were
found for Groups B, E, and F. From these results, a period of transition was speculated for tonal
music aptitude; nevertheless, the findings were not conclusive.
As evident from the results exhibited in Table 88, a similar trend of overall growth in
rhythm scores from Grade 3 to Grade 5 without significant increase in adjacent grade levels was
seen for Group E. No significant score difference was found for Groups A, C, D, or F. An
unusual pattern of score fluctuation was interpreted for Group B, in which rhythm scores
decreased from Grade 3 to Grade 4, then increased from Grade 4 to Grade 5. From the findings
Table 88
Repeated Measures ANOVA Combined Results (Rhythm)
of Group B, it could be speculated rhythm scores fluctuated between the developmental and
stabilized music aptitude stage during the 3-year period in question, as students transitioned from
one stage into the next. Nevertheless, a period of transition was not substantiated from the
majority of rhythm results.
Likely due to the influence of its component tonal and rhythm parts, significant mean
differences were noted for composite scores for most grade level groupings, as displayed in
Table 89. For Groups A, C, and D, significant score increases greater than 2 points were found
from Grade 3 to Grade 4 and from Grade 3 to Grade 5. No significant mean difference was noted
from Grade 4 to Grade 5; perhaps score gains tapered as chronological age increased, as Gordon
(1981) had asserted. A broad period of transition between the developmental and stabilized
music aptitude stages might also account for the inconsistency of score fluctuation. It was
conjectured tonal music aptitude had more influence on composite music aptitude for Groups A,
C, and D, as rhythm findings for those groups were not significant. A significant score increase
was noted from Grade 3 to Grade 5 for Group E; however, no significant mean differences were
found for adjacent grades. Again, a period of transition might account for this inconsistency. No
significant mean score difference was found for composite scores of Group F. An atypical score
trend similar to that found for tonal scores was observed for Group B: composite scores
decreased from Grade 3 to Grade 4, then increased from Grade 4 to Grade 5. It was unlikely
students were in the stabilized music aptitude stage in Grade 3, as a decrease in score fluctuation
might suggest, yet returned to the developmental music aptitude stage in Grade 4, as an increase
in score fluctuation might suggest. Therefore, a period of transition was conjectured that might
account for this discrepancy in direction of score change and accommodate an ebb and flow
between stages of music aptitude.
Table 89
Repeated Measures ANOVA Combined Results (Composite)
In the following section, results were categorized, presented, and investigated by 3-year
grade level groupings (Grade 3/Grade 4/Grade 5), labeled A-F, and subtest. Descriptive statistics,
repeated measures ANOVA findings, multivariate confirmatory test results, and Bonferroni post
hoc findings were interpreted to provide a more in-depth examination.
Group A: Fall 2016 (Grade 3), Fall 2017 (Grade 4), Fall 2018 (Grade 5)
Tonal scores were available for students who were administered IMMA in consecutive
years during the Fall of their third-, fourth-, and fifth-grade years. Mean tonal scores are
exhibited in Table 90.
Table 90
Group A: Fall 2016/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Tonal)
The results of Mauchly’s Test of Sphericity are displayed in Table 91. It was concluded
variances of differences between levels of tonal subtest results for Fall test administrations in
Grades 3–5 were significantly different. The assumption of sphericity had been violated (X2(2) =
15.820, p < .05), resulting in an inaccurate F-test (Field, 2009, p. 476). However, application of
the Greenhouse-Geisser correction yielded an adjusted F-value and degrees of freedom (F(1.62,
99.04) = 9.136, p = .001, as did the Huynh-Feldt correction (F(1.67, 101.37) = 9.136, p = .001.
The result was statistically significant and the effect size moderate (2 = .115) (Hatcher, 2013, p.
370), as evidenced in Table 92. This conclusion was supported by statistically significant
multivariate test results, displayed in Table 93: Pillai’s Trace = .193, F(2, 60) = 7.197, p = .002,
ηp2 = .193. There were significant differences in mean tonal scores of students in Grades 3, 4,
and 5 (Field, 2009, p. 477).
The significant results of the Bonferroni post hoc test, displayed as pairwise comparisons
in Table 94, might be interpreted as follows: on average, Grade 4 tonal scores were 2.149 points
higher than Grade 3 tonal scores (p = .003, 95% CI [-3.644, -.653]) and Grade 5 tonal scores
were 2.335 points higher than Grade 3 tonal scores (p < .001, 95% CI [-3.534, -1.136]). The
mean difference in Grade 4 and Grade 5 tonal scores was not significant.
Grade 4 and Grade 5 tonal scores were significantly greater than Grade 3 tonal scores for
this sample, which created the appearance that musical environment continued to influence tonal
music aptitude. However, the mean difference in Grade 4 and Grade 5 tonal scores was not
significant; the interpretation that musical environment had ceased to influence music aptitude
Table 91
Group A: Fall 2016/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Tonal)
Table 92
Group A: Fall 2016/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Tonal)
Table 93
Group A: Fall 2016/Fall 2017/Fall 2018 Multivariate Testa Results (Tonal)
Table 94
Group A: Fall 2016/Fall 2017/Fall 2018 Pairwise Comparisons Pooled Results (Tonal)
after Grade 4 was supported by this lack of significant score change. It was possible students
remained in the developmental tonal aptitude stage through Grade 5 and score difference tapered
between Grades 4 and 5; Gordon (1981) asserted environmental influence decreased as the
student’s chronological age increased. However, it also was possible the type of instruction in
Grades 4 and 5 had a limiting effect on growth of developmental tonal aptitude for the current
sample. While examination of type and quality of instruction was beyond the parameters of the
current study, investigation of the effect of type and quality of instruction on tonal aptitude
should be investigated in future studies. It was speculated a broad period of transition from
Grade 3 through Grade 5 could explain the discrepancy in tonal score fluctuation.
Mean rhythm scores for Fall 2016 (Grade 3), Fall 2017 (Grade 4), and Fall 2018 (Grade
5) are featured in Table 95. From the results of Mauchly’s Test of Sphericity, presented as Table
96, it was concluded variances of differences between levels of rhythm subtest results for Fall
test administrations in Grades 3–5 were not considered significantly different and the condition
of sphericity had been met (X2(2) = 2.162, p > .05).
Table 95
Group A: Fall 2016/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Rhythm)
The results of a one-way repeated-measures ANOVA are presented in Table 97. The
differences in rhythm scores from Grades 3–5 were not statistically significant, F(2, 130) =
1.947, p > .05; this conclusion was supported by multivariate test results displayed in Table 98:
Pillai’s Trace = .057, F(2, 64) = 1.940, p > .05, ηp2 = .057. It was concluded the relative stability
Table 96
Group A: Fall 2016/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Rhythm)
Table 97
Group A: Fall 2016/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Rhythm)
of IMMA rhythm scores from Grade 3 to Grade 5 implied students previously had attained the
stabilized rhythm aptitude stage. Thus, a period of transition was not supported for rhythm
aptitude for this sample. This finding contrasted with that for tonal aptitude for Group A, in
which significant score fluctuation seemed to indicate the stabilized music aptitude stage had not
yet been reached. It was conceivable tonal aptitude and rhythm aptitude operated separately in
the transition between the developmental and stabilized aptitude stages.
Table 98
Group A: Fall 2016/Fall 2017/Fall 2018 Multivariate Testa Results (Rhythm)
Mean Fall composite scores of students in Grades 3–5 from 2016–2018 are presented in
Table 99. From the results of Mauchly’s Test of Sphericity, presented as Table 100, it was
concluded variances of differences between levels of composite test results for Fall test
administrations in Grades 3–5 were significantly different and the condition of sphericity had not
been met (X2(2) = 11.371, p < .05). The violation of sphericity resulted in an inaccurate F-test
(Field, 2009, p. 476). However, application of the Greenhouse-Geisser correction yielded an
adjusted F-value and degrees of freedom (F(1.69, 94.37) = 6.487, p = .002); the effect size was
moderate (2 = .087) (Hatcher, 2013, p. 370). Results of the Huynh-Feldt correction were similar
(F(1.732, 96.98) = 6.487, p = .002. These statistically significant results are displayed in Table
101. An examination of the statistically significant multivariate test results, presented as Table
102, confirmed this finding: Pillai’s Trace = .139, F(2, 55) = 4.449, p = .016, ηp2 = .139.
Table 99
Group A: Fall 2016/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Composite)
Table 100
Group A: Fall 2016/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Composite)
Significant mean differences in composite scores favoring the older grade level were
suggested from results of a one-way repeated measures ANOVA. The results of the Bonferroni
post hoc test, displayed as pairwise comparisons in Table 103, were interpreted as follows: on
average, Grade 4 composite scores were 2.887 points higher than Grade 3 composite scores (p =
.007, 95% CI [-5.074, -.701]) and Grade 5 composite scores 3.009 points higher than Grade 3
composite scores (p = .003, 95% CI [-5.074, .944]). However, there was no significant difference
in mean scores of Grades 4 and 5. It was concluded IMMA composite scores continued to be
influenced by musical environment through Grade 5/age 10, thus supporting the contention
students remained in the developmental music aptitude stage and had not yet achieved the
Table 101
Group A: Fall 2016/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Composite)
Table 102
Group A: Fall 2016/Fall 2017/Fall 2018 Multivariate Testa Results (Composite)
Table 103
Group A: Fall 2016/Fall 2017/Fall 2018 Pairwise Comparisons Pooled Results (Composite)
stabilized music aptitude stage. This conclusion disputed Gordon’s findings that music aptitude
stabilized at age 9. The mean score difference was moderate, an indication score fluctuation had
not abated. Gordon (2002) had suggested a score increase averaging 2 points could be expected
with use of traditional instruction; score growth in the current study exceeded this average. The
lack of significant mean composite score difference between Grades 4 and 5 could be explained
as the natural waning of influence of musical environment with the increase of chronological age
(Gordon, 1981), instruction ill-suited to students’ current level of audiation, or a period of
transition between the developmental and stabilized composite aptitude stages.
Group B: Fall 2017 (Grade 3), Fall 2018 (Grade 4), Fall 2019 (Grade 5)
Mean tonal scores of students from Grades 3–5 are displayed in Table 104. From the
results of Mauchly’s Test of Sphericity, presented as Table 105, it was concluded variances of
differences between levels of tonal test results for Fall test administrations in Grades 3–5 were
not significantly different and the condition of sphericity had been met (X2(2) = 5.209, p > .05).
Table 104
Group B: Fall 2017/Fall 2018/Fall 2019 Descriptive Statistics Pooled Results (Tonal)
Table 105
Group B: Fall 2017/Fall 2018/Fall 2019 Mauchly’s Test of Sphericitya Results (Tonal)
The difference in tonal scores from Grades 3–5, displayed as Table 106, was statistically
significant, F(2, 102) = 3.599, p < .031; the effect size was small (2 = .047) (Hatcher, 2013, p.
370). This conclusion was disputed by the multivariate test result displayed in Table 107, which
were not statistically significant: Pillai’s Trace = .109, F(2, 50) = 3.045, p > .05, ηp2 = .109. A
Bonferroni post hoc test was conducted; results are exhibited as pairwise comparisons in Table
108. No set of tonal score differences was found significantly different from another (p > .05).
Table 106
Group B: Fall 2017/Fall 2018/Fall 2019 Tests of Within-Subjects Effects Results (Tonal)
Table 107
Group B: Fall 2017/Fall 2018/Fall 2019 Multivariate Testa Results (Tonal)
A statistically significant mean score difference was estimated for tonal scores, as
indicated from the omnibus results of a repeated measures ANOVA. However, no pair of tonal
scores was found significantly different from another (p > .05) in post hoc tests using the
Bonferroni correction. Therefore, it was concluded IMMA tonal scores remained relatively
stable from Grade 3 to Grade 5 in this sample and a period of transition was unsubstantiated.
Nevertheless, the effect of formal instruction lacking the informal guidance component necessary
to support students’ preparatory audiation needs must be considered as a possible influence on
these findings.
Mean rhythm scores for students from Grades 3–5 are displayed in Table 109. From the
results of Mauchly’s Test of Sphericity, presented as Table 110, it was concluded variances of
differences between levels of rhythm test results for Fall test administrations in Grades 3–5 were
not significantly different and the condition of sphericity had been met (X2(2) = 2.089, p > .05).
Table 108
Group B: Fall 2017/Fall 2018/Fall 2019 Pairwise Comparison Pooled Results (Tonal)
Table 109
Group B: Fall 2017/Fall 2018/Fall 2019 Descriptive Statistics Pooled Results (Rhythm)
Table 110
Group B: Fall 2017/Fall 2018/Fall 2019 Mauchly’s Test of Sphericitya Results (Rhythm)
A one-way repeated-measures ANOVA was conducted using Grade 3, Grade 4, and
Grade 5 scores from rhythm subtests. The results of the ANOVA are presented in Table 111. The
difference in rhythm scores from Grades 3–5 was statistically significant, F(2, 86) = 4.201, p <
.018, and the effect size moderate (2 = .067) (Hatcher, 2013, p. 370). This conclusion was
supported by the multivariate test results displayed in Table 112, which were statistically
significant: Pillai’s Trace = .141, F(2, 42) = 3.439, p = .041, ηp2 = .141. Both Grade 3 (1.971
points, p = .014) and Grade 5 (2.046 points, p = .006) mean rhythm scores were significantly
higher than Grade 4 rhythm scores, as indicated by the results of the Bonferroni post hoc test,
displayed as pairwise comparisons in Table 113.
Table 111
Group B: Fall 2017/Fall 2018/Fall 2019 Tests of Within-Subjects Effects Results (Rhythm)
Table 112
Group B: Fall 2017/Fall 2018/Fall 2019 Multivariate Testa Results (Rhythm)
Mean rhythm scores differed significantly between grade levels, as indicated by omnibus
results of a repeated measures ANOVA. Results of a Bonferroni post hoc test revealed a
significant difference between Grade 4 rhythm scores and scores of adjacent grade levels.
However, scores did not fluctuate in a consistent direction, as the mean difference decreased
from Grade 3 to Grade 4 but increased from Grade 4 to Grade 5. In addition, the mean difference
between rhythm scores of Grades 3 and 5 was not significant. The discrepancies of rhythm score
fluctuation were not consistent with previous patterns describing rhythm aptitude in the
developmental stage, defined by continual score fluctuation, or the stabilized stage, characterized
by the relative constancy of scores, as referenced in extant literature (Gordon, 1998, p. 10).
Table 113
Group B: Fall 2017/Fall 2018/Fall 2019 Pairwise Comparison Pooled Results (Rhythm)
Instead, Gordon (1986a, 2002) outlined decreasing gain scores and diminished influence of
instruction as indicators of a transition period, which did not characterize adequately the findings
of the current study. Perhaps a more expansive definition of a period of transition, during which
identified traits of developmental and stabilized rhythm aptitude vacillate and waver from their
established patterns as music aptitude shifts between stages, could be used to describe this
previously unaddressed pattern of rhythm score fluctuation. This finding was in contrast to that
of tonal scores for Group B, in which score fluctuation was relatively static. It was possible tonal
aptitude and rhythm aptitude functioned independently as the transition to the stabilized music
aptitude stage occurred.
Mean composite scores for students from Grades 3–5 (N = 65) are displayed in Table
114. From the results of Mauchly’s Test of Sphericity, presented as Table 115, it was concluded
variances of differences between levels of composite test results for Fall test administrations in
Grades 3–5 were not significantly different and the condition of sphericity had been met (X2(2)
=1.509, p > .05).
Table 114
Group B: Fall 2017/Fall 2018/Fall 2019 Descriptive Statistics Pooled Results (Composite)
Therefore, a one-way repeated measures ANOVA was conducted to examine the
longitudinal change in composite scores. The results are presented in Table 116. The mean
difference in composite scores of students in Grades 3–5 was found significantly different, F(2,
80) = 4.913, p = .010. The effect size, calculated as omega squared (2 = .086), was estimated as
moderate (Hatcher, 2013, p. 370). This omnibus conclusion was supported by the set of
multivariate test results displayed in Table 117, which were also statistically significant: Pillai’s
Trace = .187, F(2, 39) = 4.496, p = .017, ηp2 = .187.
Table 115
Group B: Fall 2017/Fall 2018/Fall 2019 Mauchly’s Test of Sphericitya Results (Composite)
Table 116
Group B: Fall 2017/Fall 2018/Fall 2019 Tests of Within-Subjects Effects Results (Composite)
The results of the Bonferroni post hoc test, displayed as pairwise comparisons in Table
118, were interpreted as follows: on average, Grade 3 composite scores were 2.358 points higher
than Grade 4 composite scores (p = .024, 95% CI [.255, 4.462]) and Grade 5 composite scores
3.149 points higher than Grade 4 composite scores (p = .002, 95% CI [1.067, 5.232]). However,
no significant difference was found for Grade 3 and Grade 5 tonal scores (p > .05).
Table 117
Group B: Fall 2017/Fall 2018/Fall 2019 Multivariate Testa Results (Composite)
Composite score results mirrored those of rhythm scores: mean composite scores differed
significantly between Grades 3 and 4 and Grades 4 and 5. However, the pattern of score
fluctuation throughout this 3-year period was unusual: composite scores decreased from Grade 3
to Grade 4, but increased from Grade 4 to Grade 5. No significant difference was found for
Grade 3 and Grade 5 composite scores. The developmental music aptitude stage had been
characterized as continually fluctuating (Moore, 1990), which differed from these results. The
stabilized music aptitude stage, defined by the relative constancy of scores (Gordon, 2005), also
differed from these results. Perhaps a broad period of transition might describe this distinctive
pattern of composite score fluctuation.
Table 118
Group B: Fall 2017/Fall 2018/Fall 2019 Pairwise Comparison Pooled Results (Composite)
Group C: Spring 2017 (Grade 3), Fall 2017 (Grade 4), Fall 2018 (Grade 5)
Mean tonal scores for students from Grades 3–5 are displayed in Table 119. From the
results of Mauchly’s Test of Sphericity, presented as Table 120, it was concluded variances of
differences between levels of tonal test results were significantly different and the condition of
sphericity had not been met (X2(2) = 13.702, p = .001). The violation of sphericity resulted in an
inaccurate F-test (Field, 2009, p. 476). However, application of the Greenhouse-Geisser
correction yielded an adjusted F-value and degrees of freedom (F(1.67, 107.07) = 6.518, p =
.004 and was statistically significant; the effect size was moderate (2 = .078) (Hatcher, 2013, p.
370). Results of the Huynh-Feldt correction were similar (F(1.71, 109.61) = 6.518, p = .003.
These results are displayed in Table 121. This finding was supported by significant multivariate
test results, presented as Table 122: Pillai’s Trace = .146, F(2, 63) = 5.384, p = .007, ηp2 = .146.
Table 119
Group C: Spring 2017/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Tonal)
Table 120
Group C: Spring 2017/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Tonal)
Bonferroni post hoc test results are displayed as pairwise comparisons in Table 123:
Grade 5 tonal scores averaged 1.623 points higher than Grade 3 scores (p = .011, 95% CI [.381,
2.865]). No significant difference was found between Grade 3 and 4 or Grade 4 and 5 (p > .05).
From the results of a repeated measures ANOVA with the Greenhouse-Geisser and
Huynh-Feldt corrections, it was concluded a period of transition was possible between Grades 3,
4, and 5. An approach to the stabilized music aptitude stage was suggested by tapering of score
fluctuation of adjacent grades (Gordon, 1986a). However, mean tonal scores of Grade 3 and
Grade 5 were significantly different, which supported the contention students had not yet fully
achieved the stabilized tonal aptitude stage, as scores continued to increase due to influence of
the musical environment. A transition period between the developmental music aptitude stage,
marked by score fluctuation, and the stabilized music aptitude stage, marked by relative score
Table 121
Group C: Spring 2017/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Tonal)
Table 122
Group C: Spring 2017/Fall 2017/Fall 2018 Multivariate Testa Results (Tonal)
Table 123
Group C: Spring 2017/Fall 2017/Fall 2018 Pairwise Comparison Pooled Results (Tonal)
constancy, might explain the lack of significant difference in scores of adjacent grades as well as
significant score difference in scores of Grades 3 and 5. The proposed transition period was
characterized loosely by Gordon’s (1986a) description of decreasing gain scores and diminishing
influence of instruction as indicators of a transition period between stages of music aptitude.
Mean rhythm scores for students from Grades 3–5 are displayed in Table 124. From the
results of Mauchly’s Test of Sphericity, exhibited as Table 125 it was concluded variances of
differences between levels of rhythm test results were not significantly different and the
condition of sphericity had been met (X2(2) = .926, p > .05).
A one-way repeated-measures ANOVA was conducted using Grade 3, Grade 4, and
Grade 5 rhythm scores. The results of the rhythm ANOVA are presented in Table 126; the
difference in rhythm scores from Grades 3–5 was not statistically significant, F(2, 126) = .786, p
> .05. This conclusion was supported by the multivariate test results displayed in Table 127,
which were not statistically significant: Pillai’s Trace = .022, F(2, 62) = .697, p > .05, ηp2 = .022.
The lack of equivalent levels of significant fluctuation of tonal scores and rhythm scores might
indicate stabilization of rhythm aptitude independent from that of tonal aptitude. It was
interpreted that Group C’s rhythm aptitude might have stabilized prior to Grade 3, as mean
rhythm score differences were non-significant. On the other hand, it appeared tonal aptitude of
Group C remained developmental, as tonal scores continued to fluctuate significantly from
Grade 3 through Grade 5.
Table 124
Group C: Spring 2017/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Rhythm)
Table 125
Group C: Spring 2017/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Rhythm)
Table 126
Group C: Spring 2017/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Rhythm)
Table 127
Group C: Spring 2017/Fall 2017/Fall 2018 Multivariate Testa Results (Rhythm)
Mean rhythm scores did not differ significantly between grade levels, as indicated by
results of a repeated measures ANOVA. This conclusion was supported by Pillai’s Trace test
results (p > .05). IMMA rhythm scores were relatively stable from Grade 3 through Grade 5;
thus, it was conjectured students were beyond the developmental rhythm aptitude stage, noted
for continual score fluctuation (Gordon, 2012, p. 47). Although Gordon (1981) acknowledged a
decrease in the effect of environment as children approached age 8, attainment of the stabilized
music aptitude stage prior to Grade 3 would have been contrary to the findings of some extant
literature (Gordon, 2013, p. 14), but would confirm the findings of researchers such as
DeYarman (1972, 1975), Harrington (1969), and Schleuter and DeYarman (1975). Nevertheless,
without an examination of scores preceding Grade 3, it was not possible to ascertain when score
fluctuation tapered and the shift to the stabilized rhythm aptitude stage had begun in the current
study. Therefore, future studies with an expanded range of grade levels, including those
preceding and succeeding those of the current sample, are recommended to further clarify when
the effect of the musical environment recedes and the stabilized music aptitude stage is attained.
Mean composite scores for students from Grades 3–5 are displayed in Table 128. From
the results of Mauchly’s Test of Sphericity, presented as Table 129, it was concluded variances
of differences between levels of composite test results were significantly different and the
condition of sphericity had been violated (X2(2) = 28.841, p < .001). The violation of sphericity
resulted in an inaccurate F-test (Field, 2009, p. 476). However, application of the GreenhouseGeisser correction yielded a statistically significant adjusted F-value and degrees of freedom
(F(1.42, 77.80) = 6.636, p = .006; the effect size was moderate (2= .090) (Hatcher, 2013, p.
370). Results of the Huynh-Feldt correction were similar (F(1.44, 79.26) = 6.636, p = .005.
ANOVA results are displayed in Table 130 and supported by statistically significant multivariate
test results, presented as Table 131: Pillai’s Trace = .128, F(2, 54) = 3.977, p = .024, ηp2 = .128.
The results of the Bonferroni post hoc test are exhibited as pairwise comparisons in Table
132: on average, Grade 4 composite scores were 2.586 points higher than Grade 3 composite
scores (p = .05, 95% CI [.024, 5.148]) and Grade 5 composite scores were 2.734 points higher
than Grade 3 composite scores (p = .015, 95% CI [.461, 5.008]). However, the difference
between Grade 4 and Grade 5 composite scores was not significant (p > .05).
Table 128
Group C: Spring 2017/Fall 2017/Fall 2018 Descriptive Statistics Pooled Results (Composite)
Table 129
Group C: Spring 2017/Fall 2017/Fall 2018 Mauchly’s Test of Sphericitya Results (Composite)
Mean composite scores differed significantly between Grades 3 and 4 and Grades 3 and
5, as indicated from results of a repeated measures ANOVA; no significant difference was found
between Grade 4 and Grade 5 scores. The significant differences in scores were indicative of
continued composite score fluctuation consistent with the developmental music aptitude stage.
Table 130
Group C: Spring 2017/Fall 2017/Fall 2018 Tests of Within-Subjects Effects Results (Composite)
Table 131
Group C: Spring 2017/Fall 2017/Fall 2018 Multivariate Testa Results (Composite)
However, the score difference (.148 points) between Grade 4 and Grade 5 scores could be
explained as the tapering of score difference described by Gordon (1986a). The composite score
difference was non-significant, which was suggestive of the constancy that characterized the
stabilized music aptitude stage (Gordon, 1998, p. 10). A period of transition in which score
activity is inconsistent and fluctuation intermittent might account for the discrepancy of these
findings. Nevertheless, separate consideration of tonal and rhythm aptitude when examining the
shift from developmental to stabilized music aptitude was theorized from the findings of Group
C, as it appeared rhythm aptitude stabilized prior to tonal aptitude for this Group.
Table 132
Group C: Spring 2017/Fall 2017/Fall 2018 Pairwise Comparison Pooled Results (Composite)
Group D: Spring 2017 (Grade 3), Fall 2017 (Grade 4), Spring 2019 (Grade 5)
Mean tonal scores for students from Grades 3–5 are depicted in Table 133. From the
results of Mauchly’s Test of Sphericity, presented as Table 134, it was concluded variances of
differences between levels of tonal test results were not significantly different and the condition
of sphericity had been met (X2(2) = 3.593, p > .05).
Table 133
Group D: Spring 2017/Fall 2017/Spring 2019 Descriptive Statistics Pooled Results (Tonal)
Table 134
Group D: Spring 2017/Fall 2017/Spring 2019 Mauchly’s Test of Sphericity a Results (Tonal)
The results of a one-way repeated measures ANOVA are presented in Table 135: the
difference in tonal scores from Grades 3–5 was statistically significant, F(2, 118) = 7.306, p =
.001 and the effect size moderate (2 = .094) (Hatcher, 2013, p. 370). This conclusion was
supported by the statistically significant multivariate test results displayed in Table 136: Pillai’s
Trace = .174, F(2, 58) = 6.121, p = .004, ηp2 = .174. The results of the Bonferroni post hoc test
are featured as pairwise comparisons in Table 137: on average, Grade 5 tonal scores were 2.067
points higher than Grade 3 tonal scores (p < .001, 95% CI [.832, 3.301]). The mean composite
score differences for Grades 3–4 and Grades 4–5 were not significant (p > .05).
Table 135
Group D: Spring 2017/Fall 2017/Spring 2019 Tests of Within-Subjects Effects Results (Tonal)
Table 136
Group D: Spring 2017/Fall 2017/Spring 2019 Multivariate Test a Results (Tonal)
Table 137
Group D: Spring 2017/Fall 2017/Spring 2019 Pairwise Comparison Pooled Results (Tonal)
The findings for Spring 2017/Fall 2017/Spring 2019 tonal scores were mixed. Significant
mean tonal score differences were found for Grades 3 and 5, yet the mean tonal score difference
for Grades 3 and 4 and Grades 4 and 5 were not statistically significant. This finding was similar
to the Group C tonal findings, from which a period of transition was speculated. Although the
developmental tonal aptitude stage had been described as a period in which tonal aptitude
constantly fluctuates (Gordon, 2001a), the level of score fluctuation in the current study was
inconsistent. A broad period of transition was theorized to account for this discrepancy in tonal
score fluctuation, as conflicting traits of both stages of music aptitude were exhibited
simultaneously within an extended time span.
Mean rhythm scores for students from Grades 3–5 are displayed in Table 138. From the
results of Mauchly’s Test of Sphericity, presented as Table 139, it was concluded variances of
differences between levels of rhythm test results were not significantly different and the
condition of sphericity had been met (X2(2) = .459, p > .05).
Table 138
Group D: Spring 2017/Fall 2017/Spring 2019 Descriptive Statistics Pooled Results (Rhythm)
Table 139
Group D: Spring 2017/Fall 2017/Spring 2019 Mauchly’s Test of Sphericitya Results (Rhythm)
A one-way repeated-measures ANOVA was conducted using Grade 3, Grade 4, and
Grade 5 rhythm scores. The results of the rhythm ANOVA may be seen in Table 140: the
difference in rhythm scores from Grades 3–5 was not statistically significant, F(2, 126) = 1.471,
p > .05. This conclusion was supported by the multivariate test results exhibited in Table 141:
Pillai’s Trace = .046, F(2, 62) = 1.493, p > .05, ηp2 = .046.
Table 140
Group D: Spring 2017/Fall 2017/Spring 2019 Tests of Within-Subjects Effects Results (Rhythm)
Table 141
Group D: Spring 2017/Fall 2017/Spring 2019 Multivariate Testa Results (Rhythm)
Mean rhythm scores did not differ significantly between grade levels, as indicated by
results of a repeated measures ANOVA. This conclusion was supported by Pillai’s Trace test
results (p > .05). Therefore, it was concluded IMMA rhythm scores were relatively stable from
Grade 3 through Grade 5, in opposition to the defined parameters of developmental and
stabilized music aptitude reported in extant literature (Gordon, 2006). A period of rhythm
transition was unsubstantiated: it appeared rhythm aptitude had stabilized prior to Grade 3,
unlike tonal aptitude, which continued to exhibit significant score fluctuation as indicative of the
developmental rhythm aptitude stage. An atomistic perspective of developmental music aptitude,
as conjectured by Gordon (1998, p. 71), would suggest separate consideration of tonal aptitude
and rhythm aptitude when considering a period of transition.
Mean composite scores for students from Grades 3–5 are displayed in Table 142. From
the results of Mauchly’s Test of Sphericity, presented as Table 143, it was concluded variances
of differences between levels of composite test results were significantly different and the
condition of sphericity had not been met (X2(2) = 7.654, p = .022). The violation of sphericity
resulted in an inaccurate F-test (Field, 2009, p. 476). However, application of the GreenhouseGeisser correction yielded an adjusted F-value and degrees of freedom (F(1.76, 91.28) = 7.607, p
= .001; the effect size was moderate (2 = .110) (Hatcher, 2013, p. 370). Huynh-Feldt correction
results were similar (F(1.81, 94.22) = 7.607, p = .001. These statistically significant omnibus
results are displayed in Table 144. An examination of the statistically significant multivariate test
results, presented as Table 145, confirmed this finding: Pillai’s Trace = .181, F(2, 52) = 5.617, p
= .006, ηp2 = .181.
The results of the Bonferroni post hoc test are displayed as pairwise comparisons in Table
146: on average, Grade 4 composite scores were 2.901 points higher than Grade 3 composite
scores (p = .033, 95% CI [.307, 5.495]) and Grade 5 composite scores were 4.047 points higher
than Grade 3 composite scores (p < .001, 95% CI [1.645, 6.449]). However, the difference
between Grade 4 and Grade 5 composite scores was not significant (p > .05).
Table 142
Group D: Spring 2017/Fall 2017/Spring 2019 Descriptive Statistics Pooled Results (Composite)
Table 143
Group D: Spring 2017/Fall 2017/Spring 2019 Mauchly’s Test of Sphericitya Results (Composite)
Significant differences in composite scores of Grades 3 and 4 and of Grades 3 and 5
composite scores were suggested from results of a one-way repeated measures ANOVA. Thus,
continued score fluctuation was implied, suggesting students had not yet achieved the stabilized
music aptitude stage and remained in the developmental music aptitude stage. However, the
difference in Grade 4 and Grade 5 composite scores was not statistically significant (p > .05),
which implied stability of composite scores during that time period. The developmental and
stabilized music aptitude stages previously described did not account for this discrepancy in
score fluctuation. Thus, a period of transition between the two stages of music aptitude was
speculated to explain the variation in predicted score fluctuation described in previous research
(Gordon, 2006).
Table 144
Group D: Spring 2017/Fall 2017/Spring 2019 Tests of Within-Subjects Effects (Composite)
Group E: Spring 2018 (Grade 3), Spring 2019 (Grade 4), Fall 2019 (Grade 5)
Mean tonal scores for students from Grades 3–5 are displayed in Table 147. From the
results of Mauchly’s Test of Sphericity, presented as Table 148, it was concluded variances of
differences between levels of tonal test results were significant and the condition of sphericity
had not been met (X2(2) = 8.020, p = .018). The violation of sphericity resulted in an inaccurate
F-test (Field, 2009, p. 476). However, application of the Greenhouse-Geisser correction yielded
an adjusted F-value and degrees of freedom (F(1.70, 73.26) = .272, p > .05; results of the
Huynh-Feldt correction were similar (F(1.77, 75.98) = .272, p > .05). These non-significant
results are displayed in Table 149. Multivariate test results (p > .05), presented as Table 150,
confirmed this finding: Pillai’s Trace = .017, F(2, 42) = .368, p > .05, ηp2 = .017.
Table 145
Group D: Spring 2017/Fall 2017/Spring 2019 Multivariate Testa Results (Composite)
Mean tonal scores did not differ significantly between grade levels, as indicated by
results of a repeated measures ANOVA. The relative stability concluded from the lack of
significant tonal score fluctuation was suggestive of the achievement of the stabilized music
aptitude stage prior to Grade 3. However, this could not be confirmed without an examination of
scores of younger grade levels. The conclusion that the stabilized music aptitude stage had been
attained prior to Grade 3 was contrary to that described by Gordon (2013, pp. 11–12), but was
supported by the findings of other researchers (DeYarman, 1972; Harrington, 1969; Stevens,
1987; Schleuter & DeYarman, 1977) who asserted music aptitude stabilized as early as age 6.
Because tonal aptitude appeared to have stabilized before the grade levels in question, a period of
transition was unsubstantiated.
Table 146
Group D: Spring 2017/Fall 2017/Spring 2019 Pairwise Comparison Pooled Results (Composite)
Table 147
Group E: Spring 2018/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Tonal)
Mean rhythm scores for students from Grades 3–5 are displayed in Table 151. From the
results of Mauchly’s Test of Sphericity, presented as Table 152, it was concluded variances of
differences between levels of rhythm test results were not significant and the condition of
sphericity had been met (X2(2) = 3.274, p > .05).
Table 148
Group E: Spring 2018/Spring 2019/Fall 2019 Mauchly’s Test of Sphericitya Results (Tonal)
Table 149
Group E: Spring 2018/Spring 2019/Fall 2019 Tests of Within-Subjects Effects Results (Tonal)
A one-way repeated-measures ANOVA was conducted using Grade 3, Grade 4, and
Grade 5 composite scores. The results of the rhythm ANOVA are presented in Table 153: the
difference in rhythm scores from Grades 3–5 was statistically significant, F(2, 96) = 3.461, p =
.035 and the effect size small (2 = .047) (Hatcher, 2013, p. 370). This conclusion was supported
by the multivariate test results displayed in Table 154, which were statistically significant:
Pillai’s Trace = .156, F(2, 47) = 4.341, p = .019, ηp2 = .156.
Table 150
Group E: Spring 2018/Spring 2019/Fall 2019 Multivariate Testa Results (Tonal)
Table 151
Group E: Spring 2018/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Rhythm)
The results of the Bonferroni post hoc test are featured as pairwise comparisons in Table
155. On average, Grade 5 rhythm scores were approximately 1.4 points higher than Grade 3
rhythm scores (p = .027, 95% CI [.195, 2.602]). However, the mean difference between rhythm
scores for Grades 3 and 4 and Grades 4 and 5 was not significant (p > .05). For Group E, the
conjectured proposal in favor of atomistic consideration of tonal aptitude and rhythm aptitude
also seemed to apply; however, tonal aptitude seemed to have stabilized before rhythm aptitude
for this Group.
Table 152
Group E: Spring 2018/Spring 2019/Fall 2019 Mauchly’s Test of Sphericitya Results (Rhythm)
Table 153
Group E: Spring 2018/Spring 2019/Fall 2019 Tests of Within-Subjects Effects Results (Rhythm)
Table 154
Group E: Spring 2018/Spring 2019/Fall 2019 Multivariate Testa Results (Rhythm)
Table 155
Group E: Spring 2018/Spring 2019/Fall 2019 Pairwise Comparison Pooled Results (Rhythm)
A significant difference in mean rhythm scores of Grade 3 and Grade 5 was concluded
from the results of a repeated measures ANOVA. However, no significant rhythm score
difference was estimated for adjacent grade levels. Thus, IMMA rhythm scores were
significantly different over the 3-year period in question, but not from Grade 3 to Grade 4 or
Grade 4 to Grade 5. For influence of the musical environment to be significant overall, but not
for adjacent grade levels was puzzling. These results did not align with the findings of previous
studies in which score fluctuation was continual throughout the developmental music aptitude
stage and gain scores decreased as students approached age 9 (Gordon, 1986a), but might be
explained by a period of transition begun in Grade 3, after which rhythm score fluctuation was
minimal and non-significant until Grade 5.
Mean composite scores for students from Grades 3–5 are depicted in Table 156. From the
results of Mauchly’s Test of Sphericity, displayed as Table 157, it was concluded variances of
differences between levels of composite test results were significantly different and the condition
of sphericity had not been met (X2(2) = 6.534, p = .038). The violation of sphericity resulted in
an inaccurate F-test (Field, 2009, p. 476). However, application of the Greenhouse-Geisser
correction yielded an adjusted F-value and degrees of freedom (F(1.72, 63.47) = 3.163, p > .05);
results of the Huynh-Feldt correction were similar (F(56.91, 66.25) = 3.163, p > .05. These nonsignificant results are displayed in Table 158. This finding was disputed by statistically
significant multivariate test results, presented as Table 159: Pillai’s Trace = .221, F(2, 36) =
5.119, p = .011, ηp2 = .221.
The results of the Bonferroni post hoc test are displayed as pairwise comparisons in Table
160: Grade 5 composite scores were an average of 1.951 points higher than Grade 3 composite
scores (p = .018). However, the difference between composite scores for Grades 3 and 4 and
Grades 4 and 5 was not significant (p > .05).
Table 156
Group E: Spring 2018/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Composite)
Table 157
Group E: Spring 2018/Spring 2019/Fall 2019 Mauchly’s Test of Sphericitya Results (Composite)
Mean composite scores did not differ significantly between grade levels, as indicated by
results of a repeated measures ANOVA with the Greenhouse-Geisser correction or Huynh-Feldt
correction. This conclusion was not supported by Pillai’s Trace test results, which were
statistically significant. A post hoc test using the Bonferroni correction was conducted and
yielded a significant difference in Grade 3 and Grade 5 rhythm scores, but not for adjacent grade
levels. Thus, it was concluded IMMA composite scores were influenced minimally by musical
environment between adjacent grade levels and significantly over the 3-year period in
question. These results differed from those described in extant research, in which the
developmental music aptitude stage was characterized as ever changing in response to students’
interaction with the musical environment (Gordon, 2001b, p. 82), yet could be accounted for by a
broad period of transition begun in Grade 3, after which composite score fluctuation was
minimal and non-significant until Grade 5.
Table 158
Group E: Spring 2018/Spring 2019/Fall 2019 Tests of Within-Subjects Effects (Composite)
Group F: Fall 2017 (Grade 3), Spring 2019 (Grade 4), Fall 2019 (Grade 5)
Mean tonal scores for students from Grades 3–5 are presented in Table 161. From the
results of Mauchly’s Test of Sphericity, exhibited as Table 162, it was concluded variances of
differences between levels of tonal test results were significantly different and the condition of
sphericity had not been met (X2(2) = 6.158, p = .046). The violation of sphericity resulted in an
inaccurate F-test (Field, 2009, p. 476). However, application of the Greenhouse-Geisser
correction yielded an adjusted F-value and degrees of freedom (F(1.78, 83.53) = 1.376, p > .05);
the Huynh-Feldt correction yielded similar results (F(1.84, 86.59) = 1.376, p > .05. The omnibus
ANOVA result was not statistically significant, as displayed in Table 163. This finding was
supported by multivariate test results, presented as Table 164: Pillai’s Trace = .065, F(2, 46) =
1.599, p > .05, ηp2 = .065.
Table 159
Group E: Spring 2018/Spring 2019/Fall 2019 Multivariate Testa Results (Composite)
From the results of a one-way repeated measures ANOVA, it was estimated mean tonal
scores did not differ significantly between grade levels. This conclusion was supported by nonsignificant multivariate test results (p > .05). It was suggested by the relative constancy of tonal
scores that students had achieved the stabilized tonal aptitude stage before Grade 3, a conclusion
which differed from the findings of previous research (Gordon, 2013, p. 15) and supported the
findings of DeYarman (1972, 1975), Harrington (1969), and Schleuter and DeYarman (1977). A
period of transition was not substantiated by these findings. However, further research is
recommended to examine the effect of instruction with and without informal guidance, as the
lack of informal guidance activities necessary to move students beyond tonal and rhythm music
babble might have affected students’ ability to benefit from formal instruction, which in turn
might have inhibited accurate measurement and interpretation of IMMA scores as applied to a
period of transition.
Table 160
Group E: Spring 2018/Spring 2019/Fall 2019 Pairwise Comparison Pooled Results (Composite)
Table 161
Group F: Fall 2017/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Tonal)
Mean rhythm scores for 65 students from Grades 3–5 are displayed in Table 165. From
the results of Mauchly’s Test of Sphericity, presented as Table 166, it was concluded variances
of differences between levels of rhythm test results were not significantly different and the
condition of sphericity had been met (X2(2) = 1.066, p > .05).
Table 162
Group F: Fall 2017/Spring 2019/Fall 2019 Mauchly’s Test of Sphericitya Results (Tonal)
Table 163
Group F: Fall 2017/Spring 2019/Fall 2019 Tests of Within-Subjects Effects Results (Tonal)
The results of a repeated-measures ANOVA are presented in Table 167. The difference in
rhythm scores from Grades 3–5 was not statistically significant, F(2, 94) = .149, p > .05. This
conclusion was supported by non-significant multivariate test results, displayed in Table 168:
Pillai’s Trace = .006, F(2, 46) = .133, p > .05, ηp2 = .006.
Table 164
Group F: Fall 2017/Spring 2019/Fall 2019 Multivariate Testa Results (Tonal)
Table 165
Group F: Fall 2017/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Rhythm)
From these repeated measures ANOVA results, it was concluded mean rhythm scores did
not differ significantly between grade levels. This conclusion was supported by non-significant
multivariate Pillai’s Trace test results. It was suggested by the relative constancy of rhythm
scores that students had achieved the stabilized rhythm aptitude stage prior to Grade 3, which
aligned with findings for tonal aptitude for the same Group. Nonetheless, this contention was in
opposition to findings of previous research (Gordon, 2013). A period of transition was not
substantiated for rhythm aptitude.
Table 166
Group F: Fall 2017/Spring 2019/Fall 2019 Mauchly’s Test of Sphericitya Results (Rhythm)
Table 167
Group F: Fall 2017/Spring 2019/Fall 2019 Tests of Within-Subjects Effects Results (Rhythm)
Mean composite scores for 65 students from Grades 3–5 may be seen in Table 169. From
the results of Mauchly’s Test of Sphericity, presented as Table 170, it was concluded variances
of differences between levels of composite test results were not significantly different and the
condition of sphericity had been met (X2(2) = 2.114, p > .05).
Table 168
Group F: Fall 2017/Spring 2019/Fall 2019 Multivariate Testa Results (Rhythm)
Table 169
Group F: Fall 2017/Spring 2019/Fall 2019 Descriptive Statistics Pooled Results (Composite)
A one-way repeated-measures ANOVA was conducted using Grade 3, Grade 4, and
Grade 5 composite scores. From the results of the composite ANOVA, presented in Table 171, it
was concluded the difference in composite scores from Grades 3–5 was not statistically
significant, F(2, 80) = .584, p > .05.This conclusion was supported by multivariate test results
displayed in Table 172, which were not statistically significant: Pillai’s Trace = .036, F(2, 39) =
.729, p > .05, ηp2 = .036.
From these results, it was concluded mean composite scores did not differ significantly
between grade levels. This conclusion was supported by Pillai’s Trace test results. It was
suggested by the relative constancy of composite scores that students had achieved the stabilized
music aptitude stage prior to Grade 3, which was in contrast to the findings of previous studies in
which music aptitude stabilized at approximately age 9 (Gordon, 2013, pp. 11–12). A period of
transition was unsubstantiated for composite music aptitude from the results of the current study.
Table 170
Group F: Fall 2017/Spring 2019/Fall 2019 Mauchly’s Test of Sphericity a Results (Composite)
Results of an examination of mean tonal score differences over a 3-year period using oneway repeated measures ANOVA were mixed, as shown in Figure 12. Grade 4 tonal scores were
significantly higher than Grade 3 tonal scores for Groups A and D. Similarly, Grade 5 students
outscored Grade 3 students on the IMMA tonal subtest for Group A, Group C, and Group D.
Nevertheless, no significant score difference was found for Grade 4 and Grade 5 tonal scores,
and no significant score differences were found for any grade level combination for Groups B,
E, and F. From these mixed results, no definitive conclusion was deduced: for some grade level
combinations, significant tonal score increases from Grade 3 to Grade 4 and Grade 5 were
suggestive of the continued influence of instruction characteristic of the developmental music
aptitude stage, yet the lack of significant score difference from Grade 4 to Grade 5 implied a
lessening of environmental influence corresponding to an increase in chronological age (Gordon,
1986a). For other cases, no significant score differences could be interpreted as student
attainment of the stabilized tonal aptitude stage, in which tonal aptitude no longer was
susceptible to the influence of the musical environment, prior to Grade 3. This interpretation was
counter to descriptions of the stages of music aptitude by some researchers (DeYarman, 1975;
Gordon, 1989b; Haroutounian, 2002; Mang, 2013; Moore, 1987; Schleuter & DeYarman, 1977).
Table 171
Group F: Fall 2017/Spring 2019/Fall 2019 Tests of Within-Subjects Effects Results (Composite)
Table 172
Group F: Fall 2017/Spring 2019/Fall 2019 Multivariate Testa Results (Composite)
Nevertheless, other researchers concluded the onset of stabilized music aptitude occurred
as early as age 6 (DeYarman, 1975; Schleuter & DeYarman, 1977). It was possible the
developmental tonal aptitude stage began to evolve as score fluctuation leveled off; it was also
conceivable this transformation of the developmental tonal aptitude stage resulted in a period of
transition beginning in Grade 3, when significant tonal score fluctuation seemed to suggest a
continued influence of musical environment, and extending through Grade 5, when tonal scores
appeared to stabilize. An uneven increase in tonal scores and rhythm scores in a sample of
minoritized students led Gordon (1980b) to suggest an examination of the transition from the
developmental to stabilized music aptitude stages. The findings of the current study seemed to
suggest tonal aptitude and rhythm aptitude might not stabilize at the same time and therefore
might transition between music aptitude stages independently of each other. Nevertheless,
evidence to suggest a transition period was not definitive for the current sample, and further
research is necessary to investigate the stability of tonal scores of an extended range of grade
levels, particularly those immediately prior to Grade 3 and beyond Grade 5.
Figure 12
Repeated Measures ANOVA Results (Tonal)
Longitudinal stability of rhythm aptitude was also inconclusive: the results of a series of
one-way repeated measures ANOVA, as illustrated in Figure 13, exhibited non-significant mean
differences in grade level scores in the majority of cases. Minimal influence of musical
environment was noted in two instances: Grade 3 and Grade 5 scores were significantly higher
than Grade 4 scores (approximately 2 points) for Group B, and Grade 5 students outscored Grade
3 students by approximately 1.4 points for Group E. There was inadequate evidence from these
findings to substantiate a period of transition between the developmental and stabilized music
aptitude stages at age 9/Grade 4, as it was possible the lack of significant score fluctuation was
due to achievement of the stabilized music aptitude stage prior to Grade 3. Gordon (1986a)
suggested the IMMA rhythm subtest might be more characteristic of stabilized rhythm aptitude
than of developmental rhythm aptitude, as seemed to be the case with these ANOVA findings.
Figure 13
Repeated Measures ANOVA Results (Rhythm)
IMMA composite scores continued to be influenced nominally by musical environment,
as concluded from one-way repeated measures ANOVA findings, featured in Figure 14.
Inconsistent score fluctuation resulted in an atypical pattern of mean composite score difference
in which neither continual score fluctuation, which characterized the developmental music
aptitude stage, nor immutability of scores to instruction, the defining trait of the stabilized music
aptitude stage (Gordon, 1971), was described. This variation from previously described tenets of
developmental and stabilized music aptitude is worthy of continued study. A period of transition
was speculated to account for this discrepancy in score fluctuation, as Gordon (1980b) noted
from his observation of uneven growth of tonal and rhythm aptitude in minoritized students.
From the findings of that study, Gordon (1980b) recommended future research considering
“theories of instruction” of “culturally homogeneous” groups of students in the developmental
music aptitude stage.
Composite scores are the sum of tonal scores and rhythm scores. Therefore, it was
reasonable to presume the findings of an examination of mean tonal score difference might
influence composite score difference. It was previously concluded tonal results were mixed:
significant score increase from Grade 3 to Grades 4 and 5 were suggestive of a continuation of
the developmental tonal aptitude stage, but was at odds with the lack of significant tonal score
difference from Grade 4 to Grade 5. A period of transition was broached as a possible
explanation for this score discrepancy that had not been described in previous literature. A
similar trend was noted for composite results. It was anticipated rhythm results would be
reflected in composite results as well, and might even impose additional influence to that of tonal
scores, as Gordon (1986c) contended students with rhythm and composite scores equivalent to or
higher than criterion scores may be considered to have superior overall developmental music
aptitude to students whose tonal and composite scores were equivalent to or higher than criterion
scores (p. 69):
A child who has received raw scores on only the Rhythm test and the Composite test
which are the same as or higher than the criterion raw scores for his grade may be
considered superior in overall [developmental] music aptitude to a child who has
achieved raw scores on only the Tonal test and the Composite test which are the same as
or higher than the criterion raw score for his grade (Gordon, 1986c, p. 69).
Figure 14
Repeated Measures ANOVA Results (Composite)
In contrast, Moore (1987) noted that despite the contribution of rhythm aptitude to
developmental composite aptitude, the construct of music aptitude seemed too complex to be
influenced disproportionately by one element. Nonetheless, rhythm results were not reflected as
strongly in composite results as anticipated, as rhythm gain scores were stagnant or nonsignificant for several years when composite scores increased. Instead, composite results seemed
to parallel tonal results more closely in the current study. The lack of significant score difference
in rhythm scores implied a relative stability inadequate to substantiate a period of transition when
considered in isolation, yet a period of transition might account for the unusual pattern of score
fluctuation observed in composite results. Although a transition period between the
developmental and stabilized music aptitude stages was theorized, it could not be substantiated
conclusively at a gestalt level from these repeated measures ANOVA findings. Nevertheless, it
was conjectured tonal aptitude and rhythm aptitude stabilized separately; rhythm aptitude
seemed to have stabilized prior to Grade 3 for many Groups.
To address Research Question 1, the effect of chronological age on music aptitude was
examined through paired t-tests of scores separated by summer months in which students do not
attend school, thus controlling for instruction. Fall scores were significantly different than
preceding Spring scores for rhythm and composite tests; however, the mean differences were
small. Thus, scores remained marginally fluid throughout Grades 3, 4, and 5, with minimal score
differences. From these results, it was determined an effect of chronological age was not
established conclusively by the findings of this study.
Wilcoxon Signed Rank tests were used to examine Grade 3 Fall and Spring tonal,
rhythm, and composite scores by academic year in order to determine the effect of instruction on
music aptitude, the focus of Research Question 2. The mean difference was statistically
significant for 2011–2012 and 2013–2014 tonal scores; however, the effect sizes were small. In
the majority of cases, Spring tonal scores exceeded Fall tonal scores; nevertheless, the mean
difference was often nominal and not statistically significant. No effect of instruction was
concluded, as Grade 3 tonal, rhythm, and composite scores were relatively stable.
The longitudinal trend of tonal scores provided limited evidence of a transition period
between the developmental and stabilized music aptitude stages, which was central to Research
Question 3. Scores of IMMA tests administered in three consecutive years (Grades 3, 4, and 5)
were examined in a series of one-way repeated-measures ANOVA. From the findings, an
atypical pattern was observed in which the mean tonal score differences of adjacent early grades
(Grades 3 and 4) and nonadjacent grades (Grades 3 and 5) were statistically significant, yet
scores of adjacent later grades (Grades 4 and 5) were relatively stable. As this trend had not been
described in previous literature, a period of transition was proposed to account for the
discrepancies in tonal score fluctuation. Relative stability was suggested by the lack of
significant difference in rhythm scores; inadequate evidence was found to substantiate a period
of transition for rhythm aptitude. Mixed composite results, similar to tonal results, were
concluded from repeated measures ANOVA results: score increases from Grade 3 to Grade 4 and
Grade 4 to Grade 5 were suggestive of continued influence of musical environment. Score
fluctuation is the hallmark of developmental music aptitude: it has been described as continual
but varying by student (Walters, 1991), with decreased volatility as children near age 9
(Levinowitz & Scheetz, 1998). The decrease in environmental effect on music aptitude that
accompanied an increase in chronological age might seem to apply to the non-significant score
difference from Grade 4 to Grade 5; however, the significant overall composite score increase
from Grade 3 to Grade 5 did not align with that theory. The inconsistency of composite score
fluctuation, manifested as an atypical pattern not described in previous literature, led to
speculation of a period of transition for composite music aptitude.
Chapter 5
Discussion, Recommendations, and Conclusions
Purpose of the Study
The goal of adapting instruction to address individual learning differences is generally
accepted within the field of education (Heathers, 1977). By extension, differentiated instruction
in music education is advantageous in promoting student mastery. Music aptitude scores give an
overview of students’ musical strengths and weaknesses; this data is critical to selection of
appropriate tasks and content for each student.
Previous music aptitude tests differed in content and intent (Gordon, 1987). The terms
aural imagery, inner hearing, and aural perception were used as if equivalent, and terms such as
ability, talent, and achievement were used interchangeably and indiscriminately (Boyle, 1992),
thus confounding the construct test developers sought to measure. Some researchers speculated
one is born with a certain level of music aptitude (nature); others claimed music aptitude was the
result of the musical environment (nurture) (Gordon, 1998). A gestalt perspective yields a
comprehensive overview of music aptitude; proponents of the atomistic view measure discrete
dimensions of music aptitude (Grashel, 2008). Because of these disparities, an examination of
music aptitude test scores based on a unified approach to music aptitude was needed. Gordon
analyzed the effectiveness and validity of previous music aptitude measures and adapted select
features for use in his own music aptitude test batteries. Consequently, the theoretical framework
of this study is based on the research of Edwin Gordon on the construct and measurement of
stabilized music aptitude.
Gordon (2001a) conjectured two stages of music aptitude: the developmental music
aptitude stage, characterized by fluctuations in response to instruction and influence of the
musical environment, and stabilized music aptitude, established at approximately age 9 and
noted for its resistance to instruction and the effects of the musical environment. However,
DeYarman (1972, 1975), Harrington (1969), and Schleuter and DeYarman (1977) conducted
research using modifications of the Musical Aptitude Profile (MAP) with primary-age children
and concluded music aptitude stabilized before age 9, although there was no firm consensus
among them on the age of onset (Gordon, 1979a). Gordon (1982) first designed the Primary
Measures of Music Audiation (PMMA) to measure developmental music aptitude and later
developed the Intermediate Measures of Music Audiation (IMMA) as a more difficult measure of
developmental music aptitude for students who scored in the 80th percentile or higher on PMMA.
However, Gordon (1984a) asserted IMMA also could be used to measure stabilized music
aptitude in students nine years and older. The onset and manner of transition from the
developmental music aptitude stage to the stabilized music aptitude stage is unclear, as extant
research that specifically examined when and how music aptitude becomes stabilized was scarce
(Gordon, 1980b). Gordon hypothesized the stabilized music aptitude stage had been reached
when, despite score changes related to chronological age, the relative standing of students’ music
aptitude levels remained constant (Gordon, 2001b). It should be noted, however, that the
constancy of stabilized music aptitude has not been investigated in a sample of musically select
students, defined as students who participate in school ensembles (Gordon, 1995), as was the
case with numerous students in the current sample. Although this topic was beyond the
parameters of the current study, further research is recommended to investigate the effect of
ensemble participation on stabilized music aptitude.
Purpose of the Study
Thus, the purpose of this research study was to investigate the onset of, transition to, and
longitudinal constancy of stabilized music aptitude in upper elementary students.
Research Questions
The following questions guided the research:
1. At what grade level does chronological age cease to affect student music aptitude?
2. At what grade level does instruction cease to affect student music aptitude?
3. Is there evidence to substantiate the transition between the developmental music aptitude
stage and stabilized music aptitude stage at approximately age 9/Grade 4?
A nonprobability convenience sample of IMMA scores (N = 1,650) was collected from
intact classes of students in Grades 3, 4, and 5 who were enrolled in a small, rural, public school
in central Pennsylvania where the researcher has been employed as the sole elementary general
music teacher for sixteen years. Little transiency is experienced by students in this school district
and the student population is quite stable: relatively few students move into or out of the district
and most were enrolled in the district throughout their tenure as elementary and secondary
students. As a result of the stability of this school population, most scores of students who were
administered IMMA in Grades 4 and 5 could be matched to their Grade 3 scores. A large
majority of students are White; approximately half live in rural poverty.
Research Instrument
The sole research instrument used for data collection was IMMA. It was presumed
IMMA would measure developmental music aptitude of Grade 3 students and stabilized music
aptitude of students in Grades 4 and 5.
Data Collection
IMMA was administered routinely to students in Grade 3 (N = 1,035) and intermittently
to students in Grade 4 (N = 389) and Grade 5 (N = 226) during the researcher’s tenure as music
teacher to provide data for individualized instruction. IMMA scores were collected and archived
by the researcher for a period of 13 years (2007–2019). A longitudinal examination of Grade 3–
Grade 5 Fall and Spring scores by grade level, between grade levels, and across the 3-year period
was conducted to investigate when score fluctuation, attributed to musical environment,
diminished in significance, thus indicating the approach of the stabilized music aptitude stage.
In this quantitative study, the mean difference in students’ historical IMMA scores was
examined using a variety of statistical tests. Individuals’ matched scores from Spring test
administrations and subsequent Fall test administrations were used in paired t-tests to examine
the effect of chronological age on music aptitude. Violations of the assumption of normal
distribution often precluded use of paired t-tests; therefore, a series of Wilcoxon Signed Rank
tests was conducted by academic year using tonal scores, rhythm scores, and composite scores
for a grade-by-grade view of the effect of instruction on music aptitude. Scores from consecutive
academic years were used in one-way repeated measures ANOVA to highlight longitudinal
changes over a 3-year period that might suggest a period of transition between stages of music
aptitude. Multivariate test results were used to confirm ANOVA results
Research Question 1.
The results of a series of paired t-tests comparing IMMA Spring tonal scores, rhythm
scores, and composite scores with subsequent corresponding Fall scores were mixed. Mean tonal
scores increased, yet their difference was not statistically significant. In contrast, mean Grade 4
Spring/Grade 5 Fall rhythm and composite scores decreased; the mean differences also were not
significant. A small but significant increase for Grade 3 Spring/Grade 4 Fall rhythm scores and
composite scores was found. In all cases the mean score difference was less than one point.
Gordon (1998) observed a tendency for music aptitude scores to increase with
chronological age (p. 169); however, he noted scores gradually decreased as students approached
the stabilized music aptitude stage (Gordon, 1981). The significant increase of Grade 3
Spring/Grade 4 Fall rhythm scores was suggestive of students’ continued presence in the
developmental music aptitude stage, as it seemed the influence of instruction persisted. However,
Gordon (2002) had predicted an average increase of approximately 2 points for developmental
music aptitude scores of students receiving traditional instruction; the mean rhythm score
increase in the current study did not meet that minimal threshold and thus seemed to lack
practical significance. Although tonal scores generally continued to increase and Grade 4
Spring/Grade 5 Fall rhythm scores to decrease, their non-significant mean score differences
seemed to suggest the static score fluctuation expected of students who previously had attained
the stabilized music aptitude stage. Due to discrepancies in score direction and significance, the
results of paired t-tests were inconclusive in determining the effect of chronological age on
music aptitude.
Research Question 2.
Tonal and composite scores tended to increase from the Fall to Spring administrations of
the same academic year; however, the mean differences were generally modest, as interpreted
from Wilcoxon Signed Rank test results. Only tonal score differences for 2011–2012 and 2013–
2014 were statistically significant. Although rhythm scores tended to decrease from Fall to
Spring administrations, the mean differences were small and not significant. Gordon (2002) had
asserted a score increase of approximately 2 points when traditional instruction was offered. The
mean score increases of the current study were well below that threshold; therefore, no effect of
instruction was concluded. Nevertheless, it must be considered the type and quality of instruction
might have had a detrimental effect on student music aptitude, as the formal instruction offered
might have been ill-suited to the students’ current level of music aptitude, particularly for those
who remained in preparatory audiation.
Research Question 3.
Repeated measures ANOVA was used to examine longitudinal score change of IMMA
tonal scores, rhythm scores, and composite scores to consider a period of transition between the
developmental and stabilized music aptitude stages. ANOVA results were confirmed by
multivariate test results. An atypical pattern of tonal score fluctuation was observed in which
significant score increases occurred from Grade 3 to Grade 4 and Grade 3 to Grade 5, as if
students remained in the developmental music aptitude stage, but the mean tonal score difference
from Grade 4 to Grade 5 was not significant, as if the stabilized music aptitude stage had already
been achieved. Gordon (1981) had asserted score gains decreased as chronological age
increased: the inconsistency of score fluctuation from Grade 4 to Grade 5 seemed to exemplify
this assertion. A limiting effect of instruction on continued score gain of developmental music
aptitude must also be considered. Thus, a period of transition for tonal aptitude was asserted to
account for atypical tonal score results, but could not be concluded definitively. No significant
mean rhythm score differences were estimated for this sample; consequently, a period of
transition was unsubstantiated for rhythm music aptitude. An atypical pattern similar to that of
tonal scores also was observed for composite score difference: the score fluctuation typically
associated with the developmental music aptitude stage and the cessation of score change
characterized by the stabilized music aptitude stage (Gordon, 1971) were observed
simultaneously. Discrepancies in score direction deviated from findings of previous research and
warranted additional study. A period of transition for composite aptitude was speculated.
Period of Transition
A period of transition for rhythm aptitude was unsubstantiated from the findings of a
series of repeated measures ANOVA. From the unusual pattern of rhythm score fluctuation and
non-significant findings for most Groups, it is conceivable students had already progressed to the
stabilized music aptitude stage. In contrast, a period of transition for tonal aptitude was
suggested. The general and often significant increase in tonal scores found from results of a
series of repeated measures ANOVA was suggestive of the continuation of the developmental
tonal aptitude stage with a tapering of scores as students transitioned to the stabilized tonal
aptitude stage. Significant composite score growth was interpreted for most Groups, which
seemed to be indicative of a continuation of the developmental music aptitude stage. From the
compilation of tonal, rhythm, and composite findings, one could interpret students had entered a
period of transition to the stabilized music aptitude stage, as score increase seemed to gradually
decline. However, if tonal aptitude and rhythm aptitude were examined individually, it was
conjectured the two constructs might have stabilized independently of one another. The Grade 3–
Grade 4 tonal score increase was significantly different, as was that for Grade 3–Grade 5; thus, it
seemed students remained in the developmental music aptitude stage. The mean difference in
Grade 4–Grade 5 tonal scores and for rhythm scores was not significant, however, which was
suggestive of a previous transition to the stabilized music aptitude stage. Although a period of
transition could not be substantiated conclusively from these mixed results, it was speculated
rhythm aptitude had stabilized independently of tonal aptitude. Additional research focused on
the independent onset of stabilized tonal aptitude and rhythm aptitude is recommended.
Gordon (2013) awarded more weight to difference between developmental and stabilized
music aptitudes than to similarity (p. 15). He noted numerous differences in how developmental
and stabilized music aptitudes were manifested in student musical behaviors: students in the
stabilized music aptitude stage preferred to hear tonal and rhythm dimensions simultaneously
and were able to attend to one or the other, showed consistent preference for phrasings, and
reliably perceived even relatively small differences in dynamics, timbres, and tonal ranges
(Gordon, 2013, pp. 15–16). It seems unlikely students would begin to exhibit stabilized music
aptitude traits wholly and simultaneously; rather, it is probable students begin to exhibit traits of
stabilized music aptitude by degrees.
Gordon (1984b) developed the Instrument Timbre Preference Test (ITPT) for a 2-year
investigation to determine if students demonstrated more success when they played an
instrument for which they demonstrated a timbre preference and if success was more accurately
predicted by MAP scores when students played an instrument whose timbre they preferred
(Gordon, 1989c). Although Gordon (1984b) initially stated ITPT could be administered to
students entering Grades 4, 5, or 6, he later clarified ITPT should be administered to students
prior to or in the grade in which beginning band instruction is offered. As many students begin
instrumental music instruction in Grade 4, ITPT might be administered in Grade 3 Spring or
Grade 4 Fall. Thus, Gordon seemed to believe students could discern differences in instrumental
timbre as early as age 8, before music aptitude was purported to stabilize, and that difference in
timbre might be perceived before difference in dynamics or other expressive elements. It was
speculated the transition from the developmental to stabilized music aptitude stage was also of a
gradual nature, in which changes to students’ perception occur differently for each student, as
students similarly transition from preparatory audiation to audiation at different paces, according
to their individual levels of developmental music aptitude. Nevertheless, it was conjectured from
repeated measures ANOVA findings the transition to stabilized rhythm aptitude occurred before
that of tonal aptitude in the current study.
Gordon (2012) acknowledged it would be unusual for both developmental tonal and
rhythm aptitude to be very high or very low for any individual student (p. 50). When interpreting
music aptitude test scores, an attempt to raise the lower subtest score should be undertaken
immediately (Gordon, 1998, p. 149). Thus, Gordon recognized tonal aptitude and rhythm
aptitude both manifest and progress at different rates. It was evident from a comparison of the
repeated measures ANOVA findings for tonal scores and corresponding rhythm scores of the
current study that tonal scores seemed to exhibit significant growth more frequently and for
different grade level groupings than did rhythm scores. The stages of developmental music
aptitude and stabilized music aptitude each had been viewed by the researcher as a single gestalt
construct: students either functioned musically in one stage or the other, as defined by their
composite IMMA scores. However, the findings of the current study have prompted a
consideration of developmental tonal aptitude and developmental rhythm aptitude from an
atomistic perspective instead: students might audiate tonally in one stage of music aptitude and
rhythmically in another. Norton (1980) cautioned children must progress sequentially through a
series of activities ordered by the level of abstract thinking required in order to develop musical
understanding. Students in Norton’s study conserved tonal elements more successfully than
rhythm patterns. The findings of previous studies in which researchers concluded music aptitude
stabilized prior to age 9 (DeYarman, 1972; Harrington, 1969; Schleuter & DeYarman, 1977)
seemed to be based on composite results: tonal aptitude and rhythm aptitude were not interpreted
separately. Yet Gordon (1998) suggested the finding that children in the developmental music
aptitude stage seemed to focus more on the musical instrument used to record music aptitude test
items than on the content of the test items supported the assertion that developmental music
aptitude was more closely related to atomistic than gestalt perspective (p. 71). If tonal aptitude
and rhythm aptitude operate as independent constructs, it is possible students may transition
between stages of music aptitude at different rates for each construct.
Another indication Gordon regarded rhythm and tonal aptitudes as separate constructs
was his emphasis on movement and its component parts—time, space, weight, and flow—which
interacted to create rhythm (Gordon, 2012, p. 188). Although Gordon (2012) stated movement
was foundational to rhythm (p. 190) and the best means through which students understand
rhythm (p. 74), he posited the construct of space audiation, “a silent auditory response rather than
a physical response” (Gordon, 2015) in his later writing. Gordon (2012) asserted the importance
of an audiation breath, during which a pause is inserted between the teacher’s performance and
the students’ performance of each tonal pattern to encourage audiation over imitation (p. 102),
and alleged tonal audiation occurred during that breath (Gordon , 2013, p. 94). In addition,
Gordon (1998) noted rhythm, particularly meter and tempo, was foundational to musical style
and expression (p. 60), and concluded tempo was the most fundamental of all rhythm aptitudes:
tempo was basic to meter and meter to rhythm (p. 104). Thus, movement was perceived as
foundational to rhythm, and rhythm in turn to tonal aptitude and musical style. Nonetheless,
Gordon theorized the expressive dimension of stabilized music aptitude joined the tonal and
rhythm dimensions, resulting in comprehensive music aptitude (Gordon, 1998, p. 60).
Gordon (1998) described rhythm aptitude as foundational and basic. He maintained
students with high IMMA rhythm and composite scores had overall music aptitude superior to
those with high IMMA tonal and composite scores (Gordon, 1986c, p. 69), and noted knowledge
of the occurrence of chord changes in syntactic time might be essential to the process of
audiation (Gordon, 1998, p. 172). This notion of the primacy of rhythm was supported by
empirical evidence, such as the findings of a factor analysis of MAP, PMMA, and IMMA, in
which Gordon concluded a factorial relationship between the IMMA rhythm subtest and the
MAP meter subtest and not to the PMMA rhythm subtest, as might be expected. Thus, Gordon
(1986a) speculated the IMMA rhythm subtest might be more indicative of stabilized music
aptitude than of developmental music aptitude. In contrast, Moore (1987) concluded music
aptitude appeared too complex a concept to be wholly affected by rhythm aptitude, despite the
contribution of rhythm aptitude to developmental music aptitude as a whole.
Although IMMA tonal patterns and rhythm patterns were both comprised of the difficult
patterns identified in Gordon’s taxonomic research (Gordon, 1986c, p. 22), to achieve the same
percentile rank, students must receive a higher raw score on the IMMA tonal subtest than on the
IMMA rhythm subtest. Thus, a difference in the level of difficulty of tonal patterns and rhythm
patterns was evident from reported IMMA percentile ranks (Gordon, 1986c, p. 64), with the
implication that difficult rhythm patterns were more complex than difficult tonal patterns.
Gordon attributed higher rhythm aptitude than tonal to an environment more favorable to
rhythmic development (Zimmerman, 1986) in his 1967a study of “educationally disadvantaged”
students. Whether Gordon considered this environment to be a function of the school culture or
the students’ extracurricular musical environment was unknown. However, the focus on rhythm
activities to the detriment of tonal development was not uncommon in school music instruction.
Talley (2005) noted the content areas and skills most frequently assessed by elementary general
music teachers for students in kindergarten, first-, second-, and third-grades (singing voice
development, rhythm, matching pitch, and beat competency) included tonal instruction only
peripherally, and Young (1976) observed teachers considered rhythmic ability to be more
essential to student performance than music reading. Moore (1987) concluded focus on rhythm
aptitude might yield improvement. Thus, it was possible teachers’ concentration on rhythm
instruction might have resulted in higher rhythm achievement than tonal achievement, which in
turn might have accelerated growth of rhythm aptitude in the developmental music aptitude
stage. In addition, it was possible students’ acculturation was stronger for rhythm than for tonal
due to their musical experiences outside of school instructional time. Students might have had
difficulty accessing their head register or matching pitch, which might have led to frustration,
fear of failure, and reluctance to participate in tonal activities; thus, their level of tonal
achievement lagged behind that of rhythm. Gordon (1986c) advocated for teachers to use their
knowledge of student music aptitude scores to diagnose musical strengths and weaknesses for
individualizing instruction (p. 76). Although Gordon recommended teachers immediately
attempt to raise each student’s lower subtest score, it was possible tonal and rhythm dimensions
were emphasized equally for students whose rhythm scores already exceeded their tonal scores,
resulting in reinforcement and improvement of rhythm achievement to the detriment of tonal
achievement. Nevertheless, Gordon (1998) observed young children develop the second stage of
audiation more quickly for tonal patterns than rhythm patterns, even in those who have attained
the stabilized rhythm aptitude stage (p. 70). Gordon found it necessary to include audible clicks
with the recorded rhythm patterns in PMMA and IMMA to provide the context of tempo for
students in the developmental music aptitude stage and accents for students in the stabilized
music aptitude stage. Thus, Gordon’s assertion of rhythm aptitude as foundational to music
aptitude, a gestalt view, was reframed by the need to provide contextual support to establish
tempo. Perhaps stabilization of characteristics of the rhythm dimension such as tempo also occur
over time rather than concurrently.
Conclusions regarding a period of transition differed by Group for tonal and rhythm
scores in the current study. As such, it was difficult to substantiate or reject a general transition
between stages of music aptitude for the entire sample of students, as the atomistic parts (tonal
and rhythm) contributed to the gestalt whole (composite) in differing ways. Not only did tonal
aptitude and rhythm aptitude appear to function as separate constructs, rhythm aptitude seemed
to stabilize before tonal aptitude. Significant tonal score differences seemed to suggest score
change consistent with continued presence in the developmental music aptitude stage, yet the
lack of significant difference in rhythm scores seemed to suggest attainment of the stabilized
music aptitude stage prior to Grade 3. Thus, it was speculated students had transitioned to the
stabilized stage of rhythm aptitude before their transition to stabilized tonal aptitude, and this
transition occurred prior to Grade 3. Gordon (1980b) had observed inconsistent growth of tonal
and rhythm aptitude in minoritized students and posited a period of transition to account for that
discrepancy. The uneven increase in tonal and rhythm scores in the current study confirmed
Gordon’s findings and established a foundation from which to conjecture attainment of stabilized
tonal aptitude separate from that of stabilized rhythm aptitude.
Effect of Instruction
No effect of instruction for music aptitude was concluded in the current study. Although
a significant increase in tonal scores was found, the mean difference was modest. A nonsignificant decrease in rhythm scores was found from Fall to Spring of the same academic year;
the mean score difference also was small. In no case did the mean difference exceed the 2-point
threshold asserted by Gordon (2002) for students participating in traditional instruction; thus,
practical significance of mean score increase or decrease was questionable. DeYarman (1975)
similarly questioned the practical significance of statistically significant results found in his
investigation of MAP use with primary students, noting a minimal effect of different amounts
and types of formal music instruction on music aptitude before Grade 4.
The score fluctuation of the developmental music aptitude stage, tapering of score change
as students’ age increased, and cessation of environmental influence on the stabilized music
aptitude stage were described in previous research. Gordon (2013) noted the continual
fluctuation of music aptitude before age 9, as it interacted with the environment (p. 13), yet
environment had very little effect on music aptitude after age 9 (pp. 11–12). Between the
developmental and stabilized music aptitude stages, however, Gordon (1986c) described a
decline in score fluctuation due to decreasing influence of the musical environment as students’
chronological age increased (p. 103). In the current study, tonal scores increased significantly;
however, mean differences were small. Thus, students’ tonal music aptitude seemed to continue
in the developmental music aptitude stage, with score differences decreasing as students
transitioned to the stabilized music aptitude stage. This observation seemed to align with the
findings of Phillips et al. (2002), who asserted aural skills developed before and during Grade 3,
after which aural acuity no longer hampered pitch matching. In the current study, it seemed
students’ rhythm aptitude had stabilized prior to Grade 3 (age 8 or 9), as rhythm score difference
was not statistically significant. Gordon (2005) contended scores on developmental and
stabilized music aptitude tests would increase with chronological age; however, students’ relative
position in score distributions would remain constant when in the stabilized music aptitude stage.
Gordon (2012) stated improvement of instruction was the primary objective of a music
aptitude test (p. 51), through identification of students with high music aptitude to encourage
participation in music activities and diagnosis of each student’s musical strengths and
weaknesses to individualize instruction (Gordon, 1995, p. 9). Gordon (2001b) noted the
biological limitation of low music aptitude can be lessened with differentiated instruction (p. 86),
yet in order to individualize instruction appropriately, one must first ascertain students’ level of
music aptitude. In combination with knowledge of a student’s chronological age, music aptitude
test scores are suggestive of a student’s level of music aptitude (developmental or stabilized).
Nevertheless, Gordon (1998) noted it seemed possible students in either stage of music aptitude
could engage in preparatory audiation or audiation, regardless of chronological age (p. 72). Thus,
inclusion of informal guidance is critical to establish the foundation necessary from which
students may benefit most from formal instruction, regardless of the presumed stage of music
aptitude of school-age students.
Gordon (2012) asserted audiation of context and content of music was foundational to
music meaning (p. 11), unlike other approaches to music learning, and sought to develop a music
learning theory to describe how we learn music (p. 25). Through sequential stages of audiation,
students enjoy music through understanding (p. 28). Gordon advocated for knowledge of stages
of music aptitude to design sequential instruction of audiation skills. Students in the preparatory
audiation stage (music babble) need unstructured and structured informal guidance to allow
students to progress through acculturation, imitation, and assimilation organically and to
establish readiness for formal instruction. With appropriate guidance and instruction, children
typically emerge from tonal and rhythm babble between ages 5 and 9 (Gordon, 2012, p. 251).
Evidence of initial emergence from music babble are ability to distinguish between major and
minor tonalities and usual duple and usual triple meters (Gordon, 2012, p. 251). Students have
passed through tonal babble when they are able to sing in major and minor tonalities relatively in
tune, using continuous flow of the breath (Gordon, 2012, p. 252) and through rhythm babble
when they are able to chant alternately in usual duple meter and usual triple meter with a
consistent tempo and chant a series of rhythm patterns in the same tempo without intervening
beats (Gordon, 2012, p, 253). Without achievement of these audiational skills, the foundation on
which to introduce formal instruction would be inadequate: Gordon (1987) emphasized students
would learn less from formal instruction without the readiness provided from informal
instruction (p. 9). Formal instruction, which includes use of tonal patterns and rhythm patterns
within sequenced tonal and rhythm learning activities and in combination with classroom music
activities, is most appropriate for students who have passed out of the preparatory audiation stage
and therefore have the necessary foundation on which to continue to build audiation capacity.
Thus, type and quality of instruction likely affected the results of the current study.
Instruction by the researcher included creative movement activities based on Laban themes,
vocal exploration, folk dance, singing games, tonal activities focused on presentation of
materials in a variety of tonalities, and rhythm activities focused on presentation of materials in a
variety of meters. Learning Sequence Activities using Gordon’s tonal (1990a) and rhythm
register books (1990b) were also included to individualize instruction according to music
aptitude level, as determined through scores of bi-annual administrations of IMMA. However,
little effort was made to address students remaining in preparatory audiation through
incorporation of informal guidance within the context of the formal instruction offered. The
cumulative effect of instruction misaligned with individual student musical age over a period of
several years might have skewed the findings, particularly in a longitudinal examination, as
students would not have had the appropriate level of readiness for each sequenced level of skills
in turn. Taggart (1989) noted Gordon’s contention that establishment of musical context is
necessary for accurate measurement of stabilized music aptitude. Non-compensatory and noncomplementary instruction could have resulted in the lack of musical context needed to
accurately measure stabilized music aptitude in the current study. A detailed description of the
type of instruction provided, especially in its function as compensatory or complementary to
students’ musical needs, might have allowed a more precise interpretation of longitudinal IMMA
score change to address the question of effect of instruction in a more focused and direct manner.
Thus, despite the researcher’s best efforts to adapt instruction according to PMMA and
IMMA test results and students’ implied level of music aptitude, the possibility some students
remained in preparatory audiation for tonal or rhythm dimensions was not accounted for within
the context of school music instruction. Gordon (2013) acknowledged the benefit of informal
guidance and formal instruction, structured and unstructured, only when undertaken with
knowledge of music aptitudes (p. 17), and posited the influence of early guidance and instruction
would be greater on young children’s achievement than that of formal music instruction in later
years (Gordon, 2012, p. 47). Thus, an effect of instruction on music aptitude was unsubstantiated
within the context of the current study. Nonetheless, the lack of informal guidance opportunities
offered to address preparatory audiation deficits and establish a critical foundation for future
audiation of students in the current study likely affected the future cultivation of developmental
and stabilized music aptitude adversely.
Effect of Chronological Age
The findings of the current study did not suggest a significant increase in music aptitude
test scores due to chronological age. The results of paired t-tests of Spring tonal, rhythm, and
composite scores and corresponding Fall scores of the following grade level were mixed. Tonal
scores increased nominally, but not significantly. Grade 3 Spring/Grade 4 Fall rhythm and
composite scores increased significantly; however, the decrease in Grade 4 Spring/Grade 5 Fall
rhythm and composite scores was not significant. Thus, no conclusive evidence of an effect of
chronological age on music aptitude was interpreted. Gordon (1998) suggested a general score
increase due to chronological age was typical (p. 169) and specified an approximate average
increase of 2 points on developmental music aptitude tests for students participating in traditional
instruction (2002), yet the results of the current study did not reach that threshold. Thus, the
practical significance of score increase or decrease should be considered with caution.
The finding of no effect of chronological age was not surprising, as similar results were
reported in extant literature. Gordon (2013) noted the ability to generalize and infer was the basis
of both music aptitude and general intelligence (p. 13); thus, student ability to synthesize
information seemed more likely to affect music aptitude than did chronological age. Similarly,
Gordon (1995) noted the role of neural maturation in MAP score increase with age, as scores of
tests requiring continuous concentration likely increase in part because ability to concentrate is a
feature of maturity (p. 86). Perhaps skills associated with neural maturation were more
responsible for music aptitude than was chronological age itself. Although it might seem Gordon
(2013) was advocating for an association between music aptitude and academic intelligence,
indeed he was forthright in his opposition to such a conclusion (p. 19). He did, however,
acknowledge the characteristic skills inherent in measurement by a standardized test such as
IMMA or MAP: students’ ability to generalize and infer musical content and context was
analogous to their ability to audiate, and their effectiveness at maintaining concentration during
test-taking an asset.
Nevertheless, Gordon (1986c) distinguished chronological age (years of age) from
musical age (developmental age specific to music), noting the latter was more important in
determining when to begin instrumental instruction (Gordon, 2013, p. 149), formal instruction,
or individualized instruction: musical age as measured by PMMA or IMMA, rather than
chronological age, is the critical factor when adapting instruction to support individual learning
differences (Gordon, 1986c, p. 75). In fact, Gordon (2013) asserted adequate readiness from
appropriate preparatory audiation experiences, regardless of chronological age, was necessary for
students to audiate well (p. 131), and cautioned quality and quantity of learning in music babble
superseded the chronological age of students when they emerged from music babble (Gordon,
2012, p. 251). Moore (1987) observed lesson design that stimulates and challenges through
research-based approaches could have a lifelong impact on students’ tonal aptitude and future
music comprehension. It appears then that investment in informal guidance for students of all
ages may be pivotal to a lifetime of audiation.
Limitations of the Study
Limitations of the study included the composition, homogeneity, and aggregate size of
the sample, need for and possible effect of multiple imputation of missing values, internal
validity concerns of testing, use of nonparametric statistical testing due to violations of
parametric statistical assumptions, use of raw scores as the data collection unit, and use of scores
of students who had been administered IMMA for 3 consecutive years exclusively for repeated
measures ANOVA.
It is probable the lack of diversity of the convenience sample of students in the current
study reflected less variation than that which might be found in the general population. In
addition, the stability of the student population contributed to a more consistent music education
than students in the general population might have experienced. Thus, it was possible this sample
was not representative of the general population and caution must be exercised in generalizing
the results of this study.
Scores from students in Grades 3, 4, and 5 were included in this study. However,
inconsistency of IMMA test administration by academic year and grade level resulted in Grade 4
and Grade 5 sample sizes that were considerably smaller than the Grade 3 sample. Use of this
intact sample likely would have had a detrimental effect on statistical power. Student absences
on the dates of test administration, for which make-up tests were not administered, resulted in
missing data. Consequently, multiple imputation using predictive mean matching was used to
complete the data set, which also increased the aggregate sample size of Grade 4 and Grade 5
scores. It was expected Fall and Spring scores of corresponding tests would be loosely related
such that a difference in score values might reflect a meaningful change in student music
aptitude. All missing tonal, rhythm, and composite values were imputed simultaneously in the
current study. Thus, it was possible an imputed Fall or Spring score could have been generated in
accordance with the parameters of predictive mean matching, yet did not comply with the
premise of score association described above. It was possible this manner of imputation could
have skewed composite scores, which, in turn, might have affected the mean composite score
difference. For example, an imputed Spring score that was dramatically different in value (e.g.,
15 points) than its corresponding Fall score likely would be interpreted quite differently than Fall
and Spring scores with a more modest score difference. An uncharacteristic increase or decline
in scores might have obscured a relationship between Fall and Spring scores that would have
been revealed had only observed values been included. In addition, artificially high or low scores
might have affected test validity, as test scores might not have accurately represented the
construct intended to be measured. Further consideration of the application of multiple
imputation in the research design is recommended in replication studies.
IMMA composite scores were intended to be the sum of IMMA tonal scores and rhythm
scores; however, predictive mean matching, when conducted for all missing values
simultaneously, did not accommodate this assumption. Thus, it was possible the combination of
imputed and observed tonal, rhythm, and composite scores would not result in an accurate “tonal
score plus rhythm score equals composite score” equation. Simultaneous multiple imputation
using predictive mean matching was a limitation of the research design that might have affected
the results of the current study. An adaptation of the research design to accommodate stratified
imputation of tonal scores, rhythm scores, and composite scores to ensure appropriate
relationships of all scores is suggested in future studies.
No outliers were excluded from the sample. However, this limitation may have affected
the dispersion of scores as well as the skewness and kurtosis of the distribution curve. Due to
violations of the assumption of normality, the nonparametric Wilcoxon Signed Rank test was
conducted in lieu of paired t-tests for comparison of matched pairs of scores of the same
academic year. Limitations of the measurement instrument may have affected the results of the
study: a defining attribute of stabilized music aptitude testing was the need for context in
measurement (Gordon, 2005; Taggart, 1989). However, it was possible the context provided in
IMMA was inadequate to measure stabilized music aptitude.
Limited evidence in extant literature of the use of IMMA as a test of stabilized music
aptitude may have been a limitation on the instrument to measure the construct of interest,
although Gordon (1986c) reported practice effects of PMMA and IMMA test taking were
negligible and not a threat to validity (p. 109). An effect of testing resulting from administration
of the same test each semester for 3 academic years could create a limitation of internal validity,
as outcomes may have differed due to repeated testing with the same instrument. Continued
study of the efficacy of IMMA as a measure of stabilized music aptitude is recommended.
An additional limitation of this study was the use of raw or observed tonal scores, rhythm
scores, and composite scores as the unit of data collection, rather than percentile ranks. It was
determined raw scores were best suited for comparison across grade levels and different
statistical tests, as percentile ranks were a function of students’ relative standing in comparison
to their peers and would thus vary in accordance with grade level. However, apt comparisons of
raw scores to IMMA standardization results could not be made, as Gordon (1986c) reported
percentile ranks only.
A final limitation of this study was the decision to use observed scores of only the
students who had been administered IMMA for 3 consecutive years (in Grade 3, Grade 4, and
Grade 5) to address Research Question 3, rather than all available IMMA scores. A close and indepth examination of score change was desired, in order to observe differences by grade level
grouping. To supplement the interpretation of the detailed findings of the present study, a
MANCOVA using all available IMMA scores is recommended to increase insight of
longitudinal score change and allow interpretation of interactions between dependent variables.
No compelling evidence was found from the results of a series of paired t-tests to support
an effect of chronological age on music aptitude. Mean score differences, although statistically
significant, were small for Grade 3 Spring/Grade 4 Fall composite scores. The finding of no
statistical significance for most scores supported an interpretation that students had progressed to
the stabilized music aptitude stage. A gradual decrease in effect of environment due to age
(Gordon, 1981), measured as IMMA scores, could have been interpreted from the findings of the
current study; however, it appeared the decline in scores had begun as early as Grade 3. In
contrast, continued score increases could have been interpreted as maintenance of the
developmental music aptitude stage, in which scores would continue to fluctuate as
chronological age increased. Nevertheless, most mean differences were not significant and were
less than one-half point, well below the average yearly increase of 2 points expected with
traditional instruction during the developmental music aptitude stage (Gordon, 2002).
Developmental music aptitude was not depicted as typically described (Taggart, 1989),
nor was stabilized music aptitude characterized as in extant literature (Gordon, 2004) from the
results of these paired t-tests. It appeared music aptitude was not sensitive to an effect of
chronological age and consequently IMMA was an appropriate measure for students in Grades 3,
4, and 5 at varying places along the music aptitude continuum: those who remained fully in the
developmental music aptitude stage, were transitioning to the stabilized music aptitude stage, or
had attained the stabilized music aptitude stage wholly, regardless of chronological age. Whether
IMMA was as accurate a measure of stabilized music aptitude as of developmental music
aptitude cannot be verified from the results of the current study. Nevertheless, the usefulness of
IMMA as a measure of music aptitude for students in this period of transition between music
aptitude stages was not negated. IMMA may continue to be administered as a measure of
developmental music aptitude and stabilized music aptitude, as no effect of chronological age
was concluded for music aptitude in the current study. Additional research including students
younger and older than the sample group of the current study is recommended to clarify further
the effect of chronological age on music aptitude.
No overall effect of instruction was concluded from the results of a series of Wilcoxon
Signed Rank tests conducted by academic year for IMMA tonal, rhythm, and composite scores.
It was anticipated a comparison of pre- and post-instruction music aptitude scores would help
identify the grade level at which music aptitude appeared indifferent to the effect of instruction
and consequently the age of onset of the stabilized music aptitude stage would be disclosed.
Although tonal and composite score increases from Fall to Spring were observed, mean
differences were small and statistically significant only for 2011–2012 and 2013–2014 tonal
scores. Gordon (2005) asserted scores on a stabilized music aptitude test do not increase as a
result of practice or training ; therefore, previous student attainment of the stabilized music
aptitude stage was implied by the lack of significant score increase.
Nevertheless, Reese and Shouldice (2019) warned of potential reduction of score gain if
instruction ceased or effectiveness of teaching decreased (p. 478). Gordon (2001b) specified the
need for informal guidance using age-appropriate techniques and materials to move students of
all ages through music babble before formal instruction commenced. Without unstructured and
structured guidance, students’ ability to develop audiation would be limited (p. 87). Thus, it
seemed plausible instruction was inadequate for student needs and had a detrimental effect on
expected score gains. The finding of no effect of instruction on music aptitude was specific to the
parameters of instruction as applied in the current study. As an examination of the type and
quality of instruction was beyond the parameters of the current study, further investigation is
required to understand more fully the effect of compensatory and complementary instruction on
music aptitude.
The subsequent action for a finding of no significance due to prior achievement of the
stabilized music aptitude stage differs greatly from that due to inappropriate instruction. To the
practitioner, attainment of the stabilized music aptitude stage prior to age 9/Grade 4 might not
affect a music educator’s choice of music aptitude test, as IMMA seemed to measure
developmental music aptitude and stabilized music aptitude equally well. On the other hand,
Gordon (2013) contended a year of exposure to preparatory audiation might be needed to acquire
readiness for formal instruction (p. 134); therefore, use of informal guidance in lieu of or as a
supplement to formal instruction is recommended to support students’ musical age, regardless of
their chronological age (p. 30). To the researcher, continued scrutiny of IMMA scores of
students younger and older than the participants in the current sample might help clarify the
grade level at which music aptitude stabilizes. However, further investigation of type and quality
of instruction would necessitate a change in research design, likely to a quasi-experimental study
in which pre- and post-treatment IMMA scores are examined after a specified period of welldefined instruction.
The results of a longitudinal examination of tonal, rhythm, and composite scores over a
3-year period, conducted using a series of repeated measures ANOVA to investigate a period of
transition between the developmental and stabilized music aptitude stages, were mixed. It
appeared students might have remained in the developmental music aptitude stage through Grade
5, as significant mean increases from Grade 3 to Grade 4 composite scores as well as Grade 3 to
Grade 5 composite scores seemed to indicate continued score fluctuation throughout the period
in question. However, findings of no significant difference were found for all rhythm scores
except those of Groups B and E: rhythm scores had ceased to fluctuate meaningfully, which
implied students might have progressed to the stabilized music aptitude stage prior to Grade 3.
Tonal scores mimicked the composite trend of significant score increase from Grade 3 to Grade
4 and Grade 3 to Grade 5 for only 50% of the Groups. Group B results were an anomaly: Grade
4 rhythm scores were significantly lower than Grade 3 rhythm scores, but Grade 5 rhythm scores
were significantly higher than Grade 4 rhythm scores for the same 3-year period. It was
speculated a broad period of transition could account for the discrepancies found in score
fluctuation, as well as this atypical trend not previously described in the research literature. For
music educators, the implication of a transition period would be continued use of IMMA to
measure music aptitude of students still in the developmental music aptitude stage, those
transitioning between music aptitude stages, and those who had attained the stabilized music
aptitude stage.
Nevertheless, the longitudinal effect of instruction must be considered in regard to these
findings. A lack of compensatory informal music guidance for students beyond preschool age
might have hindered acquisition of higher-level music skills (Gordon, 2013, p. 9). A deficit of
foundational audiation skills for students in the current study would not have been mitigated by
formal instruction (Gordon, 2012, p. 263) in Grades 3, 4 and 5, and might have affected score
changes in a manner that did not conform to that previously described for the developmental
music aptitude stage or stabilized music aptitude stage. For the music educator, an implication of
no effect due to inappropriate type of instruction was dire: an immediate shift in type of
instruction to include informal guidance is urged, lest students’ development of audiation
becomes inhibited.
Numerous recommendations for future research resulted from the current study. These
are organized into two sections and outlined below to guide the reader.
Figure 15
Adaptations and Extensions to the Current Study
Adaptations to the Current Study
Expanded and More Diverse Sample.
A limitation of the current study was the use of a convenience sample of students from
the school district in which the researcher was employed. This sample had limited diversity: a
large majority of students were White, lived in poverty, attended the same school district
throughout their primary and secondary levels of education, and resided in a rural area.
Therefore, replication of this study with a sample more representative of the socioeconomic and
cultural diversity of the general population of American elementary school students is suggested
in order to generalize findings to a larger and more heterogeneous population. Similarly, the
stability of student enrollment within the current sample’s school system was atypical of the
educational experience of students in many American schools. Students who transfer to other
school systems might experience less consistency in type and frequency of instruction, and
school systems with a more transient student population might struggle to provide individualized
instruction simultaneously to students from a variety of musical backgrounds. Both of these
instructional situations might influence the findings of a similar investigation of effect of
instruction. Thus, a study including a sample more representative of the transiency rate of the
general elementary school population is suggested in order to generalize findings to a more
typical population.
Numerous test scores were missing due to student absence during test administration and
inconsistent IMMA administration to students in Grades 4 and 5. It is recommended IMMA test
make-up sessions be offered to students who were absent in order to lessen the need for
imputation of a large number of missing values. In addition, annual IMMA administration to
students in Grades 3, 4, and 5 would yield a larger and more balanced sample of observed scores
with which to conduct a longitudinal examination of music aptitude. Although multiple
imputation is a reputable method of managing missing values, a research design that resulted in
fewer missing values would yield a data set of observed values that most accurately represented
the current music aptitude level of that specific sample of students.
Consistency of Multiple Imputation Implementation.
Multiple imputation using predictive mean matching with 10 imputations was conducted
in the current study to increase statistical power due to the sizable quantity of missing data, as
previously described. However, the premise that Fall and Spring scores would be loosely related
was not accommodated by the multiple imputation procedure; therefore, it was possible imputed
Spring scores could have appeared to increase or decline markedly from preceding Fall scores in
an improbable manner not in keeping with previously reported score trends. It is recommended
imputed data be examined for values inconsistent with accepted parameters of score relationships
and those affected be replaced randomly with imputed values more closely aligned with those
parameters. In addition, a slight adaptation of the research design to abstain from imputation of
composite scores and impute only tonal scores and rhythm scores is recommended. Composite
scores are defined as the sum of tonal scores and rhythm scores, yet imputed composite scores
may not have adhered to this standard. Manual calculation of composite scores is recommended
for instances in which a tonal score or rhythm score has been imputed, in order that the sum of an
imputed score and observed score will equal the composite score.
Mitigation of Outliers.
No outliers were omitted from the observed or imputed data set. Nevertheless, the
presence of outliers seemed to affect the score distribution, often resulting in a violation of the
assumption of normality. Nonparametric methods, which have less statistical power than
parametric methods (Russell, 2018, p. 23), were required to mitigate the effect of outliers.
Instead, an examination of the data set before and after the multiple imputation procedure is
recommended and management of outliers considered in advance of statistical testing. Field
(2009) proposed three options for dealing with outliers: removal, if there is substantial reason to
believe the outlier is not representative of the population to be sampled; transformation of data, if
it seems likely the statistical models perform better using transformed data than using data that
violate the assumption the transformation corrects (Field, 2009, p. 155); and score changing, if
the score is highly unrepresentative and biases the statistical model. It is possible exclusion of
occasional outliers might have reduced the impact of outliers on the line of best fit and
consequently affected the findings of the statistical tests (Russell, 2018, p. 240).
Mitigation of Practice Effects.
Gordon (1986c) reported no practice effects for repeated administrations of IMMA.
Nevertheless, continued examination of IMMA as a measure of stabilized music aptitude with a
larger and more diverse sample of participants with ages and grade levels similar to and different
from participants in the current study is recommended to corroborate or refute Gordon’s findings
and the findings of the current study.
Extensions to the Current Study
Parallel Examination of Percentile Ranks.
Recommended extensions to the current study include studies in which the research
design is expanded, such as a parallel examination of percentile ranks for comparison with raw
scores and use of alternate statistical procedures. In addition, future research on the effect of
instruction pertaining to type and quality of instruction, an expansion of grade levels within the
sample, and an investigation of the effect of instruction using frequent music aptitude testing to
continually adapt instruction are suggested. Attendance at professional development
opportunities focused on preparatory audiation, an examination of the effect of music preference
on music aptitude, mitigation of cultural bias in standardized testing, an investigation of effect of
ensemble participation on music aptitude, and an examination of independent stabilization of
tonal and rhythm aptitudes are also proposed to clarify the effect of instruction in future studies.
A limitation of the current study was the use of raw scores (number of correct answers) as
the unit of comparison. In contrast, IMMA standardization results were reported as percentile
ranks (Gordon, 1986c, pp. 64–65), standard units that situate students’ scores within the context
of their grade level peers’ performance as a form of normative analysis. It was not possible to
compare the results of the current study to those of the IMMA standardization group without use
of a common unit of measure. Thus, an adaptation to the current research design in which raw
scores are converted to local percentile ranks using frequency distributions of scores (Gordon,
2012, pp. 351–353) is proposed. In this way, findings based on comparisons made using raw
scores would be enhanced by normative comparisons with local and IMMA standardization
percentile ranks.
Alternate Statistical Procedures.
A limitation of the current study was the decision to use only observed scores from
students who had been administered IMMA for 3 consecutive years (Grade 3, Grade 4, and
Grade 5) in an attempt to garner a deep understanding of longitudinal score change of specific
groups of students. Therefore, a recommendation for future research is to use all available scores
for Grades 3, 4, and 5 for a more comprehensive investigation of a period of transition and to
compare those results to the results of the more limited sample in the current study. In addition, a
series of repeated measures ANOVA was used in the current study to probe longitudinal score
change at a basic level, and multivariate results were used to confirm ANOVA findings. It is
possible an examination of the dependent variables (IMMA tonal score, IMMA rhythm score,
and IMMA composite score) and independent variables (Grade 3, Grade 4, and Grade 5) would
reveal relationships between all categories of independent and dependent variables most
effectively. Therefore, a MANCOVA including three continuous dependent variables (Test
Score: IMMA tonal score, IMMA rhythm score, and IMMA composite score), one categorical
independent variable with 3 levels (Grade Level: Grade 3, Grade 4, and Grade 5), and one
covariate variable with 2 levels (Type of Instruction: with informal guidance and without
informal guidance), is recommended to determine whether there is a relationship between grade
level and IMMA score after controlling for type of instruction (Hatcher, 2013, p. 374). An
advantage of MANCOVA over ANOVA is its power to detect relationships between dependent
variables, increase the power by reducing the size of the error term, and adjust mean scores on
the independent variable for the covariate (Hatcher, 2013, pp. 375–376). Therefore, MANCOVA
results would build on the foundational understanding gleaned from the detailed analysis of
scores of grade level groupings from the current study in investigating a period of transition
between music aptitude stages.
Full Information Maximum Likelihood (FIML) is a method to estimate parameters in a
variety of statistical models such as structural equation modeling (SEM) and is a popular modelbased procedure for handling missing data (McKnight et al., 2007, p. 163). Instead of imputing
missing values, FIML uses the likelihood function to estimate the probability of the data as a
function of the observed data and unknown parameters. Maximum Likelihood procedures
produce unbiased estimates in large samples, approximate a normal distribution in repeated
samplings (McKnight et al., 2007, pp. 160–164), and provide reproducible estimates with
smaller standard errors than multiple imputation (MI) (Ghisletta & Aichele, 2017). With large
amounts of missing information, MI can require 200–300 imputations of the data set to estimate
a standard error similar to that of FIML; therefore, FIML is more efficient than MI (Von Hippel,
2016). Although MI is a more flexible procedure, bias can be introduced if there are conflicts
between the assumptions of the imputation model and the analytic model. When used with SEM
statistical software, FIML can be employed in a single step, in contrast with the multiple-step
process of MI (data set imputation, statistical analysis, and pooling). Therefore, a future study
using FIML to handle missing data is recommended.
Structural equation modeling (SEM) is a flexible set of procedures used to estimate and
test models that hypothesize causal relationships between unobserved (latent) and observed
(manifest) variables (Hatcher, 2013, p. 478). The focus of SEM is on observation and indirect
measurement of latent variables in order to theorize causal connections among them (Huck,
2012, pp. 504–505). The researcher’s knowledge of theory and previous research is used initially
to define the latent variables, after which measurable variables are selected to illuminate the
latent variables’ qualities (pp. 507–508). Therefore, SEM is not exploratory; instead, SEM
compares actual relationships among variables to the theoretical relationships previously
hypothesized and evaluates the fit of the model to explain observed data (Huck, 2012, pp. 504–
505). Diagrams which depict types, associations, and causal relationships of and between
variables are used to present SEM results (Huck, 2012, p. 506). An advantage of SEM is that
inferences may be drawn from large data sets (Leech et al., 2015, p. 90). Thus, the use of SEM in
future studies is recommended, in order that relationships between the variables in the current
study may be elucidated more clearly and completely.
Investigation of Type and Quality of Instruction on Music Aptitude.
The limitation of type and quality of instruction, although not a predetermined focus, was
critical to the findings of the current study. No effect of instruction on music aptitude was found
in the current study; however, a consideration of the type of instruction offered for all primary
grades as well as each intermediate grade level considered in the study might have clarified its
function as traditional, compensatory, or complementary, as it was possible the necessary
audiational skills required to establish readiness for the formal introduction offered in the school
environment were insufficient for the participants in the current sample. This might have affected
the results of the study adversely, as the effect of the type of instruction being examined might
have differed from the type of instruction most appropriate for the participants. Gordon (1986c)
emphasized the import of appropriate informal and formal music experiences (p. 103), and
recommended longitudinal studies similar to those examining music achievement of culturally
homogeneous students with differing levels of stabilized music aptitude should be undertaken for
culturally homogeneous students with differing levels of developmental music aptitude, with
special consideration of theories of instruction (Gordon, 1980b). In addition, Gordon noted any
significant difference in developmental music aptitude between students from differing
backgrounds would indicate the need for diverse (and culturally responsive) instruction to
address those score differences. Therefore, type of instruction as it relates to compensatory and
complementary instruction should be included as a variable in future studies of effect of
instruction on music aptitude.
Expansion of Sample Grade Levels.
The current examination focused on IMMA scores of students in Grades 3, 4, and 5 in
order to include the grade levels prior to, including, and following age 9, the target age at which
the shift to stabilized music aptitude stage had been purported in previous research. As the results
of the current study were inconclusive for effects of chronological age and instruction on music
aptitude and a period of transition, the inclusion of Grade 2 and Grade 6 IMMA scores in future
studies is recommended to further investigate the onset of stabilized music aptitude; Gordon
(2013) reported the difference between highest and lowest scoring Grade 2 students was greater
than that of average scoring students in Grades 2 and 6 (p. 16). Gordon (2002) noted the use of
the same test allowed easily explained and understood comparisons for students in different
grades; thus, use of IMMA, which is standardized for all students in the proposed sample, is
optimal and warranted. Results of such a study could be compared to a similar study conducted
by Gordon (2002) in which PMMA and MAP non-preference subtests were administered to
students in Grade 2 and Grade 6 and resulting correlations used to consider differences in
developmental and stabilized music aptitudes, as further evidence of the dichotomy of music
aptitude stages.
Continual Adaptation of Instruction Based on Frequent Music Aptitude Testing.
To increase understanding of how instruction can be adapted through close monitoring of
developmental music aptitude, a study is recommended in which the scores of more frequent
aptitude testing (perhaps every two months) are used to guide instruction. A sample size of at
least N = 30 is small enough to provide the flexibility necessary for such frequent adaptation of
instruction, yet large enough to satisfy the central limit theorem, which states the sampling
distribution will be normally distributed regardless of the shape of the population distribution
when samples are larger than 30 (Field, 2009, p. 782). Gordon (1998) proposed periodic
administration of music aptitude testing, particularly for students still in the developmental music
aptitude stage, in order to diagnose students’ musical strengths and weaknesses for
individualized instruction and to identify students with high music aptitude to provide the
opportunities necessary to maintain that level of aptitude (p. 119). Reese and Shouldice (2019)
described the process of adapting instruction based on knowledge of music aptitude scores.
Scores of multiple test administrations may be compared to assess effect of instruction. Stable
scores indicate instruction supports the tested level of aptitude; score increase suggests
instruction has been compensatory. A decrease in scores signifies the need for an adjustment of
instruction to provide additional support (p. 482). Moore (1987) concurred, noting primary music
educators were able to monitor the effects of classroom instruction on developmental music
aptitude to good effect. Thus, findings of extant literature establish a foundation for further study
in which instruction is adapted consistently and frequently based on music aptitude test scores.
Professional Development Focused on Preparatory Audiation.
A pragmatic suggestion for all music educators interested in understanding the construct
of preparatory audiation, building foundational preparatory audiation skills for use of formal
instruction, and applying Music Learning Theory is to participate in the Gordon Institute for
Music Learning (GIML) Professional Development Levels course in Early Childhood before or
concurrently with their principal GIML course of interest (elementary general music,
instrumental, or piano). Gordon (2012) noted students moved progressively through music
babble when in the developmental or stabilized music aptitude stage and should not be hurried
into audiation and formal instruction as they matriculate into school at age 5 or older until they
have phased through preparatory audiation by sufficient participation in informal guidance
activities (pp. 260–261). The need to address preparatory audiation deficiencies applies to
educators of all instructional levels and areas of concentration. Gordon (1986c) noted informal
instruction was organized according to tonal and rhythm concepts only (pp. 70–71); therefore,
participation in professional development workshops or courses focused on preparatory audiation
may provide needed clarification for inclusion of informal instruction beyond the early childhood
years. Implementation of the concepts presented within those preparatory audiation workshops or
courses, in an age-appropriate manner for school-age students, may help move students who
remain in preparatory audiation toward a level of audiation from which they may benefit wholly
from formal instruction.
Inclusion of Informal Guidance at All Levels of Instruction.
As is evident from the findings of the present study, this researcher, despite reasonable
and concerted efforts to adapt instruction to students’ musical needs according to their level of
developmental or stabilized music aptitude, did not possess the necessary understanding of
preparatory audiation to include sufficient informal guidance for students older than Grade 1. As
previously recommended, practicing music educators should receive professional development
focused on preparatory audiation, including how to identify each student’s phase of preparatory
audiation for tonal and rhythm dimensions and implement informal guidance to support all
students as they transition out of music babble, regardless of their chronological age or grade
level. Ideally, preparatory audiation as a construct would be presented to preservice teachers in
undergraduate methods courses, along with opportunities to observe and work with young
children still in music babble. The National Association for Music Education (NAfME)
published a position statement on early childhood music education (National Association for
Music Education, 2021) and identified PreK–8 music standards (National Association for Music
Education, n.d.), and organizations such as the Gordon Institute for Music Learning (GIML) and
the Early Childhood Music & Movement Association (ECCMA) actively promote music
education focused on early childhood, yet many music educators, including the researcher, were
not made aware of best practice for implementation of those standards for school-age students
remaining in preparatory audiation. Little information regarding early childhood music
instruction or preparatory audiation was available in the researcher’s preservice training, and
professional development opportunities were focused primarily on techniques and materials
applicable for formal instruction. Offering learning opportunities for groups of upper elementary
students, some of whose members likely remained in preparatory audiation for tonal or rhythm
dimensions, others who continued to function tonally or rhythmically in the developmental
music aptitude stage, and still more who have transitioned to the stabilized music aptitude stage
for tonal or rhythm dimensions is akin to simultaneously spinning multiple plates. If the topic of
preparatory audiation is often absent from music teacher preparation, so the complexity of
instructional design combining formal instruction with simultaneous mitigation of preparatory
audiation deficits also has been overlooked. This lesson design skill set is specific and its
implementation regrettably infrequent.
In the current study, PMMA and IMMA were administered routinely by the researcher in
Fall and Spring to all students in kindergarten through Grade 3. Pre- and post-instruction scores
were scrutinized and instruction adapted accordingly by the researcher; instruction was
individualized based on bi-annual PMMA and IMMA scores and student performance. However,
adaptations to instruction were limited to content and skills within the context of formal
instruction; little attempt was made to modify instruction to accommodate students who had not
yet passed out of preparatory audiation. In contrast, Gordon (1986c) recommended students in
Grades K–3 participate in formal and informal instruction, as concurrent experience in both
strengthens the outcomes of formal instruction (p. 70). Thus, it is imperative music educators of
students at all levels and in all areas of music study recognize the need to include aspects of
informal guidance, as there are likely students in their classes or ensembles who remain in
preparatory audiation tonally or rhythmically and will not benefit from formal instruction until
they have passed out of music babble (students whose musical age is delayed in comparison to
their chronological age). Informal guidance activities such as singing tonal patterns, chanting
rhythm patterns, and exposure to songs and chants in a variety of tonalities and meters, in
conjunction with formal instruction, are suggested to strengthen audiation and thus increase
music aptitude scores (Gordon, 2005).
Effect of Music Preference Testing on Music Aptitude.
An investigation of the effect of music preference on determination of stabilized music
aptitude is suggested as an extension of the current study. Boyle (1992) noted general acceptance
of music preference as a construct (p. 251), yet the means of measurement of this construct were
not universally accepted. Gordon (1986a) reported music preference measures were within the
purview of stabilized music aptitude and noted young children in the developmental music
aptitude stage were unable to make reliable judgements about music preference, regardless of the
content or framing of test items (Gordon, 1998, p. 70). The dependence of future musical success
on the MAP musical sensitivity total test score (Gordon, 1995, p. 55), and the rhythm imagery
total test score in particular, were noted (Gordon, 1998, p. 141). However, IMMA, purported to
function as a measure of stabilized music aptitude for students age 9 and higher (Gordon,
1989d), contains no preference subtests due to its primary function as a measure of
developmental music aptitude more discriminating than PMMA (Walters, 1991). Gordon (2005)
described the following indirect findings associated with preference tests:
Successful music students score high on preference measures.
2. Students who score high on preference measures demonstrate higher levels of expression
and overall sensitivity in their performances.
3. Preference scores are highly correlated with potential to create and improvise music.
4. Preference scores are highly intercorrelated with the MAP meter subtest.
5. Preference scores are highly correlated with ability to recall and make musical inferences,
whereas non-preference subtest scores are more highly correlated with ability to
memorize and imitate music (p. 16).
Much may be gained from analysis of students’ scores on preference measures, yet only MAP
provides the opportunity to glean this information.
Therefore, an examination of concurrent IMMA and MAP scores of intermediate students
would offer the ability to correlate IMMA scores with those of a valid test of stabilized music
aptitude, with and without preference subtest scores, to confirm independently the assertion that
IMMA functions as a test of stabilized music aptitude for students age 9 and older. The
longitudinal predictive validity of the MAP battery was due in part to its three preference
subtests (Gordon, 1998, p. 61); MAP’s ability to diagnose musical strengths and weaknesses in
order to individualize instruction was notable (Gordon, 2001c) and superior to that of IMMA
(Geissel, 1985, p. 32). Therefore, the potential for MAP preference subtest scores to boost the
diagnostic capabilities of IMMA should be examined, as administration time of the two IMMA
subtests is markedly less than that of the full MAP battery. This investigation could lead in turn
to an examination of MAP preference subtest use to augment the ability of the Advanced
Measures of Audiation (AMMA), a measure of stabilized music aptitude standardized for
students in junior high through university, to describe stabilized music aptitude more fully, as
preference tests, despite a likely increase in test validity, were excluded from AMMA to
minimize test length (Gordon, 1998, p. 111).
Mitigation of Cultural Bias in Standardized Testing.
Even as of this writing, avoidance of talk about race was typical, as “White people have
been socialized not to talk about race” (Bradley, 2007, p. 152). Thus, the discourse necessary to
address the power differential resulting from racism was frequently absent in American society
and schools: Bradley, citing Pollock (2004), noted those involved in education in all roles
“lacked the language” to discuss race. Nevertheless, to avoid talking about race was to risk
perpetuation of racism by not disputing whiteness as the cultural norm and to marginalize
students who experience racism. Therefore, Gillborn (2006) advocated using knowledge gleaned
from past errors to adapt to the challenges of the present: an understanding of Gordon’s (1980b)
intent in conducting a study of “inner city” students must be tempered by the anti-racism and
social justice aims of Critical Race Theory (CRT), a theoretical and analytical framework for
educational research (DeCuir & Dixson, 2004), that reject the placement of White culture as
advantageous, despite the pervasiveness of this mindset in the social fabric of the American
culture (Gillborn, 2006).
Because no description of the racial, ethnic, or socioeconomic makeup of the IMMA
norms sample was given, it was inferred the sample was relatively homogeneous in composition.
Gordon (1998) stated definitively that MAP scores, including preference test scores, were
normally distributed (p. 100) and all students audiated similarly, regardless of cultural
background (Gordon, 1981). Gordon’s attempt to mitigate the effect of cultural difference was
apparent in his use of object identifiers to eliminate the need for reading, writing, and English
language skills in PMMA and IMMA (Gordon, 1986c, p. 33). He also endeavored to eliminate
cultural bias in the design of MAP (Gordon, 1967a) and as a factor in student achievement
(Gordon, 1980b) through selection of samples of diverse students. Gordon’s (1987) use of a
variety of modes and meters was an attempt to ensure MAP would generalize to “occidental
culture” (p. 69); results of research studies have confirmed the validity of MAP for use in East
Asian cultures as well (South Korea: Reynolds & Hyun, 1994; Taiwan: Chuang, 1997). Yet
Gordon interpreted potential differences in music aptitude by race were based on limited
environmental music opportunities and speculated students’ motivation to learn might have
affected their performance on PMMA. Today, we must question this deficit perspective and
instead suggest that the measure may not be sensitive to the environmental opportunities
available or music intellect possessed by minoritized students in the United States as well as nonWestern cultures in other countries.
Typically, norms of standardized tests are based on the scores of dominant groups, which
can result in bias against minoritized students (Kim & Zabelina, 2015). This seemed to be the
case for the small norms sample used for IMMA, which was selected from a limited geographic
region (Gordon, 1986c, p. 85). Yet historically, standardized tests have been found to reproduce
racial and economic inequalities that correlate with societal inequities (Au, 2008). Variance
among test scores can be explained by noninstructional factors such as poverty rate, language
barriers, and racism (Kohn, 2000). Knoester and Au (2017) noted correlations of structural
inequalities associated with racism and poverty with K-12 standardized testing were stronger
than with any other factor. Further examination of IMMA test validity with minoritized students
is advised, as continued reliance on findings from limited studies including homogeneous
samples may affect generalization of those findings to minoritized student populations. In
addition, creation of local norms is recommended for a more accurate comparison of findings of
diverse populations. Gordon (1986c) suggested the development of local norms was an
outgrowth of frequent test administrations and might be superior for comparing relative standing
(p. 86). Holahan and Thomson (1981) concurred, and proposed construction of local norms as
standard practice for all tests.
Potential bias inherent in standardized testing must be considered. Standardized test
content frequently requires foundational knowledge and skills disproportionately possessed by
students from more economically privileged backgrounds (Kohn, 2000). Interpretation of test
scores and the resulting instructional decisions can be inequitable as well (Hood, 1998): low
scores may highlight a mismatch between the test creator’s frame of reference and the student’s
cultural frame of reference (Bond, 2017; Koelsch et al., 1995), rather than indicate low aptitude.
It behooves us to examine whether the inequities of standardized achievement tests also apply to
standardized music aptitude tests. If music aptitude scores are to represent students’ potential to
learn in music, it is critical we acknowledge the extent to which factors such as racism, poverty,
and foundational knowledge may influence scores. Future research is recommended in which the
impact of these extra-musical factors on music aptitude are investigated.
Gordon (1986c) recognized extra-musical factors also must be considered in designing
appropriate instruction and the expertise of the music teacher taken into account (p. 76). He
suggested differences in aptitude scores of students of diverse backgrounds indicated a need for
changes in instruction to mitigate those differences (Gordon, 1980b). Culturally- and musicallyresponsive instruction is needed to mitigate cultural bias in testing (Kim & Zabelina, 2015), and
such instruction is assessed more effectively through use culturally responsive measures (Hood,
1998). Inclusion of creativity as an additional criterion might be considered to supplement the
information provided from a standardized measure of music aptitude; performance-based
assessment has been found to measure higher order thinking skills and provide a fairer
assessment of minoritized students (Hood, 1998). One such assessment tool is the Torrance Test
of Creative Thinking (TTCT), which measures creative strengths such as fluency, originality,
and elaboration. The addition of music preference tests to IMMA tests also might provide a more
complete snapshot of students’ music potential, as the TTCT tasks were not applicable directly to
standardized tests of music aptitude. Nonetheless, a study in which musical tasks similar to those
included in TTCT are examined for reduction or elimination of cultural bias might be useful. In
addition, future studies including more diverse samples are needed in order that generalizations
to more diverse populations can be drawn and appropriate, culturally-responsive changes to
instruction made.
The primary goals of music aptitude test administration are to improve instruction and
identify students with high music aptitude in order to encourage participation in school music
instruction: both are intended to helping students fulfill their musical potential. This is an ideal
that recognizes and embraces innate music aptitude as an asset and a fund of knowledge (Moll,
1992) worth cultivating.
Effect of Ensemble Participation on Music Aptitude.
Elementary students in Grades 4 and 5 often have the opportunity to participate in school
performance ensembles such as band, chorus, or orchestra. Approximately 88.5% of the students
in the current study participated in performance ensembles as elementary students. However, few
research studies were found in which the effect of ensemble participation on music aptitude was
examined. As part of his 1967 standardization of the MAP battery, Gordon (1998) investigated
the relationship of stabilized music aptitude scores to instrumental music instruction (p. 79).
Instrumental ensemble members scored only slightly higher than choral ensemble members, who
in turn scored higher than nonparticipants (Gordon, 1998, pp. 79–80), although similar score
distributions were found for students who participated in band, orchestra, and chorus at each
grade level, as well as for instrument type (Gordon, 1995, p. 91). No findings of further
statistical testing were reported; however, separate MAP norms for musically select ensemble
participants at the elementary, junior high, and senior high school levels were published
(Gordon, 1995, p. 92).
Regrettably, the discrepancies in MAP scores between participants and non-participants
in school music ensembles might be attributable to the selectivity of music performance groups.
Gordon (1995) found approximately half the students who scored in the upper 20% on MAP did
not participate in school ensembles or receive special instruction (p. 9); consequently, MAP use
was promoted in order to identify high-aptitude students to encourage participation in music
activities (Gordon, 1995, p. 9). Nevertheless, Gordon (1998) noted non-participation in
ensembles did not limit students from scoring high on music aptitude measures, nor did
ensemble participation guarantee high music aptitude scores (p. 80). In fact, Gordon found MAP
scores of students with instrumental training were only negligibly higher than scores of those
who did not participate in the school instrumental program and concluded no effect of training
on MAP scores (p. 106). Although the effect of instrumental instruction did not seem to have an
effect on stabilized music aptitude as measured by MAP, students’ instrumental achievement
was greater when teachers used their knowledge of student MAP scores to adapt instruction
(Gordon, 1998, p. 106). No extant research on the effect of ensemble participation on
developmental music aptitude was found. Therefore, initial investigation on the effect of
ensemble participation on developmental music aptitude and further investigation to confirm or
refute Gordon’s findings on its effect on stabilized music aptitude are recommended.
Atomistic Examination of Tonal Aptitude and Rhythm Aptitude Stabilization.
Independent transition to stabilization of tonal and rhythm aptitude was conjectured from
the findings of the current study. Only rhythm findings of paired t-tests using Grade 3
Spring/Grade 4 Fall scores were significantly different: rhythm aptitude seemed to remain in the
developmental music aptitude stage. Because no significant tonal score changes were found,
there was no evidence tonal aptitude continued to fluctuate. By definition, music aptitude that
was no longer influenced by the musical environment was considered stabilized. Therefore, it
was concluded the transition to stabilized music aptitude seemed complete prior to Grade 3 for
the tonal dimension but was just beginning in Grade 4 for the rhythm dimension.
In contrast, significant mean tonal score differences were noted for Groups A, C, and D
in an examination of repeated measures ANOVA results; continued score fluctuation, a feature
of developmental music aptitude, was suggested from these findings. However, significant mean
rhythm score differences were noted only for Groups B and E: score change remained relatively
static for the majority of Groups, which was suggestive of previous attainment of the stabilized
rhythm aptitude stage prior to Grade 3.
Gordon (2013) noted student engagement in tonal and rhythm preparatory audiation may
differ by type and stage (p. 30). However, discussion of tonal aptitude and rhythm aptitude as
independent constructs in the transition to the stabilized music aptitude stage was not found in
the extant literature reviewed for the current study. Gordon (1981) observed the interaction
between music aptitude and the musical environment likely occurred from birth to age 8,
although the effect of environment decreased as chronological age increased, until approximately
age 9, when music aptitude stabilized. Despite frequent descriptions of music aptitude stabilizing
at age 9 in extant literature, it was inferred this conclusion was drawn from the deficit of
significant fluctuation of composite scores, as separate consideration of tonal score or rhythm
score change was not stipulated. Thus, a gestalt perspective of music aptitude stabilization was
concluded from composite results, while implications drawn from atomistic results seem to
indicate the possibility of independent stabilization of tonal and rhythm dimensions. Although
equating of the paired t-test and repeated measures ANOVA findings is not recommended, their
difference is notable: tonal aptitude seemed to stabilize before rhythm aptitude when scores from
adjacent test administrations were examined, yet rhythm aptitude was asserted to stabilize first
when longitudinal scores were considered. Perhaps the former occurrence is only a preliminary,
short-term effect and the long-term effect of initial stabilization of rhythm aptitude is conclusive.
Regardless, it appears tonal and rhythm aptitude stabilize independently and the temporal aspect
of test administration (based on scores from consecutive administrations or those collected over a
prolonged period of time) should be considered when interpreting results of future studies.
Specifically, an investigation expressly focused on the concurrent or independent stabilization of
tonal aptitude and rhythm aptitude is recommended to confirm or refute the findings of the
current study.
The objective of the current study was to investigate the onset of, transition to, and
longitudinal constancy of stabilized music aptitude in upper elementary students. It was
predicted no effect of chronological age (Gordon, 1989b, 2005) or instruction (DeYarman, 1975;
Fosha, 1964; Gordon, 1981; Mang, 2013) would be found, based on findings of previous
research. In contrast, evidence of a period of transition between the developmental and stabilized
music aptitude stages at approximately age 9/Grade 4 was anticipated, based on observations by
Gordon (1989b, 2006).
As expected, no effect of chronological age on music aptitude was concluded. Significant
results were found only for Grade 3 Spring/Grade 4 Fall rhythm scores and composite scores.
Nevertheless, this close examination of tonal, rhythm, and composite scores confirmed several of
Gordon’s assertions: scores tended to increase with chronological age (Gordon, 1998, p. 169),
score gains began to decrease as students transition to the stabilized music aptitude stage
(Gordon, 1981), and an average annual increase of two points for developmental music aptitude
scores for students receiving traditional instruction (Gordon, 2002) served as a useful threshold
for determination of practical significance of score change. Replication of this study with a more
culturally and socio-economically diverse sample is recommended in order to better generalize
the results of the current study to a broader population. In addition, a comparable study including
students younger and older than those in the current sample is suggested to further define the
onset of stabilized music aptitude.
No effect of instruction on music aptitude was found for the current study. A small but
significant increase in tonal and composite scores was found; the direction of rhythm score
change was inconsistent, and mean rhythm score differences were not significant. Nonetheless,
the lack of informal guidance activities necessary to address deficits in students’ preparatory
audiation might have affected the longitudinal influence of formal instruction for this group of
students, as reflected in the IMMA scores under consideration. Thus, the corroboration of
Gordon’s (2013) assertion of the necessity of inclusion of informal guidance to enhance
readiness for formal instruction for students regardless of chronological age was an important
finding of this study. Future research focused on the effect of type and quality of instruction on
music aptitude is recommended.
A period of transition could not be substantiated conclusively from the results of the
current study. An atypical pattern of tonal score change seemed to support a period of transition;
however, non-significant mean rhythm score differences seemed to indicate students had already
attained the stabilized music aptitude stage. Thus, it was conjectured tonal aptitude and rhythm
aptitude stabilized independently. For the current study’s subjects, tonal aptitude seemed to have
stabilized prior to Grade 3, according to paired t-test results of IMMA scores of consecutive test
administrations of adjacent grade levels. However, it appeared rhythm aptitude stabilized before
tonal aptitude, as concluded from an examination of longitudinal data. Investigation of
independent stabilization of tonal aptitude and rhythm aptitude is recommended for future study.
In addition, an effect of instruction due to an insufficient level of preparatory audiation readiness
could have affected the findings for a period of transition; therefore, further research is necessary
to continue exploration on this topic.
An important premise of the current study was that recognition of the onset of stabilized
music aptitude would be more practical and unequivocal than identification of the culmination of
developmental music aptitude. In practice, this proved less straightforward than predicted, as
evaluation of the direction and size of score change was necessary to determine an interpretation
of students’ current music aptitude stage. It was anticipated a significant score increase would be
indicative of continuation in the developmental music aptitude stage. A significant score
decrease was interpreted as the active decline in influence of the musical environment (Gordon,
1986a) experienced by students in a period of transition from the developmental to the stabilized
music aptitude stage. A lack of significant score fluctuation, whether as an increase or decrease
in scores, signified static score change: students likely had already attained the stabilized music
aptitude stage.
An a priori interpretation of scores from which the onset of stabilized music aptitude
would be determined was not defined in the research design of the current study. Although not
specified in extant literature, it was implied composite scores were used to determine the
approximate age at which the stabilized music aptitude stage was reached. This was reflective of
a gestalt perspective of onset of stabilized music aptitude. In contrast, an atomistic viewpoint was
represented in the conjecture that tonal aptitude and rhythm aptitude stabilized independently of
one another. Therefore, a period of transition could not be concluded definitively without a clear
understanding of the influence of the atomistic position in defining stabilized music aptitude.
Research is deemed significant if it contributes uniquely to the theory or knowledge base
of its field of study. Because the published research was sparse, an investigation of the onset of
and transition to stabilized music aptitude of upper elementary students was warranted. Although
a period of transition was not substantiated conclusively from the results of the current study, it
was conjectured tonal aptitude and rhythm aptitude stabilized independently of one another.
These findings yielded a new conceptualization of tonal aptitude and rhythm aptitude as separate
constructs whose independent stabilization will need to be confirmed or refuted through
continued research. As students’ transition from preparatory audiation may differ for tonal and
rhythm dimensions, so tonal and rhythm aptitude also may stabilize at different rates. Music
educators had been encouraged to provide, through inclusion of informal guidance with formal
instruction, compensatory instruction to raise the tonal or rhythm dimension identified through
music aptitude testing as each student’s weakness, as well as complementary instruction to
maintain or increase the tonal or rhythm dimension identified as each student’s strength (Gordon,
1986c). Continued compensatory and complementary instruction is encouraged throughout the
intermediate grade levels, as the simultaneous shift from developmental to stabilized music
aptitude for tonal and rhythm dimensions cannot be presumed if the independent stabilization of
tonal and rhythm aptitudes is accepted. Because it is conjectured tonal and rhythm aptitudes
stabilize at different times and different rates for individual students, it is critical informal
guidance activities are offered to students of all ages and levels of experience. Singing tonal
patterns, chanting rhythm patterns, and participating in musical interactions with songs and
chants in a variety of tonalities and meters are additional tools to individualize instruction for
students who have not emerged from music babble. Students may remain in or be transitioning
from the developmental music aptitude stage for one dimension (tonal or rhythm) and thus need
continued compensatory or complementary instruction, yet have already transitioned to the
stabilized music aptitude stage for the other dimension, for which instruction no longer has
influence on music aptitude. Not only is the premise that all students attain the stabilized music
aptitude level at approximately age 9/Grade 4 in question, but a presumption that all students
attain the stabilized music aptitude level for tonal and rhythm dimensions simultaneously is also
at issue.
Thus, future research is recommended to replicate this study with a more diverse sample,
adapt the multiple implication method, conduct a parallel examination using percentile ranks,
examine the data using different statistical testing, and expand the study to include students at
younger and older grade levels. Suggested extensions to the current study include investigations
of type and quality of instruction on music aptitude, including one in which instruction is
continually adapted based on frequent music aptitude testing. Recommendations for practical
application of music educators include professional development focused on addressing
preparatory audiation needs of all students, as well as inclusion of informal guidance at all grade
levels. Mitigation of cultural bias in standardized testing must be considered as more minoritized
students become included in heterogeneous study samples. Examinations of the effect of music
preference testing and ensemble participation on music aptitude would supplement the current
knowledge base as well. Finally, an investigation of the stabilization of tonal and rhythm
aptitudes as independent constructs is advocated to further understand and extend the findings of
the current study.
Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67, 1012–
1028. https://doi.org/10.1111/j.1741-3737.2005.00191.x
Allen, B. (1981). Student dropout in orchestra programs in three school systems in the state of
Arkansas (Publication No. 8201181) [Doctoral dissertation, Northeast Louisiana
University]. ProQuest Dissertations and Theses Global.
Allison, P. (2009). Missing data. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The Sage
handbook of quantitative methods in psychology (pp. 72–89). Sage Publications
Ltd. http://dx.doi.org/10.4135/9780857020994.n4
Allison, P. (2015, March 5). Imputation by predictive mean matching: Promise & peril.
Statistical Horizons. https://statisticalhorizons.com/predictive-mean-matching
Amchin, R. (1995). Creative musical response: The effects of student–teacher interaction on the
improvisation abilities of fourth- and fifth-grade students (Publication No. 9542792)
[Doctoral dissertation, University of Michigan]. ProQuest Dissertations and Theses
Arms Gilbert, L. (1997). The effects of computer-assisted keyboard instruction on meter
discrimination and rhythm discrimination of general music education students in the
elementary school (Publication No. 9806336) [Doctoral dissertation, Tennessee State
University]. ProQuest Dissertations and Theses Global.
Atterbury, B. W., & Silcox, L. (1993). A comparison of home musical environment and musical
aptitude in kindergarten students. Update: Application of Research in Music Education,
11(2), 18–22. https://doi.org/10.1177/875512339301100205
Au, W. (2008). Devising inequality: A Bernsteinian analysis of high‐stakes testing and social
reproduction in education. British Journal of Sociology of Education, 29(6), 639–651.
Auh, M. (1995). Prediction of musical creativity in composition among selected variables for
upper elementary students (Publication No. 9604632) [Doctoral dissertation, Case
Western Reserve University]. ProQuest Dissertations and Theses Global.
Azzara, C. (1992). The effect of audiation-based improvisation techniques on the music
achievement of elementary instrumental students (Publication No. 9223853) [Doctoral
dissertation, University of Rochester, Eastman School of Music]. ProQuest Dissertations
and Theses Global.
Baer, D. E. (1987). Motor skill proficiency: Its relationship to instrumental music performance
achievement and music aptitude (Publication No. 8720238) [Doctoral dissertation,
University of Michigan]. ProQuest Dissertations and Theses Global.
Bailey, J. (1975). The relationships between the Colwell music achievement tests I and II, the
SRA achievement series, intelligence quotient, and success in instrumental music in the
sixth grade of the public schools of Prince William county, Virginia (Publication No.
7606685) [Doctoral dissertation, University of Illinois at Urbana–Champaign]. ProQuest
Dissertations and Theses Global.
Bash, L. (1983). The effectiveness of three instructional methods on the acquisition of jazz
improvisation skills (Publication No. 8325043) [Doctoral dissertation, The State
University of New York at Buffalo]. ProQuest Dissertations and Theses Global.
Belczyk, M. E. (1992). Using music aptitude and timbre preference test results to predict
performance achievement among beginning band students (Publication No. 9227434)
[Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Bell, W. A. (1981). An investigation of the validity of the “primary measures of music
audiation” for use with learning disabled children (Publication No. 8124579) [Doctoral
dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Bentley, A. (1966). Measures of musical abilities. Harrap Audio–Visual.
Bergonzi, L. S. (1991). The effects of finger placement markers and harmonic context on the
development of intonation performance skills and other aspects of the musical
achievement of sixth-grade beginning string students (Publication No. 9208492)
[Doctoral dissertation, University of Michigan]. ProQuest Dissertations and Theses
Bernhard, H. C. (2003). The effects of tonal training on the melodic ear playing and sight
reading achievement of beginning wind instrumentalists (Publication No. 3093857)
[Doctoral dissertation, University of North Carolina at Greensboro]. ProQuest
Dissertations and Theses Global.
Bixler, J. (1968). Musical aptitude in the educable mentally retarded child. Journal of Music
Therapy, 5(2), 41–43. https://doi.org/10.1093/jmt/5.2.41
Bluestine, E. M. (2007). A comparative study of four approaches to teaching tonal music reading
to a select group of students in third, fourth, and fifth grade (Publication No. 3268133)
[Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Bolton, B. M. (1995). An investigation of same and different as manifested in the developmental
music aptitudes of students in first, second, and third grades (Publication No. 9535717)
[Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Bond, V. (2017). Culturally responsive education in music education: A literature review.
Contributions to Music Education, 42, 153–180. https://www.jstor.org/stable/26367441
Boyle, J. D. (1982). A study of the comparative validity of three published, standardized
measures of music preference. Psychology of Music, 10(1), 11–16. https://doiorg.gate.lib.buffalo.edu/10.1177/0305735682101002
Boyle, J. D. (1992). Evaluation of music ability. In R. Colwell (Ed.), Handbook of research on
music teaching and learning: A project of the music educators national conference (pp.
246–265). Schirmer Books.
Boyle, J. D., & Radocy, R. E. (1987). Measurement and evaluation of musical experiences.
Schirmer Books.
Bradley, D. (2007). The sounds of silence: Talking race in music education. Action, Criticism,
and Theory for Music Education, 6(4), 132–162.
Briscuso, J. J. (1972). A study of ability in spontaneous and prepared jazz improvisation among
students who possess different levels of musical aptitude (Publication No. 7226656)
[Doctoral dissertation, University of Iowa]. ProQuest Dissertations and Theses Global.
Brokaw, J. P. (1983). The extent to which parental supervision and other selected factors are
related to achievement of musical and technical–physical characteristics by beginning
instrumental music students (Publication No. 8304452) [Doctoral dissertation, University
of Michigan]. ProQuest Dissertations and Theses Global.
Brown, M. (1969). The optimum length of the musical aptitude profile subtests. Journal of
Research in Music Education, 17(2), 240–247. https://doi.org/10.2307/3344329
Bugos, J., Heller, J., & Batcheller, D. (2014). Musical nuance task shows reliable differences
between musicians and nonmusicians. Psychomusicology: Music, Mind, and Brain,
24(3), 207–213. https://doi.org/10.1037/pmu0000051
Carroll, J. B. (1978). How shall we study individual differences in cognitive abilities?–
Methodological and theoretical perspectives. Intelligence, 2, 87–115.
Carroll, J. G. (1983). The use of musical verbal stimuli in teaching low-functioning autistic
children (Publication No. 8404269) [Doctoral dissertation, University of Mississippi].
ProQuest Dissertations and Theses Global.
Carson, A. D. (1998). Why has musical aptitude assessment fallen flat? And what can we do
about it? Journal of Career Assessment, 6(3), 311–328.
Cary, S. (1981). Individualized music instruction–Traditional music instruction: Relationships of
music achievement, music performance, music attitude, music aptitude, and reading
classes of fifth grade students (Publication No. 8201812) [Doctoral dissertation,
University of Oregon]. ProQuest Dissertations and Theses Global.
Choi, E. (1996). The development and implementation of interactive multimedia instrumental
discrimination skills training courseware for beginning clarinet students (Publication No.
9635496) [Doctoral dissertation, University of Michigan]. ProQuest Dissertations and
Theses Global.
Chuang, W. J. (1997). An investigation of the use of musical aptitude profile with Taiwanese
students in grades four to twelve (Publication No. 9734114) [Doctoral dissertation,
Michigan State University]. ProQuest Dissertations and Theses Global.
Ciepluch, G. M. (1988). Sightreading achievement in instrumental music performance, learning
gifts, and academic achievement: A correlation study (Publication No. 8810008)
[Doctoral dissertation, University of Wisconsin–Madison]. ProQuest Dissertations and
Theses Global.
Ciorba, C. R. (2006). The creation of a model to predict jazz improvisation achievement
(Publication No. 3243107) [Doctoral dissertation, University of Miami]. ProQuest
Dissertations and Theses Global.
Clark, B. J. (2005). The equity and effectiveness of policies and procedures instrumental music
instructors deem essential to program development for beginning percussionists
(Publication No. 3182240) [Doctoral dissertation, University of Illinois at Urbana–
Champaign]. ProQuest Dissertations and Theses Global.
Conkling, S. W. (1994). A comparison of the effects of learning sequence activities and vocal
development exercises on the vocal music achievement of middle level students
(Publication No. 9503122) [Doctoral dissertation, University of Rochester, Eastman
School of Music]. ProQuest Dissertations and Theses Global.
Cook, R. M. (2020). Addressing missing data in quantitative counseling research. Counseling
Outcome Research and Evaluation, 1, 1–11.
Cooper, H. M. (1989). Integrating research: A guide for literature review. Sage.
Crawford, L. A. (2016). Composing in groups: Creative processes of third and fifth grade
students (Publication No. 10195571) [Doctoral dissertation, University of Southern
California, Los Angeles]. ProQuest Dissertations and Theses Global.
Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative
and qualitative research (4th ed.). Pearson Education, Inc.
Cribari, P. B. (2014). A comparison of aural and aural–visual modeling on the development of
executive and performance skills of beginning recorder students (Publication No.
3662713) [Doctoral dissertation, Boston University]. ProQuest Dissertations and Theses
Culp, M. E. (2017). The relationship between phonological awareness and music aptitude.
Journal of Research in Music Education, 65(3), 328–346.
Curtis, C. (1981). A comparative analysis of the musical aptitude of normal children and mildly
handicapped children mainstreamed into regular classrooms (Publication No. 8121544)
[Doctoral dissertation, Vanderbilt University]. ProQuest Dissertations and Theses Global.
Davis, L. M. (1981). The effects of structured singing activities and self-evaluation practice on
elementary band students’ instrumental music performance, melodic tonal imagery, selfevaluation and attitude (Publication No. 8128981) [Doctoral dissertation, The Ohio State
University]. ProQuest Dissertations and Theses Global.
DeCuir, J. T., & Dixson, A. D. (2004). “So when it comes out, they aren’t that surprised that it is
there”: Using critical race theory as a tool of analysis of race and racism in education.
Educational Researcher, 33(5), 26–31. https://doi.org/10.3102/0013189X033005026
Degé, F., Patscheke, H., & Schwarzer, G. (2017). Associations between two measures of music
aptitude: Are the IMMA and the AMMA significantly correlated in a sample of 9- to 13year old children? Musicae Scientiae, 21(4), 465–478.
Dell, C. E. (2003). Singing and tonal pattern instruction effects on beginning string students’
intonation skills (Publication No. 3084778) [Doctoral dissertation, University of South
Carolina]. ProQuest Dissertations and Theses Global.
Della Pietra, C. J. (1997). The effects of a three-phase constructivist instructional model for
improvisation on high school students’ perception and reproduction of musical rhythm
(Publication No. 9736258) [Doctoral dissertation, University of Washington]. ProQuest
Dissertations and Theses Global.
Deutsch, D. (Ed.) (1982). The psychology of music. Academic Press.
DeYarman, R. (1972). An experimental analysis of the development of rhythmic and tonal
capabilities of kindergarten and first grade children. Experimental Research in the
Psychology of Music, Studies in the Psychology of Music, Volume 8. University of Iowa
DeYarman, R. M. (1975). An investigation of the stability of musical aptitude among primaryage children. In Edwin Gordon (Ed.), Experimental Research in the Psychology of Music:
10, 1–23. University of Iowa Press.
Drennan, C. B. (1984). The relationship of musical aptitude, academic achievement and
intelligence in merit (gifted) students of Murfreesboro city schools (Tennessee)
(Publication No. 8529568) [Doctoral dissertation, Tennessee State University]. ProQuest
Dissertations and Theses Global.
Edmund, D. C. (2009). The effect of articulation study on stylistic expression in high school
musicians’ jazz performance (Publication No. 3385921) [Doctoral dissertation,
University of Florida]. ProQuest Dissertations and Theses Global.
Etzel, M. (1979). The effect of training upon children’s ability in grades one through six to
perform selected musical listening tasks (Publication No. 7915345) [Doctoral
dissertation, University of Illinois at Urbana–Champaign]. ProQuest Dissertations and
Theses Global.
Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Thomson
Brooks/Cole Publishing Co.
Field, A. (2009). Discovering statistics using spss (3rd ed.). SAGE.
Flohr, J. W. (1981). Short-term music instruction and young children's developmental music
aptitude. Journal of Research in Music Education, 29(3), 219–223.
Forsythe, R. (1984). The development and implementation of a computerized preschool measure
of musical audiation (Publication No. 8425572). [Doctoral dissertation, Case Western
Reserve University]. ProQuest Dissertations and Theses Global.
Fosha, R. L. (1964). A study of the concurrent validity of the musical aptitude profile
(Publication No. 6500455) [Doctoral dissertation, University of Iowa]. ProQuest
Dissertations and Theses Global.
Frierson-Campbell, C. (2001). The effects of audiation-based enrichment activities on secondyear wind and percussion instrumental music achievement (Publication No. 9965251)
[Doctoral dissertation, University of Rochester, Eastman School of Music]. ProQuest
Dissertations and Theses Global.
Froseth, J. (1968). An investigation of the use of musical aptitude profile scores in the instruction
of beginning students in instrumental music (Publication No. 6816800) [Doctoral
dissertation, University of Iowa]. ProQuest Dissertations and Theses Global.
Froseth, J. (1971). Using MAP scores in the instruction of beginning students in instrumental
music. Journal of Research in Music Education, 19, 98–105.
Fullen, D. L. (1993). An investigation of the validity of the advanced measures of music
audiation with junior high and senior high school students (Publication No. 9316479)
[Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Gamble, D. K. (1989). A study of the effects of two types of tonal pattern instruction on the
audiational and performance skills of first-year clarinet students (Publication No.
8912430) [Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses
Gardner, H. (1983). Frames of mind: The theory of multiple intelligences. Basic Books.
Geake, J. G. (1996). Why Mozart? Information processing abilities of gifted young musicians.
Research Studies in Music Education, 7(1), 28–45.
Geake, J. G. (1999). An information processing account of audiational abilities. Research
Studies in Music Education, 12, 10–23.
Geissel, L. S., Jr. (1985). An investigation of the comparative effectiveness of the musical
aptitude profile, the intermediate measures of music audiation, and the primary measures
of music audiation with fourth grade students (Publication No. 8521082) [Doctoral
dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Gerhardstein, R. C. (2001). Edwin E. Gordon: A biographical and historical account of an
American music educator and researcher (Publication No. 3014435) [Doctoral
dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Ghisletta, P., & Aichele, S. (2017). Quantitative methods in psychological aging research: A
mini-review. Gerontology, 63(6), 529–537. https://doi.org/10.1159/000477582
Gillborn, D. (2006). Critical race theory and education: Racism and anti-racism in educational
theory and praxis. Discourse: Studies in the Cultural Politics of Education, 27(1), 11–32.
Gordon, E. E. (1965). The musical aptitude profile: A new and unique musical aptitude test
battery. Bulletin of the Council for Research in Music Education, 6, 12–16.
Gordon, E. E. (1967a). A comparison of the performance of culturally disadvantaged students
with that of culturally heterogeneous students on the musical aptitude profile. Psychology
in the Schools, 4(3), 260–268.
Gordon, E. E. (1967b). A three-year longitudinal predictive validity study of the musical aptitude
profile. Experimental research in the psychology of music, studies in the psychology of
music, Volume 5. University of Iowa Press.
Gordon, E. E. (1968). The contribution of each musical aptitude profile subtest to the overall
validity of the battery. Bulletin of the Council for Research in Music Education, 12, 32–
36. https://www.jstor.org/stable/40316956
Gordon, E. E. (1969). Intercorrelations among musical aptitude profile and Seashore measures of
musical talents subtests. Journal of Research in Music Education, 17(3), 263–271.
Gordon, E. E. (1970). Taking into account musical aptitude differences among beginning
instrumental students. American Educational Research Journal, 7(1), 41–53.
Gordon, E. E. (1971). The psychology of music teaching. Prentice-Hall, Inc.
Gordon, E. E. (1976). Tonal and rhythm patterns: An objective analysis. SUNY Press.
Gordon, E. E. (1979a). Developmental music aptitude as measured by the primary measures of
music audiation. Psychology of Music 7(1), 42–49.
Gordon, E. E. (1979b). Primary measures of music audiation. GIA Publications.
Gordon, E. E. (1980a). The assessment of music aptitudes of very young children. Gifted Child
Quarterly, 24(3), 107–111. https://doi.org/10.1177/001698628002400303
Gordon, E. E. (1980b). Developmental music aptitudes among inner-city primary children.
Bulletin of the Council for Research in Music Education, 63, 25–30.
Gordon, E. E. (1981). The manifestation of developmental music aptitude in the audiation of
“same” and “different” as sound in music. GIA Publications.
Gordon, E. E. (1982). Intermediate measures of music audiation: A music aptitude test for first,
second, third, and fourth grade children. GIA Publications.
Gordon, E. E. (1984a). A longitudinal predictive validity study of the intermediate measures of
music audiation. Bulletin of the Council for Research in Music Education, 78, 1–23.
Gordon, E. E. (1984b). Manual for the instrument timbre preference test. GIA Publications.
Gordon, E. E. (1986a). A factor analysis of the musical aptitude profile, the primary measures of
music audiation, and the intermediate measures of music audiation. Bulletin of the
Council for Research in Music Education, 87, 17–25.
Gordon, E. E. (1986b). Final results of a two-year longitudinal predictive validity study of the
instrument timbre preference test and the musical aptitude profile. Bulletin of the Council
for Research in Music Education, 89, 8–17. https://www.jstor.org/stable/40318138
Gordon, E. E. (1986c). Manual: Primary measures of music audiation and intermediate
measures of music audiation. GIA Publications.
Gordon, E. E. (1987). The nature, description, measurement, and evaluation of music aptitudes.
GIA Publications.
Gordon, E. E. (1989a). Audie: A game for understanding and analyzing your child’s music
potential. GIA Publications.
Gordon, E. E. (1989b). Manual for the advanced measures of music audiation. GIA Publications.
Gordon, E. (1989c). Predictive validity studies of IMMA and ITPT. GIA Publications.
Gordon, E. E. (1989d). A two-year longitudinal predictive validity study of the instrument timbre
preference test and the intermediate measures of music audiation. GIA Publications.
Gordon, E. E. (1990a). Jump right in: Rhythm register book 1. GIA Publications.
Gordon, E. E. (1990b). Jump right in: Tonal register book 1. GIA Publications.
Gordon, E. E. (1990c). A one-year longitudinal predictive validity study of the advanced
measures of music audiation. GIA Publications.
Gordon, E. E. (1991). Taking another look at scoring the advanced measures of music audiation:
The German study. In The advanced measures of music audiation and the instrument
timbre preference test: Three research studies (pp. 1–21). GIA Publications.
Gordon, E. E. (1993). Learning sequences in music: Skill, content and patterns. GIA.
Gordon, E. E. (1995). Manual: Musical aptitude profile. Chicago, IL: GIA Publications.
Gordon, E. E. (1998). Introduction to research and the psychology of music. GIA Publications.
Gordon, E. E. (1999). All about audiation and music aptitudes: Edwin E. Gordon discusses
using audiation and music aptitudes as teaching tools to allow students to reach their full
music potential. Music Educators Journal, 86(2), 41–44. https://doi.org/10.2307/3399589
Gordon, E. E. (2001a). Music aptitude and related tests: An introduction. GIA Publications.
Gordon, E. E. (2001b). Preparatory audiation, audiation, and music learning theory: A
handbook of a comprehensive music learning sequence. GIA Publications.
Gordon, E. E. (2001c). A three-year study of the musical aptitude profile. GIA Publications.
(First printing The University of Iowa Press, Iowa City, 1967).
Gordon, E. E. (2002). Developmental and stabilized music aptitudes: Further evidence of the
duality. GIA Publications.
Gordon, E. E. (2004). Continuing studies in music aptitudes. GIA Publications.
Gordon, E. E. (2005). Vectors in my research: Reflections on the development of music learning
theory. In M. Runfola & C. C. Taggart (Eds.), The development and practical application
of music learning theory (3–50). GIA Publications.
Gordon, E. E. (2006). Nature, source, evaluation, and measurement of music aptitudes. Polskie
Forum Psychologiczne, 11(2), 227–237. http://repozytorium.ukw.edu.pl/handle/item/903
Gordon, E. E. (2010). The crucial role of music aptitudes in music instruction. In T. S. Brophy
(Ed.), The Practice of Assessment in Music Education: Frameworks, Models, and
Designs: Proceedings of the 2009 Florida Symposium on Assessment in Music Education
(pp. 211–215). GIA Publications.
Gordon, E. E. (2011). Untying Gordian knots. GIA Publications.
Gordon, E. E. (2012). Learning sequences in music: Skill, content, and patterns (2012 ed.). GIA
Gordon, E. E. (2013). Music learning theory for newborn and young children. GIA Publications.
Gordon, E. E. (2015). Space audiation. GIA Publications.
Gouzouasis, P. (1990). An investigation of the comparative effects of two tonal pattern systems
and two rhythm pattern systems for learning to play guitar (Publication No. 9100281)
[Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really
needed? Some practical clarifications of multiple imputation theory. Prevention Science,
8, 206–213. https://doi.org/10.1007/s11121-007-0070-9
Grashel, J. (2008). The measurement of musical aptitude in 20th century United States: A brief
history. Bulletin of the Council for Research in Music Education, 176, 45–49.
Green, B. R. (2003). The comparative effects of computer-mediated interactive instruction and
traditional instruction on music achievement in guitar performance (Publication No.
NQ86051) [Doctoral dissertation, The University of British Columbia (Canada),
Vancouver]. ProQuest Dissertations and Theses Global.
Groeling, C. R. (1975). A comparison of two methods of teaching instrumental music to fourthgrade beginners (Publication No. 7529644) [Doctoral dissertation, Northwestern
University]. ProQuest Dissertations and Theses Global.
Gromko, J. E., & Russell, C. (2002). Relationships among young children’s aural perception,
listening condition, and accurate reading of graphic listening maps. Journal of Research
in Music Education, 50(4), 333–342. https://doi.org/10.2307/3345359
Gromko, J. E., & Walters, K. (1999). The development of musical pattern perception in schoolaged children. Research Studies in Music Education, 12, 24–29.
Grutzmacher, P. A. (1985). The effect of tonal pattern training on the aural perception, reading
recognition and melodic sight reading achievement of first year instrumental music
students (Publication No. 8514172) [Doctoral dissertation, Kent State University].
ProQuest Dissertations and Theses Global.
Guderian, L. V. (2008). Effects of applied music composition and improvisation assignments on
sight-reading ability, learning in music theory and quality in soprano recorder playing
(Publication No. 3331120) [Doctoral dissertation, Northwestern University]. ProQuest
Dissertations and Theses Global.
Guerrini, S. C. (2002). The acquisition and assessment of the developing singing voice among
elementary students (Publication No. 3040318) [Doctoral dissertation, Temple
University]. ProQuest Dissertations and Theses Global.
Guerrini, S. C. (2004). The relationship of vocal accuracy, gender, and music aptitude among
elementary students. Visions of Research in Music Education, 4, 1–14. http://wwwusr.rider.edu/~vrme/v4n1/visions/Guerrini%20The%20Relationship%20of%20Vocal%20
Guilbault, D. M. (2004). The effect of harmonic accompaniment on the tonal achievement and
tonal improvisations of children in kindergarten and first grade. Journal of Research in
Music Education, 52(1), 64–76. https://doi.org/10.2307/3345525
Hansen, D. A. (1991). The effect of prerequisite skill mastery on higher order skill attainment
and motivation in music learning (Publication No. 9207597) [Doctoral dissertation,
University of Missouri–Kansas City]. ProQuest Dissertations and Theses Global.
Haroutounian, J. (2002). Kindling the spark: Recognizing and developing musical talent. Oxford
University Press.
Harrington, C. J. (1969). An investigation of the primary level musical aptitude profile for use
with second and third grade students. Journal of Research in Music Education, 17(4),
359–368. https://doi.org/10.2307/3344164
Haston, W. A. (2004). Comparison of a visual and an aural approach to beginning wind
instrument instruction (Publication No. 3132535) [Doctoral dissertation, Northwestern
University]. ProQuest Dissertations and Theses Global.
Hasty, J. G. J. (1992). The influence of selected music teaching strategies upon aesthetic
responses to phrasing, balance, and style among middle school band students
(Publication No. 9316347) [Doctoral dissertation, University of Georgia]. ProQuest
Dissertations and Theses Global.
Hatcher, L. (2013). Advanced statistics in research: Reading, understanding, and writing up
data analysis results. Shadow Finch Media.
Heathers, G. (1977). A working definition of individualized instruction. Educational Leadership,
34(5), 342–345.
Henry, W. H. (1995). The effects of pattern instruction, repeated composing opportunities, and
musical aptitudes on the compositional processes and products of fourth-grade students
(Publication No. 9537223) [Doctoral dissertation, Michigan State University]. ProQuest
Dissertations and Theses Global.
Henry, W. H. (2002). The effects of pattern instruction, repeated composing opportunities, and
musical aptitude on the compositional process and products of fourth-grade student.
Contributions to Music Education, 29(1), 9–28. https://www.jstor.org/stable/24126972
Hess, J. (2015). Decolonizing music education: Moving beyond tokenism. International Journal
of Music Education, 33(3), 336–347. https://doi.org/10.1177/0255761415581283
Heymans, M. W., & Eekhout, I. (2019). Pooling means and standard deviations in spss. Applied
missing data analysis with spss and (r)studio (First draft). Amsterdam.
Hobbs, C. (1985). A comparison of the music aptitude, scholastic aptitude, and academic
achievement of young children. Psychology of Music, 13(2), 93–98.
Holahan, J. M., & Thomson, S. W. (1981). An investigation of the suitability of the primary
measures of music audiation for use in England. Psychology of Music, 9(2), 63–68.
Hood, S. (1998). Culturally responsive performance-based assessment: Conceptual and
psychometric considerations. The Journal of Negro Education, 67(3), 187–196.
Hornbach, C. M., & Taggart. C. C. (2005). The relationship between developmental tonal
aptitude and singing achievement among kindergarten, first-, second-, and third-grade
students. Journal of Research in Music Education, 53(4), 322–331.
Horton, N. J., & Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software
packages for regression models with missing variables. The American Statistician, 55(3),
244–254. https://www.jstor.org/stable/2685809
Huck, S. W. (2012). Reading statistics and research (6th ed.). Allyn & Bacon.
Hufstader, R. A. (1974). Predicting success in beginning instrumental music through use of
selected tests. Journal of Research in Music Education, 22(1), 52–57.
IBM Corp. (2019a). IBM SPSS missing values 26. Retrieved December 24, 2020, from
IBM Corp. (2019b). IBM SPSS statistics for Macintosh, Version 26.0. IBM Corp.
Jaffurs, S. E. (2000). The relationship between singing achievement and tonal music aptitude.
(Publication No. 1399634) [Master’s thesis, Michigan State University]. ProQuest
Dissertations and Theses Global.
Jarvis, W. C. (1981). The effectiveness of verbalization upon the recognition and performance of
instrumental music notation (Publication No. 8120827) [Doctoral dissertation, Rutgers,
The State University of New Jersey]. ProQuest Dissertations and Theses Global.
Jin, H. L., & Huber, J. Jr. (2011). Multiple imputation with large proportions of missing data:
How much is too much? United Kingdom Stata Users’ Group Meetings 2001 (No. 23).
Stata Users Group. http://repec.org/usug2011/UK11_Lee.pptx
Johnson, D. A. (2000). The development of music aptitude and effects on scholastic achievement
of 8 to 12 year olds (Publication No. 9983062) [Doctoral dissertation, University of
Louisville]. ProQuest Dissertations and Theses Global.
Josuweit, D. (1991). The effects of an audiation-based instrumental music curriculum upon
beginning band students’ achievement in music (Publication No. 9207869) [Doctoral
dissertation, Temple University]. ProQuest Dissertations and Theses Global.
Karas, J. B. (2005). The effect of aural and improvisatory instruction on fifth-grade band
students’ sight reading ability (Publication No. 3199697) [Doctoral dissertation,
University of Nebraska, Lincoln]. ProQuest Dissertations and Theses Global.
Karma, K. (1982). Validating tests of musical aptitude. Psychology of Music, 10(1), 33–36.
Karma, K. (1984). Musical aptitude as the ability to structure acoustic material. International
Journal of Music Education, 3(1), 19–30. https://doi.org/10.1177/025576148400300104
Karma, K. (1994). Auditory and visual temporal structuring: How important is sound to musical
thinking? Psychology of Music, 22, 20–30. https://doi.org/10.1177/0305735694221002
Karma, K. (2007). Musical aptitude definition and measure validation: Ecological validity can
endanger the construct of musical aptitude tests. Psychomusicology: A Journal of
Research in Music Cognition, 19(2), 79–90.
Kendall, M. J. (1986). The effects of visual interventions on the development of aural and
instrumental performance skills in beginning fifth-grade instrumental students: A
comparison of two instruction approaches (reading, kinesthetic, musical technique).
(Publication No. 8612553) [Doctoral dissertation, University of Michigan]. ProQuest
Dissertations and Theses Global.
Kim, K. H., & Zabelina, D. (2015). Cultural bias in assessment: Can creativity assessment help?
International Journal of Critical Pedagogy, 6(2), 129–147.
Kimble, E. P. (1983). The effect of various factors on the ability of children to sing an added part
(Publication No. 8326407) [Doctoral dissertation, University of Georgia]. ProQuest
Dissertations and Theses Global.
Kleinke, K. (2018). Multiple imputation by predictive mean matching when sample size is small.
Methodology, 14(1), 3–15. https://doi.org/10.1027/1614-2241/a000141
Klinedinst, R. E. (1989). The ability of selected factors to predict performance achievement and
retention of fifth-grade instrumental music students (Publication No. 9006131) [Doctoral
dissertation, Kent State University]. ProQuest Dissertations and Theses Global.
Klinedinst, R. E. (1991). Predicting performance achievement and retention of fifth-grade
instrumental students. Journal of Research in Music Education, 39(3), 225–238.
Kluth, B. L. (1986). A procedure to teach rhythm reading: Development, implementation, and
effectiveness in urban junior high school music classes (Publication No. 8617078)
[Doctoral dissertation Kent State University]. ProQuest Dissertations and Theses Global.
Knoester, M., and Au, W. (2017). Standardized testing and school segregation: Like tinder for
fire? Race Ethnicity and Education, 20(1), 1–14.
Koelsch, N., Estrin, E., & Farr, B. (1995). Guide to developing equitable performance
assessments. Office of Educational Research and Improvement.
Kohn, A. (2000). Standardized testing and its victims. Educational Week, 20(4), 60–64.
Kołodziejski, M. (2019). Relationship between stabilised musical aptitude and harmonic and
rhythm improvisation readiness in adults in transversal research. Uniwersytet
Humanistyczno-przyrodniczy Im. Jana Długosza W Częstochowie (Poland), XIV, 177–
197. http://dx.doi.org/10.16926/em.2019.14.08
Kopiez, R., & In Lee, J. (2006). Towards a dynamic model of skills involved in sight reading
music. Music education research, 8(1), 97–120.
Kopiez, R. & Lee, J. (2008). Towards a general model of skills involved in sight reading music.
Music Education Research, 10(1), 41–62. https://doi.org/10.1080/14613800701871363
Kratus, J. (1994). Relationships among children’s music audiation and their compositional
processes and products. Journal of Research in Music Education 42(2), 115–130.
Kuhlman, K. (2005). Musical aptitude versus academic ability as a predictor of beginning
instrumental music achievement and retention: Research and implications. Update:
Applications of Research in Music Education, 24(1), 34–43).
Landerman, L. R., Land, K. C., & Pieper, C. F. (1997). An empirical evaluation of the predictive
mean matching method for imputing missing values. Sociological Methods &
Research, 26(1), 3–33. https://doi.org/10.1177/0049124197026001001
Law, L. N. C., & Zentner, M. (2012). Assessing musical abilities objectively: Construction and
validation of the profile of music perception skills. PLoS One, 7(12), 1–15.
Lee, E. (2007). A study of the effect of computer assisted instruction, previous music experience,
and time on the performance ability of beginning instrumental music students
(Publication No. 3284028) [Doctoral dissertation, The University of Nebraska–Lincoln].
ProQuest Dissertations and Theses Global.
Leech, N. L., Barrett, K. C., & Morgan, G. A. (2015). IBM SPSS for intermediate statistics: Use
and interpretation (5th ed.). Routledge.
Levinowitz, L. M., & Scheetz, J. (1998). The effects of group and individual echoing of rhythm
patterns on third-grade students’ rhythmic skills. Update: Applications of Research in
Music Education, 16(2), 8–11. https://doi.org/10.1177/875512339801600203
Li, P., Stuart, E. A., & Allison, D. B. (2015). Multiple imputation: A flexible tool for handling
missing data. Journal of the American Medical Association, 314(18), 1966–1967.
Linklater, R. F. (1994). A comprehensive investigation of the effects of audio and video tape
models on the musical development of beginning clarinet students (Publication No.
9500987) [Doctoral dissertation, University of Michigan]. ProQuest Dissertations and
Theses Global.
Liperote, K. A. (2004). A study of audiation-based instruction, music aptitude, and music
achievement of elementary wind and percussion students (Publication No. 3123215)
[Doctoral dissertation, University of Rochester, Eastman School of Music]. ProQuest
Dissertations and Theses Global.
Little, R. J. (1988a). Missing-data adjustments in large surveys. Journal of Business & Economic
Statistics, 6(3), 287–296. 10.1080/07350015.1988.10509663
Little, R. J. (1988b). A test of missing completely at random for multivariate data with missing
values. Journal of the American statistical Association, 83(404), 1198–1202.
Lundin, R. (1967). An objective psychology of music (2nd ed.). Ronald Press Co.
Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data
should not be used to guide decisions on multiple imputation. Journal of Clinical
Epidemiology, 110, 63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016
Mang, E. (2013, December). Musicality profile of Hong Kong children. In 2013 International
Conference on the Modern Development of Humanities and Social Science, 331–333.
Atlantis Press. https://doi.org/10.2991/mdhss-13.2013.87
Mawbey, W. E. (1973). Wastage from instrumental classes in schools. Psychology of Music, 1,
33–43. https://doi-org.gate.lib.buffalo.edu/10.1177/030573567311007
McCarthy, J. (1974). The effect of individualized instruction on the performance achievement of
beginning instrumentalists. Bulletin of the Council for Research in Music Education, 38,
1–16. https://www.jstor.org/stable/40317313
McDonald, K. J. (2010). The effect of vocal jazz aural skill instruction on student sight singing
achievement (Publication No. 3438011) [Doctoral dissertation, University of Hartford].
ProQuest Dissertations and Theses Global.
McDowell, R. (1974). The development and implementation of a rhythmic ability test designed
for 4-year-old preschool children (Publication No. 7422023) [Doctoral dissertation,
University of North Carolina at Greensboro]. ProQuest Dissertations and Theses Global.
McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing data: A gentle
introduction. The Guilford Press.
McPherson, G. E. (1995). ‘Honing the craft’: Improving the way we teach the musically gifted
and talented. In Honing the craft: Improving the quality of music education; Conference
proceedings of the Australian society for music education, 10th national conference (p.
169). Artemis Publishing.
Menard, E. (2009). An investigation of creative potential in high school musicians: Recognizing,
promoting, and assessing creative ability through music composition (Publication No.
3451495) [Doctoral dissertation, Louisiana State University and Agricultural &
Mechanical College]. ProQuest Dissertations and Theses Global.
Miceli, J. S. (1998). An investigation of an audiation-based high school general music
curriculum and its relationship to music aptitude, music achievement, and student
perception of learning (Publication No. 9825698) [Doctoral dissertation, University of
Rochester, Eastman School of Music]. ProQuest Dissertations and Theses Global.
Milford, G. F. (2002). Effect of three different pulse stimulus modes on the rhythm reading
achievement of beginning instrumentalists (Publication No. 3057395) [Doctoral
dissertation, Kent State University]. ProQuest Dissertations and Theses Global.
Mitchum, J. P. (1969). The Wing ‘standardized tests of musical intelligence’: An investigation
of predictability with selected seventh-grade beginning-band students (Publication No.
7008565) [Doctoral dissertation, Florida State University]. ProQuest Dissertations and
Theses Global.
Moll, L. C. (1992). Bilingual classroom studies and community analysis: Some recent trends.
Educational Researcher, 21(2), 20–24. https://doi.org/10.3102/0013189X021002020
Moore, J. L. (1987). An experiment with rhythm and movement upon developmental music
aptitude. Update: The Applications of Research in Music Education, 6(1), 7–10.
Moore, J. L. (1990). Toward a theory of developmental music aptitude. Research Perspectives
in Music Education: A Bulletin of the Florida Music Educators Association, 1(1), 19–
23. https://bit.ly/2LKKdHq
Morgan, M. (1995). Effects of Gordon’s model for music education on the rhythmic aptitude of
second-grade students (Publication No. 9616875) [Doctoral dissertation, The State
University of New York at Albany]. ProQuest Dissertations and Theses Global.
Mota, G. (1997). Detecting young children’s musical aptitude: A comparison between
standardized measures of music aptitude and ecologically valid musical performances.
Bulletin of the Council for Research in Music Education, 133, 89–94.
Moustakas, C. (1994). Phenomenological research methods. Sage Publications, Inc.
Müllensiefen, D., Gingras, B., Musil, J., & Stewart, L. (2014). The musicality of non-musicians:
An index for assessing musical sophistication in the general population. PLoS ONE 9(2):
e89642, 1–23. https://doi.org/10.1371/journal.pone.0089642
Multiculturalism (2016, August 12). Stanford Encyclopedia of Philosophy. Retrieved May 18,
2020, from https://plato.stanford.edu/entries/multiculturalism/
National Association for Music Education (n.d.). 2014 music standards (PK–8 general music).
Retrieved February 18, 2021, from https://nafme.org/wp-content/uploads/2014/11/2014Music-Standards-PK-8-Strand.pdf
National Association for Music Education (2021). Early childhood music education. Retrieved
February 18, 2021, from https://nafme.org/about/position-statements/early-childhoodmusic-education/
National Center for Education Statistics (n.d.). School directory
Norton, D. (1980). Interrelationships among music aptitude, IQ, and auditory conservation.
Journal of Research in Music Education, 28(4), 207–217.
O’Leary, J. E. (2010). The effects of motor movement on elementary band students’ music and
movement achievement (Publication No. 3405999) [Doctoral dissertation, Boston
University]. ProQuest Dissertations and Theses Global.
Ortner, J. M. (1990). The effectiveness of a computer-assisted instruction program in rhythm for
secondary school instrumental music students (Publication No. 9115133) [Doctoral
dissertation, The State University of New York at Buffalo]. ProQuest Dissertations and
Theses Global.
Palmer, M. H. (1974). The relative effectiveness of the Richards and the Gordon approaches to
rhythm reading for fourth grade children (Publication No. 7511879) [Doctoral
dissertation, University of Illinois at Urbana–Champaign]. ProQuest Dissertations and
Theses Global.
Pan, Q., & Wei, R. (2016). Fraction of missing information (γ) at different missing data fractions
in the 2012 NAMCS physician workflow mail survey. Applied Mathematics, 7(10),
1057–1067. https://doi.org/10.4236/am.2016.710093
Parks, J. K. E. (2005). The effect of a program of portable electronic piano keyboard experience
on the acquisition of sight-singing skill in the novice high school chorus (Publication No.
3201961) [Doctoral dissertation, University of Maryland, College Park]. ProQuest
Dissertations and Theses Global.
Pedhazur, E.J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated
approach. Lawrence Erlbaum Associates.
Pereira, A. I., Rodrigues, H., & Rutkowski, J. (2017). The relationship between children’s use of
singing voice, singing accuracy, and self-perception on singing with text and neutral
syllable. In Context Matters, The 6th International Symposium on Assessment in Music
Education [Symposium], Birmingham, United Kingdom.
Peterson, J. J. (1983). The Iowa testing programs: The first fifty years. University of Iowa Press.
Phillips, K. H., & Aitchison, R. E. (1997). The relationship of singing accuracy to pitch
discrimination and tonal aptitude among third-grade students. Contributions to Music
Education, 24(1), 7–22. https://www.jstor.org/stable/24126943
Phillips, K. H., Aitchison, R. E., & Nompula, Y. P. (2002). The relationship of music aptitude to
singing achievement among fifth-grade students. Contributions to Music Education,
29(1), 47–58. https://www.jstor.org/stable/24126974
Pollock, M. (2004). Colormute: Race talk dilemmas in an American school. Princeton University
Press. https://www-jstor-org.gate.lib.buffalo.edu/stable/j.ctt7rjh1
Pruitt, J. S. (1966). A study of withdrawals from the beginning instrumental music programs of
selected schools in the school district of Greenville county, South Carolina (Publication No.
6609485) [Doctoral dissertation, New York University]. ProQuest Dissertations and Theses
Pursell, A. F. (2005). The effectiveness of iconic-based rhythmic instruction on middle school
instrumentalists’ ability to read rhythms at sight (Publication No. 3194875) [Doctoral
dissertation, Ball State University]. ProQuest Dissertations and Theses Global.
Radocy, R., & Boyle, J. D. (1979). Psychological foundations of musical behavior. Charles C
Reese, J. A., & Shouldice, H. N. (2019). Assessment in the music learning theory-based
classroom. In T. Brophy (Ed.), The Oxford Handbook of Assessment Policy and Practice
in Music Education, Volume 2 (pp. 477–501). Oxford University Press.
Reifinger, J. L. (2018). The relationship of pitch sight-singing skills with tonal discrimination,
language reading skills, and academic ability in children. Journal of Research in Music
Education, 66(1), 71–91. https://doi.org/10.1177/0022429418756029
Reynolds, A. M., & Hyun, K. (1994). Understanding music aptitude: Teachers’ interpretations.
Research Studies in Music Education, 23(1), 18–31.
Rowlyk, W. T. (2008). Effects of improvisation instruction on nonimprovisation music
achievement of seventh and eighth grade instrumental music students (Publication No.
3300374) [Doctoral dissertation, Temple University]. ProQuest Dissertations and Theses
Runfola, M. (2016). Development of MAP and ITML: Is music learning theory an unexpected
outcome? In T. S. Brophy, J. Marlatt, & G. K. Ritcher (Eds.), Connecting Practice,
Measurement, and Evaluation: Selected Papers from the Fifth International Symposium
on Assessment in Music Education (pp. 357–374). GIA Publications.
Runfola, M., & Etopio, E. (2010). The nature of performance-based criterion measures in early
childhood music education research, and related issues. In T. S. Brophy (Ed.), The
Practice of Assessment in Music Education: Frameworks, Models, and Designs.
Proceedings of the 2009 Florida Symposium on Assessment in Music Education (pp. 395411). GIA Publications.
Russell, J. A. (2018). Statistics in music education research. Oxford University Press.
Ruthsatz, J. M. (2000). Predicting expert performance within the musical domain: A test of
summation theory (Publication No. 9981862) [Doctoral dissertation, Case Western
Reserve University]. ProQuest Dissertations and Theses Global.
Rutkowski, J. (1986). The effect of restricted song range on kindergarten children’s use of
singing voice and developmental music aptitude (Publication No. 8619357) [Doctoral
dissertation, The State University of New York at Buffalo]. ProQuest Dissertations and
Theses Global.
Rutkowski, J. (1996). The effectiveness of individual/small-group singing activities on
kindergartners’ use of singing voice and developmental music aptitude. Journal of
Research in Music Education, 44(4), 353–368. https://doi.org/10.2307/3345447
Rutkowski, J. (2015). The relationship between children’s use of singing voice and singing
accuracy. Music Perception: An Interdisciplinary Journey, 32(3), 283–292.
Rutkowski, J., & Miller, M. S. (2003a). The effectiveness of fequency [sic] of instruction and
individual/small-group singing activities on first graders’ use of singing voice and
developmental music aptitude. Contributions to Music Education, 30(1), 23–38.
Rutkowski, J., & Miller, M. S. (2003b). The effect of teacher feedback and modeling on first
graders’ use of singing voice and developmental music aptitude. Bulletin of the Council
for Research in Music Education, 156, 1–10. https://www.jstor.org/stable/40319169
Salvador, K. (2011). Individualizing elementary general music instruction: Case studies of
assessment and differentiation (Publication No. 3482549) [Doctoral dissertation,
Michigan State University]. ProQuest Dissertations and Theses Global.
Saunders, T. C., & Holahan, J. M. (1993). Computerized response procedure to assess young
student reaction times of judgments of sameness and difference among paired tonal
patterns. Bulletin of the Council for Research in Music Education, 115, 31–48.
Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8,
3–15. https://doi.org/10.1177/096228029900800102
Schenker, N., & Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation.
Computational Statistics & Data Analysis, 22, 425–446. https://doi.org/10.1016/01679473(95)00057-7
Schleuter, S. L. (1978). Effects of certain lateral dominance traits, music aptitude, and sex
differences with instrumental music achievement. Journal of Research in Music
Education, 26(1), 22–31. https://doi.org/10.2307/3344786
Schleuter, S. L. (1984). A sound approach to teaching instrumentalists: An application of
content and learning sequences. Kent State University Press.
Schleuter, S. L., & DeYarman, R. (1977). Musical aptitude stability among primary school
children. Bulletin of the Council for Research in Music Education, 51, 14–22.
Schoenoff, A. (1973). An investigation of the comparability of American and German norms for
the musical aptitude profile (Publication No. 7313591) [Doctoral dissertation, University
of Iowa]. ProQuest Dissertations and Theses Global.
Schoonover, R. J. (1974). A study of the construct validity of selected musical aptitude tests
using the multitrait-multimethod matrix procedure (Publication No. 7428738) [Doctoral
dissertation, Northwestern University]. ProQuest Dissertations and Theses Global.
Seashore, C. E. (1919). The psychology of musical talent. Silver Burdett.
Sell, V.H. (1976). The musical aptitude of Finnish students: An investigative study in
comparative music education (Publication No. 628174) [Doctoral dissertation, University
of Wisconsin–Madison]. ProQuest Dissertations and Theses Global.
Sergeant, D., & Thatcher, G. (1974). Intelligence, social status and musical abilities. Psychology
of Music, 2(2), 32–57. https://doi.org/10.1177/030573567422005
Shuter, R. (1968). The psychology of musical ability. Methuen & Co.
Shuter-Dyson, R. (1999). Musical ability. In The Psychology of Music (2nd ed., pp. 627–651).
Academic Press. https://doi.org/10.1016/B978-012213564-4/50017-2
Simmons, J. C. H. (1981). An investigation of relationships among primary-level student
performance on selected measures of music aptitude, scholastic aptitude, and academic
achievement (Publication No. 8131595) [Doctoral dissertation, Peabody College for
Teachers of Vanderbilt University]. ProQuest Dissertations and Theses Global.
Smith, J. P. (2004). Music compositions of upper elementary students created under various
conditions of structure (Publication No. 3132610) [Doctoral dissertation, Northwestern
University]. ProQuest Dissertations and Theses Global.
Smith, N. (2006). The effect of learning and playing songs by ear on the performance of middle
school band students (Publication No. 3255935) [Doctoral dissertation, University of
Hartford]. ProQuest Dissertations and Theses Global.
Soley-Bori, M. (2013). Dealing with missing data: Key assumptions and methods for applied
analysis. Technical Report No. 4. http://www.bu.edu/sph/files/2014/05/Marina-techreport.pdf
Stamou, L., Schmidt, C. P., & Humphreys, J. T. (2010). Standardization of the Gordon primary
measures of music audiation in Greece. Journal of Research in Music Education, 58(1),
75–89. https://doi.org/10.1177/0022429409360574
Stangroom, J. (2021). Effect size calculator for t-test. Social Science Statistics.
Stanton, H. M. (1935). Measurement of musical talent: the Eastman experiment. In C. E.
Seashore (Ed.), University of Iowa Studies in the Psychology of Music. University of
Iowa Press.
Stanton, H. M., & Koerth, W. (1933). Musical capacity measures of children repeated after
musical training. University of Iowa Studies: Series of Aims & Progress of Research, 42,
New Ser., 259, 48.
Stevens, D. O. (1987). The construction and validation of a test of musical aptitude for young
children (Publication No. 8715880) [Doctoral dissertation, University of South Dakota,
Vermillion]. ProQuest Dissertations and Theses Global.
Stoltzfus, J. (2005). The effects of audiation-based composition on the music achievement of
elementary wind and percussion students (Publication No. 3169610) [Doctoral
dissertation, University of Rochester, Eastman School of Music]. ProQuest Dissertations
and Theses Global.
Stringham, D. (2010). Improvisation and composition in a high school instrumental music
curriculum (Publication No. 3445843) [Doctoral dissertation, University of Rochester,
Eastman School of Music]. ProQuest Dissertations and Theses Global.
Swaminathan, S., Schellenberg, E. G., and Khalil, S. (2017). Revisiting the association between
music lessons and intelligence: Training effects or music aptitude? Intelligence, 62, 119–
124. https://doi.org/10.1016/j.intell.2017.03.005
Taggart, C. C. (1989). The measurement and evaluation of music aptitudes and achievement. In
D. L. Walters & C. C. Taggart (Eds.), Readings in Music Learning Theory (pp. 45–54).
GIA Publications.
Talley, K. E. (2005). An investigation of the frequency, methods, objectives and applications of
assessment in Michigan elementary general music classrooms (Publication No. 1428983)
[Master’s thesis, Michigan State University]. ProQuest Dissertations and Theses Global.
Van Ginkel, Linting, Rippe, & van der Voort (2020). Rebutting existing misconceptions about
multiple imputation as a method for handling missing data. Journal of Personality
Assessment, 102(3), 297–308. https://doi.org/10.1080/00223891.2018.1530680
Von Hippel, P. T. (2016). New confidence intervals and bias comparisons show that maximum
likelihood can beat multiple imputation in small samples. Structural Equation Modeling:
A Multidisciplinary Journal, 23(3), 422–437. https://arxiv.org/pdf/1307.5875.pdf
Wallentin, M., Nielsen, A. H., Friis-Olivarius, M., Vuust, C., & Vuust, P. (2010). The musical
ear test, a new reliable test for measuring musical competence. Learning and Individual
Differences, 20(3), 188–196. https://doi.org/10.1016/j.lindif.2010.02.004
Walters, D. L. (1991). Edwin Gordon’s music aptitude work. The Quarterly, 2(1–2), 64–72.
Walters, D. L. (1992). Sequencing for efficient learning. In R. Colwell (Ed.), Handbook of
research on music teaching and learning (pp. 535–545). Schirmer Books.
Webb, M. N. A. (1984). An investigation of the relationship of musical aptitude and intelligence
of students at the third grade level (Publication No. 8509188) [Doctoral dissertation, The
University of North Carolina at Greensboro]. ProQuest Dissertations and Theses Global.
Westervelt, T. G. (2001). An investigation of harmonic and improvisation readiness among
upper elementary-age school children (Publication No. 3031569) [Doctoral dissertation,
Temple University]. ProQuest Dissertations and Theses Global.
Wing, H. D. (1939/1961). Standardised Tests of Musical Intelligence. The Mere, England:
National Foundation for Educational Research.
Wing, H. D. (1962). A revision of the “Wing musical aptitude test”. Journal of Research in
Music Education, 10(1), 39–46. https://doi.org/10.2307/3343909
Wolf, A., & Kopiez, R. (2018). Development and validation of the musical ear training
assessment (META). Journal of Research in Music Education, 66(1), 53–70. https://doiorg.gate.lib.buffalo.edu/10.1177/0022429418754845
Wöllner, C., Halfpenny, E., Ho, S., & Kurosawa, K. (2003). The effects of distracted inner
hearing on sight-reading. Psychology of Music, 31(4), 377–389.
Yosso, T. J. (2005). Whose culture has capital? A critical race theory discussion of community
cultural wealth. Race, Ethnicity and Education, 8(1), 69–91.
Young, R., & Johnson, D. R. (2015). Handling missing values in longitudinal panel data with
multiple imputation. Journal of Marriage and Family, 77(1), 277–294.
Young, W. T. (1971). The role of musical aptitude, intelligence, and academic achievement in
predicting the musical attainment of elementary instrumental music students. Journal of
Research in Music Education, 19(4), 385–398. https://doi.org/10.2307/3344291
Young, W. T. (1973). The Bentley “measures of musical abilities”: A congruent validity report.
Journal of Research in Music Education, 21(1), 74–79. https://doi.org/10.2307/3343982
Young, W. T. (1976). A longitudinal comparison of four music achievement and music aptitude
tests. Journal of Research in Music Education, 24(3), 97-109.
Zentner, M., & Gingras, B. (2019). The assessment of musical ability and its determinants. In
P.J. Rentfrow & D. J. Levitin (Eds.), Foundations in music psychology: Theory and
research (pp. 641–684). Massachusetts Institute of Technology.
Zhang, Z. (2016). Missing data imputation: Focusing on single imputation. Annals of
Translational Medicine, 4(1), 1–8. https://doi.org/10.3978/j.issn.2305-5839.2015.12.38
Zimmerman, M. P. (1986). Music development in middle childhood: A summary of selected
research studies. Bulletin of the Council for Research in Music Education, 86, 18–35.
ProQuest Number: 28419252
The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.
Distributed by ProQuest LLC ( 2021 ).
Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17,
United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization
of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA