The current issue and full text archive of this journal is available on Emerald Insight at: https://www.emerald.com/insight/0737-8831.htm LHT 40,1 80 Received 11 May 2020 Revised 14 July 2020 17 August 2020 10 September 2020 Accepted 10 September 2020 Computer-assisted EFL writing and evaluations based on artificial intelligence: a case from a college reading and writing course Zhijie Wang College of Humanities and Development, China Agricultural University, Beijing, China Abstract Purpose – The aim of this study is to explore students’ expectations and perceived effectiveness of computerassisted review tools, and the differences in reliability and validity between human evaluation and automatic evaluation, to find a way to improve students’ English writing ability. Design/methodology/approach – Based on the expectancy disconfirmation theory (EDT) and Intelligent Computer-Assisted Language Learning (ICALL) theory, an experiment is conducted through the observation method, semistructured interview method and questionnaire survey method. In the experiment, respondents were asked to write and submit four essays on three online automated essay evaluation (AEE) systems in total, one essay every two weeks. Also, two teacher raters were invited to score the first and last papers of each student. The respondents’ feedbacks were investigated to confirm the effectiveness of the AEE system; the evaluation results of the AEE systems and teachers were compared; descriptive statistics was used to analyze the experimental data. Findings – The experiment revealed that the respondents held high expectations for the computer-assisted evaluation tools, and the effectiveness of computer scoring feedback on students was higher than that of teacher scoring feedback. Moreover, at the end of the writing project, the students’ independent learning ability and English writing ability were significantly improved. Besides, there was a positive correlation between students’ initial expectations of computer-assisted learning tools and the final evaluation of learning results. Originality/value – The innovation lies in the use of observation methods, questionnaire survey methods, data analysis, and other methods for the experiment, and the combination of deep learning theory, EDT and descriptive statistics, which has particular reference value for future works. Keywords Computer-assisted writing, Learner expectation and satisfaction, Artificial intelligence in writing and reading, AEE, AERW Paper type Research paper 1. Introduction 1.1 Literature review With the irresistible trend of globalization and internationalization worldwide, the ability to write well in English as a lingua franca for communication in diverse fields across cultures had become an imperative in second/foreign language education (Wang et al., 2013; Yu, 2015; Khadka, 2020). As a result, teachers undertaking EFL (English as a foreign language) writing courses were facing increasing pressure for language output evaluating and corrective feedback, especially when enrollments outrun handling capacity (Tang and Wu, 2011; Gao and Ma, 2020). In order to make up for instructors’ inability to standardize essay assessment Library Hi Tech Vol. 40 No. 1, 2022 pp. 80-97 © Emerald Publishing Limited 0737-8831 DOI 10.1108/LHT-05-2020-0113 The paper is supported by the MOE(Ministry of Education in China) Industry-Academy Cooperation Program “Study on the Effect of Group Cooperation on Undergraduates’ Improvement in EFL Reading and Writing” (Grant No. 201902184036). This paper forms part of a special section “Informetrics on Social Network Mining: Research, Policy and Practice challenges - Part 2”, guest edited by Mu-Yen Chen, Chien-Hsiang Liao, Edwin David Lughofer and Erol Egrioglu. and provide quick and valid feedback, computer-based software and online systems have been introduced in writing evaluation. The first experiment on computer-assisted essay evaluation was presented in 1966 (Page, 1966). Into the 21st century, more sophisticated automated essay evaluation (AEE) systems had been applied in writing curricula, such as Criterion online writing service (Burstein et al., 2004), IntelliMetricSM essay scoring system (Rudner et al., 2006) and e-rater® scoring engine (Attali, 2007). Computer-assisted scoring systems were reliable in that if they were programmed in the same way, they would stably yield identical results free from such conditions as timing, environment or mood of human beings (Powers et al., 2000). Tests of syntactic complexity analyzer (SCA) and Coh-Metrix showed that the two AEE systems were both reliable for studying syntactic complexity across genres (Polio and Yoon, 2018). It was also found that a combination of e-rater version 2.0, an AEE system and critique writing analysis tools – a suite of programs that detects undesirable style, errors in grammar, usage and mechanics – provided students with reliable feedback to improve their writing skills (Burstein et al., 2004). Research also suggests that AEE software had partly sufficient qualities from the point of teachers (Bas and Tekal, 2014). The Development Planning for a New Generation of Artificial Intelligence issued by the Chinese State Council in 2017 proposed to accelerate the innovation and application of artificial intelligence (AI) in the field of education, develop intelligent education and meet the personalized and diverse educational needs. It is necessary to construct a learner-centered education environment, create an intelligent learning and interactive learning education system, as well as use intelligent technology to innovate talent training modes and teaching methods. Since 2018, the Ministry of Education of China promulgated the “Internet þ Education” specific implementation plan Education Informatization 2.0 Action Plan. AI has gradually entered the implementation stage in the field of education. The validity of AEE technology has also been put to the test in Chinese universities in recent years. An experiment at Chongqing Technology and Business University showed that the use of Pigai.org (an AEE system in China) could effectively enhance students’ writing proficiency, promote their writing motivation and improve their self-efficacy (Yang and Dai, 2015). An 18-weeks experiment was conducted on 81 non-English major sophomores to explore the effectiveness of blog-assisted writing for Chinese English learners. The results indicated that during the blog-assisted process, students’ English writing ability significantly improved. Moreover, most participants had a positive attitude toward this writing mode, which played a decisive role in solving the problems that may arise in traditional English writing teaching (Zhou, 2015). A tracking study of six non-English major undergraduates found that an AEE-mediated three-phase process – analyzing, responding and revising – beefed up the enthusiasm of the respondents involved, reinforced their perception of learning goals and yielded positive feedbacks (Lu, 2016). Another investigation showed that all 37 respondents enrolled in a writing program were satisfied with feedbacks from Pigai.org in that the warnings and suggestions on vocabulary and collocations contributed much to their improvement of lexical diversity and sophistication (Huang and Zhang, 2018). Recent research also found that by using AI technology in planting depth perception features, including correlation, error and overall accuracy, automatic evaluations could be as reliable as manual scoring in dealing with deep-seated language issues (Pribadi et al., 2017; Tang and Sun, 2019; Boulanger and Kumar, 2019). Though many empirical studies in the field of AEE systems revealed positive effects on overall writing improvement, some studies had shown unfavorable learner evaluations. A study conducted in three EFL classrooms found that respondents in all three classes perceived the use of the AEE system unfavorably (Chen and Cheng, 2008). Others were concerned that AEE could be easily fooled to assign high scores to essays which were long, syntactically complex and replete with sophisticated vocabulary (Wilson and Czik, 2016) Therefore, further research was required to test the efficiency of these systems and to Computerassisted EFL writing 81 LHT 40,1 82 determine which areas in the writing construct could be effectively improved via AEE systems (Aluthman, 2016). In addition, integrating technological advances into intelligent training models to help learners build conceptualized language ability was also a grand challenge for the future development of AEE tools (Winke and Isbell, 2017). In recent years, social network interactions were also integrated into the writing process to provide more practical and profound implications in the use and evaluations of AEE systems. In fact, with the growing popularity of Web 2.0 technology, the value of web-based interactions and collaborations in EFL writing had been positively critiqued (Jin, 2014; Chu et al., 2017). Experimental studies on EFL writing collaboration revealed that collaborative writing in the form of peer review exercises within task groups could improve writing outcomes (Cristina and Carrillo, 2016) and encourage students’ sense of fulfillment (Li and Liu, 2011; Liu et al., 2018). Both qualitative and quantitative analyses showed that collaborative writing on blogs could improve the students’ knowledge about their language performance in writing (Amir et al., 2011), trigger motivation to learn autonomously (Yang and Yang, 2013), foster a sense of competence, be useful and mutually related (Kramer and Kusurkar, 2017). In this study, all respondents were also encouraged to interact by crossreviewing each other’s essays and responding to peer reviews, using three AEE websites as academic, social network tools. Despite massive research efforts from multiple dimensions, there are still some questions that need solutions. First, in what way do learners engage in AEE systems? Second, how do learners respond to and perceive the corrective feedback from the AEE rating? Third, how to measure learners’ expectations and outcome confirmation? Fourth, how to compare the reliability and validity between the AEE evaluations and that of human raters? Fifth, how do the latest technological advances, especially in AI technology, help improve on the ingrained concerns, such as learner engagement? (Kukich, 2000). In conclusion, the traditional English writing teaching method only focuses on the teaching of writing skills and the result, which is not conducive to stimulating students’ interest in learning. With the increasing influence of network technology on the reform of English teaching mode, it has become the attention focus of teachers and students. AI technology and deep learning theory are integrated into the AEE system to realize the optimization of the AEE system and effective feedback of the evaluation results. 2.2 Research questions This study deals with three research questions. First, what are the respondents’ expectations and subjective evaluations of AEE? An exploration of this question could provide a further test on the existing research on learners’ engagement with AEE systems (Tian and Zhou, 2020). Second, what are the respondents perceived effectiveness of AEE in their English writing improvement? Research on this question could throw light on how student beliefs about AEE might relate to their writing performance (Wang, 2013). Third, what are the advantages and disadvantages of AEE in compassion with the human rating? A study on this question could offer more evidence on AEE if it was a good substitute for the human rating (Azmi et al., 2019) or if it played a supplementary role (Li, 2019). 2. Method and design 2.1 Respondents The respondents of this study were 188 students from China Agricultural University, who were taking the advanced English reading and writing (AERW) course taught by the teacher–researcher of this paper. As the course was open to all undergraduates, the enrollments included students from first-year students to fourth-year seniors. First-year students accounted for 52% of the total, sophomores 47%, junior and senior students both less than 2%. As there was no classroom as big as to accommodate all the students, and many students had to attend other compulsory courses at different time slots, the respondents were separately placed in Class A42, Class A42, Class A51 and Class A52, each of which was held at different hours. To guarantee the validity and reliability of the analysis, each student had to finish and submit all four essays online under the teacher’s monitoring in class within a given amount of time. Consequently, ten respondents were eliminated for failing to meet all requirements, and the remaining 178 respondents were counted in as qualified respondents of the research. Computerassisted EFL writing 83 2.2 Research methods The theoretical paradigm of this study is based on two theories: expectancy disconfirmation theory (EDT) and the theory of ICALL (Intelligent Computer-Assisted Language Learning). The research methods adopted were observation, semistructured interview and survey. Descriptive statistics was used to display and explain the data collected in the experiments. 2.2.1 EDT. EDT was first applied to analyze the effect of expectation–disconfirmation on affective judgments of a product (Oliver, 1977). At the heart of EDT was that disconfirmed expectancy might positively or negatively influence customers’ attitude and purchase intention (Oliver, 1980). EDT had been widely used in the analysis of affective evaluations and behavioral management, such as how an organization’s satisfaction with its supply network’s behavior influenced its intention to open innovation with that network (Bravo et al., 2017), and how citizen satisfaction with public services was judged by prior expectations (Van Ryzin, 2013). An EDT analysis model created by Van Ryzin (2013), shown in Figure 1, is used to build analytic indicators in respondents’ subjective evaluations of AEE systems in this study. Among the links, A indicates high expectations result in negative disconfirmation, B shows high-performance yields positive disconfirmation, C represents a positive relation between disconfirmation and satisfaction, D indicates expectations and performance are positively related, E shows a positive relationship between performance and satisfaction and F shows expectations may either positively or negatively relate to expectations. 2.2.2 English writing teaching assisted by AI computing. Regarding the application of AI in the field of language teaching, there is currently the theory of ICALL. ICALL is a branch discipline developed from CALL (Computer-Assisted Language Learning). From the existing Figure 1. EDT analysis model created by Van Ryzin (2013) LHT 40,1 84 empirical research, the current application of AI technology in language teaching mainly includes speech recognition and semantic analysis, image recognition (such as recognizing learners and learning content, assisting online examinations, homework correction and review), human–computer interaction (such as intelligent evaluation, intelligent learning feedback), adaptive learning and personalized learning (adjusting learning materials, test methods or learning sequences according to individual learning conditions to meet personalized learning needs) and scene teaching based on virtual reality. As an emerging discipline, although AI-assisted language teaching is still in its infancy, the development and application of the technology has dramatically enriched the means of English teaching and provided strong technical support for mixed English teaching. The English writing teaching mode based on AI computer assistance is shown in Figure 2. 2.2.3 Data collection and analysis tools. A set of data collection and analysis tools are used in this research, including two sets of questionnaires conducted via www.wjx.cn, one of the leading Chinese social statistics websites, separately at the beginning and ending of the writing project. The five-point Likert-scale questionnaires had 15 items concerning respondents’ attitudes toward the effectiveness of three AEE systems on writing improvement. Some items were also designed for comparisons by respondents between different scoring approaches. An estimation of Cronbach’s α showed high reliability of the questionnaire (α 5 0.82). Also, some open-ended questions were included in respondents’ evaluations of the three AEE tools to supplement quantitative analysis. 2.3 Data collection process Data collection was conducted in the Spring semester of the 2018 school year. The AEWR course was given in a computer-equipped classroom, ensuring that each respondent had a computer at hand. Before writing tasks were assigned, respondents were asked to finish a background-surveying questionnaire. In the first two weeks of the class, the teacher explained the scoring mechanism of Pigai.org, iWrite and Awrite, three well-known Chinese AEE systems shown in Table 1, and then various features and functions of the three evaluation tools were demonstrated. As the respondents registered and practiced, the teacher circulated in the classroom to supervise the progress and respond to questions. Network environment support, artificial intelligence technology Before class: guidance and Practice Autonomous Learning Learn r ing Figure 2. English writing teaching mode based on AI computer assistance In class: learn to imitate After class: Reciting to promote writing review revise Study t comment Transfer Transfe f r ooff sentence patterns Writing Writi t ng transfer t ansfe tr f r Combination of online and offline learning Corporation Software Technique Beijing Ciwang Technology Incorporated Foreign Language Teaching and research press Fangzhou Education Technology Incorporated Pigai NLP þ cloud computation iWrite NLP Awrite NLP Main focus Scoring Feedback Grammar, syntactic complexity, style Grammar, content, usage Holistic and limited trait scores Holistic and trait scores Grammar, usage, organization Holistic and trait scores Detailed individualized feedback Limited individualized feedback Detailed individualized feedback During the period from the third week to the fifth week into the course, various modes of essay writing (argument and persuasion, comparison and contrast, cause and effect) were instructed, and respondents were encouraged to practice organizing a paragraph on their own. After the instruction and practice in the beginning five weeks, the respondents were asked to write one essay every two weeks and upload their work to the websites of Pigai.org, iWrite and Awrite. The respondents had to write four essays in total, each finished in 30min under the teacher’s supervision in computer-equipped classrooms. Since qualified respondents were 178 in total, 752 essays were collected for the study. To guarantee the functions of the three AEE systems were fully understood and properly used, each respondent was given 5 to 20 minutes of tutoring before writing. During the tutoring sessions, respondents and the teacher worked together to submit a sample essay first, and then they reviewed the feedback from the scoring systems. For anything that the respondents felt difficult or puzzled about, the teacher offered timely advice and solutions. After each submission, respondents were encouraged to interact with each other by crossreviewing each other’s essays and responding to peers’ reviews via the academic, social networking features of the three AEE websites. An e-rater scoring metric (Quinlan et al., 2009) was offered to help the respondents classify and pinpoint rating focuses common in AEE systems, as shown in Figure 3. After each assigned essay was finished, respondents were encouraged to check the feedback of the three AEE systems with the e-rater metric and respond to 15 items of a five-point Likert-scale questionnaire. After the writing tasks were finished, the respondents’ affective feedback was collected by asking them to answer some questions designed based on the expectation–disconfirmation theory. Respondents were categorized according to their feedback into four groups: low expectations and low performance, low expectations and high performance, high expectations and low performance and high expectations and high performance. After all of the four essays were finished, the teacher retrieved the first and the last essays of each student from the online electronic archives of the three AEE systems. The word count and the total score of each essay were recorded. After that, two experienced teacher raters were invited to review and evaluate the essays following the rubrics of Rubrics on CET (College English Test) Band Four, a popular standardized English test in Mainland China. And then the respondents were asked to finish another five-point Likert-scale questionnaire with 15 items for a comparison between AEE and human scoring. Computerassisted EFL writing 85 Table 1. A comparison of three well-known Chinese AEE systems LHT 40,1 86 Figure3. The organization of e-rater scoring (Quinlan et al., 2009) 3. Results 3.1 Respondents background Analysis of the questionnaire shows that 116 of the respondents are female, accounting for 65.17% of the total, and the remaining 34.83% are male. As to age, 94%, or 168 respondents of the total, are in the age span from 17 to 20, with the remaining 10 between 20 to 22 years of age. As for the places where they received high school education, 103 respondents or 58% of the total come from developed provinces and cities; 44 respondents or 25% are from moderately developed areas; 20 respondents or 11% are from less developed places, and the remaining 11 respondents or 6% come from poverty-stricken areas. Since all the respondents taking the AERW course are comparatively of higher English proficiency are of little difference from each other in language skills, places of their high-school education are not reckoned as factors of influence. Instead, an analysis according to their years in university was conducted in this study. Computerassisted EFL writing 87 3.2 Effect of disconfirmed expectation It is found in Table 2 that respondents of different years in university show different expectations for AEE systems. More first-year students than respondents of other years focus on the ability to enlarge their vocabulary, accounting for 30.86% of the total. Putting a premium on vocabulary building was common for fresh university goers because they faced more difficult text content with more new words than they did in high school (Liu, 2015; Jiang, 2019). Sophomores are comparatively less concerned with vocabulary, but they instead emphasize better organization and development, as well as better syntactic variety. Juniors are more concerned with better syntactic variety and better organization and development than with other English skills. First-year students Sophomores Juniors Seniors N Vocabulary enlargement (%) Grammar refining (%) Usage improvement (%) Better syntactic variety (%) Improvement in organization and development (%) 81 25 (30.86) 15 (18.52) 11 (13.58) 13 (18.05) 15 (18.52) 72 12 13 14 (19.44) 2 (16.67) 2 (15.38) 17 (23.61) 3 (25.00) 3 (23.07) 4 (5.55) 1 (8.33) 1 (7.69) 19 (26.39) 3 (25.00) 3 (23.07) 15 (20.83) 3 (25.00) 3 (23.07) Vocabulary E P Grammar E P Low expectations, low performance Low expectations, high performance High expectations, low performance High expectations, high performance Usage E P Fluency E P Table 2. Expectations of respondents for AEE Organization and development E P 2.23 3.12 2.26 2.89 2.32 3.15 2.54 2.46 2.67 2.86 2.15 4.54 2.34 4.24 3.01 4.45 2.28 4.10 2.76 4.08 4.24 3.56 4.34 3.23 4.67 3.78 4.52 3.43 4.69 2.89 4.56 4.58 4.67 4.69 4.71 4.70 4.47 4.40 4.42 4.27 Table 3. Comparing expectations and performance ratings (E indicates expectations, P indicates performance) LHT 40,1 88 The effect of expectation–disconfirmation is revealed in Tables 3 and 4, with respondents grouped in four categories according to their ratings for AEE features at different stages: the first is the low expectations and low performance, the second low expectations and high performance, the third high expectations and low performance, the fourth high expectations and high performance. The relationship between expectations and performance, as is shown in Table 3, is positive when levels of expectations and performance are both high or low. The relationship can also be negative. For example, when expectations are low, the performance can be high. The ratings of AEE performance in vocabulary, grammar and usage are markedly higher than those of fluency and organization/development. When expectations are high, but performance is low, the ratings of AEE performance in fluency and organization/ development are markedly lower than that of vocabulary, grammar and usage. From the ratings of satisfaction, as is shown in Table 4, it can be found that performance and satisfaction are positively related, whereas the relationship between expectations and satisfaction is comparatively complex. Unless the levels of expectations and performance are both high or low, satisfaction levels are not in a positive relationship with expectations. When expectations are low, but performance is high, satisfaction levels stay high, especially concerning vocabulary, grammar and usage. When expectations are high, but performance is low, satisfaction with vocabulary, grammar and usage remains at relatively high levels, though affective evaluations of fluency and organization/development are low. 3.3 Deep learning analysis based on AI technology In recent years, as deep learning is applied to word vector processing, more word vector expressions are trained, such as WORD2VEC word vector expression. The text is trained through a neural network model, but WORD2VEC is mainly used to predict words around words. So, WORD2VEC shows weakness when global statistical information is not used. English text classification can have a good user experience improvement for the English learning system and provide an excellent design basis for the system’s user preference analysis and article recommendation. In this way, more accurate English articles can be provided to users in a targeted manner. By applying a text classification training model, different convolution windows are used for convolution training. Compared with a single convolution window, the correlation between word and text can be better trained. By adopting the improved average processing method of pooling convolution kernel for the pooling layer, the accuracy of the training results will also be increased. The weight-sharing characteristics or the local weight-sharing characteristics of the network layer of the convolutional neural network (CNN) make CNN have considerable advantages over traditional methods in speech recognition and two-dimensional image processing. The weight-sharing characteristic also reduces the complexity of the neural network model. When the quantity of data input is relatively large, the excessive complexity of the model is avoided to a certain extent. Satisfaction with various features Low expectations, low performance Low expectations, high performance Table 4. Comparing satisfaction High expectations, low performance High expectations, high performance ratings Vocabulary Grammar Usage Fluency Organization/ development 3.88 4.68 3.87 4.77 3.98 4.76 3.76 4.56 3.76 4.58 3.68 4.50 3.01 3.78 3.15 3.56 2.98 3.69 3.24 3.27 Computerassisted EFL writing input data Text preprocessing Convolution layer Activation layer tanh Convolution layer Activation layer tanh Convolution layer Activation layer tanh Pool layer Max Pool layer Max Pool layer Max Flattening convolution Flattening convolution Flattening convolution 89 Full connection layer Activation layer Classified output English texts are preclassified and classified into possible different types of groups to avoid classification errors caused by a single classification of some texts. The method of controlling variables is adopted in the training process to obtain the effect of text preprocessing on the results. The dimension of the word vector is controlled to 50 dimensions, and the number of training iterations is 10. The number of loop training is 1000, and the depth of the convolutional layer is 3 convolutional layers. The overall design of the neural network is shown in Figure 4. Under the background of AI, the mixed teaching mode in English writing teaching is designed with students at the center. It combines online and offline learning. A variety of teaching methods such as interest-driven, problem-oriented, and flipped classrooms are integrated. At the same time, diverse teaching evaluation methods are used to allow students to truly integrate into the classroom and make them indeed be the protagonists of learning. The use of AI cloud classes helps students develop autonomous learning, critical thinking and developing ability. Also, teaching resources have been shared more effectively. In the context of “AI þ Education,” the use of modern technology can accelerate the reform of the talent training mode and effectively improve the quality of curriculum teaching. 3.4 Comparison between AEE and human rating In this part, a set of five-point Likert-scale questionnaires were used to gauge respondents’ perceived comparisons between automatic essay scoring and human rating in the AREW course. On the five-point scale, 1 means “totally disagree,”; 2 indicates “disagree,”; 3 shows “slightly agree,”; 4 means “agree,”; 5 indicates “totally agree.” The questions were first imported on www.wjx.cn, the above-mentioned social statistic service website, and later exported to SPSS (version 22.0) for variables and reliability analysis. Table 5 shows eight favorable traits of AEE in comparison with human rating, including being timely, more detailed, more individualized and more understandable. It can also be found from Table 5 that respondents give high positive evaluations of more independent learning assisted by AEE. There are also positive responses regarding the reliability of the Figure 4. The overall design of the neural network LHT 40,1 90 system scoring method, as well as the timeliness and detail of the feedback content. Overall, the data shows that most respondents accepted the AEE traits as more effective than human rating (total average of means 5 4.07). It is not negligible, however, that some evaluation scores are comparatively low, especially for the effectiveness of AEE in vocabulary building and sentence learning. Interviews were arranged with the respondents; specific analysis is provided in the discussion part of this paper. To further fathom the respondents’ attitude toward using AEE systems, an additional five-point Likert-scale questionnaire was designed. As the data in Table 6 show, most respondents in each class of the AERW course expressed a positive attitude. Of the total respondents, 78% of respondents in Class A41 considered AEE as “greatly helpful” or “moderately helpful.” Percentages in Class A42, Class A51, and Class A52 are 73%, 86% and 84% separately. Class A51 and Class A52 show a comparatively higher ratio of positive attitude, which credits further explorations. Means Table 5. Respondents’ perceived effective traits of AEE Table 6. Overall perceived effectiveness of using AEE systems Feedback is timely 4.69 Feedback is more detailed 4.57 Feedback is more individualized 3.68 Feedback is more clearly understandable 4.08 Scoring measure is more reliable 4.62 More convenient in vocabulary building 3.53 More convenient in sentence learning 3.64 Learning is more independent 3.34 Note(s): Total average 5 4.07; Cronbach’s α 5 0.816 SD Sig. (two-tailed) 178 178 178 178 178 178 178 178 0.536 0.537 0.734 0.625 0.484 0.879 0.954 0.756 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 To what extent do you believe AEE helps you in writing? Greatly helpful Moderately helpful Slightly helpful Not helpful Undecided Class A41 (N 5 46) Class A42 (N 5 45) Class A51 (N 5 43) Class A52 (N 5 44) 11 (24%) 9 (20%) 7 (16%) 8 (18%) 25 (54%) 24 (53%) 30 (70%) 29 (66%) 3 (0.7%) 6 (13%) 3 (0.7%) 5 (%) 3 (0.7%) 4 (0.9%) 3 (0.7%) 2 (%) 4 (0.9%) 2 (0.4%) 0 (0%) 0 (0%) PIgai.org Table 7. Respondents’ overall assessment of AEE reliability in writing improvement n iWrite Awrite Grammar 4.58 4.50 4.45 Usage 4.55 4.28 4.50 Mechanics 4.50 4.38 4.12 Style 4.12 4.40 4.08 Organization 3.88 3.90 3.78 Development 3.28 3.45 3.48 Syntactic complexity 4.60 4.54 4.38 Note(s): Total average 5 4.07 Cronbach’s α 5 0.825 Mean N SD Sig. (Two-tailed) 4.51 4.44 4.33 4.21 3.85 3.40 4.51 178 178 178 178 178 178 178 0.576 0.549 0.658 0.785 0.804 0.746 0.634 0.000 0.000 0.000 0.000 0.000 0.000 0.000 3.5 Effects of AEE on students’ writing improvement In order to examine the perceived effects of AEE on writing performance, another five-point Likert-scale type questionnaire was designed to gauge the respondents’ assessment of the functions of three AEE systems separately. As seen in Table 7, most participants expressed positive attitudes toward the features of all three systems, especially concerning grammar, usage, mechanics and syntactic complexity. The scores for organization and development are comparatively low. Interviews with respondents revealed that they were satisfied with the content analysis of the AEE systems, but they expected more feedback on discourse elements. The respondents’ writing improved over the course as can be seen in Table 8, which shows a correlation between the word count and AEE scores. For instance, the word count in the first essays of Class A41 ranges from 128 to 246 words. Compared with the first essay, the number of words in the fourth essay goes up from 141 words to 308. Besides the increase in word count, the AEE scores for Class A41 also rise to the range of 5 to 9 from the original range of 4 to 9. A significant improvement is seen in the minimum score, which advances from 4 to 5. The same tendency is found in all other classes, of which Class A42 registers the highest leap of the minimum score from 3 to 5. N Word count (1st essay) Min Max Word count (4th essay) Min Max 46 45 43 44 128 120 119 121 141 152 150 155 Class A41 A42 A51 A52 Essay 246 208 234 216 Scores of the first essay Mean SD 1. 7.14 0.53 2. 7.27 0.48 3. 7.94 0.54 4. 7.68 0.60 Note(s): ***means p < 0.001 Scores of the first essay Mean SD. 1.18 4 3 4 4 9 10 10 9 t-test 0.54 0.52 0.49 0.56 Scores of the fourth essay Mean SD. 7.28 1.05 8.76 Note(s): **means p < 0.01 308 303 356 328 Scores of the fourth essay Mean SD 8.34 8.61 9.26 9.12 Scores (1st essay) Min Max Computerassisted EFL writing 91 Scores (4th essay) Min Max 5 5 6 5 9 10 10 10 t p Cohen’s d effect d 4.15 5.48 5.37 4.48 0.000*** 0.000*** 0.000*** 0.000*** 0.56 0.61 0.79 0.51 t p Cohen’s d effect size 4.012 0.005** 0.56 Table 8. Words/scores of first and last essays Table 9. Improvements in the mean scores (N 5 178) Table 10. Scores of all four essays LHT 40,1 92 An analysis of the significant difference in the scores of the first essay and the fourth essay also shows positive results. As it is seen in Table 9, Cohen’s d effect sizes underlying these significant levels range from 0.51 to 0.79, showing medium to significant effects. These results demonstrate a marked improvement of respondents’ scores from the first essays to the fourth essays. Two experienced teacher raters were also invited to assess the first and fourth essays on the same scale of one to ten. Their evaluations of the essays show marked similarity with that of the three AEE systems. The score difference between the first essay and the fourth essay shows significant improvement in respondents’ writing skills. Cohen’s d effect size shows that the significant difference is statistically moderate (Table 10). 4. Discussion This paper discusses the issue of AEE systems and their application in EFL writing at China Agricultural University. Four dimensions are covered in the study: respondents’ expectations, comparison between AEE and human rating, perceived effectiveness of AEE in English writing and mediation of AI-based deep learning. Questionnaires and interviews are conducted to examine the expectations of the respondents enrolled in the writing project. Besides, to draw a comparison between AEE and human rating and study the perceived effects of AEE on respondents’ writing improvement, all respondents are asked to write four essays throughout the project. At the end of the project, two teacher raters are invited to evaluate the first and last essays of the respondents. Furthermore, to provide a better understanding of AEE-mediated learning context, a model is established based on deep learning concepts. As for the first research question, it is found that respondents of different college years have different expectations. For first-year students, vocabulary enlargement is most important, while for sophomores, syntactic variety is a priority. While respondents in their junior and senior year focus on better organization and development of the essay, they take grammar refining as similarly necessary. The EFL teaching curriculum can explain such results not only in China Agricultural University but in other colleges in Asia where the learning mode usually transforms from a test-intensive way in high school to a self-regulated way at college. (Chou, 2015; Kim et al., 2016). First, grammar and syntax are supposedly taught in high schools, but text content is rendered simple enough to exclude words and expressions outside of the list issued by educational authorities. College English content, however, is not subject to a guided wordlist, and thus first-year students are challenged to learn as many words as quickly to catch up with the teaching schedule. (Zheng, 2014) Second, all respondents are not English majors, and will not be able to obtain writing tutorship afterward, so peer collaboration or AEE rating are alternatives to keep their writing skills honed. Analysis based on EDT to gauge students’ affective evaluations of AEE systems also reflect a high level of satisfaction at AEE systems even if there is a disparity between high expectations and low performance. For the second research question, this study shows that the respondents confirmed their preference for AEE functions in many ways, such as being timely, more detailed, more clearly understandable feedback and reliable scoring scale. Similar research explained that instantaneous feedback and consistent grading helped students draft better essays (Azmi et al., 2019). It seems contradictory that the respondents place a high value on the “detailed” feature but in the meanwhile give a markedly lower score to the “individualized” feature. Moreover, respondents give a high score on “clearly understandable” feedback of AEE, but meanwhile offer a pronouncedly low mark to “more convenient in vocabulary and sentence learning.” Also unusual is the low score they give to “learning more independently.” Interviews with some respondents show that they believe the AEE feedback does not give as individualized comments as they can get when they have their essays reviewed by the teacher, with whom they feel easy to develop an affinity during the torturing process. Furthermore, though AEE is reliable in most cases, it will give inaccurate, confusing feedback. For instance, the three AEE systems wrongly code proper nouns like “iPhone” as a capitalization error and fail to distinguish “a long” from “along” in “Ocean water can lead to great damage to buildings along coastal areas,” which indicates AEE is oblivious to the logic errors of transitional words. Moreover, it fails to check the missing subject and predicate in sentences like “When Robby got up early, which was a routine for twenty years,” while the teacher-rater identifies them accurately. In addition, self-learning via AEE requires high motivation and well-informed interactions, which many respondents felt too challenging to handle. In terms of the third question, upon the comparison between AEE systems and human rating, the respondents view the former favorably, but interviews also find that teacher raters’ role is indispensable. On the one hand, the effect of three AEE systems on the respondents’ writing is positive. Respondents confirm AEE’s better effectiveness in giving feedback on grammar, usage, mechanics and syntactic complexity. Other research also shows that human evaluation was somewhat inconsistent, and it would be advisable to use automatic evaluation (Azmi et al., 2019) in cases where punctuation, morphology, lexis, syntax and coherence is to be evaluated (Svatava et al., 2019). However, on the other hand, interviews with respondents show human raters are indispensable where there is no consistent standard, such as logic consistency, organization and development. Research also shows that most AEE systems failed to give accurate evaluations of the structure and content of an essay (Bai and W, 2019). Moreover, the automatic rating feature had a disadvantage when it was necessary to evaluate written texts while considering their specific context (Patout and Cordy, 2019). Simply put, this study reveals that AEEs effectively help students improve their writing skills, but low satisfaction levels in AEE evaluations of development and organization indicates that human rating remains indispensable. 5. Conclusion and implication This study explores the students’ disconfirmed expectation on and motivation in using AEE systems by creating a dynamic and longitudinal context. The results reveal marked relationships: first, if the relationship between expectations and performance is positive, then there is a positive relationship between expectation and satisfaction. Low expectations, for example, combined with low performance, leads to low levels of satisfaction, and vice versa. Second, in the cases when expectations are low but performance is high, satisfaction levels are high. Third, if expectations are high but performance is low, satisfaction levels stay in the middle, with respondents’ overall ratings of the AEE systems at high levels. The study provides several significant implications: first, the results reveal a discrepancy between expectations and perceived performance, which not only wields an impact on student satisfaction but also acts as a moderating agent between expectations and satisfaction. Therefore, it is advisable for teacher–researchers to explain the strengths of AEE at the beginning, to inform respondents of the latest improvement of AEE systems during the process of writing and to better integrate AEE systems with EFL (English as a foreign language) writing curriculum. Second, grasping the affective level of student expectations is challenging in that respondents may appear optimistic at the initial use of the AEE system features. However, their satisfaction can diminish if they cannot experience the same level of perceived performance as expected, thus resulting in a negative effect on their motivation to use valid AEE features. Third, although there is a positive relationship between performance and satisfaction, the relationship between expectations and disconfirmation is not linear and is mediated by factors such as teacher’s guidance and collaboration with peer reviewers, which provides insight for pedagogical reforms in College English curriculum to Computerassisted EFL writing 93 LHT 40,1 94 foster students’ learning motivations. Wilson conducted a one-semester experiment with respondents from fourth grade to junior high school from 28 schools. It was found that automatic feedback could actively assist teachers in dealing with various problems encountered in the writing and reviewing process, such as diagnosis and response (Wilson, 2014). In this way, the teachers will have more time and energy to improve students’ writing ability and other aspects of feedback. It is consistent with the results of this study, confirming the importance of educational informatization for teaching innovation. In summary, the AEE system can adequately compensate for the failure to give untimely corrective feedback in traditional teaching, help students improve their English writing performance and promote the improvement of students’ writing ability. Moreover, it stimulates students’ motivation and self-efficacy, cultivates students’ interest and confidence in writing and help build up the ability to learn independently. Furthermore, the evolution of AI will transform the writing teachers’ role from merely giving corrective feedback to constructing a learner-friendly social engagement context. Investing in intelligent computing technology and facilitating its use in classrooms is of significant policy implications. In order to make up for current limitations, future research may focus on how to enhance students’ writing proficiency and satisfaction levels by helping them construct conceptual thoughts via deep engagement with AI technologies. References Aluthman, E.S. (2016), “The effect of using automated essay evaluation on ESL undergraduate students’ writing skill”, International Journal of English Linguistics, Vol. 6 No. 6, pp. 54-67. Amir, Z., Ismail, K. and Hussin, S. (2011), “Blogs in language learning: maximizing students’ collaborative”, Procedia Social and Behavioral Sciences, Vol. 18 No. 2011, pp. 537-543. Attali, Y. (2007), Construct Validity of E-Rater in Scoring TOEFL Essays (Research Report No. RR-0721), Educational Testing Service, available at: https://www.ets.org/research/policy_research_ reports/publications/report/2007/hsmn. Azmi, A.M., Al-Jouie, M.F. and Hussain, M. (2019), “AAEE–automated evaluation of students’ essays in Arabic language”, Information Processing and Management, Vol. 56 No. 5, pp. 1736-1752. Bai, L.F. and W, J. (2019), “An overview of automatic essay evaluation in the past two decades”, Foreign Language Research, No. 1, pp. 65-71. Bas, F.C. and Tekal, M. (2014), “Evaluation of computer based foreign language learning software by teachers and students”, The Turkish Online Journal of Educational Technology, Vol. 13 No. 2, pp. 71-78. Boulanger, D. and Kumar, V. (2019), “Deep learning in automated essay scoring”, Presented at 14th International Conference on Intelligent Tutoring Systems (ITS), Montreal. doi: 10.1007/978-3-31991464-0_30. Bravo, M.I.R., Montes, F.J.L. and Moreno, A.R. (2017), “Open innovation in supply networks: an expectation disconfirmation”, Journal of Business and Industrial Marketing, Vol. 32 No. 3, pp. 432-444. Burstein, F., Chodorow, M. and Leacock, C. (2004), “Automated essay evaluation: the criterion online writing service”, AI Magazine, Vol. 25 No. 3, pp. 27-36. Chen, C.F.E. and Cheng, W.Y.E. (2008), “Beyond the design of automated writing evaluation: pedagogical practices and perceived earning effectiveness in EFL writing classes”, Language Learning and Technology, Vol. 12 No. 2, pp. 94-112. Chou, M.H. (2015), “Impacts of the test of English listening comprehension on students’ English learning expectations in taiwan”, Language and Curriculum, Vol. 28 No. 2, pp. 191-208. Chu, S.K.W., Capio, C.M., van Aalst, J.C.W. and Cheng, E.W.L. (2017), “Evaluating the use of a social media tool for collaborative group writing of secondary school students in Hong Kong”, Computers and Education, Vol. 110 No. 2017, pp. 170-180. Cristina, P.B. and Carrillo, C.A. (2016), “L2 collaborative E-writing”, Procedia-Social and Behavioral Sciences, Vol. 228 No. 2016, pp. 601-607. Gao, J.W. and Ma, S. (2020), “Instructor feedback on free writing and automated corrective feedback in drills: intensity and efficacy”, Language Teaching Research, Advance online publication, p. 601, 1362168820915337, doi: 10.1177/1362168820915337. Huang, A.Q. and Zhang, W.X. (2018), “The effect of automated writing evaluation feedback on students’ vocabulary revision -taking Pigai.org for example”, Modern Educational Technology, Vol. 2 No. 8, pp. 71-78, (In Chinese). Jiang, Y.L.B. (2019), “The development of vocabulary and morphological awareness: a longitudinal study with college EFL students”, Applied Psycholinguistics, Vol. 40 No. 4, pp. 877-903. Jin, Y.Q. (2014), “An evidenced-based rearch on the role of web 2.0 in building up college students’ collaborative learning capabilities”, China Educational Technology, Vol. 12 No. 335, pp. 139-145. Khadka, S. (2020), “Meaning and teaching of writing in higher education in Nepal”, in Bista, K., Sharma, S. and Raby, R.L. (Eds), Higher Education in Nepal: Policies and Perspectives, Routledge, Oxford, pp. 201-213. Kim, Hee, N., Jung and Ae, M. (2016), “Korean EFL freshman students’ English learning experiences in high school and in college”, The Journal of Studies in Language, Vol. 32 No. 1, pp. 1-23. Kramer, I.M. and Kusurkar, R.A. (2017), “Science-writing in the blogosphere as a tool to promote autonomous motivation in education”, The Internet and Higher Education, Vol. 35 No. 2017, pp. 48-62. Kukich, K. (2000), “Beyond automated essay scoring, the debate on automated essay grading”, IEEE Intelligent Systems, Vol. 15 No. 5, pp. 22-27. Li, G.F. (2019), “The impact of the integrated feedback on students’ writing revision based on the AWE”, Foreign Language Education, Vol. 40 No. 4, pp. 72-76, (In Chinese). Li, H. and Liu, R.D. (2011), “Study on the effect of web-based collaborative writing”, Web-Assisted Education, Vol. 219 No. 7, pp. 67-72. Liu, H. (2015), “Freshmen’s new words consciousness and its influence on language output competence measured on a corpus basis”, Higher Education Exploration, Vol. 2015 No. 7, pp. 91-96, (In Chinese). Liu, M., Liu, L.P. and Liu, L. (2018), “Group awareness increases student engagement in online collaborative writing”, The Internet and Higher Education, Vol. 38 No. 2018, pp. 1-8. Lu, L. (2016), “A study of the second writing process based on an automated essay evaluation tools”, Foreign Language World, Vol. 2016 No. 2, pp. 88-96, (In Chinese). Oliver, R.L. (1977), “Effect of expectation and disconfirmation on postexposure product evaluations– an alternative interpretation”, Journal of Applied Psychology, Vol. 62 No. 4, pp. 480-486. Oliver, R.L. (1980), “A cognitive model of the antecedents and consequences of satisfaction decisions”, Journal of Marketing Research, Vol. 17 No. 4, pp. 460-469. Page, E.B. (1966), “The imminence of grading essays by computer”, Phi Delta Kappan, Vol. 47 No. 5, pp. 238-243. Patout, P.A. and Cordy, M. (2019), “Towards context-aware automated writing evaluation systems”, Presented at Proceedings of The 1st ACM SIGSOFT International Workshop on Education through Advanced Software Engineering and Artificial Intelligence, Tallinn. doi: 10.1145/3340435.3342722. Polio, C. and Yoon, H.J. (2018), “The reliability and validity of automated tools for examining variation in syntactic complexity across genres”, International Journal of Applied Linguistics, Vol. 128 No. 1, pp. 165-188. Computerassisted EFL writing 95 LHT 40,1 96 Powers, D.E., Burstein, C., Chodorow, M., Fowles, M.E. and Kukich, K. (2000), Comparing the Validity of Automated and Human Scoring of Essays, Educational Testing Service, Princeton, NJ. Pribadi, F.S., Utomo, A.B. and Mulwinda, A. (2017), “Automated short essay scoring system using normalized simpson methods”, Presented at Engineering International Conference (EIC2017), Semarang. doi: 10.1063/1.5028081. Quinlan, T., Higgins, D. and Wolff, S. (2009), Evaluating the Construct-Coverage of the E-Rater® Scoring Engine, ETS, Princeton, NJ. Rudner, L.M., Garcia, V. and Welch, C. (2006), “An evaluation of the IntelliMetricSM essay scoring system”, The Journal of Technology, Learning, and Assessment, Vol. 4 No. 4, pp. 1-22. Svatava, S., Katerina, R. and Magdalena, R. (2019), “Comparison of automatic and human evaluation of L2 texts in Czech”, Issledovanija po slavjanskim jazykam, Vol. 24 No. 1, pp. 93-101. Tang, D. and Sun, Y. (2019), “Automatic scoring method of English composition based on language depth perception”, Presented at 2019 4th International Seminar on Computer Technology, Mechanical and Electrical Engineering (ISCME 2019), Chengdu. doi: 10.1088/1742-6596/1486/4/ 042045. Tang, J.L. and Wu, Y. (2011), “Using automated writing evaluation in classroom assessment: a critical review”, Foreign Language Teaching and Research, Vol. 2011 No. 2, pp. 273-282, (in Chinese). Tian, L.L. and Zhou, Y. (2020), “Learner engagement with automated feedback, peer feedback and teacher feedback in an online EFL writing context”, System, Vol. 91 No. 102247, doi: 10.1016/j. system.2020.102247. Van Ryzin, G.G. (2013), “An experimental test of the expectancy-disconfirmation theory of citizen satisfaction”, Journal of Policy Analysis and Management, Vol. 32 No. 3, pp. 597-614. Wang, P.L. (2013), “Can automated writing evaluation programs help students improve their English writing?”, International Journal of Applied Linguistics and English Literature, Vol. 2 No. 1, pp. 6-12. Wang, Y.J., Shang, H.F. and Briody, P. (2013), “Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing”, Computer Assisted Language Learning, Vol. 26 No. 3, pp. 234-257. Wilson, J. (2014), “Does automated feedback improve writing quality?”, Learning Disabilities: A Contemporary Journal, Vol. 12 No. 1, pp. 93-118. Wilson, J. and Czik, A. (2016), “Automated essay evaluation software in English language arts classrooms: effects on teacher feedback, student motivation, and writing quality”, Computers and Education, Vol. 100 No. 2016, pp. 94-106. Winke, P.M. and Isbell, D.R. (2017), “Computer-assisted language assessment”, in Thorne, S. and May, S. (Eds), Language, Education and Technology. Encyclopedia of Language and Education, 3rd ed., Springer, Cham, pp. 1-13. Yang, X.Y. and Dai, Y.C. (2015), “An empirical study on college English autonomous writing teaching model based on www. pigai.Org”, Computer-assisted Foreign Language Education, Vol. 2015 No. 2, pp. 17-23, (In Chinese). Yang, Y.J. and Yang, Y. (2013), “Blog-assisted process writing for English majors and pedagogic implications”, Computer-Assisted Foreign Language Education, Vol. 153 No. 2013, pp. 46-51. Yu, B.B. (2015), “Incorporation of automated writing evaluation software in language education: a case of evening university”, International Journal of Information and Education Technology, Vol. 5 No. 11, pp. 808-813. Zheng, R.N. (2014), “A discussion on the connection of English teaching between high school and college”, Education exploration, Vol. 2014 No. 12, pp. 35-36. Zhou, H. (2015), “An empirical study of blog-assisted EFL process writing: evidence from Chinese non-English majors”, Journal of Language Teaching and Research, Vol. 6 No. 1, pp. 189-195. About the author Dr. Zhijie Wang is an associate professor at the College of Humanities and Development at China Agricultural University. His research interests include computer-assisted language learning, second language writing and second language education. Zhijie Wang can be contacted at: lynx17505@cau.edu.cn For instructions on how to order reprints of this article, please visit our website: www.emeraldgrouppublishing.com/licensing/reprints.htm Or contact us for further details: permissions@emeraldinsight.com Computerassisted EFL writing 97