Session S4E Forming Groups for Collaborative Learning in Introductory Computer Programming Courses Based on Students’ Programming Styles: An Empirical Study Eustáquio São José de Faria 1, Juan Manuel Adán-Coello2, Keiji Yamanaka3 Abstract – This paper describes and evaluates an approach for constructing groups for collaborative learning of computer programming. Groups are formed based on students' programming styles. The style of a program is characterized by simple well known metrics, including length of identifiers, size and number of modules (functions/procedures), and numbers of indented, commented and blank lines. A tool was implemented to automatically assess the style of programs submitted by students. For evaluating the tool and approach used to construct groups, some experiments were conducted involving Information Systems students enrolled in a course on algorithms and data structures. The experiments showed that collaborative learning was very effective for improving the programming style of students, particularly for students that worked in heterogeneous groups (formed by students with different levels of knowledge of programming style). Index Terms – Collaborative Learning; Group Formation; Computer Programming; Programing Style. INTRODUCTION Learning computer programming is one of the first and most challenging tasks encountered by computing students. The difficulties faced by the students can be perceived by the high degrees of failure and the difficulties presented in the courses directly dependent on the abilities to program, to develop a logical reasoning and to solve problems. In part, this is the result of the difficulties found by instructors to effectively guide the students during their programming lab activities, due to the large number of students per class. The literature indicates that collaborative work involving groups of students can contribute to solve this problem, provided that adequate mechanisms are available to construct and mediate group work. When this is done, it has been observed that students are able to reach improvements on performance, critical thinking and cooperative behavior [1] [2]. Collaborative work is based on the assumption that two or more individuals working together can reach a state of equilibrium where ideas can be exchanged and distributed among group members, generating as a result new ideas and knowledge [3] [4]. Some research projects have focused on the use of collaboration for learning abilities related to Computer Science. A good example of this type of work is given by COLER, an environment for collaborative learning of entityrelationship modeling [5]. In COLER an entity-relationshipmodeling problem is presented to the students that construct individual solutions in a private workspace. Then the students use a shared workspace to collaboratively construct a new entity-relationship diagram. The design of COLER was based on socio-cognitive conflict and cognitive dissonance theories [6] [7]. COLER’s main functions are to detect learning opportunities and to coach collaboration. COLER recognizes learning opportunities by identifying a number of relevant syntactic dissimilarities between individual and group diagrams. Doise and Mugny [6] understand the socio-cognitive conflict as a social interaction induced by a confrontation of divergent solutions from the participant subjects. They highlight the importance of the socio-cognitive conflict on the process of learning, and founded on the Piagetian theory of equilibrium [4] argue that from that interaction the individuals can reach higher states of knowledge. Based on the above ideas, a collaborative strategy for learning computer programming can benefit from the construction of groups where students can learn from controversy when discussing their solutions for a given problem. On way of constructing such groups is to analyze students’ programs to find characteristics that evidence significant differences in some respect, for instance program quality. The differences should be relevant enough to motivate the students to discuss them. SOFTWARE QUALITY METRICS Besides meeting its functional specification a program has several other attributes that reflect its quality. These attributes 1 Eustáquio São José de Faria, Instituto de Informática, Pontifícia Universidade Católica de Minas Gerais – Arcos, Brasil, eustaquio@pucminas.br Juan Manuel Adán Coello, Faculdade de Engenharia de Computação, Pontifícia Universidade Católica de Campinas, juan@puc-campinas.edu.br 3 Keiji Yamanaka, Faculdade de Engenharia Elétrica, Universidade Federal de Uberlândia, Brasil, keiji@ufu.br 2 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-6 Session S4E are not directly related to what the program does. Instead, they reflect the structure and organization of the source code and the related documentation and are usually considered important to discern good from bad programs and programmers [8]. The attributes that reflect the quality of a program are empirically known by expert programmers, but difficult to teach to novices, because they demand much abstraction and experience and because many of them are not well founded. Several metrics were proposed to measure the quality of programs, and some tools were developed to automatically asses it [9] [10] [11] [12] [13] [14] [15]. Most of the available tools generate a report that is presented to the user, and some return a single numeric value that represents the overall quality of the evaluated program [9] [10]. Most of the metrics proposed to measure the quality of programs are related to the style used in their construction. Program style is a subject scarcely discussed in programming courses and books. Usually the importance of programming style for producing good programs is not adequately emphasized. But, against the general intuition, source code is written primarily to be read by people not to be processed by compilers. The source code is read, reviewed and modified by different people in different times for different tasks, but it will not be processes by a compiler many times [16]. In this research, the quality of a program is assessed using some of the most common metrics of style in the literature. They measure the student’s capacity to write well-organized code: 1. Identifier length 2. Percentage of identifiers defined as constants (define / const); 3. Module length (functions and procedures); 4. Number of modules 5. Percentage of indented lines 6. Percentage of comment lines 7. Percentage of blank lines The importance of each of these attributes can be justified as follows: • Identifier length: the number of different identifiers used in a single module can reach some hundreds. Considering all the modules of a medium size program, several thousand different elements can be in use, each with its particular name. Memorizing the meaning of all these names is a very hard task. Therefore, when reading the source code, it is desirable that the precise meaning, declaration point and computational properties of each name could be determined without having to consult complementary documentation, or even variable and function declarations. Usually very short identifiers do not express conveniently its meaning. • Percentage of identifiers defined as constants: the inclusion of constants as literal values directly in the source code (for example: 3.14, "parameter missing") complicates program maintenance. When performing any modification that affects one of these constants, it is necessary to find all places where they are used. Besides, constants with different meanings can have the same value, making it difficult to correctly modify the program. Module length and number of modules: problem decomposition is a key factor for complexity reduction. Complexity is synonymous of variety, that is, the number of different situations that the problem can present. Though, when decomposing a problem into subproblems, the programmer is invariably dividing also its complexity, and as a consequence simplifying the solution. Percentage of indented lines: indentation is the technique that involves shifting code text to the right or to the left whenever there is a change in scope/block. Indentation improves code readability, helping in evidencing the logical structure of a program. Percentage of comment lines: comments constitute the main component of the internal documentation of a program. They help the reader to understand the objective of each code unit. They also help in explaining the logic of difficult sections. Novice programmers rarely are instructed on how to write good comments, albeit writing good comments is perhaps as important and difficult as writing good programs. Good comments can not improve a bad codification but bad comments can seriously worsen a good code. Percentage of blank lines: blank lines help the programmer to perceive the border between blocks. Blank lines influence the readability of a program, much as natural language texts. • • • • A TOOL FOR PROGRAM QUALITY ASSESSMENT (PQA-C) PQA-C is a tool implemented to asses the quality of student programs written in the C programming language, according to the set of style metrics presented above. PQA-C computes a score for each style metric, using a method similar to Berry and Meekings scoring method [9], illustrated in figure 1. In the graph shown in figure 1, L is the point below which no score is obtained; S is the start of the ideal expected range for the metric; F is the end of the ideal range and H is the point above which no score is obtained. If vi is the value found for metric i (metric value), the score attributed to this metric, si, is computed as follows: si = 100 ; if S ≤ vi ≤ F si = ((vi – L) / (S – L))*100; if L ≤ vi ≤ S Score 100 0 L S F H Metric Value FIGURE 1 METRIC VALUE CALCULATION 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-7 Session S4E si = ((H – vi) / (H – F))*100 ; if F ≤ vi ≤ H si = 0 ; if vi < L or vi > H Using the scores for each attribute, it is computed a global score (GS) that represent the overall quality of the program: GS = ∑i si Besides computing the scores for the 7 quality metrics and the GS, PQA-C generates a Report of Anomalies and Solutions. Based on the score of each metric, this report informs the students which are the main stylistic deficiencies found on his program and gives suggestions on how to remove them, without showing to the student the ideal value for each metric. EXPERIMENTAL EVALUATION A number of experiments were conducted for evaluating the strategy proposed for forming groups and the supporting PQA-C tool. The experiments involved students pursuing a major on Information Systems at PUC-Minas (Pontifícia Universidade Católica de Minas Gerais, Brasil), campus Arcos, enrolled in an Algorithms and Data Structures course. I. Forming groups The experiments involved 32 students, 24 distributed into 6 groups of 4 students each and 8 working individually. The groups have 4 students because groups with many elements tend to present difficulties to organize their work and share information. Conversely, groups that are too small may be unable to provide a sufficient rich environment for idea generation and discussion. The experiments measured and compared the learning gain of students working in homogeneous and heterogeneous groups and individually. The students had to accomplish three programming tasks. All tasks were first developed by each student individually and then redone by the groups. All programs produced by the individuals and by the groups were evaluated by the PQA-C tool. Students working individually developed the same program twice. The second time, after having the first version of the program evaluated by the PQA-C tool. Groups were formed using the scores computed for the first program developed individually. The maximum global score (GS) for a program is 700. A GS lower than 210 (30% of the maximum GS) was considered low; a GS between 210 and 400 (60% of the maximum GS) was considered medium; a GS greater than 400 was considered high. From the 32 students that performed the first programming task, 8 get a high GS, 8 get a medium GS and 16 get a low GS. Using these scores, students were grouped as follows: • Groups 1 and 2 (heterogeneous): 2 students with high GS and 2 with low GS each. • Group 3 (homogeneous-high): 4 students with high GS, • Group 4 (homogeneous-low): 4 students with low GS; • Groups 5 and 6 (homogeneous-medium): 4 students with medium GS each; In addition to the above groups, 8 students with low GS worked individually in all three tasks, in order to permit to compare the effects of collaborative work versus individual work in the considered learning task. No student with initial medium or high GS worked individually for two reasons: (1) the number of students with these scores was relatively low and (2) it was a particular interest in focusing the study on the effects of collaborative work on weaker students. Future experiments should be conducted to evaluate the effects of collaborative versus individual work also on students with medium and high GS. Groups worked together during lab classes. Students were always in possession of their programs and the corresponding reports generated by PQA-C. The members of each group were asked to exchange their programs and reports and try to find in their colleagues’ programs the anomalies mentioned in the respective reports. Then, each student discussed with their group mates the likely sources of anomalies in the examined program and on how to improve the program. In each group, a leader randomly chosen and previously oriented by the instructor mediated the collaboration between their group colleagues and conducted the construction of a new version of the program. The new program was submitted to the PQA-C and a new report was generated and presented to the students. They were asked to compare and contrast the program they produced individually with the program produced by the group, and the respective reports. Students were also asked to comment on the salient differences between their programs and the program produced by the group. In parallel, the students working individually also developed two versions of the same program. The second version was written after the students read the report generated by PQA-C for the first version. The same cycle was repeated for tasks 2 and 3. All programs produced and respective reports were stored to be analyzed afterward by the instructor. II. The Programming Tasks Programming task 1 required the development of a program to read data for a telephone directory with a constant number of entries. After initializing the phone list entries, a menu had to be presented to the user with the following options: (1) search phone number by subscriber name; (2) search subscriber by phone number; (3) exit. Programming task 2 required to construct and maintain a list of N (a given constant) students enrolled in a course. The list had to be kept ordered using the bubble sort algorithm. A menu had to be presented to the user with the following options: (1) insert a student; (2) sort the list by name; (3) sort the list by student grade; (4) display all approved students; (5) display all students; (6) exit. Programming task 3 consisted of writing a program to control a dynamic ordered list of guests for a party. The user had to be guided by a menu with the following options: (1) 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-8 Session S4E TABLE I GLOBAL SCORES FOR INDIVIDUALS (IGS) AND GROUPS (GGS) Task 1 IGS1 Group GGS1 Task 2 IGS 2 GGS 2 Task 3 IGS 3 GGS 3 Student 1 530 1 4 0 Heterogeneous 9 67 570 532 225 150 625 570 350 267 Task 1 → 3 IGS 3 / IGS 1 1,18 600 ∞ 3,98 17 467 375 533 1,14 5 500 625 575 1,15 2 20 475 Heterogeneous 25 33 26 50 350 433 8,66 1,12 513 550 133 567 633 250 3 467 475 525 3 10 433 400 467 Homogeneous-High 16 425 4 HomogeneousLow 5 HomogeneousMedium 6 HomogeneousMedium 642 367 533 375 625 667 7,58 - 1,12 420 433 525 1,25 100 275 367 3,67 13 100 30 117 267 425 375 325 433 2,77 133 250 325 2,44 7 300 375 433 1,44 11 350 14 402 23 375 433 500 1,33 15 325 350 333 1,02 18 300 333 425 19 363 29 325 617 insert a guest in the list; (2) remove a guest from the list; (3) display the number of guests; (4) display the names of all guests; (5) display the names of all guests from a given city. III. Results Table I shows the GSs for the programs produced by the members of the groups and by the groups themselves. Table II presents the results for the students that worked individually. Comparing the scores of students 1, 17, 5 and 20, from the heterogeneous groups, that achieved an initial high GS, with the scores of the members of the homogeneous-high group, it is possible to notice that the scores of the students from the heterogeneous group had, on the average, a higher increase than that of the members of the homogeneous-high group. This suggests that the stronger students also benefit from the collaboration with weaker students. As a matter of fact, the teaching activity gives new perspectives about the object of study, consolidating previously acquired knowledge. 325 425 375 367 533 525 267 550 400 375 550 433 1,04 3,78 32 614 1,22 1,08 6 333 1,13 1,33 31 440 GGS 3 / GGS 1 - 1,02 - 1,24 1,37 - 1,10 1,42 1,10 -1,30 1,15 Table I also shows some interesting data concerning the evolution of scores for students that got an initial low GS. The students with an initial low GS that worked on heterogeneous groups (students 4, 9, 25 and 26) increased their individual scores more than 574% on the average; the scores of students from the homogeneous-low group had an average increase of 216%; finally, the 4 higher increases among the students that worked individually was of only 51% on average (students 12, 14, 27 and 28). This is consistent with the literature [17] [18]: heterogeneous groups are usually more effective for individual learning because heterogeneity naturally produces controversy more frequently. Nevertheless the homogeneous-low group (group 4) also presented a good level of individual evolution. As expected, programs developed by groups scored higher than programs developed by their members. This can be credited to the collaboration inside groups, but also to the fact that groups members had already developed a version of the programs that had already been evaluated by PQA-C. 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-9 Session S4E 2 203 175 175 233 200 225 - 1,01 8 150 75 175 275 167 225 1,11 12 100 150 133 100 150 150 1,50 21 167 200 133 167 125 175 - 1,25 22 100 133 125 233 133 133 1,33 24 100 142 175 225 150 233 1,50 27 100 150 133 250 167 267 1,67 Although more experiments have to be conducted, the reported experiments point out that collaboration has very positive impacts on learning how to program, at least in improving students programming style. The experiments also verified that the mediator is fundamental. When the mediator was not present, collaboration not occurred or it was very precarious. In heterogeneous groups, in the absence of a mediator high GS students tried to impose their positions on low GS students without any discussion. In the homogeneous-high group, the absence of a coordinator led to generalized chaos, where each member believed that his position should prevail. In the two homogeneous-medium groups, collaboration occurred naturally in the absence of the mediator. Finally, in the homogeneous-low group, when the mediator was absent students became apathetic, without even knowing how to start the discussion. 28 200 200 225 100 275 367 1,38 FUTURE WORK TABLE II GLOBAL SCORES FOR STUDENTS WORKING INDIVIDUALLY (IGS) Task 1 Task 2 Task 3 Task 1 → 3 Student IGS1 IGS 11 IGS 2 IGS 2 IGS 3 IGS 33 Average IGS 3 / IGS 1 IGS 3 / IGS 1 1,28 Comparing the GSs for tasks 1 and 3, it is evident that the programs of heterogeneous groups accounted for an increase well above of what was verified for the remaining groups. Table I shows that increase for heterogeneous groups (groups 1 and 2) was 13% and 22%, respectively, while for the homogeneous-high group (group 3) the increase was of only 4%. In the remaining groups, the GS decreased. This was an unexpected result, because the GSs for the programs of those groups’ members had a positive increase and, in most cases, the GS for the programs written by the groups were noticeable higher that the GS of its members. CONCLUSION The conducted experiments make it evident that collaborative work resulted in more understandable programs. Besides verifying that the global scores (GS) of programs increased at each activity, a manual inspection of the produced programs performed by the instructor at the end of each experiment showed that the programs were indeed progressively closer to the expected readability standards. The higher level of increase in the GS was verified among the students that had an initial low GS and worked in groups, especially in heterogeneous groups. By the other hand, if heterogeneous groups accounted for higher increases in GSs, they were the most resistant to collaboration at the beginning of the experiments. It was also observed in these groups that students with low GS tended to be submissive towards high GS students. The submission was only broken when the instructor joined the groups and stimulated the participation of all group members. By contrast, it was observed that students were more collaborative in homogeneous groups. Conflicts in those groups were usually more balanced. The experiments also verified that students with an initial low GS had much higher increases in GS when working in group than when working individually. There are three main foreseen directions to continue this work. The first is the development of new experiments that help in confirming and refining the results described above; the second is the extension of the PQA tool; and the third is the construction of a mediator agent. The PQA tool can be extended in a number of ways. For example, currently it supports only the C programming language. It would be very helpful if it could also analyze programs written in other modern languages also used to teach programming for novices, as Java. The metrics used to assess program quality and the method employed for scoring and comparing programs can also be improved. For example, PQA-C does not take into consideration the semantics of the identifiers used in a program, it only verifies their lengths. Although usually longer identifiers are more expressive than short identifiers, rigorously there is no direct relation between identifier length and semantic meaning. The use of ontologies can be of use here to verify if a given identifier is meaningful in the context of a given program domain. Currently all metrics have the same weigh when computing a program GS. However, all expert programmers know that not all style metrics contribute equally for producing good style, readable programs, though it is not evident the relative importance of each one. The experiments have also shown that the effectivity of group work in many cases depends heavily in the mediation of their members interactions. Because it is difficult for a single instructor to adequately mediate the work of large numbers of groups, a computer supported environment for collaborative learning will be most helpful. This environment should integrate tools for group formation, group interaction and group mediation. REFERENCES [1] Suthers, D.D., "Computer Aided Education and Training Initiative", Technical Report, Learning Research and Development Center, University of Pittsburgh, 1998. 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-10 Session S4E [2] Gokhale, A.A., "Collaborative Learning Enhances Critical Thinking", Journal of Technology on Education, Vol. 7, No. 1, 1995. [3] Lévy, Pierre. “Collective Intelligence: Mankind's Emerging World in Cyberspace”, Perseus Books; January 2000. [4] Piaget, J. “The Development of Thought: Equilibration of Cognitive Structures”.New York: Viking Penguin, 1977. [5] Constantino-Gonzalez, M. A., D. D. Suthers, J. G. E. de los Santos. “Coaching Web-based Collaborative Learning based on Problem Solution Differences and Participation”. International Journal of Artificial Intelligence in Education Vol. 13, No. 2-4, 2003, pp. 263 – 299 [6] Doise, W., and Mugny, W. “The social development of the intellect”. Oxford: Pergamon Press, 1984. [7] Festinger, L. “A theory of cognitive dissonance”. Stanford University Press, 1957. [8] Sommerville, I.. “Software Engineering”, 7th Ed., Addison Wesley, 2004. [9] Berry, R. E. and Meekings, B. A. E. “A Style Analysis of C Programs”. Communications of the ACM, Vol. 28, No. 1, January 1985, pp. 80 – 88. [10] Hung, S.; Kwork, L. and Chan, R. “Automatic Program Assessment”. Computers and Education, Vol. 20, No. 2, 1993, pp. 183-190. [11] Schorsch, T.. “Cap: An Automated Self-Assessment Tool To Check Pascal Programs For Syntax, Logic And Style Errors”. SIGCSE’95, Nashville, TN, USA, March, 1995. [12] Mengel, S. A. e Yerramilli, V. “A Case Study Of The Static Analysis Of The Quality Of Novice Student Programs”. SIGCSE’99, New Orleans. LA, USA, march, 1999. [13] Xenos, M.; Stavrinoudis, D.; Zikoulli, K. e Christodoulakis, D. “Objectoriented metrics – a survey”. Proc. of the FESMA - Federation of European Software Measurement Associations, Madrid, Spain, 2000. [14] Jackson, D.. “A Semi-Automated Approach to Online Assessment” iTiCSE 2000, Helsinki, Finland, July, 2000. [15] Purao, S. e Vaishnavi, V. “Product Metrics for Object-Oriented Systems”. ACM Computing Surveys, Vol. 35, No. 2, June, 2003, pp. 191-221. [16] Oman, P. W. e Cook, C. R. “A programming style taxonomy”. Journal of Systems and Software. Vol. 15, No. 3, July 1991. [17] Ellis, A. P. J. E., J. R. Hollenbeck and D. R. Ilgen, Christopher O. L. H. Porter, B. J. West. “Team Learning: Collectively Connecting the Dots”. Journal of Applied Psychology. Vol. 88, No. 5, 2003, pp. 821–835. [18] Maier, N. “Problem Solving and Creativity in Individuals and Groups”. Brooks/Cole, Belmont, CA, 1970. 1-4244-0257-3/06/$20.00 © 2006 IEEE October 28 – 31, 2006, San Diego, CA 36th ASEE/IEEE Frontiers in Education Conference S4E-11