Locating Discussion Practices in Computational Models of Text Carolyn P. Rosé, Miaomiao Wen, & Diyi Yang, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA, 15213, {cprose,mwen,diyiy}@cs.cmu.edu Abstract: Language modeling techniques such as Latent Semantic Analysis (LSA) (Dumais et al., 1988) and Latent Dirichlet Allocation (LDA) (Blei, Ng, & Jordan, 2003) have been used to construct easy to obtain lenses on discussion behavior. They are desirable because they do not require labeled training data and are thus thought of as an easy way to perform a meaningful analysis of text. In this paper we contrast an LDA style analysis that approximates an analysis of motivation and cognitive engagement with an analysis making use of hand coded training data to model these phenomena directly. In both cases, we provide a demonstration that the automated indicators have predictive validity in connection with attrition over time, although the demonstration is stronger in the case of the direct modeling approach. We conclude with a discussion of the trade-offs between these two approaches and implications for future work in the area of Learning Analytics applied to discussion data. Introduction As the field of Learning Analytics matures, it seeks to move beyond shallow observations of student behavior that are easy to measure such as number of posts, number of back clicks, time spent watching videos, etc. It reaches instead for more meaningful latent factors that these observational variables may reflect, like commitment, metacognitive awareness, or cognitive engagement. The concept of a practice fits within this class of more abstract notions. Practices encompass more than isolated behaviors. They reflect identity, goals, and intentionality. In this paper, we focus specifically on discussion practices that are relevant in a learning context. In particular we focus on practices of signaling motivation and cognitive engagement. It is not controversial that motivation and cognitive engagement are student orientations that are important for learning. As students interact with one another in a course discussion, they reflect their state of motivation and cognitive engagement in subtle or overt ways. At times, it may be part of rapport building or support seeking to engage in commiserating about a struggle to find the wherewithal to persist in a course. At other times, it may be a way of projecting an image of a competent student to reflect high motivation to succeed. It is not surprising to find students who appear unmotivated and then drop out of a course. For a number of reasons it may be strategic to computationally model these practices. For example, if it is possible to identify students with low motivation or low cognitive engagement, it may be easier for instructors or mentors to identify which students are vulnerable so they can allocate their mentoring and support accordingly. Alternatively, reflecting back to students their detected levels of motivation and cognitive engagement might support their metacognitive awareness and self-regulation. The field of language technologies, and text mining in particular, offer a variety of modeling approaches that may be adopted and applied within the area of Learning Analytics. In this paper, we contrast two alternative approaches. In a top-down approach, we model motivation and cognitive engagement directly, using carefully constructed meaningful knowledge sources and hand coded training data. In a bottom-up approach we come to a model indirectly by applying an exploratory modeling approach and identifying latent factors that turn out to reflect motivation and cognitive engagement. Both approaches show some measure of predictive validity. We provide a discussion comparing and contrasting what is gained and lost in these two alternative approaches and conclude with some directions for future research. Computational Modeling Approaches In this paper, we compare two different approaches to modeling discussion practices associated with learning. First, we model motivation and cognitive engagement directly, using carefully constructed meaningful knowledge sources and hand coded training data. And second, we apply an exploratory modeling approach and identifying latent factors that turn out to reflect motivation and cognitive engagement. In both cases, in order to validate the indicators provided by the models, we used a survival model to measure the extent to which the indicators predict attrition over time (Rabe-Hesketh & Skrondal, 2012). Survival analysis is known to provide less biased estimates than simpler techniques (e.g., standard least squares linear regression) that do not take into account the potentially truncated nature of time-to-event data (e.g., users who had not yet left the community at the time of the analysis but might at some point subsequently). From a more technical perspective, a survival model is a form of proportional odds logistic regression, where a prediction about the likelihood of a failure occurring is made at each time point based on the presence of some set of predictors. The estimated weights on the predictors are referred to as hazard ratios. The hazard ratio of a predictor indicates how the relative likelihood of the failure occurring increases or decreases with an increase or decrease in the associated predictor. A hazard ratio greater than 1 signifies that higher than average measure of an independent variable is predictive of higher than average dropout at the next time point. In particular, by subtracting 1 from the hazard ratio, the result indicates what percentage more likely to drop out at the next time point a participant is estimated to be if the value of the associated independent variable is 1 standard deviation higher than average. For example, a hazard ratio of 2 indicates a doubling of probability. A hazard ratio between 0 and 1 signifies that higher than measure of an independent variable is predictive of lower than average dropout at the next time point. In particular, if the hazard ratio is .3, then a participant is 70% less likely to drop out at the next time point if the value of the associated independent variable is 1 standard deviation higher than average for that student. In preparation for a partnership with an instructor team for a Coursera MOOC that was launched in Fall of 2013, we were given permission by Coursera to extract the discussion data from and study a small number of courses. One of those courses was a Python programming course, “Learn to Program: The Fundamentals”, offered in August 2013, which has 3,590 active users and 24,963 forum posts. This course was offered as a seven week course. It includes seven week specific subforums and a separate general subforum for more general discussion about the course. Our analysis is limited to behavior within the discussion forums. Direct Modeling of Motivation and Cognitive Engagement We have developed and validated a targeted approach to measuring motivation towards a course and cognitive engagement in our prior work (Wen et al., in press). Here we summarize that work. In order to model motivation, we extracted roughly 1000 posts altogether from two other MOOCs. We used Amazon’s Mechanical Turk service to have the posts hand annotated for displayed level of motivation towards the course on a likert scale. We evaluated the reliability of the hand coding using the Inter-Class Correlation, which was high (> .7). Next we used a machine learning model to learn to predict high versus low motivation based on these hand codings, and we applied this model to all of the data in the Python MOOC. As a measure of cognitive engagement we used a carefully constructed publically available dictionary of abstractness that can be used to score words based on how abstract/concrete they are on a likert scale (Beukeboom, 2014). We averaged the scores of words within a post to derive an abstractness score, which we considered to be a measure of cognitive engagement with the material. In our survival analysis, both the motivation indicator and the cognitive engagement measure were found to predict lower attrition. In particular, the hazard ratio associated with motivation was .84, which signifies that students whose posts were rated as high motivation at a time point were 16% less likely than average to drop out at the next time point. The hazard ratio associated with cognitive engagement was .53, indicating that students whose posts reflected a standard deviation higher measure of cognitive engagement at a time point were 47% less likely than average to drop out at the next time point. Probabilistic Graphical Modeling Approach Our probabilistic graphical model used in an exploratory way in our work integrates two types of previously developed probabilistic graphical models (Yang et al., under review). First, in order to obtain a soft partitioning of the social network of the discussion forums, we used a Mixed Membership Stochastic Blockmodel (MMSB) (Airoldi et al., 2008). The advantage of MMSB over other graph partitioning methods is that it does not force assignment of students solely to one subcommunity. The model can track the way students move between subcommunities during their participation. We have linked the community structure that is discovered by the model with a probabilistic topic model, so that for each person a distribution of identified communicative themes is estimated that mirrors the distribution across subcommunities. By integrating these two modeling approaches so that the representations learned by each are pressured to mirror one another, we are able to learn structure within the text portion of the model that helps identify the characteristics of within-subcommunity communication that distinguish various subcommunities from one another. A well known approach is Latent Dirichlet Allocation (LDA) (Blei et al. 2003), which is a generative model and is effective for uncovering the thematic structure of a document collection. In an LDA model, each latent word class is represented as a distribution of words. The words that rank most highly in the distribution are the words that are treated as most characteristic of the associated latent class, or topic. An important parameter that must be set prior to application of the modeling framework is the number of subcommunities to identify. In this set of experiments, we set the number to twenty for each MOOC in order to enable the models to identify a diverse set of subcommunities reflecting different compositions in terms of content focus, participation goals, and time of initiating active participation. The trained model identifies a distribution of subcommunity participation scores across the twenty subcommunities for each student on each thread. Thus we are able to construct a subcommunity distribution for each student for each week of active participation in the discussion forums by averaging the subcommunity distributions for that student on each thread that student participated in that week. In this analysis we refer to student-weeks because for each student, for each week of their active participation in the discussion forum, we have one observational vector that we treat as one data point. The text associated with that student-week contains all of the messages posted by that student during that week. We will use our integrated model to identify themes in these student-weeks by examining the student-weeks that have high scores for the topics that showed significantly higher or lower than average attrition in the quantitative analysis. We identified four such topics, referred to as Topic9 (Hazard ratio 1.06), Topic13 (Hazard ratio .95), Topic17 (Hazard ratio 1.09), and topic18 (Hazard ratio .95) below. In contrast to the effect of the directly modeled variables, these hazard ratios indicate a weaker effect, between 5% and 10%. When an LDA model is trained, the most visible output that represents that trained model is a set of word distributions, one associated with each topic. That distribution specifies a probabilistic association between each word in the vocabulary of the model and the associated topic. Top ranking words are most characteristic of the topic, and lowest ranking words are hardly representative of the topic at all. Typically when LDA models are used in research such as presented in this paper, a table is offered that lists associations between topics and top ranking words, sometimes dropping words from the list that don’t form a coherent set in connection with the other top ranking words. The set of words is then used to identify a theme. In our methodology, we did not interpret the word lists out of the context of the textual data that was used to induce them. Instead, we used the model to retrieve messages that fit each of the identified topics using a maximum likelihood measure and then assigned an interpretation to each topic based on the association between topics and texts rather than directly to the word lists. Word lists on their own can be misleading, especially with an integrated model like our own where the a student may get a high score for a topic within a week more because of who he was talking to than for what he was saying. We will see that at best, the lists of top ranking words bore an indirect connection with the texts in top ranking student-weeks. However, we do see that the texts themselves that were associated with top ranking student-weeks were nevertheless thematically coherent. Because LDA is an unsupervised language processing technique, it would not be reasonable to expect that the identified themes would exactly match human intuition about organization of topic themes, and yet as a technique that models word co-occurrence associations, it can be expected to identify some things that would be make sense as thematically associated. In this light, we examine sets of posts that the model identifies as strongly associated with each of the topics identified as predicting significantly more or less drop out in the survival analysis, and then for each one, identify a coherent theme. Apart from the insights we gain about reasons for attrition from the qualitative analysis, what we learn at a methodological level is that this new integrated model identifies coherent themes in the data, in the spirit of what is intended for LDA, and yet the themes may not be represented strictly in word co-occurrences. And thus, we must interpret this integrated model with more care than a typical LDA model. What is interesting about the Python course is that we have topics within the same course, some of which predict higher attrition and others that predict lower attrition, so we can compare them to see what is different in their nature. In each case, we see that the connection between the top ranking words in the topic and the topic themes as identified from top ranking student-weeks bore little connection to one another, although we see some inklings of connection at an abstract level. Topics that signified higher than average attrition were more related to getting set up for the course, and possibly indicating confusion with course procedures. Topics that signaled lower than average attrition were ones where students were deeply engaged with the content of the course, working together towards solutions. The interactions between students in the discussions associated with higher attrition were not particularly dysfunctional as discussions, they simply lacked a mentoring component that might have helped the struggling students to get past their initial hurdles and make a personal connection with the substantive course material. Topic9 [more attrition]. Top ranking words included keyword, trying, python, formulate, toolbox, workings, coursera, vids, seed, and tries. The top ranking student-weeks contained lots of requests to be added to study groups. But in virtually all of these cases, that was the last message posted by the student that week. Similarly, a large number of these student-weeks included an introduction and no other text. What appears to unify these student-weeks is that these are students who came in to the course, made an appearance, but were not very quick to engage in discussions about the material. Some exceptions within the top ranking studentweeks were requests for help with course procedures. Topic13 [less attrition]. Top ranking words include name error, uses, mayor, telly, setattr, hereby, gets, could be, every time, and adviseable. In contrast to Topic9, this topic contained many top ranking studentweeks with substantial discussion about course content. We see students discussing their struggles with the assignment, but not just complaining about confusion. Rather, we see students reasoning out solutions together. For example, “So 'parameter' is just another word for 'variable;' and an 'argument' is a specific value given to the variable. Okay; this makes a lot more sense now.” or “For update_score(): Why append? are you adding a new element to a list? You should just update the score value.” Topic17 [more attrition]. Top ranking words include was beginner, amalgamate, thinking, defaultdef, less, Canada, locating, fundamentalist, only accountable, and English. Like Topic 9, this topic contains many top ranking student-weeks with requests to join study groups as the only text for the week. The substantive technical discussion was mainly related to getting set up for the course rather than about Python programming per se, for exampke “Hi;I am using ubuntu 12.04. I have installed python 3.2.3 Now my ubuntu12.04 has two version of python. How can I set default version of 3.2.3Please reply.” or “For Windows 8 which version should I download ?Downloaded Python 3.3.2 Windows x86 MSI Installer?and I got the .exe file with the prompter ... but no IDLE application”. Topic18 [less attrition]. Top ranking words include one contribution, accidental, workable, instance, toolbox, wowed, meant, giveaway, patient, and will accept. Like topic13, we see a great deal of talk related to problem solving, for example “i typed s1.find(s2;s1.find(s2)+1;len(s1)) and i can't get why it tells me it's wrong? do not use am or pm.... 3am=03:00 ; 3pm=15:00”, or “I don't see why last choice doesn't work. It is basically the same as the 3rd choice. got it! the loop continues once it finds v. I mistakenly thought it breaks once it finds v. thanks!”. The focus was on getting code to work. Perhaps “workable” is the most representative of the top ranking words. In applying this integrated model that brings together a view of the data from a social network perspective with a complementary view from text contributed by students in their threaded discussions, we see how we are able to identify emergent subcommunity structure that enables us to identify subcommunities with differential rates of attrition. A qualitative posthoc analysis suggests that subcommunities associated with higher attrition demonstrate lower comfort with course procedures and lower expressed motivation and cognitive engagement with the course materials, whereas subcommunities associated with lower attrition reflect higher motivation and cognitive engagement, with is consistent with the results obtained through the direct modeling approach. Discussion and Conclusions It is not at all surprising that the approach that required more time to develop, namely the direct modeling approach, yielded stronger results quantatively. Nevertheless, the analysis presented in this paper offers some lessons learned and directions for future work. The first important take home message is that although the exploratory model produced meaningful results, it is important to note that does not mean that one could not do better with a more carefully crafted measure. Researchers should consider carefully how much resolution into their data they are losing if they chose to take an “effortless” approach. The second important take home message is that it can be dangerous to read too much into the lists of top ranking words per topic that come out of LDA variants, although it may again be a tempting shortcut. Nothing replaces actually going back to the data to see what structure the model is really picking up. Perhaps the important take home message is that while the top ranking words per topic did not turn out to be well represented in top ranking posts, we do find some connection at an abstract level with the more abstract themes that the topics were revealed to pick up on because of leveraging the network structure. We see here evidence that the network structure has the potential to make an important contribution to the interpretation of the text. While pure text based approaches like standard LDA rely entirely on word co-occurrences, we see here that word co-occurrences and word overlap may miss collections of thematically related posts where the relationship is not reflected in the words because the commonality transcends individual words and operates at the level of functional word classes, such as words that signal emotion or words used as greetings. It is this last take home message that is the most critical for modeling discussion practices. What we see here is that if we desire to push further on exploratory models that are useful for identification of practices rather than just low level observed behaviors, we should explore further how network structure may be leveraged to raise the level of awareness of the models above limited notion of commonality found in simple word co-occurrences. References Airoldi, E., Blei, D., Fienberg, S. & Xing, E. P. (2008). Mixed Membership Stochastic Blockmodel, Journal of Machine Learning Research, 9(Sep):1981--2014, 2008. Beukeboom, C. J., Tanis, M., and Vermeulen, I. E. 2013. The Language of Extraversion Extraverted People Talk More Abstractly, Introverts Are More Concrete. Journal of Language and Social Psychology, 32(2), 191-201. Blei, D., Ng, A. and Jordan, M. (2003). Latent dirichlet allocation. J. Mach. Learn. Res (3) 993-1022. Dumais, S. T., G. W. Furnas, et al. (1988). Using Latent Semantic Analysis to Improve Access to Textual Information. Conference on Human Factors in Computing Systems: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, Washington, D.C., ACM. Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques, The MIT Press. Rabe-Hesketh, S. & Skrondal, S. (2012). Multilevel and Longitudinal Modeling Using Stata (Volumes I and II), STATA Press. Wen, M., Yang, D., Rosé, D. (2014). Linguistic Reflections of Student Engagement in Massive Open Online Courses, in Proceedings of the International Conference on Weblogs and Social Media