David Dornan Learning Analytics: Tool Matrix Tool (URL) Description Opportunities in Learning Analytic Solutions Weaknesses/Concerns/ Comments Data One of the biggest hurdles in developing learning analytic tools is developing data governance and privacy policy related to accessing student data. The two initiatives in this section offer frameworks for opening access to student attention/learning data. The first initiative provides a start to developing data collection standards and the second provides inspiration on how/why it is not only feasible to deliver free open courses, it is also makes sense in terms of providing a community based research environment to explore, develop and test learning theories and learning feedback mechanisms/tools. PSLC (Pittsburgh Science of Learning Center) DataShop The PSCL DataShop is a repository containing course data from a variety of math, science, and language courses. Data Standards Open Learning Initiative This is an exciting initiative taking place at Carnegie Mellon University. Students’ interaction with free online course material/activities provides a virtual learning analytic laboratory to experiment with algorithms and feedback From Solo Sport to Community Based Research Activity Initiatives like PSLC will help the learning analytics community develop standards for collecting, anonomizing and sharing student level course data. Convincing individual institutions to contribute to this type of data repository may be difficult given that many institutions do not have data governance/sharing policies to share this type of information internally. Herbert Simon from Carnegie Mellon University states that, “Improvement in Post Secondary Education will require converting teaching from a ‘solo sport’ to a 1 mechanisms. community based research activity.” There are often two concerns related to conducting experimentation using learning analytics: 1. Privacy concerns related to accessing student related data. 2. Ethical concerns related to testing different feedback\instructional response mechanisms. By offering free courses to student with full disclosure of how their interactions will be tracked and analyzed, these the two issues are no longer road blocks for conducting learning analytics research. As learning material/objects become commodities, the development of learning analytics tools that help guide and direct students will become what is valued and this requires that institutions build expertise in developing and sustaining the communities required to conduct community based learning research. 2 Database Storage The majority of current learning analytics initiative are handled adequately using relational databases. However, as learning analytics programs begin to make use of the semantic web and social media tools, there will be a need to start exploring data storage technology that can handle large unstructured data sets. This section provides a brief description to the data storage required for LA programs. Relational Database For years we have used relational databases to structure the data required for our analyses. Data is stored in tables consisting of rows and columns. The columns are welldefined attributes pertaining to an object represented by a table. There are good open source relational database such as greenplum and mysql. However, most universities have standard supported RDMS offerings. At the University of Guelph we support both SQL Server and Oracle's RDMS. Oracle provides a secure repository for structured data. The recent release of 11g also provides integration with the R engine permitting it to access data stored in the database. NoSQL Database/Hadoop/ Map Reduce Hadoop is an Apache project inspired by Google's Mapreduce and the Google File System. It has become a standard for distributing large unstructured data sets. It provides a framework that can distribute large data set over a number of servers and can provide intermediate results as data flows through the framework's pipeline. As learning analytics programs begin to make use of the semantic web and social media tools there will be a need to start exploring data storage technology that can handle large unstructured data. Universities have good relational database infrastructures including expertise. As LA programs grow to include analysis of unstructured data, universities will need to develop skill and capacity to offer Hadoop data storage and retrieval services. 3 EC2 There are a number of companies that lease access to processing via virtual servers. Amazon’s EC2 is a common cloud server option available to host applications. It is becoming common for organization to look at moving application to the cloud. For many of the traditional services, like the RDMS, there is resistance to cloud based deployments. This resistance is primarily due to privacy concerns and resistance to change. As LA programs require access to new technologies such as Hadoop and require infrequent massive analytical cycles, there may be an opportunity to introduce cloudbased offerings such as EC2. The first assignment for this course (the development of a LA tool) provided me an opportunity to deploy an application using EC2. EC2 is a great way to explore new technologies. If mistakes are made one simply redeploys a new EC2 instance. There are many publically available instances that save time in deploying complete environments. In developing my LA tool, I deployed an Oralce XE instance (which required virtually no effort) and another RedHat instance where I installed RevoDeployR. Since RevoDeployR was a new tool for me, I had to start over several times before completing a successful installation. It is possible to create backup images in EC2. However, it was not as intuitive as creating a new instance. Data Cleansing/Integration Prior to conducting data analysis and presenting it through visualizations, data must be acquired (extracted), integrated, cleansed and stored in an appropriate data structure. The tools that perform these tasks are commonly referred to as ETL tools. Given the need for both structured and unstructured data (as described in the above section), the ideal ETL tools will be able to access and load data to and from data sources including RRS feeds, API calls, RDMS and unstructured data stores such as Hadoop. 4 Needlebase Needlebase is a web-based webscraping tool that provides an easy to use interface to acquire, integrate and cleanse web-based data. As a user navigates a website tagging page elements of interest, Needlebase detects the underlying database structure and web navigation and automates the collection of the underlying data into a table of data. Needle base is a great tool for accessing a websites underlying data when direct access to the data is not easily accessible. I have used Needlebase to create a lookup table for archived National Occupation Codes and to create a lookup table for our undergraduate course calendar. There is no API access to the Needlebase scripts that are created. It seems best for one off extracts or for applications where the entire dataset is acquired using Needlebase tools. It does not seem all that useful for an integrated solution. One other restriction that I ran across using this tool was that it did not support accessing websites requiring authentication. Pentaho Integration Pentaho Data Integration (PDI) is a powerful easy to learn open source ETL tool that supports acquiring data from a variety of data sources including flat files, relational databases, Hadoop databases, RSS Feeds, and RESTful API calls. It can also be used to cleanse and output data to the same list of data sources. PDI provides a versatile ETL tool that can grow with the evolution of an institutions learning analytics program. For example, initially a LA program may start with institutional data that is easily accessible via institutional relational databases. As the program grows to include text mining and recommendation systems that require extracting unstructured data outside the institution, the skills developed with PDI will accommodate the new sources of data collection and cleansing. There are two concerns that I have with PDI: 1. Pentaho does not have built in integration with R statistics. Instead Pentaho data mining integration focuses on a WEKA module. 2. Pentaho is moving away from the open source model. Originally PDI was an open source ETL tool called Kettle developed by Matt Casters. Since Pentaho acquired Kettle (and Matt Caster), it has become a central piece to their subscription based BI Suite and the support costs are growing at a rapid pace. Twice, I have budgeted for support on this product only to find that the support 5 costs have more than doubled year over year. Talend Talend is another open source ETL tool that has many of the same features as PDI. The main differences between PDI and Talend are presented in the following blog post: Talend has the same strengths as described above with the additional benefit of having built in integration with R. http://churriwifi.wordpress.com/20 10/06/01/comparing-talend-openstudio-and-pentaho-dataintegration-kettle/ The main difference that from my perspective is that Talend is a code generator whereas PDI is not. I have also found PDI a much easier tool to learn and use. Yahoo Pipes Yahoo provides this free web-based GUI tool that allows users to extract web-based data and create data stream that will cleanse, filter or enhance data prior to outputting the data via an RSS feed. Since PDI and Talend seem to be able to provide the same ability as Yahoo Pipes I did not spend a great deal of time exploring Yahoo Pipes. However, it seems to me that Yahoo pipes could provide the webscraping functionality that Needlebase provides, yet offer a RRS feed output that could be picked up by either Talend or Pentaho in order to The one concern that I have wrt Yahoo pipes is that some of the unstructured data that will require analysis in a LA system will be posts by student. If a free public service like Yahoo Pipes is being used to stream data through various analytic API’s, we will potentially release personal student data. 6 schedule nightly loads. It might be a more efficient way to pass web based data streams through various API's prior to extractions using PDI> Statistical Modeling There are three major statistical software vendors: SAS, SPSS and R. All three of these tools are excellent for developing analytic/predictive models that are useful in developing learning analytics models. This section focuses on R. The open source project R has numerous packages and commercial add-ons available that position it well to grow with any LA program. Given that many researchers are proficient in R, incorporating the R engine into a LA platform also offers an opportunity to engage faculty in the development of reusable models/algorithms. 7 R R is an active open source project that has numerous packages available to perform any type of statistical modeling. R statistics strength is the fact that it is a widely used by the research community. Code for analysis is widely available and there are many packages available to help with any type of analysis and presentation that might be of interest. Some of these include: 1) Visualization: a) ggplot provides good charting functionality. b) googlevis provides an interface between R and the Google Visualization API 2) Text Mining: a) tm provides functions for manipulating text including stripping whitespace and stop words and removing suffixes (stemming). b) openNLP identifies words as nouns, verbs, adjectives or adverbs c) wordnet provides access to wordnet library. This is often used to replace similar words with a common word prior to text analysis. Although I really like R there are two issues that may be of concern to some universities: 1) Lack of Support - only Revolution R provides support for the R product 2) High Level of Expertise Required to Develop and Maintain R. How does a university retain people that have the skill required to develop and maintain R/RevoDeployR. However, since many faculty and students are proficient with R, perhaps building a platform similar to Datameer (see below) would allow R code to be community sourced allowing the majority of faculty and students to easily access and build their own learning dashboards. 8 Here are a few articles that show the power of using a few of these text mining packages: 1. Creating a wordle using tm and ggplot - http://www.rbloggers.com/building-a-betterword-cloud/ 2. Provides an overview of conducting text analysis using R http://www.jstatsoft.org/v25/i05/pa per Oracle has also integrated R into it's 11g RDMS allowing R models direct access to RDMS data. 9 Revolution R Offerings Including: RevoDeployR RevoConnectR Integration with IBM Netezza rApache Revolution R provides support for the open source R engine and provides add on to enhance the integration and use of R within databases and websites. The RevoDeployR is a server-based platform that provides access to the R engine via a RESTful API. The RevoConnectR allows use of Hadoop stored data by the R engine. Revolution R also provides integration with IBM Netezza data warehouse appliances providing a scalable infrastructure for analyzing very large datasets. Revolution R is the only commercial support offering for R. Revolution R will be useful for institutions that have procurement or risk management policies that restrict the use of open source products. The support that I received using RevoDeployR was very slow. However, I am not a supported customer. Revolution R tools are free for research purposes and their support contract or licenses for institutional purposes (i.e. learning analytics and dashboards) are very reasonable. I was quoted $4,ooo/core for RevoDeployR product. This is an open source apache module named mod_R that embeds the R statistical engine inside the web server. 10 Zementis ADAPA Zementis offers a PMML-based scoring engine which can be deployed on-site, within a greenplum database, within an excel spreadsheet or consumed as a web service using Zementis amazon cloud based service. By using the PMML (Predictive Model Markup Language) standard ADAPA can easily leverage predictive models developed in the major statistical software including R, SAS and SPSS. It can quickly provide scoring based on any of the following modeling techniques: - Support Vector Machines - Naive Bayes Classifiers - Ruleset Models - Clustering Models - Decision Trees - Regression Models - Scorecards - Association Rules - Neural Networks ADAPA allows for easy consumption of predictive scores into a student or faculty web based learning dashboard. The cloud based service starting at only $0.99/hr only requires a $2000/semester investment. I tried using the API to create a Purdue-like dashboard in the LA tool, but I did not have time to get it working properly. Zementis has partnered with RevoDeployR to create their web base subscription service using RevoDeployR. So if RevoDeployR is part of your LA architecture, it could provide the same functionality using your in house service. Network Analysis Network Analysis focuses on the relationship between entities. Whether the entities are students, researchers, learning objects or ideas, network analysis attempts to understand how the entities are connected rather than understand the attributes of the entities. Measure include density, centrality, connectivity, betweenness and degrees. This is an important area to explore, as we take up Herbert Simon (from Carnegie Mellon University) challenge and nudge learning and teaching ‘from a solo sport to a community based research activity’. 11 Network analysis can not only help us identify pattern that help identify dis-connected students or help predict success based network metrics, these tools can help student develop networking skill that will be required for successful life long learning and research. SNAPP Social Networks Adapting Pedagogical Practice (SNAPP) is a network visualization tool that is delivered as a 'bookmarklet' . Users can easily create network visualizations from LMS forums in real time. Self Assessment Tool for Students SNAPP provide students with easy access to network visualizations of forum posting. These diagram can help students understand their contribution to class discussions. Identify at Risk Students/ Monitor Impact of Learning Activity Network Analysis visualizations can help faculty identify students that may be isolated. They can also be used to see if specific activities have impacted the class network. NodeXL NodeXL is an excel add-on that creates network visualizations from a worksheet containing the lists of edges. The tool provides the ability to calculate common networking measures such as density, centrality, connectivity, betweenness and degrees. Data can be exported in a format that can be imported into Sophisticated Network Analysis Both NodeXL and Gelphi can be used to explore network patterns. These tools are useful for researchers. It would be interesting to explore the relationship these network metrics ( e.g. centrality and betweeness) and 12 gelphi for further analysis or refined visualization. Gelphi Gelphi offers a standalone product for analyzing networks. It is the most advanced of the three network analysis tools described in this section. cohere Simon Buckingham and Anna De Liddo have developed an enhanced diigo-like tagging/bookmark tool that has allows a user to link their contributions to other ideas and websites with descriptive adjectives. student success. Idea Creation While this tool provides the creators with data that is useful to conduct their discourse analysis research it also provides people/researchers with a tool that may help connect them to people that have related interests and ideas and may help to stimulate new ideas and collaborations. Other Tools for Analysis ViralHeat Viral heat provides a full-featured tool set and an API that helps monitor web content for specific mentions of people, products and services. Monitor and Evaluate Course/Program Satisfaction This relatively cheap analytics offering could help introduce the use of analytics by helping evaluate a recruitment drive/strategy or fundraising campaign. 13 WordNet Leximancer Wolfram API Princeton University provides a lexical database that links English words (or sets of words) by their common meaning. It is essentially a database that helps identify synonyms. Identify Main Concepts found in a Learning Objects / Forum Post Leximancer provides sophisticated text analysis and presentation of concepts found in a learning object. The API can return interactive concept maps demonstrating how different ideas connect. The tool provides the ability to drill from the concept map down to the text that spawned the concept map. Identify Main Concepts found in a Learning Objects / Forum Post The Wolfram Alpha API provides a developers with the ability to submit free text/questions from a website to the Wolfram Alpha engine and have the results returned. Dynamic Content Delivery This lexical database is used in text analysis to replace similar words with one common descriptor. Leximacer could be a used to help consolidate the main ideas of a lecture or discussion groups. It can also provide students with easy access to the detailed discussion and material related to a concept via a link from the concept map to the discussion forum posting. The wolfram API could be used to provide supplemental material to on-line discussion. Linked Data If Tim Berners-Lee vision of linked data (http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html) is successful in transforming the internet into a huge database, the value of delivering content via courses and programs will diminish and universities will need to find new ways of adding value to learning. Developing tools that can facilitate access to relevant 14 content using linked data could be one way that universities remain relevant in the higher learning sector. Ontologies e.g. DBPedia OpenCalais Ontologies are essentially an agreed upon concept map for a particular domain of knowledge. Reuter’s offers this free API that takes text input and returns tags that will link the concepts in the text to other linked data on the web. Dynamically Deliver Relevant Content Using OpenCalais along with welldefined ontologies provides a mechanism for dynamically delivering/suggesting related readings. Visualization The presentation of the data after it has been extracted, cleansed and analyzed is critical to successfully engage students in learning and acting on the information that is presented. Google Visualization API’s (http://code.google. com/apis/chart/) Protovis (http://mbostock.git hub.com/protovis/) D3 (http://mbostock.git hub.com/d3/) Google Visualization provides an API to their chart library allowing for the creation of charts and other visualizations. They have recently released an API to add interactive controls to their charts. Protovis and D3 are JavaScript frameworks for creating web-based visualizations. Protovis is no longer an active open source project. It has been replaced by D3. Interactive Learning Dashboards All of these tools are useful for creating visualizations for learning feedback systems such as dashboards. Learning how to use these tools/libraries requires a fair amount of effort. Developer retention is a risk for system maintenance and enhancement. The Motion Chart (purchased from gapminder) is one of my favourite interactive charts that Google provides access via their API. All of these tools can present data as a heat maps, network analysis 15 FusionCharts (http://www.fusionc harts.com/) Fusion Charts provides a commercial JavaScript framework for creating dynamic visualizations. diagrams and tree maps. Here's a link to an example dashboard created in D3, presenting university admission data. http://keminglabs.com/ukuni/ Reporting Suites Many universities have reporting tools available to create visualizations. Tools include Tableau, Cognos, Pentaho and Jasper Reports. All of these vendors provide good tools to create reports and dashboards. My favourite is Tableau, however, JasperReports or Pentaho are much more affordable. Full Analytics Offerings LOCO Using LOCO (Learning Object Context Ontologies) student on-line activities are mapped to specific learning objectives. The tool set provides faculty with feedback related to how well material has been understood, as well it provides network visualizations describing student interaction. The tool provides a framework for describing on-line learning environments. Faculty Feedback Related to Learning Success 16 DataMeer Datameer provides full set of tools (http://www.datame allowing users to conduct advanced er.com/) analytics on Hadoop based data. Engage Faculty in Learning Analytics I like Datameer's wizard based approach to user controlled analytics. It provides some ideas on how one could provide faculty with the ability to contribute or reuse predictive models, quickly test historic data, deploy a learning analytics algorithm and present the results in a learning dashboard. This approach may be too complicated for delivery to the masse, as I suspect that the majority of faculty will want something that requires less effort. 17