Learning Analytics: Tool Matrix Tool (URL) Description (Adapted from David Dornan) Opportunities in Learning Analytic Solutions Weaknesses/Concerns/ Comments Data Cleansing/Integration “Search and Replace” function http://blog.visual.ly/ cleaning-data-sets/ Regular Expression Engines or Processors (Regex, Regexp) Quote from blog post: “The most basic data cleansing tool is find and replace in any text editor. With some carefully crafted replacements, it is possible to get data fairly clean and into a good format. Look for patterns and repetition in a file. Work around the parts that don’t repeat, and use the existing structure of the data to your advantage. A regular expression processor is similar to the search and replace function however it is designed to recognize expressions that may contain different characters however within the context of a data set represent the same concept. This solution is interesting in its simplicity. It would not be suitable for big-data sets or highly complex data but for someone just starting to explore data cleansing it provides a very accessible tool to look for patterns with data that repeat and, by extension, data that is anomalous or isolated. Using search and replace tools creates a relatively manual process and the interpretation of the data may be prone to human error. Regex engines would be an important component of the functionality of any data cleansing application. Data elements that contain only superficial differences should not be categorized as One weakness or issue with regex functionality is knowing in advance what are the patterns within the data that are, in fact, equivalent. In the example in column 2, the data analysts would need to know that 1 http://en.wikipedia. org/wiki/Regular_ex pression#Syntax http://www.regularexpressions.info/tut orial.html IBM InfoSphere Information Server separate elements in the data Example from Wikipedia Page: These cleansing analysis. If this were data elements would be recognized as done then potentially important the same by a Regex engine patterns and trends might be programmed to identify “an”, “än” missed. and “aen” as the same, "Handel", "Händel", and "Haendel" The syntax and behaviour of different regex engines are called “flavors” including Perl, PCRE, PHP, .NET, Java, JavaScript, XRegExp, VBScript, Python, Ruby, Delphi, R, Tcl, POSIX, Platform includes a suite of products that includes data cleansing and integration functions. http://www01.ibm.com/softwar The key functionality is collecting data e/data/integration/i from a diverse range of sources and integrating into useful reporting nfo_server/ services. The website shows the following data management infrastructure: Information Governance allows for data transformation and delivery to Data Integration which allows for data cleansing and monitoring which creates Data Quality which supports understanding and collaboration. Some useful terminology that IBM uses that is likely to be common across other platforms: Master Data Management & Warehousing Massive Parallel Processing (MPP) Extract, Transform, Load (ETL) Extract, Load, Transform (ELT) Performance ETL and ELT refer to different configurations of integrated systems and where and when the composer’s name had these three formats depending on the country or language that the name was originally written in. Another area of concern might be misspelled or corrupted data. To continue the example with the composer’s name, I’m not sure how Hamdel or Handell might be captured by a regex engine. Perhaps much like spell check it the engine might identify them as errors and then the data analysis process could opt to correct the data. Having worked on several database projects, specifically around learning content management, one of the key learning points here is the importance of a data governance model as a foundation for a data analytics program. Unless you define data models, data sources and business goals, the analytics process becomes too chaotic to achieve any positive outcomes for the organization. Despite the lofty claims made by all software providers, the client needs to take ownership of the governance 2 each of these three functions occurs. Previously each function was handled separately but as platforms become more robust in the era of big data, data can be transformed by the same systems that produce the reporting. The issue of “trust” figures prominently in the promotional material for this platform. The ability to trust the data that is collected, processed and distributed through the platform is a critical success factor. process – aided possibly by the vendor - before the software will contribute to positive business outcomes. http://blog.performancearchitects. com/wp/2013/06/13/etl-vs-eltwhats-the-difference/ Statistical Modeling The top 10 tools by share of users were (June 7, 2014) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. RapidMiner, 44.2% share ( 39.2% in 2013) R, 38.5% ( 37.4% in 2013) Excel, 25.8% ( 28.0% in 2013) SQL, 25.3% ( na in 2013) Python, 19.5% ( 13.3% in 2013) Weka, 17.0% ( 14.3% in 2013) KNIME, 15.0% ( 5.9% in 2013) Hadoop, 12.7% ( 9.3% in 2013) SAS base, 10.9% ( 10.7% in 2013) Microsoft SQL Server, 10.5% (7.0% in 2013) http://www.kdnuggets.com/2014/06/kdnuggets-annual-software-poll-rapidminer-continues-lead.html GARTNER Magic Quadrant for Advanced Analytics Software Providers 3 http://rapidminer.com/resource/leader-gartners-magic-quadrant-advanced-analytics/ RapidMiner RapidMiner provides a GUI to design an Churn Prevention/Student analytical pipeline (the "operator tree" in Retention: Know in advance which RapidMiner parlance). The GUI generates students are likely to drop out or http://rapidminer.c an XML (eXtensible Markup Language) file leave after one academic cycle. that defines the analytical processes the om/ Learner Segmentation: Identify user wishes to apply to the data. This file common characteristics of student is then read by RapidMiner to run the segments and develop/design analyses automatically. While these are programs and services to meet running, the GUI can also be used to their specific needs. interactively control and inspect running processes. Next Best Action: Deliver the most http://medblog.stanford.edu/lanefaq/archives/2009/05/what_is_rapidmi.ht ml suitable learning module to the learner depending on their previous learning activity. Based marketshare data, RapidMiner appears to be a widely adopted application. Organization leveraging Mondrian/R would need to have an overall data strategy and either in-house staff or contracted consultants to manage the data-mining initiatives. 4 Visualization The presentation of the data after it has been extracted, cleansed and analyzed is critical to successfully engage students in learning and acting on the information that is presented. Mondrian http://www.theusru s.de/Mondrian/ Mondrian us a data-visualization system that integrated with R. Primary Features/Strengths: -Interactive visualizations See examples below. Organization leveraging Mondrian/R would need to have an overall data strategy and either in-house staff or contracted consultants to manage the data-mining initiatives. -Categorical data -Geographical data -Large Data -Histograms, Boxplots, Scatterplots, Barcharts, Mosaicplots, Missing Value Plots, Parallel Co-ordinates, SPLOMs. 5 Some illustrations of different data visualization formats available from Mondrian with examples of how they could be used in a learning analytics contexts. Note that the illustrations are generic and not representative of the examples. Histogram: -Example Online/Minutes spent per day on LMS (x axis) and submissions posted on course discussion boards (y axis) Boxplot: Boxplots measure median/second quartile data within a data set as well as additional data such as the threshold for first and third quartile data and the min/max of all the data (measured by the vertical lines or “whiskers”. Example: Median grades and max/min grades across geographically dispersed schools/learner groups. 6 Scatterplot: Example: Final course grade correlated (y axis) to hours spent on specific study and review activities (x axis). Bar Chart Example: Student enrolments by program area. Note: A bar chart differs from a histogram in that the latter data segments represent ranges of data elements whereas in the bar chart the bars represent discrete groups or data. 7 Mosaic Plot Example: Enrolments in various university faculties (y axis) from 1910 to 2010 (x axis). Missing Value Plot Example: Daily access to online learning platform mapped against defined range of lifestyle behaviours. Missing Data: Frequency of use of recreational drugs. 8 Parallel Coordinates: Example: Vertical lines: range of lifestyle and demographic characteristics or behaviours of students. Horizontal “poly-lines” represent final grades for a particular academic program. SPLOM (Scatterplot Matrix) Example: Income, debt levels and age of students enrolled in an academic program. 9