Learning Analytics: Tool Matrix

advertisement
Learning Analytics: Tool Matrix
Tool (URL)
Description
(Adapted from David Dornan)
Opportunities in Learning
Analytic Solutions
Weaknesses/Concerns/
Comments
Data Cleansing/Integration
“Search and
Replace” function
http://blog.visual.ly/
cleaning-data-sets/
Regular Expression
Engines or
Processors
(Regex, Regexp)
Quote from blog post:
“The most basic data cleansing
tool is find and replace in any
text editor. With some carefully
crafted replacements, it is
possible to get data fairly clean
and into a good format. Look for
patterns and repetition in a file.
Work around the parts that
don’t repeat, and use the
existing structure of the data to
your advantage.
A regular expression processor is
similar to the search and replace
function however it is designed to
recognize expressions that may
contain different characters however
within the context of a data set
represent the same concept.
This solution is interesting in its
simplicity. It would not be suitable
for big-data sets or highly complex
data but for someone just starting
to explore data cleansing it
provides a very accessible tool to
look for patterns with data that
repeat and, by extension, data that
is anomalous or isolated.
Using search and replace tools
creates a relatively manual process
and the interpretation of the data
may be prone to human error.
Regex engines would be an
important component of the
functionality of any data cleansing
application. Data elements that
contain only superficial differences
should not be categorized as
One weakness or issue with regex
functionality is knowing in advance
what are the patterns within the
data that are, in fact, equivalent. In
the example in column 2, the data
analysts would need to know that
1
http://en.wikipedia.
org/wiki/Regular_ex
pression#Syntax
http://www.regularexpressions.info/tut
orial.html
IBM InfoSphere
Information Server
separate elements in the data
Example from Wikipedia Page: These
cleansing analysis. If this were
data elements would be recognized as done then potentially important
the same by a Regex engine
patterns and trends might be
programmed to identify “an”, “än”
missed.
and “aen” as the same,
"Handel", "Händel", and "Haendel"
The syntax and behaviour of
different regex engines are called
“flavors” including Perl, PCRE, PHP,
.NET, Java, JavaScript, XRegExp,
VBScript, Python, Ruby, Delphi, R,
Tcl, POSIX,
Platform includes a suite of products
that includes data cleansing and
integration functions.
http://www01.ibm.com/softwar The key functionality is collecting data
e/data/integration/i from a diverse range of sources and
integrating into useful reporting
nfo_server/
services.
The website shows the following data
management infrastructure:
Information Governance allows for
data transformation and delivery to
Data Integration which allows for
data cleansing and monitoring which
creates Data Quality which supports
understanding and collaboration.
Some useful terminology that IBM
uses that is likely to be common
across other platforms:



Master Data Management &
Warehousing
Massive Parallel Processing
(MPP)
Extract, Transform, Load (ETL)
Extract, Load, Transform (ELT)
Performance
ETL and ELT refer to different
configurations of integrated
systems and where and when
the composer’s name had these
three formats depending on the
country or language that the name
was originally written in.
Another area of concern might be
misspelled or corrupted data. To
continue the example with the
composer’s name, I’m not sure how
Hamdel or Handell might be
captured by a regex engine. Perhaps
much like spell check it the engine
might identify them as errors and
then the data analysis process could
opt to correct the data.
Having worked on several database
projects, specifically around learning
content management, one of the
key learning points here is the
importance of a data governance
model as a foundation for a data
analytics program. Unless you define
data models, data sources and
business goals, the analytics process
becomes too chaotic to achieve any
positive outcomes for the
organization.
Despite the lofty claims made by all
software providers, the client needs
to take ownership of the governance
2
each of these three functions
occurs. Previously each
function was handled
separately but as platforms
become more robust in the era
of big data, data can be
transformed by the same
systems that produce the
reporting.
The issue of “trust” figures
prominently in the promotional
material for this platform. The ability
to trust the data that is collected,
processed and distributed through
the platform is a critical success
factor.
process – aided possibly by the
vendor - before the software will
contribute to positive business
outcomes.
http://blog.performancearchitects.
com/wp/2013/06/13/etl-vs-eltwhats-the-difference/
Statistical Modeling
The top 10 tools by share of users were (June 7, 2014)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
RapidMiner, 44.2% share ( 39.2% in 2013)
R, 38.5% ( 37.4% in 2013)
Excel, 25.8% ( 28.0% in 2013)
SQL, 25.3% ( na in 2013)
Python, 19.5% ( 13.3% in 2013)
Weka, 17.0% ( 14.3% in 2013)
KNIME, 15.0% ( 5.9% in 2013)
Hadoop, 12.7% ( 9.3% in 2013)
SAS base, 10.9% ( 10.7% in 2013)
Microsoft SQL Server, 10.5% (7.0% in 2013)
http://www.kdnuggets.com/2014/06/kdnuggets-annual-software-poll-rapidminer-continues-lead.html
GARTNER Magic Quadrant for Advanced Analytics Software Providers
3
http://rapidminer.com/resource/leader-gartners-magic-quadrant-advanced-analytics/
RapidMiner
RapidMiner provides a GUI to design an
Churn Prevention/Student
analytical pipeline (the "operator tree" in
Retention: Know in advance which
RapidMiner parlance). The GUI generates
students are likely to drop out or
http://rapidminer.c an XML (eXtensible Markup Language) file leave after one academic cycle.
that defines the analytical processes the
om/
Learner Segmentation: Identify
user wishes to apply to the data. This file
common characteristics of student
is then read by RapidMiner to run the
segments and develop/design
analyses automatically. While these are
programs and services to meet
running, the GUI can also be used to
their specific needs.
interactively control and inspect running
processes.
Next Best Action: Deliver the most
http://medblog.stanford.edu/lanefaq/archives/2009/05/what_is_rapidmi.ht
ml
suitable learning module to the
learner depending on their
previous learning activity.
Based marketshare data,
RapidMiner appears to be a
widely adopted application.
Organization leveraging
Mondrian/R would need to have
an overall data strategy and
either in-house staff or
contracted consultants to
manage the data-mining
initiatives.
4
Visualization
The presentation of the data after it has been extracted, cleansed and analyzed is critical to successfully engage students in learning and
acting on the information that is presented.
Mondrian
http://www.theusru
s.de/Mondrian/
Mondrian us a data-visualization
system that integrated with R.
Primary Features/Strengths:
-Interactive visualizations
See examples below.
Organization leveraging Mondrian/R
would need to have an overall data
strategy and either in-house staff or
contracted consultants to manage
the data-mining initiatives.
-Categorical data
-Geographical data
-Large Data
-Histograms, Boxplots, Scatterplots,
Barcharts, Mosaicplots, Missing
Value Plots, Parallel Co-ordinates,
SPLOMs.
5
Some illustrations of different data visualization formats available from Mondrian with examples of how they could be used in a learning
analytics contexts. Note that the illustrations are generic and not representative of the examples.
Histogram:
-Example Online/Minutes spent per day on LMS (x axis) and submissions posted on course discussion boards (y axis)
Boxplot:
Boxplots measure median/second quartile data within a data set as well as additional data such as the threshold for first and third quartile
data and the min/max of all the data (measured by the vertical lines or “whiskers”.
Example: Median grades and max/min grades across geographically dispersed schools/learner groups.
6
Scatterplot:
Example: Final course grade correlated (y axis) to hours spent on specific study and review activities (x axis).
Bar Chart
Example: Student enrolments by program area.
Note: A bar chart differs from a histogram in that the latter data segments represent ranges of data elements whereas in the bar chart the
bars represent discrete groups or data.
7
Mosaic Plot
Example: Enrolments in various university faculties (y axis) from 1910 to 2010 (x axis).
Missing Value Plot
Example: Daily access to online learning platform mapped against defined range of lifestyle behaviours. Missing Data: Frequency of use of
recreational drugs.
8
Parallel Coordinates:
Example: Vertical lines: range of lifestyle and demographic characteristics or behaviours of students. Horizontal “poly-lines” represent final
grades for a particular academic program.
SPLOM (Scatterplot Matrix)
Example: Income, debt levels and age of students enrolled in an academic program.
9
Download