D2-W5D Data Mining: Looking for Patterns in Data

advertisement
Data Mining
Looking For Patterns in Data
This Presentation
 Why data Mining?
 What is Data Mining?
 What’s happening in K-12 Schools
 The Data Mining Process
 Prepare your data for mining
 Data Mining Tools?
 Mining SCEGGS Data
Why data Mining?
 Drowning in Data…..
 Schools Store lots of data about students’ performance
 SCEGGS:
 10,000 records per reporting period per school year
 2 reporting periods per year since 1999
 160,000 records
 A gold mine: hidden information that has not been made use
of???
My Shovel:
 Well written and easy to read
 Lots of examples
 As technical as you want to be
"If you have data that you want to analyze
and understand, this book and the
associated Weka toolkit are an excellent
way to start."
-Jim Gray, Microsoft Research
Ethical Issues
Important questions:
 Who is permitted access to the data?
 For what purpose was the data collected?

What kind of conclusions can be legitimately drawn from it?
● Caveats must be attached to results
● Purely statistical arguments are never sufficient!
● Are resources put to good use?
What is Data Mining
 Looking for patterns in data
 Data is stored electronically
 Search is augmented by computer
 Meaningful Patterns may be used for prediction
 “Intelligently analysed data is a valuable resource. It can lead to
new insights…” (Witten & Frank)
 Data Mining is about solving problems by analysing data
already in databases
An example 3
Classroom A
After “Mining”
(With cheating algorithm applied)
1. 112a4a342cb214dOOO1acd24a3a12dadbcb4aOOOOOOO
2. 1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141
3. db2abad1acbdda212b1acd24a3a12dadbcb400000000
4. d43a3a24acb1d32b412acd24a3a12dadbcb422143bcO
5. d43ab4d1ac3dd43421240d24a3a12dadbcb400000000
6. 1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4
7. dba2ba21ac3d2ad3c4c4cd40a3a12dadbcb400000000
8. 144a3adc4cbddadbcbc2c2cca3a12dadbcb4211ab343
9. 3b3ab4d14c3d2ad4cbcac1cO03a12dadbcb4adb40000
10.d43aba3cacbddadbcbca42c2a3212dadbcb42344b3cb
11.214ab4dc4cbdd31b1b2213c4ad412dadbcb4adbOOOOO
12.313a3ad1ac3d2a23431223cOOO012dadbcb400000000
13.d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1
14.dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000
15.db223a24acb11a3b24cacd12a241cdadbcb4adb4b300
16.d122ba2cacbd1a13211a2d02a2412dOdbcb4adb4b3cO
17.1423b4d4a23d24131413234123a243a2413a21441343
1B.db4abadcacb1dad3141ac212a3a1c3a144ba2db41b43
19.db2a33dcacbd32d313c21142323cc300000000000000
20.1b33b4d4a2b1dadbc3ca22cOOOOOOOOOOOOOOOOOOOOO
21.d12443d43232d32323c213c22d2c23234c332db4b300
22.d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb
Data Mining Compared with Statistics1
Statistics:
1. Computing skills required to manage the data and the analysis.
2. An understanding of design of data collection issues.
3. An understanding of statistical inferential issues.
4. A knowledge of relevant mathematics.
5. Insights from practical data analysis.
6. Application area insights.
7. Automation of data analysis.
Data Mining Compared with Statistics1
Data Mining:
1. Computing skills required to manage the data and the analysis.

An understanding of design of data collection issues.

An understanding of statistical inferential issues.

A knowledge of relevant mathematics.

Insights from practical data analysis.

Application area insights.
2. Automation of data analysis.
Data Mining in Schools
 Google scholar search: ‘data mining’ & K-12 & education –
513 hits
 Jamie McKenzie, FNO articles, 1 relevant in 2000: [technology
will give us]…the ability to enable decision makers to collect and analyze data
more effectively, more frequently, and more easily”
http://www.fno.org/sept00/data.html Accessed 28/5/2006
Data Mining in Schools - SAS6 promises, promises:

“Decisions based on data and a culture of evidence will enable us to sustain effective processes to meet the
needs of our large and diverse student population," explains Kelly. "We expect to provide our staff and students
with easy access to information using our business intelligence solution." http://www.sas.com/success/accd.html
accessed 28/5/06

Improve student test scores and graduation rates.

Better prepare students for college and the work force.

Recruit and retain the best teachers.


Ensure that students are performing at their highest level and determine which programs help individual students
make progress.

Find online curriculum that meets state standards and makes learning more profound.

Develop and manage a complex budget.

Provide thousands of students with safe and efficient bus transportation, nutritious food, up-to-date libraries,
cutting-edge technology resources and secure campus facilities.
http://www.sas.com/govedu/edu/k12/index.html accessed 28/5/06
Data Mining in Schools-Broward County (Fla.) Public
Schools – the 1999 promise:
 Although the district is still just beginning to scratch the surface
of what's possible, the benefits of the data warehouse were
obvious from the start.
 “Schools have reported finding patterns of absenteeism. Now,
we can intervene sooner.” http://www.electronic-school.com/199909/0999f1.html

If you're going to be using the data, you have much more of an
investment in making sure the data is correct
 One immediate benefit of the system is the ability to automate
the generation of mandated state and federal reports

http://www.electronic-school.com/199909/0999f1.html accessed 28/5/06
Data Mining in Schools-Broward County (Fla.) Public
Schools – the reality:
 2002 - "On the first day of our training sessions, we
sometimes have to say, 'This is a mouse, you push this
button, and if you hold it down, you can drag,'" says
Phyllis Chasser, senior data warehouse analyst for the
Broward County School District, in Florida.
http://www.destinationcrm.com/articles/default.asp?ArticleID=2736 accessed 28/5/2006
 2006 – “Teachers, for instance, can drill down into an
individual student's attendance records and grades, and
parents can track their children's progress.”
http://www.computerworld.com/databasetopics/businessintelligence/story/0,10801,108951,00.html accessed
28/5/2006
Georgia Department of Education 2002 data Mining
Study:
 Standard Tests for grades 4, 6 and 8
 294,000 students
 They found the following student-level success predictor variables:
 Gender
 Race
 Free/reduced lunch status
 Attendance rate
 Special education status
 LEP status (Limited English Proficient)
The data Mining Process
http://www.siggraph.org/education/materials/HyperVis/applicat/data_mining/data_mining.html (Accessed 22/5/06)
Some Terms: Concepts, Instances and Attributes
 Concept description or class – the thing that is to be learned
 Instances or examples- the information that the learner is given
 Attributes – the values that measure different aspects of the
instance which can be
 numeric
 nominal
 ordinal etc etc
Preparing your data
 Don’t underestimate this step – I found most of the time was taken
here
 Clean the data – it has most certainly not been gathered for data
mining
 Integrate data from different sources – e.g. Board of Studies numbers
used to link SC and HSC results
 Deal with missing values
 Sparse data – many attributes may be 0
 Remove inaccurate data – visual representation can help
 Get to know your data
4 styles of learning
 Classification Learning
 Classified examples from which it is expected to learn a way of classifying
unseen examples
 Association learning
 Any association among features is sort
 Clustering
 Groups of examples that belong together are sort
 Numeric Prediction
 Outcome to be predicted is a numeric quantity
 In practice success is often measured subjectively
Classification Learning – Naming Irises
 To test classification learning; try it out on an independent set of
data for which the classifications are known but not available to
the machine
Association

Association rules differ from classifications rules as they can predict any attribute,
not just the class.

Useful if you are not swamped by association rules
If temperature=cool then Humidity=normal
If humidity=normal and windy=false then
play=yes
If outlook=sunny and play=no then
humidity=high
If windy=false and play=no then outlook=sunny
and humidity=high
All the classification rules are correct. There are
many more of them, but just how useful are they?
Clustering – the “Iris naming problem” without Iris
names

Group items that seem to fall naturally together

Success of “Clustering” is often measured subjectively in terms of how
useful the result appears to humans
Numeric Prediction – the focus of my work
 The predicted value may be of less interest than which attributes are
important and how they relate to the numeric outcome.
PRP = -55.9 + 0.0489MYCT + 0.0153MMIN + 0.0056MMAX + 0.6410CACH – 0.2700CHMIN + 1.480CHMAX
Format Attribute Relation File Format (ARFF)
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
But don’t panic – Weka is smart enough to use .csv files.
Nomial and numeric attributes
age,sex,chest_pain_type,cholesterol,exercise_induced_angina,heart_disease
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
Numeric attributes
Sci_09_1,Sci_09_2,Sci_10_1,Sci_10_2,SC,11_1_phys,11_2_phys,12-2_phys,HSC_Phys_05
63,73,53,63,86,44,49,57,70
76,79,73,69,91,54,59,62,74
79,80,85,80,85,60,65,67,75
78,85,81,81,91,76,74,61,75
74,80,81,76,88,65,68,71,76
72,81,76,78,92,71,73,64,76
78,79,80,77,83,68,68,68,76
83,83,81,84,86,64,71,66,76
80,81,76,85,86,68,74,74,80
...
Data Mining Tools
 Commercial and Open Source – a comprehensive list:
 http://www.togaware.com/datamining/catalogue.html
 The tool I have used; Weka
 http://www.cs.waikato.ac.nz/ml/weka/
The source of
SCEGGS Data:
Academic Report
Front Page
Co-Curricula
Activities
General Comment
The Source of
SCEGGS Data:
Subject Report
Numeric marks (/100 or /50)
Subjective Effort Grade (A-E)
Specific Outcomes (A-E)
Subject teacher comments
SCEGGS Data – turning a mountain into a molehill
Report
Records
Numeric marks
Science Only
2001S1 Y8
10200
1056
96
2001S2 Y8
10200
1056
96
2002S1 Y9
10200
1056
96
2002S2 Y9
10200
1056
96
2003S1 Y10
10200
1056
96
2003S2 Y10
10200
1056
96
2004S1 Y11
10200
703
23
31
23
2004S2 Y11
10200
703
23
31
23
2005S2 Y12
10200
629
23
31
23
91800
8371
645
669
645
03 SC Data
05 HSC Data
504
488
9363
Physics
Chem
Bio
77 Students
studied the
sciences
104
23
668
31
700
23
678
What I investigated:
 The relationship between numeric results for students in years 712 and their
 SC Science results
 HSC Results in Physics, Biology and Chemistry
 HSC results in English extension 1
The weka!
Copyright: Martin Kramer (mkramer@wxs.nl)
Weka – loading the data
Weka - Visualising the data – a handy tool
Weka – Classifying using a linear regression function
Weka – Visualising the Errors
Supervised Attribute Selection
Further examples
 English year 7-10 + SC results
 English year 9 -12 with Extension 1 HSC
 Biology: Science year 9 – 12 with HSC results
 Chemistry: Science years 9 – 12 with HSC results
How valuable has this been?
 Weka is a nice tool for visualising data
 Most of the results we have seen are intuitive
 Learnt respect for those who can phrase SQL queries efficiently
 A feeling that I have a lot to learn
 There is much left undone.
What I wanted to investigate:
 Do co-curricular activities correlate well with HSC results? – e.g
Do Dof E students do better at Physics than non D0fE students
 What if any is the relation between outcome grades A-E and SC
and HSC results?
 Can the comments in the reports be used for “Text Mining”
References
1.
Maindonald, J. Data Mining from a Statistical Perspective,
http://wwwmaths.anu.edu.au/~johnm/dm/dmpaper.html, Accessed 20/5/2005
2.
Witten, I. & Frank, E, Data Mining Practical Machine Learning and Techniques, 2nd Elsevier San
Francisco, 2005
3.
Levitt, S.D. & Dubner S.J., Freakonomics A Rogue Economist Explores the Hidden Side of
Everything. Penguin Group, Victoria, 2005
4.
Williams, G., http://datamining.anu.edu.au/student/math3346_2005/050809-maths3346-topics2x2.pdf accessed 22/5/06
5.
http://www.electronic-school.com/199909/0999f1.html cccessed 28/5/06
6. http://www.sas.com/ This commercial organisation sells
software that may be used to mine data wharehouses.
Accessed 28/5/06
Other References

http://datamining.anu.edu.au/
http://www.csse.monash.edu.au/~webb/Bio.htm - Geoff Webb - Monash
http://wwwmaths.anu.edu.au/~johnm/dm/dmpaper.html - John Maindonald - ANU
http://www.csse.monash.edu.au/~mgaber/WResources.htm
http://www.datamining.monash.edu.au/index.shtml#Contacts
http://www.electronic-school.com/199909/0999f1.html and
this http://www.electronic-school.com/2000/03/0300f3.html
http://www.networkworld.com/archive/2000/88136_02-28-2000.html
Contact Details
 Ian Ralph
 IT Manager
 SCEGGS Darlinghurst
 ian@sceggs.nsw.edu.au
Download