Data Mining Looking For Patterns in Data This Presentation Why data Mining? What is Data Mining? What’s happening in K-12 Schools The Data Mining Process Prepare your data for mining Data Mining Tools? Mining SCEGGS Data Why data Mining? Drowning in Data….. Schools Store lots of data about students’ performance SCEGGS: 10,000 records per reporting period per school year 2 reporting periods per year since 1999 160,000 records A gold mine: hidden information that has not been made use of??? My Shovel: Well written and easy to read Lots of examples As technical as you want to be "If you have data that you want to analyze and understand, this book and the associated Weka toolkit are an excellent way to start." -Jim Gray, Microsoft Research Ethical Issues Important questions: Who is permitted access to the data? For what purpose was the data collected? What kind of conclusions can be legitimately drawn from it? ● Caveats must be attached to results ● Purely statistical arguments are never sufficient! ● Are resources put to good use? What is Data Mining Looking for patterns in data Data is stored electronically Search is augmented by computer Meaningful Patterns may be used for prediction “Intelligently analysed data is a valuable resource. It can lead to new insights…” (Witten & Frank) Data Mining is about solving problems by analysing data already in databases An example 3 Classroom A After “Mining” (With cheating algorithm applied) 1. 112a4a342cb214dOOO1acd24a3a12dadbcb4aOOOOOOO 2. 1b2a34d4ac42d23b141acd24a3a12dadbcb4a2134141 3. db2abad1acbdda212b1acd24a3a12dadbcb400000000 4. d43a3a24acb1d32b412acd24a3a12dadbcb422143bcO 5. d43ab4d1ac3dd43421240d24a3a12dadbcb400000000 6. 1142340c2cbddadb4b1acd24a3a12dadbcb43d133bc4 7. dba2ba21ac3d2ad3c4c4cd40a3a12dadbcb400000000 8. 144a3adc4cbddadbcbc2c2cca3a12dadbcb4211ab343 9. 3b3ab4d14c3d2ad4cbcac1cO03a12dadbcb4adb40000 10.d43aba3cacbddadbcbca42c2a3212dadbcb42344b3cb 11.214ab4dc4cbdd31b1b2213c4ad412dadbcb4adbOOOOO 12.313a3ad1ac3d2a23431223cOOO012dadbcb400000000 13.d4aab2124cbddadbcb1a42cca3412dadbcb423134bc1 14.dbaab3dcacb1dadbc42ac2cc31012dadbcb4adb40000 15.db223a24acb11a3b24cacd12a241cdadbcb4adb4b300 16.d122ba2cacbd1a13211a2d02a2412dOdbcb4adb4b3cO 17.1423b4d4a23d24131413234123a243a2413a21441343 1B.db4abadcacb1dad3141ac212a3a1c3a144ba2db41b43 19.db2a33dcacbd32d313c21142323cc300000000000000 20.1b33b4d4a2b1dadbc3ca22cOOOOOOOOOOOOOOOOOOOOO 21.d12443d43232d32323c213c22d2c23234c332db4b300 22.d4a2341cacbddad3142a2344a2ac23421c00adb4b3cb Data Mining Compared with Statistics1 Statistics: 1. Computing skills required to manage the data and the analysis. 2. An understanding of design of data collection issues. 3. An understanding of statistical inferential issues. 4. A knowledge of relevant mathematics. 5. Insights from practical data analysis. 6. Application area insights. 7. Automation of data analysis. Data Mining Compared with Statistics1 Data Mining: 1. Computing skills required to manage the data and the analysis. An understanding of design of data collection issues. An understanding of statistical inferential issues. A knowledge of relevant mathematics. Insights from practical data analysis. Application area insights. 2. Automation of data analysis. Data Mining in Schools Google scholar search: ‘data mining’ & K-12 & education – 513 hits Jamie McKenzie, FNO articles, 1 relevant in 2000: [technology will give us]…the ability to enable decision makers to collect and analyze data more effectively, more frequently, and more easily” http://www.fno.org/sept00/data.html Accessed 28/5/2006 Data Mining in Schools - SAS6 promises, promises: “Decisions based on data and a culture of evidence will enable us to sustain effective processes to meet the needs of our large and diverse student population," explains Kelly. "We expect to provide our staff and students with easy access to information using our business intelligence solution." http://www.sas.com/success/accd.html accessed 28/5/06 Improve student test scores and graduation rates. Better prepare students for college and the work force. Recruit and retain the best teachers. Ensure that students are performing at their highest level and determine which programs help individual students make progress. Find online curriculum that meets state standards and makes learning more profound. Develop and manage a complex budget. Provide thousands of students with safe and efficient bus transportation, nutritious food, up-to-date libraries, cutting-edge technology resources and secure campus facilities. http://www.sas.com/govedu/edu/k12/index.html accessed 28/5/06 Data Mining in Schools-Broward County (Fla.) Public Schools – the 1999 promise: Although the district is still just beginning to scratch the surface of what's possible, the benefits of the data warehouse were obvious from the start. “Schools have reported finding patterns of absenteeism. Now, we can intervene sooner.” http://www.electronic-school.com/199909/0999f1.html If you're going to be using the data, you have much more of an investment in making sure the data is correct One immediate benefit of the system is the ability to automate the generation of mandated state and federal reports http://www.electronic-school.com/199909/0999f1.html accessed 28/5/06 Data Mining in Schools-Broward County (Fla.) Public Schools – the reality: 2002 - "On the first day of our training sessions, we sometimes have to say, 'This is a mouse, you push this button, and if you hold it down, you can drag,'" says Phyllis Chasser, senior data warehouse analyst for the Broward County School District, in Florida. http://www.destinationcrm.com/articles/default.asp?ArticleID=2736 accessed 28/5/2006 2006 – “Teachers, for instance, can drill down into an individual student's attendance records and grades, and parents can track their children's progress.” http://www.computerworld.com/databasetopics/businessintelligence/story/0,10801,108951,00.html accessed 28/5/2006 Georgia Department of Education 2002 data Mining Study: Standard Tests for grades 4, 6 and 8 294,000 students They found the following student-level success predictor variables: Gender Race Free/reduced lunch status Attendance rate Special education status LEP status (Limited English Proficient) The data Mining Process http://www.siggraph.org/education/materials/HyperVis/applicat/data_mining/data_mining.html (Accessed 22/5/06) Some Terms: Concepts, Instances and Attributes Concept description or class – the thing that is to be learned Instances or examples- the information that the learner is given Attributes – the values that measure different aspects of the instance which can be numeric nominal ordinal etc etc Preparing your data Don’t underestimate this step – I found most of the time was taken here Clean the data – it has most certainly not been gathered for data mining Integrate data from different sources – e.g. Board of Studies numbers used to link SC and HSC results Deal with missing values Sparse data – many attributes may be 0 Remove inaccurate data – visual representation can help Get to know your data 4 styles of learning Classification Learning Classified examples from which it is expected to learn a way of classifying unseen examples Association learning Any association among features is sort Clustering Groups of examples that belong together are sort Numeric Prediction Outcome to be predicted is a numeric quantity In practice success is often measured subjectively Classification Learning – Naming Irises To test classification learning; try it out on an independent set of data for which the classifications are known but not available to the machine Association Association rules differ from classifications rules as they can predict any attribute, not just the class. Useful if you are not swamped by association rules If temperature=cool then Humidity=normal If humidity=normal and windy=false then play=yes If outlook=sunny and play=no then humidity=high If windy=false and play=no then outlook=sunny and humidity=high All the classification rules are correct. There are many more of them, but just how useful are they? Clustering – the “Iris naming problem” without Iris names Group items that seem to fall naturally together Success of “Clustering” is often measured subjectively in terms of how useful the result appears to humans Numeric Prediction – the focus of my work The predicted value may be of less interest than which attributes are important and how they relate to the numeric outcome. PRP = -55.9 + 0.0489MYCT + 0.0153MMIN + 0.0056MMAX + 0.6410CACH – 0.2700CHMIN + 1.480CHMAX Format Attribute Relation File Format (ARFF) @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... But don’t panic – Weka is smart enough to use .csv files. Nomial and numeric attributes age,sex,chest_pain_type,cholesterol,exercise_induced_angina,heart_disease 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present Numeric attributes Sci_09_1,Sci_09_2,Sci_10_1,Sci_10_2,SC,11_1_phys,11_2_phys,12-2_phys,HSC_Phys_05 63,73,53,63,86,44,49,57,70 76,79,73,69,91,54,59,62,74 79,80,85,80,85,60,65,67,75 78,85,81,81,91,76,74,61,75 74,80,81,76,88,65,68,71,76 72,81,76,78,92,71,73,64,76 78,79,80,77,83,68,68,68,76 83,83,81,84,86,64,71,66,76 80,81,76,85,86,68,74,74,80 ... Data Mining Tools Commercial and Open Source – a comprehensive list: http://www.togaware.com/datamining/catalogue.html The tool I have used; Weka http://www.cs.waikato.ac.nz/ml/weka/ The source of SCEGGS Data: Academic Report Front Page Co-Curricula Activities General Comment The Source of SCEGGS Data: Subject Report Numeric marks (/100 or /50) Subjective Effort Grade (A-E) Specific Outcomes (A-E) Subject teacher comments SCEGGS Data – turning a mountain into a molehill Report Records Numeric marks Science Only 2001S1 Y8 10200 1056 96 2001S2 Y8 10200 1056 96 2002S1 Y9 10200 1056 96 2002S2 Y9 10200 1056 96 2003S1 Y10 10200 1056 96 2003S2 Y10 10200 1056 96 2004S1 Y11 10200 703 23 31 23 2004S2 Y11 10200 703 23 31 23 2005S2 Y12 10200 629 23 31 23 91800 8371 645 669 645 03 SC Data 05 HSC Data 504 488 9363 Physics Chem Bio 77 Students studied the sciences 104 23 668 31 700 23 678 What I investigated: The relationship between numeric results for students in years 712 and their SC Science results HSC Results in Physics, Biology and Chemistry HSC results in English extension 1 The weka! Copyright: Martin Kramer (mkramer@wxs.nl) Weka – loading the data Weka - Visualising the data – a handy tool Weka – Classifying using a linear regression function Weka – Visualising the Errors Supervised Attribute Selection Further examples English year 7-10 + SC results English year 9 -12 with Extension 1 HSC Biology: Science year 9 – 12 with HSC results Chemistry: Science years 9 – 12 with HSC results How valuable has this been? Weka is a nice tool for visualising data Most of the results we have seen are intuitive Learnt respect for those who can phrase SQL queries efficiently A feeling that I have a lot to learn There is much left undone. What I wanted to investigate: Do co-curricular activities correlate well with HSC results? – e.g Do Dof E students do better at Physics than non D0fE students What if any is the relation between outcome grades A-E and SC and HSC results? Can the comments in the reports be used for “Text Mining” References 1. Maindonald, J. Data Mining from a Statistical Perspective, http://wwwmaths.anu.edu.au/~johnm/dm/dmpaper.html, Accessed 20/5/2005 2. Witten, I. & Frank, E, Data Mining Practical Machine Learning and Techniques, 2nd Elsevier San Francisco, 2005 3. Levitt, S.D. & Dubner S.J., Freakonomics A Rogue Economist Explores the Hidden Side of Everything. Penguin Group, Victoria, 2005 4. Williams, G., http://datamining.anu.edu.au/student/math3346_2005/050809-maths3346-topics2x2.pdf accessed 22/5/06 5. http://www.electronic-school.com/199909/0999f1.html cccessed 28/5/06 6. http://www.sas.com/ This commercial organisation sells software that may be used to mine data wharehouses. Accessed 28/5/06 Other References http://datamining.anu.edu.au/ http://www.csse.monash.edu.au/~webb/Bio.htm - Geoff Webb - Monash http://wwwmaths.anu.edu.au/~johnm/dm/dmpaper.html - John Maindonald - ANU http://www.csse.monash.edu.au/~mgaber/WResources.htm http://www.datamining.monash.edu.au/index.shtml#Contacts http://www.electronic-school.com/199909/0999f1.html and this http://www.electronic-school.com/2000/03/0300f3.html http://www.networkworld.com/archive/2000/88136_02-28-2000.html Contact Details Ian Ralph IT Manager SCEGGS Darlinghurst ian@sceggs.nsw.edu.au