Tutorial Overview & Learning Python COMP 4332 Tutorial 1 Feb 13 WANG YUE ywangby@connect.ust.hk Course Work • Three or Four assignments (20%) 1. Progress report of the first project 2. Add cross-validation to the first project 3. 3-4 quick questions about data mining • Two projects (60%) 1. KDD Cup 2014 - Predicting Excitement at DonorsChoose.org 2. PAKDD 2014 - ASUS Malfunctional Components Prediction • One term paper (10%) • One presentation (10%) Project-oriented tutorials • Project and assignments count for 80% of your grade. • You will write code in a few languages/tools. • More importantly, you will do experiments! • Very different from COMP4331. Light on concepts/math. Heavy hands-on course. COMP 4332 = COMP 4331 + COMP 4331 A data mining project requires ... • 1. Explore data and data preprocessing. • 2. Trying algorithms, SVM, Logistic Regression, Decision Trees, Dimensionality Reduction, etc... And try varying parameters in each algorithm. • Labor intensive! • Sometimes frustrated. Repeatedly go to step 1 to reprocessing the data to feed into different tools. • 3. Summarize findings and design new methods and go back to step 2. The creative part! 1. Explore data/look at the data • Visualization: • 1D data summary: mean, variance, median, skewness; density estimation(pdf), cdf; outliers, etc. • 2D data summary: scatter plot, QQ-plot, correlation scores, etc. • High-dimensional data summary: dimensionality reduction and plot to 2D or 3D • Store data and extract wanted part. • Organized: SQL like queries... • Quick and dirty: write a script for each operation... 2. Run experiments using tools • Most of the time, tools are available. • Weka, libsvm, etc.. Good news:) • Sometimes, you need to implement a variant of existing algorithm. • A different decision tree • A classifier handles unbalanced data Numerical code is generally hard to write correctly (hard to DEBUG!). You will do this in this course! • Run the methods and vary parameters and plot results and trends. 3. Summarize findings and design new methods • After each iteration of step 1 and 2, you know more about the data, you may have new ideas and go back to step 1 and 2. • But before that, first document your findings. A cloud of tools ... • Data preprocessing: Python, Java/C++, SQL, Excel, text editors.... • Visualization: Excel, Matlab, R, matlibplot • SVM: libsvm, svmlight, liblinear packages • Logistic regression: liblinear • Decision Trees & tree ensemble: Weka, FEST • Matrix factorization: libfm, GraphLab *Bolded tools are that we will teach in the tutorials. Teaching all of them is impossible! • You have to take time to read the manuals of these tools, and sometimes source code of them! • Through this course, we will use Python to illustrate • Data preprocessing (mostly its string processing) • Algorithm implementation (numpy/scipy) • Automaticly perform experiments • Simple plotting (matlibplot) • Sometimes, we use R’s plotting packages (core, ggplot2) if matlibplot does not fit the requirement. Why Python • Easy to learn and easy to use. • A good tool for us to illustrate the three steps of doing a data mining project. • A concise and powerful language. • A glue language. Easily integrate components written in other languages. • Widely used in IT industries. Organizations using Python • We would use latest python version in this course(python3.4) Setup Python Scientific Environment • Anaconda Scientific Python Distribution • It includes over 195 of the most popular Python packages for science, math, engineering, data analysis. (numpy, scipy, sklearn, matplotlib) • Cross Platform • No need to install scientific package one by one • Default IDE is weak. Recommended IDEs: • Sublime Text (recommended) • PyCharm (recommended) • Eclipse + pydev (cross platform) • Or simply Notepad++ editor with syntax highlighting (only in Windows) Learn Python • The official Python tutorial. Written for experienced programmers. • Read it twice and try every code snippet in the tutorial. • Code Like a Pythonista: Idiomatic Python • Python Howto: sort, logging, functional programming, etc. • MIT 6.00 course material. • Liang Huang’s Python Short Course. • numpy examples and scipy tutorial. • Best place to ask a Python-related question: http://stackoverflow.com/. It is better to send your Python question to Stackoverflow rather than to our mailing list. Learn Python (Books) A Byte of Python • Learning Python • Python Cookbook • Moving from Python2 to Python3 • Play with Python data structures • basic types: bool, integer, float, complex • tuple: (x, y, ..) • list: [x, y, ...] • string: ‘hello’, “world” • dictionary: { x: a, y: b, ... } • set: set([a, b, c, d]) • iteratable/sequence: a unified view for data structures • tuple/list/dictionary/set/string are all iteratable. DEMO • 1. Go through basic Python data structures and their operations. • 2. Show Python’s functions and control structures (if-then- else/for/while). Project Requirement • You should register an account as a team in the Kaggle • • • • website. The mark on the team is the same as each member of the team. You should send the source code of your project or report(if you use some other tools) to me plus the final online score(comp4332.ust.hk@gmail.com) Marking: source code/report(50%) + final score you get on the website(50%) Deadline: Project 1(11:59pm Apr 12th), Project 2(11:59pm, May 10th)