Tutorial Overview & Learning Python COMP 4332 Tutorial 1 Feb 16 WANG YUE ywangby@connect.ust.hk Project-oriented tutorials • Project and assignments count for 80% of your grade. • You will write code in a few languages/tools. • More importantly, you will do experiments! • Very different from COMP4331. Light on concepts/math. Heavy hands-on course. COMP 4332 = COMP 4331 + COMP 4331 A data mining project requires ... • 1. Explore data and data preprocessing. • 2. Trying algorithms, SVM, Logistic Regression, Decision Trees, Dimensionality Reduction, etc... And try varying parameters in each algorithm. • Labor intensive! • Sometimes frustrated. Repeatedly go to step 1 to reprocessing the data to feed into different tools. • 3. Summarize findings and design new methods and go back to step 2. The creative part! 1. Explore data/look at the data • Visualization: • 1D data summary: mean, variance, median, skewness; density estimation(pdf), cdf; outliers, etc. • 2D data summary: scatter plot, QQ-plot, correlation scores, etc. • High-dimensional data summary: dimensionality reduction and plot to 2D or 3D • Store data and extract wanted part. • Organized: SQL like queries... • Quick and dirty: write a script for each operation... 2. Run experiments using tools • Most of the time, tools are available. • Weka, libsvm, etc.. Good news:) • Sometimes, you need to implement a variant of existing algorithm. • A different decision tree • A classifier handles unbalanced data Numerical code is generally hard to write correctly (hard to DEBUG!). You will do this in this course! • Run the methods and vary parameters and plot results and trends. 3. Summarize findings and design new methods • After each iteration of step 1 and 2, you know more about the data, you may have new ideas and go back to step 1 and 2. • But before that, first document your findings. A cloud of tools ... • Data preprocessing: Python, Java/C++, SQL, Excel, text editors.... • Visualization: Excel, Matlab, R, matlibplot • SVM: libsvm, svmlight, liblinear packages • Logistic regression: liblinear • Decision Trees & tree ensemble: Weka, FEST • Matrix factorization: libfm, GraphLab Teaching all of them is impossible • You have to take time to read the manuals of these tools, and sometimes source code of them! • Through this course, we will use Python to illustrate • Data preprocessing (mostly its string processing) • Algorithm implementation (numpy/scipy) • Automaticly perform experiments • Simple plotting (matlibplot) • Sometimes, we use R’s plotting packages (core, ggplot2) if matlibplot does not fit the requirement. Why Python • Easy to learn and easy to use. • A good tool for us to illustrate the three steps of doing a data mining project. • A concise and powerful language. • A glue language. Easily integrate components written in other languages. • Widely used in IT industries. Organizations using Python • We would use latest python version in this course(python3.4) Setup Python Scientific Environment • Anaconda Scientific Python Distribution • It includes over 195 of the most popular Python packages for science, math, engineering, data analysis. (numpy, scipy, sklearn, matplotlib) • Cross Platform • No need to install scientific package one by one • Default IDE is weak. Recommended IDEs: • Sublime Text (recommended) • PyCharm (recommended) • Eclipse + pydev (cross platform) • Or simply Notepad++ editor with syntax highlighting (only in Windows) Learn Python • The official Python tutorial. Written for experienced programmers. • Read it twice and try every code snippet in the tutorial. • Code Like a Pythonista: Idiomatic Python • Python Howto: sort, logging, functional programming, etc. • MIT 6.00 course material. • Liang Huang’s Python Short Course. • numpy examples and scipy tutorial. • Best place to ask a Python-related question: http://stackoverflow.com/. It is better to send your Python question to Stackoverflow rather than to our mailing list. Learn Python (Books) A Byte of Python • Learning Python • Python Cookbook • Moving from Python2 to Python3 • Play with Python data structures • basic types: bool, integer, float, complex • tuple: (x, y, ..) • list: [x, y, ...] • string: ‘hello’, “world” • dictionary: { x: a, y: b, ... } • set: set([a, b, c, d]) • iteratable/sequence: a unified view for data structures • tuple/list/dictionary/set/string are all iteratable. Learning By Doing • 1. Go through basic Python data structures and their operations. • 2. Show Python’s functions and control structures (if-then- else/for/while).