T1_Overview

advertisement
Tutorial Overview & Learning
Python
COMP 4332 Tutorial 1
Feb 16
WANG YUE
ywangby@connect.ust.hk
Project-oriented tutorials
• Project and assignments count for 80% of your grade.
• You will write code in a few languages/tools.
• More importantly, you will do experiments!
• Very different from COMP4331. Light on concepts/math.
Heavy hands-on course.
COMP 4332 = COMP 4331 + COMP 4331
A data mining project requires ...
• 1. Explore data and data preprocessing.
• 2. Trying algorithms, SVM, Logistic Regression, Decision
Trees, Dimensionality Reduction, etc... And try varying
parameters in each algorithm.
• Labor intensive!
• Sometimes frustrated.
Repeatedly go to step 1 to reprocessing the data to feed
into different tools.
• 3. Summarize findings and design new methods and go
back to step 2.
The creative part!
1. Explore data/look at the data
• Visualization:
• 1D data summary: mean, variance, median, skewness; density
estimation(pdf), cdf; outliers, etc.
• 2D data summary: scatter plot, QQ-plot, correlation scores, etc.
• High-dimensional data summary: dimensionality reduction and plot
to 2D or 3D
• Store data and extract wanted part.
• Organized: SQL like queries...
• Quick and dirty: write a script for each operation...
2. Run experiments using tools
• Most of the time, tools are available.
• Weka, libsvm, etc..
Good news:)
• Sometimes, you need to implement a variant of existing
algorithm.
• A different decision tree
• A classifier handles unbalanced data
Numerical code is generally
hard to write correctly (hard
to DEBUG!). You will do this
in this course!
• Run the methods and vary parameters and plot results
and trends.
3. Summarize findings and design new
methods
• After each iteration of step 1 and 2, you know more about
the data, you may have new ideas and go back to step 1
and 2.
• But before that, first document your findings.
A cloud of tools ...
• Data preprocessing: Python, Java/C++, SQL, Excel, text
editors....
• Visualization: Excel, Matlab, R, matlibplot
• SVM: libsvm, svmlight, liblinear packages
• Logistic regression: liblinear
• Decision Trees & tree ensemble: Weka, FEST
• Matrix factorization: libfm, GraphLab
Teaching all of them is impossible
• You have to take time to read the manuals of these tools,
and sometimes source code of them!
• Through this course, we will use Python to illustrate
• Data preprocessing (mostly its string processing)
• Algorithm implementation (numpy/scipy)
• Automaticly perform experiments
• Simple plotting (matlibplot)
• Sometimes, we use R’s plotting packages (core, ggplot2)
if matlibplot does not fit the requirement.
Why Python
• Easy to learn and easy to use.
• A good tool for us to illustrate the three steps of doing a data mining
project.
• A concise and powerful language.
• A glue language. Easily integrate components written in other
languages.
• Widely used in IT industries. Organizations using Python
• We would use latest python version in this course(python3.4)
Setup Python Scientific Environment
• Anaconda Scientific Python Distribution
• It includes over 195 of the most popular Python packages for
science, math, engineering, data analysis. (numpy, scipy, sklearn,
matplotlib)
• Cross Platform
• No need to install scientific package one by one
• Default IDE is weak. Recommended IDEs:
• Sublime Text (recommended)
• PyCharm (recommended)
• Eclipse + pydev (cross platform)
• Or simply Notepad++ editor with syntax highlighting (only in
Windows)
Learn Python
• The official Python tutorial. Written for experienced
programmers.
• Read it twice and try every code snippet in the tutorial.
• Code Like a Pythonista: Idiomatic Python
• Python Howto: sort, logging, functional programming, etc.
• MIT 6.00 course material.
• Liang Huang’s Python Short Course.
• numpy examples and scipy tutorial.
• Best place to ask a Python-related question:
http://stackoverflow.com/. It is better to send your Python
question to Stackoverflow rather than to our mailing list.
Learn Python (Books)
A Byte of Python
• Learning Python
• Python Cookbook
• Moving from Python2 to Python3
•
Play with Python data structures
• basic types: bool, integer, float, complex
• tuple: (x, y, ..)
• list:
[x, y, ...]
• string: ‘hello’, “world”
• dictionary: { x: a, y: b, ... }
• set: set([a, b, c, d])
• iteratable/sequence: a unified view for data structures
• tuple/list/dictionary/set/string are all iteratable.
Learning By Doing
• 1. Go through basic Python data structures and their
operations.
• 2. Show Python’s functions and control structures (if-then-
else/for/while).
Download