Introduction to Weka and NetDraw MIS510 Spring 2009

advertisement
Introduction to Weka and
NetDraw
MIS510
Spring 2009
1
Outline
• Weka
– Introduction
– Weka Tools/Functions
– How to use Weka?
• Weka Data File Format (Input)
• Weka for Data Mining
• Sample Output from Weka (Output)
– Conclusion
• NetDraw
– Introduction
– How to use NetDraw?
• NetDraw Input Data File Format
• Draw Networks using NetDraw
– Conclusion
2
Weka
3
Introduction to Weka
(Data Mining Tool)
• Weka was developed at the University of
Waikato in New Zealand.
http://www.cs.waikato.ac.nz/ml/weka/
• Weka is a open source data mining tool
developed in Java. It is used for research,
education, and applications. It can be run on
Windows, Linux and Mac.
4
What can Weka do?
• Weka is a collection of machine learning
algorithms for data mining tasks. The
algorithms can either be applied directly to
a dataset (using GUI) or called from your
own Java code (using Weka Java library).
• Weka contains tools for data preprocessing, classification, regression,
clustering, association rules, and
visualization. It is also well-suited for
developing new machine learning schemes.
5
Weka Tools/Functions
• Tools (or functions) in Weka include:
– Data preprocessing (e.g., Data Filters),
– Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural
Networks, SVM),
– Regression (e.g., Linear Regression, Isotonic Regression, SVM for
Regression),
– Clustering (e.g., Simple K-means, Expectation Maximization (EM)),
– Association rules (e.g., Apriori Algorithm, Predictive Accuracy,
Confirmation Guided),
– Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chisquared Statistic), and
– Visualization (e.g., View different two-dimensional plots of the data).
6
Weka’s Role in the Big Picture
Data Ming
by Weka
Input
•Raw data
•Pre-processing
•Classification
•Regression
•Clustering
•Association Rules
•Visualization
Output
•Result
7
How to use Weka?
•
•
•
Weka Data File Format (Input)
Weka for Data Mining
Sample Output from Weka (Output)
8
Weka Data File Format (Input)
 The most popular data input format of Weka is “arff” (with “arff” being
the extension name of your input data file).
FILE FORMAT
@relation RELATION_NAME
@attribute ATTRIBUTE_NAME ATTRIBUTE_TYPR
@attribute ATTRIBUTE_NAME ATTRIBUTE_TYPR
@attribute ATTRIBUTE_NAME ATTRIBUTE_TYPR
@attribute ATTRIBUTE_NAME ATTRIBUTE_TYPR
@data
DATAROW1
DATAROW2
DATAROW3
9
Example of “arff” Input File
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
10
Weka for Data Mining
• There are mainly 2 ways to use Weka to conduct
your data mining tasks.
– Use Weka Graphical User Interfaces (GUI)
• GUI is straightforward and easy to use. But it is not flexible. It
can not be called from you own application.
– Import Weka Java library to your own java application.
• Developers can leverage on Weka Java library to develop
software or modify the source code to meet special
requirements. It is more flexible and advanced. But it is not as
easy to use as GUI.
11
Weka GUI
Different analysis tools/functions
The value set of the chosen attribute
and the # of input items with each value
Different attributes to
choose
12
Weka GUI
Classification Algorithms
13
Import Weka Java library to your own
Java application
• Three sets of classes you may need to use
when developing your own application
– Classes for Loading Data
– Classes for Classifiers
– Classes for Evaluation
14
Classes for Loading Data
• Related Weka classes
– weka.core.Instances
– weka.core.Instance
– weka.core.Attribute
• How to load input data file into instances?
– Every DataRow -> Instance, Every Attribute ->
Attribute, Whole -> Instances
# Load a file as Instances
FileReader reader;
reader = new FileReader(path);
Instances instances = new Instances(reader);
15
Classes for Loading Data
• Instances contains Attribute and Instance
– How to get every Instance within the Instances?
# Get Instance
Instance instance = instances.instance(index);
# Get Instance Count
int count = instances.numInstances();
– How to get an Attribute?
# Get Attribute Name
Attribute attribute = instances.attribute(index);
# Get Attribute Count
int count = instances.numAttributes();
16
Classes for Loading Data
– How to get the Attribute value of each Instance?
# Get value
instance.value(index);
or
instance.value(attrName);
– Class Index (Very important!)
# Get Class Index
instances.classIndex();
or
instances.classAttribute().index();
# Set Class Index
instances.setClass(attribute);
or
instances.setClassIndex(index);
17
Classes for Classifiers
•
Weka classes for C4.5, Naïve Bayes, and SVM
– Classifier: all classes which extend
weka.classifiers.Classifier
• C4.5: weka.classifier.trees.J48
• NaiveBayes: weka.classifiers.bayes.NaiveBayes
• SVM: weka.classifiers.functions.SMO
• How to build a classifier?
# Build a C4.5 Classifier
Classifier c = new weka.classifier.trees.J48();
c.buildClassifier(trainingInstances);
Build a SVM Classifier
Classifier e = weka.classifiers.functions.SMO();
e.buildClassifier(trainingInstances);
18
Classes for Evaluation
• Related Weka classes
– weka.classifiers.CostMatrix
– weka.classifiers.Evaluation
• How to use the evaluation classes?
# Use Classifier To Do Classification
CostMatrix costMatrix = null;
Evaluation eval = new Evaluation(testingInstances, costMatrix);
for (int i = 0; i < testingInstances.numInstances(); i++){
eval.evaluateModelOnceAndRecordPrediction(c,testingInstances.instance(i));
System.out.println(eval.toSummaryString(false));
System.out.println(eval.toClassDetailsString()) ;
System.out.println(eval.toMatrixString());
}
19
Classes for Evaluation
• Cross Validation
– In cross validation process, we split a single
dataset into N equal shares. While taking N-1
shares as a training dataset, the rest will be
used as testing dataset.
– The most widely used is 10 cross fold
validation.
20
Classes for Evaluation
• How to obtain the training dataset and the
testing dataset?
Random random = new Random(seed);
instances.randomize(random);
instances.stratify(N);
for (int i = 0; i < N; i++)
{
Instances train = instances.trainCV(N, i , random);
Instances test = instances.testCV(N, i , random);
}
21
Sample Output from Weka
22
Conclusion about Weka
• In sum, the overall goal of Weka is to build a state-of-theart facility for developing machine learning (ML)
techniques and allow people to apply them to real-world
data mining problems.
• Detailed documentation about different functions provided
by Weka can be found on Weka website.
• WEKA is available at:
http://www.cs.waikato.ac.nz/ml/weka
23
NetDraw
24
Introduction to NetDraw
(Visualization Tool)
• NetDraw is an open source program written by
Steve Borgatti from Analytic Technologies. It is
often used for visualizing both 1-mode and 2-mode
social network data.
• You can download it from:
http://www.analytictech.com/downloadnd.htm
• (Compared to Weka, it is much easier to use :P)
25
What can NetDraw do?
• NetDraw can:
– handle multiple relations at the same time, and
– use node attributes to set colors, shapes, and sizes of nodes.
• Pictures can be saved in metafile, jpg, gif and bitmap formats.
• Two basic kinds of layouts are implemented: a circle and an MDS
based on geodesic distance.
• You can also rotate, flip, shift, resize and zoom configurations.
26
How to use NetDraw?
• NetDraw Input Data File Format
• Draw Networks using NetDraw
27
NetDraw Input Data File Format
“vna” Data Format
The VNA data format (with “vna” being the extension name of the input data file) allows users to
store not only network data but also attributes of the nodes, along with information about how to
display them (color, size, etc.).
*node data
"ID", num
"$10 Gift Card off REGIS SALON (SALON SERVICES) + E"
2
"$10 iTunes Gift Certificate exp 9/2008" 2
"$10 STARBUCKS gift CARD CERTIFICATE"
3
"$10 Target Gift Card"
3
"$10.00 iTunes Music Gift Card - Free Shipping"
2
"$100 Best Buy Gift Card" 15
"$100 Gap Gift Card - FREE Shipping" 9
……………………
*Tie data
FROM TO "Strength"
"Home Depot Gift Card $500."
"$100 Home Depot Gift Card Accepted Nationwide" 1
"** $250 Best Buy GiftCard Gift Card Gift Certifica"
"$25 Best Buy Gift Card for Store or Online!"
1
"$50 Bed Bath & Beyond Gift Card - FREE SHIPPING!"
"$200 Cost Plus World Market Gift Card 4 Jewelry Be"1
"$500.00 Best Buy gift certificate"
"$15 Best Buy Gift Card *Free Shipping*"
1
"$25 Best Buy Gift Card for Store or Online!"
"$15 Best Buy Gift Card *Free Shipping*"
1
"Bath and Body Works $25 Gift Card" "$200 Cost Plus World Market Gift Card 4 Jewelry Be"1
28
Draw Networks using NetDraw
Different functions
The networks: nodes representing the
individuals and links representing the relations
Display
setup of the
nodes and
relations
29
Analysis Example:
Hot Item Analysis based on Giftcard selling information from eBay
•
•
•
•
Each circle in the graph represents an active item in the database.
The label of the circle is the item title.
The bigger the circle and the label of circle, the hotter the item.
Items are clustered together based on the brand information.
•
Hot Topics during April 15 – April 22, 2007
•
Hot Topics during April 22 – April 29, 2007
30
Conclusion
• In sum, NetDraw can be used for social network visualization.
• There are a lot of parameters to play with in the tool. The results can be
saved as EMF, WMF, BMP and JPG files.
• NetDraw is available at:
http://www.analytictech.com/downloadnd.htm
• The website also provides detailed documentation.
• If you have interest, you may also try some other visualization tools
such as JUNG (http://jung.sourceforge.net/) and GraphViz
(http://www.graphviz.org/).
31
Some Suggestions
• Carefully prepare your data according to the input
format required by each tool.
• Read the documentation of each tool that you
decide to use and understand its functionality.
Think how it can be applied to your project.
• Download and play with the tools. You cannot
learn anything unless you try them by yourself!!!
32
Thanks!
Good luck for your
projects! 
33
Download