Major Web Intelligence Tools 1 © 2005

advertisement
Major Web Intelligence Tools
© 2005
1
Web Intelligence Tools
•
I. Collection
– Offline Explorer
– SpidersRUs (AI Lab)
– Google Scholar
•
II. Analysis (Data and Text Mining)
–
–
–
–
–
–
•
Google APIs
Google Translation
GATE
Arizona Noun Phraser (AI Lab)
Self-Organizing Map, SOM (AI Lab)
Weka
III. Visualization
– NetDraw
– JUNG
– Analyst’s Notebook and Starlight
© 2005
2
Collection: Offline Explorer
•
Developed by MetaProducts Corporation, Offline Explorer can download Web
sites to your hard disk for offline browsing.
http://www.metaproducts.com/OE.html
•
Advantages of Offline Explorer
– Save Time: Download up to 500 files simultaneously.
– Save Yesterday's Web Sites for Tomorrow's Use
– Monitor Web Sites
– Mine your Data
• TextPipe tool in Offline Explorer Pro edition can extract or change the desired data, or
even explort it to a database.
© 2005
3
Offline Explorer
Project list
Project properties setup window
Download
URLs
File filters, URL filters,
and other advanced
properties.
© 2005
Download
level
File modification
check
4
SpidersRUs
•
SpidersRUs Digital Library Toolkit was developed by Artificial
Intelligence Lab at the University of Arizona.
http://ai.eller.arizona.edu/spidersrus/
•
Provide modular tools for spidering, indexing, searching for building
digital libraries in different languages in a simple DIY (Do-ItYourself) way. Users can create their own search engines easily
and quickly via the friendly user interface.
•
SpidersRUs can automate the development of vertical search
engines in different domains and languages. It can work on nonEnglish languages such as Asian and Middle East languages.
© 2005
5
SpidersRUs
Keyword search
Search results
© 2005
An example of a Chinese search engine built by SpidersRUs
6
Google Scholar
• Google Scholar provides a simple way to broadly search for scholarly
literature.
http://scholar.google.com/
• Features of Google Scholar:
– Search diverse sources from one convenient place
– Find papers, abstracts and citations
– Locate the complete paper through your library or on the web
– Learn about key papers and scholars in any area of research
© 2005
7
Google Scholar
Search for “Bioterrorism” in Google Scholar
List of papers citing this paper
366
citations
© 2005
8
Analysis: Google APIs
•
Google provides many APIs to help you quickly develop your own applications.
http://code.google.com/more/
•
Examples of Google APIs:
– Google API for Inlink: Discovers what pages link to your website.
– Google Data APIs: Provide a simple, standard protocol for reading and
writing data on the Web. Several Google services provide a Google Data
API, including Google Base, Blogger, Google Calendar, Google
Spreadsheets and Picasa Web Albums.
– Google AJAX Search API: Uses JavaScript to embed a simple, dynamic
Google search box and display search results in your own Web pages.
– Google Analytics: Allows users gather, view, and analyze data about their
Website traffic. Users can see which content gets the most visits, average
page views and time on site for visits.
– Google Safe Browsing APIs: Allow client applications to check URLs
against Google's constantly-updated blacklists of suspected phishing and
malware pages.
– YouTube Data API: Integrates online videos from YouTube into your
applications.
© 2005
9
Example: Google API for Inlink
Results: all the related
inlink Web pages
Input “link URL” and search
© 2005
10
Google Translation
• Google's Translate function.
http://www.google.com/language_tools?hl=en
• The input and output languages can be Arabic, Chinese, Dutch,
English, French, German, Greek, Italian, Japanese, Korean,
Portugese, Russian or Spanish.
• Major functions of Google Translation include:
– Search multilingual Web pages
• Search the Internet in one language and get the results in another one.
– Translate text
• Translate free text into multiple languages.
– Translate a Web page
• Translate a Web page into multiple languages.
© 2005
11
Google Translation
Search multilingual Web pages
Translate text from Arabic to English
Translate a Web page
© 2005
12
GATE
•
Generalised Architecture for Text Engineering (GATE) is a toolkit for Text Mining. It
was developed by NLP group at the University of Sheffield (UK). http://gate.ac.uk
•
Information Extraction tasks:
–
Named Entity Recognition (NE)
•
–
Co-reference Resolution (CO)
•
–
Finds relations between TE entities.
Scenario Template Production (ST)
•
•
Adds descriptive information to NE results (using CO).
Template Relation Construction (TR)
•
–
Identifies identity relations between entities in texts.
Template Element Construction (TE)
•
–
Finds names, places, dates, etc.
Fits TE and TR results into specified event scenarios.
GATE also includes:
–
–
–
© 2005
Parsers, stemmers, and Information Retrieval tools;
Tools for visualizing and manipulating ontology; and
Evaluation and benchmarking tools.
13
GATE
Project information
Attributes
Results display
14
© 2005
* Picture is from http://nlp.shef.ac.uk
Arizona Noun Phraser
•
The Arizona Noun Phraser was developed by Artificial Intelligence Lab at the University
of Arizona.
http://ai.arizona.edu/
•
The Arizona Noun Phraser is made up of three major components, a tokenizer, a partof-speech tagger, and a phrase generation tool. It generates precise topic descriptions.
– Tokenizer
• Separates punctuation and symbols from text without affecting content.
– Part of Speech (POS) Tagger
• Uses both lexical and contextual disambiguation in POS assignment;
• Lexicons include: Brown Corpus, Wall Street Journal, and Specialist Lexicon.
– Phrase Generation
• Uses Simple Finite State Automata (FSA) of noun phrasing rules;
• Breaks sentences and clauses into grammatically correct noun phrases.
© 2005
15
Arizona Noun Phraser
© 2005
16
SOM
• The multi-level self-organizing map neural network
algorithm was developed by Artificial Intelligence Lab
at the University of Arizona.
– Using a 2D map display, similar topics are
positioned closer according to their co-occurrence
patterns; more important topics occupy larger
regions.
© 2005
17
SOM
Example: FMD Paper Content Map (2001~2005)
Topic
Topic
region
# of
documents
belonging to
this topic
Different
Topics
Warm colors
represent
new topics.
© 2005
•
Developed by AI lab at the University of Arizona
18
Weka
•
Weka was developed at the University of Waikato in New Zealand.
http://www.cs.waikato.ac.nz/~ml/
•
Tools include:
– Data preprocessing (e.g., Data Filters),
– Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural
Networks, SVM),
– Regression (e.g., Linear Regression, Isotonic Regression, SVM for
Regression),
– Clustering (e.g., Simple K-means, Expectation Maximization (EM),
Farthest First),
– Association rules (e.g., Apriori Algorithm, Predictive Accuracy,
Confirmation Guided),
– Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chisquared Statistic), and
– Visualization (e.g., View different two-dimensional plots of the data).
© 2005
19
Weka
Different analysis tools
The value set of the chosen attribute
and the # of input items with each value
Different attributes to
choose
© 2005
20
Visualization: NetDraw
• NetDraw is a open source program written by Steve Borgatti
from Analytic Technologies for visualizing both 1-mode and 2mode social network data.
http://www.analytictech.com/downloadnd.htm
• Handle multiple relations at the same time, and can use node
attributes to set colors, shapes, and sizes of nodes. Pictures
can be saved in metafile, jpg, gif and bitmap formats.
• Two basic kinds of layouts are implemented: a circle and an
MDS/ spring embedding based on geodesic distance. You
can also rotate, flip, shift, resize and zoom configurations.
© 2005
21
NetDraw
Different functions
The networks: nodes representing the
individuals and links representing the relations
© 2005
Display
setup of the
nodes and
relations
22
JUNG
•
The Java Universal Network/Graph Framework (JUNG) is a software
library for the modeling, analysis, and visualization of data that can be
represented as a graph or network. It was developed by School of
Information and Computer Science at the University of California, Irvine.
http://jung.sourceforge.net/index.html
•
The current distribution of JUNG includes implementations of a number
of algorithms from graph theory, data mining, and social network
analysis:
– Clustering
– Decomposition
– Optimization
– Random Graph Generation
– Statistical Analysis
– Calculation of Network Distances and Flows and Importance
Measures (Centrality, PageRank, HITS, etc.).
© 2005
23
JUNG
Examples of visualization types
© 2005
* Pictures are from http://jung.sourceforge.net/index.html
24
Analyst’s Notebook & Starlight
• Analyst’s Notebook, by i2: A 2D graph and timeline
layout tool for crime and intelligence analysis
• Startlight, by Pacific Northwest Lab (PNL): A 3D
network visualization and navigation tool for
intelligence analysis
© 2005
25
Analyst’s Notebook, i2
© 2005
Starlight, PNL
26
Download