Presented by:
Group 17
AIA 8803 Course
Feb 28, 2008
Large Amount of Deep Web Content
Refers to World Wide Web content that is not part of the surface Web indexed by search engines (Bergman, 2001)
In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes of data and 550 billion individual documents
Characteristics of Deep Web Data:
Mostly generated by backend database
Intrinsic – behind database scheme
Deep web crawling
Iterative querying
Deep web mining
Attribute labeling
Advanced search
Database construction
Object-level search
Comparison
Why it’s difficult in dynamic web space?
Hidden Web, Deep Web
Different from traditional web crawler where a hyperlink graph is traversed with BFS or WFS to crawl web pages
Seed-based crawler
Seed Crawl
New Seed Crawl
…
Initial seed: car
New seeds: Lincoln, Deluxe, TracRac,
Truck, SUV
What we have:
Large amount of web pages gathered from the crawler
Machine Learning /
Data Mining techniques
What we need:
A structured database for web application
Problem
Different web sites may have different layouts
Conditional Random Fields (CRFs)
An undirected graphic model
X (Gray nodes): observations
Features extracted from the crawled web pages
Y (White nodes): hidden states
Labels
Product name, price, customer rating, etc..
CRF models the conditional probability p(y|x)
Key advantage
Rich, correlated feature sets
Data fusion will be necessary where multiple copies of data exist across sites
•
•
Web object extraction and mining
Structured databases of web objects
•
•
improve the state-of-the-arts Web search make some money
1. object-level web search combine different features or attributes of an identical Web object in different Web sites to respond to a user query
DBLP (manual but high-precise) Citeseer (auto but less-precise)
Challenge is on how to build an precise and automatic object-level search platform DBLP?
2. comparison Web search compare attributes (e.g. price, performance, etc) of Web objects across different sites or sources
Building a LAMP Server
"LAMP" system: Linux, Apache, MySQL and PHP.
1. low acquisition cost
2. ubiquity of its components
Fancy restaurant (dynamic web server)
Apache: chef.
PHP: waiter.
MySQL: stockroom of ingredients
When a patron (or Web site visitor) comes to your restaurant, he or she sits down and orders a meal with specific requirements.
The waiter (PHP) takes those specific requirements back to the kitchen and passes them off to the chef (Apache).
The chef then goes to the stockroom (MySQL) to retrieve the ingredients (or data) to prepare the meal and presents the final dish to the patron, exactly the way he or she ordered the meal.
Q&A