Web mining (part 2)

advertisement
Web Mining
Shah Mohammad Nur Alam Sawn
03/03/2014
What is Web Mining?
Discovering desired and useful information
from the World Wide Web
Exploiting Geographical Location Information of
Web Pages
 Orkut Buyukkokten(orkut@cs.stanford.edu)
 Junghoo Cho(cho@cs.stanford.edu)
 Hector Garcia-Molina(hector@cs.stanford.edu)
 Luis Gravano(gravano@cs.columbia.edu)
Department of Computer science ,Stanford
University, Stanford, Ca 94305.
Department of Computer science, Columbia
University, New York, 10027.
(December 27,2008)
“Proof of Concept” using mapping databases
Ways of exploiting information from internet:
Improve the search engine; such as, not showing
irrelevant information about the query.
To identify the “globality” of resources; such as,
use of hyperlink and exploiting information about
web sites then it can estimated how global a web
entity is.
Problems of exploit geographical location
information of entities
 How to compute geographical information?
 How to exploit this information?
Computing geographical information
 Information Extraction; such as, automatically
analyze web pages to extract geographic entities
like area or zip code.
 Network IP Address Analysis; such as, focus on
the location of their hosting web sites.
Exploiting the Information using databases
 Site Mapper (http://www.internic.net/)
It has the phone numbers of network administrators
of all Class A and B domains. From this database,
extracted the area code of the domain administrator
and built a Site-Mapper table with area code
information for IP addresses belonging to Class A and
Class B addresses.
 Area Mapper (http://www.zipinfo.com/)
It maps cities and townships to a given area code. In
some cases, entire states (e.g., Montana) correspond
to one area code. In other cases, a big city often has
multiple area codes (e.g., Los Angeles). Then write
scripts to convert the above data into a table with
entries that maintained for each area code the
corresponding set of cities/counties.
• Zip-Code Mapper (http://www.zipinfo.com/)
This mapped each zip code to a range of longitudes
and latitudes.
Graphical Interface of Proof of Concept Prototype
States
Output of search
Cities
Zoom
Ip
Refresh
Zip code
Map
City
Area Code
URL
Input
Geospatial Data Mining on the Web:
Discovering Locations of Emergency Service
Facilities. (2012)
Wenwen Li, Michael F. Goodchild, Richard L. Church , and Bin Zhou
 GeoDa Center for Geospatial Analysis and Computation, School of
Geographical Sciences and Urban Planning, Arizona State University,
Tempe AZ 85287 (Wenwen@asu.edu)
 Department of Geography, University of California, Santa Barbara Santa
Barbara, CA 93106 {good,church}@geog.ucsb.edu
 Institute of Oceanographic Instrumentation, Shandong Academy of
Sciences Qingdao, Shandong, China 266001 (senosy@gmail.com )
Google search image of fire station
Actual Location
Google result
Process of Web Crowler
A Web crawler is an Internet bot that systematically
browses the World Wide Web, typically for the
purpose of Web indexing. A Web crawler may also be
called a Web spider, or an automatic indexer.
Defining New Class Address Structure
Form of street address for Identifying target webpages
Cont.
d1:Distance between p and the location of the foremost
digit in the number block closest (before) to location
p.
d2: Distance between p and the location of the last digit
of the first number that appears(for detecting 5-digit
ZIP code), or the last digit of the second number after
p if the token distance of the first and second number
block equals
r1: regular expression [1-9][0-9]*[\\s\\r\\n\\t]*([a-zAZ0-9\\.]+[\\s\\r\\n\\t])+
r2: : regular expression "city-Pattern "[\\s\\r\\n\\t,]?+
("statePattern")?+[\\s\\r\\n\\t,]*\\d{5}(-\\d{4})*
Decision rules of desired addresses by training
data based on semantic information
Station + Num
Key word Station and
Title web page as fire
Station on
web page title
Architecture of Proposed Cyber Miner
Here input is seeding web urls and output is target address
Search Results of Cyber Miner
Location of all fire station obtained by Cyber Miner from address database
Web-based geographic search engine for location
aware search in Singapore
• Flora S. Tsai
School of Electrical & Electronic Engineering,
Nanyang Technological University, Singapore
639798, Singapore 2010.
Geo search
This is able to search for location-specific information
in Singapore based Web sites. The user is able to view
their search locations on a satellite map instead of the
two-dimensional maps currently used in street
directories. The Web-based search engine is able to
search for locations based on area names, building
names, and groups of landmark types, business names,
and business categories. Furthermore, the user is also
able to use their current coordinates as a parameter so
that the search engine is able to return results in order of
the distance from the user’s current location.
Google earth
Using google earth for their search
Keyhole Markup Language
Keyhole Markup Language (KML) is a file
format used to display geographic data in an
earth browser such as Google Earth, Google
Maps and Google Maps for mobile.
Street directory
http://www.streetdirectory.com/
Usefull for mobile phone only and it is also web map service which merge with google earth
Global Positioning System
Google Earth allows download of tracks and waypoints
from GPS devices creates KML files for the waypoints
and tracks downloaded.
Design
Design Cont.
• BusinessAreaAddress, where the address is stored
without the postal code;
• BusinessAreaPostal, where the postal code is stored;
• Area, where the keywords of the area are stored, e.g.
Causeway Point;
• General Area, where the General Area of the location
is stored, e.g. Yishun.
Algorithms
Here use the Haversine’s Formula for faster processing.
• For two points on a sphere of radius R with latitudes
Ø1 and Ø2, latitude separation Δ Ø= Ø1 - Ø2 and longitude
separation Δ λ.
• where angles are in radians, and the distance d
between the two points is related to their locations
by the formula:
h=haversin(Δ Ø)+cos(Ø1 )cos(Ø2 )haversin(Δ λ)……(1)
Algorithms Cont.
• Let h denote haversin (d/R) given from above. d can then
be solved either by simply applying the inverse haversine
(if available) or by using the arcsine (inverse sine)
function:
• d=(R)haversin-1 (h)=(2R)arcsin(√h)………………..(2)
• This formula is only an approximation when applied to
Earth as earth is not a perfect sphere, its radius R varies
from 6356.78 km at the poles to 6367.45 km at the
equator. The error is therefore 0.1% depending on the
location due to this slight elipticity. Assuming that the
geometric mean of R = 6367.45 km is used.
• The output of this formula is calculating distance from
two coordinates
Result
The database from which these results are taken contain 1652
entries with the following categories:
• Apparel, Bank, Cinema, Department Store, Duty Free Shop,
Electronics, F&B (food and bev- erage), Fast Food, Food
Court, Furniture, Health and Beauty, Minim-art, Musical
Instruments,
Restaurant,
Snack
Bar,
Sports,
Stationery,Seafood, and Supermarket.
• The landmark type searched for are Building, Road, MRT
stations, Schools and Shopping Centres. General Area
searched under Advanced have various roads grouped into one
big area, e.g. Tan-jong Katong and Haig Road are both
grouped under the Katong area
Simple search
Input
Output
Advance search
Thank you for your patience!
Download