Presentation of Proximity Search in Databases

advertisement
Proximity Search in Databases
A Paper by
Roy Goldman, Narayna ShivaKumar,
Suresh VenkataSubramaniam,Hector
Garcia-Molina
Presented by
Arjun Saraswat
Flow of the Presentation








Introduction
Motivation
Problem Statement
Model/Design
Scoring Function
Implementation
Strategies
Performance Experiments
INTRODUCTION
Introduction





Basic Idea: Proximity search is used in IR to
retrieve documents that have words occurring
near each other.
Database is viewed as a collection of objects that
are related by distance function.
Objects: can be tuples, records…
In IR traditionally intra-object proximity search is
searching within the same document.
The Proximity search in this paper talks about
ranking objects based on their distance to other
objects.
MOTIVATION
Motivation



There are situations in which user cannot generate a
specific query or its impractical to generate a specific
query, or even when a search needs to be based on
relevance of different data objects
There is no feature in databases and IR for
implementation of proximity search .
Motivation is to develop a general purpose proximity
service that can be implemented independent of
underlying database.
PROBLEM STATEMENT
Problem Statement
Basic Statement: To rank objects in one given set (Find)
based on their proximity to the objects in the another set
(near)
What is Find Set?
It is a set that is basically of interest for the Proximity
search.
What is Near Set ?
Ranking of Find set objects is done in respect of their
distance to Near set objects.
Gets more clear with example:
“ Find Movie Near Travolta Cage”
Problem Statement
Find Movie
Looks for all objects of the type movie or the objects that
have word movie in there body ,it does not in anyway
means that it will search for a movie containing Travolta
and Cage
Here Movie, Travolta and Cage all are different objects.
For the Query “Find Movie Near Travolta Cage”
The Top 10 results are:
1.Face off
2.She’s so Lovely
3.Primary colors
4.Con air
5.Mad City
6.Happy Birthday Elizabeth: A Celebration for life
7.Original Sin’s
8.’Night Sins’
9. That old feeling
10. Dancer Upstairs
Problem Statement
As we can clearly see that “Face-off” is
going to be the top hit as it has both the
stars Travolta and Cage. This can be
explained as both actor objects are at a
short distance away from the movie Faceoff. The movies in second place are here 5
in number, they all have one of the two
stars.
Rest of the answers have an indirect
affiliations means they are at a larger
distances.
MODEL/DESIGN
Model/Design
Basic Architecture
Fig.1
Model/Design




Figure .1 gives a clear view of the basic
components of the Proximity Search architecture
A database stores a set of objects that can be
tuples, records, etc.
The application fires Find and Near Queries to get
the Find set and the Near set
The Proximity Search Engine takes input as Find
and Near objects or sets and Distance Module
and gives output as re-ranked Find Set based on
there distances, which is obtained from the
Distance Module.
Model/Design
Distance Module in simplified terms can be
understood as providing the Proximity Search
Engine with set of triplets like (X, Y, d) where d
is the distance between objects with identifiers X
and Y.
Assumption1: all distances are taken to be
greater than or equal to one.
Assumption2: Proximity Search Engine makes
use of these distances to compute the lengths
of shortest paths between objects. Now, As we
are more interested in close objects we disregard
all objects with distances greater than some
constant K and setting an infinity for the rest.
will become more clear when we talk about the
algorithm
Model/Design
From the perspective of
Proximity Search engine the
database is viewed as
undirected graph with
weighted edges. It does not
mean that the underlying
databases need to be
maintained as an undirected
graph.
As can be seen from the figure
given on the right side which
shows a normalized relational
schema for the Internet Movie
Database.
Model/Design
Graph based representation
Model/Design
In the graph based the representation each
tuple is broken down into multiple objects:
one for the entity object and additional
objects for each attribute value.
The distances are assigned between objects
are done on the following basis:
1.Small weights are assigned between objects
like entity and its attribute values i.e. a close
relationship.
2.Larger weights to objects linked through
foreign and primary keys.
3.Largest weights are assigned to objects
linked by entity tuples in the same relation.
SCORING FUNCTION
Scoring Function
The main idea behind all this is that we want to rank each
object f in the Find set based on there proximity to the to
the objects in the Near set N.
rF : ranking function in the Find set.
rN : ranking function in the Near set.
range for these functions is [0,1]
with 1 representing the highest possible rank.
The distance between any two objects f Є F and n Є N is
the weight of the shortest distance in the underlying
database graph, known as d (f, n) .Bond between f and n
where f ≠ n :
rF(f) rN(n)
b (f, n) =
d (f, n)t
here t is a tuning exponent, it is non-negative real number
that controls the impact of distance on bond
Scoring Function
The Bond ranges between [0,1], higher the value
greater is the bond
How to use Bond’s depends upon the application,
different approaches can be taken for interpreting
bonds to Near objects
Some of the approaches are discussed below:
1.Additive : For example in the Query
“Find Movie Near Travolta Cage”
we intuitively know that movie that
has both the actors should be ranked higher so in
accordance to our intuition we score each object f
based on the sum of its bonds with Near objects
score (f) = nЄN Σ b (f, n)
Scoring Function
2.Maximum : In some cases maximum bond may be more
important than the total number, in this case
score (f) = nЄN max b (f, n)
3.Beliefs : In this we treat bonds as beliefs, that is suppose
the graph represents a connection between electronic
devices, such that the two devices close together in the
graph are close together physically as well.
Here rF : indicates the known status of the Find Devices
rN: gives that a Near device is faulty
b (f ,n) gives us the belief that f is faulty due to n, as the
more closer f is to faulty device more likely it is to be faulty
score (f) = 1- nЄN Π (1-b (f, n))
IMPLIMENTATION
Implementation




The implementation of the proximity search architecture
was done on top of LORE a database system that was
designed at Stanford University for storage and querying
graph structured data.
It is based on OEM (Object Exchange Model)
What is OEM ?
An OEM object contains an OID, textual label, a type and a
value.
A value may be atomic or complex.
Atomic OEM any data value that should be considered
indivisible by the database
A complex OEM value, on the other hand, is a collection of
0 or more OEM objects
Implementation
Complex OEM Object:
<Birthday {
<Month "January">
<Day 7>
<Year 1972>
}>
Here Birthday is the single complex OEM object
with three
Atomic OEM objects Month, Day and Year
Implementation
Basics of OEM :
<Restaurant {
<Entree {<Name "Burger">
<NINE: Price 9.00>}>
<Entree {<Name "BLT">
<&NINE>}>
<Entree {<Name "Reuben">
<Cost &NINE>}>
}>
Here NINE is SymOid
STRATEGIES
Strategies
Naïve Approach : A simple approach would be to
compute the shortest distances between the objects
at search time using the Dijkstra's single source
shortest path algorithm.
For each iteration the algorithm will explore N(v)
Vertices adjacent to the some vertex v, so it will
Make N(v) random seeks for a disk based graph and
as many as |E1| random seeks. This type of approach
Requires too many random seeks .
E1 : edge list provided by the distance module, it is of
the form <u,v,w>
Strategies
Algorithm for Self joins
Algorithm: Distance self-join
Input: Edge set El, Maximum required distance: K
Output: Lookup table Dist supplies the shortest distance (up to K) between
any pair of objects
[1] For l = 1 to ┌log2k ┐
[2] Copy El into El’+1
[3] Sort El on first vertex.// To improve performance
[4] Scan sorted El:
[5] For each <vi, vJ, wk> and <vi, v’J, w’k> where vj != v’j
[6] If (wk + w’k ≤ 2l ) and (wk + w’k ≤ K)
[7] Add < vj, v’j, wk + w’k > and < v’j, vj, wk + w’k > to El’+1
[8] Sort on El’+1 first vertex, and store in El’+1
[9] Scan sorted El’+1 :
[10] Remove tuple <u, v, w>, if there exists another tuple <u, v, w’>, with
w > w’.
[11] Let Dist be the final El+1.
[12] Build index on first vertex in Dist.
Strategies
In algorithm for self joins
l-1
l
2
E : edge-list representation of A
l’
E : edge-list before applying min operator
The algorithm is iterated┌log2k ┐and gives the square of
the original matrix ┌log2k ┐times to give the Ak
The final output that is Dist is the look-up table that
contains the distances of all k neighborhood vertices.
The table stores <vi, vJ, wk> for all vertex pairs vi, vJ
having wk ≤ K
The main purpose is to query for d(vi, vJ) which can be
done efficiently as the Dist table is indexed and access of
neighborhood for a tuple like <vi, vJ, wk> ,if its there then
distance is wk or distance is greater than K.
The problem with this approach is that it requires a lot of
space for the generated edge-list and scanning & sorting
operation on it can be expensive.
Strategies
Hub Indexing : It requires
far less space for shortest
distances then self join
algorithm at the cost of
access time.
Hubs : Here in the figure p
and q are hub vertices that
connect to two sub graphs
called as hubs
Here we calculate for (|A| +
|B|) pair wise shortest
distances rather than storing
all (|A| * |B|).

Strategies
Construction of hub indexes : Main Components are a Hub
Set H and Table of distances whose shortest path do not
cross through H
The DIST look-up table that was generated by the SelfJoin algorithm.
In that one step needs to be changed to make the
algorithm in accordance to Hub indexes, that is
Strategies
We need to maintain a matrix of pair-wise of hubs
in Memory of the form Hubs [hi] [hj] , initializing
with distances equal to infinity ,and for each edge
<hi, hj, wk> where hi, hj Є H, Hubs [hi] [hj] = wk
Floyd Warshall’s algorithm is used to compute
shortest distances in hubs.
Strategies
hub indexing algorithm
Algorithm: Pair-wise distance querying

Input: Lookup table on disk: Dist, Lookup matrix in memory: Hubs,
Maximum required distance: K, Hub set: H

Vertices to compute distance between: u, v (u≠ v)

Return Value: Distance between u and v: d
[1] If u, v Є H, return d =Hubs [u ][v].
[2] d = ∞
[3] If u Є H
[4] For each <v, vi, wk> in Dist
[5] If vi Є H // Path u ~vi~ v
[6] d = min (d, wk+ Hubs [vi ] [ u ])
[7] If d > K, return d = ∞, else return d.
[8] Steps [4]-[7] are symmetric steps if v Є H, and u !Є H.
[9] // Neither u nor v is in H
[10] Cache in main-memory (EU) all <u, vi, wk > from Dist
[11] For each <v, v’i , w’k > in Dist
[12] If (v’i = u)
[13] d = min(d, w’k) //Path u ~ v without crossing hubs
[14] For each edge <u, vi, wk > in EU
[15] If v’i Є H and vi Є H //Path u~ vi ~ v’i ~v
[16] d = min (d, wk+ w’k +Hubs [v’I] [vi] )
[17] If d > K, return d = ∞, else return d.

Strategies



The algorithms discussed earlier on can be
used to get the distances between single
pair of objects
Naïve approach for Find/Near Query would
be to check for the all pairs of Find and
Near objects. To avoid unnecessary seeks
clustering over the objects can be done this
has to be done engine administrator.
In this Proximity search engine clustering is
done on the labels such as Actors,
Producers, etc.
Strategies
Hub Selection
Consider a Graph G(V,E) , and let V1, V2 be disjoint
Subsets of V, A set of vertices S ⊆ V separates V1 &
V2 If all pairs vertices (v1, v2) v1 Є V1 , v2 Є V2
goes thru some Vertex from S.
We say that S is a balanced separator if
min(|V1||V2|) ≥ |V|/3
We say that S is a c-separator if
V - S = V1 U V2,
i.e. S disconnects the graph
PERFORMANCE EXPERIMENTS
Performance Experiments


For the experiments, they have used
a Sun SPARC/Ultra II (2x200 MHz)
running SunOS 5.6, with 256 MBs of
RAM, and 18 GBs of local disk space.
They have done experiments with
two sets of datasets IMDB and
DBgroup dataset.
Performance Experiments






A generator is used that takes
in as input as IMDB’s edge list
and scales the database by a
scale factor S.
For performance we have user
ISAM indexes
Performance Issues
discussed:
Index Performance :
First figure is storage
requirements with varying K
Second figure is Index
Construction time for varying
K
When the number of Hubs is
small
For this we have taken the
scale Factor to be S =10 and
2.5% vertices as hubs
Performance Experiments





Algorithm Scalability as
database grows in size.
First figure is total
storage with varying
scale .
For this scale factor is
taken to be S =10 and
2.5% vertices as hubs.
Second figure number
of hubs as percentage
of vertices.
For this scale factor is
taken to be K=12,S =10
and 2.5% vertices as
hubs.
THANK YOU
References
1. A Standard Textual Interchange Format
for the Object Exchange Model (OEM)
by Roy Goldman, Sudarshan Chawathe, Arturo
Crespo, Jason McHugh
Download