09
 Ferret:
360 
Search
 Prafulla
Mahindrakar


advertisement

Spring
Ferret:
360o
Search
Prafulla
Mahindrakar
Aniket
Patil
Ketan
Umare
Advisor:
Dr.
Ling
Liu
CS8803:Advanced
Internet
Application
Development,
Group
Project.
09
Table
of
Contents
1.
Motivation
and
Objectives ...............................................................................................3
2.
Related
Work........................................................................................................................4
2.1
Googling ........................................................................................................................................ 4
2.2
Socially
relevant
search .......................................................................................................... 5
2.2
Categorization
of
search
results ........................................................................................... 5
3.
Proposed
Work ....................................................................................................................6
3.1
System
Architecture ................................................................................................................. 6
3.2
Components................................................................................................................................. 6
3.2.1
Open
ID
Authentication..................................................................................................................... 7
3.2.2
Social
Network
Information
Grabber
APP ................................................................................ 7
3.2.3
Query
Parsing ....................................................................................................................................... 7
3.2.4
Socially
relevant
Search .................................................................................................................... 7
3.2.5
External
Search
Interface ................................................................................................................. 7
3.2.6
Categorization
of
Search
Results................................................................................................... 7
3.2.7
Display ...................................................................................................................................................... 7
4.
Plan
of
action ........................................................................................................................7
4.1
Resource ....................................................................................................................................... 7
4.1.1
Software
Tools....................................................................................................................................... 8
4.1.2
Hardware................................................................................................................................................. 8
4.1.3
Operating
system ................................................................................................................................. 8
4.2
Schedule........................................................................................................................................ 8
5.
Evaluation
and
Testing
Method......................................................................................9
5.1
Deliverable .................................................................................................................................. 9
5.2
Evaluation
Strategy................................................................................................................... 9
5.3
Testing
Methodology ................................................................................................................ 9
5.3.1
Unit
Testing ............................................................................................................................................ 9
5.3.2
Functional
Testing ............................................................................................................................... 9
6.
Bibliography ...................................................................................................................... 10
1.
Motivation
and
Objectives
fer·ret
(v)
(\ˈfer­әt\)
to
find
and
bring
to
light
by
searching
Imagine
trying
to
find
a
pair
of
the
latest
Ray‐Ban
glasses
in
the
Lenox
Square
Mall.
It
is
not
an
easy
task!
Now
think
about
doing
the
same
across
the
World
Wide
Web.
Feeling
tizzy?
The
World
Wide
Web
with
its
astronomical
amount
of
information
presents
an
enormous
challenge
for
resource
discovery.
Precise
navigation
is
impossible
with
the
increasingly
large
collection
of
hyperlinks
that
users
must
traverse.
Commercial
search
engines
like
Google
and
Yahoo
have
solved
the
problem
at
a
fundamental
level
by
making
available
a
hypertext‐based
index
for
pages
across
the
web.
Web
Users
can
query
the
index
for
documents
about
a
specific
topic
to
find
the
desired
document.
While
search
engines
have
become
quite
popular
and
are
helping
to
redefine
how
people
access
information
scattered
across
the
wide‐area
network,
they
are
not
well
suited
to
the
case
when
users
do
not
know
what
exactly
they
are
looking
for.
In
such
a
situation,
using
one
of
the
popular
search
engines
can
be
a
messy,
frustrating
experience.
What
do
you
do
when
you
don’t
know
where
to
start?
Give
Ferret
a
try!
For
any
topic
in
the
universe,
Ferret
provides
a
neatly
organized
view
of
the
web.
Our
category
guides
bring
meaningful
and
relevant
information
that
makes
browsing
for
a
topic
fast.
Rather
than
the
messy
back‐and‐forth
clicking
of
search
results,
we
do
the
processing
so
that
you
can
learn,
explore
and
discover
the
things
that
matter
to
you.
Ferret
offers
you
a
new
way
to
discover
the
Web
–
it’s
the
place
you
should
be
when
you
want
to
browse
and
discover
everything
the
Web
has
to
offer.
Come
to
Ferret
when
you
want
to
learn
about
a
topic
or
explore
what’s
happening
now
on
the
Web.
We’ll
show
you
content
that
you
may
have
never
discovered
otherwise
and
we’ll
give
you
an
at‐an‐glance
look
at
everything
related
to
the
query.
Think
of
Ferret
as
your
guide
for
exploring
the
Web.
For
instance,
consider
the
search
term
‘Transformers’.
A
Google
search
result
returns
a
list
arranged
serially
that
speaks
about
the
movie
‘Transformers’,
and
electrical
transformers
on
the
first
page.
However,
a
user
who
is
interested
in
knowing
about
the
class
‘Transformer’
in
Java
or
about
the
comics
on
Transformers
needs
to
browse
several
pages
before
such
results
are
discovered.
Our
system
graphically
arranges
and
classifies
results
into
categories
such
as
text,
multimedia,
entertainment,
discussions,
blogs
and
more.
A
user
simply
needs
a
single
click
to
have
a
360
degree
view
of
content
associated
with
the
query
term.
Ferret.
Your
guide
to
the
world!
2.
Related
Work
Search
has
been
a
constantly
evolving
and
a
continuously
researched
topic.
There
have
been
great
success
stories
and
even
greater
debacles
in
this
industry.
Web
search
has
become
such
an
important
part
of
our
life
that
it
has
contributed
to
our
vocabulary
in
some
cases.
Following
are
some
of
the
most
different
systems
currently
available
online,
from
which
we
derive
and
drive
our
inspiration.
Figure
1:
Taxonomy
of
Existing
Search
Technologies
2.1
Googling
In
their
seminal
work
[1],
the
authors
described
a
new
way
of
ranking
web
documents,
based
on
the
idea
of
citation.
The
Search
engine
instantly
became
a
hit
and
overtook
all
of
its
competitors.
The
webpage
[2]
is
the
most
highly
visited
page
online
and
everyone
knows
“The
Google
Story”.
Google
uses
a
simple
keyword
based
search,
but
the
most
important
point
is
the
ranking
of
content.
Thus
Google
successfully
demonstrates
the
idea
that
just
the
content
is
not
important,
but
the
way
we
present
it
is
highly
important.
Google
has
continued
to
innovate
and
come
up
with
great
innovative
new
features,
but
still
it
has
a
long
way
to
go.
2.2
Socially
relevant
search
Social
search
or
a
social
search
engine
is
a
type
of
web
search
method
that
determines
the
relevance
of
search
results
by
considering
the
interactions
or
contributions
of
users.[3]
Based
on
this
simple
idea
is
Delver[4],
which
uses
the
social
network
of
a
user
to
come
up
with
better
recommendations.
It
enables
you
to
find,
experience
and
benefit
from
the
wealth
of
information
created
and
referenced
by
your
social
world.
Socially
relevant
search
can
really
benefit
a
user,
as
what
matters
to
him
is
usually
what
matters
to
his
peers.
Paper
[5]
talks
about
the
benefits
of
integrating
the
web
search
and
social
search
and
quantifies
it
with
great
results.
It
also
delineates
the
challenges
in
doing
so.
2.2
Categorization
of
search
results
Search
results
categorization
is
another
important
way
to
present
the
search
results.
Take
an
example
of
the
word
Transformers.
For
the
same
word
we
could
have
different
implications
–
an
electrical
device,
a
movie,
the
cartoon
series,
a
toy,
there
could
be
a
review
about
the
movie,
or
some
news
about
the
invention
of
some
new
efficient
transformer,
etc.
So
how
do
you
show
these
results?
Which
is
more
important?
These
questions
are
almost
impossible
to
answer.
Papers[6‐9]
show
a
variety
of
ways
in
which
we
can
classify
the
web
search
results
and
quantify
them
with
interesting
results.
But
Kosmix[10],
is
one
of
the
most
promising
sites
that
has
leveraged
from
this
idea.
It
uses
the
search
provided
by
Google,
and
creates
a
wrapper
for
its
own
classification
system.
It
has
been
voted
as
one
of
the
best
new
startups[11]
and
that
just
makes
a
statement
about
the
importance
of
classification
of
results.
3.
Proposed
Work
The
following
sections
give
an
outline
of
the
System
architecture
and
a
small
description
of
the
important
components.
3.1
System
Architecture
Figure
2
System
Architecture
3.2
Components
This
section
explains
the
various
modules
that
constitute
the
ferret
search
engine.
3.2.1
Open
ID
Authentication
We
use
the
open
id
authentication
engine
to
login
into
our
system.
Open
Id
also
serves
us
in
getting
user
profiles
from
social
networking
websites
like
Facebook[12]
and
Orkut
that
use
Open
Id
authentication.
3.2.2
Social
Network
Information
Grabber
APP
This
module
lets
us
grab
user’s
social
network
through
various
websites
like
Facebook[12]
and
Orkut.
This
network
is
stored
locally
in
our
system
database
along
with
each
users
query
history
and
his
preferences
which
assist
us
in
socially
relevant
search.
3.2.3
Query
Parsing
The
query
parsing
engine
uses
the
Stanford
tagger[13,
14]
to
extract
keywords
which
are
then
used
by
the
social
relevant
search
module
to
search
for
previously
fruitful
searches
through
the
user’s
social
network.
3.2.4
Socially
relevant
Search
This
module
uses
the
parsed
query
to
search
through
the
users
social
network
to
find
if
any
relevant
searches
were
made
earlier.
3.2.5
External
Search
Interface
This
engine
is
multithreaded
and
accepts
the
raw
query
and
dispatches
it
to
the
various
worker
threads,
which
aim
at
collecting
the
search
results
from
variety
of
search
engines
like
Google[2],
A9[15],
IMDB[16]
etc.
The
worker
threads
use
WSDL
to
communicate
to
the
various
search
engines.
The
external
interface
is
extensible
since
collecting
results
from
a
new
search
engine
simply
requires
the
implementation
of
a
WSDL
interface.
This
enables
our
system
to
be
augmented
by
additional
search
results
through
Yahoo,
Windows
Live
or
any
other
search
engine.
3.2.6
Categorization
of
Search
Results
The
results
collected
through
the
various
websites
are
then
categorized
using
kmeans[9]
and
seed
list
based
clustering,
and
then
grouped
into
different
categories
3.2.7
Display
The
clustered
results
and
the
socially
relevant
search
results
are
then
showed
to
the
end
user
in
tabbed
format,
which
allows
the
user
to
easily
find
his
appropriate
content.
4.
Plan
of
action
Following
sections
outline
the
resources
needed
for
developing
the
ferret
system
and
the
schedule
till
completion.
4.1
Resource
We
would
be
developing
a
java‐based
application
with
usual
JAVAEE
components.
The
following
sections
list
down
the
details
of
the
requirements.
Also
included
in
the
Operating
systems
are
all
the
compatible
platforms
(on
which
we
would
develop
and
test)
for
ferret
system.
4.1.1
Software
Tools
• Java
1.5
• Eclipse
IDE
• J2EE
1.4
• Apache
Tomcat
5.5
• MySQL
5.0
• Clustering
Algorithms
(Developed
by
us)
• Simile
• MySQL
JDBC
Connector
• JUnit
4.4
• WSDL’s
and
Open
Source
Web
/
REST
API’s
for
Google,
IMDB,
Facebook
etc.
4.1.2
Hardware
We
need
simple
commodity
hardware,
as
it
will
not
be
a
live
system,
but
a
proof
of
concept.
Currently
a
Desktop
PC
with
a
browser
and
internet
connectivity
would
suffice.
We
would
primarily
develop
on
our
laptops.
4.1.3
Operating
system
The
primary
development
and
test
platforms
would
be
• Windows
98/XP/Vista
• MacOSX
10.5.5
(Leopard)
Though
most
of
the
technologies
we
are
using
are
completely
portable
and
we
should
be
able
to
run
on
most
systems
that
support
JAVA.
4.2
Schedule
Week
No.
1
2
3
4
5
Dates
Scheduled
Work
Feb
23
–
Feb
27
Mar
2
–
Mar
6
Mar
9
–
Mar
13
Mar
16
–
Mar
20
Mar
23
–
Mar
27
6
Mar
30
–
Apr
3
7
8
9
Apr
6
–
Apr
10
Apr
13
–
Apr
17
Apr
20
–
Apr
24
Installation
of
J2EE,
MySQL,
Tomcat
Study
of
web
services
Design
and
implementation
of
external
agent
interface
Connectivity
to
Google
OpenID
for
authentication
Study
and
implementation
of
connectivity
with
Google
Search
and
Amazon
A9
Study
and
implementation
of
connectivity
with
Facebook
social
network
Study
and
implementation
of
clustering
algorithms
Implementation
of
user
interface
Testing
with
JUnit
and
documentation
5.
Evaluation
and
Testing
Method
Following
sections
give
an
outline
for
the
deliverable
at
the
end
of
the
project,
the
evaluation
strategy
we
will
use
to
test
the
results
of
the
system
and
the
Testing
methods.
5.1
Deliverable
We
envision
a
completely
interactive
frontend,
which
is
highly
intuitive.
The
system
would
consist
of
a
working
query
manager
and
interface
manager
with
atleast
a
few
interfaces
already
built
in.
5.2
Evaluation
Strategy
We
would
compare
the
quality
of
results
at
various
levels.
The
most
important
criterion
would
be
to
compare
against
Google,
and
we
would
do
some
User
acceptance
testing,
because
the
most
important
thing
is
how
the
user
perceives
the
results.
We
also
would
compare
the
results
to
Kosmix[10]
and
Delver[4].
We
would
try
and
create
a
small
survey
and
ask
COC
students
to
compare
the
results.
This
would
help
us
study
the
potent
of
such
a
system.
5.3
Testing
Methodology
We
would
carry
out
testing
at
two
levels,
as
explained
below.
5.3.1
Unit
Testing
Each
piece
will
be
first
unit
tested
using
JUnit.
This
will
ensure
that
individual
units
of
source
code
are
working
properly.
We
would
also
test
simple
functionality
of
each
unit.
5.3.2
Functional
Testing
This
is
an
important
step
to
ensure
correctness.
The
most
important
problem
here
is
obtaining
a
sizeable
amount
of
social
network
data.
Thus
we
would
have
to
mostly
manufacture
data,
and
simulate
the
social
network.
6.
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Brin,
S.
and
L.
Page,
The
anatomy
of
a
large­scale
hypertextual
Web
search
engine.
Computer
Networks
and
ISDN
Systems,
1998.
30(1‐7):
p.
107‐117.
larry
page,
S.B.
Google.
Available
from:
http://www.google.com.
Wikipedia­Thre
free
Encyclopedia.
Available
from:
http://www.wikipedia.com.
Liad
Agmon,
A.y.,
Sagie
Davidovitch(co‐founders),
Delver.
Mislove,
A.,
K.
Gummadi,
and
P.
Druschel.
Exploiting
Social
Networks
for
Internet
Search.
2006.
Chen,
H.
and
S.
Dumais.
Bringing
order
to
the
Web:
automatically
categorizing
search
results.
2000:
ACM
Press
New
York,
NY,
USA.
Thet,
T.,
J.
Na,
and
C.
Khoo,
Automatic
Classification
of
Web
Search
Results:
Product
Review
vs.
Non­review
Documents.
LECTURE
NOTES
IN
COMPUTER
SCIENCE,
2007.
4822:
p.
65.
Vogel,
D.,
et
al.,
Classifying
search
engine
queries
using
the
web
as
background
knowledge.
SIGKDD
Explor.
Newsl.,
2005.
7(2):
p.
117‐122.
Yeung,
A.,
N.
Gibbins,
and
N.
Shadbolt,
A
k­Nearest­Neighbour
Method
for
Classifying
Web
Search
Results
with
Data
in
Folksonomies.
2008.
Venky
Harinarayan,
A.R.C.‐f.
Kosmix.
Available
from:
http://www.kosmix.com.
Read
Write
Web­
Top
10
Alternative
Search
Engines
of
2008.
Available
from:
http://www.readwriteweb.com/archives/top_10_alternative_search_engi.ph
ps.
Zuckerberg,
M.
Facebook
­
the
Social
Networking
site.
Available
from:
http://www.facebook.com.
Toutanova,
K.
and
C.
Manning.
Enriching
the
knowledge
sources
used
in
a
maximum
entropy
part­of­speech
tagger.
2000.
Toutanova,
K.,
et
al.
Feature­rich
part­of­speech
tagging
with
a
cyclic
dependency
network.
2003:
Association
for
Computational
Linguistics
Morristown,
NJ,
USA.
Bezos,
J.
A9­Amazons
Seach
Engine.
Available
from:
http://www.a9.com.
Needham,
C.
Internet
Movie
Database.
Available
from:
http://www.imdb.com.

Download