Generalized Vector Space Model

advertisement
Generalized Vector Space Model
• Definition Let ki be a vector associated with
the index term ki . Independence of index terms
in the vector model implies that the set of
vectors {k1 ,k2 ,…,kt} is linearly independent and
forms a basis for the subspace of interest. The
dimension of this space is the number t of
index terms in the collection.
An example for independent
• V1=(1, 0, 0), V2=(0, 1, 0), V3=(0, 0, 1).
• V1  V2=0+0+0=0.
• Vi  Vj=0.
• Each element represents a keywords.
• Different keywords are treated as totally different
items. This is not reasonable since sometimes
they are related.
• Definition Given the set {k1 ,k2 ,…,kt} of index
terms in a collection, as before, let wi,j be the
weight associated with the term-document pair
[ki ,dj]. If the wi,j weights are all binary, then all
possible patterns of term co-occurrence (inside
documents) can be represented by a set of 2t
minterms given by m1 =(0,0,…,0),
m2 =(1,0,…,0),…, m2t =(1,1,…,1). Let gi (mj )
return the weight {0,1} of the index term ki in the
minterm mi.
• Definition Let us define the following set of
vectors
•
m1=(0, 0, …, 1)
m2=(0, 0, …, 1, 0)
…..
m 2t-1=(1, 1, …, 1).
where each vector mi is associated with the
respective minterm mi .
• mi m j  0 for all i  j
ki



ci ,r 
c mr
r , g i ( mr ) 1 i , r
2
c
r , g ( m ) 1 i , r
i
1.1
r
w
i, j
d j | g l ( d j )  g l ( mr ), for.all.l
1.2
ki  k j 
c
c
i ,r
j ,r
r | g i ( mr ) 1 g j ( mr ) 1
d j  i wi , j ki
q j  i wi ,q ki
An example for Generalized Vector Space
Model
• Suppose that the system has 12 documents and
4 keywords.
• D1=(2, 1, 0, 0), D2=(5, 1, 0, 0), D3=(1, 1, 1, 1),
• D4=(0, 0, 2, 2), D5=(0, 1, 1, 2), D6=(0, 0, 1, 1),
• D7=(0, 0, 1, 0), D8=(1, 1, 0, 0), D9=(2, 1, 1, 1),
• D10=(0, 2, 2, 2). D11=(1, 0, 2, 0), D12=(0,0, 2,1).
• Minterms: 6 minterms are used as independent vectors to form
a base.
• m1=(1, 1, 0, 0), m2=(1, 1, 1, 1), m3=(0, 0, 1, 1),
m4=(0, 1, 1, 1), m5=(0, 0,1, 0), m6=(1, 0, 1, 0).
Generalized Vector Space Model
• Independent vectors:
v1= (1, 0, 0, 0, 0, 0), v2=(0, 1, 0, 0, 0, 0),
v3=(0, 0, 1, 0, 0, 0), v4=(0, 0, 0, 1, 0, 0),
v5=(0, 0, 0, 0, 1, 0), v6=(0, 0, 0, 0, 0, 1).
• Vi represents minterm mi.
• Each pair of Vi and Vj is orthogonal. (dot
product=0)
• The four keywords k1, k2, k3, and k4 are
represent by a combination of the independent
vectors.
Generalized Vector Space Model
• The four keywords k1, k2, k3, and k4 are
represent by a combination of the independent
vectors.
k1=(c1,1V1+c1,2V2+c1,3V3+c1,4V4+c1,5V5+c1,6V6)/C
where c1,1=w1,1+w1,2+w1,8 =2+5+1 (D1, D2, and D8
has minterm m1), c1,2=w1,3+w1,9 =1+2=3(D3 and
D9 has minterm m2),
c1,3=w1,4+w1,6+w1,12=0+0+0=0 (D4, D6 and D12
has minterm m3.), c1,4=w1,5+w1,10=0+0.
c1,5=w1,7=0. c1,6=w1,11=1.
C=(c1,1 2+c1,2 2+c1,3 2+c1,4 2+c1,5 2+c1,6 2)0.5
Generalized Vector Space Model
k2=(c2,1V1+c2,2V2+c2,3V3+c2,4V4+c2,5V5+c2,6V6)/C
where c2,1=w2,1+w2,2+w2,8 =1+1++1 (D1, D2, and
D8 has minterm m1), c2,2=w2,3+w2,9 =1+1=2(D3
and D9 has minterm m2),
c2,3=w2,4+w2,6+w2,12=0+0+0=0 (D4, D6 and D12
has minterm m3.), c2,4=w2,5+w2,10=1+2=3.
c2,5=w2,7=0. c2,6=w2,11=0.
C=(c2,1 2+c2,2 2+c2,3 2+c2,4 2+c2,5 2+c2,6 2)0.5
Generalized Vector Space Model
k3=(c3,1V1+c3,2V2+c3,3V3+c3,4V4+c3,5V5+c3,6V6)/C
where c3,1=w3,1+w3,2+w3,8 =0 (D1, D2, and D8 has
minterm m1), c3,2=w3,3+w3,9 =1+1=2(D3 and D9
has minterm m2), c3,3=w3,4+w3,6+w2,12=2+1+2=5
(D4, D6 and D12 has minterm m3.),
c3,4=w3,5+w3,10=1+2=3. c3,5=w3,7=1. c3,6=w3,11=2.
C=(c3,1 2+c3,2 2+c3,3 2+c3,4 2+c3,5 2+c3,6 2)0.5
Generalized Vector Space Model
k4=(c4,1V1+c4,2V2+c4,3V3+c4,4V4+c4,5V5+c4,6V6)/C
where c4,1=w4,1+w4,2+w4,8 =0 (D1, D2, and D8 has
minterm m1), c4,2=w4,3+w4,9 =1+1=2(D3 and D9
has minterm m2), c4,3=w4,4+w4,6+w4,12=2+1+1=4
(D4, D6 and D12 has minterm m3.),
c4,4=w4,5+w4,10=2+2=4. c4,5=w4,7=0. c4,6=w4,11=0.
C=(c4,1 2+c4,2 2+c4,3 2+c4,4 2+c4,5 2+c4,6 2)0.5
Ki’s are converted from a vector of length 4 into a
vector of length 6.
Google Web API
See: http://www.google.com/apis/
Concept:
• With the Google Web APIs service,
software developers can query more than
3 billion web documents directly from their
own computer programs.
• Google uses the SOAP and WSDL
standards so a developer can program in
his or her favorite environment - such as
Java, Perl, or Visual Studio .NET.
Google Web APIs provide three
service:
• Search relative web pages according to
the keyword(s) user supplies
• Return the cached web page to the user
by the URL user supplies
• Correct the spell of the word user inputs
Search Requests:
• Search requests submit a query string and
a set of parameters to the Google Web
APIs service and receive in return a set of
search results. Search results are derived
from Google’s index of over 2 billion Web
pages.
Seach Request Format:
Name
Description
Key
Provided by Google, Google uses the
key for authentication and logging
Query string
Q
start
maxRes
ults
Zero-based index of the first desired
result
Number of results desired per query.
The maximum value per query is 10.
(see next page)
filter
Activates or deactivates automatic results
filtering, which hides very similar results
and results that all come from the same
Web host.
restrict Restricts the search to a subset of the
Google Web index, such as a topic like
“Linux”.
safeSe A Boolean value which enables filtering of
arch
adult content in the search results.
lr
Language Restrict-Restricts the search to
documents within one or more languages.
Search Results Format:
• Search Response----Each time you issue
a search request to the Google service, a
response is returned to you. (We will
describe the meanings of the values
returned to you.)
• Result Element
Search Response:
<documentFiltering>--A Boolean value
indicating whether filtering was performed
on the search results
<searchComments>--A text string intended
for displaying to an end user
<estimatedTotalResultsCount>--The
estimated total number of results that exist
for the query
Continue:
• <estimatIsExact>--A Boolean value
indicating that the estimate value is
actually the exact value
• <resultElements>--An array of
<resultElement> items. This corresponds
to the actual list of search results
• <searchQuery>--This is the value of <Q>
for the search request
Continue:
• <startIndex>--Indicates the index (1-based)
of the first search result in <resultElements>
• <endIndex>--Indicates the index(1-based)
of the last search result in <resultElements>
• <searchTips>--A text string intended for
displaying to the end user. It provides
instructive suggestions on how to use
Google
Continue:
• <directoryCategories>--An array of
<directoryCategory> items
• <searchTime>--Text, floating-point number
indicating the total server time to return the
search results, measured in seconds
Cache Requests:
• Cache requests submit a URL to the
Google Web APIs service and receive in
return the contents of the URL when
Google’s crawlers last visited the page.
Spelling Requests:
• Spelling requests submit a query to the
Google Web APIs service and receive in
return a suggested spell correction for the
query (if available).
Java Implementation:
• Google provides a java implementation of
the Google Web APIs
• We will take a look at it and provide an
example finally.
The java classes:
• com.google.soap.search.GoogleSearch
• com.google.soap.search.GoogleSearchRe
sult
• com.google.soap.search.GoogleSearchRe
sultElement
• com.google.soap.search.GoogleSearchFa
ult
• com.google.soap.search.GoogleSearchDir
ectoryCategory
Usage Demo:
•
•
GoogleSearch s = new GoogleSearch();
s.setKey(clientKey);
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
try {
if (directive.equalsIgnoreCase("search")) {
s.setQueryString(directiveArg);
GoogleSearchResult r = s.doSearch();
System.out.println(r.toString());
} else if (directive.equalsIgnoreCase("cached")) {
byte [] cachedBytes = s.doGetCachedPage(directiveArg);
String cachedString = new String(cachedBytes);
System.out.println(cachedString);
} else if (directive.equalsIgnoreCase("spell")) {
System.out.println("Spelling suggestion:");
String suggestion = s.doSpellingSuggestion(directiveArg);
System.out.println(suggestion);
}
}
How to build the executive file
• 1. Write your own code in the right place of
the GoogleAPIDemo.java;
• 2. Compile GoogleAPIDemo.java;
• 3. Add the GoogleAPIDemo$1.class and
GoogleAPIDemo.class (both generated by
2) in the directory
“com.google.soap.search” of
GoogleAPI.jar with the software WinRAR.
• 4. Click the exec.bat to run the program.
Example program:
• You can download the executive files and
source files of the example from Dr.
Wang’s home page.
Download