Generalized Vector Space Model • Definition Let ki be a vector associated with the index term ki . Independence of index terms in the vector model implies that the set of vectors {k1 ,k2 ,…,kt} is linearly independent and forms a basis for the subspace of interest. The dimension of this space is the number t of index terms in the collection. An example for independent • V1=(1, 0, 0), V2=(0, 1, 0), V3=(0, 0, 1). • V1 V2=0+0+0=0. • Vi Vj=0. • Each element represents a keywords. • Different keywords are treated as totally different items. This is not reasonable since sometimes they are related. • Definition Given the set {k1 ,k2 ,…,kt} of index terms in a collection, as before, let wi,j be the weight associated with the term-document pair [ki ,dj]. If the wi,j weights are all binary, then all possible patterns of term co-occurrence (inside documents) can be represented by a set of 2t minterms given by m1 =(0,0,…,0), m2 =(1,0,…,0),…, m2t =(1,1,…,1). Let gi (mj ) return the weight {0,1} of the index term ki in the minterm mi. • Definition Let us define the following set of vectors • m1=(0, 0, …, 1) m2=(0, 0, …, 1, 0) ….. m 2t-1=(1, 1, …, 1). where each vector mi is associated with the respective minterm mi . • mi m j 0 for all i j ki ci ,r c mr r , g i ( mr ) 1 i , r 2 c r , g ( m ) 1 i , r i 1.1 r w i, j d j | g l ( d j ) g l ( mr ), for.all.l 1.2 ki k j c c i ,r j ,r r | g i ( mr ) 1 g j ( mr ) 1 d j i wi , j ki q j i wi ,q ki An example for Generalized Vector Space Model • Suppose that the system has 12 documents and 4 keywords. • D1=(2, 1, 0, 0), D2=(5, 1, 0, 0), D3=(1, 1, 1, 1), • D4=(0, 0, 2, 2), D5=(0, 1, 1, 2), D6=(0, 0, 1, 1), • D7=(0, 0, 1, 0), D8=(1, 1, 0, 0), D9=(2, 1, 1, 1), • D10=(0, 2, 2, 2). D11=(1, 0, 2, 0), D12=(0,0, 2,1). • Minterms: 6 minterms are used as independent vectors to form a base. • m1=(1, 1, 0, 0), m2=(1, 1, 1, 1), m3=(0, 0, 1, 1), m4=(0, 1, 1, 1), m5=(0, 0,1, 0), m6=(1, 0, 1, 0). Generalized Vector Space Model • Independent vectors: v1= (1, 0, 0, 0, 0, 0), v2=(0, 1, 0, 0, 0, 0), v3=(0, 0, 1, 0, 0, 0), v4=(0, 0, 0, 1, 0, 0), v5=(0, 0, 0, 0, 1, 0), v6=(0, 0, 0, 0, 0, 1). • Vi represents minterm mi. • Each pair of Vi and Vj is orthogonal. (dot product=0) • The four keywords k1, k2, k3, and k4 are represent by a combination of the independent vectors. Generalized Vector Space Model • The four keywords k1, k2, k3, and k4 are represent by a combination of the independent vectors. k1=(c1,1V1+c1,2V2+c1,3V3+c1,4V4+c1,5V5+c1,6V6)/C where c1,1=w1,1+w1,2+w1,8 =2+5+1 (D1, D2, and D8 has minterm m1), c1,2=w1,3+w1,9 =1+2=3(D3 and D9 has minterm m2), c1,3=w1,4+w1,6+w1,12=0+0+0=0 (D4, D6 and D12 has minterm m3.), c1,4=w1,5+w1,10=0+0. c1,5=w1,7=0. c1,6=w1,11=1. C=(c1,1 2+c1,2 2+c1,3 2+c1,4 2+c1,5 2+c1,6 2)0.5 Generalized Vector Space Model k2=(c2,1V1+c2,2V2+c2,3V3+c2,4V4+c2,5V5+c2,6V6)/C where c2,1=w2,1+w2,2+w2,8 =1+1++1 (D1, D2, and D8 has minterm m1), c2,2=w2,3+w2,9 =1+1=2(D3 and D9 has minterm m2), c2,3=w2,4+w2,6+w2,12=0+0+0=0 (D4, D6 and D12 has minterm m3.), c2,4=w2,5+w2,10=1+2=3. c2,5=w2,7=0. c2,6=w2,11=0. C=(c2,1 2+c2,2 2+c2,3 2+c2,4 2+c2,5 2+c2,6 2)0.5 Generalized Vector Space Model k3=(c3,1V1+c3,2V2+c3,3V3+c3,4V4+c3,5V5+c3,6V6)/C where c3,1=w3,1+w3,2+w3,8 =0 (D1, D2, and D8 has minterm m1), c3,2=w3,3+w3,9 =1+1=2(D3 and D9 has minterm m2), c3,3=w3,4+w3,6+w2,12=2+1+2=5 (D4, D6 and D12 has minterm m3.), c3,4=w3,5+w3,10=1+2=3. c3,5=w3,7=1. c3,6=w3,11=2. C=(c3,1 2+c3,2 2+c3,3 2+c3,4 2+c3,5 2+c3,6 2)0.5 Generalized Vector Space Model k4=(c4,1V1+c4,2V2+c4,3V3+c4,4V4+c4,5V5+c4,6V6)/C where c4,1=w4,1+w4,2+w4,8 =0 (D1, D2, and D8 has minterm m1), c4,2=w4,3+w4,9 =1+1=2(D3 and D9 has minterm m2), c4,3=w4,4+w4,6+w4,12=2+1+1=4 (D4, D6 and D12 has minterm m3.), c4,4=w4,5+w4,10=2+2=4. c4,5=w4,7=0. c4,6=w4,11=0. C=(c4,1 2+c4,2 2+c4,3 2+c4,4 2+c4,5 2+c4,6 2)0.5 Ki’s are converted from a vector of length 4 into a vector of length 6. Google Web API See: http://www.google.com/apis/ Concept: • With the Google Web APIs service, software developers can query more than 3 billion web documents directly from their own computer programs. • Google uses the SOAP and WSDL standards so a developer can program in his or her favorite environment - such as Java, Perl, or Visual Studio .NET. Google Web APIs provide three service: • Search relative web pages according to the keyword(s) user supplies • Return the cached web page to the user by the URL user supplies • Correct the spell of the word user inputs Search Requests: • Search requests submit a query string and a set of parameters to the Google Web APIs service and receive in return a set of search results. Search results are derived from Google’s index of over 2 billion Web pages. Seach Request Format: Name Description Key Provided by Google, Google uses the key for authentication and logging Query string Q start maxRes ults Zero-based index of the first desired result Number of results desired per query. The maximum value per query is 10. (see next page) filter Activates or deactivates automatic results filtering, which hides very similar results and results that all come from the same Web host. restrict Restricts the search to a subset of the Google Web index, such as a topic like “Linux”. safeSe A Boolean value which enables filtering of arch adult content in the search results. lr Language Restrict-Restricts the search to documents within one or more languages. Search Results Format: • Search Response----Each time you issue a search request to the Google service, a response is returned to you. (We will describe the meanings of the values returned to you.) • Result Element Search Response: <documentFiltering>--A Boolean value indicating whether filtering was performed on the search results <searchComments>--A text string intended for displaying to an end user <estimatedTotalResultsCount>--The estimated total number of results that exist for the query Continue: • <estimatIsExact>--A Boolean value indicating that the estimate value is actually the exact value • <resultElements>--An array of <resultElement> items. This corresponds to the actual list of search results • <searchQuery>--This is the value of <Q> for the search request Continue: • <startIndex>--Indicates the index (1-based) of the first search result in <resultElements> • <endIndex>--Indicates the index(1-based) of the last search result in <resultElements> • <searchTips>--A text string intended for displaying to the end user. It provides instructive suggestions on how to use Google Continue: • <directoryCategories>--An array of <directoryCategory> items • <searchTime>--Text, floating-point number indicating the total server time to return the search results, measured in seconds Cache Requests: • Cache requests submit a URL to the Google Web APIs service and receive in return the contents of the URL when Google’s crawlers last visited the page. Spelling Requests: • Spelling requests submit a query to the Google Web APIs service and receive in return a suggested spell correction for the query (if available). Java Implementation: • Google provides a java implementation of the Google Web APIs • We will take a look at it and provide an example finally. The java classes: • com.google.soap.search.GoogleSearch • com.google.soap.search.GoogleSearchRe sult • com.google.soap.search.GoogleSearchRe sultElement • com.google.soap.search.GoogleSearchFa ult • com.google.soap.search.GoogleSearchDir ectoryCategory Usage Demo: • • GoogleSearch s = new GoogleSearch(); s.setKey(clientKey); • • • • • • • • • • • • • • • try { if (directive.equalsIgnoreCase("search")) { s.setQueryString(directiveArg); GoogleSearchResult r = s.doSearch(); System.out.println(r.toString()); } else if (directive.equalsIgnoreCase("cached")) { byte [] cachedBytes = s.doGetCachedPage(directiveArg); String cachedString = new String(cachedBytes); System.out.println(cachedString); } else if (directive.equalsIgnoreCase("spell")) { System.out.println("Spelling suggestion:"); String suggestion = s.doSpellingSuggestion(directiveArg); System.out.println(suggestion); } } How to build the executive file • 1. Write your own code in the right place of the GoogleAPIDemo.java; • 2. Compile GoogleAPIDemo.java; • 3. Add the GoogleAPIDemo$1.class and GoogleAPIDemo.class (both generated by 2) in the directory “com.google.soap.search” of GoogleAPI.jar with the software WinRAR. • 4. Click the exec.bat to run the program. Example program: • You can download the executive files and source files of the example from Dr. Wang’s home page.