vector

Each call to vect_file consists of taking the words from a region in a file, stemming them, discarding common, meaningless words, and converting them to integer values (wordcodes). A binary tree is built up with the wordcodes as keys to the nodes, each node containing the number of occurences of the word in the document. When completed, the tree is flattened into an array by an in-order traversal, resulting in the desired docvec. Zeroes are inserted into the array for wordcodes which do not appear in the tree, but do appear in other documents in the collection. The wordcodes which have appeared globally, so to speak, are kept track of with a lookup tree, whose nodes are sorted by wordcode like the document trees above. In this case the nodes contain the index into a docvec of their word, e.g., if the entry at node "banana" in this tree is 87, then j_random_docvec[87] is the number of occurences of "banana" in the document. When a document tree is flattened into a docvec, it is compared with this lookup tree so that zeroes can be inserted appropriately. Conversely, if the document contains a word not appearing in the lookup tree, a new node is created for it, containing the next highest index. init_collection() initializes the lookup tree. For now, the vectors returned will have a sparse representation. Densification may come later. A docvec will be a struct containing an int indicating number of entries (because this will grow as indexing progresses) and an array of ints, the vector itself. See vector.h for more info. We'll also have stuff in here for finding the covariance matrix of a bunch of docvecs. -------------------------------------------------There's some conceptually important stuff in here. We have two kinds of vectors, a DocVec and a WordVec. The first is associated with a particular document from the collection (or a query) and contains arrays of word hashes and weights, the frequencies of those words in the document. The second is associated with a particular word appearing in the collection, and contains an array of packed integers, each of which has a field to identify a document in which the word appears and a field that has the frequency of the word in that document. Basically, we can think of a document as a vector in "wordspace", a vector space whose basis vectors correspond to words. e.g., "Now is the time for all quick brown men to jump over their lazy country" is parsed into separate words and has the common words discarded, leaving "time quick brown men jump lazy country". This can be thought of as a vector whose "time" component is 1, "quick" component is 1, etc. If we perform this operation on all documents in a collection, we can represent the collection as a matrix, setting all these (column) vectors side by side. Originally, savant used only these document vectors; collections were saved as sequences of these vectors and retrieval was performed by sucking the whole thing into memory (since every single document vector has to be examined in a query). It turns out, however, it is much more convenient to store and retrieve the row vectors of the matrix mentioned above. These are the WordVec's. With these structures saved on disk, we need only load one at a time - as we are finding the best matches for a query, we are only looking at one word from it at a time, so we need only grab from disk the row vector of the matrix that corresponds to that word. We add the weights in this vector to a running total and discard it. Very light on memory, and still very fast.

vector

Related documents

Products

Support

vector

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib