vector

advertisement
Each call to vect_file consists of taking the words from a region in a
file,
stemming them, discarding common, meaningless words, and converting them
to
integer values (wordcodes). A binary tree is built up with the wordcodes
as
keys to the nodes, each node containing the number of occurences of the
word
in the document. When completed, the tree is flattened into an array by
an
in-order traversal, resulting in the desired docvec. Zeroes are inserted
into the array for wordcodes which do not appear in the tree, but do
appear
in other documents in the collection.
The wordcodes which have appeared globally, so to speak, are kept track
of
with a lookup tree, whose nodes are sorted by wordcode like the document
trees above. In this case the nodes contain the index into a docvec of
their word, e.g., if the entry at node "banana" in this tree is 87, then
j_random_docvec[87] is the number of occurences of "banana" in the
document.
When a document tree is flattened into a docvec, it is compared with this
lookup tree so that zeroes can be inserted appropriately. Conversely, if
the document contains a word not appearing in the lookup tree, a new node
is
created for it, containing the next highest index. init_collection()
initializes the lookup tree.
For now, the vectors returned will have a sparse representation.
Densification may come later. A docvec will be a struct containing an
int
indicating number of entries (because this will grow as indexing
progresses)
and an array of ints, the vector itself. See vector.h for more info.
We'll also have stuff in here for finding the covariance matrix of a
bunch
of docvecs.
-------------------------------------------------There's some conceptually important stuff in here. We have two kinds of
vectors, a DocVec and a WordVec. The first is associated with a
particular
document from the collection (or a query) and contains arrays of word
hashes
and weights, the frequencies of those words in the document. The second
is
associated with a particular word appearing in the collection, and
contains
an array of packed integers, each of which has a field to identify a
document in which the word appears and a field that has the frequency of
the
word in that document.
Basically, we can think of a document as a vector in "wordspace", a
vector
space whose basis vectors correspond to words. e.g., "Now is the time
for
all quick brown men to jump over their lazy country" is parsed into
separate
words and has the common words discarded, leaving "time quick brown men
jump
lazy country". This can be thought of as a vector whose "time" component
is
1, "quick" component is 1, etc. If we perform this operation on all
documents in a collection, we can represent the collection as a matrix,
setting all these (column) vectors side by side.
Originally, savant used only these document vectors; collections were
saved
as sequences of these vectors and retrieval was performed by sucking the
whole thing into memory (since every single document vector has to be
examined in a query). It turns out, however, it is much more convenient
to
store and retrieve the row vectors of the matrix mentioned above. These
are
the WordVec's. With these structures saved on disk, we need only load
one
at a time - as we are finding the best matches for a query, we are only
looking at one word from it at a time, so we need only grab from disk the
row vector of the matrix that corresponds to that word. We add the
weights
in this vector to a running total and discard it. Very light on memory,
and
still very fast.
Download