Latent Semantic Indexing - Querying and Augmenting LSI Vector Spaces

Querying and Augmenting LSI Vector Spaces

The computed Tk and Dk matrices define the term and document vector spaces, which with the computed singular values, Sk, embody the conceptual information derived from the document collection. The similarity of terms or documents within these spaces is a factor of how close they are to each other in these spaces, typically computed as a function of the angle between the corresponding vectors.

The same steps are used to locate the vectors representing the text of queries and new documents within the document space of an existing LSI index. By a simple transformation of the A = T S DT equation into the equivalent D = AT T S−1 equation, a new vector, d, for a query or for a new document can be created by computing a new column in A and then multiplying the new column by T S−1. The new column in A is computed using the originally derived global term weights and applying the same local weighting function to the terms in the query or in the new document.

A drawback to computing vectors in this way, when adding new searchable documents, is that terms that were not known during the SVD phase for the original index are ignored. These terms will have no impact on the global weights and learned correlations derived from the original collection of text. However, the computed vectors for the new text are still very relevant for similarity comparisons with all other document vectors.

The process of augmenting the document vector spaces for an LSI index with new documents in this manner is called folding in. Although the folding-in process does not account for the new semantic content of the new text, adding a substantial number of documents in this way will still provide good results for queries as long as the terms and concepts they contain are well represented within the LSI index to which they are being added. When the terms and concepts of a new set of documents need to be included in an LSI index, either the term-document matrix, and the SVD, must be recomputed or an incremental update method (such as the one described in ) be used.

Read more about this topic:  Latent Semantic Indexing

Famous quotes containing the words augmenting and/or spaces:

    [The Declaration of Independence] meant to set up a standard maxim for free society, which should be familiar to all, and revered by all; constantly looked to, constantly labored for, and even though never perfectly attained, constantly approximated, and thereby constantly spreading and deepening its influence, and augmenting the happiness and value of life to all people of all colors everywhere.
    Abraham Lincoln (1809–1865)

    Though there were numerous vessels at this great distance in the horizon on every side, yet the vast spaces between them, like the spaces between the stars,—far as they were distant from us, so were they from one another,—nay, some were twice as far from each other as from us,—impressed us with a sense of the immensity of the ocean, the “unfruitful ocean,” as it has been called, and we could see what proportion man and his works bear to the globe.
    Henry David Thoreau (1817–1862)