Model
Parse the document collection into a set of sentences.
Utilize classic (unsupervised) cluster analysis techniques to partition the set of sentences into theme clusters, i.e. disjoint subsets of sentences, such that each sentence in a cluster is �about� the same theme.
Compute the cluster centers as (�1, �, �n) , where �i is the average frequency of the itih term.
For each cluster, compute the distance from each sentence s, to its cluster center c, as 1-cos(s,c).
Consider the document collection center (modeling what the collection is �about�) to be the term frequency vector of the entire collection.
Compute the distance from each cluster center c, to the document collection center d as 1-cos(d,c).