Model
Parse the document collection into a set of sentences.
Utilize classic (unsupervised) cluster analysis techniques to partition the set of sentences into theme clusters, i.e. disjoint subsets of sentences, such that each sentence in a cluster is “about” the same theme.
Compute the cluster centers as (µ1, …, µn) , where µi is the average frequency of the ith term.
For each cluster, compute the distance from each sentence s, to its cluster center c, as 1-cos(s,c).
Consider the document collection center (modeling what the entire collection is “about”) to be the term frequency vector of the entire collection.
Compute the distance from each cluster center c, to the document collection center d as 1-cos(d,c).