## Parse the document collection into a set of sentences.

## Utilize classic (unsupervised, seeded) cluster analysis techniques to partition the set of sentences into theme clusters, i.e. disjoint subsets of sentences, such that each sentence in a cluster is “about” the same theme.

## Compute the cluster centers as (µ1, …, µn) , where µi is the average frequency of the itih term.

## For each cluster, compute the distance from each sentence s, to its cluster center c, as 1-cos(s,c).

## Consider the document collection center (modeling what the collection is “about”) to be the term frequency vector of the entire collection.

## Compute the distance from each cluster center c, to the document collection center d as 1-cos(d,c).

Previous slide | Next slide | Back to first slide | View graphic version |