Unsupervised Topic Discovery (UTD)
Finds topics (a la PSM) for the whole corpus (all the folders) first.
- A topic is one of several coherent subjects within a document.
- A document contains many topics
Merge very similar documents -> new documents
Find all close document pairs. Compute term intersections for all close pairs.
Cluster doc-intersections -> topics
Use OnTopic EM method to sharpen distributions
Find understandable names (labels) for each topic