Anna Drummond, Topic Models For Feature Selection in Document Clustering

Slides

We investigate the idea of using a topic model such as the popular Latent Dirichlet Allocation model as a feature selection step for unsupervised document clustering, where documents are clustered using the proportion of the various topics that are present in each document. One concern with usin g LDA as a feature selection method for input to a clustering algorithm is that the Dirichlet prior on the topic mixing proportions is too smooth and well-behaved. It does not encourage a “bumpy” distribution of topic mixing proportion vectors, which is what one would desire as input to a clustering algorithm. As such, we propose two variant topic models that are designed to do a better job of producing topic mixing proportions that have a good clustering structure.