Diametric Topic Mixtures for Documents Classification, Anna Drummond

The first large-scal eclinical data warehousse (CDWs) have appeared over the last few years. A CDW is a large repository of computer-based patient records (CPRs) culled from multiple sources. These CPRs contain both structured (numerical, categorical) and unstructured information (such text, images, etc.). This research focuses on the clinical notes (text) associated with the CPR. Obtaining an unbiased sample from CDWs for rare-class problems may be very costly. This motivates the development of the proposed diametric topic mixture model (DTMM). In this model documents are produced by a set of correlated topics and are represented by a vector of probabilities which control the extent to which each topic influences a particular document. Those probability vectors are produced by sampling from a mixture model of two components, where one component is associated witch each class. To maximize discriminative power of classification of the documents, the two mixture components are "diametrically opposed" to one another. Although the proposed model has been designed for a specific problem that arises in biomedical information retrieval, it is suitable for general-purpose binary document classification.