Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Unsupervised word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in a language for many natural language processing tasks. However, the performance of these text embeddings models usually falls short on per NLP classification tasks due to the lack of ability to learn other semantic information available for a specific task. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state-of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing one of a large scale heterogeneous in- formation network by combining topic models that have emerged as a powerful technique of useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains a shallow unsupervised neural net to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.
Original languageEnglish
Title of host publicationUnknown book
StatePublished - 2018
Event41st European Conference on Information Retrieval -
Duration: Jan 1 2018 → …

Conference

Conference41st European Conference on Information Retrieval
Period01/1/18 → …

Cite this