Abstract
Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 18th IEEE International Conference on Machine Learning and Applications, ICMLA 2019 |
| Editors | M. Arif Wani, Taghi M. Khoshgoftaar, Dingding Wang, Huanjing Wang, Naeem Seliya |
| Place of Publication | usa |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 155-161 |
| Number of pages | 7 |
| ISBN (Electronic) | 9781728145495 |
| DOIs | |
| State | Published - Dec 1 2019 |
| Event | 18th IEEE International Conference on Machine Learning and Applications, ICMLA 2019 - Boca Raton, United States Duration: Dec 16 2019 → Dec 19 2019 |
Conference
| Conference | 18th IEEE International Conference on Machine Learning and Applications, ICMLA 2019 |
|---|---|
| Country/Territory | United States |
| City | Boca Raton |
| Period | 12/16/19 → 12/19/19 |
Keywords
- Document Clustering
- Document Topic Information Network
- LDA
- Learning Word Embeddings
- Topic Model
- Unsupervised Text Analysis
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver