Supervised by: Ministry of Culture of PRC

Sponsored by:National Library of China
  Library Society of China

ISSN 1001-8867    CN 11-2746/G2

Filtering and Classifying Relevant Short Text with a Few Seed Words

Abstract: Filtering out irrelevant documents andclassifying the relevant ones into topical categories is ade facto task in many applications. However, supervisedlearning solutions require extravagant human effortson document labeling. In this paper, we propose anovel seed-guided topic model for dataless short textclassification and filtering, named SSCF. Without usingany labeled documents, SSCF takes a few “seed words” foreach category of interest, and conducts short text filteringand classification in a weakly supervised manner. Toovercome the issues of data sparsity and imbalance, theshort text collection is mapped to a collection of pseudodocuments,one for each word. SSCF infers two kinds oftopics on pseudo-documents: category-topics and generaltopics.Each category-topic is associated with one categoryof interest, covering the meaning of the latter. In SSCF,we devise a novel word relevance estimation processbased on the seed words, for hidden topic inference. Thedominating topic of a short text is identified through postinference and then used for filtering and classification. Ontwo real-world datasets in two languages, experimentalresults show that our proposed SSCF consistentlyachieves better classification accuracy than state-of-theartbaselines. We also observe that SSCF can even achievesuperior performance than the supervised classifierssupervised latent dirichlet allocation (sLDA) and supportvector machine (SVM) on some testing tasks.

Keywords: dataless text classification, short text, topicmodeling, seed word, pseudo-document


富源县| 乌鲁木齐市| 蓝田县| 平果县| 海淀区| 温州市| 临泽县| 若尔盖县| 太仆寺旗| 城口县| 昭平县| 读书| 都兰县| 泸溪县| 达州市| 牙克石市| 当雄县| 鱼台县| 奈曼旗| 雷波县| 乌拉特后旗| 陕西省| 哈密市| 肥乡县| 水城县| 枣强县| 湘乡市| 尉氏县| 西青区| 越西县| 公主岭市| 宁远县| 肃宁县| 蕉岭县| 长治市| 张家口市| 简阳市| 新和县| 常宁市| 应城市| 惠州市|