Recently there has been significant activity in developing algorithms with provable guarantees for topic modeling [Arora et al.'12, Bansal et al.'14, Anandkumar et al.'14]. In standard topic models, a topic (such as sports, business, or politics) is viewed as a probability distribution \vec a_i over words, and a document is assumed to be generated by first selecting a mixture \vec w over topics, and then generating words i.i.d. from the associated mixture \vec w^T A. Given a large collection of such documents, the goal is to recover the topic vectors and then to correctly classify new documents according to their topic mixture. In this work we consider a generalization of this framework in which words are no longer assumed to be drawn i.i.d. and instead a topic is a complex distribution over sequences of paragraphs. Since one could not hope to even represent such a distribution in general (even if paragraphs are given using some natural feature representation), we aim instead to directly learn a document classifier. That is, we efficiently learn a predictor that given a new document, accurately predicts its topic mixture, without learning the distributions explicitly. More generally, our model can be viewed as a generalization of the multi-view or co-training setting in machine learning. We use tools from learning theory, perturbation theory, and matrix concentration to achieve our results. Based on joint work with Avrim Blum.
Nika Haghtalab on "Algorithms for Generalized Topic Modeling" - YouTube | |
0 Likes | 0 Dislikes |
526 views views | 341 followers |
People & Blogs YouTube Video Editor View attributions | Upload TimePublished on 28 Oct 2016 |
Không có nhận xét nào:
Đăng nhận xét