Introduction

Abstract in Korean

Topic modeling은 많은 문서들 꾸러미 속 잠재 의미 구조, 즉 Topic을 파악하는 데 사용됩니다. 가장 널리 사용되는 방법론에는 LDA(잠재 디리클레 할당)과 PLSA(확률적 잠재 의미 인덱싱)가 있습니다. 위 방법론들의 인기에도 불구하고, 이들은 여러 약점들이 존재합니다. 최적의 결과물을 얻기 위해 일반적으로 Topic의 개수, 자체적인 stop-word 목록, 어간 추출(stemming) 및 표제어 추출(lemmatization)을 필요로 합니다. 뿐만 아니라 이러한 방법론들은 단어들의 순서나 의미 구조를 무시하는 bag-of-words 표현법에 의존합니다. 문서들과 단어들의 분산 표현법(distributed representations)은 단어들과 문서들의 의미 구조를 파악하는 능력으로 인해 인기를 얻어왔습니다. 우리는 top2vec을 제안하는데, 이는 문서와 단어의 결합 의미 임베딩을 활용하여 topic 벡터를 찾아냅니다. 이 모델은 stop-word 목록, 어간 추출 또는 표제어 추출을 필요로 하지 않으며, 자동적으로 topic의 개수를 찾아냅니다. 이렇게 찾아진 topic 벡터들은 문서 및 단어 벡터들과 함께 임베딩된 상태로, 그들간의 거리는 의미론상 유사도를 의미합니다. 우리의 실험결과는 top2vec가 확률론적 생성 모델(probabilistic generative model)들에 비해 훨씬 더 정보량이 풍부하며(informative) 훈련된 말주머니(corpus)를 잘 표현해내는(representative) topic들을 찾아내는 것을 입증합니다.

Previous Methods

LSA

Overview

"Latent Semantic Analysis"

Apply singular value decomposition (SVD) (matrix factorization) to a Word-Document Matrix or a window-based co-occurence matrix and retrieve a semantic matrix, which contains latent semantic meanings, with reduced dimensionality[6]
It also improves calculation efficiency due to the lower dimensionality

Details

Apply SVD to a BoW matrix $A$

$$ A = U \Sigma V^\top $$
- $A$: an $m \times n$ matrix ($m$: # terms, $n$: # documents, $r$: # eigenvectors
- Usually, $m \gg n$ (so many words even in a single document)
Use only $k$ eigvenvalues of $\Sigma_k$ (truncated SVD)

$$ A_k = \color{blue} U_k \Sigma_k V_k^\top \color{defaultcolor} $$
- $A_k$: still similar to the original matrix $A$, but with less information!
- $U_k$: an $m \times k$ matrix, $\Sigma_k$: a $k \times k$ matrix, $V_k^\top$: a $k \times n$ matrix ($k < r$: truncated)
Calculate document and word representations

$$ U_k^\top A_k = U_k^\top \color{blue}U_k \Sigma_k V_k^\top \color{defaultcolor}= I \Sigma_k V_k^\top = \Sigma_k V_k^\top = X_1 \\ A_k V_k = \color{blue}U_k \Sigma_k V_k^\top \color{defaultcolor}V_k = U_k \Sigma_k I = U_k \Sigma_k = X_2 $$
- Here, $X_1$ is an $r \times n$ matrix → each column is a document representation with a $r$-length vector.
- $X_2$ is an $m \times r$ matrix → each row is a word representation with a $r$-length vector.

Pros & Cons

Deerwester et ak.(1990)과 Landauer and Dumais(1997)은 이 기법을 적용하면 단어와 문맥 간의 내재적인 의미(latent/hidden meaning)을 효과적으로 보존할 수 있게 돼 결과적으로 문서 간 유사도 측정 등 모델의 성능 향상에 도움을 줄 수 있다고 합니다.
Rapp(2003)은 입력 데이터의 노이즈 제거, Vozalis and Margaritis(2003)은 입력데이터의 sparsity를 줄이게 돼 그 효과가 좋다고 합니다[6].
LSA는 입력데이터의 크기가 $m \times n$이고, 문서당 평균 단어수가 $c$일 경우, 계산복잡도가 $O(mnc)$에 불과하지만, 새로운 문서나 단어가 추가될 경우 처음부터 작업을 새로 시작해야 함