How to use clustering to improve a language model

2020-04-23 nlp study group PV:

前情提要:

在做 speech recognizer 時
President Kennedy 和 precedent Kennedy 的發音很像
該如何讓speech recognizer 知道答案為President Kennedy?

Notation	Meaning	e.g.
$H$	hypothesis	President Kennedy precedent Kennedy
$D$	speech signal
$P(H)$	language model

用 noisy channal model
使找到的解答滿足:

$\hat H =
\underset{H}{\arg\max} P(H|D) =
\underset{H}{\arg\max} \frac{P(D|H)P(H)}{P(D)} =
\underset{H}{\arg\max} P(D|H)P(H)$

要如何優化 $P(H)$ , 使之更 generalization 便是今天要探討的問題

Notation	Meaning	Remark
L	corpus	L = $w_1,…w_N$
$\pi$	function	assign word to cluster

find function $\pi$
s.t minimize $H(L, \pi)$
(白話翻譯: 找到一個分類函數使之最小化對下個字預測的不確定性)

$H(L, \pi) = -\frac{1}{N}logP(w_{1,…N})$
經過課本 510-511頁一連串的推導後可得
$H(L, \pi) \approx H(W)-I(c_1; c_2)$

Remark: $I(c_1; c_2)$: mutual information between adjacent clusters

因此最大化 $I(c_1; c_2)$ 即可使 $H(L, \pi) 最小$

用bottom-up algorithm來分群，
用mutual information loss 當作loss

MI-loss($c_i$, $c_j$) = $\sum_{c_k \in C\setminus{c_i, c_j}} I(c_k;c_i) + I(c_k;c_j) - I(c_k;c_i\cup c_j)$

每次迭代將loss最小的兩群做合併
持續迭代到剩下k群為止(k為給定的參數課本是給1000)

Remark: 此演算法不能保證找到最佳解

課本給出的結果如下:

model	perplxity
cluster-based language model	277
word-based model	244
linear interpolation between the word-based and the cluster based model	236