How to use clustering to improve a language model

前情提要:

question:

在做 speech recognizer 時
President Kennedy 和 precedent Kennedy 的發音很像
該如何讓speech recognizer 知道答案為President Kennedy?

answer

Notation Meaning e.g.
$H$ hypothesis President Kennedy
precedent Kennedy
$D$ speech signal
$P(H)$ language model

用 noisy channal model
使找到的解答滿足:

$\hat H =
\underset{H}{\arg\max} P(H|D) =
\underset{H}{\arg\max} \frac{P(D|H)P(H)}{P(D)} =
\underset{H}{\arg\max} P(D|H)P(H)$

要如何優化 $P(H)$ , 使之更 generalization 便是今天要探討的問題

using clustering improving a language model

模型假設

  • bigram model
  • Markov assumption
Notation Meaning Remark
L corpus L = $w_1,…w_N$
$\pi$ function assign word to cluster

目標

find function $\pi$
s.t minimize $H(L, \pi)$
(白話翻譯: 找到一個分類函數 使之最小化 對下個字預測的不確定性)

Question: 要如何最小化$H(L, \pi)$?

$H(L, \pi) = -\frac{1}{N}logP(w_{1,…N})$
經過 課本 510-511頁一連串的推導後可得
$H(L, \pi) \approx H(W)-I(c_1; c_2)$

Remark: $I(c_1; c_2)$: mutual information between adjacent clusters

因此 最大化 $I(c_1; c_2)$ 即可使 $H(L, \pi) 最小$

Question: 要如何分群使$I(c_1; c_2)$的值最大?

用bottom-up algorithm來分群,
用mutual information loss 當作loss

MI-loss($c_i$, $c_j$) = $\sum_{c_k \in C\setminus{c_i, c_j}} I(c_k;c_i) + I(c_k;c_j) - I(c_k;c_i\cup c_j)$

每次迭代將loss最小的兩群做合併
持續迭代到剩下k群為止(k為給定的參數 課本是給1000)

Remark: 此演算法不能保證找到最佳解

Is cluster-based language model work?

課本給出的結果如下:

model perplxity
cluster-based language model 277
word-based model 244
linear interpolation between the word-based and the cluster based model 236

Powered by Hexo and Hexo-theme-hiker

Copyright © 2020 - 2021 DSMI Lab's website All Rights Reserved.

UV : | PV :