TSVM (Transductive Inference for Text Classification using Support Vector Machines)

TSVM (Transductive Inference for Text Classification using Support Vector Machines)

tags: SVM, semi-supervised learning, SSL

paper link
cited:3517
參考書籍:Learning to Classify Text Using Support Vector Machines

Particularly, S3VMs have been widely applied to many tasks [10], and their representative algorithm, TSVM [23], has won the Ten-Year Best Paper Award for machine learning in 2009. (from Towards Making Unlabeled Data Never Hurt)

簡介

這篇論文的方法主要是針對“text classification”這個task,從這個task研發出semi-supervised learning演算法 – TSVM。 TSVM是一個SVM based的semi-supervised 方法。

  • 附註:What is text classification?
    text classification的目標是要讓model能夠自動將文檔分類。各文檔的類別有可能是multiple, exactly one, or no category. 因此,可以分別將每個類別視作二元分類的問題(屬於此類別or不屬於此類別)。

    To facilitate effective and efficient learning, each category is treated as a separate binary classifcation problem. Each such problem answers the question of whether or not a document should be assigned to a particular category.

TSVM (Transductive Support Vector Machine)

論文提出的方法如下:

為什麼TSVM適用text classification?

  • 附註Text classification的特點

  1. High dimensional input space:文本裡的字都是features.
  2. Document vectors are sparse
  3. Few irrelevant features

    Arguments from [Joachims, 1998] show that SVMs are especially well-suited for this setting, outperforming conventional methods substantially while also being more robust. Dumais et al. [Dumais et al., 1998] come to similar conclusio

  • 為什麼TSVM適用text classification?
  1. TSVM大致上保留SVM的特性
  2. text的資料會有許多co-occurrence的資訊。因為TSVM是transductive,因此也同時考慮了test data裡的co-occurrence資訊(inductive learning沒有辦法去考慮test data的資訊)。

Algorithm for TSVM

上面的方法若遇到大量unlabeled data,則需要非常大的運算量。而通常semi-supervised learning的問題,unlabeled data通常滿大的(text classification通常有大量unlabeled data)。因此,作者提出以下加速的最佳化問題,來近似OP2。

實際演算法如下:

主要的想法:

  1. 先用labeled data去train一個inductive SVM,對unlabeled data標註。
  2. 再透過替換掉一些slack variable比較大的label,去降低objective function。

    The key idea of the algorithm is that it begins with a labeling of the test data based on the classification of an inductive SVM. Then it improves the solution by switching the labels of test examples so that the objective function

附註:

  • Loop 1:透過不斷的提高 $C_-^$ 和 $C_+^$,慢慢提升unlabeled data對模型的影響力。
  • Loop 2:過替換掉slack variable值較大的unlabeled data的label,去降低objective function的值。

論文中也提出定理來說Algorithm TSVM會在有限步收斂

Theorem 2 Algorithm 1 converges in a fnite number of steps.


Experiment result

evaluation: Precision/Recall-Breakeven Point (text classification常用的measure)

The P/R-breakeven point is defined as that value
for which precision and recall are equal.

註:

Precision: probability that a document predicted to be in class
“+” truly belongs to this class.
Recall: probability that a document belonging to class “+” is classified into this class




Reference

https://github.com/muratayoshio/tsvm

Powered by Hexo and Hexo-theme-hiker

Copyright © 2020 - 2021 DSMI Lab's website All Rights Reserved.

UV : | PV :