Automatic Generation of Personalized Annotation Tags for Twitter Users

What’s the problem in this paper?

In this paper, they want to tag Twitter user’s personal interests via extracting keywords.

How did they solve this problem?

Identifying an individual user’s interests and concerns can help potential commercial applications.
For instance, this information can be employed to produce “following” suggestions, either a person who shares similar interests (for expanding their social network) or a company providing products or services the user is interested in (for personalized advertisement).

Dataset in this paper

  • Training: Messages from 11,376 Twitter users. Each user has 180~200 messages.
  • Testing: randomly selected 156 Twitter users to evaluate the top-N precision of TFIDF ranking and TextRank

techniques in this paper

flow chart:
針對蒐集起來的twitter進行下列的前處理

  1. remove reply message: 因為這些回應訊息是比較針對對方的觀點而不是針對自己的言論
  2. remove emoticons: 對於keyword analysis比較沒有幫助
  3. Substituting / removing internet slangs(厘語) and abbreviations: 這類詞彙可以分成三類
    a. 可以被正名的有意義的詞彙: bff(best friend forever), fone(phone)
    b. 文法上的縮寫: im(i’m), abt(about),如果刪除會影響POS tagging的結果
    c. 用來斷句的詞彙,通常沒有實質意義: lol(laugh out loud), clm(cool like me),在這篇的方法中會被刪除
  1. Part-of-Speech tagging and filtering

  2. Stemming and stopword removing:use Porter stemmer去除字尾,然後去除stop words

  3. TFIDF ranking: messages from user are put together as one document.

    • $n_{i,u}:$ the count of word i in user u’s messages
    • $U_i:$ the number of users whose messages containg word i
    • $U:$ the total number of users in the Twiter corpus
  4. TextRank: build a TextRank graph with undirected edges for each Twitter user

    • weight of edge: 這兩個字在幾篇文章中有同時出現(co-exist)
    • where $w_{ji}$ is the weight of the edge that links $V_{j}$ and $V_{i}$, $E(V_{i})$ is the set of vertices which $V_{i}$ is connected to. $d$(a damping factor) is set to 0.85: reference Google’s PageRank algorithm
    • The rank update iteration continues until convergence.

evaluation

After we obtained the topN outputs from the system, three human evaluators were asked to judge whether the output tags from the two systems (unidentified) reflected the corresponding Twitter user’s interests or concerns according to the full set of his/her messages

result

  • top-N代表選出的前N個標籤都要經過三個專家判斷是否這些tags符合該文章內容,所以很合理的,當N越大的時候precision會越低

Although most Twitter users express their interests to some extent in their messages, there are some
users whose message content is not rich enough to
extract reliable information. We investigated two
measures for identifying such users:

  • standard deviation of the top-10 TextRank :一個text rank graph,每一個rank所計算出的標準差。越高代表說每一個word之間的rank差異很大,因此找出的tag比較能反應user的偏好。相反越低代表每一個word之間rank差異不大,因此找出的tag就比較不具有鑑別性。
  • text entropy: 刻劃user message 的豐富度。越高表示message 內容越豐富。當user message 內容越豐富,越能找出具有代表性的tag。

new idea

To our knowledge, no previously published research has yet addressed problems on tagging user’s personal interests from Twitter messages via keyword extraction, though several studies have looked at keyword extraction using other genres.
Previous work on both TFIDF ranking and TextRank has been done mainly on formal language style text, such as academic papers, spoken documents or web pages than that of Twitter messages.

  • 這是第一個如何透過”user message”來找出重要的tag的研究

Reference

  1. Stanford Log-linear Part-Of-Speech Tagger
  2. The Porter Stemming Algorithm
  3. Google’s PageRank Algorithm: A Diagram of the
    Cognitive Capitalism and the Rentier of the Common
    Intellect
    • google search engine早期使用的技術: PageRank

problem?

fantastic !
wonderful !
amazing !
fabulous !

Powered by Hexo and Hexo-theme-hiker

Copyright © 2020 - 2021 DSMI Lab's website All Rights Reserved.

UV : | PV :