arxiv: v1 [cs.cl] 25 Oct 2017

Size: px

Start display at page:

Download "arxiv: v1 [cs.cl] 25 Oct 2017"

Antonia Dawson
6 years ago
Views:

1 Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings Aditya Mogadala 1, Dominik Jung 2 and Ahim Rettinger 1 arxiv: v1 [s.cl] 25 Ot Institute AIFB, Karlsruhe Institute of Tehnology, Germany, aditya.mogadala@kit.edu, rettinger@kit.edu 2 Institute IISM, Karlsruhe Institute of Tehnology, Germany, dominik.jung2@kit.edu Abstrat. Soial media platforms have grown into an important medium to spread information about an event published by the traditional media, suh as news artiles. Grouping suh diverse soures of information that disuss the same topi in varied perspetives provide new insights. But the gap in word usage between informal soial media ontent suh as tweets and diligently written ontent (e.g. news artiles) make suh assembling diffiult. In this paper, we propose a transformation framework to bridge the word usage gap between tweets and online news artiles aross languages by leveraging their word embeddings. Using our framework, word embeddings extrated from tweets and news artiles are aligned loser to eah other aross languages, thus failitating the identifiation of similarity between news artiles and tweets. Experimental results show a notable improvement over baselines for monolingual tweets and news artiles omparison, while new findings are reported for ross-lingual omparison. 1 Introdution On the web, growth of soial media platforms has offered numerous opportunities with several hallenges to solve. Twitter 3 is one suh soial media platform that allows its users to share 140 haraters of text messages (popularly known as tweets) in multiple languages with their friends or followers. Tweets may ontain personal information or a onfined desription about an event motivated by the traditional media suh as online news artiles. Studies [1] have shown that 85% of the tweets are news affiliated. Though only some tweets aknowledge news artiles by expliitly linking them, most of them do not. This impliit linking of tweets with the news topis provide novel insights. For example, most of the traditional media ompanies that publish online news write only fats about an event. However, identifying relevant tweets for the orresponding news will append people opinion. Furthermore, attahing tweets with the news artiles will allow to understand the multi-dimensional view about ontroversial topis, thus 3

2 empowering the editor of an artile to modify upoming or following artiles based on veraity. Howbeit, due to the differenes in word usage aross informal tweets and the attentively drafted writings like news artiles make this linking hallenging. Nevertheless, different approahes are pursued to solve the problem. Initially, monolingual omparison of tweets with news artiles is ahieved by omprehending ommonality between the topis using unsupervised topi models [2]. Although a salable approah, it fails to apture importane of words and their differenes aross orpora. A graph-based latent variable model [3] was further introdued for finding short text orrelations using miroblog hashtags and news artiles named entities. Even though it addresses earlier drawbaks by giving importane to keywords suh as named entities in news artiles. It still ignores other large hunk of voabulary. Krestel et al. [4] followed a different path by posing the omparison of tweets with news as relevane assessment problem and designed supervised binary lassifier with many hand-rafted features. Yet supervised, the hand-rafted features limit its salability. Further, aforementioned approahes ignore the multilingual aspet of the published news. Nowadays most of the online news about any event is multilingual. Identifiation of a news artile belonging to single language is not enough to over the olletive views about an event. In this paper, we overome the limitations of earlier approahes and propose a new salable framework to support tweets with monolingual and ross-lingual news artile omparison. Our framework leverages monolingual [5] and bilingual [6] word embeddings aquired from tweets and news artiles as basi units for bridging the word usage gap aross these olletions. Furthermore, non-linear transformation of tweet word embeddings is performed to make it loser to the news artile word embeddings using manifold alignment with Prorustes analysis [7]. Work losely related to our approah is by Tan et al. [8] who perform lexial omparison of words observed in tweets and Wikipedia belonging to same language with only linear transformation, while we perform non-linear transformation and also aross languages. Three main ontributions are summarized as follows: 1. Proposed an approah to lassify tweets as to how relevant they are for a given news artile in more than one language. 2. New evaluation orpora is reated for monolingual and ross-lingual tweets to news artile omparison. 3. Lexial and task speifi evaluation results are presented on two different datasets. 2 Related Work Most of our researh is losely related to the work that identifies relevane of tweets with online news or perform event detetion. We divide eah of the related works into separate ategories.

3 2.1 Event Detetion in Tweets Analyzing information flow about the events as they emerge is an important aspet of event detetion in tweets. Several works used this information in various ways. Some approahes [9] olloated emerging events and lassified them into different ategories, while some [10] found sentiment from the deteted events. Others deteted events as trends to trak publi health [11], politial abuse [12] and risis ommuniation [13]. 2.2 News and Relevant Tweets Several approahes have been explored to identify relevant noisy tweets with the lengthy news artiles. Initially, a semanti enrihment framework [14] was built to link news artiles and tweets by identifying possible orrelations to provide personalized news reommendations. Jin et al. [15] viewed the problem from different perspetive and introdued a dual latent Dirihlet alloation model to jointly learn two sets of topis. Later, a more sophistiated unsupervised topi modeling [2] approah was proposed for finding overlap of topi distribution between tweets and news artiles obtained from New York Times Distributed Representations Distributed word representations [5] has shown signifiant improvements in many NLP tasks [16]. Different variations of them suh as bilingual [6] and polylingual [17] are also obtained by projeting multiple or pair of languages into the shared semanti spae. Also, word representations were extended to meet requirements of the short or noisy text [18,19]. 3 Monolingual Word Usage Charateristis To understand the harateristis of word usage, initially news artiles in German and English are olleted between January, 2015 and Deember, To have a good overlap of topis, keywords 5 are extrated from news artiles to be used as queries for olleting tweets belonging to the same period with Twitter searh API 6. Aquired tweets are then polished by removing URLs, user mentions, # symbol of the hashtags, and all re-tweets. Additionally, Glove 7 is used to obtain word embeddings with 400 dimensions for both olletions. Size of the final doument sets and the voabulary extrated from Glove is listed in the Table 1. Word embeddings for eah olletion are now used to effetively omprehend the word usage harateristis. Initially, top 10 ommon and frequent words observed in both olletions are visualized with t-sne [20]. We observed that the

4 Colletion Language Douments Voabulary News English News German Tweets English Tweets German Table 1. Colletion Sizes same words learned separately from tweets and news olletion are highly separated. Furthermore to apprehend the differene in slangs, abbreviations et., in both olletions, we use frequent 5000 ommon voabulary terms (both English and German) to pereive differenes among their nearest neighbors. Based on rank biased overlap (RBO) measure [21,8] whih provides a omparison between inomplete and indefinite rankings, we observe a minimal average RBO measure of and for English and German respetively with parameters ϕ = 0.9 and k = 100. Thus exhibiting the differene in word usage among both olletions. This motivates us to transform word embeddings learned with tweets loser to word embeddings learned using news artiles or vie versa. 4 Transformed Word Embeddings (TWE) Differene in the embeddings learned from two different olletions suh as tweets and news require bridging with embedding transformation. In this setion, we formulate the problem and present our approah for monolingual and rosslingual transformation. 4.1 Problem Formulation Let, Tw l n = {t l w 1, t l w 2...t l w i...t l w n } and Te l n = {t l e 1, t l e 2...t l e i...t l e n } represent set of words and their orresponding embeddings extrated from tweet olletion respetively. Where l is the language of tweets, n is the size of voabulary and eah embedding is of dimension t l e i R 1Xd. Similarly, Nw l m = {n l w 1, n l w 2...n l w i...n wm } and Ne l m = {n l e 1, n l e 2...n l e i...n l e m } represent set of words and their embeddings of news orpora respetively. Where l is the language of news orpora, m is the size of voabulary and eah embedding is of dimension n ei R 1Xd. Formally, now our researh question is to identify ommon words {Tw l, Nw l } = {t l w i, n l w i } i=1 and transform word embeddings in the tweet olletion (T e l ) loser to the embeddings of news olletions (Ne l ) or vie versa. This transformation is based on the assumption that there prevails a transformation relationship between the vetors for the frequent words of eah olletion. Some approahes [8] have earlier performed this simple transformation only if the language of tweets and formal language orpora (e.g. news, Wikipedia) belong to same language. But, it is non-trivial if the language of tweets and formal language orpora differs. In the following setions, we present the transformation of tweet embeddings loser to the monolingual or ross-lingual news embeddings.

5 4.2 Monolingual-TWE Earlier approahes [8] assume only linear relationship between embeddings from different olletions to perform transformation. Sometimes relationship needs to handle disturbanes suh as saling and rotation. To ater suh issues, we leverage manifold alignment using Prorustes analysis [7] to transform word embeddings of tweets loser to word embeddings of news artiles with a three step proedure. Learning low-dimensional embeddings is ue for transformation. We already have low-dimensional embeddings {Te l, Ne l } of words observed in both tweet and news olletion. To find the optimal values of transformation, Prorustes superimposition is done by translating, rotating and saling the objets (i.e. rows of Te l is transformed to make it similar to the rows of Ne l ). Transformation is ahieved by Translation: Taking mean of all the members of set to make entroids T l e i N l e i ( i=1, i=1 ) lie at origin. Saling and Rotation: The rotation and saling that maximizes the alignment is given by orthogonal matrix (Q) and saling fator (j). They are obtained by minimizing orthogonal Prorustes problem [22] and is provided by Equation 1. arg min j,q N l e T l e F (1) where Te l a matrix of transformed T l e values given by jte l Q and. F is the Frobenius norm onstrained over Q T Q = I. If Tw l represents the words of T l e low-dimensional embeddings, then the final sets {Tw l, N l w } ontains loser orrespondene. To understand the effetiveness of this transformation, we perform similar experiments as of 3 in Cross-Lingual-TWE Comparison of voabulary obtained from tweets in one language (l 1 ) with the voabulary of news artiles in another language (l 2 ) is not straightforward. To subdue this onern, we propose a two step approah. In the first step, news artiles from two different languages are aquired to learn bilingual word distributed representations(i.e. bilingual embeddings). Aim of bilingual embeddings is to apture linguisti regularities aross languages into a ommon semanti spae suh that English and German words (e.g. wonderful and wunderbar ) are neighbors in the t-sne visualization, thus bridging the language gap.

6 In the seond step, ross-lingual transformation is ahieved between word embeddings obtained from tweets in l 1 and word embeddings of news artiles in l 2. As bilingual word embeddings of news artiles in l 1 also share linguisti regularities from l 2, mapping word embeddings of tweets loser to the bilingual word embeddings of news artiles of l 1 will also help to inorporate linguisti regularities of l 2. Consequently, transformation is attained in the similar way as 4.2 between word embeddings of tweets and bilingual word embeddings of news artiles belonging to same language. Step-1 To learn bilingual embeddings, we leverage the approah of Gouws et al. [6] as it is fast and salable to jointly optimize the monolingual objetive M( ) with the ross-lingual objetive ϕ( ) (i.e. ross-lingual regularization term) to find the overall loss L( ). Douments in the news olletion of languages l 1 and l 2 are used to learn monolingual models along with ross-lingual regularization term learned with parallel orpora (e.g. Europarl-v7). Overall loss funtion L( ) is given by Equation 2. L( ) = min θ l 1,θ l 2 lɛ{l 1,l 2} M l (w t, h; θ l, ) ) + λϕ(θl1 θl2 2 C l (2) ϕ(.) eliminates the need for word-alignment and makes an assumption that eah word observed in the doument of language l 1 an potentially find its alignment in the doument of language l 2. Thus, the Equation 2 is now modified into Equation 3. L( ) = min θ l 1,θ l 2 + λ 1 m m V l1 i w iɛl 1 lɛ{l 1,l 2} 1 n 2 M l (w t, h; θ l ) C l n V l2 i 2 w iɛl 2 Where V l1 and V l2 are monolingual word vetors of the words in douments of languages l 1 and l 2 respetively and C l is monolingual orpus (e.g. News). w t is the predited word in the ontext h of a monolingual model. (3) Step-2 We follow a similar proedure as of 4.2 but with a different set of embeddings. Low-dimensional embeddings that are used initially are {Te l1, Ne l1 } of words observed in both tweet and news olletion belonging to the same language. Here, Ne l1 represents bilingual embeddings. Transformation is now ahieved by translating, rotating and saling the objets (i.e. rows of Te l1 is transformed to make it similar to the rows of Ne l1 ) using the same proedure as desribed in 4.2.

7 5 Experimental Setup To evaluate our approah, we built a dataset for the ross-language and monolingual pairwise tweet and news artile relevane assessment. Also, we used the existing monolingual omparisons orpora to ompare with other approahes. 5.1 Corpus Creation Unavailability of datasets for omparing news artiles with the tweets in different languages ompelled us to reate our own. We reated a gold standard dataset for monolingual and ross-lingual omparison aross olletions by aquiring some more tweets and news artiles mainly in English and German in the same way as desribed in 3. Tweets with a single URL link to any news artile are olleted and arefully evaluated to see if it does not simply represent the news title or summary. If they only represent news title or summary then they are onsidered to be trivial and are removed. After basi preproessing, using the keyword Grexit (the Greee exit of the European Union) around 18 tweets and 18 news artiles (both English and German) are seleted for further human evaluation. 5.2 Human Evaluation The goal of the human evaluation is to get pairwise omparison sores between tweets and news. Thus, eah partiipant had to rate a pair of douments with respet to their semanti similarity. Three different annotators who have English(E) and German(G) language skills were hosen for omparing pair of tweets and news based on sores listed in Table 2. At the end, a list of 628 relevane Sore Type Desription 0 Dissimilar Tweet and news artile are not about same topi. 1 Related Tweet and news artile share topi but important ideas in news is not represented in the tweet. 2 Similar Tweet and news artile are about same topi and important ideas in news is represented in the tweet Table 2. Similarity Sores judgments (i.e. 162 between (E)Tweets and (E)News, 162 between (E)Tweets and (G)News and so on) were produed. A signifiane test with Kendall s τ is omputed to test the onsisteny among user judgments. Results suggested that there is no signifiant differene in the sore pairs of users (0.05 signifiane

8 level). Speifially, the results showed that users have an similar understanding of the similarity assessment. To obtain the final sore for eah pair, similar to SemEval semanti similarity tasks 8 arithmeti mean was alulated between all user ratings. We term this resoure as Dataset-1 9. This dataset provides more fine-grain omparison as ompared to other datasets [4] that provide only binary relevane. 5.3 Other Datasets Evaluation of monolingual omparison is also performed on the other existing resoures suh as Krestel et al. [4]. This dataset onsists of 1600 relevane judgments onstituting 17 news artiles overing different topis with the Tweets labeled as relevant or irrelevant for the eah news artile. We term this resoure as Dataset Evaluation Metris For many pairwise semanti similarity tasks statistial orrelation based measures have been used. Here, we use Pearson orrelation oeffiient (r) to evaluate our approahes on the dataset we reated. While, measures like auray is used for other datasets. 6 Experimental Results In this setion, we present our experimental results on different datasets with variation in parameters. 6.1 Baselines Two different baselines are used to ompare with our approah. Latent Dirihlet Alloation (LDA) Most of the earlier researh [2,3] have shown signifiant interest to ompare news and tweets with LDA and its variations. We use the polylingual topi model [23] trained on English and German Wikipedia with 100 topis to support multiple languages. Similarity between tweet and news represented as topis vetor is measured using osine similarity. WTMF-G Weighted Textual Matrix Fatorization on Graphs (WTMF-G) [3] is one of the baseline that ompare tweets and news based on a graph onneted by hashtags, named entities or temporal information. To train the WTMF-G model we used regularization oeffiient (λ = 20), weight of missing words as w n = 0.01, number of neighbors (k = 4) and link weights (δ = 3) as suggested in earlier researh. Latent dimension of 100 is used to represent tweet and news, while similarity between them is alulated using osine similarity. 8 Page 9

9 6.2 TWE Implementation Major parameters that affet training of Glove is the dimensionality of word embeddings and the size of word ontext window. We hoose 25, 50, 100, 200, 400 word embedding dimensions and 5 words on left and right ontext window. Similarly, later for learning bilingual word embeddings we used Bilbowa tool 10 to learn same embedding dimensions as former with 5 word left ontext window and entire English-German Europarl-v7 11 as the parallel data. In both ases, ount of words less than 2 in the entire orpus are disarded. 6.3 Monolingual Comparison Before omparing monolingual news and tweets, we estimate the quality of embedding transformation ahieved with Monolingual-TWE by performing similar experiments as in 3. The transformation an be either from tweets to news (T2N) or in the opposite orientation (N2T). Though both of them have different transformation, we observed that they produe similar t-sne visualization. Also, there is a slight derease in distane between ommon words aross olletions as ompared to without transformation. Average RBO measure using the top 5000 frequent terms observed in both tweets and news olletions in German and English is realulated to pereive the refinement. We pereived that there is an improvement of 24.4% and 21.2% for English and German respetively. Now, tweets and news artiles in Dataset-1 and Dataset-2 are represented as the tf-idf weighted average of transformed word embeddings. They are now used as input to SVM lassifier 12 with default parameters to alulate auray and to osine similarity for finding Pearson orrelation. Furthermore, top performing embedding dimensions are identified based on Pearson orrelation and auray measures using validation data of the datasets. Figure 1 and Figure 2 show the omparison of results with ((T2N)TWE and (N2T)TWE) and without (Non- TWE) transformation on different datasets. One the top performing embedding dimensions are identified, testing data is used to ompare different approahes with diverse measures in Table 3 and Table Cross-Lingual Comparison For the ross-lingual omparison, we follow a similar proedure as in 6.3. Sine, news word embeddings inorporate bilingual information from both German and English, alulation of RBO measure between tweets and news without transformation is not appropriate. Hene, we alulate RBO measure after transformation to verify that it satisfies minimum threshold of 0.328, whih in general feth satisfatory results [8]. Now to ompare tweets and news belonging to the dataset listed in 5.1 aross languages, we estimate the top performing embedding dimension based on Pearson orrelation measure using the validation data

10 Fig. 1. Effet of Embedding Dimensions(Dataset-1) Fig. 2. Effet of Embedding Dimensions(Dataset-2) Method Dim r German No-Transformation LDA-PTM [23] WTMF-G [3] (T2N)Monolingual-TWE (N2T)Monolingual-TWE English No-Transformation LDA-PTM [23] WTMF-G [3] (T2N)Monolingual-TWE (N2T)Monolingual-TWE Table 3. Monolingual Tweets and News Comparison of the dataset. Figure 3 show the omparison of results with (TWE) and without (Non-TWE) transformation. One the top performing embedding dimension is identified, testing data is used to ompare different approahes as provided in Table 5.

11 Method Dim Auray LDA-PTM [23] % Boosting [4] % (T2N)Monolingual-TWE+SVM % (N2T)Monolingual-TWE+SVM % Table 4. Auray (English) Fig. 3. Effet of Embedding Dimensions(Cross-Lingual) Method r (E)Tweets - (G)News LDA-PTM [23] (T2N)Cross-Lingual-TWE (N2T)Cross-Lingual-TWE (G)Tweets - (E)News LDA-PTM [23] (T2N)Cross-Lingual-TWE (N2T)Cross-Lingual-TWE Table 5. Cross-Lingual Tweets and News Comparison With 100-Dimensions 7 Disussion We start our analysis with results observed in the Table 3. It an be omprehended that the Monolingual-TWE (either T2N or N2T) ahieved an ommendable improvement over other approahes. However, the values for Pearson orrelation are low and an be assoiated to the fat that Tweets and news are inherently very different and ahieving high level of pairwise similarity is a omplex task. But for auray assessment, whih is mostly seen from the perspetive of a lassifiation task there is lear improvement over other approahes by using transformed embeddings as features. Table 4 shows that T2N ahieved better performane as ompared to N2T. Although aforementioned analysis is pereived on a small dataset. The results show a promising diretion to use Monolingual-TWE whih an easily sale with the size of ommon voabulary aross olletions. Thus giving a possibility

12 to improve or sustain the auray and Pearson orrelation values on larger datasets. Similar observations an be enuniated about ross-lingual-twe. Given the omplexity assoiated with finding pairwise relevane between tweets and rosslanguage news, we ompared only LDA based approahes with ross-lingual- TWE. It an be omprehended from Table 5 that T2N outperformed LDA-PTM with notable improvement. Although it may not be signifiant, these results only show preliminary examination to pereive researh in this diretion. 8 Conlusion and Future Work In this paper, we foused on mapping tweets with monolingual and ross-lingual news by transforming their word embeddings loser to eah other, thus bridging the lexial and word usage gap aross olletions. In future, we aim to improve the quality of results with more sophistiated approahes. Referenes 1. Kwak, H., Lee, C., Park, H., Moon., S.: What is twitter, a soial network or a news media?. In: Proeedings of ACM (2010) Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li., X.: Comparing twitter and traditional media using topi models. In: Advanes in Information Retrieval., Springer Berlin Heidelberg (2011) Guo, W., Li, H., Ji, H., Diab., M.T.: Linking tweets to news: A framework to enrih short text data in soial media. In: Proeddings of ACL. (2013) Krestel, R., Werkmeister, T., Wiradarma, T.P., Kasnei., G.: Tweet-reommender: Finding relevant tweets for news artiles. In: Proeedings of WWW, ACM (2015) Pennington, J., Soher, R., Manning., C.D.: Glove: Global vetors for word representation. In: Proeedings of EMNLP. (2014) Gouws, S., Bengio, Y., Corrado., G.: Bilbowa: Fast bilingual distributed representations without word alignments. In: arxiv preprint arxiv: (2014) 7. Wang, C., Mahadevan., S.: Manifold alignment using prorustes analysis. In: Proeedings of ICML, ACM (2008) Tan, L., Zhang, H., Clarke, C.L., Smuker., M.D.: Lexial omparison between wikipedia and twitter orpora by using word embeddings. In: Proeedings of ACL. (2015) 9. Ritter, A., Etzioni, O., Clark., S.: Open domain event extration from twitter. In: Proeedings of KDD. (2012) Thelwall, M., Bukley, K., Paltoglou., G.: Sentiment in twitter events. Journal of the Amerian Soiety for Information Siene and Tehnology. 62 (2011) Paul, M.J., Dredze., M.: You are what you tweet: Analyzing twitter for publi health. In: Proeedings of ICWSM. (2011) Ratkiewiz, J., Conover, M., Meiss, M., Gonalves, B., Flammini, A., Menzer., F.: Deteting and traking politial abuse in soial media. In: Proeedings of ICWSM. (2011)

13 13. Crooks, A., Croitoru, A., Stefanidis, A., Radzikowski., J.: #earthquake: Twitter as a distributed sensor system. Transations in GIS. 17(1) (2013) Abel, F., Gao, Q., Houben, G.J., Tao., K.: Analyzing user modeling on twitter for personalized news reommendations. In: Proeddings of UMAP. (2011) Ou, J., Liu, N.N., Zhao, K., Yu, Y., Yang., Q.: Transferring topial knowledge from auxiliary long texts for short text lustering. In: Proeddings of CIKM., ACM (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukuoglu, K., Kuksa., P.: Natural language proessing (almost) from srath. The Journal of Mahine Learning Researh 12 (2011) Al-Rfou, R., Bryan, P., Steven., S.: Polyglot: Distributed word representations for multilingual nlp. In: Proeedings of CoNLL, ACL (2013) Ramon, A.F., Amir, S., Lin, W., Silva, M., Tranoso., I.: Learning word representations from sare and noisy data with embedding sub-spaes. In: Proeedings of ACL. (2015) 19. Kim, J., Rousseau, F., Vazirgiannis., M.: Convolutional sentene kernel from word embeddings for short text ategorization. In: Proeedings of EMNLP. (2015) 20. der Maaten, L.V., Hinton., G.: Visualizing data using t-sne. The Journal of Mahine Learning Researh 9 (2008) Webber, W., Moffat, A., Zobel., J.: A similarity measure for indefinite rankings. ACM Transations on Information Systems (TOIS). 4 (2010) 22. Shönemann, P.H.: A generalized solution of the orthogonal prorustes problem. Psyhometrika. 31(1) (1966) Mimno, D., Wallah, H.M., Naradowsky, J., Smith, D.A., MCallum., A.: Polylingual topi models. In: Proeedings of EMNLP, ACL (2009)

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za