arxiv: v1 [cs.cl] 25 Oct 2017

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Attributed Social Network Embedding

arxiv: v1 [cs.cl] 2 Apr 2017

Lecture 1: Machine Learning Basics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Efficient Online Summarization of Microblogging Streams

Rule Learning With Negation: Issues Regarding Effectiveness

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

arxiv: v2 [cs.ir] 22 Aug 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Summarizing Answers in Non-Factoid Community Question-Answering

Mining Topic-level Opinion Influence in Microblog

A deep architecture for non-projective dependency parsing

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Term Weighting based on Document Revision History

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Georgetown University at TREC 2017 Dynamic Domain Track

Cross Language Information Retrieval

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Australian Journal of Basic and Applied Sciences

Python Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

ATENEA UPC AND THE NEW "Activity Stream" or "WALL" FEATURE Jesus Alcober 1, Oriol Sánchez 2, Javier Otero 3, Ramon Martí 4

Team Formation for Generalized Tasks in Expertise Social Networks

A heuristic framework for pivot-based bilingual dictionary induction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

arxiv: v2 [cs.cv] 30 Mar 2017

Learning Methods in Multilingual Speech Recognition

A Case Study: News Classification Based on Term Frequency

Learning to Rank with Selection Bias in Personal Search

Learning From the Past with Experiment Databases

Matching Similarity for Keyword-Based Clustering

HLTCOE at TREC 2013: Temporal Summarization

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Comment-based Multi-View Clustering of Web 2.0 Items

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Transfer Learning Action Models by Measuring the Similarity of Different Domains

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v1 [cs.lg] 15 Jun 2015

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Word Embedding Based Correlation Model for Question/Answer Matching

Using Web Searches on Important Words to Create Background Sets for LSI Classification

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

arxiv: v2 [cs.cl] 26 Mar 2015

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Constructing Parallel Corpus from Movie Subtitles

Online Updating of Word Representations for Part-of-Speech Tagging

Unsupervised Cross-Lingual Scaling of Political Texts

Finding Translations in Scanned Book Collections

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Detecting English-French Cognates Using Orthographic Edit Distance

A Vector Space Approach for Aspect-Based Sentiment Analysis

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Reinforcement Learning Variant for Control Scheduling

Indian Institute of Technology, Kanpur

INPE São José dos Campos

Learning Methods for Fuzzy Systems

AQUA: An Ontology-Driven Question Answering System

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The stages of event extraction

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Variations of the Similarity Function of TextRank for Automated Summarization

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Speech Emotion Recognition Using Support Vector Machine

Ensemble Technique Utilization for Indonesian Dependency Parser

Modeling function word errors in DNN-HMM based LVCSR systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Automating the E-learning Personalization

On the Combined Behavior of Autonomous Resource Management Agents

Modeling function word errors in DNN-HMM based LVCSR systems

There are some definitions for what Word

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Diagnostic Test. Middle School Mathematics

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v1 [cs.lg] 3 May 2013

Linking Task: Identifying authors and book titles in verbose queries

Semantic and Context-aware Linguistic Model for Bias Detection

THE world surrounding us involves multiple modalities

Characterizing Mathematical Digital Literacy: A Preliminary Investigation. Todd Abel Appalachian State University


arxiv: v1 [math.at] 10 Jan 2016

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Transcription:

Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings Aditya Mogadala 1, Dominik Jung 2 and Ahim Rettinger 1 arxiv:1710.09137v1 [s.cl] 25 Ot 2017 1 Institute AIFB, Karlsruhe Institute of Tehnology, Germany, aditya.mogadala@kit.edu, rettinger@kit.edu 2 Institute IISM, Karlsruhe Institute of Tehnology, Germany, dominik.jung2@kit.edu Abstrat. Soial media platforms have grown into an important medium to spread information about an event published by the traditional media, suh as news artiles. Grouping suh diverse soures of information that disuss the same topi in varied perspetives provide new insights. But the gap in word usage between informal soial media ontent suh as tweets and diligently written ontent (e.g. news artiles) make suh assembling diffiult. In this paper, we propose a transformation framework to bridge the word usage gap between tweets and online news artiles aross languages by leveraging their word embeddings. Using our framework, word embeddings extrated from tweets and news artiles are aligned loser to eah other aross languages, thus failitating the identifiation of similarity between news artiles and tweets. Experimental results show a notable improvement over baselines for monolingual tweets and news artiles omparison, while new findings are reported for ross-lingual omparison. 1 Introdution On the web, growth of soial media platforms has offered numerous opportunities with several hallenges to solve. Twitter 3 is one suh soial media platform that allows its users to share 140 haraters of text messages (popularly known as tweets) in multiple languages with their friends or followers. Tweets may ontain personal information or a onfined desription about an event motivated by the traditional media suh as online news artiles. Studies [1] have shown that 85% of the tweets are news affiliated. Though only some tweets aknowledge news artiles by expliitly linking them, most of them do not. This impliit linking of tweets with the news topis provide novel insights. For example, most of the traditional media ompanies that publish online news write only fats about an event. However, identifying relevant tweets for the orresponding news will append people opinion. Furthermore, attahing tweets with the news artiles will allow to understand the multi-dimensional view about ontroversial topis, thus 3 https://twitter.om/

empowering the editor of an artile to modify upoming or following artiles based on veraity. Howbeit, due to the differenes in word usage aross informal tweets and the attentively drafted writings like news artiles make this linking hallenging. Nevertheless, different approahes are pursued to solve the problem. Initially, monolingual omparison of tweets with news artiles is ahieved by omprehending ommonality between the topis using unsupervised topi models [2]. Although a salable approah, it fails to apture importane of words and their differenes aross orpora. A graph-based latent variable model [3] was further introdued for finding short text orrelations using miroblog hashtags and news artiles named entities. Even though it addresses earlier drawbaks by giving importane to keywords suh as named entities in news artiles. It still ignores other large hunk of voabulary. Krestel et al. [4] followed a different path by posing the omparison of tweets with news as relevane assessment problem and designed supervised binary lassifier with many hand-rafted features. Yet supervised, the hand-rafted features limit its salability. Further, aforementioned approahes ignore the multilingual aspet of the published news. Nowadays most of the online news about any event is multilingual. Identifiation of a news artile belonging to single language is not enough to over the olletive views about an event. In this paper, we overome the limitations of earlier approahes and propose a new salable framework to support tweets with monolingual and ross-lingual news artile omparison. Our framework leverages monolingual [5] and bilingual [6] word embeddings aquired from tweets and news artiles as basi units for bridging the word usage gap aross these olletions. Furthermore, non-linear transformation of tweet word embeddings is performed to make it loser to the news artile word embeddings using manifold alignment with Prorustes analysis [7]. Work losely related to our approah is by Tan et al. [8] who perform lexial omparison of words observed in tweets and Wikipedia belonging to same language with only linear transformation, while we perform non-linear transformation and also aross languages. Three main ontributions are summarized as follows: 1. Proposed an approah to lassify tweets as to how relevant they are for a given news artile in more than one language. 2. New evaluation orpora is reated for monolingual and ross-lingual tweets to news artile omparison. 3. Lexial and task speifi evaluation results are presented on two different datasets. 2 Related Work Most of our researh is losely related to the work that identifies relevane of tweets with online news or perform event detetion. We divide eah of the related works into separate ategories.

2.1 Event Detetion in Tweets Analyzing information flow about the events as they emerge is an important aspet of event detetion in tweets. Several works used this information in various ways. Some approahes [9] olloated emerging events and lassified them into different ategories, while some [10] found sentiment from the deteted events. Others deteted events as trends to trak publi health [11], politial abuse [12] and risis ommuniation [13]. 2.2 News and Relevant Tweets Several approahes have been explored to identify relevant noisy tweets with the lengthy news artiles. Initially, a semanti enrihment framework [14] was built to link news artiles and tweets by identifying possible orrelations to provide personalized news reommendations. Jin et al. [15] viewed the problem from different perspetive and introdued a dual latent Dirihlet alloation model to jointly learn two sets of topis. Later, a more sophistiated unsupervised topi modeling [2] approah was proposed for finding overlap of topi distribution between tweets and news artiles obtained from New York Times 4. 2.3 Distributed Representations Distributed word representations [5] has shown signifiant improvements in many NLP tasks [16]. Different variations of them suh as bilingual [6] and polylingual [17] are also obtained by projeting multiple or pair of languages into the shared semanti spae. Also, word representations were extended to meet requirements of the short or noisy text [18,19]. 3 Monolingual Word Usage Charateristis To understand the harateristis of word usage, initially news artiles in German and English are olleted between January, 2015 and Deember, 2015. To have a good overlap of topis, keywords 5 are extrated from news artiles to be used as queries for olleting tweets belonging to the same period with Twitter searh API 6. Aquired tweets are then polished by removing URLs, user mentions, # symbol of the hashtags, and all re-tweets. Additionally, Glove 7 is used to obtain word embeddings with 400 dimensions for both olletions. Size of the final doument sets and the voabulary extrated from Glove is listed in the Table 1. Word embeddings for eah olletion are now used to effetively omprehend the word usage harateristis. Initially, top 10 ommon and frequent words observed in both olletions are visualized with t-sne [20]. We observed that the 4 http://www.nytimes.om/ 5 https://github.om/aneesha/rake 6 https://dev.twitter.om/rest/publi/searh 7 https://github.om/stanfordnlp/glove

Colletion Language Douments Voabulary News English 1027987 348419 News German 198784 241014 Tweets English 110731 47280 Tweets German 56957 31887 Table 1. Colletion Sizes same words learned separately from tweets and news olletion are highly separated. Furthermore to apprehend the differene in slangs, abbreviations et., in both olletions, we use frequent 5000 ommon voabulary terms (both English and German) to pereive differenes among their nearest neighbors. Based on rank biased overlap (RBO) measure [21,8] whih provides a omparison between inomplete and indefinite rankings, we observe a minimal average RBO measure of 0.2856 and 0.2589 for English and German respetively with parameters ϕ = 0.9 and k = 100. Thus exhibiting the differene in word usage among both olletions. This motivates us to transform word embeddings learned with tweets loser to word embeddings learned using news artiles or vie versa. 4 Transformed Word Embeddings (TWE) Differene in the embeddings learned from two different olletions suh as tweets and news require bridging with embedding transformation. In this setion, we formulate the problem and present our approah for monolingual and rosslingual transformation. 4.1 Problem Formulation Let, Tw l n = {t l w 1, t l w 2...t l w i...t l w n } and Te l n = {t l e 1, t l e 2...t l e i...t l e n } represent set of words and their orresponding embeddings extrated from tweet olletion respetively. Where l is the language of tweets, n is the size of voabulary and eah embedding is of dimension t l e i R 1Xd. Similarly, Nw l m = {n l w 1, n l w 2...n l w i...n wm } and Ne l m = {n l e 1, n l e 2...n l e i...n l e m } represent set of words and their embeddings of news orpora respetively. Where l is the language of news orpora, m is the size of voabulary and eah embedding is of dimension n ei R 1Xd. Formally, now our researh question is to identify ommon words {Tw l, Nw l } = {t l w i, n l w i } i=1 and transform word embeddings in the tweet olletion (T e l ) loser to the embeddings of news olletions (Ne l ) or vie versa. This transformation is based on the assumption that there prevails a transformation relationship between the vetors for the frequent words of eah olletion. Some approahes [8] have earlier performed this simple transformation only if the language of tweets and formal language orpora (e.g. news, Wikipedia) belong to same language. But, it is non-trivial if the language of tweets and formal language orpora differs. In the following setions, we present the transformation of tweet embeddings loser to the monolingual or ross-lingual news embeddings.

4.2 Monolingual-TWE Earlier approahes [8] assume only linear relationship between embeddings from different olletions to perform transformation. Sometimes relationship needs to handle disturbanes suh as saling and rotation. To ater suh issues, we leverage manifold alignment using Prorustes analysis [7] to transform word embeddings of tweets loser to word embeddings of news artiles with a three step proedure. Learning low-dimensional embeddings is ue for transformation. We already have low-dimensional embeddings {Te l, Ne l } of words observed in both tweet and news olletion. To find the optimal values of transformation, Prorustes superimposition is done by translating, rotating and saling the objets (i.e. rows of Te l is transformed to make it similar to the rows of Ne l ). Transformation is ahieved by Translation: Taking mean of all the members of set to make entroids T l e i N l e i ( i=1, i=1 ) lie at origin. Saling and Rotation: The rotation and saling that maximizes the alignment is given by orthogonal matrix (Q) and saling fator (j). They are obtained by minimizing orthogonal Prorustes problem [22] and is provided by Equation 1. arg min j,q N l e T l e F (1) where Te l a matrix of transformed T l e values given by jte l Q and. F is the Frobenius norm onstrained over Q T Q = I. If Tw l represents the words of T l e low-dimensional embeddings, then the final sets {Tw l, N l w } ontains loser orrespondene. To understand the effetiveness of this transformation, we perform similar experiments as of 3 in 6.3. 4.3 Cross-Lingual-TWE Comparison of voabulary obtained from tweets in one language (l 1 ) with the voabulary of news artiles in another language (l 2 ) is not straightforward. To subdue this onern, we propose a two step approah. In the first step, news artiles from two different languages are aquired to learn bilingual word distributed representations(i.e. bilingual embeddings). Aim of bilingual embeddings is to apture linguisti regularities aross languages into a ommon semanti spae suh that English and German words (e.g. wonderful and wunderbar ) are neighbors in the t-sne visualization, thus bridging the language gap.

In the seond step, ross-lingual transformation is ahieved between word embeddings obtained from tweets in l 1 and word embeddings of news artiles in l 2. As bilingual word embeddings of news artiles in l 1 also share linguisti regularities from l 2, mapping word embeddings of tweets loser to the bilingual word embeddings of news artiles of l 1 will also help to inorporate linguisti regularities of l 2. Consequently, transformation is attained in the similar way as 4.2 between word embeddings of tweets and bilingual word embeddings of news artiles belonging to same language. Step-1 To learn bilingual embeddings, we leverage the approah of Gouws et al. [6] as it is fast and salable to jointly optimize the monolingual objetive M( ) with the ross-lingual objetive ϕ( ) (i.e. ross-lingual regularization term) to find the overall loss L( ). Douments in the news olletion of languages l 1 and l 2 are used to learn monolingual models along with ross-lingual regularization term learned with parallel orpora (e.g. Europarl-v7). Overall loss funtion L( ) is given by Equation 2. L( ) = min θ l 1,θ l 2 lɛ{l 1,l 2} M l (w t, h; θ l, ) ) + λϕ(θl1 θl2 2 C l (2) ϕ(.) eliminates the need for word-alignment and makes an assumption that eah word observed in the doument of language l 1 an potentially find its alignment in the doument of language l 2. Thus, the Equation 2 is now modified into Equation 3. L( ) = min θ l 1,θ l 2 + λ 1 m m V l1 i w iɛl 1 lɛ{l 1,l 2} 1 n 2 M l (w t, h; θ l ) C l n V l2 i 2 w iɛl 2 Where V l1 and V l2 are monolingual word vetors of the words in douments of languages l 1 and l 2 respetively and C l is monolingual orpus (e.g. News). w t is the predited word in the ontext h of a monolingual model. (3) Step-2 We follow a similar proedure as of 4.2 but with a different set of embeddings. Low-dimensional embeddings that are used initially are {Te l1, Ne l1 } of words observed in both tweet and news olletion belonging to the same language. Here, Ne l1 represents bilingual embeddings. Transformation is now ahieved by translating, rotating and saling the objets (i.e. rows of Te l1 is transformed to make it similar to the rows of Ne l1 ) using the same proedure as desribed in 4.2.

5 Experimental Setup To evaluate our approah, we built a dataset for the ross-language and monolingual pairwise tweet and news artile relevane assessment. Also, we used the existing monolingual omparisons orpora to ompare with other approahes. 5.1 Corpus Creation Unavailability of datasets for omparing news artiles with the tweets in different languages ompelled us to reate our own. We reated a gold standard dataset for monolingual and ross-lingual omparison aross olletions by aquiring some more tweets and news artiles mainly in English and German in the same way as desribed in 3. Tweets with a single URL link to any news artile are olleted and arefully evaluated to see if it does not simply represent the news title or summary. If they only represent news title or summary then they are onsidered to be trivial and are removed. After basi preproessing, using the keyword Grexit (the Greee exit of the European Union) around 18 tweets and 18 news artiles (both English and German) are seleted for further human evaluation. 5.2 Human Evaluation The goal of the human evaluation is to get pairwise omparison sores between tweets and news. Thus, eah partiipant had to rate a pair of douments with respet to their semanti similarity. Three different annotators who have English(E) and German(G) language skills were hosen for omparing pair of tweets and news based on sores listed in Table 2. At the end, a list of 628 relevane Sore Type Desription 0 Dissimilar Tweet and news artile are not about same topi. 1 Related Tweet and news artile share topi but important ideas in news is not represented in the tweet. 2 Similar Tweet and news artile are about same topi and important ideas in news is represented in the tweet Table 2. Similarity Sores judgments (i.e. 162 between (E)Tweets and (E)News, 162 between (E)Tweets and (G)News and so on) were produed. A signifiane test with Kendall s τ is omputed to test the onsisteny among user judgments. Results suggested that there is no signifiant differene in the sore pairs of users (0.05 signifiane

level). Speifially, the results showed that users have an similar understanding of the similarity assessment. To obtain the final sore for eah pair, similar to SemEval semanti similarity tasks 8 arithmeti mean was alulated between all user ratings. We term this resoure as Dataset-1 9. This dataset provides more fine-grain omparison as ompared to other datasets [4] that provide only binary relevane. 5.3 Other Datasets Evaluation of monolingual omparison is also performed on the other existing resoures suh as Krestel et al. [4]. This dataset onsists of 1600 relevane judgments onstituting 17 news artiles overing different topis with the Tweets labeled as relevant or irrelevant for the eah news artile. We term this resoure as Dataset-2. 5.4 Evaluation Metris For many pairwise semanti similarity tasks statistial orrelation based measures have been used. Here, we use Pearson orrelation oeffiient (r) to evaluate our approahes on the dataset we reated. While, measures like auray is used for other datasets. 6 Experimental Results In this setion, we present our experimental results on different datasets with variation in parameters. 6.1 Baselines Two different baselines are used to ompare with our approah. Latent Dirihlet Alloation (LDA) Most of the earlier researh [2,3] have shown signifiant interest to ompare news and tweets with LDA and its variations. We use the polylingual topi model [23] trained on English and German Wikipedia with 100 topis to support multiple languages. Similarity between tweet and news represented as topis vetor is measured using osine similarity. WTMF-G Weighted Textual Matrix Fatorization on Graphs (WTMF-G) [3] is one of the baseline that ompare tweets and news based on a graph onneted by hashtags, named entities or temporal information. To train the WTMF-G model we used regularization oeffiient (λ = 20), weight of missing words as w n = 0.01, number of neighbors (k = 4) and link weights (δ = 3) as suggested in earlier researh. Latent dimension of 100 is used to represent tweet and news, while similarity between them is alulated using osine similarity. 8 http://ixa2.si.ehu.es/stswiki/index.php/main Page 9 http://people.aifb.kit.edu/amo/iling2017/

6.2 TWE Implementation Major parameters that affet training of Glove is the dimensionality of word embeddings and the size of word ontext window. We hoose 25, 50, 100, 200, 400 word embedding dimensions and 5 words on left and right ontext window. Similarly, later for learning bilingual word embeddings we used Bilbowa tool 10 to learn same embedding dimensions as former with 5 word left ontext window and entire English-German Europarl-v7 11 as the parallel data. In both ases, ount of words less than 2 in the entire orpus are disarded. 6.3 Monolingual Comparison Before omparing monolingual news and tweets, we estimate the quality of embedding transformation ahieved with Monolingual-TWE by performing similar experiments as in 3. The transformation an be either from tweets to news (T2N) or in the opposite orientation (N2T). Though both of them have different transformation, we observed that they produe similar t-sne visualization. Also, there is a slight derease in distane between ommon words aross olletions as ompared to without transformation. Average RBO measure using the top 5000 frequent terms observed in both tweets and news olletions in German and English is realulated to pereive the refinement. We pereived that there is an improvement of 24.4% and 21.2% for English and German respetively. Now, tweets and news artiles in Dataset-1 and Dataset-2 are represented as the tf-idf weighted average of transformed word embeddings. They are now used as input to SVM lassifier 12 with default parameters to alulate auray and to osine similarity for finding Pearson orrelation. Furthermore, top performing embedding dimensions are identified based on Pearson orrelation and auray measures using validation data of the datasets. Figure 1 and Figure 2 show the omparison of results with ((T2N)TWE and (N2T)TWE) and without (Non- TWE) transformation on different datasets. One the top performing embedding dimensions are identified, testing data is used to ompare different approahes with diverse measures in Table 3 and Table 4. 6.4 Cross-Lingual Comparison For the ross-lingual omparison, we follow a similar proedure as in 6.3. Sine, news word embeddings inorporate bilingual information from both German and English, alulation of RBO measure between tweets and news without transformation is not appropriate. Hene, we alulate RBO measure after transformation to verify that it satisfies minimum threshold of 0.328, whih in general feth satisfatory results [8]. Now to ompare tweets and news belonging to the dataset listed in 5.1 aross languages, we estimate the top performing embedding dimension based on Pearson orrelation measure using the validation data 10 https://github.om/gouwsmeister/bilbowa 11 http://www.statmt.org/europarl/ 12 https://www.sie.ntu.edu.tw/~jlin/libsvm/

Fig. 1. Effet of Embedding Dimensions(Dataset-1) Fig. 2. Effet of Embedding Dimensions(Dataset-2) Method Dim r German No-Transformation 400-0.1051 LDA-PTM [23] 100 0.0445 WTMF-G [3] 100 0.0498 (T2N)Monolingual-TWE 25 0.0607 (N2T)Monolingual-TWE 25 0.0601 English No-Transformation 400-0.1193 LDA-PTM [23] 100 0.0321 WTMF-G [3] 100 0.0491 (T2N)Monolingual-TWE 400 0.0605 (N2T)Monolingual-TWE 400 0.0599 Table 3. Monolingual Tweets and News Comparison of the dataset. Figure 3 show the omparison of results with (TWE) and without (Non-TWE) transformation. One the top performing embedding dimension is identified, testing data is used to ompare different approahes as provided in Table 5.

Method Dim Auray LDA-PTM [23] 100 79.1% Boosting [4] - 82.5% (T2N)Monolingual-TWE+SVM 400 83.1% (N2T)Monolingual-TWE+SVM 400 81.0% Table 4. Auray (English) Fig. 3. Effet of Embedding Dimensions(Cross-Lingual) Method r (E)Tweets - (G)News LDA-PTM [23] 0.0821 (T2N)Cross-Lingual-TWE 0.1181 (N2T)Cross-Lingual-TWE 0.1018 (G)Tweets - (E)News LDA-PTM [23] 0.0765 (T2N)Cross-Lingual-TWE 0.1073 (N2T)Cross-Lingual-TWE 0.1064 Table 5. Cross-Lingual Tweets and News Comparison With 100-Dimensions 7 Disussion We start our analysis with results observed in the Table 3. It an be omprehended that the Monolingual-TWE (either T2N or N2T) ahieved an ommendable improvement over other approahes. However, the values for Pearson orrelation are low and an be assoiated to the fat that Tweets and news are inherently very different and ahieving high level of pairwise similarity is a omplex task. But for auray assessment, whih is mostly seen from the perspetive of a lassifiation task there is lear improvement over other approahes by using transformed embeddings as features. Table 4 shows that T2N ahieved better performane as ompared to N2T. Although aforementioned analysis is pereived on a small dataset. The results show a promising diretion to use Monolingual-TWE whih an easily sale with the size of ommon voabulary aross olletions. Thus giving a possibility

to improve or sustain the auray and Pearson orrelation values on larger datasets. Similar observations an be enuniated about ross-lingual-twe. Given the omplexity assoiated with finding pairwise relevane between tweets and rosslanguage news, we ompared only LDA based approahes with ross-lingual- TWE. It an be omprehended from Table 5 that T2N outperformed LDA-PTM with notable improvement. Although it may not be signifiant, these results only show preliminary examination to pereive researh in this diretion. 8 Conlusion and Future Work In this paper, we foused on mapping tweets with monolingual and ross-lingual news by transforming their word embeddings loser to eah other, thus bridging the lexial and word usage gap aross olletions. In future, we aim to improve the quality of results with more sophistiated approahes. Referenes 1. Kwak, H., Lee, C., Park, H., Moon., S.: What is twitter, a soial network or a news media?. In: Proeedings of WWW., ACM (2010) 591 600 2. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li., X.: Comparing twitter and traditional media using topi models. In: Advanes in Information Retrieval., Springer Berlin Heidelberg (2011) 338 349 3. Guo, W., Li, H., Ji, H., Diab., M.T.: Linking tweets to news: A framework to enrih short text data in soial media. In: Proeddings of ACL. (2013) 239 249 4. Krestel, R., Werkmeister, T., Wiradarma, T.P., Kasnei., G.: Tweet-reommender: Finding relevant tweets for news artiles. In: Proeedings of WWW, ACM (2015) 53 54 5. Pennington, J., Soher, R., Manning., C.D.: Glove: Global vetors for word representation. In: Proeedings of EMNLP. (2014) 1532 1543 6. Gouws, S., Bengio, Y., Corrado., G.: Bilbowa: Fast bilingual distributed representations without word alignments. In: arxiv preprint arxiv:1410.2455. (2014) 7. Wang, C., Mahadevan., S.: Manifold alignment using prorustes analysis. In: Proeedings of ICML, ACM (2008) 1120 1127 8. Tan, L., Zhang, H., Clarke, C.L., Smuker., M.D.: Lexial omparison between wikipedia and twitter orpora by using word embeddings. In: Proeedings of ACL. (2015) 9. Ritter, A., Etzioni, O., Clark., S.: Open domain event extration from twitter. In: Proeedings of KDD. (2012) 1104 1112 10. Thelwall, M., Bukley, K., Paltoglou., G.: Sentiment in twitter events. Journal of the Amerian Soiety for Information Siene and Tehnology. 62 (2011) 406 418 11. Paul, M.J., Dredze., M.: You are what you tweet: Analyzing twitter for publi health. In: Proeedings of ICWSM. (2011) 265 272 12. Ratkiewiz, J., Conover, M., Meiss, M., Gonalves, B., Flammini, A., Menzer., F.: Deteting and traking politial abuse in soial media. In: Proeedings of ICWSM. (2011)

13. Crooks, A., Croitoru, A., Stefanidis, A., Radzikowski., J.: #earthquake: Twitter as a distributed sensor system. Transations in GIS. 17(1) (2013) 124 147 14. Abel, F., Gao, Q., Houben, G.J., Tao., K.: Analyzing user modeling on twitter for personalized news reommendations. In: Proeddings of UMAP. (2011) 1 12 15. Ou, J., Liu, N.N., Zhao, K., Yu, Y., Yang., Q.: Transferring topial knowledge from auxiliary long texts for short text lustering. In: Proeddings of CIKM., ACM (2011) 775 784 16. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukuoglu, K., Kuksa., P.: Natural language proessing (almost) from srath. The Journal of Mahine Learning Researh 12 (2011) 2493 2537 17. Al-Rfou, R., Bryan, P., Steven., S.: Polyglot: Distributed word representations for multilingual nlp. In: Proeedings of CoNLL, ACL (2013) 183 192 18. Ramon, A.F., Amir, S., Lin, W., Silva, M., Tranoso., I.: Learning word representations from sare and noisy data with embedding sub-spaes. In: Proeedings of ACL. (2015) 19. Kim, J., Rousseau, F., Vazirgiannis., M.: Convolutional sentene kernel from word embeddings for short text ategorization. In: Proeedings of EMNLP. (2015) 20. der Maaten, L.V., Hinton., G.: Visualizing data using t-sne. The Journal of Mahine Learning Researh 9 (2008) 2579 2605 21. Webber, W., Moffat, A., Zobel., J.: A similarity measure for indefinite rankings. ACM Transations on Information Systems (TOIS). 4 (2010) 22. Shönemann, P.H.: A generalized solution of the orthogonal prorustes problem. Psyhometrika. 31(1) (1966) 1 10 23. Mimno, D., Wallah, H.M., Naradowsky, J., Smith, D.A., MCallum., A.: Polylingual topi models. In: Proeedings of EMNLP, ACL (2009) 880 889