Semantic similarity and analysis of the word frequency dynamics

Similar documents
Leveraging Sentiment to Compute Word Similarity

arxiv: v1 [cs.cl] 2 Apr 2017

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

The College Board Redesigned SAT Grade 12

Vocabulary Usage and Intelligibility in Learner Language

Multi-Lingual Text Leveling

Evidence for Reliability, Validity and Learning Effectiveness

AQUA: An Ontology-Driven Question Answering System

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Linking Task: Identifying authors and book titles in verbose queries

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Procedia - Social and Behavioral Sciences 154 ( 2014 )

A Comparison of Two Text Representations for Sentiment Analysis

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Myths, Legends, Fairytales and Novels (Writing a Letter)

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Mandarin Lexical Tone Recognition: The Gating Paradigm

2.1 The Theory of Semantic Fields

CEFR Overall Illustrative English Proficiency Scales

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Florida Reading Endorsement Alignment Matrix Competency 1

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Effect of Word Complexity on L2 Vocabulary Learning

CS 598 Natural Language Processing

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

English Language and Applied Linguistics. Module Descriptions 2017/18

Facing our Fears: Reading and Writing about Characters in Literary Text

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Bayesian Learning Approach to Concept-Based Document Classification

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

California Department of Education English Language Development Standards for Grade 8

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Probabilistic Latent Semantic Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Derivational and Inflectional Morphemes in Pak-Pak Language

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

The Ups and Downs of Preposition Error Detection in ESL Writing

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Combining a Chinese Thesaurus with a Chinese Dictionary

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The Effect of Written Corrective Feedback on the Accuracy of English Article Usage in L2 Writing

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Advanced Grammar in Use

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Handling Sparsity for Verb Noun MWE Token Classification

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

English Language Arts Missouri Learning Standards Grade-Level Expectations

Language Acquisition Chart

Lecture 2: Quantifiers and Approximation

Probability and Statistics Curriculum Pacing Guide

Mercer County Schools

arxiv: v1 [cs.cl] 22 Oct 2015

The Role of String Similarity Metrics in Ontology Alignment

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A Case Study: News Classification Based on Term Frequency

Beyond the Pipeline: Discrete Optimization in NLP

Universiteit Leiden ICT in Business

Methods for the Qualitative Evaluation of Lexical Association Measures

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Loughton School s curriculum evening. 28 th February 2017

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Writing a composition

Management of time resources for learning through individual study in higher education

STA 225: Introductory Statistics (CT)

12- A whirlwind tour of statistics

raıs Factors affecting word learning in adults: A comparison of L2 versus L1 acquisition /r/ /aı/ /s/ /r/ /aı/ /s/ = individual sound

Speech Recognition at ICSI: Broadcast News and beyond

One Stop Shop For Educators

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Analyzing Linguistically Appropriate IEP Goals in Dual Language Programs

Problems of the Arabic OCR: New Attitudes

What the National Curriculum requires in reading at Y5 and Y6

An Empirical and Computational Test of Linguistic Relativity

Speech Emotion Recognition Using Support Vector Machine

Transcription:

Journal of Physics: Conference Series PAPER OPEN ACCESS Semantic similarity and analysis of the word frequency dynamics To cite this article: V V Bochkarev et al 2017 J. Phys.: Conf. Ser. 936 012067 View the article online for updates and enhancements. This content was downloaded from IP address 148.251.232.83 on 02/09/2018 at 16:25

Semantic similarity and analysis of the word frequency dynamics V V Bochkarev, Yu S Maslennikova, A A Svetovidov Kazan Federal University, Kremlevskaya str.18, Kazan 420018, Russia E-mail: svetovidov1994@gmail.com Abstract. In this study a similarity in changes of frequencies dynamics for semantically related words was analyzed using word statistics extracted from more than 4.5 million books written over a period of 205 years. The approach is based on the correlation analysis of 1-grams frequency dynamics. We analyzed the frequencies correlation of synonym pairs, their corresponding antonymous groups and random words pairs. Also, we compared several metrics to find the most effective for assessing the degree of similarity in the dynamics of use of different words. Comparing differences between logarithmic rank variations in pairs of synonyms and random word pairs, significant differences are found, though they are smaller than it could be expected. 1. Introduction Evolution of written language is significantly influenced by the wide range of cultural and historical factors. Social and political events also influence on stylistics and semantic context of written speech. Furthermore, choice of specific word forms in the text is determined by grammar requirements. With the lapse of time, these external factors bring certain contribution into lexicon usage statistics, as some word forms become more prevalent for use, while the others become less preferable. Therefore, it is expected that related in meaning concepts and word forms will have similar word usage frequency dynamics. However, obsolete concepts and archaisms may have opposite usage frequency trends to concepts, which replace the first ones in use. Today an increasing number of works are dedicated to statistical analysis of semantically related word frequencies. For example, the approach called relative cosine similarity is proposed in [1]. Distributed word vectors were analyzed, the cosine distance was used to estimate the synonymic similarity for different part-of-speech separately. This approach gave a significant improvement the accuracy of synonyms recognition. The similar study is proposed in [2], where a synonym acquisition trajectory is analyzed, chosen by non-native English speakers learning English. It is found that with the growth of language proficiency the number of correctly selected synonyms increases, but even advanced learners face difficulties with synonyms in context-called-for unique constructions. The study [3] develops the functional systems theory (politics, religion etc.), analyzing also the frequency graphs for words «money», «power» and «love». It was noticed that frequency changes trends for concepts «love» and «money» are rather close to each other in English. However, the authors analyzed the words without consideration of their synonyms or words from the same semantic groups. Therefore, our research could supplement achieved results of the above work. Resent results [4] shows that semantically related words show strong phase coherence. It is noted that found patterns in the statistics of language may be a consequence of changes in the cultural framework that influences the thematic focus of writers. Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by Ltd 1

Authors of this study analyzed only a core vocabulary of 5,630 common English nouns using only correlation metrics. The purpose of this work is to figure out, whether there is a correlation between frequency usage changes of semantically related words based on pre-verified lists, and also to determine the most informative measure frequency series proximity. Our study is based on the 2012 version of the Google Ngram database [5]. We focus on 1-grams data, which consist of single word frequency counts for every year independently. The 2012 version of the database is annotated with parts-of-speech tags [6], which allow the extraction of particular word classes. Additionally, WordNet database was used [7]. WordNet is one of the largest lexical databases, which provides access to more than 140 000 words from 4 parts of speech (nouns, adjectives, verbs, adverbs) and the information about their semantic relationships. For our study, we considered groups of common nouns and adjectives. 2. Methods For English language, our starting database was the 1-gram counts from 1800 year because, although the available data start at 1520, the data for the first 300 years is rather sparse. Using the parts-ofspeech tag included in the database [6], we extracted the 1-gram information for the noun and adjective classes. This procedure finally left a core vocabulary of 493 nouns and 584 adjectives. Figure 1 demonstrates examples of normalized time series of the group of synonyms and antonyms. Figure 1 Frequency dynamics of synonyms (on the left) and antonyms (on the right) Figure 2 L -metric between random pairs of adjectives with different words frequencies To acquire information about semantic similarity degree through comparison of word-use data, correlation analysis was executed. Following list of pairwise distances was used: Euclidean distance (L 2-metric), City block metric (L 1-metric), Chebychev distance (L -metric), cosine distance, Pearson s correlation and Spearman s rank correlation. Correlation metrics and cosine distance only reflect changes in forms of researched series, however, distances for the first three metrics also depend on 2

absolute values of frequencies. We have been interested in the analysis of words frequency dynamics, that s why time series of frequencies was normalized by the mean for the considered period (1800-2005). Also, we compared logarithmic rank variations, i.e. logarithms of numbers from a word frequencies list, as in [4]. To estimate the significance of acquired distances between semantically related words, estimation of distances between random words (reference pairs) were also executed. Reference pairs from similar parts of speech were chosen, because several papers demonstrate steady difference in usage frequency trends, for example, between nouns and adjectives, which should be taken into consideration [6]. While choosing pairs of random words the average usage frequencies were considered. It is well known that time series of rare words frequencies are extremely noisy. As it is shown in [8], relative value of words frequency variations has a power-law dependence on frequency: σ f/f ~ 1/f -0.316. This fact leads to increasing of acquired distance values between rare words frequencies. Figure 2 shows, how the average Chebychev distance varies according to changes of logarithms of random adjective pairs usage frequencies. Therefore, reference pairs with words frequencies that are similar to analyzed synonym and antonym groups were selected for our research. 3. Results At the first step, similarities between the normalized frequencies dynamics of synonyms and reference pairs were estimated. However, the analysis of the distribution for distances between groups of synonyms shows that it is almost same as the distribution of distances between random pairs of words. Differences that are more significant were obtained by comparing the logarithmic rank variations as in [4]. For each pair of synonyms (the total number of pairs was 1457 for adjectives and 1484 - for nouns), we selected 40 pairs of randomly selected words of the same part of speech with sufficiently close frequencies. Then, the logarithmic ratio of distances in reference pairs and pairs of synonyms was calculated. In Figure 3 distributions of the logarithmic ratio using L 1-metrics are shown for adjectives and nouns separately. The positive values of the logarithmic ratio show that distances between synonyms is less than distances in reference pairs. The observed asymmetry of both distributions (with a significant prevalence of positive values) indicates that semantically related words are close to each other with comparison of reference random pairs. Figure 3. Distributions of the logarithmic ratio of distances between reference pairs and pairs of synonyms for adjectives (on the left) and nouns (on the right) To estimate the significance of the found effect, the null hypothesis about the homogeneity of samples in reference pairs and pairs of synonyms was tested. According to the sign test, p-value was 6.84 10-191 for adjectives and 3.48 10-203 for nouns. Different metrics were compared based on analysis of the accuracy for semantically related words recognition using different pairwise distance metrics. Thus, it can be concluded that the found effect is significant. To compare different metrics, we estimated the accuracy of semantically related words recognition. To do this, we compared distances in synonyms pairs D syn with distances in reference pairs D ref and considered the number of cases, when D syn<d ref. If the analyzed pair consists of semantically unrelated words, then the number of cases will be any value from the interval from 0 to N, where N is the number of reference pairs. If analyzed pair consists of semantically related words, then this number 3

will be larger with high probability, as follows from Figure 3. In our study, the threshold value for make the decision was chosen so that errors of the first and second kind would be equal. The results for adjectives and nouns are shown on the Figure 4. We can see that the L 1-metric is the most accurate and shows minimal errors: 0.29 - for adjectives and 0.32 - for nouns. Figure 4. Errors of semantically related words recognition for different pairwise distance metrics 4. Conclusion Comparing distances between the series of normalized frequencies of semantically related words and similarly obtained distances for random word pairs, no significant differences are observed. On the contrary, comparing differences between increment series of ranks in pairs of synonyms and random word pairs, significant differences are found, though they are smaller than it could be expected. Apparently, a sort of competition between words and word substitution in groups of synonyms play an important role in dynamics of language use. It should also be taken into account that there are words which on average don t have significant change in frequency of use over the last two centuries (for example, the word unbending in Figure1). When assessing the degree of similarity in the dynamics of use of different words, the most effective metrics are L 1 and L 2. This work was supported by the Russian Foundation for Basic Research, Grant no. 15-06-07402. The research of the first author was supported by the Russian Government Program of Competitive Growth of Kazan Federal University. References [1] Leeuwenberg A, Vela M, Dehdari J, Genabith J A 2016 Minimally Supervised Approach for Synonym Extraction with Word Embeddings The Prague Bulletin of Mathematical Linguistics 111-142 [2] Liu D, Zhong S 2016 L2 vs. L1 use of synonymy: An empirical study of synonym use/acquisition Applied Linguistics. 37 Issue 2 239-261 [3] Roth, Steffen et al. 2016 The Fashionable Functions Reloaded: An Updated Google Ngram View of Trends in Functional Differentiation (1800-2000) In: Mesquita, A. (Ed.) Research Paradigms and Contemporary Perspectives on Human-Technology Interaction. Hershey: IGI-Global [4] Montemurro M A, Zanette D H 2016 Coherent oscillations in word-use data from 1700 to 2008 Palgrave communications [5] Michel J, Shen Y, Aiden A et al. 2011 Quantitative Analysis of Culture Using Millions of Digitized Books Science 331 (6014) 176-182 [6] Lin Y, Michel J-B, Aiden EL, Orwant J, Brockman W and Petrov S 2012 Syntactic annotations for the google books ngram corpus, in Proceedings of the ACL [7] Fellbaum C. 1998 WordNet: An Electronic Lexical Database Cambridge, MA: MIT Press [8] Bochkarev V, Solovyev V, Wichmann S 2014 Universals versus historical contingencies in lexical evolution J. R. Soc. Interface 11: 20140841 4