Entropy Rate Constancy in Text

Similar documents
Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Using dialogue context to improve parsing performance in dialogue systems

Corpus Linguistics (L615)

Switchboard Language Model Improvement with Conversational Data from Gigaword

An Efficient Implementation of a New POP Model

A Case Study: News Classification Based on Term Frequency

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

A Bootstrapping Model of Frequency and Context Effects in Word Learning

ReFresh: Retaining First Year Engineering Students and Retraining for Success

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The Strong Minimalist Thesis and Bounded Optimality

Prediction of Maximal Projection for Semantic Role Labeling

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

CEFR Overall Illustrative English Proficiency Scales

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

GROUP COMPOSITION IN THE NAVIGATION SIMULATOR A PILOT STUDY Magnus Boström (Kalmar Maritime Academy, Sweden)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Learning Computational Grammars

Proof Theory for Syntacticians

CHEM 101 General Descriptive Chemistry I

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Abstractions and the Brain

Running head: DELAY AND PROSPECTIVE MEMORY 1

Distant Supervised Relation Extraction with Wikipedia and Freebase

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Parsing of part-of-speech tagged Assamese Texts

Syntactic surprisal affects spoken word duration in conversational contexts

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

WHEN THERE IS A mismatch between the acoustic

Multi-Lingual Text Leveling

Calibration of Confidence Measures in Speech Recognition

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.cl] 2 Apr 2017

Probabilistic Latent Semantic Analysis

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Accurate Unlexicalized Parsing for Modern Hebrew

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Some Principles of Automated Natural Language Information Extraction

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Probability estimates in a scenario tree

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

Linking Task: Identifying authors and book titles in verbose queries

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Introduction to Simulation

Loughton School s curriculum evening. 28 th February 2017

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

The Ups and Downs of Preposition Error Detection in ESL Writing

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Natural Language Processing. George Konidaris

An Evaluation of POS Taggers for the CHILDES Corpus

Seminar - Organic Computing

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

An Interactive Intelligent Language Tutor Over The Internet

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

LTAG-spinal and the Treebank

The stages of event extraction

CS Machine Learning

Evidence for Reliability, Validity and Learning Effectiveness

THE VERB ARGUMENT BROWSER

Getting Started with Deliberate Practice

The Discourse Anaphoric Properties of Connectives

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Degeneracy results in canalisation of language structure: A computational model of word learning

CS 598 Natural Language Processing

A Re-examination of Lexical Association Measures

Ensemble Technique Utilization for Indonesian Dependency Parser

Context Free Grammars. Many slides from Michael Collins

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Survey on parsing three dependency representations for English

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Methods for the Qualitative Evaluation of Lexical Association Measures

Case of the Department of Biomedical Engineering at the Lebanese. International University

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

An Empirical and Computational Test of Linguistic Relativity

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

arxiv:cmp-lg/ v1 22 Aug 1994

Thesis-Proposal Outline/Template

Training and evaluation of POS taggers on the French MULTITAG corpus

Transcription:

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 199-206. Entropy Rate Constancy in Text Dmitriy Genzel and Eugene Charniak Brown Laboratory for Linguistic Information Processing Department of Computer Science Brown University Providence, RI, USA, 02912 {dg,ec}@cs.brown.edu Abstract We present a constancy rate principle governing language generation. We show that this principle implies that local measures of entropy (ignoring context) should increase with the sentence number. We demonstrate that this is indeed the case by measuring entropy in three different ways. We also show that this effect has both lexical (which words are used) and non-lexical (how the words are used) causes. 1 Introduction It is well-known from Information Theory that the most efficient way to send information through noisy channels is at a constant rate. If humans try to communicate in the most efficient way, then they must obey this principle. The communication medium we examine in this paper is text, and we present some evidence that this principle holds here. Entropy is a measure of information first proposed by Shannon (1948). Informally, entropy of a random variable is proportional to the difficulty of correctly guessing the value of this variable (when the distribution is known). Entropy is the highest when all values are equally probable, and is lowest (equal to 0) when one of the choices has probability of 1, i.e. deterministically known in advance. In this paper we are concerned with entropy of English as exhibited through written text, though these results can easily be extended to speech as well. The random variable we deal with is therefore a unit of text (a word, for our purposes 1 ) that a random person who has produced all the previous words in the text stream is likely to produce next. We have as many random variables as we have words in a text. The distributions of these variables are obviously different and depend on all previous words produced. We claim, however, that the entropy of these random variables is on average the same 2. 2 Related Work There has been work in the speech community inspired by this constancy rate principle. In speech, distortion of the audio signal is an extra source of uncertainty, and this principle can by applied in the following way: A given word in one speech context might be common, while in another context it might be rare. To keep the entropy rate constant over time, it would be necessary to take more time (i.e., pronounce more carefully) in less common situations. Aylett (1999) shows that this is indeed the case. It has also been suggested that the principle of constant entropy rate agrees with biological evidence of how human language processing has evolved (Plotkin and Nowak, 2000). Kontoyiannis (1996) also reports results on 5 consecutive blocks of characters from the works 1 It may seem like an arbitrary choice, but a word is a natural unit of length, after all when one is asked to give the length of an essay one typically chooses the number of words as a measure. 2 Strictly speaking, we want the cross-entropy between all words in the sentences number n and the true model of English to be the same for all n.

of Jane Austen which are in agreement with our principle and, in particular, with its corollary as derived in the following section. 3 Problem Formulation Let {X i },i =1...n be a sequence of random variables, with X i corresponding to word w i in the corpus. Let us consider i to be fixed. The random variable we are interested in is Y i,arandom variable that has the same distribution as X i X 1 = w 1,...,X i 1 = w i 1 for some fixed words w 1...w i 1. For each word w i there will be some word w j,(j i) whichisthestarting word of the sentence w i belongs to. We will combine random variables X 1...X i 1 into two sets. The first, which we call C i (for context), contains X 1 through X j 1, i.e. all the words from the preceding sentences. The remaining set, which we call L i (for local), will contain words X j through X i 1.BothL i and C i could be empty sets. We can now write our variable Y i as X i C i,l i. Our claim is that the entropy of Y i, H(Y i ) stays constant for all i. By the definition of relative mutual information between X i and C i, H(Y i ) = H(X i C i,l i ) = H(X i L i ) I(X i C i,l i ) where the last term is the mutual information between the word and context given the sentence. As i increases, so does the set C i. L i,on the other hand, increases until we reach the end of the sentence, and then becomes small again. Intuitively, we expect the mutual information at, say, word k of each sentence (where L i has the same size for all i) to increase as the sentence number is increasing. By our hypothesis we then expect H(X i L i ) to increase with the sentence number as well. Current techniques are not very good at estimating H(Y i ), because we do not have a very good model of context, since this model must be mostly semantic in nature. We have shown, however, that if we can instead estimate H(X i L i ) and show that it increases with the sentence number, we will provide evidence to support the constancy rate principle. The latter expression is much easier to estimate, because it involves only words from the beginning of the sentence whose relationship is largely local and can be successfully captured through something as simple as an n-gram model. We are only interested in the mean value of the H(X j L j )forw j S i,wheres i is the ith 1 sentence. This number is equal to S i H(S i), which reduces the problem to the one of estimating the entropy of a sentence. We use three different ways to estimate the entropy: Estimate H(S i ) using an n-gram probabilistic model Estimate H(S i ) using a probabilistic model induced by a statistical parser Estimate H(X i ) directly, using a non-parametric estimator. We estimate the entropy for the beginning of each sentence. This approach estimates H(X i ), not H(X i L i ), i.e. ignores not only the context, but also the local syntactic information. 4 Results 4.1 N-gram N-gram models make the simplifying assumption that the current word depends on a constant number of the preceding words (we use three). The probability model for sentence S thus looks as follows: P (S) = P (w 1 )P (w 2 w 1 )P (w 3 w 2 w 1 ) n P (w n w n 1 w n 2 w n 3 ) i=4 To estimate the entropy of the sentence S, we compute log P (S). This is in fact an estimate of cross entropy between our model and true distribution. Thus we are overestimating the entropy, but if we assume that the overestimation error is more or less uniform, we should still see our estimate increase as the sentence number increases. Penn Treebank corpus (Marcus et al., 1993) sections 0-20 were used for training, sections 21-24 for testing. Each article was treated as a separate text, results for each sentence number were

grouped together, and the mean value reported on Figure 1 (dashed line). Since most articles are short, there are fewer sentences available for larger sentence numbers, thus results for large sentence numbers are less reliable. The trend is fairly obvious, especially for small sentence numbers: sentences (with no context used) get harder as sentence number increases, i.e. the probability of the sentence given the model decreases. 4.2 Parser Model We also computed the log-likelihood of the sentence using a statistical parser described in Charniak (2001) 3. The probability model for sentence S with parse tree T is (roughly): P (S) = x T P (x parents(x)) where parents(x) are words which are parents of node x in the the tree T. This model takes into account syntactic information present in the sentence which the previous model does not. The entropy estimate is again log P (S). Overall, these estimates are lower (closer to the true entropy) in this model because the model is closer to the true probability distribution. The same corpus, training and testing sets were used. The results are reported on Figure 1 (solid line). The estimates are lower (better), but follow the same trend as the n-gram estimates. 4.3 Non-parametric Estimator Finally we compute the entropy using the estimator described in (Kontoyiannis et al., 1998). The estimation is done as follows. Let T be our training corpus. Let S = {w 1...w n } be the test sentence. We find the largest k n, such that sequence of words w 1...w k occurs in T. Then log S k is an estimate of the entropy at the word w 1. We compute such estimates for many first sentences, second sentences, etc., and take the average. 3 This parser does not proceed in a strictly left-to-right fashion, but this is not very important since we estimate entropy for the whole sentence, rather than individual words For this experiment we used 3 million words of the Wall Street Journal (year 1988) as the training set and 23 million words (full year 1987) as the testing set 4. The results are shown on Figure 2. They demonstrate the expected behavior, except for the strong abnormality on the second sentence. This abnormality is probably corpusspecific. For example, 1.5% of the second sentences in this corpus start with words the terms were not disclosed, which makes such sentences easy to predict and decreases entropy. 4.4 Causes of Entropy Increase We have shown that the entropy of a sentence (taken without context) tends to increase with the sentence number. We now examine the causes of this effect. These causes may be split into two categories: lexical (which words are used) and non-lexical (how the words are used). If the effects are entirely lexical, we would expect the per-word entropy of the closed-class words not to increase with sentence number, since presumably the same set of words gets used in each sentence. For this experiment we use our n-gram estimator as described in Section 4.2. We evaluate the per-word entropy for nouns, verbs, determiners, and prepositions. The results are given in Figure 3 (solid lines). The results indicate that entropy of the closed class words increases with sentence number, which presumably means that non-lexical effects (e.g. usage) are present. We also want to check for presence of lexical effects. It has been shown by Kuhn and Mohri (1990) that lexical effects can be easily captured by caching. In its simplest form, caching involves keeping track of words occurring in the previous sentences and assigning for each word w a caching probability P c (w) = w C(w) C(w),where C(w) is the number of times w occurs in the previous sentences. This probability is then mixed with the regular probability (in our case - smoothed trigram) as follows: P mixed (w) =(1 λ)p ngram (w)+λp c (w) 4 This is not the same training set as the one used in two previous experiments. For this experiment we needed a larger, but similar data set

8.4 parser n gram 8.2 8 entropy estimate 7.8 7.6 7.4 7.2 7 6.8 0 5 10 15 20 25 sentence number Figure 1: N-gram and parser estimates of entropy (in bits per word)

9 8.9 8.8 8.7 entropy estimate 8.6 8.5 8.4 8.3 8.2 8.1 8 0 5 10 15 20 25 sentence number Figure 2: Non-parametric estimate of entropy

where λ was picked to be 0.1. This new probability model is known to have lower entropy. More complex caching techniques are possible (Goodman, 2001), but are not necessary for this experiment. Thus, if lexical effects are present, we expect the model that uses caching to provide lower entropy estimates. The results are given in Figure 3 (dashed lines). We can see that caching gives a significant improvement for nouns and a small one for verbs, and gives no improvement for the closed-class parts of speech. This shows that lexical effects are present for the open-class parts of speech and (as we assumed in the previous experiment) are absent for the closed-class parts of speech. Since we have proven the presence of the non-lexical effects in the previous experiment, we can see that both lexical and non-lexical effects are present. 5 Conclusion and Future Work We have proposed a fundamental principle of language generation, namely the entropy rate constancy principle. We have shown that entropy of the sentences taken without context increases with the sentence number, which is in agreement with the above principle. We have also examined the causes of this increase and shown that they are both lexical (primarily for open-class parts of speech) and non-lexical. These results are interesting in their own right, and may have practical implications as well. In particular, they suggest that language modeling may be a fruitful way to approach issues of contextual influence in text. Of course, to some degree language-modeling caching work has always recognized this, but this is rather a crude use of context and does not address the issues which one normally thinks of when talking about context. We have seen, however, that entropy measurements can pick up much more subtle influences, as evidenced by the results for determiners and prepositions where we see no caching influence at all, but nevertheless observe increasing entropy as a function of sentence number. This suggests that such measurements may be able to pick up more obviously semantic contextual influences than simply the repeating words captured by caching models. For example, sentences will differ in how much useful contextual information they carry. Are there useful generalizations to be made? E.g., might the previous sentence always be the most useful, or, perhaps, for newspaper articles, the first sentence? Can these measurements detect such already established contextual relations as the given-new distinction? What about other pragmatic relations? All of these deserve further study. 6 Acknowledgments We would like to acknowledge the members of the Brown Laboratory for Linguistic Information Processing and particularly Mark Johnson for many useful discussions. Also thanks to Daniel Jurafsky who early on suggested the interpretation of our data that we present here. This research has been supported in part by NSF grants IIS 0085940, IIS 0112435, and DGE 9870676. References M. P. Aylett. 1999. Stochastic suprasegmentals: Relationships between redundancy, prosodic structure and syllabic duration. In Proceedings of ICPhS 99, San Francisco. E. Charniak. 2001. A maximum-entropy-inspired parser. In Proceedings of ACL 2001, Toulouse. J. T. Goodman. 2001. A bit of progress in language modeling. Computer Speech and Language, 15:403 434. I. Kontoyiannis, P. H. Algoet, Yu. M. Suhov, and A.J. Wyner. 1998. Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Trans. Inform. Theory, 44:1319 1327, May. I. Kontoyiannis. 1996. The complexity and entropy of literary styles. NSF Technical Report No. 97, Department of Statistics, Stanford University, June. [unpublished, can be found at the author s web page]. R. Kuhn and R. De Mori. 1990. A cache-based natural language model for speech reproduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(6):570 583.

9.5 normal caching Nouns 11 normal caching Verbs 9 10.5 8.5 10 8 9.5 2 4 6 8 10 2 4 6 8 10 5.4 normal caching Prepositions 4.4 normal caching Determiners 5.2 4.3 4.2 5 4.1 4 4.8 3.9 3.8 4.6 2 4 6 8 10 3.7 2 4 6 8 10 Figure 3: Comparing Parts of Speech

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn treebank. Computational Linguistics, 19:313 330. J. B. Plotkin and M. A. Nowak. 2000. Language evolution and information theory. Journal of Theoretical Biology, pages 147 159. C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27:379 423, 623 656, July, October.