Using Synonyms for Author Recognition

Size: px

Start display at page:

Download "Using Synonyms for Author Recognition"

Clifford Stevenson
6 years ago
Views:

1 Using Synonyms for Author Recognition Abstract. An approach for identifying authors using synonym sets is presented. Drawing on modern psycholinguistic research, we justify the basis of our theory. Having formally defined the operations needed to algorithmically determine authorship, we present the results of applying our method to a corpus of classic literature. We argue that this technique of author recognition is both accurate as an author identification tool, as well as applicable to other domains in computer science such as speaker recognition. 1 Introduction and Motivation Current research in the area of Stylometry focuses on identifying idiosyncrasies in written literature to identify the author. We present a novel approach for identifying authors in written text that is also applicable to identifying speakers from automatically transcribed discourse. In this paper, we argue that an author s choice of lexicons used by an author or speaker when many synonyms are available is idiosyncratic to the point of providing identification. Modern psycholinguistic evidence indicates that this is a well-founded approach. Though previous methods of author attribution have focused on linguistic elements peculiar to written text, the use of synonyms does not require such elements as punctuation to be effective and so may be applied to automatically transcribed speech as well. Such a flexible approach is ideal for situations such as a Smart Home environment where requests made in natural language could originate from sources as varied as a personal computer to a mounted microphone. Traditional methods might require the use of both a vocal analyzer and a text-specific analyzer to identify the source. When using synonyms, no such dependence on disparate technologies is incurred. The ability to discern the source of a document or statement also has value in the area of knowledge acquisition. One goal of knowledge acquisition is to gain the most accurate knowledge possible. However, all sources of information are not equally reliable and so such a system could be designed not trust them equally. The first step toward this type of learning is the ability to correctly identify the source of a piece of knowledge (i.e. a document). The presented method of author recognition aims to be adaptable to all of these applications.

2 2 Related Work Author attribution is a well-studied area of artificial intelligence. Formal methods for determining authorship have even older roots. The field of Stylometry has thrived well before the turn of the twentieth century having several documented methods of analyzing texts to settle disputed authorship. Linguistic idiosyncrasies that have been identified as characteristic of an author include everything from counting keywords to analyzing punctuation. 2.1 Related Efforts in Stylometry The classification of documents by content (as opposed to by author) has been one area in which similar techniques have been employed. Keywords have shown to be very effective in this area by many studies. This idea has been taken further by those such as Paek [10], who used keywords in image descriptions to classify the content of the images. In the field of Stylometry, several algorithmic approaches have been applied to author attribution. For instance, Brinegar [3], Glover, and Hirst [5] used distributions of word length statistics to determine authorship while others including Morton [9] used sentence length for identification. The number of distinct words in a set of documents was studied by Holmes [7] and Thisted. [14] 2.2 Psycholinguistic Foundations In the development of this method, consideration was given to not only the empirical correctness of its results, but also to the theoretical foundations on which it is built. Current psycholinguistic research suggests that synonyms are a part of language that is affected by one s environment. Developed by the Cognitive Science Laboratory at Princeton University, the WordNet lexical database is itself rooted in modern psycholinguistic theory. [13] It is organized based largely upon how humans are believed to understand words. It is no coincidence that synonym sets are at the heart of WordNet. In Holland s 1991 study, when subjects were requested to think of as many synonyms as possible for a set of words, a small but significant priming effect was found. [6] That is to say, subjects were more likely to produce as synonyms words that they had previously seen. This suggests that the synonyms produced by an individual are a product of experience, something that is very unique indeed. Let us digress a moment and further pursue the concept of priming. Associative priming, the process by which one concept leads to another (e.g. a dog might cause one to think about a bone), is a result of spreading activation. [1] The way in which activation potentials spread through a neural network is dependent upon what connections have been made and how often they have been asserted. According to Hebbian brain theory, connections that fire together are more likely to fire together in the future. Thus, from the perspective of cognitive psychology, it makes sense that though

3 many synonyms may be activated by a concept, the word at the forefront of one s mind will be based on experience. The uniqueness of this experience and the corresponding uniqueness of synonym choice may be exploited to determine authorship. 3 Theory 3.1 Definitions As a simple formalization of this theory, let us begin by defining a set of authors α, which have been encountered by our system before any identification is processed. We then define λ i as the lexicon corresponding to author α i so that λ i is the vocabulary richness. Next, consider two functions which may be applied to a word w in λ: occ(w) and syn(w) where occ(w) is the number of times word w was encountered and syn(w) is the number of synonyms for x. Consider a threshold θ, which is the minimum number of synonyms that a word must have before to be considered an idiosyncratic identifier of an author. Note that it is the use of a sufficiently large value of θ that provides a reasonable running time for our algorithm. Now we define the filtered lexicon σ i as author α i s lexicon with words having sufficiently large set of synonyms such that σ i λ i where each word σ ij has syn(σ ij ) θ. Having considered the task of learning each author s style, we can examine the case of identifying the work of an unknown author α u where α u α. The heart of the algorithm is in the intersection of the filtered lexicon of the unknown author with that of all known authors: ρ i = σ i σ u. (1) Here, there are some special considerations to be made as to exactly how the intersection of these two lexicons is to be calculated. Since we have also associated a number of occurrences occ(w) with each element of σ, we need to determine how to evaluate this function for an intersection. For our purposes, let the number of occurrences for a word used by both authors i and j be occ( σ i σ j ) = min( occ(σ i ), occ(σ j ) ). (2) Finally, we calculate the match factor for each author: ρi µ(i,u) = occ( ρ ij ) syn( ρij ) (3) j= 1 The hypothetical author α i of the text corresponds to the maximum value of the match factor where n = α such that µ(i,u) = max( µ(0,u), µ(1,u),, µ(i,u), µ(n,u) ) (4)

4 3.2 Tractability Though upon first glance, calculating σ i σ u for every author in α may spark complexity concerns, we have already addressed the problem of tractability by means of θ. Recall that σ is a subset of λ and, as we will see, can be a very small subset indeed, if the threshold is set high enough. Even though it seems we are cutting many possible matches, our theory holds that for reasonably small θ, we are cutting less important words, since the author was constrained by the lexicon of the language when choosing a word with syn(w) < θ, due to a lack of synonyms. 3.3 Parallelization By means of the independence of the various stages of the training and, later, the discernment process, this algorithm is an excellent candidate for parallelization (see Fig. 1). Each author and, in fact, each document that is added to the training set α can be evaluated separately, leaving the merging of the sets to a central dispatcher. Since the string operations necessary to calculate the statistics are far more processor-intensive than calculating the union of said sets, a very high rate of speedup is possible. However, true to Amdahl s law, the stage of identifying the work of an unknown author is dependent on having all training data prepared. Still, each intersection σ i σ u and the subsequent calculations of µ(i,u) are independent of one another, giving yet another opportunity for parallelization. In the end, it is reasonable to distribute the calculations associated with each individual author to a separate parallel process. Fig. 1. An illustration of the major parts of the algorithm that may be run in parallel

5 4 Implementation 4.1 WordNet As our method requires the ability to determine the number of synonyms for a word, we chose to use WordNet 1 to accomplish this task. In development since 1985, WordNet is now the one of the foremost lexical resources in computational linguistics. With over 118,000 word forms, it encompasses a substantial portion of the English language. [13] In WordNet, each word is linked to one or more senses. These senses, in turn, can reside in synonym sets. Thus, we can find if a word shares at least one of its senses in common with another word making it, by definition, a synonym. Furthermore, this gives us the number of synonyms for a word by summing the number of unique members of a word s synonym sets. Though unimportant from a theoretical standpoint, it should be noted that the actual version of WordNet used in this research was not the traditional C library, but rather a normalized database format for PostgreSQL. 2 This allowed the number of synonyms for a word to be determined in the execution of a single SQL statement. 4.2 Corpus The texts used to train the system consisted of works of classic literature by Charles Dickens and William Shakespeare. As these texts are in the public domain, they are freely available for review by the curious reader. 3 Table 1. Texts included in our experimental corpus. The corpus was comprised of 286,898 words total Set Total Words Included Texts Dickens Train 65,157 Battle of Life, Chimes, To be Read at Dusk Dickens Test 80,832 Cricket on the Hearth, Three Ghost Stories, A Christmas Carol Shakespeare Train 73,863 Comedy of Errors, Hamlet, Romeo and Juliet Shakespeare Test 67,046 Julius Caesar, Henry V, Macbeth 1 WordNet 2.1 is available for download from the Princeton Cognitive Science Laboratory at 2 WordNet SQL Builder for WordNet 2.1, the application used to generate a PostgreSQL database of WordNet, is available for download at 3 The texts listed here are available from Project Gutenberg at

6 The selected works of Dickens includes the so-called Christmas Stories, which are A Christmas Carol, The Chimes, The Cricket on the Hearth, and the Battle of Life, all written within the same decade. The other of Dickens texts are short-stories, To be Read at Dusk and Three Ghost Stories. Shakespeare s writings that were analyzed are some of his more famous comedies and tragedies such as The Comedy of Errors, Hamlet, Romeo and Juliet, Julius Caesar, Henry V, and Macbeth. These works of Shakespeare span over a 16-year period of his writing career. In total, the corpus contained 286,898 words. The works of Shakespeare contributed 140,909 of the words while Dickens text constituted 145,989 words of the total. Each author used a relatively large vocabulary, though still a very small portion of the English language as a whole. Shakespeare s lexicon had over 13,000 words while Dickens had in excess of 12,000. The corpus was divided into four sections. Most obviously, the texts were grouped by author. Secondly, each author s texts were divided in half, resulting in groups of about 72,000±9% words. These sets were tasked as train and test where the train was used in acquiring λ, the characteristic lexicon of an author and a match factor µ(i,u) was then calculated for each train-test pair σ Train σ Test. 5 Results 5.1 Author Identification Results For our test data, the results showed that there is a well-defined difference in the match factor of a correct pair and an incorrect pair (see Fig. 2 and Fig. 3). This trend is consistent through all tested values of the threshold θ. In the cases of both Dickens and Shakespeare, the correct test set was matched with its corresponding test set. Moving beyond the simple fact that the system managed to produce the correct answers, let us take note of the margin by which the correct answer was ranked above the others. The strongest case in the set was that of the Shakespeare training set versus the two test sets. With the difference between the matching and non-matching set being roughly 17%, there is little question that this test run was a success. In the case of the Dickens training set run against the test sets, the answer is less confident with a 10% difference between the matching and non-matching values. However, upon further consideration, it makes sense that the authors should have a good deal of their vocabulary in common. Though the unique qualities of an author s style may be subtle, we still have the ability to detect these subtleties and exploit them to determine authorship.

7 Match Factor 250,000 Percent Difference 200, , ,000 50,000 Shakespeare Train - Shakespeare Test Shakespeare Train - Dickens Test Shakespeare Test Dickens Test Threshold Value Fig. 2. The match factor µ(i,u) correctly correlates each test set with the training set written by the same author. Thus, given a set known to be by a certain author, the system can discern one author from another Percent Difference in Match Factors 25% Percent Difference 20% 15% 10% 5% Shakespeare Train vs Two Test Sets Dickens Train vs Two Test Sets 0% Threshold Value Fig. 3. By applying sufficiently large values of the synonym threshold θ, the number of synonym matches, and therefore the computational overhead, is greatly reduced

8 Number of Synonym Matches Number of Synonym Matches 2,000 1,800 1,600 1,400 1,200 1, Threshold Value Shakespeare Train - Shakespeare Test Shakespeare Train - Dickens Test Shakespeare Test Dickens Test Fig. 4. By applying sufficiently large values of the synonym threshold θ, the number of synonym matches, and therefore the computational overhead, is greatly reduced. Due to our particular interest in words with large numbers of synonyms, the cutting of words with small synonym sets does not negatively effect the algorithm s ability to distinguish authors from one another Reductions in Synonym Matches Reduction in Synonym Matches 100% 95% 90% 85% 80% 75% Threshold Value Shakespeare Train - Shakespeare Test Shakespeare Train - Dickens Test Shakespeare Test Dickens Test Fig. 5. Even moderate values of θ yield in excess of 90% reduction in the synonym matches between authors

9 5.2 Threshold Performance Results As expected, the cutting of words having few synonyms was successful in reducing the size of ρ for each author. In the case of Shakespeare s works, there were over 11,900 unique words. Yet, after filtering out words via a threshold of 30, only 539 words had to be examined, giving us a 94% overall reduction in the size of ρ. (see Fig. 4 and Fig. 5). The fact that the threshold was so successful in identifying those words that were useful and discarding others is key in making this algorithm tractable for larger sets of authors. Even higher values of θ do not negatively impact the accuracy of discerning authorship. 6 Future Work 6.1 Multilingual Implementation The next step in our work is to determine if it will be successful in other languages. Since the only language dependent part of our algorithm is WordNet, we simply need to query a lexical database for our target language. Currently, there is much work being done on the creation and automatic generation of WordNet databases for various languages. One such implementation is MEANING, a project using the web as a huge corpus in an effort to build a multilingual WordNet. Already, the project has produced a web-based interface to view their progress Integration with an Automatic Speech Transcriber One key point of this algorithm is that it is not dependent upon the peculiarities of the written word. That is, the system is not hindered by the punctuation generated by an automatic speech transcription system. Provided the transcription system can accurately generate text from speech, the results of our method of author recognition should seamlessly carry over to speaker recognition. One possible implementation of this involves Carnegie Mellon s Sphinx Live Decoder. Using Hidden Markov Models, this program has the ability to transcribe voice input to text on the fly. [8] This type of system would be ideal for situations in which input could come from disparate sources such as keyboards and microphones. 4 The demo of MEANING project can be found at The web-based demo can be accessed from this page.

10 6.3 Integration in to a Smart Home Environment In a Smart Home environment, it is key to provide a truly natural experience for the home s inhabitants. Thus, it follows that the Smart Home should be able to take vocal requests and that it should be able to discern who is making these requests. This will allow the system s responses to be more appropriate. For example, an truly intelligent Smart Home would not be likely to comply with a three-year-old child s demand for ten dozen cookies. A author-speaker recognition system using the methods proposed here is the first step in making this scenario a reality. 7 Conclusions Research using synonyms to recognize authors shows much promise for the future. Being grounded in psycholinguistic theory, we can be confident that it has a solid foundation on which future work can be built. Furthermore, the fact that this method can be optimized using higher threshold values and distributed processing allows it to be used in situations where running time is a consideration. Having displayed its ability to accurately identify authors for this domain, we look forward to applying our new theory to more areas of usage. References 1. Anderson, J.R. Cognitive Psychology and its Implications. New York: W.H. Freeman and Company. (1995) 2. Baayen, H., van Halteren, H., Neijt, A., Tweedie, F. An experiment in authorship attribution. JADT 2002: 6es Journees internationales d Analyse stastique des Donnees Textuelles. (2002) 3. Brinegar, C. Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship. Journal of the American Statistical Association, 58, (1963) 4. Chaski, C. (2005). Who s At the Keyboard? Recent Results in Authorship Attribution. International Journal of Digital Evidence. Spring (2005) 5. Glover, A. and Hirst, G. Detecting stylistic inconsistencies in collaborative writing. In Sharples, Mike and van der Geest, Thea (eds.), The new writing environment: Writers at work in a world of technology. London: Springer-Verlag. (1996) 6. Holland, Cynthia Rose. Does synonym priming exist on a word completion task? Doctoral Thesis, Case Western Reserve University, Psychology. (1992) 7. Holmes, D. Authorship Attribution. Computers and the Humanities, 28, Kluwer Academic Publishers, Netherlands. (1994) 8. Huang, X. et al. The Sphinx-II Speech Recognition System. Computer Speech and Language. (1993) 9. Morton, A. The Authorship of Greek Prose. Journal of the Royal Statistical Society (A), 128, (1965) 10. Paek, S. et al. Integration of Visual and Text-Based Approaches for the Content Labeling and Classication of Photographs. ACM SIGIR. (1999)

11 11. Reiter, E. and S. Sripada. Contextual Influences on Near-Synonym Choice. Proceedings of INLG-2004, pages (2004) 12. Kaster, A., Siersdorfer, S., Gerhard, W. (2005). Combining Text and Linguistic Document Representations for Authorship Attribution. SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE), Salvador, Bahia, Brazil. (2005) 13. Miller, George A. WordNet: A Lexical Database for English. Communications of the ACM. November 1995/Vol.38, No. 11 (1995) 14. Thisted, R. and Efron, B. Did Shakespeare Write a Newly-discovered Poem? Biometrika, 74, , (1987) 15. Uzuner, Ö., Katz, B. (2005). A Comparative Study of Language Models for Book and Author Recognition. Lecture Notes in Computer Science. Volume 3651/2005, pp Springer-Verlag GmbH. (2005)

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center