LATENT SEMANTIC ANALYSIS FOR RUSSIAN LITERATURE INVESTIGATION

Size: px

Start display at page:

Download "LATENT SEMANTIC ANALYSIS FOR RUSSIAN LITERATURE INVESTIGATION"

Jeffry Cross
5 years ago
Views:

1 . ". " 1. Introduction LATENT SEMANTIC ANALYSIS FOR RUSSIAN LITERATURE INVESTIGATION P. I. Nakov Abstract. The paper presents the results of experiments of usage of Latent Semantic Analysis for analysis of textual data. The method is explained in brief and special attention is pointed on its potential for comparison and investigation of Russian literature texts. Two hypotheses are tested: The texts by the same author are alike and can be distinguished from the ones by different person; The prose and poetry can be automatically discovered. Key words: latent semantic analysis, statistical stylistics. One of the most interesting text features is the text variance. There are always several ways to express the same thought and the authors are forced to choose between different syntactic constructions, synonyms and terminology according to the intended audience and the impact the text must produce. Furnas, Landauer, Gomez and Dumais have shown in [5] that people use the same words to describe the same subject 10-20% of the time. The authors make their choices according to both the specific text intention and their own subjective preferences. These (denoted as style) are consistent (along the text or all the author s oeuvres) and easy to discover for humans but very hard to describe and measure. Researchers in statistical stylistics have concentrated at word-based statistics (word length, word length distribution, long words count,

2 type/token ratios, see [11]), text-based statistics (sentence length, clause complexity, see [7,10]) and statistics based on specific items (pronouns counts, presence/absence of contractions/amplifiers, relative frequency of specific verbs: e.g. seem, appear etc., see [2,6]). We go different way: Our purpose here is to study the possibilities of using a classic semantic analysis method without additional tuning for automatically discovery of the texts from the same oeuvre/author and to discriminate between prose and poetry. 2. Latent Semantic Analysis The Latent Semantic Analysis (LSA) is a powerful statistical technique for indexing, retrieval and analysis of textual information used in different fields of the human cognition during the last decade. The method is fully automatic and is based on the general idea that there exists a set of latent mutual dependencies between the words and their contexts (phrases, paragraphs and texts). Their identification and proper treatment permits LSA to deal successfully with the synonymy and partially with the polysemy, which are the major problems with the word-based approaches. LSA starts with the construction of a term to document occurrence frequency matrix, which is then submitted to singular value decomposition (SVD). As a result each term or document is associated a vector of low dimensionality (e.g. 100). The proximity between two documents can be calculated as the dot product between their normalised vectors. (see [1,3,8,9,12] for details) 3. Experiments The experiments were performed on Russian literature texts we collected on the Web from the following sites [16,17,18,19,20]. The file contents were carefully investigated and all index and biographic files were removed. The remaining files were pre-processed and the HTML tags were discarded together with the headers, footers and editors comments leaving just the plain text. We thus obtained the following oeuvres, grouped by author: Mikhail Bulgakov: Master and Margarita (800 KB), A Country Doctor's Notebook (228 KB) Nikolay Gogol: Dead Souls (514 KB), Taras Bulba (247 KB) Ilf&Petrov: The Twelve Chairs (944 KB), The Little Golden Calf (670 KB) Aleksandr Pushkin: The Captain's Daughter (222 KB), Eugene Onegin (174 KB), Poetry (619 KB collection of poems) Marina Tzvetaeva: Poetry (832 KB collection of poems) Nikolay Gumilyov: Poetry (205 KB collection of poems)

3 Since LSA tries to capture the mutual dependences between the words and their contexts it is of crucial importance to provide contexts of reasonable size. Usually, when indexing small documents they are passed as they are since it is best to work on the whole document. It is clearly not the case here and we decided to split the documents into chunks of size of approximately 4 KB (In fact the chunk size varies since we do not split the sentences.). On the other hand there is a disproportion between the sizes of the texts for both the different oeuvres and the different authors. To deal with this we kept at most 100 chunks per oeuvre. The oeuvres whose plain text was less than 400 KB had less than 100 chunks (see above). The poetry collections by the same author were first concatenated and then processed as a whole oeuvre. This resulted in 867 chunks total. The common meaningless words were removed by means of a stop-words list of 195 words. The list was created by investigating and keeping some of the most frequent words met in the texts. Using a standard list of stop-words such as the one we found on the Web at [21] (492 stop-words) resulted in inferior quality. Then the words met in just one document were removed since they cannot contribute to the proximity, thus reducing the total different non-stop word forms considered from to After the frequency matrix X ( ) was built, we divided each row by its entropy and just then performed SVD. [1,3,8,12] No word stemming was performed since the different oeuvres/authors could prefer a particular word form to another one Figure 1. Choice of dimensionality A crucial moment when using LSA is the correct choice of dimensionality. Figure 1 shows the top 100 singular values sorted in descending order. The curve goes straight down and then flattens. We have to cut the singular values around the place where the curve behaviour changes. If we cut further we lose important information and if we keep more values we start modelling the noise. Figure 1 shows for our case this value has to be somewhere between 10 and 30. Nevertheless, as we will see later, lower dimensions will be useful as well. We performed several different space reductions: to spaces with dimensionalities of 100, 30, 25, 20, 15, 10, 7, 4, 3 and 2. For each of these cases we calculated the dot product between the normalized vectors (cosine) for all the document couples. With the dimensionality reduction the vectors get closer and their cosine higher. Fig. 2 shows the corresponding correlation matrices ( )

in 5 different colours for the five correlation intervals: 87,5-100%, black; 75-87,5%, dark grey; 62,5-75%, grey; 50-62,5%, light grey; 0-50%, white.

The same way the oeuvres by the same author come one after the other and, at the higher level, the prose comes just before the poetry without mixture.

We will start our investigation with the dimensionality of 25. Figure 3 shows the corresponding correlation matrix.

4 in 5 different colours for the five correlation intervals: 87,5-100%, black; 75-87,5%, dark grey; 62,5-75%, grey; 50-62,5%, light grey; 0-50%, white. It is important to say that we arranged the texts in the matrices in a way that the chunks from the same oeuvre come one just after the other. The same way the oeuvres by the same author come one after the other and, at the higher level, the prose comes just before the poetry without mixture. This is possible since we have just one author represented by both prose and poetry (Pushkin). Figure 2. Correlation matrices for dimensions: 100, 30, 25, 20, 15, 10, 7, 4, 3, 2. We will start our investigation with the dimensionality of 25. Figure 3 shows the corresponding correlation matrix. The points representing the texts belonging to the same oeuvre are outlined by a dark line. In case two or three consecutive oeuvres are written by the same author they are all grouped together by an additional dashed rectangle. The figure shows as expected that the chunks from the same oeuvre are much closer (semantically similar) to each other than do the chunks that are not. Let us investigate the matrix on Fig. 3 in more details from the upper left towards the lower right corner. The first dark rectangle contains the chunks for Master and Margarita (100 chunks). Its content is sparse and the cluster is not clear enough. The following cluster is much more clear, represents an oeuvre by the same author and is due to A Country Doctor's Notebook (57 chunks). Then come another two clear clusters by Nikolay Gogol: his Dead Souls (100 chunks) and Taras Bulba (61 chunks). Going forward we can see the two smooth clusters of the chunks of The Twelve Chairs (100 chunks) and The Little Golden Calf (100 chunks) by Ilf and Petrov. The following cluster represents The Captain's Daughter (55 chunks) by Anlexandr Puchkin and is the last prosaic oeuvre. The two clusters that follow immediately are due to the Pushkin s Eugene Onegin (43 chunks) followed by a poetry collection (100 chunks). Then comes the poetry of Marina Tzvetaeva (100 chunks) and the one by Nikolay Gumilyov (51 chunks).

5 Figure 3. Correlation matrix for the dimension of 25 It is interesting to see in the prose zone that there are several smooth outcluster dependencies due to the correlation between the chunks by the same author. What about the poetry zone (the last 4 clusters) these dependencies are much higher for the only poet having two oeuvres: Pushkin. But there are other highlevel dependencies between the poetry of Marina Tzvetaeva and the one by Nikolay Gumilyov. Surprisingly, the poetry of Marina Tzvetaeva is not monolith and reveals two internal clusters. Tzvetaeva s poetry chunks were arranged historically the earlier poem included being written in 1906 and the last one in 1919 (In fact there were poems up to 1941 but they were dropped out in order to keep just 100 clusters. Only the last poem in the last cluster was written in 1919.). We found that the second cluster starts somewhere in the beginning of 1914 giving us the periods: and Our investigation showed the First

World War (1914-1918) had a major impact on Tzvetaeva s poetry and this was caught by LSA. Figure 4.

6 World War ( ) had a major impact on Tzvetaeva s poetry and this was caught by LSA. Figure 4. Correlation matrix for the dimension of 15 Reducing the dimensionality further we see that the oeuvres by the same author tend to get closer (see Fig. 2). Thus, the whole dashed rectangles start getting filled. This applies especially to Bulgakov and Ilf&Petrov and partially to Gogol. What about Pushkin his last two clusters (Eugene Onegin and the Poetry collection) form much more clear common cluster while The Captain's Daughter is left out. We think this is due to the fact that it is a prose and thus differs significantly from the poetry. What is interesting here is that Tzvetaeva and Gumilyov form common cluster. This means their poetry is semantically very similar, according to LSA.

7 It is important to stress that the whole poetry part of the matrix is much darker than the prose part does. Looking on the further reductions (fig. 2) we can see it forms a dark cluster at the dimensionality of 10 and becomes absolutely monolith at 7. In our opinion this must be due to the major difference between the prose and the poetry: the poetry is much more restrictive in the way authors can express themselves. The prose variance is much more rich and its zone does not consolidate into a single dark cluster until the dimensionality is reduced to Discussion As we saw, the different dimensionality reductions reveal different kinds of correlation between the texts. The higher dimensionality matrices show that the texts from the same oeuvre are more alike and tend to form separate clusters. In case the dimensionality is high enough some internal clusters can be seen inside the same oeuvre (see Fig. 3, Marina Tzvetaeva poetry). When a further reduction is performed the oeuvres by the same author lose their differences and each author tends to obtain its own cluster (in fact two clusters must be expected if the author is represented by both prose and poetry as is the case of Pushkin). When we perform a further reduction we obtain just two classes: the prose and the poetry. Figure 2 shows us that this process is not so clear and that for some of the oeuvres/authors the more general dependencies occur earlier than expected. 5. Conclusion The experiments performed show that in the general case the selected Russian authors can be distinguished using LSA but it seems to be hard for some of the authors (Gogol). On the other hand the corpus gives a strong support for the hypothesis that the prose and poetry can be automatically discovered. These results are consistent with our previous experiments on Bulgarian (see [15]), English and German texts. 6. Further work Additional experiments on new corpora with new authors have to be performed in order to justify the results obtained and to better understand the factors influencing the text proximity when using LSA. An interesting possibility is to combine the semantic proximity with the traditional stylistic statistics used by [4,6,7,10,11]. 7. References [1] Berry M., Do T., O'Brien G., Krishna V., and S o w m i n i V a r a d h a n. SVDPACKC (Version 1.0) User's Guide. April 1993.

8 [2] B i b e r D. A typology of English Texts. Linguistics, 27, pp [3] Deerwester S., Dumais S., Furnas G., Laundauer T., H a r s h m a n R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Sciences, 41 (1990), pp [4] D i a b M., S c h u s t e r J., B o c k P. A Preliminary Statistical Investigation into the impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification, Proc. Of 6 th International Conference on Artificial Intelligence Applications. Cairo, Egypt [5] Furnas G., Landauer T., Gomez L. and Dumais T. Statistical semantics: Analysis of the Potential Performance of Keyword Information Systems. Bell Syst.Tech.J., 62, Number 6, pp , [6] K a r l g e n J., D o u g l a s C. Recognizing Text Genres with Simple metrics Using Discriminant Analysis. Proceedings of COLING 94, Kyoto, pp [7] K l a r e G.The Measurement of Readability. Ames: Iowa University Press [8] L a u d a u e r T., F o l t z P., L a h a m D. Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp [9] LSA , see [10] L o r g e I. The Lorge Formula for Estimating Difficulty of Reading Materials. New York: Teachers College Press, Columbia University, [11] L o s e e R. Text Windows and Phrases Differing by discipline, Location in Document, and Syntactic Structure.Information Processing & Management 32(Nov): [12] N a k o v P. Getting Better Results with Latent Semantic Indexing. In Proceedings of the Students Presentations at ESSLLI-2000, pp , Birmingham, UK, August [13] N a k o v, P. Web-personalisation using extended Boolean operations with Latent Semantic Indexing. Proc. AIMSA-2000, Varna, Bulgaria, Lecture Notes in Artificial Intelligence 1904, Springer 2000, pp [14] N a k o v P. Latent Semantic Analysis of Textual Data. In Proceedings of CompSysTech 2000, Sofia, Bulgaria. June [15] N a k o v P. Latent Semantic Analysis for Bulgarian literature. In Proceedings of Spring Conference of Bulgarian Mathematicians Union. Borovetz [16] [17] [18] [19] [20] [21]

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview