Exploracion of Full-Text Databases with Self-organizing Maps

Exploracion of Full-Text Databases with Self-organizing Maps Timo Honkela, Samuel Kaski, Krista Lagus, and Teuvo Kohonen Helsinki University of Technology Neural Networks Research Centre Rakentajanaukio 2 C, FIN-02150 ESPOO, Finland Timo.Honkela@hut.fi ABSTRACT Availability of large full-text document collections in electronic form has created a need for intelligent information retrieval techniques. Especially the expanding World Wide Web presupposes methods for systematic exploration of miscellaneous document collections. In this paper we introduce a new method, the WEBSOM, for this task. Self-organizing Maps (SOMs) are used to represent documents on a map that provides an insightful view of the text collection. This view visualizes similarity relations between the documents, and the display can be utilized for orderly exploration of the material rather than having to rely on traditional search expressions. The complete WEBSOM method involves a two-level SOM architecture comprising of a word category map and a document map, and means for interactive exploration of the data base. 1. Introduction Full-text classification may be based on the assumption that the elementary textual features of documents that deal with similar topics are statistically similar. Consider tentatively that the documents (text files) were simply described by their word histograms. A Self-organizing Map (SOM) of the documents could then be computed easily using these histograms as input vectors. However, it has turned out to be a more effective encoding scheme to first construct the statistics of word categories, because this increases generality: the general nature of the texts is not altered much if the words are replaced by their synonyms or other words of the same category. The categorial histograms may then be used as inputs to the SOM mentioned above. The so-called semantic SOM [8] is able to effectively and completely automatically cluster words into grammatical and semantic categories on the basis of their short contexts, i.e., frames of neighboring words in which they occur. The document-searching method WEBSOM introduced in this paper also first extracts a great number of elementary contextual features from text files and maps them onto an ordered two-dimensional SOM array. In the second stage, histograms of contexts accumulated on the first array are further mapped into points on the second map, whereby similar files become mapped close to each other. This order facilitates easy browsing and searching of related files. The WEBSOM applies to arbitrary full-text files: no descriptors are thereby needed. In this work we have used it to organize 20 selected Usenet newsgroups in the Internet, from which we have picked up in total 10 000 discussions, or approximately 3 000 000 words. It may be clear that systematic organization of such an amount of colloquial texts is a very hard task, but the WEBSOM seems to do it effectively. The WEBSOM has two possible modes of operation: unsupervised and supervised. In the former, clustering of arbitrary text files is made by a conventional two-level Self-organizing Map, whereby no class information about the documents is given; classification is simply based on the analysis of the raw texts. In the supervised mode of operation, separation of the document groups is enhanced if auxiliary class The term newsgroup has already become established, although discussion group would in most cases be more accurate. 0-7803-32 10-5196 $4.00@1996 IEEE 56

information, for instance the name of the newsgroup, can be given. In this work we use a partly supervised mode. The overall architecture of the WEBSOM method thus consists of two levels: the word category map and the document map, respectively. 2. The Word Category Map 2.1. Preprocessing of Text The documents in the Internet contain plenty of details that may be only remotely connected with the topic, for example ASCII drawings and automatically included signatures. These were first removed using heuristic rules. Also the articles ( a, an, the ) were removed and the numerical expressions as well as special codes were treated with heuristic rules. The documents may also contain plenty of words that occur only a few times in the whole data base. Their contribution to the formation of the SOM would be highly erratic. For this reason, and also to reduce the computational load significantly, all words occurring less than 50 times were represented by a don t care symbol and neglected in further computation. 2.2. Formation of the Word Category Map Several articles have been published on SOMs that map words into grammatical and semantic categories [l],[7], [8], 191, [14]. Below we only delineate the basic idea of the semantic SOM. Consider that all words in a text corpus are concatenated into a single symbol string, from which strange symbols and delimiters (such as punctuation marks) have been elimin,ated, and rare words have been denoted by the don t care symbol as mentioned above. In the vocabulary of all occurring words, each word is represented by an n-dimensional real vector xi with random-number components. (In our present experiments we had xi E We used the so-called averaging method, whereby we scanned the string of words, and for each different word encountered we formed its average context vector over the text corpus. If the code vector in word position i is xi, then its context vector reads X(i) = [ Xi, E{xi-l(zi} ] E{xi+l xi} where E denotes the estimate of the expectation value evaluated over the corpus, and E is a small scalar number (e.g., E = 0.2). Now the X(i) E constitute the input vectors to the category map. The training set consists of all the X(i) with different xi. After the training process, calibration of the SOM was made by inputting the X(i) once again and labeling the best-matching units according to the x, parts of the X(i). In this method a unit may become labeled by a multitude of symbols. This kind of multiple labeling will be needed in the fast mapping discussed in Sec. 3.2. 3. The Document Map 3.1. Previous Work on Document Searching with the SOM A small document map based on titles of scientific documents has been presented by Lin et al. [4]. Scholtes has applied the SOM extensively to natural language processing (see e.g. [1%], [13]). For information retrieval he has developed a neural filter along with a neural interest map that is related to the current principle [lo], [Ill, [12], [14]. Although the neural interest map and the WEBSOM display look similar, Scholtes does not, for instance, apply a two-level SOM such as the one introduced in the current paper. Merkl [5],[6] has used the SOM to cluster textual descriptions of software library components. 57

3.2. Speeding up Word Category Histogram Computation and Blurring of the Histogram One severe problem in organizing documents is the vast number of computations needed to analyze and map large text files. Consider that when we scan the text files to form their word category histograms on the word category map, we should actually record where each word triplet corresponding to a 270-component real vector is mapped, and collect these incidents into a histogram. A full winner search is necessary at each step. We have found out that a sufficiently accurate and about 85 000 times faster method is to locate the images of the source text words directly on the calibrated word category map by tabular search using hash coding (cf. [a]). To locate the words by tabular search we need the multitude of labels mentioned in Sec. 2.2. As will be indicated in the results section 4, one unit or node of the word category map mostly contains contextually related words. We have found out by experiments that invariance of classification with respect to contextual features is further increased if we blur the histograms on the word category map by convolution. Such blurring is a commonplace method in pattern recognition. The blurring is justifiable also here, because the map is ordered! When the map consisted of 15 by 21 units, we used for the convolution kernel a Gaussian with the full width at half maximum of seven map spacings. 3.3. Weighting of Words by Their Entropy As mentioned above, we are forming histograms of texts on the word category map, to be used after their blurring as inputs to the document map. Experience has shown that class separation is greatly improved if we use entropy weighting of the words before forming the histograms. Denote by ni(w) the frequency of occurrence of word w in group i (i = 1,...,20). The entropy of this word is defined as: and the weight W(w) of word U) is defined to be where Emax = ln20. This kind of entropy-based weighting of the words is straightforward when the documents can be naturally and easily divided into groups such as the different newsgroups. If no natural division exists, word entropies might still be computed over individual documents (which then must be large enough for sufficient statistical accuracy) or clusters of documents. 3.4. Supervised Document Map If the affiliation of the texts with particular newsgroups is known a priori, another measure that greatly improves class separation is to use the supervised-som principle [a, 31. If H(i) is the histogram vector, for the actual input vectors to the document map we used data vectors of the form where U(i) is a unit vector that has 20 components, and has the nonzero value in the position corresponding to the newsgroup. In reality we used a U(i) vector with a norm that was 0.2 times the norm of the H(i) part, whereby the method is only partly supervised. The U(i) part introduces a bias into the vector space that is the same for documents of the same group but different between different groups and thus increases the separation of the groups. (With a smaller norm of U(i), there would be more overlap between the newsgroups, whereby relevant articles from different groups might be better attracted to the same cluster.) Vectors constructed in the similar way were used for both training and testing.

4. Results The word category map, although using unsupervised learning only, clusters similar words together. For instance, one unit in the map has been found to contain the words cd, television, video, screen, radio, tv, cd-rom, tape, net internet. and web comp.ai.alife comp.ai.fuszy comp ai.games comp.ai.genetic comp.ai neural-nets comp.ai.philosophy rec.arts.movies current-film rec.arts.movies.pnst-films rec.humor rec muric.hluenote rec.music. hluenote. blues rec.mosic classicvl rec.mujic.progrcssive sci.cognitive sci.long sci.philoaophy.meta sci.philosophy.tech Fig. 1: Distribution of the documents in different newsgroups on the document map (of size 24 x 32 units). Each small display depicts the locations of the articles in one newsgroup on the same document map. The units have been displayed as small circles, the color of which indicates the number of articles the unit contains (black: a large number, white: none). The document map was formed in a partly supervised SOM learning process on a massively parallel CNAPS neurocomputer, whereby the norm of the unit vector part U(i) of the input vectors was 0.2. Exploration of the map can be made on a usual workstation computer. On the document map formed in the partly supervised SOM learning process, nearby locations contain hierarchically similar documents. This property can be utilized for a powerful tool, a road map, in exploring the database, as will be described in Sec. 5. As the supervised SOM enhances separability, the documents in each newsgroup are predominantly centered around one or a few clusters (Fig. 1). Different newsgroups most often occupy separate areas, but since the topics in different groups are overlapping, also their areas on the map overlap partly. The groups that have overlapping or neighboring areas in Fig. 1 59

discuss related topics. Examples of such groups are the two philosophy groups shown in the last two displays. On the other hand, segregation of the discussions in groups bluenote and bluenote.blues was probably due to the supervised learning, which enforces a dissimilarity between the defined groups. Another possible explanation is that these groups may have different discussants or discussion styles. There also exists important clustering within the groups. Even a single node mostly contains related articles, and this fact will. be demonstrated in Fig. 2. 5. Scenario for Interactive Exploration The document collection may be explored by moving on a document map, using a hypertext interface viewable from WWW (World Wide Web). When exploring a document map that may represent tens of thousands of documents, it is necessary both to form a general idea of the locations of different subject areas, and to be able to explore an interesting area in more detail. This is possible through zooming the document map display. The most general view of the map should contain some important landmarks (e.g. pre-selected representative documents), to give an overall idea of the document landscape available for further exploration. On closer look (i.e. on a deeper zoom level) more details could be revealed of the selected area. A sketch of such a configuration is presented in Fig. 2, where a point-and-click interface is used for moving between and across different zoom levels, as well as for reaching the documents mapped to each node. b) WEBSOM node 7,9 Documents in this node: comp.ai.neural-nets Re: Inverse of Functions, and Back-mopgation * Bill Armstrong. 8 Jun 1995, Lines: 41. NetLife and ZooLife update *Christopher G Burch, 10 Jun 1995, Lines: 19. Test-reports of NN-Softwaretools 9 Juergen Braun. 14 Jun 1995, Lines: 19. Travellinc salesman I Kohonen SOM * H. Lohninger. Wed, 14 Jun 1995, Lines: 29. Re: UnsuDervised Hebbian learning * George M. Georgiou. 15 Jun 1995, Lines: 30. Chaos and Neural Networks in Investing RobertTabc, 18 Jun 1995, Lines: 13. Re: Defining the Net * X. M. Song, 19 Jun 1995, Lines: 22. Ouestinc fo; Huann and Limmann s Dhoneme data * Ted Wane, 20 Jun 1995. Lines: 15. Fig. 2: A sample scene of the WEBSOM interface as seen by a WWW-browser. (a) Part of a zoomed document map display, in which clusters have been depicted in shades of gray. Light shades indicate a high degree of clustering whereas dark implies empty space between the clusters. Two examples of newsgroup articles, landmarks that have been automatically considered representative of the corresponding map areas (based on a close resemblance between their encoded form and the model vector in the corresponding map location) have been marked on the display to aid in navigating on the map. (b) A map node brought to view by clicking near the upper landmark. Interesting articles can be read by clicking on their subjects (underlined in the figure). Only a part of the discussions in this node has been shown. 6. Discussion A new method for organizing and exploring large document collections has been presented. The essential new features of the method, called the WEBSOM, are the following: 1. The overall system architecture is based on a combination of two Self-Organizing Maps, namely a category map of words, and a document map. 2. The words were weighted according to their entropy over document classes (Usenet newsgroups in the experiments) when encoding the documents as histograms on the word category map. 60

3. The histograms were blurred by convolution, in order to improve invariance of the categorial features. 4. The underlying learning method, the SOM itself, is inherently unsupervised, enabling maximally general application possibilities. In addition, available class information, e.g., iiewsgroup in the experiments, can be utilized in the document map, resulting in a supervised or partly supervised SOM mode. This improves separation of the newsgroups. In preliminary experiments, all the above new features of the present methoid as discussed in Secs. 2.1, 2.2, 3.2, 3.3, and 3.4 were found important in order to achieve the results reported here. Naturally, many details of the method could still be refined. For instance, the method scales up so well that much larger document maps could be used. Then, however, some more effective computing methods must be developed in order to keep the system operating close to real time, as the case is already at present. It is also a subject of further investigation whether it would be more justified to calibrate the category map directly by words, i.e., using test vectors of the form [a,~?, flit where fl denotes don t care in the comparison of vector components. Announcement: The WEBSOM demo is available in the Internet at the address htt p: / /websom. hut. fi/ websom/ References [I] T. Honkela, V. Pulkki, and T. Kohonen, Contextual relations of words in Grimm tales analyzed by selforganizing map, in Proceedings of International Conference on Artificial Neural Networks, ICANN-95 (F. Fogelman-Soulie and P. Gallinari, eds.), (Paris), vol. 2, pp. 3-7, EC2 et Cie, 1995. [2] T. Kohonen, Self-organizing Maps. Berlin, Heidelberg: Springer, 1995. [3] T. Kohonen, K. Makisara, and T. Saramaki, Phonotopic maps - insightful representation of phonological features for speech recognition, in Proc. 7ICPR, Int. Conf. on Pattern Recognition, (Los Alamitos, CA), pp. 182-185, IEEE Computer Soc. Press, 1984. [4] X. Lin, D. Soergel, and G. Marchionini, A self-organizing semantic map for information retrieval, in Proc. 14th. Ann. Int. ACM/SIGIR Conf. on R & D In Information Retrieval, pp. 262-269, 1991. [5] D. Merkl, Structuring software for reuse - the case of self-organizing maps, in Proc. IJCNN-93- Nagoya, Int. Joint Conf. on Neural Networks, vol. 111, (Piscataway, NJ), pp. 2468-2471, IEEE Service Center, 1993. [6] D. Merkl and A. M. Tjoa, The representation of semantic similarity between documents by using maps: Application of an artificial neural network to organize software libraries, in Proc. FID 94, General Assembly Conf. and Congress of the Int. Federation for Information and Documentation, 1994. 171 R. Miikkulainen, Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. Cambridge, MA: MIT Press, 1993. [8] H. Ritter and T. Kohonen, Self-organizing semantic maps, Biol. Cyb., vol. 61, no. 4, pp. 241-254, 1989. [9] H. Ritter and T. Kohonen, Learning semantotopic maps from context, in Proc. IJCNN-90-WASH- DC, Int. Joint Conf. on Neural Networks, vol. I, (Hillsdale, NJ), pp. 23-26, Lawrence Erlbaum, 1990. [lo] J. C. Scholtes, Kohonen feature maps in full-text data bases: A case study of the 1987 Pravda, in Proc. Informatiewetenschap 1991, Nijmegen, (Nijmegen, Netherlands), pp. 203-220, STINFON, 1991. [ll] J. C. Scholtes, Unsupervised learning and the information retrieval problem, in Proc. IJCNN 91, Int. Joint Conf. on Neural Networks, (Piscataway, NJ), pp. 95-100, IEEE Service Center, 1991. [12] J. C. Scholtes, Neural nets for free-text information filtering, in Proc. 3rd Australian Conf. on Neural Nets, Canberra, Australia, February 3-5, 1992. [13] J. C. Scholtes, Resolving linguistic ambiguities with a neural data-oriented parsing (DOP) system, in Artificial Neural Networks, 2 (I. Aleksander and J. Taylor, eds.), vol. 11, (Amsterdam, Netherlands), pp. 1347-1 350, North-Holland, 1992. [14] 3. C. Scholtes, Neural Networks in Natural Language Processing and Information Retrieval. PhD thesis, Universiteit van Amsterdam, Amsterdam, Netherlands, 1993. 61