Centrality Measures of Sentences in an English-Japanese Parallel Corpus

Centrality Measures of Sentences in an English-Japanese Parallel Corpus Masanori Oya Mejiro University m.oya@mejiro.ac.jp Abstract This study introduces directed acyclic graph representation of typed dependencies among words in a sentence, and proposes a method for calculating the degree centrality and closeness centrality of the typeddependency directed acyclic graph. The method is applied to sentences in a section of an English-Japanese parallel corpus and the differences in the results are discussed, along with suggestions for further study. Key Words- Dependency grammar, Typeddependency directed acyclic graphs, Graph centralities, English-Japanese parallel corpus 1. Introduction In [1], Oya proposed that graph centrality measures [2, 3] can be applied to typed-dependency trees for sentences. Accordingly, these measures have been applied to English sentences of different genres of texts [4, 5, 6, 7] and to English-Japanese translation pairs of a small-scale corpus [5, 8, 9]. In [1], an English- Japanese parallel corpus larger than those used in [5, 8, 9] is used. This study follows these studies to explore the possibility of automatic numerical analysis of the syntactic dependency structure of sentences in an English-Japanese parallel corpus, in terms of the centrality measures of typed dependencies among words in the sentences. The aim of this study is to represent the difference of the structural settings between two languages, based not on the speakers subjective intuition, but on objective, numerical measures. These objective measures for the structural difference between these two languages can be applied for various purposes, such as language teaching and machine translation. The remainder of this paper is organized as follows. Section 2 summarizes the theoretical background of this study: dependency grammar, graph centrality measures (degree centrality and closeness centrality), as well as the interpretation of the two centrality measures. Section 3 reports the analysis of the English-Japanese parallel corpus in terms of the centrality measures of the sentences. Section 4 is a discussion on the results of the analysis and the possibility of further study of centrality measures of typed-dependency trees, and Section 5 concludes this study. 2. Theoretical background 2.1. Dependency grammar Dependency grammar is a set of syntactic theories that focus on the dependency relationships among words in a sentence. Since it was first proposed in [11], dependency grammar frameworks have been developed by a number of researchers, e.g., extensible dependency grammar [12, 13], word grammar [14], Stanford dependency [15, 16, 17]. 2.2. Typed-dependency directed acyclic graphs The dependency relationship among words in a sentence can be represented by a typed-dependency directed acyclic graph (DAG) [1, 15, 16, 17]. For example, the sentence I am studying dependency grammar can be represented as follows: I NSUBJ studying AUX am DOBJ grammar NN dependency Figure 1. Typed-Dependency Directed Acyclic Graph for I am studying dependency grammar In Figure 1, each of the words is represented as one node. The dependency relationship between words is represented by an arc with a label. The direction of the arc represents the direction of the dependency. An arc starts from the head node, and it ends at the tail node. For example, the word studying is the head of the words I, am, and grammar, and the labels of these dependencies are nsubj, aux, and dobj, respectively. The dependency relationship among words in a sentence is acyclic, meaning that the no path from any node in the graph leads to the same node.

The output of the Stanford Parser [15, 17] concisely represents the dependency relationship among the words in a sentence. The output for the sentence I am studying dependency grammar is shown in (1): (1) nsubj(studying-3, I-1) aux(studying-3, am-2) dobj(studying-3, grammar-5) nn(theory-5, dependency-4) The first line in (1) states that the dependency between the third and the first word in the input sentence is typed as nsubj (an abbreviation for nominal subject ). The Stanford parser categorizes dependencies into one of the 55 different types [17]. The parse output of a sentence by the Stanford parser is a typed-dependency DAG representation for an input sentence, and we can calculate the structural characteristics of the output. Consider sentences (2) and (3); both of them have three words, yet the dependency relationships among them and the structural characteristics are different: (2) Write an article. (3) David wrote it. The Stanford parser outputs for (2) and (3) are (4) and (5), respectively (the nodes for the period are excluded; the root node is an abstract node postulated to be the head of the root of the dependency tree): (4) root(root-, Write-1) det(article-3, an-2) dobj(write-1, article-3) (5) nsubj(wrote-2, David-1) root(root-, wrote-2) dobj(wrote-2, it-3) The typed-dependency DAG representations for (2) and (3) are shown in Figure 2 and Figure 3, respectively: Root ROOT Write DOBJ article DET an Figure 2. Typed-Dependency DAG Representation for Write an article. ROOT wrote NSUBJ DOBJ David Root Figure 3. Typed-Dependency DAG Representation for David wrote it. The dependency relationship in the former DAG is deeper than that in the latter. There are three arcs from the root node to the terminal node in the DAG in Figure 2, whereas there are two arcs from the root to the terminal node in the DAG in Figure 3. On the other hand, the dependency relationship in the latter DAG is wider than that in the former. The verb read is connected to two other words in the DAG in Figure 3, while the verb read is connected to one word in Figure 2. For sentences with the same word count, we can have typed-dependency DAGs with different widths and depths, as will be shown in Section 2.4. The width and the depth of a given DAG can be calculated as degree centrality and closeness centrality in a welldefined manner. 2.3. Graph centrality measures Degree centrality of a node in a given graph is defined as the number of nodes connected to this node [2]. The degree centrality of a given graph as a whole is the sum of the maximum degree in the graph minus the degree of each of all of the other nodes, divided by the largest possible sum of the maximum degree of the graph minus the degree of all the other nodes [3]. Degree centrality increases in proportion to the flatness of dependency DAGs, and when a DAG is a star graph, its degree centrality is 1. For example, the degree centralities of sentences (2) and (3) are.333 and 1, respectively; these degree centralities indicate that sentence (3) is flatter than (2) in terms of the typeddependency relationships among the words. Closeness centrality is defined by the reciprocal of the average distance from a given node to another in a graph [2, 3]. The distance from one node to another is the number of arcs between them. In typed-dependency DAG representations, the most relevant distance is that from the root node to all the other nodes, since it represents the depth of the dependency DAG. The closeness centrality of a DAG is 1 when it contains only two nodes: the root node and one word dependent on it. Closeness centrality decreases in proportion to embeddedness, or the distance from the root to the other nodes. Smaller closeness centralities indicate more it

embedded typed-dependency DAGs. For example, the closeness centralities of sentences (2) and (3) are.5 and.6, respectively; these figures show that sentence (2) is more embedded than (3) in terms of the typed dependency relationships among the words. 2.4. Interpretation of graph centrality measures Observing the distribution of the sentences in the corpus in terms of these two centralities, we can have some objective insight into the structural characteristics of these sentences. For example, if we find that sentences with a certain value of degree centrality or closeness centrality appear more frequently than other sentences with other values of these centralities, these sentences can be argued to have typical structural characteristics representative of the corpus in which these sentences are found. 2.5. Related work The first study of syntactic networks based on data from dependency treebanks is [18], showing that their networks fit the small world model [19]. Quantitative network analysis (QNA) is introduced in [2] as a means for classifying complex networks according to their topology, and QNA is applied to genealogical classification of languages in [21]. An approach to automatic language classification of 11 languages according to their dependency networks is introduced in [22]; this includes degree and closeness centralities as typological network indices. The results matches the genealogical similarities of these languages, yet the 11 languages do not include Japanese, a language genealogically and typologically different from them. 3. Analysis In order to verify the idea in the previous section, a corpus-based analysis was carried out. In [5, 8, 9], an analysis of a small-scale parallel corpus was conducted for different purposes. In this study, a large-scale English-Japanese parallel corpus is used to obtain more reliable cross-linguistic results. 3.1. Procedure The centrality measures of the typed-dependency DAGs of the sentences in a given corpus are calculated automatically. The English sentences in the parallel corpus were parsed by the Stanford Typed-dependency parser, and their degree centralities and closeness centralities were calculated by a Ruby script used in [5]. The Japanese sentences in the corpus, on the other hand, were parsed by the Kurohashi-Nagao Parser (KNP) [23, 24, 25], and its output was transformed automatically into Stanford-Parser-style triples (the method of this automatic conversion of KNP output into triples is based on [26, 27]), from which their degree centralities and closeness centralities were calculated by the same script. KNP outputs the dependency relationships among syntactic units (bunsetsu in Japanese), which consist of one content word, followed by one particle when necessary to demonstrate the case. In order to compare its output with that of the Stanford dependency parser, the output format of the Stanford dependency parser was set to Collapsed Tree [17], wherein prepositions are converted into the name of the dependency type. For example, the prepositional phrase on Monday in a sentence I read this book on Monday. depends on the verb read with the dependency type prep_on. 3.2. Description of the text data The corpus used in this study is the Japanese- English Bilingual Corpus of Wikipedia s Kyoto Articles ver.2.1 (National Institute of Information and Communications Technology, 211). This corpus contains about 5, English-Japanese pairs of manually translated sentences on topics related to Kyoto. This corpus is divided into 16 subcorpora according to their contents (such as religion, famous people, or famous buildings). In this study, the sentences in the subcorpus on notable buildings in Kyoto are chosen randomly (henceforth BLD) to calculate their degree centralities and closeness centralities. 3.3. Results The descriptive statistics of the degree centralities and closeness centralities in the English sentences and Japanese sentences in BLD are as follows: Degree Closeness Mean S.D. Mean S.D. E.3.17.32.9 J.39.24.39.13 Table 1: Descriptive statistics (E: English; J: Japanese) Figures 4 and 5 show the frequencies of degree centralities of the Japanese and English sentences in BLD, respectively. When we round off these degree centralities to two decimal places, the most frequent degree centrality among the Japanese sentences is 1 (approx. 7.6% of all the sentences). The second most frequent degree centrality among Japanese sentences is.24 (approx. 6.2%). The most frequent degree centrality among the English sentences is.24 (approx. 4.5%), and the second most frequent value is.26

.6.12.18.24.3.36.42.48.54.6.66.72.78.84.9.96.6.12.18.24.3.36.42.48.54.6.66.72.78.84.9.96.6.12.18.24.3.36.42.48.54.6.66.72.78.84.9.96 (approx. 4.3%). English sentences with degree centrality 1 are not as frequent as Japanese ones. 8 6 4 2 Figure 4. Frequency of Degree Centralities of Japanese Sentences in BLD; x = degree centrality; y = frequency; n = 1595 6 5 4 3 2 1 Figure 5. Frequency of Degree Centralities of English Sentences in BLD; x = degree centrality; y = frequency; n = 1595 Figures 6 and 7 show the frequencies of sentences in terms of the values of closeness centralities of the Japanese and English sentences in BLD, respectively. When we round off these closeness centralities to two decimal places, the most frequent value among the Japanese sentences is.4 (approx. 5.3%), and the second most frequent value is.5 (approx. 4.6%). The most frequent value among the English sentences is.33 (approx. 5.3%), and the second most frequent value is.32 (approx. 5.2%). 6 5 4 3 2 1 Figure 6. Frequency of Closeness Centralities of Japanese Sentences in BLD; x = closeness centrality; y = frequency; n = 1595

.6.12.18.24.3.36.42.48.54.6.66.72.78.84.9.96 6 5 4 3 2 1 Figure 7. Frequency of Closeness Centralities of English Sentences in BLD; x = closeness centrality; y = frequency; n = 1595 4. Discussion The distribution of the degree centralities of English and that of Japanese sentences in BLD differ in that the variety of the degree centralities of the English sentences in BLD is larger than that of the Japanese sentences. In other words, degree centralities of English sentences do not concentrate on particular values, compared to those of Japanese sentences. This difference seems to suggest that English has a larger variety of structural settings of typeddependency DAGs compared to Japanese. In this context, special attention should be paid to the relatively high frequency of Japanese sentences with degree centrality 1, which means that their typeddependency DAGs share the star-graph setting, in which one word is connected to all the other words (see Figure 3 for an example of a star graph). This type of Japanese sentences can be regarded as having structural characteristics that are typical to this language, as far as the flatness of their typed-dependency DAGs is concerned. In English, on the other hand, typed-dependency DAGs in the star-graph setting do not exhibit such a typical typeddependency DAG setting in terms of their flatness. The difference between the frequencies of closeness centralities between the English and Japanese sentences in BLD is not as extensive as that in the case of their degree centralities, yet in Japanese, certain closeness centralities are more frequent than others are. This also can be considered to reflect the structural characteristics of Japanese, especially with respect to the fact that these more frequent closeness centralities are found in oneword sentences (their closeness centrality is.75), twoword sentences (.67), and three-word sentences (.63). These tendencies in the distributions of degree centralities and closeness centralities can be applied to a variety of fields, along with typological classification of languages [21, 22]. For example, in the field of language teaching, it can be possible to estimate the naturalness of English essays written by Japanese learners of English, or vice versa, in terms of the flatness expressed by degree centrality or of the embeddedness expressed by closeness centrality. A rough sketch of applying centrality measures into language teaching is as follows: if it is found that some Japanese learners tend to write many flat English sentences more frequently than English native speakers do, they need some advice so that they will write more embedded English sentences. Apart from language teaching, this estimation of naturalness of sentences of a certain language in terms of their flatness and embeddedness can also be applied to the result of sentence-generating applications, such as machine translation. Naturalness of sentences, whether generated by a system or written by humans, in terms of their centrality measures, is one of the possible research questions of future research. 5. Conclusion This study explored the possibility of automatic numerical analysis of the syntactic structure of sentences in an English-Japanese parallel corpus in terms of the centrality measures (degree centrality and closeness centrality) of typed dependencies among words in these sentences. The results suggest that centrality measures can reflect the difference of structural settings of these languages in terms of their flatness, and further research is required to apply the centrality measures of typeddependency DAGs for the naturalness of sentences. 6. Acknowledgement This work was supported by JSPS KAKENHI Grant Number 26375. 7. References [1] M. Oya, Directed Acyclic Graph Representation of Grammatical Knowledge and its Application for Calculating Sentence Complexity, Proceedings of the

15th International Conference of Pan-Pacific Association of Applied Linguistics, 21, pp. 393-4. [2] L. Freeman, Centrality in Social Networks, Social Networks, Vol.1, 1979, pp. 215-239. [3] Wasserman, S. and K. Faust, Social Network Analysis, Cambridge: Cambridge University Press, 1994. [4] M. Oya, Degree Centralities, Closeness Centralities, and Dependency Distances of Different Genres of Texts, Selected Papers from the 17th International Conference of Pan-Pacific Association of Applied Linguistics, pp. 42-53, 213. [5] M. Oya, A Study of Syntactic Typed-Dependency Trees for English and Japanese and Graph-Centrality Measures, Doctoral dissertation, Waseda University, 214. [6] M. Oya, Dependency-grammar analyses of different genres of English, Second Asia Pacific Corpus Linguistics Conference (APCLC 214), The Hong Kong Polytechnic University, 214. [7] M. Oya, Extracting Structural Properties from Syntactic Dependency Corpus, The 4th Conference of Japan Association for English Corpus Studies, Kumamoto Gakuen University, 214. [8] M. Oya, Syntactic Dependency Structures of English and Japanese, Mejiro Journal of Humanities, Vol. 9, 213, pp. 151-164. [9] M. Oya, Typed-dependency Tree Pairs of English and Japanese, Mejiro Journal of Humanities, Vol. 1, 214, pp. 25-215. [1] M. Oya, An English-Japanese bilingual corpusbased comparison of their syntactic dependency structures, The 19th Conference of Pan-Pacific Association of Applied Linguistics, Waseda University, 214. [11] L. Tesnière, Éléments de syntaxe structural, Paris: Klincksieck, 1959. [12] R. Debusmann, Dependency Grammar as Graph Description, Prospects and Advances in the Syntax- Semantics Interface, Nancy, 23, Retrieved July 27th, 21, from https://www.ps.uni-saarland.de/ Publications/documents/passi3.pdf [13] R. Debusmann and M. Kuhlmann, Dependency Grammar: Classification and Exploration, Project report (CHORUS, SFB 378), 27, Retrieved July 3, 21, from http://www.ps.uni-saarland.de/~rade/papers/sfb.pdf. [14] Hudson, R., An Introduction to Word Grammar, Cambridge University Press, 21. [15] M. C. de Marneffe, B. MacCartney, and C. D. Manning, Generating Typed Dependency Parses from Phrase Structure Parses, Proceedings of the Fifth International Conference on Language Resources and Evaluation, 26. [16] M. C. de Marneffe and C. D. Manning, The Stanford Typed Dependencies Representation, COLING Workshop on Cross-framework and Cross-domain Parser Evaluation, 29. [17] M. C. de Marneffe and C.D. Manning, Stanford Typed Dependency Manual, Retrieved July 3, 21, from http://nlp.stanford.edu/software/dependencies_ manual.pdf. [18] R. Ferrer i Cancho, R. Solé, and R. Köhler. Patterns in Syntactic Dependency Networks. Physical Review E, 69, 51915, 69:5, 24. [19] D. J. Watts and S. H. Strogatz Collective Dynamics of Small World Networks. Nature, 393, pp. 44-442, 1998. [2] A. Mehler, Structural Similarities of Complex Networks: A Computational Model by Text Representation. Applied Artificial Intelligence, 22, pp. 619-683, 28. [21] A. Mehler, O. Pustylnikov, and N. Diewald, The Geography of Social Ontologies: The Sapir-Whorf Hypothesis Revisited. Computer, Speech and Language. London: Academic Press, 21. [22] O. Abramov and A. Mehler, Automatic Language Classification by Means of Syntactic Dependency Networks Journal of Quantitative Linguistics 18:4, pp. 291-336, 211. [23] S. Kurohashi and M. Nagao, A Method for Analyzing Conjunctive Structures in Japanese, Journal of Information Processing Society of Japan, Vol. 33, No. 8, 1992, pp. 122-131. [24] S. Kurohashi and M. Nagao, A Syntactic Analysis Method of Long Japanese Sentences based on Coordinate Structure Detection, Journal of Natural Language Processing, Vol.1, No.1, 1994, pp. 35-57. [25] S. Kurohashi and and M. Nagao, Building a Japanese Parsed Corpus while Improving the Parsing System, Proceedings of the 1st International Conference on Language Resources and Evaluation, 1998, pp. 719-724. [26] M. Oya, A Method of Automatic Acquisition of Typed-dependency Representation of Japanese Syntactic Structure, Proceedings of the 14th Conference of Pan- Pacific Association of Applied Linguistics, 29, pp. 337-34. [27] M. Oya, Treebank-Based Automatic Acquisition of Wide Coverage, Deep Linguistic Resources for Japanese, M.Sc. thesis, School of Computing, Dublin City University, 21.