Thematic Development for Measuring Cohesion and Coherence Between Sentences in English Paragraph

2016 Fourth International Conference on Information and Communication Technologies (ICoICT) for Measuring Cohesion and Coherence Between Sentences in English Paragraph Erinna Hardianto Putri, Diah Rostanti Fadilah, Ivan, Derwin Suhartono Computer Science Department Bina Nusantara University Jakarta, Indonesia erinna.hardiantoputri@binusian.org, dfadilah@binus.edu, vanz_9365@binusian.org, dsuhartono@binus.edu Melania Wiannastiti English Department Bina Nusantara University Jakarta, Indonesia mwiannastiti@binus.edu Abstract Writing is a skill of marking coherent words on a paper and composing a text. There are several criteria of good writing, two of them are cohesion and coherence. Research of cohesion and coherence writing has been conducted by using and Value method. However, the result can still be improved. Therefore, in this research, we tried to use approach which focused on the use of Theme and Rheme as a method to analyze coherence level of a paragraph, combined with CogNIAC rules and to analyze its cohesion. Besides improving the result of previous methods, this research aims to help users in evaluating and assessing their writing text. To achieve these objectives, the proposed method is compared to previous works as well as human judge. Based on the experiment, the proposed method yields average result of 91% which is nearly equivalent to human judge which is 92%. also yields better result compared to which is 29% and Value which is 0%, given the same data set of beginner and intermediate simple writing. Keywords cohesion; coherence; thematic development; english paragraph I. INTRODUCTION For many students, writing seems to be the most difficult subject because it requires work of inventing ideas, thinking about how to express the ideas and organizing them into sentences or paragraphs [1]. Indeed, most of writing tasks involve certain rules to be followed such that they can be categorized into different kinds of writing. Furthermore, the practice of writing at school is done based on writing rubric for each grade. Despite these attempts to educate students in writing, many of them still cannot follow the procedures and have low performance in writing. In addition, students who learn to write only consider about grammar, punctuation, and diction, whereas there are another essential things to be included in writing, such as cohesion and coherence [1]. It is proven that 50 percent of research samples cannot implement cohesion and coherence in their writing. Basically, the way to link sentences and ensure their cohesion is by using anaphora resolution [2]. Furthermore, to write coherent and cohesive paragraph, students need to have a knowledge in theme and rheme as it is said in [1] that students write better papers when they develop an ability to use theme and rheme more effectively in their writing. In addition, the interaction of theme and rheme governs how the information in a text developed [1]. development is necessary for the construction of an optimally coherent and grammatically cohesive structured text [3]. Another issue in academic writing is that teachers need to manually evaluate every single paragraph written by students. This way of correcting may consume a lot of time. Therefore, teachers need a system to correct students writing quickly. Seeing this issue, researchers has built a web-based coherence and cohesion checker which function is to analyze connection between sentences in a paragraph. Previous researches have implemented and Value to analyze local coherence between sentences. E-rater essay scoring system has been developed to examine local discourse coherence based on [4]. In parallel with aforementioned study, has extended by proposing a new method to evaluate discourse coherence through Value [5]. This method is done by tracking all entity instead of single entity as in original. However, from linguistics point of view, there is knowledge of which is useful to check coherence and cohesion of writing. Nevertheless, a system that can evaluate cohesion and coherence in text by using is not available yet. The main goal of this research is to develop a system based on method so that the result will be similar to human judgment. As a comparison, and Value method are also used to check cohesion and coherence in this research. II. LITERATURE REVIEW A unit of language or simply a group of sentences which has a particular focus and is represented as paragraphs, sections, chapters, parts, or stories can be defined as discourse [5]. A text produced in a discourse should contain related and meaningful sentence which is expected to be both cohesive and coherent. Cohesive is a linguistic device that helps to establish links among the sentences of a text. Cohesion in texts is more about linking sentences or, more generally, textual units through ISBN: 978-1-4673-9879-4 54

cohesive devices such as anaphors, an entity that pointing back to a previously mentioned item in text [2]. Coherence in writing means the sentences must hold together to form smooth movement between sentences. One way to achieve coherence in a paragraph is to use the same nouns and pronouns consistently [6]. In English, sentences (and the clauses of which they are composed) have a simple two-way division between what the sentence is about (its topic) and what the writer wants to tell the readers about that topic (the comment). The topic and comment are called theme and rheme. Theme is subject of the sentence and is typically realized by noun phrase. Rheme is the new information of the sentence and it is used to explain the topic or theme [7]. This definition is supported by [1] who says that the topic or Theme is the subject of each sentence in the paragraph while rheme is the controlling idea to limit the topic in every sentence. The way the theme of a clause is developed is known as. Theme of a clause is taken from theme or rheme of previous sentences [1]. There are three types of pattern as follows: Theme Reiteration or Constant Theme Pattern This pattern shows that the first theme is picked up and repeated in the beginning of the next clause as described in figure 1. Multiple Theme / Split Rheme Pattern In this pattern as describe in figure 3, a rheme may include a number of different pieces of information, each of which may be taken up as the theme in a number of subsequent clauses. Fig. 3. Multiple Theme / Split Rheme Pattern III. METHODOLOGY Fig. 1. Theme Reiteration Pattern Zig Zag Linear Theme Pattern It is a pattern when the subject matter in the Rheme of one clause is taken up as the Theme of the following clause. Figure 2 makes it clearer. Fig. 2. Zig Zag Linear Theme Pattern Fig. 4. Process We had to go through several processes before applying patterns as attached in figure 4. 1. Parsing paragraph to sentences Parsing sentences is done by dividing a paragraph into several sentences using pre-trained Punkt tokenizer for English. Punkt sentence tokenizer is accessible in nltk.tokenize module provided by Natural Language Toolkit (NLTK). This tokenizer divides a text into list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations (combination of words), and words that start sentences. It must be trained on a large collection of plain text in the target language before it can be used [8]. For example: Jane is smart. Mr. John is a good man. This text will be separated into two sentences: Sentence 1: Jane is smart. Sentence 2: Mr. John is a good man. 55

Punkt tokenizer knows that the period in Mr. John does not mark sentence boundaries. Therefore, it does not incorrectly - separate the text into three sentences; Jane is smart; Mr. and John is a good man. 2. Parsing a sentence to words and tagging POS (Part-of- Speech) In this step, a sentence is parsed to several words. Moreover, each word is annotated with its type such as noun, verb, etc. Sentence parsing and POS tagging are done by a parser accessible in pattern.en module. The parser uses a fast, regular expressions-based shallow parsing approach for English (identifies sentence constituents, e.g., nouns, verbs), using a finite state part-of-speech tagger extended with a tokenizer, lemmatizer and chunker [9]. The concept of parsing can be best interpreted as a family of tagging, chunking, and named-entity recognition task which attempts to recover some syntactic-semantic information in a robust and deterministic way at the expense of ignoring detailed configurationally syntactic information [10]. Furthermore, tagging is classification of parts of speech such as verbs, nouns, pronouns, adverbs, adjectives, prepositions, conjunction, and interjections. Moreover, chunking and Named Recognition mean that the parser can also recognize noun phrase and named entity, such as locations, names and organizations. 3. Chunking Noun Phrase In this step, the pattern of part of speech that has been tagged is checked whether it fulfills the criteria as a noun phrase. Regular expression approach is used so that the researchers have more control over the tag patterns that need to be matched. The regular expression for noun phrase is below: {<DT>?<CD>?<JJ JJR JJS>*(<NN NNS NNP NNP-LOC NNP-PERS NNPS>)+} If the words POS tags match the defined pattern, the words are combined into a single noun phrase. For example, the sentence The tall smart man likes the president of Indonesia. will yield three noun phrases which are The tall smart man; the president; and Indonesia. 4. Getting basic semantic information Basic semantic information for each pronoun and noun phrase is identified here. Basic semantic information includes gender, number, and animacy. For noun phrase, a head noun must be identified first before defining the semantic information. The head noun identification is based on the word s tag. If there are at least one consecutive nouns (including proper noun), then the nouns will be considered as a head noun. For example, a noun phrase the fast programming language will yield programming language as the head noun. On another example, a noun phrase a fast white light will yield light as the head noun. Gender information of an entity is obtained from the result of Naive Bayes Classifier. The features used for classifying gender are first letter, last letter, and the last two letters of a name. Name that ends with a, e, and i are likely classified as female while name that ends with k, o, r, s and t are likely classified as male. Number information of an entity is retrieved by checking the entity in inflect engine. If the entity is in a plural form, plural will be returned, otherwise singular will be returned by the system. For example, a word men will be identified as a plural entity, whereas a word man will be identified as a singular entity. Animacy information is retrieved by checking an entity in WordNet database provided in pattern. If the entity is recognized as a person in WordNet, then person will be returned by the system, otherwise non-person will be returned. However, if the entity is not recognized in WordNet database, it will be checked based on its POS tag. An entity with proper noun and pronoun POS tag will be identified as a person, otherwise it will be identified as a non-person. 5. Finding anaphor After all pronouns and noun phrases have obtained their syntactic information, the next step is to find anaphors that need to be resolved. The anaphors that will be resolved in this thesis are personal pronoun (e.g. he), reflexive pronoun (e.g. himself), and possessive pronoun (e.g. his). The sample sentences are as follow: Harry is a rich man. He likes to buy cars. However, Christina hates him. The anaphors that will be resolved from the sentences above are He and him. 6. Finding antecedent candidates During this step, the program will search for all possible antecedents for each anaphor. The search will be done in current and previous sentences. For instance: Harry is a rich man. He likes to buy cars. However, Christina hates him. From the sentences, the possible antecedents for anaphor him are Harry; a rich man; cars; and Christina. 7. Applying agreement filter After all antecedent candidates have been gathered, agreements are applied to filter the candidates. Each candidate is filtered by matching its basic semantic information (gender, number, and animacy) with the anaphor s semantic information. If the candidate s semantic information does not match with the anaphor s semantic information, then the current candidate is no longer known as a candidate. From the previous example, the candidate car and Christina are removed because their semantic information do not properly match the semantic information of the word him. The remaining candidates for the word him are Harry and a rich man. 8. Applying CogNIAC rules and After applying agreement filter and constraint, CogNIAC rules [12] is applied to the remaining candidates. The rules are applied in sequence until a certain rule condition is met. The CogNIAC rules are unique in discourse, reflexive, unique in current & prior, possessive pronoun, unique current sentence, 56

unique subject/object pronoun, cb-picking, and pick most recent. 9. Finding entity that will be considered are noun and noun phrase. For example: Michael likes Emily. Emily is Republican. Table I contains entities of the above utterances. TABLE I. EXAMPLE OF FINDING ENTITY Utterance 1 Utterance 2 1 Michael Emily 2 Emily Republican 10. Finding theme and rheme Theme is a topic of utterance. Theme is obtained from the subject of utterance. Whereas, rheme is a new information of the sentence which is used to explain the topic or theme. For example: 1. Merida was a student in computer science. 2. China is a large country. 3. Merida went to a private school in Jakarta. 4. Beijing is the capital of China. 5. She likes computer science. Table II contains theme and rheme for utterances above. TABLE II. EXAMPLE OF FINDING THEME AND RHEME Utterance Theme Rheme 0 Merida Student, Computer Science 1 China Large Country 2 Merida Private School, Jakarta 3 Beijing Capital, China 4 She Computer Science 11. Applying thematic development patterns By using the same example, the result of applying patterns is as follows: Fig. 5. Process represents coherence percentage of all utterances. The formula to calculate coherence percentage is shown in table III below. TABLE III. EXAMPLE OF FINDING THEME AND RHEME Utterance Coherence 0-1 No 2 Yes 3 Yes 4 No Percentage of Coherence 50% The first utterance has no coherence because there is no other utterance to compare with. The second utterance is not coherent with the previous one due to lack of similarity, the entities of both sentences are different in meaning and topic. The third utterance is compared to the first and the second sentence. Even it is not related to the second sentence, it is connected to the first one and it meets the reiteration or constant theme pattern, so it is coherent. The fourth utterance meets the zig zag linear theme pattern against the second sentence which makes it coherent. While the last utterance has no link to previous sentences, so it is not coherent to any sentence. Since there are four utterances, with two coherent sentences, the result is as follow: h ( ) = 100% ( ) = 2 100% = 50% 4 IV. RESULT AND DISCUSSION There were three kinds of data set used in this experiment. The first data set was gathered from human experts, English lecturers of Bina Nusantara University. The collected data from human experts amounted to 10 essays: 5 essays written by lecturers and 5 essays written by college students. Then, a paragraph was chosen from each essay to be evaluated by the application. Next, the second data set was taken from writing examples of IELTS test which had been marked with the score or band of 5 and 6. Ten paragraphs were chosen from 10 essays with different topics. Lastly, the third data set was taken from the examples of paragraph in academic writing books and several websites that explain Theme-Rheme. Ten paragraphs were chosen from different English books and Theme-Rheme websites. In total, there were 30 paragraphs grouped into three data sets. In order to have score from human judge, each essay is assessed by expert as well. Set 1 This experiment used paragraphs taken from the expert written by lecturers and students. The result of the experiment on both the proposed method and human judge was summarized in table IV. 12. Showing thematic development result 57

TABLE IV. RESULT OF THEMATIC DEVELOPMENT AND HUMAN JUDGE IN PARAGRAPHS WRITTEN BY LECTURERS AND STUDENTSEXAMPLE OF FINDING THEME AND RHEME Paragraph 1 2 3 4 5 6 7 8 9 10 57 100 71 80 100 100 100 80 50 100 Judge 61 85 98 91 87 90 97 77 98 100 Set 2 This experiment used paragraphs taken from writing example of IELTS tests marked with the score of 5 and 6. The result of experiment on both the proposed method and human judge was summarized in table V. TABLE V. RESULT OF THEMATIC DEVELOPMENT AND HUMAN JUDGE IN IELTS PARAGRAPH WRITINGS OF BAND 5 AND 6 Paragraph 1 2 3 4 5 6 7 8 9 10 100 83 100 100 57 100 100 100 75 75 Judge 96 92 94 98 97 77 67 93 98 72 Set 3 The experiment used examples of paragraphs taken from academic writing books and several websites that explained Theme-Rheme. Ten paragraphs were chosen from different English books and the websites. The result of experiment was summarized in table VI. TABLE VI. RESULT OF THEMATIC DEVELOPMENT AND HUMAN JUDGE IN IELTS PARAGRAPH WRITINGS OF BAND 5 AND 6 Paragraph 1 2 3 4 5 6 7 8 9 10 100 100 100 100 100 100 100 100 100 100 Judge 100 100 100 100 100 100 100 100 100 100 While comparing the result to previous methods, we found out that our method was better than them. The results given by the system are shown in table VII to IX. In order to make analysis and interpretation easier, the result value for each method is converted into binary (0 and 1). In, if two sentences are not connected or the value of transition is No, then it will be converted to 0. Otherwise, if two sentences are connected or the value of transition is Yes, then it will be converted to 1. In Value method, the value of 0.5 or above will be converted to 0 while the value below 0.5 will be converted to 1. In, continue and retain transition will be converted to 1, whereas smooth shift and rough shift transition will be converted to 0. Set 1 TABLE VII. RESULT OF THEMATIC DEVELOPMENT, ENTITY TRANSITION VALUE, AND CENTERING THEORY IN PARAGRAPHS WRITTEN BY LECTURERS AND STUDENTS Paragraph Value 1 57 0 0 2 100 0 17 3 71 0 43 Set 2 4 80 0 20 5 100 0 0 6 100 0 17 7 100 0 17 8 80 0 0 9 50 0 0 10 100 0 29 TABLE VIII. RESULT OF THEMATIC DEVELOPMENT, ENTITY TRANSITION VALUE, AND CENTERING THEORY IN IELTS PARAGRAPH WRITINGS OF BAND 5 AND 6 Set 3 Paragraph Value 1 100 0 15 2 83 0 33 3 100 0 40 4 100 0 25 5 57 0 0 6 100 0 0 7 100 0 0 8 100 0 25 9 75 0 13 10 75 0 25 TABLE IX. RESULT OF THEMATIC DEVELOPMENT, ENTITY TRANSITION VALUE, AND CENTERING THEORY IN PARAGRAPHS FROM ACADEMIC WRITING BOOKS AND WEBSITES Paragraph Value 1 100 0 33 2 100 0 67 3 100 0 33 4 100 0 67 5 100 0 25 6 100 0 0 7 100 0 33 8 100 0 100 9 100 0 100 10 100 0 100 58

It is obvious that and human judgment give results which are not significant in difference. Experiments show that the average result of method is 91% while the average result of human judgment is 92%. In summary, the result of proposed application is almost the same as human judgment. TABLE X. COMPARISON TABLE OF THEMATIC DEVELOPMENT, CENTERING THEORY, ENTITY TRANSITION VALUE, AND HUMAN JUDGMENT RESULT IN THREE DIFFERENT KINDS OF DATA SET Set Set 1 Set 2 Set 3 Value Judgment 84% 14% 0% 88% 89% 18% 0% 88% 100% 56% 0% 100% Total 91% 29% 0% 92% plays a role in analyzing sentences cohesion and coherence based on the connection from Theme to Theme or Rheme to Theme. This method is also applied by human judge. Both of the raters measure text cohesion and coherence based on repeated nouns and noun phrases on each sentence. Therefore, the average result is almost similar. Furthermore, the results from three algorithms have a great gap. The average result of is 91%, while is 29% and Value is 0%. This difference happens due to the abundant variance of entities. and Value only excel in writing that has simple and consistent repeated mentions of nouns or noun phrases. In data set 1, there are various kinds of noun and noun phrase in each sentence which explain the topic. Therefore, and Value method give low result. Nonetheless, method still gives high result for some paragraphs. This is due to the nature of method which does not highly depend on the number of entities in a paragraph. The results of Value stay on zero percent because in data set 2, several long sentences are used. It is difficult for Value to analyze long sentences because the calculation of connectivity in this method requires low number of entities to achieve cohesive and coherent result. If the variety of entity is too big, the distinct entities realized in adjacent sentence will be large; hence the result is getting closer to 1 which means the sentences are unconnected. Furthermore, also yields low result in data set 2. This is because data set 2 contains a lot of shifting entities. As the result, the focus of the passage is vague, causing to be less accurate. In contrast, method assesses data set 2 with high accuracy result. This is because method allows elaboration of explanatory on a sentence (Rheme) to be a focus on the other sentence. In data set 3, several examples of paragraph which are taken from academic writing books are used. Therefore, it is proven that these paragraphs are good examples of both cohesive and coherent paragraph. In addition, method performs very well in analyzing this kind of paragraphs. It is characterized by perfect values in all tests. V. CONCLUSION Cohesion and coherence measurement with method gives better experiment result in analyzing cohesion and coherence level of a paragraph compared to and Value since it is not restricted by a need to use same entity for every sentence. The average result of tested with 3 different kinds of data set is 91%. As comparison, the average result of is 29% while the average result of Value is 0%. Furthermore, the result of experiments with method is nearly equivalent to human judgment which gives the result of 92%. However, the current system is only limited to simple writing and cannot evaluate advanced level reading text. For future study, implementation of proposed method can be developed to analyze more paragraphs with more advanced level. REFERENCES [1] K. Rustipa. Theme-Rheme Organization of Learners' Texts. Dinamika Bahasa dan Ilmu Budaya, Vol.4, No. 2, 2010. [2] R. Mitkov. Discourse Processing. In A. Clark, C. Fox, & S. Lappin, The Handbook of Computational Linguistics and Natural Language Processing (pp. 599-611), 2010. West Sussex: Wlley-Blackwell. [3] E. Not. Implementation of the Progression and Realization Component. LRE Project 062-09, 1996. [4] E. Miltsakaki and K. Kukich, K. Evaluation of text coherence for electronic essay scoring. Natural Language Engineering, Vol 10, 44. 2004. [5] M. Tofiloski. Extending for The Measure of Coherence. Canadian AI. Kelowna: Springer. 2009. [6] A. Oshima, and A. Hogue. Introduction to Academic Writing. New York: Pearson Eduction. 2007. [7] S. Thornbury. Beyond The Sentence. Oxford: Macmillan Publishers Limited. 2005. [8] NLTK Project. nltk.tokenize package. Retrieved December 11, 2014, from Natural Language Toolkit - NLTK 3.0 documentation: http://www.nltk.org/api/nltk.tokenize.html [9] T.D. Smedt and W. Daelemans. Pattern For Python. Journal Of Machine Learning Research, 1-2, 2012. [10] T.D. Smedt, V.V. Asch and W. Daelemans. Memory-Based Shallow Parser for Python. Antwerp: CLiPS Research Center. 2010. [11] B. Baldwin, B. CogNIAC: A High Precision Pronoun Resolution Engine. Journal of Semantics,Vol.4, 3, 1996. 59