Basic Text Processing: Morphology Sentence Segmentation

Similar documents
Derivational and Inflectional Morphemes in Pak-Pak Language

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

BULATS A2 WORDLIST 2

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Words come in categories

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Emmaus Lutheran School English Language Arts Curriculum

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Parsing of part-of-speech tagged Assamese Texts

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Word Stress and Intonation: Introduction

Constructing Parallel Corpus from Movie Subtitles

CS 598 Natural Language Processing

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Adjectives tell you more about a noun (for example: the red dress ).

Syntactic types of Russian expressive suffixes

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

What the National Curriculum requires in reading at Y5 and Y6

Ch VI- SENTENCE PATTERNS.

Coast Academies Writing Framework Step 4. 1 of 7

Basic concepts: words and morphemes. LING 481 Winter 2011

Minimalism is the name of the predominant approach in generative linguistics today. It was first

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

LING 329 : MORPHOLOGY

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

BASIC ENGLISH. Book GRAMMAR

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

5 Star Writing Persuasive Essay

Phenomena of gender attraction in Polish *

National Literacy and Numeracy Framework for years 3/4

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Primary English Curriculum Framework

On the Notion Determiner

California Department of Education English Language Development Standards for Grade 8

Florida Reading Endorsement Alignment Matrix Competency 1

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

Modeling full form lexica for Arabic

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Information Retrieval

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

4 th Grade Reading Language Arts Pacing Guide

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Developing a TT-MCTAG for German with an RCG-based Parser

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

The Role of the Head in the Interpretation of English Deverbal Compounds

Myths, Legends, Fairytales and Novels (Writing a Letter)

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Appendix D IMPORTANT WRITING TIPS FOR GRADUATE STUDENTS

GRADE 1 GRAMMAR REFERENCE GUIDE Pre-Unit 1: PAGE 1 OF 21

Year 4 National Curriculum requirements

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

Chapter 4: Valence & Agreement CSLI Publications

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

Thornhill Primary School - Grammar coverage Year 1-6

Cross Language Information Retrieval

The Impact of Morphological Awareness on Iranian University Students Listening Comprehension Ability

Processes of Word Formation

Course Outline for Honors Spanish II Mrs. Sharon Koller

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Natural Language Processing. George Konidaris

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

1. Introduction. 2. The OMBI database editor

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Criterion Met? Primary Supporting Y N Reading Street Comprehensive. Publisher Citations

(3) Vocabulary insertion targets subtrees (4) The Superset Principle A vocabulary item A associated with the feature set F can replace a subtree X

CAVE LANGUAGES KS2 SCHEME OF WORK LANGUAGE OVERVIEW. YEAR 3 Stage 1 Lessons 1-30

Semantic Modeling in Morpheme-based Lexica for Greek

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Theoretical Syntax Winter Answers to practice problems

Common Core ENGLISH GRAMMAR & Mechanics. Worksheet Generator Standard Descriptions. Grade 2

Alignment of Iowa Assessments, Form E to the Common Core State Standards Levels 5 6/Kindergarten. Standard

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

The suffix -able means "able to be." Adding the suffix -able to verbs turns the verbs into adjectives. chewable enjoyable

Lexical specification of tone in North Germanic

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Today we examine the distribution of infinitival clauses, which can be

Developing Grammar in Context

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Transcription:

Basic Text Processing: Morphology Sentence Segmentation

Basic Text Processing Every NLP task needs to do text normalization to determine what are the words of the document: Segmenting/tokenizing words in running text Special characters like hyphen - and apostrophe Normalizing word formats (Non) capitalization of words Reducing words to stems or lemmas Segmenting sentences in running text To do these tasks, we need to use morphology 2

Synchronic Model of Language Syntactic Lexical Morphological Semantic Pragmatic Discourse

Morphology Morphology is the level of language that deals with the internal structure of words General morphological theory applies to all languages as all natural human languages have systematic ways of structuring words (even sign language) Must be distinguished from morphology of a specific language English words are structured differently from German words, although both languages are historically related Both are vastly different from Arabic

Minimal Units of Meaning Morpheme = the minimal unit of meaning in a word walk -ed Simple words cannot be broken down into smaller units of meaning Monomorphemes Called base words, roots or stems Affixes are attached to free or bound forms prefixes, infixes, suffixes, circumfixes

Affixes Prefixes appear in front of the stem to which they attach un- + happy = unhappy Infixes appear inside the stem to which they attach -blooming- + absolutely = absobloominglutely Suffixes appear at the end of the stem to which they attach emotion = emote + -ion English may stack up to 4 or 5 suffixes to a word Agglutinative languages like Turkish may have up to 10 Circumfixes appear at both the beginning and end of stem German past participle of sagen is gesagt: ge- + sag + -t Spelling and sound changes often occur at the boundary Very important for NLP

Inflection Inflection modifies a word s form in order to mark the grammatical subclass to which it belongs apple (singular) > apples (plural) Inflection does not change the grammatical category (part of speech) apple noun; apples still a noun Inflection does not change the overall meaning both apple and apples refer to the fruit

Derivation Derivation creates a new word by changing the category and/ or meaning of the base to which it applies Derivation can change the grammatical category (part of speech) sing (verb) > singer (noun) Derivation can change the meaning act of singing > one who sings Derivation is often limited to a certain group of words You can Clintonize the government, but you can t Bushize the government This restriction is partially phonological

Inflection & Derivation: Order Order is important when it comes to inflections and derivations Derivational suffixes must precede inflectional suffixes sing + -er + -s is ok sing + -s + -er is not This order may be used as a clue when working with natural language text

Inflection & Derivation in English English has few inflections Many other languages use inflections to indicate the role of a word in the sentence Use of case endings allows fairly free word order English instead has a fixed word order Position in the sentence indicates the role of a word, so case endings are not necessary This was not always true; Old English had many inflections English has many derivational affixes, and they are regularly used to form new words Part of this is cultural -- English speakers readily accept newly introduced terms For more details, see examples from J&M, sections 3.1 3.3 (2 nd ed.)

Classes of Words Closed classes are fixed new words cannot be added Pronouns, prepositions, comparatives, conjunctions, determiners (articles and demonstratives) Function words Open classes are not fixed new words can be added Nouns, Verbs, Adjectives, Adverbs Content words New content words are a constant issue for NLP

Creation of New Words Derivation - adding prefixes or suffixes to form a new word Clinton à Clintonize Compounding - combining two existing words home + page à homepage Clipping - shortening a polysyllabic word Internet à net Acronyms - take initial sounds or letters to form new word Scuba à Self Contained Underwater Breathing Apparatus Blending - combine parts of two words motor + hotel à motel smoke + fog à smog Backformation resurrection à resurrect

Word Formation Rules: Agreement Plurals In English, the morpheme s is often used to indicate plurals in nouns Nouns and verbs must agree in plurality Gender nouns, adjectives and sometimes verbs in many languages are marked for gender 2 genders (masculine and feminine) in Romance languages like French, Spanish, Italian 3 genders (masc, fem, and neuter) in Germanic and Slavic languages More are called noun classes Bantu has up to 20 genders Gender is sometimes explicitly marked on the word as a morpheme, but sometimes is just a property of the word 13

How does NLP make use of morphology? Stemming Strip prefixes and / or suffixes to find the base root, which may or may not be an actual word Spelling corrections are not made Lemmatization Strip prefixes and / or suffixes to find the base root, which will always be an actual word Spelling corrections are crucial Often based on a word list, such as that available at WordNet Part of speech guessing Knowledge of morphemes for a particular language can be a powerful aid in guessing the part of speech for an unknown term

Stemming Removal of affixes (usually suffixes) to arrive at a base form that may or may not necessarily constitute an actual word Continuum from very conservative to very liberal modes of stemming Very Conservative Remove only plural s Very Liberal Remove all recognized prefixes and suffixes Good resource: http://www.comp.lancs.ac.uk/computing/research/stemming/ for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

Porter Stemmer Popular stemmer based on work done by Martin Porter M.F. Porter. An algorithm for suffix stripping. 1980, Program 14(3), pp. 130-137. Very liberal step stemmer with five steps applied in sequence See example rules on next slide Probably the most widely used stemmer Has been incorporated into a number of Information Retrieval systems Does not require a lexicon. Open source software available for almost all programming languages.

Examples of Porter Stemmer Rules Step 1a sses ss! caresses caress! ies i! ponies poni! ss ss! caress caress! s ø cats cat! Step 1b (*v*)ing ø walking Step 2 (for long stems) ational ate relational relate! walk! izer ize! digitizer digitize! sing sing! ator ate! operator operate! (*v*)ed ø plastered plaster!!! Where *v* is the occurrence of any verb. From Dan Jurafsky Step 3 (for longer stems) al ø revival reviv! able ø adjustable adjust! ate ø activate activ!! 17

Some other Stemmers for English Paice-Husk Stemmer Simple iterative stemmer; rather heavy when used with standard rule set Krovetz Stemmer Light stemmer; removes inflections only; removal of inflections is very accurate (actually a lemmatizer) Often used as a first step before using another stemmer for increased compression Lovins Stemmer Single-pass, context-sensitive, longest match stemmer; not widely used Dawson Stemmer Complex linguistically targeted stemmer based on Lovins; not widely used

Lemmatization Removal of affixes (typically suffixes), But the goal is to find a base form that does constitute an actual word Example: parties à remove -es, correct spelling of remaining form à party Spelling corrections are often rule-based May use a lexicon to find actual words

Guessing the Part of Speech English is continuously gaining new words on a daily basis And new words are a problem for many NLP systems New words won t be found in the MRD or lexicon, if one is used How might morphology be used to help solve this problem? What part of speech are: clemness foramtion depickleated outtakeable

Ambiguous Affixes Some affixes are ambiguous: -er Derivational: Agentive er Verb + -er > Noun Inflectional: Comparative er Adjective + -er > Adjective -s or es Inflectional: Plural Noun + -(e)s > Noun Inflectional: 3 rd person sing. Verb + -(e)s > Verb -ing Inflectional Progressive Verb + -ing > Verb Derivational act of Verb + -ing > Noun Derivational in process of Verb + -ing > Adjective As with all other ambiguity in language, this morphological ambiguity creates a problem for NLP

Complex Morphology Some languages requires complex morpheme segmentation Turkish Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize Uygar `civilized + las `become + tir `cause + ama `not able + dik `past + lar plural + imiz p1pl + dan abl + mis past + siniz 2pl + casina as if 22

Importance of Punctuation So far we have discussed what words to keep and possible alternate forms of words through stemming and lemmatization Note that for further steps of language processing, we need to keep all the punctuation as tokens. Punctuation determines the clauses of a sentence and can profoundly affect the meaning From the book Eats, Shoots and Leaves: The Zero Tolerance Approach to Punctuation by Lynne Truss, a collection of quotes: http://www.goodreads.com/work/quotes/854886-eats-shoots-leavesthe-zero-tolerance-approach-to-punctuation Another example seen on a t-shirt: Let s eat Grandma! Let s eat, Grandma! Commas save lives! 23

Sentence Segmentation Punctuation not only shows internal structure of sentences, but is crucial in determining the end of sentences.!,? are relatively unambiguous Period. is quite ambiguous Sentence boundary Abbreviations like Inc. or Dr. Numbers like.02% or 4.3 Treat this as a classification problem Looks at a. (or the word preceding the. ) Decides EndOfSentence/NotEndOfSentence Classifiers: hand-written rules, regular expressions, or machinelearning Slides in this section are from Dan Jurafsky 24

Classify whether a word is End-of-Sentence An example of one way to classify is a Decision Tree: 25

Classification Problem Features Each property used in the decision tree to decide which branch to take is usually called a feature of the word More sophisticated features for end-of-sentence decision Word shape features Case of word with. : Upper, Lower, Cap, Number Look at word after. to see if it begins a new sentence: Case of word after. : Upper, Lower, Cap, Number Numeric features Length of word with. Probability(word with. occurs at end-of-s) Probability(word after. occurs at beginning-of-s) 26