Chunking. Ewan Klein ICL 14 November PDF Free Download

in NLTK-Lite in Cass as Tagging Ewan Klein ewan@inf.ed.ac.uk ICL 14 November 2005

in NLTK-Lite in Cass as Tagging in NLTK-Lite in Cass as Tagging

in NLTK-Lite in Cass as Tagging Problems with Full Parsing, 1 Goal: Build a complete parse tree for a sentence. Coverage and ambiguity: No complete grammar of any language Sapir: All grammars leak As coverage increases, so does ambiguity. Problem of ranking parses by degree of plausibility Low accuracy Unbounded dependencies hard to parse Errors tend to propagate

in NLTK-Lite in Cass as Tagging Problems with Full Parsing, 2 Speed: Complexity of rule-based chart parsing is O(n 3 ) in length of sentence, multiplied by factor O(G 2 ), where G is size of grammar. Practical results are often better, but still slow for parsing large (e.g., billion words) corpora in reasonable time. Finite state machines have worst-case complexity O(n) in length of string.

in NLTK-Lite in Cass as Tagging s for Parsing Why parse sentences in the first place? Parsing is usually an intermediate stage in a larger processing framework. Full parsing is a sufficient but not necessary step for many NLP tasks. Full parsing often provides more information than we need or can deal with.

in NLTK-Lite in Cass as Tagging Partial Parsing / Assign a partial structure to a sentence. Don t try to deal with all of language Don t attempt to resolve all semantically significant decisions Use deterministic grammars for easy-to-parse pieces, and other methods for other pieces, depending on task. easy to parse = no ambiguity & no recursion Partial parsing is usually: easier to implement more robust faster

in NLTK-Lite in Cass as Tagging, 1 Goal: Divide a sentence into a sequence of chunks. Abney (1994): [when I read] [a sentence], [I read it] [a chunk] [at a time] Chunks are non-overlapping regions of text: [walk] [straight past] [the lake] (Usually) each chunk contains a head, with the possible addition of some preceding function words and modifiers [ walk ] [straight past ] [the lake ] Chunks are non-recursive: A chunk cannot contain another chunk of the same category

in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid

in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid NP postmodifiers (e.g., PPs, relative clauses) are often recursive and/or structurally ambiguous: they are not included in noun chunks. Chunks are typically subsequences of constituents (they don t cross constituent boundaries)

in NLTK-Lite in Cass as Tagging Chunk Parsing: Accuracy Chunk parsing attempts to do less, but does it more accurately. Smaller solution space Less word-order flexibility within chunks than between chunks. Better locality: doesn t attempt to deal with unbounded dependencies less context-dependence doesn t attempt to resolve ambiguity only do those things which can be done reliably [the boy] [saw] [the man] [with a telescope] less error propagation

in NLTK-Lite in Cass as Tagging Chunk Parsing: Domain Independence Chunk parsing can be relatively domain independent, in that Dependencies involving lexical or semantic information tend to occur at levels higher than chunks: attachment of PPs and other modifiers argument selection constituent re-ordering

in NLTK-Lite in Cass as Tagging Chunk Parsing: Efficiency Chunk parsing is more efficient: smaller solution space relevant context is small and local chunks are non-recursive can be implement with a finite state automaton (FSA) can be applied to very large text sources

in NLTK-Lite in Cass as Tagging Psycholinguistic s Chunks as processing units evidence that humans tend to read texts one chunk at a time Chunks are phonologically relevant prosodic phrase breaks rhythmic patterns might be a first step in full parsing

in NLTK-Lite in Cass as Tagging with Regular Expressions, 1 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$

in NLTK-Lite in Cass as Tagging with Regular Expressions, 2 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$ Define rules in terms of tag patterns

in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN

in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN DT or PRP$: <DT PRP$><JJ><NN><NNS> JJ and NN are optional: <DT PRP$><JJ>*<NN>*<NNS>

in NLTK-Lite in Cass as Tagging Tag Patterns in Chunk Rules NLTK-Lite tag patterns are a special kind of Regular Expression: Use < > for grouping instead of ( ), e.g. <JJ>*, <NN NNS>*

in NLTK-Lite in Cass as Tagging Tag Patterns in Chunk Rules NLTK-Lite tag patterns are a special kind of Regular Expression: Use < > for grouping instead of ( ), e.g. <JJ>*, <NN NNS>* Wildcard. never matches beyond tag boundaries, e.g. <NN.*> matches <NN> and <NNS>, but not <NN JJ>

in NLTK-Lite in Cass as Tagging Chunk Grammars Approach adopted in Cass (Abney) Recognition carried out by a cascade of FSAs output of one is the input to another Level 0: tagged words Level 1: all sequences at level 0 that match a given pattern are replaced by appropriate label e.g., date expressions replaced by the label Date Level n: do something with output of Level n 1 Strings that don t match a pattern are just passed on unchanged

in NLTK-Lite in Cass as Tagging CASS RegEx Grammar Automata defined by a regular expression grammar :chunks nx -> DT? NN+ vx -> VBZ VBD BE VBG :phrases vp -> vx nx* pp -> IN nx :clause c -> pp* nx pp* vp pp*

in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn

in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn [vx take/vbp] [nx the/dt road/nn] on/in [nx the/dt left/nn]

in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn [vx take/vbp] [nx the/dt road/nn] on/in [nx the/dt left/nn] [vx take/vbp] [nx the/dt road/nn] [pp on/in [nx the/dt left/nn]]

in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his...

in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his... we tag words according to where they are in a chunk: announce any new policy measures in his.. VB DT JJ NN NNS IN PRP$ O B-NP I-NP I-N P I-NP O B-NP where B-NP is Begin noun chunk, I-NP is Inside noun chunk and O is Outside any chunk.

in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient.

in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient. Maybe sufficient for many practical tasks: Information Extraction Question Answering Extracting subcatgorization frames Providing features for machine learning, e.g., for building Named Entity recognizers. Two main approaches: 1. Regular expressions over tag sequences 2. Tagging with IOB tags

in NLTK-Lite in Cass as Tagging Reading Jurafsky and Martin, Section 10.5 NLTK-Lite Chunk Parsing Tutorial Steven Abney. Parsing By Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht. 1991. Steven Abney. Partial Parsing via Finite-State Cascades. J. of Natural Language Engineering, 2(4): 337-344. 1996. Abney s publications: http://www.vinartus.net/spa/publications.html

in NLTK-Lite in Cass as Tagging Extra Tutorial Extra tutorial on writing tag patterns 5.00pm Tuesday 15th Nov, HCRC Seminar Room, 2 Buccleuch Place

Chunking. Ewan Klein ICL 14 November 2005