in NLTK-Lite in Cass as Tagging Ewan Klein ewan@inf.ed.ac.uk ICL 14 November 2005
in NLTK-Lite in Cass as Tagging in NLTK-Lite in Cass as Tagging
in NLTK-Lite in Cass as Tagging Problems with Full Parsing, 1 Goal: Build a complete parse tree for a sentence. Coverage and ambiguity: No complete grammar of any language Sapir: All grammars leak As coverage increases, so does ambiguity. Problem of ranking parses by degree of plausibility Low accuracy Unbounded dependencies hard to parse Errors tend to propagate
in NLTK-Lite in Cass as Tagging Problems with Full Parsing, 2 Speed: Complexity of rule-based chart parsing is O(n 3 ) in length of sentence, multiplied by factor O(G 2 ), where G is size of grammar. Practical results are often better, but still slow for parsing large (e.g., billion words) corpora in reasonable time. Finite state machines have worst-case complexity O(n) in length of string.
in NLTK-Lite in Cass as Tagging s for Parsing Why parse sentences in the first place? Parsing is usually an intermediate stage in a larger processing framework. Full parsing is a sufficient but not necessary step for many NLP tasks. Full parsing often provides more information than we need or can deal with.
in NLTK-Lite in Cass as Tagging Partial Parsing / Assign a partial structure to a sentence. Don t try to deal with all of language Don t attempt to resolve all semantically significant decisions Use deterministic grammars for easy-to-parse pieces, and other methods for other pieces, depending on task. easy to parse = no ambiguity & no recursion Partial parsing is usually: easier to implement more robust faster
in NLTK-Lite in Cass as Tagging, 1 Goal: Divide a sentence into a sequence of chunks. Abney (1994): [when I read] [a sentence], [I read it] [a chunk] [at a time] Chunks are non-overlapping regions of text: [walk] [straight past] [the lake] (Usually) each chunk contains a head, with the possible addition of some preceding function words and modifiers [ walk ] [straight past ] [the lake ] Chunks are non-recursive: A chunk cannot contain another chunk of the same category
in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid
in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid NP postmodifiers (e.g., PPs, relative clauses) are often recursive and/or structurally ambiguous: they are not included in noun chunks.
in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid NP postmodifiers (e.g., PPs, relative clauses) are often recursive and/or structurally ambiguous: they are not included in noun chunks. Chunks are typically subsequences of constituents (they don t cross constituent boundaries)
in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid NP postmodifiers (e.g., PPs, relative clauses) are often recursive and/or structurally ambiguous: they are not included in noun chunks. Chunks are typically subsequences of constituents (they don t cross constituent boundaries) noun groups everything in NP up to and including the head noun
in NLTK-Lite in Cass as Tagging, 2 Chunks are non-exhaustive Some words in a sentence may not be grouped into a chunk [take] [the second road] that [is] on [the left hand sid NP postmodifiers (e.g., PPs, relative clauses) are often recursive and/or structurally ambiguous: they are not included in noun chunks. Chunks are typically subsequences of constituents (they don t cross constituent boundaries) noun groups everything in NP up to and including the head noun verb groups everything in VP (including auxiliaries) up to and including the head verb
in NLTK-Lite in Cass as Tagging Chunk Parsing: Accuracy Chunk parsing attempts to do less, but does it more accurately. Smaller solution space Less word-order flexibility within chunks than between chunks. Better locality: doesn t attempt to deal with unbounded dependencies less context-dependence doesn t attempt to resolve ambiguity only do those things which can be done reliably [the boy] [saw] [the man] [with a telescope] less error propagation
in NLTK-Lite in Cass as Tagging Chunk Parsing: Domain Independence Chunk parsing can be relatively domain independent, in that Dependencies involving lexical or semantic information tend to occur at levels higher than chunks: attachment of PPs and other modifiers argument selection constituent re-ordering
in NLTK-Lite in Cass as Tagging Chunk Parsing: Efficiency Chunk parsing is more efficient: smaller solution space relevant context is small and local chunks are non-recursive can be implement with a finite state automaton (FSA) can be applied to very large text sources
in NLTK-Lite in Cass as Tagging Psycholinguistic s Chunks as processing units evidence that humans tend to read texts one chunk at a time Chunks are phonologically relevant prosodic phrase breaks rhythmic patterns might be a first step in full parsing
in NLTK-Lite in Cass as Tagging with Regular Expressions, 1 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$
in NLTK-Lite in Cass as Tagging with Regular Expressions, 2 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$
in NLTK-Lite in Cass as Tagging with Regular Expressions, 2 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$ Define rules in terms of tag patterns
in NLTK-Lite in Cass as Tagging with Regular Expressions, 2 Assume input is tagged. Identify chunks (e.g., noun groups) by sequences of tags: announce any new policy measures in his... VB DT JJ NN NNS IN PRP$ Define rules in terms of tag patterns rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs )
in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN
in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN DT or PRP$: <DT PRP$><JJ><NN><NNS>
in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN DT or PRP$: <DT PRP$><JJ><NN><NNS> JJ and NN are optional: <DT PRP$><JJ>*<NN>*<NNS>
in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN DT or PRP$: <DT PRP$><JJ><NN><NNS> JJ and NN are optional: <DT PRP$><JJ>*<NN>*<NNS> we can have NNPs: <DT PRP$><JJ>*<NNP>*<NN>*<NNS>
in NLTK-Lite in Cass as Tagging with Regular Expressions, 3 rule = parse.chunkrule( <DT><JJ><NN><NNS>, Modified plural NPs ) Extending the example: in his Mansion House speech IN PRP$ NNP NNP NN DT or PRP$: <DT PRP$><JJ><NN><NNS> JJ and NN are optional: <DT PRP$><JJ>*<NN>*<NNS> we can have NNPs: <DT PRP$><JJ>*<NNP>*<NN>*<NNS> NN or NNS: <DT PRP$><JJ>*<NNP>*<NN>*<NN NNS>
in NLTK-Lite in Cass as Tagging Tag Patterns in Chunk Rules NLTK-Lite tag patterns are a special kind of Regular Expression: Use < > for grouping instead of ( ), e.g. <JJ>*, <NN NNS>*
in NLTK-Lite in Cass as Tagging Tag Patterns in Chunk Rules NLTK-Lite tag patterns are a special kind of Regular Expression: Use < > for grouping instead of ( ), e.g. <JJ>*, <NN NNS>* Wildcard. never matches beyond tag boundaries, e.g. <NN.*> matches <NN> and <NNS>, but not <NN JJ>
in NLTK-Lite in Cass as Tagging Tag Patterns in Chunk Rules NLTK-Lite tag patterns are a special kind of Regular Expression: Use < > for grouping instead of ( ), e.g. <JJ>*, <NN NNS>* Wildcard. never matches beyond tag boundaries, e.g. <NN.*> matches <NN> and <NNS>, but not <NN JJ> Whitespace is ignored in tag patterns, e.g. <NN JJ> is equivalent to <NN JJ>
in NLTK-Lite in Cass as Tagging Chunk Grammars Approach adopted in Cass (Abney) Recognition carried out by a cascade of FSAs output of one is the input to another Level 0: tagged words Level 1: all sequences at level 0 that match a given pattern are replaced by appropriate label e.g., date expressions replaced by the label Date Level n: do something with output of Level n 1 Strings that don t match a pattern are just passed on unchanged
in NLTK-Lite in Cass as Tagging CASS RegEx Grammar Automata defined by a regular expression grammar :chunks nx -> DT? NN+ vx -> VBZ VBD BE VBG :phrases vp -> vx nx* pp -> IN nx :clause c -> pp* nx pp* vp pp*
in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn
in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn [vx take/vbp] [nx the/dt road/nn] on/in [nx the/dt left/nn]
in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn [vx take/vbp] [nx the/dt road/nn] on/in [nx the/dt left/nn] [vx take/vbp] [nx the/dt road/nn] [pp on/in [nx the/dt left/nn]]
in NLTK-Lite in Cass as Tagging CASS Example take/vbp the/dt road/nn on/in the/dt left/nn [vx take/vbp] [nx the/dt road/nn] on/in [nx the/dt left/nn] [vx take/vbp] [nx the/dt road/nn] [pp on/in [nx the/dt left/nn]] [c [vx take/vbp] [nx the/dt road/nn] [pp on/in [nx the/dt left/n
in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his...
in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his... we tag words according to where they are in a chunk: announce any new policy measures in his.. VB DT JJ NN NNS IN PRP$ O B-NP I-NP I-N P I-NP O B-NP where B-NP is Begin noun chunk, I-NP is Inside noun chunk and O is Outside any chunk.
in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his... we tag words according to where they are in a chunk: announce any new policy measures in his.. VB DT JJ NN NNS IN PRP$ O B-NP I-NP I-N P I-NP O B-NP where B-NP is Begin noun chunk, I-NP is Inside noun chunk and O is Outside any chunk. Known both as BIO and IOB tagging
in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his... we tag words according to where they are in a chunk: announce any new policy measures in his.. VB DT JJ NN NNS IN PRP$ O B-NP I-NP I-N P I-NP O B-NP where B-NP is Begin noun chunk, I-NP is Inside noun chunk and O is Outside any chunk. Known both as BIO and IOB tagging Used in CoNNL shared tasks
in NLTK-Lite in Cass as Tagging CONLL Notation for Chunks Instead of using bracketing, as in announce [any new policy measures] in [his... we tag words according to where they are in a chunk: announce any new policy measures in his.. VB DT JJ NN NNS IN PRP$ O B-NP I-NP I-N P I-NP O B-NP where B-NP is Begin noun chunk, I-NP is Inside noun chunk and O is Outside any chunk. Known both as BIO and IOB tagging Used in CoNNL shared tasks Allows off-the-shelf statistical taggers to be used for chunking as well as POS tagging
in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient.
in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient. Maybe sufficient for many practical tasks: Information Extraction Question Answering Extracting subcatgorization frames Providing features for machine learning, e.g., for building Named Entity recognizers.
in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient. Maybe sufficient for many practical tasks: Information Extraction Question Answering Extracting subcatgorization frames Providing features for machine learning, e.g., for building Named Entity recognizers. Two main approaches: 1. Regular expressions over tag sequences 2. Tagging with IOB tags
in NLTK-Lite in Cass as Tagging Summary is less ambitious than full parsing, but more efficient. Maybe sufficient for many practical tasks: Information Extraction Question Answering Extracting subcatgorization frames Providing features for machine learning, e.g., for building Named Entity recognizers. Two main approaches: 1. Regular expressions over tag sequences 2. Tagging with IOB tags Cass extends regular expression approach using a cascade of finite state transducers.
in NLTK-Lite in Cass as Tagging Reading Jurafsky and Martin, Section 10.5 NLTK-Lite Chunk Parsing Tutorial Steven Abney. Parsing By Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht. 1991. Steven Abney. Partial Parsing via Finite-State Cascades. J. of Natural Language Engineering, 2(4): 337-344. 1996. Abney s publications: http://www.vinartus.net/spa/publications.html
in NLTK-Lite in Cass as Tagging Extra Tutorial Extra tutorial on writing tag patterns 5.00pm Tuesday 15th Nov, HCRC Seminar Room, 2 Buccleuch Place