Parsing Morphologically Rich Languages:

1 / 39 Rich Languages: Sandra Kübler Indiana University

2 / 39 Rich Languages joint work with Daniel Dakota, Wolfgang Maier, Joakim Nivre, Djamé Seddah, Reut Tsarfaty, Daniel Whyatt, and many more def. morphologically rich language: expresses multiple levels of information at the word level add information about grammatical function of word, grammatical relations to other words, pronominal clitics, inflectional affixes, etc. Dan Bikel s classification (2010):

3 / 39 segmentation: for some languages, words do not always correspond to ideal input tokens for parsing: multi-word expressions, syncretism morphology: how do we integrate morphology into the parsing step? lexicon: how do we cope with lower type/token ratio? language/annotation scheme: are some languages harder to parse, or is it the annotation scheme? language independent parsing?

4 / 39 and Question How do we integrate morphology into the parsing process?

5 / 39 and morphologically rich languages need morphology for parsing e.g., in German, case is indicator of grammatical func. of NPs morphological information can be attached to POS tags in pipeline (parsing on top of POS tags), information can be used by parser

6 / 39 Example: NP and Case VROOT S OA HD MO SB NP MO NK RC SB S HD OA NP NK NK APP NP NP PP AA MO NK NK PP NK dieses PDAT NK Buch NN finden VVFIN AC vor APPR NK allem PIS diejenigen PDS schwierig ADJD, $, die PRELS PM am PTKA HD meisten PIS Bildung NN haben VAFIN, $, AC vor APPR NK allem PIS psychoanalytische ADJA Bildung NN ( $(... $( ) $( Acc.Sg.Neut Acc.Sg.Neut 3.Pl.Pres.Ind Dat.Sg.Neut Nom.Pl.* Pos Nom.Pl.* *.*.* Acc.Sg.Fem 3.Pl.Pres.Ind Dat.Sg.Neut Pos.Acc.Sg.Fem Acc.Sg.Fem (1) dieses Buch finden vor allem diejenigen schwierig, die am meisten Bildung haben, vor allem psychoanalytische Bildung (...) this book is difficult, especially for those who have a higher education, especially a higher education in psychoanalysis (...)

7 / 39 Question What exactly is the effect of varying morphological granularity on POS tags on both POS tagging and parsing?

7 / 39 Question What exactly is the effect of varying morphological granularity on POS tags on both POS tagging and parsing? How well do the different POS taggers work with tagsets of a varying level of morphological granularity? Do the differences in POS tagger performance translate into similar differences in parsing quality?

8 / 39 Treebanks and Parser Treebanks TiGer version 2.2, last 5000/5000 sentences for dev/test, rest for training TüBa-D/Z release 8, same amount of sentences for training/dev/test, rest discarded Parser Berkeley Parser

9 / 39 POS Tagger Morfette: Averaged Perceptron RF-Tagger: HMMs/Decision Trees for fine grained tag sets Stanford Tagger: Maximum Entropy SVMTool: Support Vector Machine TnT: Cascaded HMMs Wapiti: CRFs

10 / 39 Tagset variants UTS: Universal Tagset built for cross-language use 12 tags (2) Aber Bremerhavens AfB fordert jetzt Untersuchungsausschuß CONJ NOUN NOUN VERB ADJ NOUN But the Bremerhaven AfB now demands a board of inquiry (3) Ausländische Investoren in Indien wieder willkommen ADJ NOUN ADP NOUN ADV ADJ Foreign investors welcome again in India

10 / 39 Tagset variants UTS: Universal Tagset built for cross-language use 12 tags STTS: Stuttgart-Tübingen tagset based on distributional regularities of German 54 tags (2) Aber Bremerhavens AfB fordert jetzt Untersuchungsausschuß KON NE NE VVFIN ADV NN But the Bremerhaven AfB now demands a board of inquiry (3) Ausländische Investoren in Indien wieder willkommen ADJA NN APPR NE ADV ADJD Foreign investors welcome again in India

10 / 39 Tagset variants UTS: Universal Tagset built for cross-language use 12 tags STTS: Stuttgart-Tübingen tagset based on distributional regularities of German 54 tags STTSmorph: STTS with morphological information morphological component for STTS 585/271 resp. 783/761 available/used in TiGer, resp. TüBa-D/Z (2) Aber Bremerhavens AfB fordert jetzt Untersuchungsausschuß KON NE%gsn NE%nsf VVFIN%3sis ADV NN%asm But the Bremerhaven AfB now demands a board of inquiry (3) Ausländische Investoren in Indien wieder willkommen ADJA%Pos.Nom.Pl.Masc NN%Nom.Pl.Masc APPR NE%Dat.Sg.Neut ADV ADJD%Pos Foreign investors welcome again in India

11 / 39 POS Tagging Evaluation TiGer TüBa-D/Z Tagset Tagger dev test dev test UTS Morfette 98.51 98.09 98.25 98.49 RF-Tagger 97.89 97.41 97.69 97.96 Stanford 97.88 96.83 97.11 97.26 SVMTool 98.54 98.01 98.09 98.28 TnT 97.94 97.48 97.72 97.92 Wapiti 97.54 96.67 97.47 97.80 STTS Morfette 94.12 93.23 92.95 93.41 RF-Tagger 97.04 96.24 96.68 96.84 Stanford 96.26 95.15 95.63 95.79 SVMTool 97.06 96.22 96.46 96.69 TnT 97.15 96.29 96.92 97.00 Wapiti 92.93 91.62 90.99 91.81 STTSmorph Morfette 82.71 80.10 81.19 82.26 RF-Tagger 86.56 83.90 85.68 86.31 Stanford SVMTool 82.47 79.53 80.33 81.31 TnT 85.77 82.77 84.67 85.45 Wapiti 79.83 75.92 77.27 78.29 STTSmorph STTS TnT 97.08 96.15 96.78 96.82

12 / 39 Results TiGer test Tags Tagset POS LP LR LF1 gold UTS 99.97 71.80 70.26 71.02 STTS 99.97 71.90 71.11 71.50 STTSmorph 88.70 67.68 67.99 67.83 parser UTS 97.83 71.13 69.50 70.30 STTS 96.18 71.16 69.84 70.49 STTSmorph 79.05 67.67 67.02 67.34 TnT UTS 96.01 68.37 66.78 67.57 STTS 96.19 71.16 69.84 70.49 STTSmorph 75.05 65.43 64.78 65.10 TüBa-D/Z test Tags Tagset POS LP LR LF1 gold UTS 99.98 82.24 81.94 82.09 STTS 99.99 84.54 84.46 84.50 STTSmorph 90.55 83.57 79.91 81.70 parser UTS 98.58 81.07 80.66 80.87 STTS 97.39 82.93 82.78 82.85 STTSmorph 81.68 81.89 78.20 80.00 TnT UTS 98.58 81.07 80.66 80.87 STTS 97.39 82.93 82.78 82.85 STTSmorph 81.68 81.89 78.20 80.00

13 / 39 Further evaluation We manually checked the parser outputs. sometimes helps, sometimes not. can lead to correct grammatical function label (e.g. case) can lead to over-differentiation of grammatical function label (cf. PP attachment)

Further evaluation We manually checked the parser outputs. sometimes helps, sometimes not. can lead to correct grammatical function label (e.g. case) can lead to over-differentiation of grammatical function label (cf. PP attachment) Interaction between morphology and tree depth Comparison STTS vs. STTSmorph: substructures too flat in TüBa-D/Z and too hierarchical in TiGer confirmed by number of edges: TiGer: more edges in STTSmorph than in STTS TüBa-D/Z: more edges in STTS than in STTSmorph 13 / 39

14 / 39 What did We Learn? POS tagging is easier with less granular tagset amount of morphology for parsing needs to be just right even gold case is not useful for parsing, need different mechanisms for integrating case into parse morphology seems to influence depth of trees in parser output

15 / 39 Further Work Versley & Rehbein (2009) integrate subcategorization into constituent parsing Seeker & Kuhn (2013) use case as filter in dependency parsing they show that solution needs to be language dependent

16 / 39 Hard Language or Hard? Question If a parser performs worse on language A than on language B, does that means A is more difficult than B, or is the difference in the annotation schemes?

17 / 39 Comparing German Treebanks: NEGRA/TIGER vs. TüBa-D/Z differences: TüBa-D/Z: more structure in phrases, topological fields, no traces, no empty categories NEGRA: ver flat phrases, more structure on S level, crossing branches approach: make TüBa-D/Z more similar to NEGRA flatten phrase structure delete unary nodes

18 / 39 A NEGRA Tree In the foyer of the town hall, the history of research on the Hochheimer Spiegel is presented next to the trove.

19 / 39 A TüBa-D/Z Tree The car convoy of the visitors of the rehearsal goes along a street, which is even today called Lagerstraße.

20 / 39 Comparing NEGRA and TüBa-D/Z NEGRA NEG+tr. TüBa-D/Z crossing brackets 1.04 1.03 1.93 func. labeled recall 52.75 49.03 73.65 func. labeled precision 51.85 50.49 76.13 func. labeled F-score 52.30 49.75 74.87 nodes/words (treeb.) 0.88 0.88 2.38 nodes/words (parse) 0.62 0.63 1.30

21 / 39 A Flattened TüBa-D/Z Tree

22 / 39 Making NEGRA More Similar to TüBa-D/Z cr. br. LR LP F-score % not parsed NEGRA 1.04 52.75 51.85 52.30 12.59 NE field 1.21 69.85 69.53 69.19 2.17 TüBa 1.93 73.65 76.13 74.87 1.03 Tü NU 2.17 62.11 65.43 63.73 9.98 Tü flat 1.07 73.8 74.66 74.23 3.55 Tü fl N 1.29 53.63 58.87 56.13 18.87

23 / 39 What did We Learn? considerable differences between treebanks unclear whether annotation scheme, evaluation metric, differences in text unary nodes, more structure in phrases, and topological fields improve results more structure provides more coverage, but also more chances to make mistakes

24 / 39 Further Work Rehbein & van Genabith (2007): similar experiments, different results possible explanations: different data sets (shorter sentences, different split) evaluation metric favors trees with high number of nodes Kübler et al. (2008): extend work, evaluation with leaf-ancestor (LA) & on converted dependency representation: TIGER better than TüBa-D/Z BUT: LA artificially high BUT: conversion is lossy; loss on parsing structures unknown

25 / 39 Question What can we learn from a shared task scenario with 9 languages with aligned constituent and dependency representations?

26 / 39 Goals of the s Clear view on the state-of-the-art regardless of the framework (constituency or dependency) in a realistic parsing scenario with the most accurate evaluation protocol we could reasonably set up on as many MRLs as possible Trying to asses: what are the remaining challenges... in parsing MRLs in evaluating them

Data Sets 9 Languages Semitic: Arabic, Hebrew Romance: French Germanic: German, Swedish Isolated: Basque, Korean Uralic: Hungarian Slavic: Polish Available in two syntactic representations Constituents (ptb) and Dependency structures (conll) aligned at all levels (token, POS, sentence) containing at least the same morph information available with gold and predicted morphology Training sets: full and reduced (5k sent.) size 27 / 39

28 / 39 Evaluation Protocol 3 Scenarios Gold: provide unambiguous gold morphological provided: segmentation, POS tags, and morphological features Predicted: provided: disambiguated morphological segmentation; unknown: POS tags and morphological features Raw: provided: morphologically ambiguous input; unknown: morphological segmentation + features, POS tags Note, for all languages but Arabic and Hebrew: RAW = Predicted

29 / 39 Evaluation Protocol (2) Evaluation Metrics operating in different dimensions Cross-Parser Evaluation in Gold/Predicted scenarios constituent: Evalb Labeled F-score LeafAncestor s macro averaged accuracy dependency: Eval07 Labeled Attachment Scores also: MWE evaluation scores (for French on dep. structures) Cross-Parser Evaluation in Raw Scenarios Standard metrics not applicable with non-gold tokenization. Instead: TedEval s labeled accuracy (Tsarfaty et al, 2012) on sentences of length 70 tokens

30 / 39 Evaluation Protocol (3) Evaluation Metrics Operating in Different Dimensions (2) Cross-Framework Evaluation compare results by dep. and const. parser use unlabeled TedEval metric: internally converts all representation types into a normalized function tree Cross-Language Evaluation compare parsers for same representation type across different languages reasonable approximation: unlabeled TedEval metric

31 / 39 7 Teams / 20 Systems Teams 1. IMS-SZEGED-CIS 2. ALPAGE-DYALOG 3. MALTOPTIMIZER 4. AI-KU (multi) 5. BASQUE-TEAM (multi) 6. IGM-ALPAGE (French) 7. CADIM (Arabic) System overview 1. Ensemble System (strong POS tagging, morph lexicon, mate+turbo parser, (re)ranker, const. features) 2. Transition based + beam + lattices 3. maltoptimizer+automatic feature selection and splitting 4. maltoptimizer+unlabeled data (word clustering) 5. Ensemble System (Malt Blender)+maltoptimizer+efficient feature selection 6. CRF MWE tagger+lexica+voting system (Mate, pipeline and join) 7. Easy First+ rich lexicon and rich morph features

32 / 39 80,36 70,11 77,98 77,81 69,97 70,15 82,06 75,63 73,21 Results: Dependency (LAS), Predicted 85,86 83,20 Scenario (full) 72,57 82,32 69,01 78,92 81,86 76,35 84,25 84,51 88,66 84,97 80,88 90 88 86 84 82 AI-KU ALPAGE_DYA BASELINE_MA BASQUE_TEA CADIM IGM-ALPAGE IMS-SZEGED-C MALTOPTIMIZ 80 78 76 74 72 70 Arabic Basque French German Hebrew Hungarian Korean Polish Swedish soft_avg languages

33 / 39 red/5k gold/full gold/5k Correlation charts F1 (%, F1 (%, F1 (%, F1 (%, F1 (%, (%, F1 (%, F1 (%, Results: Arabic) Basque) Constituents French) German) Hebrew) (F1), Hungarian) Predicted Korean) Polish) Scenario (full) F1 (%, Swedish) W 79,19 70,50 80,38 78,30 86,96 81,62 71,42 79,23 79,18 GGED 78,66 74,74 79,76 78,28 85,42 85,22 78,56 86,75 80,64 81,32 87,86 81,83 81,27 89,46 91,85 84,27 87,55 83,99 F1 av 92 90 88 86 84 BASELINE_BKY_RAW BASELINE_BKY_TAGG IMS_SZEGED 82 80 78 76 74 72 70 Arabic Basque French German Hebrew Hungarian Korean Polish Swedish soft_avg languages

Results(TedEval Unlabeled Accuracy): Raw Scenario (full) 93,00 92,00 91,00 90,00 89,00 88,00 87,00 86,00 85,00 Arabic full Arabic 5k Hebrew 5k 84,00 IMS-SZEGED-CIS (Const) IMS-SZEGED-CIS (Dep) ALPAGE_DYALOG MALTOPTIMIZER CADIM ALPAGE_DYALOG_RAWLAT AI-KU 34 / 39

Correlation: Label Set, Training Set Size, and LAS Correlation between label set size, treebank size, and mean LAS 90 88 KoG 86 FrG GeG ArG 84 FrG PoG ArG GeP GeG PoG KoP 82 KoG ArP HuG PoP PoP FrP 80 HuG HuP SwG HeG GeP FrP BaG BaP ArP 78 HuP SwP BaG BaP 76 KoP 74 HeP 72 10 50 100 500 1 000 35 / 39

36 / 39 Cross Language Evaluation: (Dependency) 99 98 97 96 95 94 93 92 IMS_SZEGED_CIS-DEP ALPAGE_DYALOG BASELINE_MALT AI-KU MALTOPTIMIZER CADIM 91 Arabic Basque Fench German Hebrew Hungarian Korean Polish Swedish

37 / 39 Cross Framework Evaluation: (Dep. + Const.) 99 98 97 96 95 94 93 92 91 IMS_SZEGED_CIS-DEP IMS_SZEGED_CIS-CONST BASELINE-CONST BASELINE_MALT 90 89 Arabic Basque Fench German Hebrew Hungarian Korean Polish Swedish Avg

38 / 39 What did We Learn? parser (re-)ranking works best across languages clear differences between languages dependencies are not always better BUT: cross-language / cross-framework based on unlabeled data shared task 2014: more, automatically labeled training data (out of domain) does not help exception: languages with small data sets (mostly Swedish)

39 / 39 Go from Here? segmentation: lattice parsing? more discriminative parsing? integrate multi-word expressions

39 / 39 Go from Here? segmentation: lattice parsing? more discriminative parsing? integrate multi-word expressions morphology: need to integrate morphology in useful manner, language specific more discriminative parsing? better word clustering?

Go from Here? segmentation: lattice parsing? more discriminative parsing? integrate multi-word expressions morphology: need to integrate morphology in useful manner, language specific more discriminative parsing? better word clustering? lexicon: better clustering? language vs. annotation: adaptive parsers universal annotation??? language independence: better feature engineering? get away from reranking? 39 / 39