Treebank mining with GrETEL Liesbeth Augustinus Frank Van Eynde GrETEL tutorial - 27 March, 2015
GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks
GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks Treebank = syntactically annotated corpus o o o Penn Treebank (English) TüBa (German) LASSY, CGN, SoNaR (Dutch)
NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics CLARIN project October, 2010 February, 2012 Goals: o User-friendly tools o Fast and accurate Result: o GrETEL 1.0 o http://nederbooms.ccl.kuleuven.be
Update of GrETEL 1.0 CLARIN project June, 2013 July, 2014 GrETEL 2.0 Goals: o Improve GUI o Make more data accessible Result: o GrETEL 2.0 o http://gretel.ccl.kuleuven.be
TREEBANKS CGN treebank Spoken Dutch LASSY small Written Dutch Stylistic & regional differences conversations vs read texts NL vs VL Stylistic differences Wikipedia vs legal texts ± 1M words ± 1M words 130k sentences Manually corrected 65k sentences Manually corrected
TREEBANKS SoNaR Written Dutch Stylistic differences Wikipedia vs legal texts ± 500M words 41M sentences Not corrected
GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks Treebank = syntactically annotated corpus o o o Penn Treebank (English) TüBa (German) LASSY, CGN, SoNaR (Dutch) Parser o E.g. Alpino (Van Noord 2006)
ALPINO PARSER Dit is een zin. >> ALPINO parser >> This is a sentence.
ALPINO PARSER Dit is een zin. >> ALPINO parser >> This is a sentence. XML trees Query language: XPath
XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]
XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]
XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]
XPATH
GrETEL 2 search modes: o Example-based search o XPath search
GrETEL 2 search modes: o Example-based search advantage: no or limited knowledge of data structure and/or formal query languages needed o XPath search
the user 1. Example sentence 2. Inspect parse 3. Indicate relevant items of the sentence 4. Select treebank 5. (Adapt XPath) 6. Inspect results GrETEL Parser (Alpino) Automatically generate XPath expression Present results
OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions
CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch Hij heeft Marie horen zingen. He has heard Mary sing. dat Jan niet is kunnen komen. that Jan was not able to come.
CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch Hij heeft Marie horen/*gehoord zingen. He has heard Mary sing. dat Jan niet is kunnen/*gekund komen. that Jan was not able to come.
GrETEL ONLINE
INPUT
INPUT PARSE
SELECTION MATRIX
SELECTION GUIDELINES
TREEBANK SELECTION
TREEBANK SELECTION
QUERY OVERVIEW
IPP constructions in CGN RESULTS Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits
RESULTS
RESULTS: table
RESULTS: data
greedy search RESULTS: data
RESULTATEN: trees
IPP constructions in CGN RESULTS Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits dat Jan niet is kunnen komen. that Jan was not able to come. 24 hits
MORE RESULTS Option 1: Use different queries Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits dat hij Marie heeft horen zingen. that he has heard Mary sing. 79 hits dat Jan niet is kunnen komen. that Jan was not able to come. 24 hits Jan is niet kunnen komen. Jan was not able to come. 120 hits TOTAL: 567 hits
MORE RESULTS Option 2: Adapt query (via XPath Search ) //node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="hebben"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] //node[(@cat="smain" or @cat="ssub") and node[@rel="hd" and (@lemma="hebben" or @lemma="zijn")] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]]
MORE RESULTS
MORE RESULTS Option 2: Adapt query (via XPath Search )
MORE RESULTS Option 2: Adapt query (via XPath Search ) //node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="hebben"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] //node[(@cat="smain" or @cat="ssub") and node[@rel="hd" and (@lemma="hebben" or @lemma="zijn")] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] 566 hits (one sentence matches twice: fva400364 10)
OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions
ADVANCED SEARCH
ADVANCED SEARCH
ADVANCED SEARCH
ADVANCED SEARCH
SEARCH OPTIONS Below annotation matrix
WORD ORDER PP-over-V o V + PP o dat hij opstond met een kater.... that he woke up with a hangover. o o PP + V dat hij met een kater opstond. that he with a hangover woke-up... that he woke up with a hangover.
PP-over-V in LASSY small o o o V + PP WORD ORDER dat hij opstond met een kater.... that he woke up with a hangover. 2,890 hits in 2,764 sentences But: results include PP + V as well!
PP-over-V in LASSY small o o WORD ORDER V + PP + word order option dat hij opstond met een kater.... that he woke up with a hangover. 787 hits in 775 sentences Results only include V + PP
IGNORE TOP NODE
CONTEXT
CONTEXT
OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions
CONCLUSIONS GrETEL: search engine for Dutch treebanks Input = natural language example Output = sample of similar sentences Syntactic concordancer Available online (via Mozilla Firefox) No installation required
Try it yourself! http://gretel.ccl.kuleuven.be Thanks for your attention!