Memory-Bounded Left-Corner Unsupervised Grammar Induction on Child-Directed Input

Memory-Bounded Lef-Corner Unsupervised Grammar Inducion on Child-Direced Inpu Cory Shain The Ohio Sae Universiy shain.3@osu.edu William Bryce Universiy of Illinois a Urbana-Champaign bryce2@illinois.edu Lifeng Jin The Ohio Sae Universiy jin.544@osu.edu Vicoria Krakovna Harvard Universiy vkrakovna@fas.harvard.edu Finale Doshi-Velez Timohy Miller Harvard Universiy Boson Children s Hospial & finale@saes.harvard.edu Harvard Medical School imohy.miller@childrens.harvard.edu William Schuler The Ohio Sae Universiy schuler@ling.osu.edu Lane Schwarz Universiy of Illinois a Urbana-Champaign lanes@illinois.edu Absrac This paper presens a new memory-bounded lef-corner parsing model for unsupervised raw-ex synax inducion, using unsupervised hierarchical hidden Markov models (UHHMM). We deploy his algorihm o shed ligh on he exen o which human language learners can discover hierarchical synax hrough disribuional saisics alone, by modeling wo widely-acceped feaures of human language acquisiion and senence processing ha have no been simulaneously modeled by any exising grammar inducion algorihm: (1) a lef-corner parsing sraegy and (2) limied working memory capaciy. To model realisic inpu o human language learners, we evaluae our sysem on a corpus of child-direced speech raher han ypical newswire corpora. Resuls bea or closely mach hose of hree compeing sysems. 1 Inroducion The success of saisical grammar inducion sysems (Klein and Manning, 2002; Seginer, 2007; Ponver e al., 2011; Chrisodoulopoulos e al., 2012) seems o sugges ha sufficien saisical informaion is available in language o allow grammar acquisiion on his basis alone, as has been argued for word segmenaion (Saffran e al., 1999). Bu exising grammar inducion sysems make unrealisic assumpions abou human learners, such as he availabiliy of par-of-speech informaion and access o an indexaddressable parser char, which are no independenly cogniively moivaed. This paper explores he possibiliy ha a memory-limied incremenal lef-corner parser, of he sor independenly moivaed in senence processing heories (Gibson, 1991; Lewis and Vasishh, 2005), can sill acquire grammar by exploiing saisical informaion in child-direced speech. 2 Relaed Work This paper bridges work on human senence processing and synax acquisiion on he one hand and unsupervised grammar inducion (raw-ex parsing) on he oher. We discuss relevan lieraure from each of hese areas in he remainder of his secion. This work is licensed under a Creaive Commons Aribuion 4.0 Inernaional Licence. hp://creaivecommons.org/licenses/by/4.0/ Licence deails: 964 Proceedings of COLING 2016, he 26h Inernaional Conference on Compuaional Linguisics: Technical Papers, pages 964 975, Osaka, Japan, December 11-17 2016.

2.1 Human senence processing and synax acquisiion Relaed work in psycholinguisics and cogniive psychology has provided evidence ha humans have a limied abiliy o sore and rerieve srucures from working memory (Miller, 1956; Cowan, 2001; McElree, 2001), and may herefore employ a lef-corner-like sraegy during incremenal senence processing (Johnson-Laird, 1983; Abney and Johnson, 1991; Gibson, 1991; Resnik, 1992; Sabler, 1994; Lewis and Vasishh, 2005). Schuler e al. (2010) show ha nearly all naurally-occurring senences can be parsed using no more han four disjoin derivaion fragmens in a lef-corner parser, suggesing ha general-purpose working memory resources are all ha is needed o accoun for informaion sorage and rerieval during online senence processing. These findings moivae our lef-corner parsing sraegy and deph-bounded memory sore. An exensive lieraure indicaes ha memory abiliies develop wih age (see e.g. Gahercole, 1998 for a review). Newpor (1990) proposed ha limied processing abiliies acually faciliae language acquisiion by consraining he hypohesis space (he less-is-more hypohesis). This heory has been suppored by a number of subsequen compuaional and laboraory sudies (e.g, Elman, 1993; Goldowski & Newpor, 1993; Kareev e al., 1997) and parallels similar developmens in he curriculum learning raining regimen for machine learning (Bengio e al., 2009). 1 Research on he acquisiion of synax has shown ha infans are sensiive o synacic srucure (Newpor e al., 1977; Seidl e al., 2003) and ha memory limiaions consrain he learning of synacic dependencies (Sanelman and Jusczyk, 1998). Togeher, hese resuls sugges boh (1) ha he memory consrains in infans and young children are even more exreme han hose aesed for aduls and (2) ha hese consrains impac and may even faciliae learning. By implemening hese consrains in a domain-general compuaional model, we can explore he exen o which human learners migh exploi disribuional saisics during synax acquisiion (Lappin and Shieber, 2007). 2.2 Unsupervised grammar inducion The process of grammar inducion learns he synacic srucure of a language from a sample of unlabeled ex, raher han a gold-sandard reebank. The consiuen conex model (CCM) (Klein and Manning, 2002) uses expecaion-maximizaion (EM) o learn differences beween observed and unobserved brackeings, and he dependency model wih valence (DMV) (Klein and Manning, 2004) uses EM o learn disribuions ha generae child dependencies, condiioned on valence (lef or righ direcion) in addiion o he lexical head. Boh of hese algorihms induce on gold par-of-speech ag sequences. A number of successful unsupervised raw-ex synax inducion sysems also exis. Seginer (2007) (CCL) uses a non-probabilisic scoring sysem and a dependency-like synacic represenaion o bracke raw-ex inpu. Ponver e al. (2011) (UPPARSE) use a cascade of hidden Markov model (HMM) chunkers for unsupervised raw-ex parsing. Chrisodoulopoulos e al. (2012) (BMMM+DMV) induce par-of-speech (PoS) ags from raw ex using he Bayesian mulinomial mixure model (BMMM) of Chrisodoulopoulos e al. (2011), induce dependencies from hose ags using DMV, and ieraively re-ag and reparse using he induced dependencies as feaures in he agging process. In conras o ours, none of hese sysems employ a lef-corner parsing sraegy or model working memory limiaions. 3 Mehods Experimens described in his paper use a memory-bounded probabilisic sequence model implemenaion of a lef-corner parser (Aho and Ullman, 1972; van Schijndel e al., 2013) o deermine wheher naural language grammar can be acquired on he basis of saisics in ranscribed speech wihin humanlike memory consrains. The model assumes access o episodic memories of raining senences, bu imposes consrains on working memory usage during senence processing. The core innovaion of his paper is he adapaion of his processing model o Bayesian unsupervised inducion using consrained priors. 1 The less-is-more hypohesis has been a subjec of conroversy, however. See e.g. Rohde and Plau (2003) for a criical review. 965

a) S PRP VP We MD VP ll VP NP VB PRP ADJ NP ge you anoher one. b) = 1 = 2 = 3 = 4 = 5 S/VP VP/VP VP/VP VP/NP NP/NP VP/PRP We ll ge you anoher one. Figure 1: Trees and parial analyses for he senence We ll ge you anoher one, aken from he raining corpus. Derivaion fragmens are shown verically sacked beween words, using / o delimi op and boom signs. 3.1 Lef-corner parsing Lef-corner parsing is aracive as a senence processing model because i mainains a very small number of disjoin derivaion fragmens during processing (Schuler e al., 2010), in keeping wih human working memory limiaions (Miller, 1956; Cowan, 2001; McElree, 2001), and correcly predics difficuly in recognizing cener-embedded, bu no lef- or righ-embedded srucures (Chomsky and Miller, 1963; Miller and Isard, 1964; Karlsson, 2007). A lef-corner parser mainains a sequence of derivaion fragmens a/b, a /b,..., each consising of an acive caegory a lacking an awaied caegory b ye o come. I incremenally assembles rees by forking off and joining up hese derivaion fragmens, using a pair of binary decisions abou wheher o use a word w o sar a new derivaion fragmen (iniially a complee caegory c): 2 a/b w a/b c b + c... ; c w (F=1) a/b w c a = c; b w (F=0) and wheher o use a grammaical inference rule o connec a complee caegory c o a previously disjoin derivaion fragmen a/b: a/b c a/b b c b (J=1) a/b c a/b a /b b + a... ; a c b (J=0) These wo binary decisions have four possible oucomes in oal: he parser can fork only (which increases he number of derivaion fragmens by one), join only (which decreases he number of derivaion fragmens by one), boh fork and join (which keeps he number of derivaion fragmens he same), or neiher fork nor join (which also preserves he number of derivaion fragmens). An example derivaion of he senence We ll ge you anoher one, in shown in Figure 1. 3.2 Probabilisic sequence model A lef-corner parser can be modeled as a probabilisic sequence model using hidden random variables a every ime sep for Acive caegories A, Awaied caegories B, Preerminal or par-of-speech (POS) ags P, and an observed random variable W over Words. The model also makes use of wo binary swiching 2 Here, b + c... consrains c o be a lefmos descendan of b a some deph. 966

variables a each ime sep, F (for Fork) and J (for Join) ha guide he ransiions of he oher saes. These wo binary swiching variables yield four cases: 1/1, 1/0, 0/1 and 0/0 a each ime sep. Le D be he deph of he memory sore a posiion in he sequence, and le he sae q 1..D be he sack of derivaion fragmens a, consising of one acive caegory a d and one awaied caegory b d a each deph d. The join probabiliy of he hidden sae q 1..D and observed word w, given heir previous conex, are defined using Markov independence assumpions and he fork-join variable decomposiion of van Schijndel e al. (2013), which preserves PCFG probabiliies in incremenal senence processing: P(q 1..D w q 1..D 1.. w 1..) = P(q 1..D def = P(p w f j a 1..D = P θp (p q 1..D ) w q 1..D ) (1) P θw (w q 1..D p ) P θf ( f q 1..D p w ) P θj ( j q 1..D p w f ) P θa (a 1..D q 1..D p w f j ) P θb (b 1..D q 1..D b 1..D q 1..D ) (2) p w f j a 1..D ) (3) The par-of-speech p only depends on he lowes awaied (b d ) caegory a he previous ime sep, where d is he deph of he sack a he previous ime sep and q is an empy derivaion fragmen: P θp (p q 1..D ) def = P θp (p d b d ); d =max{q d d q } (4) The lexical iem (w ) only depends on he par of speech ag (p ) a he same ime sep: P θw (w q 1..D p ) def = P θw (w p ) (5) The fork decision f is assumed o be independen of previous sae q 1..D variables excep for he previous lowes awaied caegory b d and par of speech ag p : P θf ( f q 1..D p w ) def = P θf ( f d b d p ); d =max{q d d q } (6) The join decision j is decomposed ino fork and no-fork cases depending on he oucomes of he fork decision: P θj ( j q 1..D f p w ) def P θj ( j d a d = bd 1 ); d =max d {qd q } if f =0 P θj ( j d p b d ); d =max d {qd q (7) } if f =1 When f =1, ha is, a fork has been creaed, he decision of j is wheher o immediaely inegrae he newly forked derivaion fragmen and ransiion he awaied caegory above i ( j =1) or keep he newly forked derivaion fragmen ( j =0). When f =0, ha is, no fork has been creaed, he decision of j is wheher o reduce a sack level ( j =1) or o ransiion boh he acive and awaied caegories a he curren level ( j =0). Decisions abou he acive caegories a 1..D are decomposed ino fork- and join-specific cases depending on he previous sae q 1..D and he curren preerminal p. Since he fork and join oucomes only allow a single derivaion fragmen o be iniiaed or inegraed, each case of he acive caegory model only nondeerminisically modifies a mos one a d variable from he previous ime sep: 3 P θa (a 1..D q 1..D a 1..d 2 a 1..d 1 =a 1..d 2 =a 1..d 1 a 1..d 1 a 1..d 0 =a 1..d 0 f p w j ) def = a d 1 =a d 1 ad+0..d =a ; d =max d {q d q } if f =0, j =1 P θa (a d d bd 1 ad ) ad+1..d =a ; d =max d {q d q } if f =0, j =0 =a 1..d 1 a d =ad ad+1..d =a ; d =max d {q d q } if f =1, j =1 P θa (a d+1 d b d p ) a d+2..d =a ; d =max d {q d q } if f =1, j =0 3 Here φ is a (deerminisic) indicaor funcion, equal o one when φ is rue and zero oherwise. (8) 967

a 1 b 1 a 1 b 1 a 1 +1 b 1 +1 a 2 b 2 a 2 b 2 a 2 +1 b 2 +1 p f j p +1 f +1 j +1 w w +1 Figure 2: Graphical represenaion of probabilisic lef-corner parsing model expressed in Equaions 6 9 across wo ime seps, wih D = 2. Decisions abou he awaied caegories b 1..D also depend on he oucome of he fork and join variables. Again, since he fork and join oucomes only allow a single derivaion fragmen o be iniiaed or inegraed, each case of he awaied caegory model only nondeerminisically modifies a mos one b d variable from he previous ime sep: P θb (b 1..D b 1..d 2 b 1..d 1 b 1..d 1 b 1..d 0 q 1..D =b 1..d 2 f p w j a 1..D ) def = P θb (b d 1 d b d 1 ad ) bd+0..d =b ; d =max d {q d q } if f =0, j =1 =b 1..d 1 P θb (b d d ad a d ) bd+1..d =b ; d =max d {q d q } if f =0, j =0 =b 1..d 1 P θb (b d d bd p ) b d+1..d =b ; d =max d {q d q } if f =1, j =1 =b 1..d 0 P θb (b d+1 d a d+1 p ) b d+2..d =b ; d =max d {q d q } if f =1, j =0 (9) Thus, he parser has a fixed number of probabilisic decisions o make as i encouners each word, regardless of he deph of he sack. A graphical represenaion of his model is shown in Figure 2. 3.3 Model priors Inducion in his model follows he approach of Van Gael e al. (2008) by applying nonparameric priors over he acive, awaied, and par-of-speech variables. This approach allows he model o learn no only he parameers of he model such as wha pars of speech are likely o be creaed from wha awaied caegories bu also he cardinaliy of how many acive, awaied, and par of speech caegories are presen, in a fully unsupervised fashion. No labels are needed for inference, which alernaes beween inferring hese unseen caegories and he associaed model parameers. The probabilisic sequence model defined above, augmened wih priors, can be repeaedly sampled o obain an esimae of he poserior disribuion of is hidden variables given a se of observed word sequences. Priors over he synacic models are based on he infinie hidden Markov model (ihmm) used for par-of-speech agging (van Gael e al., 2009). In ha model, a hierarchical Dirichle process HMM (Teh e al., 2006) is used o allow he observed number of saes corresponding o pars of speech in he HMM o grow as he daa requires. The hierarchical srucure of he ihmm ensures ha ransiion disribuions share he same se of saes, which would no be possible if we used a fla infinie mixure model. A fully infinie version of his model uses nonparameric priors on each of he acive, awaied, and par-of-speech variables, allowing he cardinaliy of each of hese variables o grow as he daa requires. Each model draws a base disribuion from a roo Dirichle process, which is hen used as a parameer o an infinie se of Dirichle processes, one each for each applicable combinaion of he condiioning 968

variables a, b, p, j, f, a, and b : β A GEM(γ A ) (10) P θa (a d d b d 1 ad ) DP(α A, β A ) (11) P θa (a d+1 d b d p ) DP(α A, β A ) (12) β B GEM(γ B ) (13) P θb (b d 1 d b d 1 ad 1 ) DP(α B, β B ) (14) P θb (b d d a d a d ) DP(α B, β B ) (15) P θb (b d d b d p ) DP(α B, β B ) (16) P θb (b d+1 d a d+1 p ) DP(α B, β B ) (17) β P GEM(γ P ) (18) P θp (p d b d ) DP(α P, β P ) (19) where DP is Dirichle process and GEM is he sick-breaking consrucion for DPs (Sehuraman, 1994). Models a deph greaer han one use he corresponding model a he previous deph as a prior. 3.4 Inference Inference is based on he beam sampling approach employed in van Gael e al. (2009) for par-of-speech inducion. This inference approach alernaes beween wo phases in each ieraion. Firs, given he disribuions θ F, θ J, θ A, θ B, θ P, and θ W, he model resamples values for all he hidden saes {q d, p }. Nex, given he sae values {q d, p }, i resamples each se of mulinomial disribuions θ F, θ J, θ A, θ B, θ P, and θ W. The sampler is iniialized by conservaively seing he cardinaliies of he number of acive, awaied, and par-of-speech saes we expec o see in he daa se, randomly iniializing he sae space, and hen sampling he parameers for each disribuion θ F, θ J, θ A, θ B, θ P, and θ W given he randomly iniialized saes and fixed hyperparameers. As noed by Van Gael e al. (2008), oken-level Gibbs sampling in a sequence model can be slow o mix. Preliminary work found ha mixing wih oken-level Gibbs sampling is even slower in his model due o he igh consrains imposed by he swiching variables i is echnically ergodic bu exploring he sae space requires many low probabiliy moves. Therefore, he experimens described in his paper use senence-level sampling insead of oken-level sampling, firs compuing forward probabiliies for he sequence and hen doing sampling in a backwards pass; resampling he parameers for he probabiliy disribuions only requires compuing he couns from he sampled sequence and combining wih he hyperparameers. To accoun for he infinie size of he sae spaces, hese experimens employ he beam sampler (Van Gael e al., 2008), wih some modificaions for compuaional speed. The sandard beam sampler inroduces an auxiliary variable u a each ime sep, which acs as a hreshold below which ransiion probabiliies are ignored. This auxiliary variable u is drawn from Uniform(0, p(q 1..D q 1..D )), so i will be beween 0 and he probabiliy of he previously sampled ransiion. The join disribuion over ransiions, emissions, and auxiliary variables can be reduced so ha he ransiion marix is ransformed ino a boolean marix wih a 1 indicaing an allowed ransiion. Depending on he cu-off value u, he size of he insaniaed ransiion marix will be differen for every ime-sep. Values of u can be sampled for acive, awaied, and POS variables a every ime sep, raher han a single u for he ransiion marix. I is possible o compile all he operaions a each ime sep ino a single large ransiion marix, bu compuing his marix is prohibiively slow for an operaion ha mus be done a each ime sep in he daa. To address his issue, he learner may inerleave several ieraions holding he cardinaliy of he insaniaed space fixed wih full beam-sampling seps in which he cardinaliy of he sae space can change. 969

Figure 3: Log Probabiliy (wih punc) Figure 4: F-Score (wih punc) Figure 5: Deph=2 Frequency (wih punc) When he cardinaliy of he sae space is fixed, he learner can muliply ou he saes ino one large, srucured ransiion marix ha is valid for all ime seps. The forward pass is hus reduced o an HMM forward pass (albei one over a much larger se of saes), vasly improving he speed of inference. Alernaing beween sampling he parameers of his marix and he sae values hemselves corresponds o updaing a finie porion of he infinie possible sae space; by inerleaving hese finie seps wih occasional full beam-sampling ieraions, he learner is sill properly exploring he poserior over models. 3.5 Parsing There are muliple ways o exrac parses from an unsupervised grammar inducion sysem such as his. The opimal Bayesian approach would involve averaging over he values sampled for each model across many ieraions, and hen use hose models in a Vierbi decoding parser o find he bes parse for each senence. Alernaively, if he model parameers have ceased o change much beween ieraions, he learner can be assumed o have found a local opimum. I can hen use a single sample from he end of he run as is model and he analyses of each senence in ha run as he parses o be evaluaed. This laer mehod is used in he experimens described below. 4 Experimenal Seup We ran he UHHMM learner for 4,000 ieraions on he approximaely 14,500 child-direced uerances of he Eve secion of he Brown corpus from he CHILDES daabase (MacWhinney, 2000). 4 To model he limied memory capaciy of young language learners, we resriced he deph of he sore of derivaion fragmens o wo. 5 The inpu senences were okenized following he Penn Treebank convenion and convered o lower case. Puncuaion was iniially lef in he inpu as a proxy for inonaional phrasal cues (Seginer, 2007; Ponver e al., 2011), hen removed in a follow-up experimen. 4 We used 4 acive saes; 4 awaied saes; 8 pars of speech; and parameer values 0.5 for α a, α b, and α c, and 1.0 for α f, α j, and γ. The burnin period was 50 ieraions. 5 This limied sack deph permis discovery of ineresing synacic feaures like subjec-aux inversion while modeling he severe memory limiaions of infans (see 2.1). Greaer dephs are likely unnecessary o parse child-direced inpu (e.g., Newpor e al., 1977). 970

Wih punc No punc P R F1 P R F1 UPPARSE 60.50 51.96 55.90 38.17 48.38 42.67 CCL 64.70 53.47 58.55 56.87 47.69 51.88 BMMM+DMV (direced) 62.08 62.51 62.30 61.01 59.24 60.14 BMMM+DMV (undireced) 63.63 64.02 63.82 61.34 59.33 60.32 UHHMM-4000, binary 46.68 58.28 51.84 37.62 46.97 41.78 UHHMM-4000, flaened 68.83 57.18 62.47 61.78 45.52 52.42 Righ-branching 68.73 85.81 76.33 68.73 85.81 76.33 Table 1: Parsing accuracy on Eve wih and wihou puncuaion (phrasal cues) in he inpu. The UHHMM sysems were given 8 PoS caegories while he BMMM+DMV sysems were given 45. UPPARSE and CCL do no learn PoS ags. Only he UHHMM sysems model limied working memory capaciy or incremenal lef-corner parsing. To generae accuracy benchmarks, we parsed he same daa se using he hree compeing rawex inducion sysems discussed in 2: CCL (Seginer, 2007), UPPARSE (Ponver e al., 2011), 6 and boh direced and undireced varians of BMMM+DMV (Chrisodoulopoulos e al., 2012). 7 The BMMM+DMV sysem generaes dependency graphs which are no direcly comparable o our phrasesrucure oupu, so we used he algorihm of Collins e al. (1999) o conver he BMMM+DMV oupu o he flaes phrase srucure rees permied by he dependency graphs. We evaluaed accuracy agains hand-correced gold-sandard Penn Treebank-syle annoaions for Eve (Pearl and Sprouse, 2013). All evaluaions were of unlabeled brackeings wih puncuaion removed. 8 Accuracy resuls repored for our sysem are exraced from arbirary samples aken afer convergence had been reached: ieraion 4000 for he wih-punc model, and ieraion 1500 for he no-punc model (see Figures 3 and 6, respecively). 5 Resuls Figures 3, 4, and 5 show (respecively) log probabiliy, f-score, and deph=2 frequency by ieraion for he UHHMM rained on daa conaining puncuaion. As he figures show, he model remains effecively deph 1 unil around ieraion 3000, a which poin i discovers deph 2, rapidly overgeneralizes i, hen scales back o around 350 uses over he enire corpus. Around his ime, parsing accuracy drops considerably. This resul is consisen wih he less-is-more hypohesis (Newpor, 1990), since accuracy decreases near he poin when he number of plausible hypoheses suddenly grows. In our sysem, we believe his is because he model reallocaes probabiliy mass o deeper parses. Noneheless, as we show below, final resuls are sae of he ar. We sampled parses from ieraion 4000 of our learner for evaluaion. As shown in Table 1, iniial accuracy measures are worse han all four compeiors. However, our sysem generaes exclusively binary-branching oupu, while all compeiors can produce he higher ariy rees aesed in he PTB-like evaluaion sandard (noice ha our recall measure for he binary branching oupu beas boh CCL and UPPARSE). To correc his disadvanage, we flaened he UHHMM oupu by firs convering binary rees o dependencies using a heurisic ha selecs for each paren he mos frequenly co-occurring child caegory as he head, hen convering hese dependencies back ino phrase srucures using he Collins e al. (1999) algorihm. As shown in Table 1, recall remains approximaely he same while precision predicably improves, resuling in higher overall F-measures ha bea or closely mach hose of all compeing sysems. 9 6 Using he bes cascaded parser seings from ha work: probabilisic righ-linear grammar wih uniform iniializaion. 7 We ran boh varians of he BMMM+DMV sysem for 10 generaions, wih 500 ieraions of BMMM and 20 EM ieraions of DMV per generaion, as was done by Chrisodoulopoulos e al. (2012). 8 Noe ha while puncuaion was removed for all evaluaions, inclusion/removal of puncuaion in he raining daa was an independen variable in our experimen. 9 I happens o be he case ha hese child-direced senences are heavily righ-branching, likely due o he simpliciy and 971

Figure 6: Log Probabiliy (no punc) Figure 7: F-Score (no punc) Figure 8: Deph=2 Frequency (no punc) Figures 6, 7, and 8 show (respecively) log probabiliy, f-score, and deph=2 frequency by ieraion for he UHHMM rained on daa conaining no puncuaion. Somewha surprisingly, he model discovers deph 2 and converges much more quickly han i did for he wih-punc corpus, requiring fewer han 1000 ieraions o converge. This is possibly due o he sligh reducion in corpus size. As in he case of he wih-punc rained learner, once deph 2 is discovered, he sysem quickly overgeneralizes, hen converges in a consisen range (in his case around 250 uses of deph 2). To evaluae accuracy on he puncuaion-free daa, we sampled parses from ieraion 1500 of our learner. Resuls are given in Table 1. Binary UHHMM resuls are on par wih UPPARSE, worse han CCL, and considerably worse han BMMM+DMV, while flaened UHHMM resuls show higher overall F-measures han boh CCL and UPPARSE. BMMM+DMV suffers less in he absence of puncuaion han he oher sysems (and herefore generally provides he bes inducion resuls on no-punc). The large drop in UHHMM accuracy wih he removal of puncuaion provides weak evidence for he use of inonaional phrasal cues in human synax acquisiion. While he BMMM+DMV resuls are on par wih ours, i is imporan o noe ha we used a severely resriced number of caegories in order o improve compuaional efficiency. For example, our sysem was given 8 PoS ags o work wih, while BMMM+DMV was given 45. Finer grained sae spaces in a more efficien implemenaion of our learner will hopefully improve upon he resuls presened here. Finally, i is ineresing o observe ha he uses of deph 2 shown in Figures 5 and 8 are in general linguisically well-moivaed. They end o occur in subjec-auxiliary inversion, diransiive, and conracion consrucions, in which deph 2 is ofen necessary in order o bracke auxiliary+subjec, verb+objec, and verb+conracion ogeher, as illusraed in Figure 9. Unforunaely, due o he fla represenaion of hese consrucions in he gold sandard rees, his insigh on he par of our learner is no refleced in he accuracy measures in Table 1. shor lengh of child-direced uerances, and herefore he righ-branching baseline (RB) ouperforms all sysems by a wide margin on his corpus. However, we argue ha such uerances are a more realisic model of inpu o human language learners han newswire ex, and herefore preferable for evaluaion of sysems ha purpor o model human language acquisiion. Our sysem learns his direcional bias from daa, and does so a leas as successfully as is compeiors. 972

1. Subjec-auxiliary inversion: ACT4 POS2 oh POS8, POS7 is AWA2 ACT4 POS1 rangy AWA1 POS3 sill AWA4 POS8 on AWA2 POS6 he AWA1 POS3 sep AWA4 POS8? 2. Diransiive: POS1 we ACT1 POS7 ll 3. Conracion: POS1 ha ACT1 POS7 s ACT2 POS7 ge POS6 a AWA3 ACT4 POS5 you AWA4 POS6 prey AWA1 AWA4 POS6 anoher ACT2 POS3 picure AWA4 POS3 one POS8, AWA4 POS8. ACT4 AWA2 POS7 is ACT1 AWA1 POS5 n POS5 i POS8? Figure 9: Acual parses from UHHMM-4000 (wih puncuaion), illusraing he use of deph 2 (bold) for subjec-aux inversion, diransiives, and conracions. 6 Conclusion This paper presened a grammar inducion syem ha models he working memory limiaions of young language learners and employs a cogniively plausible lef-corner incremenal parsing sraegy, in conras o exising raw-ex inducion sysems. The fac ha our sysem can model hese aspecs of human language acquisiion and senence processing while achieving he compeiive resuls shown here on a corpus of child-direced speech indicaes ha humans can in principle learn a good deal of naural language synax from disribuional saisics alone. I also shows ha modeling cogniion more closely can mach or improve on exising approaches o he ask of raw-ex grammar inducion. In fuure research, we inend o make use of parallel processing echniques o increase he speed of inference and (1) allow he sysem o infer he opimal number of saes in each componen of he model, permiing addiional granulariy ha migh enable i o discover subler paerns han is possible wih our currenly-resriced sae invenories, and (2) allow he sysem o make use of dephs 3 and 4, modeling working memory capaciies of older learners. Acknowledgemens The auhors would like o hank he anonymous reviewers for heir commens. This projec was sponsored by he Defense Advanced Research Projecs Agency award #HR0011-15-2-0022. The conen of he informaion does no necessarily reflec he posiion or he policy of he Governmen, and no official endorsemen should be inferred. 973

References Seven P. Abney and Mark Johnson. 1991. Memory requiremens and local ambiguiies of parsing sraegies. J. Psycholinguisic Research, 20(3):233 250. Alfred V. Aho and Jeffery D. Ullman. 1972. The Theory of Parsing, Translaion and Compiling, Vol. 1: Parsing. Prenice-Hall, Englewood Cliffs, New Jersey. Yoshua Bengio, Jérôme Louradour, Ronan Collober, and Jason Weson. 2009. Curriculum learning. In Proceedings of he 26h Annual Inernaional Conference on Machine Learning, pages 41 48, Monreal. Noam Chomsky and George A. Miller. 1963. Inroducion o he formal analysis of naural languages. In Handbook of Mahemaical Psychology, pages 269 321. Wiley, New York, NY. Chrisos Chrisodoulopoulos, Sharon Goldwaer, and Mark Seedman. 2011. A Bayesian mixure model for parof-speech inducion using muliple feaures. In Proceedings of EMNLP, pages 638 647, Edinburgh, Scoland, 7. Chrisos Chrisodoulopoulos, Sharon Goldwaer, and Mark Seedman. 2012. Turning he pipeline ino a loop: Ieraed unsupervised dependency parsing and PoS inducion. In NAACL-HLT Workshop on he Inducion of Linguisic Srucure, pages 96 99, Monreal, Canada, 6. Michael Collins, Jan Hajic, Lance A. Ramshaw, and Chrisoph Tillman. 1999. A saisical parser for Czech. In Proceedings of ACL. Nelson Cowan. 2001. The magical number 4 in shor-erm memory: A reconsideraion of menal sorage capaciy. Behavioral and Brain Sciences, 24:87 185. Jeffrey L. Elman. 1993. Learning and developmen in neural neworks: The imporance of saring small. Cogniion, 48:71 99. Susan E. Gahercole. 1998. The developmen of memory. Journal of Child Psychology and Psychiary, 39:3 27. Edward Gibson. 1991. A compuaional heory of human linguisic processing: Memory limiaions and processing breakdown. Ph.D. hesis, Carnegie Mellon. Boris Goldowsky and Elissa Newpor. 1993. Modeling he effecs of processing limiaions on he acquisiion of morphology: he less is more hypohesis. In Jonahan Mead, edior, Proceedings of he 11h Wes Coas Conference on Formal Linguisics, pages 234 247. Philip N. Johnson-Laird. 1983. Menal models: Towards a cogniive science of language, inference, and consciousness. Harvard Universiy Press, Cambridge, MA, USA. Yakoov Kareev, Iris Lieberman, and Miri Lev. 1997. Through a narrow window: Sample size and he percepion of correlaion. Journal of Experimenal Psychology, 126:278 287. Fred Karlsson. 2007. Consrains on muliple cener-embedding of clauses. Journal of Linguisics, 43:365 392. Dan Klein and Chrisopher D. Manning. 2002. A generaive consiuen-conex model for improved grammar inducion. In Proceedings of he 40h Annual Meeing of he Associaion for Compuaional Linguisics. Dan Klein and Chrisopher D. Manning. 2004. Corpus-based inducion of synacic srucure: Models of dependency and consiuency. In Proceedings of he 42nd Annual Meeing of he Associaion for Compuaional Linguisics. Shalom Lappin and Suar M. Shieber. 2007. Machine learning heory and pracice as a source of insigh ino universal grammar. Journal of Linguisics, 43:1 34. Richard L. Lewis and Shravan Vasishh. 2005. An acivaion-based model of senence processing as skilled memory rerieval. Cogniive Science, 29(3):375 419. Brian MacWhinney. 2000. The CHILDES projec: Tools for analyzing alk. Lawrence Elrbaum Associaes, Mahwah, NJ, hird ediion. Brian McElree. 2001. Working memory and focal aenion. Journal of Experimenal Psychology, Learning Memory and Cogniion, 27(3):817 835. George A. Miller and Sephen Isard. 1964. Free recall of self-embedded english senences. Informaion and Conrol, 7:292 303. 974

George A. Miller. 1956. The magical number seven, plus or minus wo: Some limis on our capaciy for processing informaion. Psychological Review, 63:81 97. Elissa Newpor, Henry Gleiman, and Lila Gleiman. 1977. Moher, I d raher do i myself: Some effecs and noneffecs of maernal speech syle. In Caherine F. Snow, edior, Talking o Children, pages 109 149. Cambridge Universiy Press, Cambridge. Elissa Newpor. 1990. Mauraional consrains on language learning. Cogniive Science, 14:11 28. Lisa Pearl and Jon Sprouse. 2013. Synacic islands and learning biases: Combining experimenal synax and compuaional modeling o invesigae he language acquisiion problem. Language Acquisiion, 20:23 68. Elias Ponver, Jason Baldridge, and Karin Erik. 2011. Simple unsupervised grammar inducion from raw ex wih cascaded finie sae models. In Proceedings of he 49h Annual Meeing of he Associaion for Compuaional Linguisics, pages 1077 1086, Porland, Oregon, 6. Philip Resnik. 1992. Lef-corner parsing and psychological plausibiliy. In Proceedings of COLING, pages 191 197, Nanes, France. Douglas L.T. Rohde and David C. Plau. 2003. Less is less in language acquisiion. In Philip Quinlan, edior, Connecionis modelling of cogniive developmen. Psychology Press, Hove, UK. Jenny R Saffran, Elizabeh K Johnson, Richard N Aslin, and Elissa L Newpor. 1999. Saisical learning of one sequences by human infans and aduls. Cogniion, 70(1):27 52. Lynn Sanelman and Peer W. Jusczyk. 1998. Sensiiviy o disconinuous dependencies in language learners: Evidence for limiaions in processing space. Cogniion, 69:105 34. William Schuler, Samir AbdelRahman, Tim Miller, and Lane Schwarz. 2010. Broad-coverage incremenal parsing using human-like memory consrains. Compuaional Linguisics, 36(1):1 30. Yoav Seginer. 2007. Fas unsupervised incremenal parsing. In Proceedings of he 45h Annual Meeing of he Associaion of Compuaional Linguisics, pages 384 391. Amanda Seidl, George Hollich, and Peer W. Jusczyk. 2003. Early undersanding of subjec and objec whquesions. Infancy, 4(3):423 436. Jayaram Sehuraman. 1994. A consrucive definiion of Dirichle priors. Saisica Sinica, 4:639 650. Edward Sabler. 1994. The finie conneciviy of linguisic srucure. In Perspecives on Senence Processing, pages 303 336. Lawrence Erlbaum. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. 2006. Hierarchical Dirichle processes. Journal of he American Saisical Associaion, 101(476):1566 1581. Jurgen Van Gael, Yunus Saaci, Yee Whye Teh, and Zoubin Ghahramani. 2008. Beam sampling for he infinie hidden Markov model. In Proceedings of he 25h inernaional conference on Machine learning, pages 1088 1095. ACM. Jurgen van Gael, Andreas Vlachos, and Zoubin Ghahramani. 2009. The infinie HMM for unsupervised PoS agging. (Augus):678 687. Maren van Schijndel, Andy Exley, and William Schuler. 2013. A model of language processing as hierarchic sequenial predicion. Topics in Cogniive Science, 5(3):522 540. 975