Unsupervised Estimation for Noisy-Channel Models

Markos Mylonakis mmylonak@science.uva.nl Khalil Sima an simaan@science.uva.nl Language and Compuaion, Universiy of Amserdam, Pl. Muidergrach 24, 1018TV Amserdam, Neherlands Rebecca Hwa hwa@cs.pi.edu Deparemen of Compuer Science, Universiy of Pisburgh, 210 S. Bouque S. Pisburgh, PA 15260, U.S.A. Absrac Shannon s Noisy-Channel model, which describes how a corruped message migh be reconsruced, has been he corner sone for much work in saisical language and speech processing. The model facors ino wo componens: a language model o characerize he original message and a channel model o describe he channel s corrupive process. The sandard approach for esimaing he parameers of he channel model is unsupervised Maximum-Likelihood of he observaion daa, usually approximaed using he Expecaion-Maximizaion (EM) algorihm. In his paper we show ha i is beer o maximize he join likelihood of he daa a boh ends of he noisy-channel. We derive a corresponding bi-direcional EM algorihm and show ha i gives beer performance han sandard EM on wo asks: (1) ranslaion using a probabilisic lexicon and (2) adapaion of a par-of-speech agger beween relaed languages. 1. Inroducion An influenial paradigm in saisical naural language processing (NLP) is he noisy-channel model (Shannon & Weaver, 1949). I describes a communicaion process in which a sender emis he inended message m hrough an imperfec communicaion channel such ha he sequence o observed by he recipien is a noisy version of he original message. To reconsruc m from o, one may posulae a se of hypoheses, H(o), and compue he opimal Bayesian Appearing in Proceedings of he 24 h Inernaional Conference on Machine Learning, Corvallis, OR, 2007. Copyrigh 2007 by he auhor(s)/owner(s). hypohesis, m = arg max m H(o) P (m o) = arg max m H(o) P (m)p (o m), where P (m) is called he language model and P (o m) he channel model. Many NLP problems can be framed in erms of he noisy-channel model. For example, in speech recogniion, o is an acousic uerance heard by he recipien and m is he speaker s inended message; in machine ranslaion, o is a senence expressed in a foreign language, s (source); m is he inended message expressed in he recipien s naive language, (arge); and he channel model is a probabilisic ranslaion lexicon (dicionary). A major challenge for raining he channel model for an NLP applicaion is ha he available daa rarely conains explici, in-deph mappings beween o and m. For insance, consider he problem of raining a channel model for machine ranslaion. While i may no be hard o find bilingual exs, he exs hemselves do no specify how individual words in he source language are ranslaed ino words in he arge language. Thus, he channel model P (o m) is usually explained by assuming a disribuion over a hidden ranslaion relaion, a m o, so ha P (o m) = a P (o, a m) (Bahl e al., 1990; Brown e al., 1988). The parameers for he model can be esimaed wih he Expecaion- Maximizaion (EM) algorihm (Baum e al., 1970; Dempser e al., 1977). However, his means ha he parameers are fied only o daa from one side of he channel: The language model parameers depend solely on daa from he message-side; and he channel model parameers are chosen o maximize he likelihood of he daa from he observable-side of he channel alone. Because of weak language models, asymmeric channel models and sparse-daa, his approach leads o differen esimaes from each direcion of he channel (P (m)p (o m) vs. P (o)p (m o)). Some recen work (Zens e al., 2004; Liang e al., 2006) suggess ha his could be subopimal in pracice and ha he wo direcions of he channel should be reconciled.

In his paper we explore mehods of maximizing he likelihood of boh he observable and message sides of he raining daa simulaneously. We propose ha he wo direcions of ranslaion, P (o) a P (a, m o) and P (m) a P (a, o m), employ he same se of join probabiliies P (o, a, m). This allows raining on he join daa of messages and observaions under Maximum-Likelihood. We exend he sandard EM algorihm ino a Bi-direcional EM (Bi-EM) algorihm for re-esimaing channel parameers. Unlike sandard NLP applicaion of he noisy-channel model, our algorihm does no depend on using a parallel corpus of messages and heir corresponding corruped observaions (m, o) as he raining daa; i is sufficien o have separae corpora of m and o. This is especially beneficial for machine ranslaion beween languages for which bilingual exs are no abundan. We presen experimens comparing Bi-EM wih he uni-direcional EM on wo asks (1) ranslaion from one language o anoher using a probabilisic ranslaion lexicon and wo monolingual corpora, and (2) auomaic adapaion of a par-of-speech (POS) agger from a language for which here exiss an annoaed raining corpus (wrien, Modern Sandard Arabic) o a relaed language (Spoken, Levanine dialec) for which here is only a small, unannoaed corpus of senences. On boh asks, and under varying raining condiions, he Bi-EM esimaes give beer sysem performance han sandard (unidirecional) EM. 2. Background and Relaed Work I is useful o hink of he noisy-channel problem as a ranslaion ask: he observaion o is he source language senence s and he message m is he arge language senence. While channel models (P (s )) can be implemened in many ways, in his paper we consider only a probabilisic ranslaion lexicon ha bridges he source (observaion) and arge (message) exs. This choice does no impac he generaliy of he esimaion algorihm presened, especially wih regard o applicaions such as machine ranslaion or speechrecogniion. Much work in Saisical Machine Translaion (SMT) has been devoed o he esimaion of lexicon probabiliies. We briefly review he relevan lieraure as a background agains which we presen our algorihm. 2.1. Translaion Probabiliies in SMT For a source senence s = (s 1,..., s n ) and a arge senence = ( 1,... m ), he objecive of SMT can be expressed in he noisy-channel framework as: arg max p( s) = arg max p(s )p(). To learn he ranslaion model, mos SMT approaches require a large parallel corpus (see e.g. (Brown e al., 1988; Koehn e al., 2003)) in order o induce a hidden alignmen a beween he words of each pair of senences s and : arg max p( s) = arg max p(s, a )p(). To esimae he word alignmen probabiliies and he lexicon probabiliies, mos work employs some form of he Expecaion-Maximizaion algorihm. 2.2. Baseline Model In conras wih work using parallel corpora, in (Koehn & Knigh, 2000) as well as in his paper, only monolingual corpora (in boh source and arge languages) are available. Because he wo corpora are no ranslaions of each oher, alignmens beween he pairs of senences by-and-large do no exis. Insead, we assume ha we are provided wih an ambiguous ranslaion lexicon L (which may be obained from a bilingual dicionary). For every source word s, L conains a se of ranslaions L(s), and vice versa (for arge word i conains a se L()). The goal is o esimae ranslaion probabiliies p(s ), he probabiliy ha a word ranslaes as word s L(), regardless of conex. Le he se L(s) sand for he se of all possible arge senences ha resul from ranslaing he (ordered) sequence of words in s, one by one 1, using lexicon L. Koehn and Knigh derive he following model 2 arg max p( s) L(s) = arg max L(s) p (s )p() θ (1) = arg max n θ (si i ) L(s) (2) a where θ sands for he ranslaion lexicon probabiliies s, i.e. p(s ). This model employs a language model p() over arge senences rained on he arge language monolingual corpus T, and a ranslaion model wih lexicon probabiliies θ (s i i ). Using fixed language model esimaes p(), he lexicon probabiliies are esimaed using EM over he source 1 Thereby assuming he same word-order and a one-oone mapping beween words, which also implies ha senence lengh is unchanged, i.e. m == n. 2 The noaion p θ (.) sands for he probabiliy under model (parameers) θ.

language corpus S. Assuming an iniial esimae θ 0 for θ, and denoe he esimae a ieraion r by θ r E-sep r : for every s S and L(s): q( s; θ r ) := 1 Z(s; θ r) p() n θ r (s i i ) M-sep r : maximize over θ o obain θ r+1 θ r+1 := arg max θ s S L(s) q( s; θ r ) log[ p()p θ (s )] where Z(s; θ r ) = L(s) p() n θ r (s i i ). The maximizaion a ieraion r (M-sep r ) is calculaed by relaive frequency esimaes as follows: θ r+1 (s ) = s S L(s) s S L(s) q( s; θ r ) j δ[s j, s]δ[ j, ] q( s; θ r ) j δ[ j, ] where δ[x, y] = 1 iff x == y, and zero oherwise. The acual implemenaion for Hidden Markov Models is known as he Baum-Welch or Forward-Backward algorihm (Baum e al., 1970). 2.3. Exising Bi-direcional Mehods I has been observed in he SMT lieraure ha combining he alignmens esimaed from he wo possible direcions of ranslaion S T and S T improves he precision of he alignmen (Och & Ney, 2003). Reconciling he alignmens of he wo direcions of ranslaion culminaes in he mehod of (Zens e al., 2004). This mehod employs wo direcional ranslaion models, each wih a hidden direcional alignmen model and a word-o-word lexicon. The crucial observaion of Zens e al., shared wih our approach, is ha he condiional lexicon probabiliies can be compued using join esimaes (see equaions 4) from couns over he alignmens obained from eiher ranslaion direcion. Conrary o our approach, however, Zens e al. employ wo separae Uni-EM algorihms o consruc wo probabilisic direcional alignmens. Afer each ieraion of hese Uni-EM algorihms, each of he direcional alignmens is used for acquiring esimaes of he join couns for he lexicon word-pairs. These join couns are hen inerpolaed ogeher leading o symmerized lexicon probabiliy esimaes, which are in urn fed back ino each of he separae Uni-EM algorihms. I is unclear wha objecive funcion of he daa his mehod is opimizing. Furhermore, Zens e al. make unrealisic and unnecessary assumpions regarding he unigram couns in he wo corpora. Coming righ up o dae, (Liang e al., 2006) presen Alignmen by Agreemen : The key idea is o employ he parallel corpus S, T for he esimaion of wo alignmens θ and θ (he wo direcions of ranslaion) under an objecive likelihood funcion of S, T ha measures individual fi o he daa as well as muual agreemen beween hese alignmens: L(S, T ; θ ) L(S, T ; θ ) L(S, T ; Agr( θ, θ )) where L(X; θ) = x X p θ(x) sands for he likelihood of parallel corpus X (senence pairs) under he model ha employs alignmen θ, and Agr(a, b) measures he agreemen beween he wo alignmens a and b given x X as he do produc of wo probabiliy vecors ha range over all possible alignmens beween ha pair (also called se of generalized alignmens). While he idea of agreemen alignmen is appealing, i is by definiion no applicable in he presen case as we sar ou from a non-parallel corpus. Furhermore, because he lexicon is large (relaive o senence lengh), i is compuaionally prohibiive o employ he same measure of agreemen (such as do produc) beween he wo esimaes of probabiliies (per direcion) over he subses of he ranslaion lexicon (he power se of he lexicon). 3. Noisy-Channel Esimaors We sar ou from he inuiion ha he independen esimaion of he lexicon probabiliies p θ (s ) and p θ ( s) yields empirical esimaes ha do no agree on he join probabiliy p(s, ), i.e. p() p θ (s ) p(s) p θ ( s) This inequaliy is expeced due o he asymmeric saisics in T and S, asymmery in he ranslaion lexicon and weak language models. We hypohesize ha he noion of agreemen beween he wo models can be implemened by esimaion under he consrain ha consensus is achieved over his join probabiliy. A sraighforward approach would be o ake he weighed sum of he final EM esimaes obained over he wo ranslaion direcions (each conduced on is own): p(s, ) = λ p θ (s, ) + (1 λ) p θ (s, ) (3) where λ could be, e.g. he raio of corpora sizes. This leads o re-esimaes p θ (s ) = p(s, ) s p(s, ) p θ ( s) = p(s, ) p(s, ) (4) While inerpolaing he esimaes could be useful, we ake a novel approach ha aims a maximizing he

s Corpus S Image of T using lexicon L C(S)... L(s) Image of S using lexicon C(T) L Corpus T join-likelihood, jus in he same fashion he EM is obained from sandard maximum-likelihood. Le us define wo corpora C(S) and C(T ) (see figure 1): C(S) is he corpus ha consiss of a pair s, for every senence s S and every hypohesis L(s). Corpus C(T ) is defined analogously. Figure 2 shows he Bi-EM algorihm, L()... E-sep r : Figure 1. The concaenaion of complee source and arge corpora resuls in a single complee corpus. join-likelihood of he wo corpora under a join probabiliy model p θ (s, ) = n θ(s i, i ) which coordinaes wo inernally hidden condiional, direced ranslaion models ha are boh employing he same se of ranslaion parameers θ. Le p 1 (s) be a language model esimaed from S and analogously p 2 () from T, we rewrie he direcional ranslaion models in erms of a single se of lexicon parameers θ: arg max s arg max p(s ) = arg max s p( s) = arg max p θ (, s) p 1 (s) p θ(, s) p θ (, s) p 2 () s p θ(, s ) Saing he wo models in erms of he same se of join probabiliies of words implies ha he source and arge corpora are assumed o have been generaed from a single source: he join lexicon probabiliies. This allows us o sae a new objecive funcion, he Join-Likelihood of wo monolingual corpora: max L(T ; θ, p 1, L) L(S; θ, p 2, L) (5) θ L(X; θ, p k, ˆL) = L(x, y; θ, p k ) x X y ˆL(x) p θ (x, y) L(x, y; θ, p k ) = p k (y) x p θ(x, y) This saemen of he objecive funcion opimizes over θ he join-likelihood of wo monolingual corpora, each under is own likelihood funcion which involves he oher corpus. Crucially, he join-likelihood funcion has he same form as he usual likelihood funcion wih he minor difference ha he muliplicaion ranges over wo raher han one corpus (each under is own ranslaion direcion). In ligh of his observaion we can direcly obain a Bidirecional-EM algorihm ha aims a he s, C(T ): q 1 (s, ; θ r ) := p 1 (s) n s, C(S): q 2 (s, ; θ r ) := p 2 () n M-sep r : maximize over θ o obain θ r+1 θ r+1 := arg max θ + s, C(S) s, C(T ) θ r(s i, i) P θr(si,) θ r(s i, i) P s θr(s,i) A r(s,;θ) {}}{ q 1 (s, ; θ r ) Z 1 (s; θ r ) log L(s, ; θ, p 1) B r(s,;θ) {}}{ q 2 (s, ; θ r ) Z 2 (; θ r ) log L(, s; θ, p 2) L(x, y; θ, p) = p(x) n θ(x i,y i) P y θ(xi,y) Figure 2. Bi-EM algorihm where Z 1 (; θ r ) = s L() q1 (s, ; θ r ) and Z 2 (s; θ r ) = L(s) q2 (s, ; θ r ) are unigram coun esimaes. The sum of he wo sums in he M-sep can be rearranged ino a single sum if we precompue a single (complee) corpus C r ha concaenaes C(S) wih C(T ) and sores he expeced frequencies (A r (s, ; θ) or B r (s, ; θ)) wih each pair as { Ar (s, ; θ) s, C(T ) log freq r (s, ; θ) = B r (s, ; θ) s, C(S) The M-sep becomes he M-sep of a sandard EM algorihm: θ r+1 := arg max θ s, C r log freq r (s, ; θ) Hence, he Bi-EM inheris he properies of he common (Uni-direcional/Uni-) EM algorihm, including convergence and a guaranee of a choice of θ ha will no decrease he join-likelihood afer each ieraion.

The acual updae formula for he Bi-EM is: { q q(s, ; θ r ) = 1 (s, ; θ r ) s, C(T ) q 2 (s, ; θ r ) s, C(S) θ r+1 (s, ) = q(s, ; θ s, Cr r) j δ[s j, s]δ[ j, ] s, C r q(s, ; θ r ) Noe ha he Bi-EM akes only wice as much raining ime as he Uni-EM. 4. Implemenaion Deail The core of boh he Uni-EM esimaion mehods (Koehn & Knigh, 2000) and he presen Bi-EM esimaor is he Baum-Welch algorihm (Baum e al., 1970) for Hidden Markov Models (HMMs), which is known o be an EM algorihm (Dempser e al., 1977). This algorihm in is mos general form employs he Forward-Backward calculaions o updae expeced couns of ransiion (language model) and emission (lexicon) probabiliies. In our seing we fix he language model (ransiion) esimaes and reesimae only he lexicon (emission) probabiliies. This is because language models can be readily consruced from large monolingual daa and here is no reason o reesimae hem. For he generaion of he language models we used he CMU-Cambridge Toolki (Clarkson & Rosenfeld, 1997), employing a firs order Markov model. For he Baum-Welch algorihm, we implemened our own (Java) sofware package. Our sofware package 3 implemens boh he Uni- and Bi-EM algorihms. For POS-agging we employ he TnT agger (Brans, 2000) which works wih a 2nd-order HMM over POS ags and individual lexical (word-ag) probabiliies. 5. Applicaion I: Translaion Following (Koehn & Knigh, 2000), our experimens are on ranslaing noun sequences exraced from corpus senences. As an absolue baseline we employ a ranslaion model ha assumes uniform lexicon probabiliies (called LM mehod). The acual baseline, however, is he sandard EM (Koehn & Knigh, 2000) (subsequenly called Uni-EM Unidirecional EM). We compare hese baselines o he presen Bi-EM algorihm (secion 3). During raining, he inpu o he esimaion mehods consiss of a non-parallel English-German corpus pair and an ambiguous lexicon conaining up o seven Ger- 3 hp://saff.science.uva.nl/~mmylonak man ranslaions for every English word. 4 We iniialize he lexicon parameers wih a uniform disribuion boh for Uni- and Bi-EM. For evaluaion purposes, we embed he lexicon esimaes wihin a simple word-o-word ranslaion sysem (secion 2.2), and evaluae he ranslaion resul agains he ranslaions available in a given parallel corpus. As (Koehn & Knigh, 2000), we use Germano-English ranslaion. As a es corpus we use 5106 word ranslaion pairs from 1850 noun sequences exraced from an equal number of senences from he de-news 5, which have been aligned down o he word level. We measure accuracy, he fracion of words whose ranslaion maches he word used in he biex. In addiion, we also provide he BLEU scores (Papineni e al., 2001) as an addiional measure of ranslaion qualiy. 5.1. Effec of Domain Mismach The differen esimaors operae under domainand/or genre-mismach beween (1) source corpus, (2) arge corpus, (3) lexicon, and (4) es corpus. We fix he lexicon and he es corpus hroughou all experimens. Because he Bi-EM aims a he joinlikelihood of wo corpora, a quesion may arise as o wheher weakening he relaedness (in domain and/or genre) of he wo corpora will affec he performance of Bi-EM relaive o Uni-EM. Highly relaed The wo corpora here consis of noun sequences from wo non-overlapping secions of he Europarl (Koehn, 2005) parallel corpus (English- German). The baseline sysem using he LM mehod (uniform lexicon probabiliies) achieves an accuracy of 63.11% (BLEU score 0.2372). The following able lis Uni- vs. Bi-EM resuls: #senences Uni-EM Bi-EM 40K 72.01% (0.3896) 76.19% (0.4394) 75K 74.13% (0.4242) 77.34% (0.4660) 100K 74.99% (0.4300) 77.78% (0.4714) Compared agains he baseline (63.11% for he LM mehod) hese numbers improve by up o 15% (or in fac 40% error reducion). Bi-EM clearly ouperforms he sandard Uni-EM. I is eviden from he resuls ha he improved accuracy of he Bi-EM does no come from uilizing more daa. Bi-EM rained on 40,000 English and he same amoun of German 4 The lexicon was obained by auomaic word alignmen of he Europarl corpus. 5 hp://www.iccs.inf.ed.ac.uk/~pkoehn/ publicaions/de-news/

senences significanly ouperforms Uni-EM rained on 100,000 English senences (and a German language model). This is a srong indicaion ha he Join- Likelihood is a beer objecive funcion han he likelihood of a single corpus. Less relaed We use as raining daa newspaper ex from he Gigaword (English) and from he European Language Newspaper Tex (German), uilizing news sories coming from he same agencies and published during he same period (Associaed Press, Agence France-Presse, May 1994-December 1995). Unlike differen secions of Europarl, his pair of corpora concerns news exs ha originae from nonparallel sources and are in wo differen languages. We esimae ranslaion probabiliies using Uni-EM and Bi-EM, raining wih 100K senences per language used: #senences Uni-EM Bi-EM 100K 70.29% (0.3610) 72.80% (0.3809) We noice again ha he Bi-EM helps produce significanly more accurae ranslaions. Ineresingly, raining Bi-EM on 100K senence sill gives beer resuls ha Uni-EM rained on 200K senences (Uni-EM wih 200K = 72.08% (0.3737)). Disanly relaed We also rained on a pair of disanly relaed corpora. These are he newspaper ex from Gigaword (English) and he parliamen proceedings from Europarl (German): #senences Uni-EM Bi-EM 100K 68.90% (0.3110) 70.98% (0.3303) Bi-EM is sill able o produce esimaes ha give more accurae ranslaions han Uni-EM. Furhermore, Bi- EM rained on 100K senences ouperforms Uni-EM rained on 200K senences (Uni-EM on 200K = 70.23% (0.3215)). 5.2. Smaller Targe Language Daa We employ he corpora of secion 5.1, varying his ime he amoun of raining senences from he arge language (English), while mainaining a fixed raining corpus of 100K German senences (source). Figure 3 shows he average accuracies of Bi-EM as funcion of arge corpus increase. Noe ha he zero poin refers o he Bi-EM rained on arge corpus of size zero, which is equivalen o he Uni-EM. Ineresingly, 81% of he accuracy increase of Bi-EM relaive o Uni-EM is already obained by using only 25K senences, 77.32% (0.4542). These accuracies are averages over 3 differen non-overlapping ses of 25K English senences. Accuracy % 78 77.5 77 76.5 76 75.5 75 74.5 0 25 50 75 100 #K english senences Figure 3. Bi-EM accuracy as arge corpus size increases 6. Applicaion II: Adaping Taggers Par-of-Speech (POS) agging is he ask of classifying every word in a ex ino one POS caegory (e.g., verb, noun). Many machine learning echniques have been applied o POS agging, including HMMs, Condiional Random Fields, Suppor Vecor Machines, Memory- Based Learning, jus o name a few (Ranaparkhi, 1996; Daelemans e al., 1996; Brans, 2000; Laffery e al., 2001). Here we focus on he POS agging of ranscrips of a spoken Levanine Arabic dialec. Unlike Modern Sandard Arabic (MSA), Arabic dialecs are spoken bu rarely ever wrien, which makes i virually impossible o obain MSA-dialec parallel corpora (see (Rambow e al., 2005)). Available is a manually agged MSA corpus (approx. 564K words) (Maamouri e al., 2004) and a iny, manually creaed ranslaion lexicon 6 ha maps words beween Levanine and MSA. Also available is a small Levanine corpus (approx. 31K words) consising of wo splis (18157 and 12238 words resp.). The ask here is o uilize he MSA agged corpus in order o auomaically POS ag he Levanine side using only unannoaed Levanine senences for raining and he lexicon for ranslaion. We embed he MSA POS agger and MSA-Levanine lexicon in he noisy-channel approach. Le m = m 1... m n be an MSA senence and l = l 1... l n be a Levanine senence. On he MSA side we have a POS ag sequence = 1... n associaed wih m. We have wo direcions for he noisy channel: P (m,, l) = P (m, )P (l m, ) (6) P (m,, l) = P (l)p (m, l), (7) 6 Originaing from JHU 2005 summer workshop hp: //www.clsp.jhu.edu/ws2005/. The lexicon has 321 enries wih on average approx. 1.5 Levanine words per MSA word; If averaged over ambiguous MSA words only, he ambiguiy rises o 3.

Table 1. Adaping MSA POS agger o Levanine Adapaion Training Daa Accuracy None MSA only 70.48% Uni-EM MSA-o-Lev 75.93% Uni-EM Lev-o-MSA 77.88% Bi-EM MSA-and-Lev 78.25% where P (m, ) is an MSA POS agger, P (l) is a Levanine Language Model, and he oher wo erms are channel models involving he ranslaion lexicon in he wo direcions. The 2 nd -order HMM MSA POS agger and he Levanine language model are boh sandard: 7. P (m, ) = P (l) = n P ( i i 2, i 1 )P (m i i ) (8) n P (l i l i 1 ) (9) For equaion 8, we rain an off-he-shelf HMM POS agger (Brans, 2000) on he MSA daa (accuracy 95.07 over 66K words es-se). We make wo srong assumpions (1) The Levanine POS agger differs from he MSA POS agger only in he lexical model, and (2) When a Levanine word ranslaes ino an MSA word-ag pair, he POS ag remains he same. The laer means ha we exend he MSA-Levanine lexicon from pairs m, l ino riples m,, l, where is any of he POS ags ha co-occur wih word m in he agged MSA corpus. A word found in boh corpora bu no in he lexicon is mapped o iself, and a word found in he Levanine bu no in he MSA corpus nor in he lexicon is mapped o all open caegory POS ags. For he wo Uni-EM versions, he channel probabiliy employs he probabilisic lexicon in wo direcions P (l m, ) = n θ (li m i, i ) and P (m, l) = n θ (mi, i l i ). For he Bi-EM we assume one (nondirecional/join) se of parameers θ(m,, l) ha underlies he wo direcional/condiional parameers as done wihin he ranslaion ask (secion 5). The esimae θ(m,, l) is Pconvered ino a Levanine lexical m model: P (l ) = θ(m,,l) P m,l θ(m,,l). This lexical model is used ogeher wih he 2 nd -order Markov Model over POS ags (rained on he MSA corpus) as a Levanine POS agger. Table 1 exhibis he resuls of he various POS ag- gers on he Levanine daa averaged over wo splis (he wo Lev pars). The firs row is he original MSA rained POS agger (70.48% accuracy = percenage of correcly agged es words). The second and hird rows correspond each o an adaped MSA POS agger using he Uni-EM esimaes from eiher ranslaion direcion. Depending on direcion, he Uni-EM achieves 18-25% less errors relaive o he unadaped agger. The Bi-EM adaped POS agger (las row) commis 2-10% less errors han he Uni-EM direcions (or abou 27.5% less errors han he MSA POS agger). Noe ha we have no included any exernal knowledge. In (Rambow e al., 2005), manual adapaion combined wih EM leads o 77-78% accuracy on a modified version 8 of he Levanine daa. On ha esmaerial, our experimens show ha he Bi-EM scores 82.30% accuracy (averaged over wo splis). We hink ha wo facors conribue o he fac ha he Bi-EM improves over Uni-EM: (1) I combines saisics from he MSA POS agger (one direcion) wih saisics from he Levanine language model (anoher direcion), and (2) Because he lexicon is asymmeric, Uni-EM updaes only hose enries used in he assumed direcion, whereas Bi-EM updaes he lexicon enries used in boh direcions. 7. Conclusions This paper aims a improved channel esimaes from daa a boh ends of he noisy-channel. We presened a Join Maximum-Likelihood approach and exended he EM algorihm ino a bi-direcional EM for unsupervised esimaion. We exemplified he uiliy of Bi-EM on wo asks: ranslaion by lexicon probabiliy esimaes and adapaion of a POS agger from a resource-rich o a resource-poor language. Bi-EM delivers beer resuls han he sandard EM regardless of mismach in domain or genre beween he source and arge corpora. In fuure work we aim a uilizing he Bi-EM for poring more linguisic processing ools from a resourcerich o a resource-poor language in cases where here exis no parallel corpora. We also hink ha he Bi- EM could be useful in saisical machine ranslaion, in paricular for obaining improved ranslaion model esimaes. Whenever he channel model (lexicon) is asymmeric and/or he language models are weak, i makes more sense o employ Bi-EM han sandard (Uni-direcional) EM for noisy-channel applicaions. 8 Cliics are marked wih disambiguaing symbols. 7 For breviy, any symbol x j where j 0 is assumed o be he unique sar symbol of a senence.

Acknowledgemens Preliminary Bi-EM versions were explored by he 2nd and 3rd auhors, ogeher wih Carol Nichols, during 2005 JHU Summer Language Engineering Workshop. We hank he JHU organizers for he opporuniy, he workshop paricipans for discussions and daa, he ICML reviewers for commens, Andy Way and Hermann Ney for poiners o relevan lieraure, and Aspasia Benei and Isaac Eseban for help wih preliminary experimens. The firs auhor is suppored by a NUFFIC HSP Huygens scholarship HSP-HP.06/940- G, and he second auhor by NWO gran number 639.022.604. References Bahl, L. R., Jelinek, F., & Mercer, R. L. (1990). A maximum likelihood approach o coninuous speech recogniion, 308 319. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Baum, L., Peerie, T., Souled, G., & Weiss, N. (1970). A maximizaion echnique occurring in he saisical analysis of probabilisic funcions of markov chains. Ann. Mah. Sais., 41, 164 171. Brans, T. (2000). Tn: a saisical par-of-speech agger. Proceedings of he sixh conference on Applied naural language processing (pp. 224 231). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Brown, P., Cocke, J., Piera, S. D., Jelinek, F., Mercer, R., & Roossin, P. (1988). A saisical approach o language ranslaion. COLING-88. Clarkson, P., & Rosenfeld, R. (1997). Saisical language modeling using he cmu-cambridge oolki. Proceedings ESCA Eurospeech. Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). Mb: A memory-based par of speech agger generaor. Proceedings of he fourh Workshop on Very Large Corpora (ACL SIGDAT) (pp. 14 27). Copenhagen, Denmark. Dempser, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplee daa via he em algorihm. Journal of he Royal Saisical Sociey, Series B, 39, 1 38. Koehn, P. (2005). Europarl: A parallel corpus for saisical machine ranslaion. MT Summi. Koehn, P., Och, F. J., & Marcu, D. (2003). Saisical phrase-based ranslaion. Proceedings of he Human Language Technology Conference 2003 (HLT- NAACL 2003). Edmonon, Canada. Laffery, J., McCallum, A., & Pereira, F. (2001). Condiional random fields: Probabilisic models for segmening and labeling sequence daa. Proc. 18h Inernaional Conf. on Machine Learning (pp. 282 289). Morgan Kaufmann, San Francisco, CA. Liang, P., Taskar, B., & Klein, D. (2006). Alignmen by agreemen. Proceedings of he Human Language Technology Conference (HLT-NAACL 2006). New York. Maamouri, M., Bies, A., Buckwaler, T., & Mekki, W. (2004). The penn arabic reebank: Building a large-scale annoaed arabic corpus. Proceedings of NEMLAR 2004.. Och, F. J., & Ney, H. (2003). A sysemaic comparison of various saisical alignmen models. Compuaional Linguisics, 29, 19 51. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2001). Bleu: a mehod for auomaic evaluaion of machine ranslaion. ACL 02: Proceedings of he 40h Annual Meeing on Associaion for Compuaional Linguisics (pp. 311 318). Morrisown, NJ, USA: Associaion for Compuaional Linguisics. Rambow, O., Chiang, D., Diab, M., Habash, N., Hwa, R., Sima an, K., Lacey, V., Levy, R., Nichols, C., & Shareef, S. (2005). Parsing arabic dialecs (Technical Repor). Johns Hopkins Universiy 2005 Summer Workshop on Language Engineering. Ranaparkhi, A. (1996). A maximum enropy model for par-of-speech agging. Proceedings of he Conference on Empirical Mehods in Naural Language Processing. Shannon, C. E., & Weaver, W. (1949). The mahemaical heory of communicaion. Urbana: Universiy of Illinois Press. Zens, R., Mausov, E., & Ney, H. (2004). Improved word alignmen using a symmeric lexicon model. Proceedings of he 20h Inernaional Conference on Compuaional Linguisics (CoLing) (pp. 36 42). Geneva, Swizerland. Koehn, P., & Knigh, K. (2000). Esimaing word ranslaion probabiliies from unrelaed monolingual corpora using he em algorihm. AAAI/IAAI.