Guidelines for tagging of Sanskrit Compounds K V Ramkrishnamacharyulu, Amba Kulkarni, Tirumala Kulkarni and Anil Kumar Final draft for circulation among the SHMT consortium members dated 12/03/2012 1 Background With the advent of computers, and the advances in the field of NLP, the annotated corpora is gaining importance. Annotated corpora not only serves as an important resource for building Statistical tools for automatic annotation, but also provides useful insights for language teachers, language learners, and researchers working on various aspects of language. Compound formation is very productive in Sanskrit. On an average every fourth word in a Sanskrit text is a compound. It is not practical to store all the compounds and their analysis/meaning. So for Sanskrit, one needs a very good compound identifier and a program to generate the paraphrase of the compound. Pāṇini has provided rules for compound formation and also semantic restrictions for many of the compound formations. But to implement these rules on machine, one requires a knowledge base. A good collection of tagged compounds will be useful in deciding the parameters for development of such a knowledge base. Similarly, a collection of tagged corpus will be useful for getting a frequency distribution of various compounds. Such a tagged corpus will also be useful for developing automatic compound identifiers using suitable machine learning algorithms. 2 Tag set for compounds The Indian Grammatical Tradition has a vast literature on samāsa. Following the literature, Sanskrit compounds are classified into 5 major and 55 The original draft has been modified after a series of workshops on samāsa tagging for Sanskrit. The details of the workshops are available on the SHMT protal viz: http://sanskrit.uohyd.ernet.in/shmt/login.php 1
minor categories. The major categories are अ य भ व त ष कम ध रय ब ह Note here that there is a small deviation from the standard literature. कम ध रय has been added as a major type of compound, instead of sub-class of त ष. The reason is purely from the convenience of the tag-names. We give below the sub-classification along with the associated tags. 2
compound type: अ य भ व compound type: कम ध रय अ य-प व पद A1 वश षण-प व पद-कम ध रय K1 अ य-उ रपद A2 वश षण-उ रपद-कम ध रय K2 त भ त A3 वश षण- उभयपद-कम ध रय K3 स प व पद-न रपद A4 उपम न-प व पद-कम ध रय K4 न रपद-अ पद थ स य म A5 उपम न-उ रपद-कम ध रय K5 स प व पद-व य रपद A6 अवध रण प व पद-कम ध रय K6 प र -म -प व पद ष रपद A7 स वन प व पद-कम ध रय K7 म मपदल प -कम ध रय Km compound type: त ष compound type: ब ह थम त ष T1 त य त ष T2 त य थ -ब ह (सम न धकरण ) Bs2 त त य त ष T3 त त य थ -ब ह (सम न धकरण ) Bs3 चत थ त ष T4 चत थ -ब ह (सम न धकरण ) Bs4 प म त ष T5 प थ -ब ह (सम न धकरण ) Bs5 ष त ष T6 ष थ -ब ह (सम न धकरण ) Bs6 स म त ष T7 स थ -ब ह (सम न धकरण ) Bs7 न त ष Tn द व चक-ब ह (सम न धकरण ) Bsd सम ह र- ग Tds हरण वषयक-ब ह (सम न धकरण ) Bsp त त थ ग Tdt हण वषयक-ब ह (सम न धकरण ) Bsg उ रपद ग Tdu अ थ -म मपदल प (नञ)-ब ह Bsmn ग तसम स Tg द-ब ह Bvp क सम स Tk स भयपद-ब ह (सम न धकरण ) Bss दसम स Tp उपम नप व पद-ब ह (सम न धकरण ) Bsu मय र सक द Tm धकरण-ब ह Bv त ष ब पद Tb स पद धकरण-ब ह Bvs त ष उपपद U सहप व पद- धकरण-ब ह BvS उपम नप व पद- धकरण-ब ह BvU ब पद-ब ह Bb 3
compound type: इतर तरय ग- सम ह र- एकश ष compound type: क वलसम स compound type: Di Ds E क वल S d 3 General Guidelines The tagging of a compound involves the following separating the constituent padas by - s, Undoing the sandhi, changing the samāsa purvapadas to their pratipadikas, assigning a tag. Below we give some examples of samasta padas and their tagging. र जप ष ल य धन व ष म स न <र जन-प ष >T6 <ल -छ य >T6 <धन ष- व >T6 <षट-म स न>Tds Do not split the padas in following cases तङ or क द with upa-padas as in आग, ह र ण etc. If the words have ढ-अथ as in म डप, तर श, उप ह र etc. In case either the प व पद or the उ रपद or both in turn are सम पदs, then they are also to be tagged. E.g. तप य नरतम should be tagged as <<तपस- य>Di- नरतम >T7 Note the use of < and > to indicate the words constituting the compound. If a pada is य - वश षण in a given context, then it should be marked as Bvs only. 4
In case there are more than one possible tags, then show both the tags separated by. E.g. व क इव अ वरत च पश न, the word अ वरत is ambiguous between Tn and T7. Hence it should be tagged as: <अ- वरत>Tn- च >Bs6 <अ व-रत>T7- च >Bs6 Handling taddhita constituents: If a constituent of a compund is a taddhita formed from a compound pratipadika, then, the taddhita suffix is to be added after indicating the samāsa tag. E.g. a) ह मम लक त is to be tagged as <<<ह म-म ल >T6 ˆ इ>-क त >T6. Note the use of i suffix, which indicates a taddhita pratyaya. b) एक ण व क त = <एक -अण व>K1 ˆ ई>-क त c) र मक वत = <र म-क >Di ˆ वत d) म हनन मक = <म हन-न म ˆ क>Bs6 In case the vigraha vākya must specify the number, then we suggest you to specify the number information while tagging the compound, as in न प त = <न {3}-प त >T6 धम भर म य = <<धम -अ भर म>Bs6{3}- य >T6 Similarly, if the ल information is also required, it may also be specified as सक पम = <स-क पम {P}>Bs, if क पम refers to क प च य, or = <स-क पम {S}>Bs, if the pratipadika of क पम is क प. 4 Rules for generating vigraha vākya Though a substantial amount of literature on Sanskrit Compounds is available, for the benefit of annotators, we give below the vigraha vākya for each type of the compound, with an example. Further, for the benefit of the programmers, we also give rules to generate a vigraha vākya from the properly tagged compound. Thus each of the example below consists of name of a compound, its major class, its tag, an example, paraphrase describing the meaning of the example compound( व हव ), and the rule to get the paraphrase mechanically (wherever possible) from its components. 5
4.1 Major class: अ य भ व 1) compound type: अ य भ व compound sub-type: अ य-प व पद A1 उपक म Example with <उप-क म >A1 paraphrase: क सम पम paraphrase rule: <x-y>a1 => y{6} f{x} where f maps x to a noun with same semantic content. A function f needs to be defined. 2) compound type: अ य भ व compound sub-type: अ य-उ रपद A2 अ प र Example with <अ -प र>A2 paraphrase: अ ण वपर तम व म paraphrase rule: <x-y>a2 => x{3} वपर तम व म 3) compound type: अ य भ व compound sub-type: त भ त A3 त Example with < त द -ग >A2 paraphrase: त ग व य न द श paraphrase rule: List to be given; collect from प ण न s अ य 4) compound type: अ य भ व compound sub-type: स प व पद-न रपद A4 स ग म Example with <स न-ग म >A4 paraphrase: स न म ग न म सम ह र paraphrase rule: <x-y>a4 => x{6} y {6} सम ह र y is the तप दक of y. 6
5) compound type: अ य भ व compound sub-type: न रपद-अ पद थ स य म A5 उ ग म Example with <उ -ग म >A5 paraphrase: उ ग य न द श paraphrase rule: <x-y>a5 => x {1} y{1} य न द श y is the तप दक of y. x is derived from x by changing the gender to that of y. the number of x and y will be plural except when x = they are in वचन 6) compound type: अ य भ व compound sub-type: स प व पद-व य रपद A6 म न Example with < -म न>A6 paraphrase: य ण म म न न म सम ह र paraphrase rule: <x-y>a6 => x{6} y{6} सम ह र 7) compound type: अ य भ व compound sub-type: प र -म -प व पद ष रपद A7 प र ग म Example with <प र -ग म >A7 paraphrase: ग य -प र paraphrase rule: <x-y>a7 => y{6} x 4.2 Major class: त ष 8) compound type: त ष compound sub-type: थम त ष T1 अध प ल Example with <अध - प ल >T1 paraphrase: अध म प paraphrase rule: <x-y>t1 => x{1} y{6} 7
9) compound type: त ष compound sub-type: त य त ष T2 क त Example with <क - त >T2 paraphrase: क म त paraphrase rule: <x-y>t2 => x{2} y 10) compound type: त ष compound sub-type: त त य त ष T3 श ल ख ड Example with <श ल -ख ड >T3 paraphrase: श लय ख ड paraphrase rule: <x-y>t3 => x{3} y 11) compound type: त ष compound sub-type: चत थ त ष T4 य पद Example with <य प-द >T4 paraphrase: य प य द paraphrase rule: <x-y>t4 => x{4} y 12) compound type: त ष compound sub-type: प म त ष T5 च रभयम Example with <च र-भयम >T5 paraphrase: च र त भयम paraphrase rule: <x-y>t5 => x{5} y 8
13) compound type: त ष compound sub-type: ष त ष T6 दशरथप Example with <दशरथ-प >T6 paraphrase: दशरथ प paraphrase rule: <x-y>t6 => x{6} y 14) compound type: त ष compound sub-type: स म त ष T7 अ श ड Example with <अ -श णड >T7 paraphrase: अ ष श ड paraphrase rule: <x-y>t7 => x{7} y 15) compound type: त ष - नञ compound sub-type: न त ष Tn अ ण /अन Example with <न- ण >Tn / <न-अ >Tn paraphrase: न ण / न अ paraphrase rule: <x-y>tn => न y 16) compound type: त ष - ग compound sub-type: सम ह र- ग Tds प गवम Example with <प न-गवम >Tds paraphrase: प न म गव म सम ह र paraphrase rule: <x-y>tds => x{6;ba} y{6;ba} सम ह र 9
17) compound type: त ष compound sub-type: त त थ ग Tdt अ कप ल Example with <अ न-कप ल >Tdt paraphrase: अ स कप ल ष स स त paraphrase rule: No Specific Rule 18) compound type: त ष compound sub-type: उ रपद ग Tdu प गवधन Example with <<प न-गव>Tdu-धन >>Bs paraphrase: paraphrase rule: 19) compound type: त ष compound sub-type: ग तसम स Tg स ग Example with <सम-ग >Tg paraphrase: No paraphrase, since it is a न compound paraphrase rule: 20)compound type: compound sub-type: Example with paraphrase: paraphrase rule: त ष क सम स Tk क प ष /क प ष <क -प ष >Tk / <क -प ष >Tk No paraphrase, since it is a ' न ' compound 10
21) compound type: त ष compound sub-type: दसम स Tp च य Example with < -आच य >Tp paraphrase: क आच य paraphrase rule: <x-y>tp =<fx y> Meanings of upasargas (fx) need to be listed. 22) compound type: त ष compound sub-type: मय र सक द Tm र ज रम Example with <र जन-अ रम >Tm paraphrase: paraphrase rule: <x-y>tm =>?? गणप ठ is there. So no rule for making व हव is required. The list should be given 23) compound type: त ष compound sub-type: त ष ब पद Tb स नम Example with < -अ -स नम >Tb paraphrase: paraphrase rule: <x-y-z>tb = x{1} y{1} z{1} Here y is the prathama puruṣa ekavacana rūpa of the verb whose kṛdanta 24) compound type: त ष compound sub-type: त ष उपपद U क क र Example with <क -क र >U paraphrase: क म कर त paraphrase rule: <x-y>u => x{2} y 11
4.3 Major class: कम ध रय 25) compound type: कम ध रय compound sub-type: वश षण-प व पद-कम ध रय K1 न ल लम Example with <न ल-उ लम >K1 paraphrase: न लम तत उ लम च paraphrase rule: <x-y>k1 => x{1} तत y{1} च 26) compound type: कम ध रय compound sub-type: वश षण-उ रपद-कम ध रय K2 णब ल Example with < ण-ब ल >K2 paraphrase: ण च ब ल च paraphrase rule: <x-y>k2 => x{1} च y{1} च 27) compound type: कम ध रय compound sub-type: वश षण-उभयपद-कम ध रय K3 म श तल Example with <म -श तल >K3 paraphrase: म च अस श तल च paraphrase rule: <x-y>k3 => x{1}च अस y{1} च 28) compound type: कम ध रय compound sub-type: उपम न-प व पद-कम ध रय K4 म घ य म Example with <म घ- य म >K4 paraphrase: म घ इव य म paraphrase rule: <x-y>k4 => x{1} इव y{1} 12
29) compound type: कम ध रय compound sub-type: उपम न-उ रपद-कम ध रय K5 प ष Example with <प ष- >K5 paraphrase: प ष इव paraphrase rule: <x-y>k5 => x{1} y{1} इव 30) compound type: कम ध रय compound sub-type: अवध रण -प व पद K6 ग द व Example with <ग -द व >K6 paraphrase: ग एव द व paraphrase rule: <x-y>k6 => x{1} एव y{1} 31) compound type: कम ध रय compound sub-type: स भ वन -प व पद K7 अय नगर Example with <अय -नगर >K7 paraphrase: अय इ त नगर paraphrase rule: <x-y>k7 => x{1} इ त y{1} 32) compound type: कम ध रय compound sub-type: म मपदल प Km श कप थ व Example with <श क-प थ व >Km paraphrase: श क य प थ व paraphrase rule: <x-y>km => x{1} z* y{1} z* is a missing madhyama pada. 13
4.4 Major class: ब ह 33) compound type: ब ह compound sub-type: त य थ -ब ह (सम न धकरण ) Bs2 दक Example with < -उदक >Bs2 paraphrase: उदक यम paraphrase rule: <x-y>bs2 => x{1} y{1} यत {g}{2} where g is the gender of the given compound. 34) compound type: ब ह compound sub-type: त त य थ -ब ह (सम न धकरण ) Bs3 ऊढरथ Example with <ऊढ-रथ >Bs3 paraphrase: ऊढ रथ य न paraphrase rule: <x-y>bs3 => x{1} y{1} य न/यय /य न 35) compound type: ब ह compound sub-type: चत थ -ब ह (सम न धकरण ) Bs4 द व Example with <द -व >Bs4 paraphrase: द म व म य paraphrase rule: <x-y>bs4 => x{1} y{1} य /य /य 36) compound type: ब ह compound sub-type: प थ -ब ह (सम न धकरण ) Bs5 अपगतज व Example with <अपगत-ज व >Bs5 paraphrase: अपगत ज व य त paraphrase rule: <x-y>bs5 => x{1} y{1} य त /य /य त 14
37) compound type: ब ह compound sub-type: ष थ -ब ह (सम न धकरण ) Bs6 प त र Example with <प त-अ र >Bs6 paraphrase: प तम अ रम य paraphrase rule: <x-y>bs6 => x{1} y{1} य /य /य 38) compound type: ब ह compound sub-type: स थ -ब ह (सम न धकरण ) Bs7 न व स Example with < न - व स >Bs7 paraphrase: न व स य न paraphrase rule: <x-y>bs7 => x{1} y{1} य न/य म /य न 39) compound type: ब ह compound sub-type: द व चक-ब ह (सम न धकरण ) Bsd प व र Example with <प व -उ र >Bsd paraphrase: प व च उ र च यद र लम paraphrase rule: <x-y>bsd => x{6} च y{6} च यद र लम 40) compound type: ब ह compound sub-type: हरण वषयक-ब ह (सम न धकरण ) Bsp द ड द ड Example with <द ड -द ड>Bsp paraphrase: द ड च द ड च इदम य म व म paraphrase rule: <x-y>bsp => x{3} च y{3} च इदम य म व म 15
41) compound type: ब ह compound sub-type: हण वषयक-ब ह (सम न धकरण ) Bsg क श क श Example with <क श -क श>Bsg paraphrase: क श ष क श ष ग ह इदम य म व म paraphrase rule: <x-y>bsg => x{7}-y{7} ग ह इदम य म व म 42) compound type: ब ह compound sub-type: अ थ -म मपदल प -(नञ)ब ह Bsmn अप Example with <अ-प >Bsmn paraphrase: न व त प य paraphrase rule: <x-y>bsmn => न व त -y{1} य /य /य 43) compound type: ब ह compound sub-type: द-ब ह Bvp नद य Example with < नर-दय >Bvp paraphrase: नग त दय य त paraphrase rule: 44) compound type: ब ह compound sub-type: स भयपद-ब ह (सम न धकरण ) Bss चत र Example with < -चत र >Bss paraphrase: य व चत र व य paraphrase rule: <x-y>bss = > x{1} व y{1} य /य /य 16
45) compound type: ब ह compound sub-type: उपम न-प व पद-ब ह (सम न धकरण ) Bsu च म ख Example with <च -म ख >Bsu paraphrase: च इव म खम य paraphrase rule: <x-y>bsu => x{1} इव y{1} य /य /य 46) compound type: ब ह compound sub-type: धकरण-ब ह Bv क ठ क ल /च श खर Example with <क ठ -क ल >Bv/<च -श खर >Bv paraphrase: क ठ क ल य /च श खर य paraphrase rule: <x-y>bv => x y{1} य /य /य 47) compound type: ब ह compound sub-type: स रपद- धकरण-ब ह Bvs उपदश Example with <उप-दश >Bvs paraphrase: दश न म सम प य स त paraphrase rule: <x-y>bvs => y{6} x य स त 48) compound type: ब ह compound sub-type: सहप व पद- धकरण-ब ह BvS सप Example with <स-प >BvS paraphrase: प ण सह paraphrase rule: <x-y>bvs => y{3} सह 17
49) compound type: ब ह compound sub-type: उपम नप व पद- धकरण-ब ह BvU उ म ख Example with <उ -म ख >BvU paraphrase: उ इव म खम य paraphrase rule: <x-y>bvu => x{6} इव y य /य /य 50) compound type: ब ह compound sub-type: ब पद-ब ह Bb Example with paraphrase: paraphrase rule: 4.5 Major class: 51) compound type: compound sub-type: इतर तरय ग- Di र मक Example with <र म-क >Di paraphrase: र म च क च paraphrase rule: <x-y+>di => x{1} च (y{1} च)+ Here + indicates one or more occurences. 52) compound type: compound sub-type: सम ह र- Ds स प रभ षम Example with <स -प रभ षम >Ds paraphrase: स च प रभ ष च एतय सम ह र paraphrase rule: <x-y+>ds => x{1} च (y{1} च)+ एतत n सम ह र Here + indicates one or more occurences. n=2 if there are only two components. n=3 otherwise. 18
53) compound type: एकश ष compound sub-type: एकश ष- E पतर Example with < पतर >E paraphrase: म त च पत च paraphrase rule: Give a list of exceptions with व हव म No common rule 54) compound type: क वल compound sub-type: क वल S भ तप व Example with <भ त-प व >S paraphrase: प व म भ त paraphrase rule: <x-y>s => y{1} x{1} 55) compound type: compound sub-type: d उपय प र Example with <उप र-उप र>d paraphrase: उप र उप र paraphrase rule: <x-y>d => x y 5 Examples of compound tagging from ब लक ड of व कर म यणम Sloka 1.1.1: <<तपस- य>Di- नरतम >T7 तप <व ग- वद म >U वरम न रदम प रप व क <म न-प वम >T7 1.1.1 Sloka 1.1.8: <<इ क -व श>T6- भव >Bs6 र म न म जन त < नयत-आ >Bs6 <मह (महत)-व य >Bs6 तम न ध तम न वश 1.1.8 19
Sloka 1.1.14: <<<व द-<व द-अ >T6>Di-त >T6- >U <धन र-व द >T6 च न त <<<<सव -श >K1-अथ >T6-त >T6- >U तम न तभ नव न 1.1.14 6 Structure of Sanskrit Compounds The Sanskrit compounds are binary in nature (with an exception of, and ब पद-ब ह). Hence they can be faithfully represented as binary trees as in Figure 1. The analysis shown in this figure may be represented in a linear notation as <A-<B-C>>. We add a tag to each of the compounds labeling its name. Thus the compound ABC after proper labeling will be <A-<B-C>tag1>tag2, where tag1 is the name of compound formed by the words B and C, and tag2 is the name of the compound formed by A and BC. The grammar for validation of tagged compounds is given below. 7 Grammar of tagged compounds compound: < component - component > tag < component - component > tag taddhita < component - component > tag number < component - component > tag gender < component - dvandvacomponents > dvandvatag < component > Etag ; dvandvacomponents: dvandvacomponents - component component ; component: pada compound ; A[1-7] 20
Bs[2-7] Bs[dpgsu] Bsmn Bv[sSU] B[bv] K[1-7] Km T[1-7] T[bgkmnp] Td[stu] [ESUd] ; dvandva D[is] ; Etag : E ; pada: [a-za-z]+ ; 21