Transliterated Search BITS PILANI HYDERABAD CAMPUS TEAM [ABHINAV MUKHERJEE, ANIRUDH RAVI, KAUSTAV DATTA]
Subtask 1 Language identification and back transliteration A few challenges were faced : Since the data given to us was from user chats, they contained varied spellings and grammatical errors Forward transliteration results in lot of words that can be classified as both E(English) and L(Language).
Training Set The data sets used were the ones provided, along with an external data set consisting of 5000 frequently used English words Char n grams were extracted, and were used as features in training data set Training data set was constructed in Sparse ARFF format. {45 1, 54 1, 81 1, 86 1, 1653 1, 1873 1, 2634 1, 2755 1, 3377 1, 4039 1, 9394 1, 13316 1, 19162 1, 19550 english}
Language classification Weka was used for Machine Learning The dataset was trained using the Support Vector Machines algorithm Classifier performance was evaluated on training set by performing cross validation, and optimised Linear Kernel function was used The model was then tested on the test data provided
Context Consideration Forward transliterated forms of words in many cases can have ambiguous classification to (त ), me (म ), b (भ ), use (उस ). For these words in the training set, we built a Naïve Bayes Classifier considering the language of the surrounding words
Results Metric Run 1 Run 2 Max Scor e Med Score EQMF All 0.005 0.004 0.005 0.001 EQMF without NE 0.010 0.009 0.010 0.003 EQMF without Mix 0.005 0.004 0.005 0.001 EQMF without Mix and NE 0.010 0.009 0.010 0.003 EQMF All (No transliteration) 0.205 0.177 0.276 0.194 EQMF without NE (No transliteration) 0.285 0.257 0.427 0.285 EQMF without MIX (No transliteration) 0.205 0.177 0.276 0.194 EQMF without Mix and NE (No 0.285 0.257 0.427 0.285 transliteration) ETPM 1923/2156 1876/2109 NA NA H- Precision 0.879 0.863 0.942 0.853 H- Recall 0.794 0.781 0.917 0.861 H- F Score 0.835 0.820 0.911 0.810 E- Precision 0.780 0.767 0.895 0.767 E- Recall 0.881 0.865 0.987 0.881 E- F Score 0.827 0.813 0.901 0.797 Transliteration Precision 0.156 0.152 0.200 0.109 Transliteration Recall 0.756 0.738 0.760 0.6335 Transliteration F Score 0.258 0.252 0.304 0.1835 Labelling Accuracy 0.838 0.826 0.886 0.792
Subtask 2 There were three main problems to tackle: Mixed Script documents and queries Spelling variations raja ki aaegi baraat, raaja ki aaegi baaraat Breaking and joining of words lejaenge lejaenge dilwale dulhaniya lejaenge, le jaenge le jaenge dilwale dulhaniya le jaenge
Mixed Script Information Retrieval There can be two possibilities: Query expansion to both the scripts would require forward and backward transliteration on the query words. Converting the documents and queries to a single script. Would require backward transliteration on both query and documents (if the native script is chosen).
Mixed Script Information Retrieval We chose the second option as backward transliteration is more accurate than forward transliteration. We used Google s online transliteration tool to perform back transliteration which returns to us the five nearest Hindi words.
Spelling variations Special rules were implemented to normalize the spelling variations in the corpus. The letter ह is trimmed from the suffix of the words ending with that letter, as this could result in spelling variations. For example, म ह and म, न यह च द ह ग and न य च द ह ग The words ending with य, य, य, and य are often written with ए, ऐ, ओ, औ instead. For example, आइय can be written as आइए etc. On many occasions, when the vowels इ or ई or combination of both occur on consecutive consonants, the later vowel is sometimes ignored. For example, र श न and र न
Sub word indexing Hindi words are broken and joined mostly along vowels (स वर). This operation doesn t affect the consonants (व यज न) that make the words. So the consonant pattern of each word in the document is concatenated to find out the consonant pattern of the whole document. The document is indexed along with character n-grams (n=3,4,5,6) of this character pattern. lejayenge lejayenge dilwale dulhaniya lejayenge le jayenge le jayenge dilwale dulhaniya le jayenge Both give the same base, which is: ल-ज-ग-ल-ज-ग-द-ल-व-ल-द-ल-ह-न-य-ल-ज-ग
Final query expansion Consonant pattern indexing made the system resistant to spelling variations made in vowels. To incorporate resistance to the spelling variations in consonants, we expand the query with sub word pattern with varying consonants. The following mapping is used to vary the consonants as they were seen as the major cause of spelling variations. ("क" -> "ख") ("ख" -> "क") ("ग" -> "घ") ("घ" -> "ग") ("च" -> "छ") ("छ" -> "च") ("ज" -> "झ") ("झ" -> "ज") ("त" -> "ट") ("ट" -> "त") ("ठ" -> "थ") ("थ" -> "ठ") ("द" -> "ध") ("ध" -> "द") ("न" -> "ण") ("ण" -> "न") ("ब" -> "भ") ("भ" -> "ब") This was done within a certain limit so that only few permutations of consonant swapping is taken into consideration.
Results of subtask 2 Run NDCG@1 NDCG@5 MAP MRR RECALL Run1 0.7500 0.7817 0.6263 0.7929 0.6818 Run2 0.7708 0.7954 0.6421 0.8171 0.6918