Native Language Identification

Native Language Identification Ishan Somshekar Stanford University Symbolic Systems ishans@stanford Bogac Kerem Goksel Stanford University Symbolic Systems bkgoksel@stanford Huyen Nguyen Stanford University Computer Science huyenn@stanford Abstract Native Language Identification is the task that involves identifying the first, or native, language of a speaker solely from their speech in another language, in this case primarily English. In this paper, we explore the use of Support Vector Machines and and Neural Networks and compare their performance on a classification task involving NLI. 1 Introduction Native Language Identification is a fascinating problem, as the idea that simple linear classifiers or neural networks can learn differences in how speakers with different native languages will speak English can possibly have far-reaching implications, particularly in the science of language acquisition. If machines can easily learn tendencies and mistakes that language learners make, then it would help facilitate the improvement of education systems. Native language identification is a classification problem, and in this particular dataset there are speakers with 11 different native languages. We approach this problem of 11-way classification with two basic approaches: using linear regression models and using machine learning with neural networks. We explore the use of a variety of different features on both frameworks. Since the dataset can be separated into text (transcriptions of speech) and audio (i-vector representations of the speakers) data, we use a combination of techniques on both kinds of data. 2 Related Work Malmasi and Dras extensively test different models on this task. The authors tested a variety of linear classifiers, concluding that ensemble classifiers are the state of the art approach. They discuss the various features used in great detail, some of which we discuss below. Malmasi and Dras found that using function word and POS-tagged n-grams in addition to simple unigrams, bigrams, and character n-grams was very helpful. In addition, using Stanford CoreNLP parsers to generate CFG s and dependencies for the sentences improved the score. The authors best performing models were ensemble classifiers, which are classifiers that take in the outputs of the first-level classifiers as input and then outputs it s own classification. Stehwein and Pado also analyze the performance of SVM s on this task, and use their results to identify key features of the datasets. Usually, models will learn that speakers with different L1 s(first languages) will have different topic biases. In addition, particular misspellings or mispronunciations can be characterized to specific L1 s. Certain L1 s will also overuse the same words. Linguistic style is also a factor, as there are L1 s like Japanese which favor a much more formal style, than say, French. Malrasi et al 2016 also examine the performance of neural networks when compared to SVM s on this task with a similar dataset, and interestingly find that SVM s tend to outperform NN s, and that character level features generally outperform word level features. 2.1 Data 2.1.1 Data Overview The data that we are using for our project is a dataset that was recently released into the public domain by ETS. The dataset contains information collected from English speakers of 11 different native languages. It includes 13,200 essays written by the different speakers, 13,200 transcriptions of the oral part of the test, and the i-vectors that correspond to the pitch, tone and speech of the speakers involved. The dataset is divided into

the train set (11,000 responses), dev set (1,100 responses), and test set (1,100 responses). We currently only have access to train set and dev set as the organization has not yet released the test set. Therefore, all the scores reported in this paper are on the dev set released by ETS. The test set will be released at the end of June, and we are excited to see the performance of our models. In this project, we did not use the essays provided and focused on the speech transcriptions and i-vectors. The speech transcriptions are provided by the dataset in text files and are already tokenized into words or distinct sounds made. We collected statistics on the lengths of the speech transcriptions, and found that the length ranges from 0 to 212 words long. Further analysis of the data showed that there is an excellent distribution of transcription length around the median. Visualizations of the data are seen below. Length Mean Median Stddev Max Min Train 98.1 100 28.1 212 0 Dev 98.5 100 28.07 211 0 Figure 1: Distribution of transcription lengths in the train set Figure 2: Distribution of transcription lengths in the dev set We can see that train set and dev set are from very similar distribution. Furthermore, both train and dev sets are perfectly balanced across classes, meaning no bias correction is required in that aspect. The dataset is also balanced in terms of the prompts the speakers had to respond to across languages, eliminating topic biases. Finally, the dataset is also controlled on the rough English proficiency of the speaker (based on their test score). All in all, the dataset provides a good and balanced variety on topics and proficiencies of the speakers. 2.2 Models We have implemented a variety of both linear classifier models and neural network models. We designed numerous different features based on our research. Our simple baseline models are linear classifiers. Since the two datasets that we are using for this task are a collection of speech transcriptions and i-vectors, we first implemented a simple linear classifier that used unigram indicator features on the speech transcription, and then another simple linear classifier that used the same unigram features on the speech transcriptions along with the i-vector for that particular speaker. We planned to add features to these classifiers to improve performance at the same time that we implemented neural net models. 2.2.1 Features We first began by using more advanced n-gram features. The classifiers use counters on both unigrams and bigrams. Keeping with what related papers on this task had said, people with the same L1 tend to overuse the same words, meaning that counting unigrams and bigrams would help us identify the native language. Additionally, we stemmed the words and then added another unigram feature on the stemmed words, the lemmas, to remove inaccuracies that could arise due to varied word endings. We also used the Spacy library to tag each of our words with their part of speech. We implemented a feature that counted the parts-of-speech in the transcript. Our reasoning here involved the idea that speakers with the same L1 would have similar distributions in POS usage. Interestingly, in contrast to other tasks in NLP where stop-words are often filtered out at the first point of asking, since this task is actually more of stylistic classification as opposed to semantic, we felt that a unigram feature on stop-words would actually be quite helpful. Our reasoning here is again based on the fact that speakers with the same L1 would use similar stop-words when they stopped to think during their speech recordings.

Finally, we added the i-vectors as features to the linear classifier. The i-vectors are high dimensional representations of the sounds of the speakers speech. They encode tone, pitch, prosody, and other features, and will of course be excellent ways to classify L1, as they will bring a representation of the accents of the speaker which can be used to identify their L1. 2.2.2 Linear Classifier Our Linear Classifier is a linear SVM classifier of the sklearn toolkit, using the default parameters of squared hinge loss, l2 penaltization and a oneversus-rest approach for multi-label classification. While we considered working with more elaborate ensemble SVM and meta-classifier schemes, we wanted to use the linear classifier as a baseline to compare against the neural network models. As such, we evaluated the performance of a simple SVM against the neural network models when given different sets of features. In general, we evaluated the linear classifier using the following feature combinations (all features presented as unigram and bigram counts of the tokens of the given type): 2.2.3 Neural Networks Our first choice for a neural network architecture was a recurrent architecture, due to their success in language-related tasks thanks to their retention of longer term dependencies. Upon experimentation, we ended up using GRU cells as they achieved equal performance as LSTM cells with much lower training times. Our general architecture took a sequence of embeddings for the inputs, fed them through a one or two layer LSTM, and the final hidden state of the LSTM was fed to a fully connected tanh layer, and finally the outputs of the fully connected layer were fed through a softmax layer to be normalized into a probability distribution. We also implemented a bidirectional LSTM model that would read the inputs in both forward and backward directions. The concatenated hidden state from both directions was used to make a prediction. 1. Unstemmed words (unigrams only) 2. Unstemmed words (with bigrams) 3. Unstemmed words + stemmed words 4. Unstemmed words + stemmed words + partof-speech tags 5. Unstemmed words + stemmed words + partof-speech tags + stop words 6. Unstemmed words + stemmed words + partof-speech tags + stop words + i-vectors In the end, the biggest performance gains were achieved by going from unigrams to bigrams, and adding i-vector data. While the high gains from the i-vector data were expected as they contain a lot more information on the acoustic speech patterns of the speakers, we were surprised that other features like part-of-speech tags and stop words did not lead to net increases in performance, despite results by Malmasi and Dras. One reason for this could be that as opposed to a stacked SVM architecture, we fed all the features into the same SVM, leading to overfitting due to too many features. Figure 3: The architecture of our Bidirectional LSTM Neural Network Model We tried several approaches to embedding the inputs to our model. The following are the different inputs to our model that we tested: 1. GLoVe vectors: We used the Wikipedia + GigaWord pretrained GLoVe vectors. These vectors cover a vocab of 400,000 words and are trained over 6,000,000,000 tokens. Despite the size of the GLoVe vocabulary, we found that over 5,000 of the 10,000 unique words seen in our training data were out-ofvocabulary for this vector set. One further improvement here could be using the CommonCrawl GLoVe set, which has a vocab size 1,900,000. However, even then, a lot of the out-of-vocabulary words in our dataset are pause words like uh and umm or words

cut in the middle due to hesitation and timing of the essay prompts. In the end, we think that most pretrained word vectors will fail to cover big parts of our vocabulary due to such phenomena only found in speech data, as opposed to written data which most pretrained word vectors are trained on. 2. Random initialized word vectors: Due to the aforementioned deficiencies in pretrained vectors, we also tried using random initialized word vectors as embeddings, and training these embeddings through backpropagation during model training. While this should perform better due to full vocabulary coverage on the train set, the disadvantage of this approach is that adding the word embeddings to the backpropagation means that the model has more parameters, and requires even more data to train well. It also increases the risk of overfitting when ran over multiple epochs over the same training data. In the end, we discovered that this approach ended up being worse than using GLoVe vectors for this dataset and our architecture. 3. Character-level: Another approach we took was to scan the entire training set to build an alphabet of all characters used, and onehot encode each character and feed the onehot encoding to the neural network. In this scheme, the sequences are character sequences. The advantage of a character level model is that it has fewer parameters than a word-embedding model as the dimensionality is smaller, and does not have the vocab issues of pretrained models. However, it still requires more training to learn even longer dependencies (with the longer sequence lengths). As such, given our dataset size limitation, this embedding type also underperformed GLoVe vectors, and we decided to not use it. 4. Part-of-Speech tags, one-hot encoding: Malmasi et Dras state of the art model on NLI on essays makes use of part of speech tags as a feature of their stacked generalization classifier. We wanted to experiment with feeding in the part-of-speech tags manually to the neural network model. While this is not a common practice, we felt that POS tags would be especially helpful in this task, because language transfer affects stylistic/syntactic elements of language production more so than semantic elements. As such, while word embeddings are good at representing the semantic content of a word, the model may benefit from having access to syntactic content like POS tags directly. While more complex architectures trained on larger datasets may learn to infer the information carried in the POS tags of the words, the dataset size limitation of this model meant that we could not expect our model to achieve this. As a result, we used the CoreNLP library and its Stanza python client to tag the parts of speech of the entire dataset, and onehot encode all the part of speech tags to create POS embeddings alongside the word embeddings. In the end, we concatenated the word embeddings and the POS embeddings before feeding them into the model. This provided a 3% increase in accuracy coupled with both random initialized and GLoVe word vectors. 5. Part-of-Speech tags, vector embeddings: We also experimented with embedding the POS tags with random initialized trainable vectors. In this model, each POS tag that appeared in the dataset was given a random initialized vector as its embedding, and these vectors were concatenated with random initialized word vectors. During train time, the entire embedding layer was trained as well. This model had the same disadvantages of the random word vectors only model and as such, it underperformed compared to the one-hot encoding of POS tags. 6. I-Vectors: While our main focus was on seeing what we could achieve without resorting to the i-vectors of speakers, we also experimented with feeding in the i-vector data into the neural network for comparison purposes. In our case, we appended the i-vector of the whole speech excerpt directly to the word embedding of each word in the sequence. This is a redundant way to feed in this information since there is a single i-vector for the entire speech excerpt, and as such in the sequence, each index had the exact same i- vector. However, this approach let us keep our simple neural network structure, and it still gave us a big performance boost.

Figure 4: The architecture of our Neural Network Model that used word embeddings concatenated with POS tags and I-Vectors training examples for the neural net to learn the POS of the words inherently. This is illustrated in the figure below, which shows that train accuracy often jumps to close to 90% within a small number of epochs, resulting in overfitting. This problem persisted even with very high dropout probabilities, as the dataset was simply too small for the model to not overfit. Additionally, the dev accuracy hits its plateau quite early, and is not able to increase any further, showcasing the fact that the model is not able to generalize from the training data. 3 Evaluation 3.1 Experimental Results Neural Network Model Score GRU with random initialization 0.428 GRU with GloVe 0.451 Bidirectional LSTM with random initialization 0.444 Bidirectional LSTM with GloVe 0.462 GRU with GloVe & POS tags 0.481 GRU with GloVe & i-vectors 0.612 Linear Classifier Features Score Unstemmed words 0.516 Stemmed words 0.524 Stemmed words, POS tags, Stopwords 0.518 Unstemmed words & i-vectors 0.799 Stemmed words, POS tags, Stopwords, 0.785 i-vectors Stemmed words & i-vectors 0.802 Table 1: Results. We had difficulty in replicating the accuracy scores attained by the linear classifiers with our neural network models(malmasi and Dras, 2017) We believe that this is mainly due to the fact that 11,000 training samples was far too few to successfully train these models. This can specifically be seen in the case of the trainable randomly initialized word embeddings being outperformed by GloVe, even though over 1/3 of the corpus did not exist in GloVe. 11,000 examples would not have been enough to train word vectors, leading to the drop in accuracy. Additionally, the increase in accuracy caused by the POS tags in the neural net is probably due to the fact that there are not enough Figure 5: Train and dev accuracy for different models An important observation is that our neural net models and linear classifiers drastically improved once we either encoded i-vectors into the word embeddings or added them as features for the classifier. Since all other features are built off of the speech-transcriptions, the models were limited to the content of the subject s speech, and although some stylistic tendencies can be learned, the addition of a feature that encapsulated the actual aural features of the subject s voice would greatly enrich the information provided to the model and make classification much easier. This is quite intuitive, as if we were to ask humans to identify native languages, they would probably make most of their judgments based off of accents, information that is not in the speech transcriptions at all. 3.2 Error Analysis Looking at the confusion matrix, we found the performance of our model over different L1 s to be distributed similarly to the performance of models from previous literature. The biggest pain point for most models in the TOEFL datasets in NLI is the Hindi-Telugu language pair, and our model

Figure 6: A confusion matrix showcasing predicted and actual labels had the same trouble as Telugu was more often classified as Hindi than itself. Another group of languages that were confused with each other were Japanese, Korean and Chinese, as these are linguistically and historically close languages. Similarly Turkish had a higher level of confusion with Arabic, due to a similar linguistic proximity. Hindi, Japanese and German were the most accurate L1 languages for our model, whereas Telugu and Chinese were the worst ones. The main reasons for this were a strong tendency to misclassify Telugu as Hindi (but not the other way around) and a strong tendency to misclassify Chinese as Korean (but not the other way around). Italian, French, German and Spanish were generally more likely to be confused with each other, again fitting with geographic proximity, but not to the extent where it affected performance significantly on these languages. Another interesting linguistic observation was that Arabic was the most likely to be misclassified as Spanish (and Spanish most likely as Arabic), hinting at the historical connections between the two languages due to the long Arab rule over the Iberian peninsula. Overall, our confusion matrix shows that even with its relatively weak accuracy, our model s performance can be explained by the linguistic factors behind the L1 languages, and that similar L1 s are more likely to be confused. This shows that our model is learning relevant features, and increases our confidence that with a bigger dataset, it could succeed to generalize well. 4 Future Work While our experiments confirmed the observations from previous papers that neural networks underperform linear classifiers for the NLI task, we still believe there s room for exploration in getting the neural network models to perform better in this task. The biggest potential improvement would come from increasing the dataset size. One way of achieving this is using the previous NLI Shared Task (The TOEFL11 dataset) and combining it with this dataset. Since both datasets are from the same context (TOEFL exams) and are topic balanced, this should lead to a direct improvement of the model performance. There are also other English L2 datasets such as the scientists corpus (Stehwien and Padó, 2015). However, due to different topical biases of these datasets, they may not lead to direct increases in performance over the TOEFL dataset. Beyond acquiring more data, different neural architectures may be more suitable for this task. While recurrent neural networks are good at semantic tasks as they can synthesize semantic content and dependencies over longer sections of text, this may not be beneficial for this task. Since language transfer to L1 affects stylistic and syntactic features more so than semantic features, a convolutional model may be better at capturing such repeating language constructs throughout the text. As such, we expect convolutional models to perform better on this dataset. Furthermore, the inputs to the model may further be extended with more syntactic information. POS tags gave a solid performance boost, and other features used in linear classifiers may also improve performance. These could include dependency parse information and CFG fragments. Further experimentation is required to see which combinations yield the best results. Finally, looking at the confusion matrix we see that most of the classification errors of the model are between specific language pairs. This may be a reason why stacked classifiers have been so successful in this task in the past. The implication for neural net models may be that adding an attention mechanism that attends over the sequence based on the initial likelihoods to force the model to learn to discern certain language pairs. Another approach may be training specific 2-way or 3-way classifiers for oft-confused language groups, and feeding inputs through the more specific classifier for their most likely language group. Either of these approaches may improve the performance of the system.

References Shervin Malmasi and Mark Dras. 2017. Native language identification using stacked generalization. CoRR, abs/1703.06541. Sabrina Stehwien and Sebastian Padó. 2015. Generalization in native language identification: Learners versus scientists. CLiC it, page 264.