Conference Presentation

Similar documents
Transcript for French Revision Form 5 ( ER verbs, Time and School Subjects) le français

CS Machine Learning

Word Segmentation of Off-line Handwritten Documents

Learning From the Past with Experiment Databases

Introduction Brilliant French Information Books Key features

Linking Task: Identifying authors and book titles in verbose queries

Python Machine Learning

1. Share the following information with your partner. Spell each name to your partner. Change roles. One object in the classroom:

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Australian Journal of Basic and Applied Sciences

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Exemplar for Internal Achievement Standard French Level 1

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The International Coach Federation (ICF) Global Consumer Awareness Study

Speech Emotion Recognition Using Support Vector Machine

CAVE LANGUAGES KS2 SCHEME OF WORK LANGUAGE OVERVIEW. YEAR 3 Stage 1 Lessons 1-30

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

(Sub)Gradient Descent

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The taming of the data:

Reducing Features to Improve Bug Prediction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Timeline. Recommendations

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The stages of event extraction

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SARDNET: A Self-Organizing Feature Map for Sequences

9779 PRINCIPAL COURSE FRENCH

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Learning Methods in Multilingual Speech Recognition

CSC200: Lecture 4. Allan Borodin

Probability and Statistics Curriculum Pacing Guide

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Library Reference Services textbook Chapter 7

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Probabilistic Latent Semantic Analysis

Generative models and adversarial training

Purpose: Students will consider instances of racial hatred and prejudice in preparation

Health Sciences and Human Services High School FRENCH 1,

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS 446: Machine Learning

Go fishing! Responsibility judgments when cooperation breaks down

Mining Association Rules in Student s Assessment Data

Curriculum MYP. Class: MYP1 Subject: French Teacher: Chiara Lanciano Phase: 1

Rule Learning with Negation: Issues Regarding Effectiveness

Issues in the Mining of Heart Failure Datasets

Calibration of Confidence Measures in Speech Recognition

ODL, classical teaching How can we assess digital resources?

Question 1 Does the concept of "part-time study" exist in your University and, if yes, how is it put into practice, is it possible in every Faculty?

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Software Maintenance

Multi-Lingual Text Leveling

OilSim. Talent Management and Retention in the Oil and Gas Industry. Global network of training centers and technical facilities

Speech Recognition at ICSI: Broadcast News and beyond

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

West Windsor-Plainsboro Regional School District French Grade 7

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Constructing a support system for self-learning playing the piano at the beginning stage

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Student Course Evaluation Class Size, Class Level, Discipline and Gender Bias

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Multi-label classification via multi-target regression on data streams

Switchboard Language Model Improvement with Conversational Data from Gigaword

arxiv: v1 [cs.cl] 2 Apr 2017

The Extend of Adaptation Bloom's Taxonomy of Cognitive Domain In English Questions Included in General Secondary Exams

SELF-STUDY QUESTIONNAIRE FOR REVIEW of the COMPUTER SCIENCE PROGRAM

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

CEF, oral assessment and autonomous learning in daily college practice

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

The International Baccalaureate Diploma Programme at Carey

Higher Education Six-Year Plans

What is a Mental Model?

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

An Introduction to the Minimalist Program

Seminar - Organic Computing

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

One-Year MBA Program. 1Y The fastest way to your Kellogg MBA NORTHWESTERN UNIVERSITY

Exposé for a Master s Thesis

A Systems Approach to Principal and Teacher Effectiveness From Pivot Learning Partners

Smart Grids Simulation with MECSYCO

15-year-olds enrolled full-time in educational institutions;

English Language and Applied Linguistics. Module Descriptions 2017/18

Transcription:

Conference Presentation Towards automatic geolocalisation of speakers of European French SCHERRER, Yves, GOLDMAN, Jean-Philippe Abstract Starting in 2015, Avanzi et al. (2016) have launched several online surveys to inquire about regionalisms in European French (France, Belgium and Switzerland). Here, we investigate the use of data from these surveys for automatic speaker geolocalisation, both as a playful incentive to attract participants for further inquiries and as a scientific analysis method of the already collected data. Following Leemann et al. (2016), the problem of automatic speaker geolocalisation consists in predicting the dialect/regiolect of a speaker (typically, a speaker that has not participated in the survey) by asking a set of questions (typically, a small subset of the surveyed variables). Given our motivations, the success of a speaker geolocalisation method should not only be assessed by the percentage of correct answers, but also by its ability to entertain and surprise potential participants. Three parameters influence this success: - The number and type of questions to be asked. No more than 20 questions should be asked to keep the attention span short. - The number and type of the areas to predict. The areas should reflect the [...] Reference SCHERRER, Yves, GOLDMAN, Jean-Philippe. Towards automatic geolocalisation of speakers of European French. In: International Conference on Language Variation in Europe (ICLAVE 9), Malaga (Spain), 6-9 June, 2017 Available at: http://archive-ouverte.unige.ch/unige:95474 Disclaimer: layout of this document may differ from the published version.

Towards automatic geolocalisation of speakers of European French Yves Scherrer & Jean-Philippe Goldman University of Geneva

Automatic speaker geolocalisation Data Simulation and methods : Clustering and shibboleth detection Recursive feature elimination Crowdsourced results

Automatic speaker geolocalisation Ask a speaker n questions and predict his/her most likely area of origin (one out of m areas) with p% accuracy.

Automatic speaker geolocalisation Ask a speaker n questions and predict his/her most likely area of origin (one out of m areas) with p% accuracy.

Automatic speaker geolocalisation Ask a speaker n questions and predict his/her most likely area of origin (one out of m areas) with p% accuracy.

Automatic speaker geolocalisation Ask a speaker n questions and predict his/her most likely area of origin (one out of m areas) with p% accuracy. Goals: Provide a playful incentive to attract participants for further inquiries Collect more data Observation Prediction Explore scientific analysis methods of the already collected data select questions and areas to maximize accuracy

Automatic speaker geolocalisation Ask a speaker n questions and predict his/her most likely area of origin (one out of m areas) with p% accuracy. Expected accuracy of predictions Number and type of questions asked Number and type of predicted areas

Automatic speaker geolocalisation Previous work: Create a geolocalisation model using data from atlases Select n questions on the basis of a dialectologist s knowledge Use the same m areas as in the original data Assess accuracy post-hoc (compare model predictions with participants real origins) ( Leemann since 2013 ) ( parlometre.ch - TSR - 2015 )

Automatic speaker geolocalisation Previous work: Create a geolocalisation model using data from atlases Select n questions on the basis of a dialectologist s knowledge Use the same m areas as in the original data Assess accuracy post-hoc (compare model predictions with participants real origins) Our approach:... from online inquiries Select optimal n questions by statistics Select optimal m areas by statistics Estimate accuracy (given n and m) using the same data as for model creation and Assess accuracy post-hoc, compare with estimates

Data Project Français de nos régions (Avanzi, Glikman et al., 2015) online surveys to inquire about regionalisms in European French (France, Belgium, Switzerland). Survey 1 Survey 2 May 2015 - May 2016 September 2015 - May 2016 40 questions 90 questions 12 000 participants 8 000 participants

Simulation Simulation framework: {questions} + {areas} prediction accuracy Idea: Leave-one-out method using two views of the same dataset Train model on aggregated data of all except one participant Predict origin of left-out participant, compare to ground truth We do not leave out the test participant from the aggregated data: Much faster, as we don t have to train a new model for each participant Since training data are aggregated and there are always > 1 participants per area, there is never an exact correspondence between training and test data Preliminary tests show good correlation with true leave-one-out method

Simulation Simulation framework: {questions} + {areas} prediction accuracy Two preprocessing steps: 1. Settle on initial set of areas: FR départements, BE provinces, CH cantons (110) 2. Match participants from Survey 1 with participants from Survey 2 (same origin) Two approaches to find {questions} and {areas}: 1. Clustering and shibboleth detection 2. Recursive feature elimination

Clustering and shibboleth detection 1. Determine the most relevant areal partition using hierarchical cluster analysis Ward s method, 5 clusters Ward s method, 10 clusters Weighted average, 10 clusters

Clustering and shibboleth detection 1. Determine the most relevant areal partition using hierarchical cluster analysis Ward s method, 5 clusters Ward s method, 10 clusters Weighted average, 10 clusters

Clustering and shibboleth detection 2. Use the shibboleth detection algorithm (Prokic, Çöltekin & Nerbonne 2012) to find the most characteristic questions for each area (e.g. 5 shibboleths/cluster)

Clustering and shibboleth detection 2. Use the shibboleth detection algorithm (Prokic, Çöltekin & Nerbonne 2012) to find the most characteristic questions for each area (e.g. 5 shibboleths/cluster) Morve Quatre-vingt-dix Soixante-dix Ving(t) Sèche-cheveux Sèche-cheveux Groseillles Clignotant Quatre-vingt-dix Soixante-dix Soixante-dix Sèche-cheveux Quatre-vingt-dix Morve Groseilles Groseilles Sèche-cheveux Clignotant Sécher Nombril Soixante-dix Quatre-vingt-dix Sèche-cheveux Chocolatine Groseilles Péguer Challer Soixante-dix Sèche-cheveux Quatre-vingt-dix Essuie-tout Septante Nonante Quelle heure il-est? Morve Soixante-dix Quatre-vingt-dix Groseillles Flaques Clignotant Débarouler Sèche-cheveux Ving(t) Groseilles Clignotant Encoubler/Achouper Septante Nonante Ca joue Souper

Clustering and shibboleth detection Simulation results: 10 clusters, all 130 questions 65.1% correct The results are very sensitive to the cluster borders: -24% between 4 and 5 clusters; -21% between 10 and 11 clusters It is difficult to determine a good number of clusters and an optimal cluster algorithm 10 clusters, 14 manually defined questions 67.0% correct Few carefully selected questions are better than all questions 10 clusters, 20 questions determined by shibboleth detection 61.8% correct Unintuitive choice of questions (standard variants for most areas) Clusters are defined on all data, not on single determining questions

Recursive feature elimination 1. The linguistic variables may have several variants with different distributions. Treat each variant separately. 2. Some variants are hardly ever used or show no geographic variation at all. Discard them first. 3. Train a classifier with the remaining variants, remove the one variant that contributes least to the classification, repeat. 4. Use the 110 atomic areas and distance between centroids throughout the process. At the end, dynamically extend the areas to their immediate and second-order neighbors.

Recursive feature elimination 1. The linguistic variables may have several variants with different distributions. Treat each variant separately. Binarize data: 130 n-ary variables 639 binary variables

Recursive feature elimination 2. Some variants are hardly ever used or show no geographic variation at all. Discard them first. Single-pass feature elimination based on χ² score Remove variables that are least statistically dependent on area Lowest average distance with 150 variants

Recursive feature elimination 3. Train a classifier with the remaining variants, remove the one variant that contributes least to the classification, repeat (= recursive feature elimination). We test two classifiers: SVM and MaxEnt Both classifiers achieve much better simulation results than the χ² method MaxEnt slightly worse than SVM

Recursive feature elimination 4. At the end, dynamically extend the areas to their immediate and second-order neighbors. Simulation results with 20 variants / 17 questions: 66.2% correct on second-order neighbors

Online speaker geolocalisation

Online speaker geolocalisation Three versions Feature elimination with MaxEnt Feature elimination with SVM Manual selection of 15 questions 4000 participants 4000 200 40% of participants provided sociolinguistic info (country+zip, age, gender, email) Social networks sharing and media coverage

Online speaker geolocalisation Crowdsourced data Feature elimination ME Feature elimination SVM Manual selection Random Part Best 5-Best 1631 1679 54 11 % 13 % 5% <1 % 43 % 47 % 16 % 4.5% Neighb-1 Neighb-2 40 % 47 % 12 % ~4.5% 62 % 64 % 18 % ~9% (110 areas - f-score)

Online speaker geolocalisation Crowdsourced data Feature elimination ME Feature elimination SVM Manual selection Random Simulated data Feature elimination ME Feature elimination SVM Manual selection Part Best 5-Best 1631 1679 54 11 % 13 % 5% <1 % 43 % 47 % 16 % 4.5% Best 5-Best 14 % 13 % 10 % 49 % 46 % 36 % Neighb-1 Neighb-2 40 % 47 % 12 % ~4.5% 62 % 64 % 18 % ~9% Neighb-1 Neighb-2 47 % 64 % 46 % 64 % 40 % 57 % ( 110 areas - f-score)

Discussion Attempt to apply machine learning techniques for question (and area) selection estimate success of crowdsourced linguistic campaign before launch Automatic selection better than manual? (to be confirmed) Crowdsourced geolocalisation also means data collection donnezvotrefrancais.fr

Towards automatic geolocalisation of speakers of European French Yves Scherrer & Jean-Philippe Goldman University of Geneva

Recursive feature elimination Retained features from the SVM classifier: Retained features from the MaxEnt classifier: Pain au chocolat / chocolatine / couque au chocolat /... Ving[t] Crayon de papier / de bois / gris /... Nonante / quatre-vingt-dix Péguer Gouttière / cheneau Il est midi vingt / et vingt / vingt Dîner / déjeuner Pain aux raisins / escargot / schnäcke Je vais y faire / le faire Faire tomber / tomber / échapper Séchoir / étendoir / étendage / tancarville Moin[s] Escargot / cagouille / luma Dégun / personne Septante / soixante-dix Ving(t) Il est midi vingt / et vingt / vingt Pain au chocolat / chocolatine / couque au chocolat /... Crayon de papier / de bois / gris / Ça joue / ça va Gorgée / schlouk / lichette Gouttière / cheneau Stan[d] Empêtrer / encoubler / achouper /.. Dîner / déjeuner Péguer Pain aux raisins / escargot / schnäcke Séchoir / étendoir / étendage / tancarville Papier ménage / Sopalin / essuie-tout