Seminar on: L715/B659. Dept. of Linguistics, Indiana University Fall Detecting Latent User Properties in Text. Goals. Topics.

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Day 1 Note Catcher. Use this page to capture anything you d like to remember. May Public Consulting Group. All rights reserved.

New Features & Functionality in Q Release Version 3.2 June 2016

Iowa School District Profiles. Le Mars

CEFR Overall Illustrative English Proficiency Scales

Academic literacies and student learning: how can we improve our understanding of student writing?

LEARN. LEAD. DISCOVER.

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Highlighting and Annotation Tips Foundation Lesson

A Case Study: News Classification Based on Term Frequency

/ On campus x ICON Grades

Degree Qualification Profiles Intellectual Skills

use different techniques and equipment with guidance

Types of curriculum. Definitions of the different types of curriculum

What is a Mental Model?

Critical Thinking in Everyday Life: 9 Strategies

Kindergarten - Unit One - Connecting Themes

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

The College Board Redesigned SAT Grade 12

How do we balance statistical evidence with expert judgement when aligning tests to the CEFR?

This Performance Standards include four major components. They are

EDUC-E328 Science in the Elementary Schools

Generative models and adversarial training

Two Futures of Software Testing

The Task. A Guide for Tutors in the Rutgers Writing Centers Written and edited by Michael Goeller and Karen Kalteissen

21st Century Community Learning Center

Types of curriculum. Definitions of the different types of curriculum

English Language Arts Missouri Learning Standards Grade-Level Expectations

ENG 111 Achievement Requirements Fall Semester 2007 MWF 10:30-11: OLSC

University of Toronto Mississauga Sociology SOC387 H5S Qualitative Analysis I Mondays 11 AM to 1 PM IB 250

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

The EDI contains five core domains which are described in Table 1. These domains are further divided into sub-domains.

EVERYTHING DiSC WORKPLACE LEADER S GUIDE

DATE ISSUED: 11/2/ of 12 UPDATE 103 EHBE(LEGAL)-P

Spanish Users and Their Participation in College: The Case of Indiana

OFFICE SUPPORT SPECIALIST Technical Diploma

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

Custom Program Title. Leader s Guide. Understanding Other Styles. Discovering Your DiSC Style. Building More Effective Relationships

DEPARTMENT OF JAPANESE LANGUAGE AND STUDIES

12- A whirlwind tour of statistics

CHMB16H3 TECHNIQUES IN ANALYTICAL CHEMISTRY

Bachelor of Arts. Intercultural German Studies. Language in intercultural contexts

Cleveland State University Introduction to University Life Course Syllabus Fall ASC 101 Section:

Oakland Unified School District English/ Language Arts Course Syllabus

NAME OF ASSESSMENT: Reading Informational Texts and Argument Writing Performance Assessment

Educational Attainment

MYP Language A Course Outline Year 3

DIRECT CERTIFICATION AND THE COMMUNITY ELIGIBILITY PROVISION (CEP) HOW DO THEY WORK?

Intensive Writing Class

ENGLISH 298: Intensive Writing

What Is a Chief Diversity Officer? By. Dr. Damon A. Williams & Dr. Katrina C. Wade-Golden

Course Content Concepts

Speech Recognition at ICSI: Broadcast News and beyond

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Ontologies vs. classification systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

SOCIAL PSYCHOLOGY. This course meets the following university learning outcomes: 1. Demonstrate an integrative knowledge of human and natural worlds

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

ACADEMIC AFFAIRS GUIDELINES

SMARTboard: The SMART Way To Engage Students

Developing an Assessment Plan to Learn About Student Learning

NOT SO FAIR AND BALANCED:

CUNY ASSESSMENT TESTS Webinar for International Students

Full text of O L O W Science As Inquiry conference. Science as Inquiry

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

SPCH 1315: Public Speaking Course Syllabus: SPRING 2014

The Algebra in the Arithmetic Finding analogous tasks and structures in arithmetic that can be used throughout algebra

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

INSTRUCTOR USER MANUAL/HELP SECTION

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

MGMT 479 (Hybrid) Strategic Management

A STUDY ON AWARENESS ABOUT BUSINESS SCHOOLS AMONG RURAL GRADUATE STUDENTS WITH REFERENCE TO COIMBATORE REGION

Outline for Session III

Spring 2015 CRN: Department: English CONTACT INFORMATION: REQUIRED TEXT:

Language Center. Course Catalog

Word Segmentation of Off-line Handwritten Documents

RECRUITMENT AND EXAMINATIONS

Rendezvous with Comet Halley Next Generation of Science Standards

California Professional Standards for Education Leaders (CPSELs)

Institutional repository policies: best practices for encouraging self-archiving

School Inspection in Hesse/Germany

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

LA1 - High School English Language Development 1 Curriculum Essentials Document

Corpus Linguistics (L615)

Postprint.

STUDENT ASSESSMENT, EVALUATION AND PROMOTION

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Can Money Buy Happiness? EPISODE # 605

GUIDE TO EVALUATING DISTANCE EDUCATION AND CORRESPONDENCE EDUCATION

Photography: Photojournalism and Digital Media Jim Lang/B , extension 3069 Course Descriptions

Tamwood Language Centre Policies Revision 12 November 2015

Dr. Zhang Fall 12 Public Speaking 1. Required Text: Hamilton, G. (2010). Public speaking for college and careers (9th Ed.). New York: McGraw- Hill.

For international students wishing to study Japanese language at the Japanese Language Education Center in Term 1 and/or Term 2, 2017

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Data Glossary. Summa Cum Laude: the top 2% of each college's distribution of cumulative GPAs for the graduating cohort. Academic Honors (Latin Honors)

Handbook for Graduate Students in TESL and Applied Linguistics Programs

Lecture 1: Machine Learning Basics

American Association of University Women Manhattan Branch KSU Scholarship Fund

BUAD 425 Data Analysis for Decision Making Syllabus Fall 2015

Transcription:

Seminar on: L715/B659 Dept. of Linguistics, Indiana University Fall 2014 1 / 14

of this class Goal: Based (solely) on the linguistic properties of a text, provide some characteristic of the writer text classification, where the goal is to characterize the text characteristic = age, native language, identity, etc. General approach: input features of the text into a classifier (supervised learning) 2 / 14

of this class Particular goals of this seminar: See the connections between what are at times disparate topics Understand the theoretical underpinnings of making such inferences e.g., explore a range of linguistic features Obtain practice building a practical system More generally, my goal in a seminar is to see you learn to: Develop as researchers Collaborate together on fun topics Think of this more like a research lab than a class I hate to quote High School Musical, but: We re all in this together. 3 / 14

Specific topics The specific topics will include: Authorship attribution Plagiarism detection Deception detection Author profiling (e.g., sex) & attribution in social media Dialect identification Native language identification Language proficiency identification Part of that will depend upon student interest... and maybe we ll come up with more in the process? 4 / 14

Authorship attribution Authorship attribution: who is the author of a document? Given a set of authors, which one (if any) wrote a particular work? Authorship verification: did a particular person write a particular work or not? More challenging problem Connects to forensic linguistics http://www.uni-weimar.de/medien/webis/research/events/pan-14/ pan14-web/author-identification.html 5 / 14

Plagiarism detection Plagiarism detection can be seen as a subpart of authorship attribution Is the author of the document who they claimed to be? This can be divided into two tasks: Source retrieval: find source documents for a suspicious document alignment: identify reused text between two documents From our perspective, we may be interested in the opposite of source retrieval: is the current document dissimilar from other ones written by the same author? http://www.uni-weimar.de/medien/webis/research/events/pan-14/ pan14-web/plagiarism-detection.html 6 / 14

Deception detection Deception detection, at least in one form, is the task of detecting whether a person is who they claim to be We are less interested in finding deceptive content by someone who we know who they are If we have user meta-data (e.g., age), deception detection could take the form of author profiling (e.g., does text match age?) A similar task to plagiarism detection, but: There is an overt attempt to hide one s identity There is new content, i.e., you cannot compare content to other documents One specific instance of this is sockpuppet detection: detecting whether a user account is a fake Can aggregate over multiple textual instances 7 / 14

Author profiling & attribution in social media Author profiling: categorize authors based on a demographic property, such as: Sex/gender Age Political persuasion... Task is to take a document (or set of documents) and determine a category (e.g., 18 24 years old ) This is commonly of interest for work in social media, where demographic information can help with product assessment http://www.uni-weimar.de/medien/webis/research/events/pan-14/ pan14-web/author-profiling.html 8 / 14

Dialect identification Dialect identification: determine the (regional) dialect of a writer of a text A specific type of author profiling, often of interest to (socio)linguists Discriminating between Similar Languages (DSL) task: determine which variety of a language (from among a set) a document belongs to i.e., which dialect is the writer using? Unlike demographics such as age, note that there are fuzzy boundaries between dialects, and that various factors (demographics) interact i.e., language usage varies based on region, sex, age,... http://corporavm.uni-koeln.de/vardial/sharedtask.html 9 / 14

Native language identification Native language identification: identify native language (L1) of writer based on second language (L2) writing Guessing a latent property based on systematic behavior in text As with dialect identification, we are getting into topics that are more specific to language usage i.e., we are classifying a property of the writer s language, not some non-linguistic demographic Many language learners have some similar patterns, e.g., (non)usage of articles But the entirety of their patterns tend to differ Note that data size is thus an issue (as with all these topics) https://sites.google.com/site/nlisharedtask2013/home 10 / 14

Language proficiency identification Language proficiency identification: determine the level of a language learner Given some level categorization (e.g., DLI, CEFR, course placement) Some researchers investigate criterial features, those features which best distinguish levels within a language Language proficiency identification gets into more temporary & fluid properties of the writer proficiency today proficiency a year from now 11 / 14

Other topics? And are there are other topics to examine? I don t know, but any (more or less permanent) demographic is in principle possible where a demographic property is an inherent or cultural marker of the person e.g., political persuasion, income, (language) disability, education,... nb: something like mood (or even political affiliation) is more fluid (a spectrum) Fishing for some of these would probably lead to accidental correspondence, but we should be curious to find anything where: Input = natural langauge text(s) Output = classification of the writer, in terms of some demographic property 12 / 14

For the next two-ish weeks, we ll make sure we re more or less on the same page: Machine learning & text classification techniques Natural Language Processing (NLP) tools Our focus will be largely practical After that, it ll be less me & more you leading discussion By Monday, September 8, I want you to sign up to lead the discussion on a particular topic. Lead Be an expert on Lead = read papers ahead of time, help determine the interesting areas to explore, come to class with the interesting points identified, etc. I ll give you a specific assignment on this next time 13 / 14

Expectations This seminar is going to be: 1. Exploratory & interactive: think of this as collabortive learning in a lab-like environment 2. Demand-driven: I have only sketched a syllabus; the contents will be driven in part by your interests But note that we can t just pick one topic & ignore the others Notice how we have only scratched the surface on how these topics are similar and how they differ Issues such as inherentness, interaction of demographics, language-usage-specific, etc. are important to sort out The degree to which we understand how to transfer techniques from one area to another depend on how well we understand this 14 / 14