API Linking in Informal Technical Discussion. Australia National University / Data61

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Probabilistic Latent Semantic Analysis

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

Linking Task: Identifying authors and book titles in verbose queries

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Conversational Framework for Web Search and Recommendations

Indian Institute of Technology, Kanpur

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

AQUA: An Ontology-Driven Question Answering System

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CS Machine Learning

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

Detecting English-French Cognates Using Orthographic Edit Distance

Switchboard Language Model Improvement with Conversational Data from Gigaword

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

arxiv: v4 [cs.cl] 28 Mar 2016

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Assignment 1: Predicting Amazon Review Ratings

Rule Learning with Negation: Issues Regarding Effectiveness

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Using dialogue context to improve parsing performance in dialogue systems

Modeling function word errors in DNN-HMM based LVCSR systems

Exposé for a Master s Thesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Australian Journal of Basic and Applied Sciences

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Active Learning. Yingyu Liang Computer Sciences 760 Fall

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Attributed Social Network Embedding

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Generative models and adversarial training

On-the-Fly Customization of Automated Essay Scoring

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

arxiv: v2 [cs.cv] 30 Mar 2017

Semi-Supervised Face Detection

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

THE world surrounding us involves multiple modalities

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A deep architecture for non-projective dependency parsing

Deep Facial Action Unit Recognition from Partially Labeled Data

Cultivating DNN Diversity for Large Scale Video Labelling

arxiv: v2 [cs.cl] 26 Mar 2015

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

arxiv: v1 [cs.cv] 10 May 2017

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Georgetown University at TREC 2017 Dynamic Domain Track

Postprint.

Bug triage in open source systems: a review

Corrective Feedback and Persistent Learning for Information Extraction

Reducing Features to Improve Bug Prediction

Memory-based grammatical error correction

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Case Study: News Classification Based on Term Frequency

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

arxiv: v1 [cs.lg] 15 Jun 2015

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Machine Learning Basics

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Second Exam: Natural Language Parsing with Neural Networks

Automatic document classification of biological literature

Introduction to Causal Inference. Problem Set 1. Required Problems

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Epistemic Cognition. Petr Johanes. Fourth Annual ACM Conference on Learning at Scale

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Disambiguation of Thai Personal Name from Online News Articles

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica

GEOCODING LOCATIONS OF HISTORIC RECLAMATION RESEARCH SITES USING GOOGLE EARTH

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Automating the E-learning Personalization

Knowledge Transfer in Deep Convolutional Neural Nets

Diverse Concept-Level Features for Multi-Object Classification

A Deep Bag-of-Features Model for Music Auto-Tagging

Deep Neural Network Language Models

arxiv: v1 [cs.cv] 2 Jun 2017

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

A Review: Speech Recognition with Deep Learning Methods

A Neural Network GUI Tested on Text-To-Phoneme Mapping

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Transcription:

API Linking in Informal Technical Discussion CHENG CHEN (U5969643) SUPERVISED BY ZHENCHANG XING Australia National University / Data61

Background In programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example: How to apply pos_tag_sents() to pandas dataframe efficiently? The documentation describes support for apply() method, but it doesn't accept any arguments

challenges 1. Common-word polysemy: E.g. 55.04% of the Pandas s APIs have common-word simple name, such as : Series : A class s name apply: A method s name Therefore, in this sentence: I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. Hard to recongnize which apply is the general verb or the function name os Pandas

challenges 2. Sentence-format variations Sentence-context variations I have finally decided to use apply which I understand is more flexible. if you run apply on a series the series is passed as a np.array It is being run on each row of a Pandas DataFrame via the apply function Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.

challenges 3. the variety of API mention Forms a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context b) some accidentally factor like misspelling, inconsistent annotations and space

Idea of solutions 1. some character level features are commonly share in many API mentions Matplotlib.pyplot.savefig(path = ) Case sensitivity Module names brackets Parameter list

Idea of solutions 2. The sentence context information is helpful to distinguish API mentions and non-api words, like verbs, some nouns, and jargons of software programming Some example posts: It is being run on each row of a Pandas DataFrame via the apply function if you run apply on a series the series is passed as a np.array

Workflow Stack Overflow Discussion DataSet Tokenizer Raw data Manually labelled data Data Selecter DNN Model Training Transfer Learning

Preparation of the training data 1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure. general English tokenizer s output: Matplotlib. Pyplot. Imshow ( ) This work s tokenizer output: matplotlib.pyplot.imshow() 2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)

Neural Network Structure input data Convolutional Layer Word Embedding Char-level feature Max Pooling Layer Bi-LSTM Layer Sentence Level Feature Classifier of API mention

Neural Network Structure 1. The Convolutional Layer and Max pooling layer is used for capture the character level feature The matrix represent the word base on char vector S a v e t x t Max pooling, keep the most important value

Neural Network Structure 2. Binary directional Long short term memory layer is used for learning the sentence Level information Forward input sequence you can use apply() method Concatenate the both output As the abstract matrix of the input sentence you can use apply() method Backward input sequence

Feasibility of Transfer learning 1. The training data set are generated from the Python library : Numpy, Pandas and Matplotlib. 2. The data are sharing some character level and sentence features, have the potential of 3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of each layer is separately sharable. 4. The weights of Neural network are loaded and freezen across different training tasks

Evaluation Methods F1 score is defined as: precision : the positive prediction of the retrieved result recall: the percentage of retrieved positive result of total positive result

Performance of API extraction The F1 score of API extraction result Matplotlib Numpy Pandas Word based model 75.71 72.81 81.53 Char based model 75.42 71.27 77.45 Deep model 78.98 76.30 84.60 The generally performance improvement is 4%

Results of Transfer learning for Numpy data F1 score 70 randomly initialized load model trained on Pands data load model trained on Matplotlib data 65 60 55 50 45 25% 50% 100% Data size

Results of Transfer learning for Pandas data F1 score 85 randomly initialized load model trained on Numpy data load model trained on Matplotlib data 80 75 70 65 60 25% 50% 100% Data size

Conclusion 1. Our work get acceptable result on API mention linking tasks 2. Transfer learning generally improve the model s performance 3. Transfer learning method improve the Neural network training, and the improvement is more obvious when the training data set become smaller. 4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer

References: A. Bacchelli, M. Lanza, and R. Robbes, Linking e-mails and source code artifacts, in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 2010, pp. 375 384 Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, Fixing recurring crash bugs via analyzing q&a sites (t), in Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 2015, pp. 307 318 P. Liang, Semi-supervised learning for natural language, Ph.D. dissertation, Citeseer, 2005. J. D. Lafferty, A. McCallum, and F. C. N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML 01, 2001, pp. 282 289. Y. Yao and A. Sun, Mobile phone name extraction from internet forums: a semisupervised approach, World Wide Web, pp. 1 23, 2015