API Linking in Informal Technical Discussion. Australia National University / Data61

Size: px

Start display at page:

Download "API Linking in Informal Technical Discussion. Australia National University / Data61"

Mitchell Lyons
6 years ago
Views:

1 API Linking in Informal Technical Discussion CHENG CHEN (U ) SUPERVISED BY ZHENCHANG XING Australia National University / Data61

2 Background In programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example: How to apply pos_tag_sents() to pandas dataframe efficiently? The documentation describes support for apply() method, but it doesn't accept any arguments

3 challenges 1. Common-word polysemy: E.g % of the Pandas s APIs have common-word simple name, such as : Series : A class s name apply: A method s name Therefore, in this sentence: I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. Hard to recongnize which apply is the general verb or the function name os Pandas

4 challenges 2. Sentence-format variations Sentence-context variations I have finally decided to use apply which I understand is more flexible. if you run apply on a series the series is passed as a np.array It is being run on each row of a Pandas DataFrame via the apply function Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.

5 challenges 3. the variety of API mention Forms a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context b) some accidentally factor like misspelling, inconsistent annotations and space

6 Idea of solutions 1. some character level features are commonly share in many API mentions Matplotlib.pyplot.savefig(path = ) Case sensitivity Module names brackets Parameter list

7 Idea of solutions 2. The sentence context information is helpful to distinguish API mentions and non-api words, like verbs, some nouns, and jargons of software programming Some example posts: It is being run on each row of a Pandas DataFrame via the apply function if you run apply on a series the series is passed as a np.array

8 Workflow Stack Overflow Discussion DataSet Tokenizer Raw data Manually labelled data Data Selecter DNN Model Training Transfer Learning

9 Preparation of the training data 1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure. general English tokenizer s output: Matplotlib. Pyplot. Imshow ( ) This work s tokenizer output: matplotlib.pyplot.imshow() 2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)

10 Neural Network Structure input data Convolutional Layer Word Embedding Char-level feature Max Pooling Layer Bi-LSTM Layer Sentence Level Feature Classifier of API mention

11 Neural Network Structure 1. The Convolutional Layer and Max pooling layer is used for capture the character level feature The matrix represent the word base on char vector S a v e t x t Max pooling, keep the most important value

12 Neural Network Structure 2. Binary directional Long short term memory layer is used for learning the sentence Level information Forward input sequence you can use apply() method Concatenate the both output As the abstract matrix of the input sentence you can use apply() method Backward input sequence

13 Feasibility of Transfer learning 1. The training data set are generated from the Python library : Numpy, Pandas and Matplotlib. 2. The data are sharing some character level and sentence features, have the potential of 3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of each layer is separately sharable. 4. The weights of Neural network are loaded and freezen across different training tasks

14 Evaluation Methods F1 score is defined as: precision : the positive prediction of the retrieved result recall: the percentage of retrieved positive result of total positive result

15 Performance of API extraction The F1 score of API extraction result Matplotlib Numpy Pandas Word based model Char based model Deep model The generally performance improvement is 4%

16 Results of Transfer learning for Numpy data F1 score 70 randomly initialized load model trained on Pands data load model trained on Matplotlib data % 50% 100% Data size

17 Results of Transfer learning for Pandas data F1 score 85 randomly initialized load model trained on Numpy data load model trained on Matplotlib data % 50% 100% Data size

18 Conclusion 1. Our work get acceptable result on API mention linking tasks 2. Transfer learning generally improve the model s performance 3. Transfer learning method improve the Neural network training, and the improvement is more obvious when the training data set become smaller. 4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer

19 References: A. Bacchelli, M. Lanza, and R. Robbes, Linking s and source code artifacts, in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 2010, pp Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, Fixing recurring crash bugs via analyzing q&a sites (t), in Automated Software Engineering (ASE), th IEEE/ACM International Conference on. IEEE, 2015, pp P. Liang, Semi-supervised learning for natural language, Ph.D. dissertation, Citeseer, J. D. Lafferty, A. McCallum, and F. C. N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML 01, 2001, pp Y. Yao and A. Sun, Mobile phone name extraction from internet forums: a semisupervised approach, World Wide Web, pp. 1 23, 2015

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering