API Linking in Informal Technical Discussion CHENG CHEN (U5969643) SUPERVISED BY ZHENCHANG XING Australia National University / Data61
Background In programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example: How to apply pos_tag_sents() to pandas dataframe efficiently? The documentation describes support for apply() method, but it doesn't accept any arguments
challenges 1. Common-word polysemy: E.g. 55.04% of the Pandas s APIs have common-word simple name, such as : Series : A class s name apply: A method s name Therefore, in this sentence: I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. Hard to recongnize which apply is the general verb or the function name os Pandas
challenges 2. Sentence-format variations Sentence-context variations I have finally decided to use apply which I understand is more flexible. if you run apply on a series the series is passed as a np.array It is being run on each row of a Pandas DataFrame via the apply function Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.
challenges 3. the variety of API mention Forms a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context b) some accidentally factor like misspelling, inconsistent annotations and space
Idea of solutions 1. some character level features are commonly share in many API mentions Matplotlib.pyplot.savefig(path = ) Case sensitivity Module names brackets Parameter list
Idea of solutions 2. The sentence context information is helpful to distinguish API mentions and non-api words, like verbs, some nouns, and jargons of software programming Some example posts: It is being run on each row of a Pandas DataFrame via the apply function if you run apply on a series the series is passed as a np.array
Workflow Stack Overflow Discussion DataSet Tokenizer Raw data Manually labelled data Data Selecter DNN Model Training Transfer Learning
Preparation of the training data 1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure. general English tokenizer s output: Matplotlib. Pyplot. Imshow ( ) This work s tokenizer output: matplotlib.pyplot.imshow() 2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)
Neural Network Structure input data Convolutional Layer Word Embedding Char-level feature Max Pooling Layer Bi-LSTM Layer Sentence Level Feature Classifier of API mention
Neural Network Structure 1. The Convolutional Layer and Max pooling layer is used for capture the character level feature The matrix represent the word base on char vector S a v e t x t Max pooling, keep the most important value
Neural Network Structure 2. Binary directional Long short term memory layer is used for learning the sentence Level information Forward input sequence you can use apply() method Concatenate the both output As the abstract matrix of the input sentence you can use apply() method Backward input sequence
Feasibility of Transfer learning 1. The training data set are generated from the Python library : Numpy, Pandas and Matplotlib. 2. The data are sharing some character level and sentence features, have the potential of 3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of each layer is separately sharable. 4. The weights of Neural network are loaded and freezen across different training tasks
Evaluation Methods F1 score is defined as: precision : the positive prediction of the retrieved result recall: the percentage of retrieved positive result of total positive result
Performance of API extraction The F1 score of API extraction result Matplotlib Numpy Pandas Word based model 75.71 72.81 81.53 Char based model 75.42 71.27 77.45 Deep model 78.98 76.30 84.60 The generally performance improvement is 4%
Results of Transfer learning for Numpy data F1 score 70 randomly initialized load model trained on Pands data load model trained on Matplotlib data 65 60 55 50 45 25% 50% 100% Data size
Results of Transfer learning for Pandas data F1 score 85 randomly initialized load model trained on Numpy data load model trained on Matplotlib data 80 75 70 65 60 25% 50% 100% Data size
Conclusion 1. Our work get acceptable result on API mention linking tasks 2. Transfer learning generally improve the model s performance 3. Transfer learning method improve the Neural network training, and the improvement is more obvious when the training data set become smaller. 4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer
References: A. Bacchelli, M. Lanza, and R. Robbes, Linking e-mails and source code artifacts, in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. ACM, 2010, pp. 375 384 Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, Fixing recurring crash bugs via analyzing q&a sites (t), in Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 2015, pp. 307 318 P. Liang, Semi-supervised learning for natural language, Ph.D. dissertation, Citeseer, 2005. J. D. Lafferty, A. McCallum, and F. C. N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML 01, 2001, pp. 282 289. Y. Yao and A. Sun, Mobile phone name extraction from internet forums: a semisupervised approach, World Wide Web, pp. 1 23, 2015