An intelligent Q&A system based on the LDA topic model for the teaching of Database Principles

Similar documents
Probabilistic Latent Semantic Analysis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Computerized Adaptive Psychological Testing A Personalisation Perspective

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Word Segmentation of Off-line Handwritten Documents

Experts Retrieval with Multiword-Enhanced Author Topic Model

Learning Methods in Multilingual Speech Recognition

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

AQUA: An Ontology-Driven Question Answering System

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Axiom 2013 Team Description Paper

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

On-Line Data Analytics

UCEAS: User-centred Evaluations of Adaptive Systems

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Reducing Features to Improve Bug Prediction

Organizational Knowledge Distribution: An Experimental Evaluation

A Comparison of Two Text Representations for Sentiment Analysis

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Cross Language Information Retrieval

Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cqa Services

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Automating the E-learning Personalization

Practical Language Processing for Virtual Humans

Learning Methods for Fuzzy Systems

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Speech Emotion Recognition Using Support Vector Machine

Language Independent Passage Retrieval for Question Answering

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Rule Learning With Negation: Issues Regarding Effectiveness

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Expert locator using concept linking. V. Senthil Kumaran* and A. Sankar

CSL465/603 - Machine Learning

An Online Handwriting Recognition System For Turkish

Specification of the Verity Learning Companion and Self-Assessment Tool

Conversational Framework for Web Search and Recommendations

Evaluation of Learning Management System software. Part II of LMS Evaluation

Data Fusion Models in WSNs: Comparison and Analysis

Test Effort Estimation Using Neural Network

Web-based Learning Systems From HTML To MOODLE A Case Study

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

NPCEditor: Creating Virtual Human Dialogue Using Information Retrieval Techniques

Speech Recognition at ICSI: Broadcast News and beyond

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Mining Topic-level Opinion Influence in Microblog

Reinforcement Learning by Comparing Immediate Reward

Online Marking of Essay-type Assignments

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

TopicFlow: Visualizing Topic Alignment of Twitter Data over Time

Development of an IT Curriculum. Dr. Jochen Koubek Humboldt-Universität zu Berlin Technische Universität Berlin 2008

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Semantic Imitation Model of Social Tag Choices

10.2. Behavior models

Operational Knowledge Management: a way to manage competence

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Term Weighting based on Document Revision History

Designing Educational Computer Games to Enhance Teaching and Learning

As a high-quality international conference in the field

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Introduction of Open-Source e-learning Environment and Resources: A Novel Approach for Secondary Schools in Tanzania

Python Machine Learning

Task Types. Duration, Work and Units Prepared by

Visit us at:

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Basic Concepts of Machine Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A Comparison of Standard and Interval Association Rules

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Georgetown University at TREC 2017 Dynamic Domain Track

Cross-Lingual Text Categorization

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Execution Plan for Software Engineering Education in Taiwan

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The stages of event extraction

Hongyan Ma. University of California, Los Angeles

Transcription:

World Transactions on Engineering and Technology Education Vol.12, No.1, 2014 2014 WIETE An intelligent Q&A system based on the LDA topic model for the teaching of Database Principles Lin Cui & Caiyin Wang Suzhou University Suzhou, Anhui, People s Republic of China ABSTRACT: With the development of computer networks, network education has received more and more attention. In the network environment, due to the limitation of time, teachers cannot answer in a timely manner all the questions that students ask. Therefore, an intelligent Q&A (questions and answers) system based on the LDA (latent Dirichlet allocation) topic model was developed, and is discussed in this article. In order to solve the difficult problems under network circumstances, considered in this research were the knowledge points and characteristics of FAQs (frequently asked questions) for the course, Database Principles. The intelligent Q&A system was based on the LDA model and on topics-documentation-knowledge points. This intelligent Q&A system allows users to describe problems in natural language and, then, the problems are submitted to the system. The system returns accurate answers related to the topic. Students opined that the Q&A system performed very well and could answer all the questions they posed about Database Principles. INTRODUCTION Network teaching frees people from the limitations of time and space since they can receive education whenever and wherever they like. But there are problems in the network teaching environment. For example, it is impossible to answer all the questions students ask of teachers due to time limitations and this saps students enthusiasm. Furthermore, teachers have to answer repeated questions [1]. In order to solve these problems, the commonly used methods are BBS (bulletin board system), on-line FAQ (frequently asked questions), e-mail and message boards [2]. The BBS, e-mail and message boards do not produce an immediate response [3]. Traditional search engines, such as Google and Baidu have many defects in that the results they return are not the direct answer to the question [4]. By analysing the frequently asked questions in the course Database Principles, an intelligent Q&A system based on the LDA (latent Dirichlet allocation) topic model was developed. Through analysing the FAQs, it was determined that each question belonged to a specific type, and a method was proposed to determine a candidate question set, which greatly improves efficiency and accuracy. RELATED WORK As early as the 1960s, when research on artificial intelligence had just started, scholars proposed that computers should answer questions using natural language, which could be regarded as the rudimentary question answering system [5]. The Q&A systems had become fashionable for a time in the field of natural language in the 1980s [6]. However, with the rise of large-scale text processing technology, the research on question answering systems died away. In recent years, with the rapid development of network and information technology, the desire by people to obtain information faster has again promoted the development of question answering systems. Many companies and research institutes are involved in this development, including Microsoft, IBM and MIT (Massachusetts Institute of Technology) [7]. In 1999, TREC (Text Retrieval Conference) introduced automatic question answering for tracking projects. Since then, the Q&A track has gradually become one of the most popular TREC projects [8]. Many countries have developed a number of relatively mature Q&A systems. In the open software Q&A systems, there is the Start system (http://start.csail.mit.edu/) developed by the InfoLab Group of the MIT Computer Science and Artificial Intelligence Laboratory and AnswerBus (http://www.answerbus.com) [9]. However, the Q&A systems used to solve problems for a specific course are very rare. Accordingly, an intelligent Q&A system that returns answers to users questions on the course Database Principles was developed. 26

RESEARCH FOUNDATION Knowledge Points The Q&A system for a specific course aims to answer questions proposed by users. A user s question usually contains detailed information about the course, which is actually specific information or several related pieces of information or one aspect of a knowledge point. Knowledge points are the basic organisational unit and the basic transmission unit, and include words, sentences, concepts, definitions, theorems, formulas, rules and laws, etc [10]. In teaching, knowledge is generally organised in the form of knowledge points. In this article, a knowledge tree is organised according to the structure of the chapter, section and knowledge points, as shown in Figure 1. Course Chapter Chapter Section Section Section Section Knowledge points points points points points points points Knowledge points Figure 1. The tree structure of knowledge points. LDA Model The LDA model was proposed by Blei in 2003 [11]. The LDA model has been applied successfully to text classification, information retrieval and many other related fields [12]. The LDA model is shown in Figure 2. Figure 2: Representation of the LDA probability model. The LDA model has a three-layer structure, i.e. words, topics and documents. Given a document collection, LDA would represent each document as a collection of topics, each topic is a multinomial distribution used for capturing the relevant information between words. In LDA, these topics are shared by all the documents and each document has a particular theme. LDA is determined by the parameter (α,β) of a document; α reflects the relative strength of hidden topics in document sets, β represents the probability distribution of all the latent topics. θ represents the proportion of each underlying theme in the document, z shows the underlying topic proportion in each word by document, w is the word vector table of documents. N is the total number of documents in the document set and N d represents the total number of words in the documents [13]. OVERALL ARCHITECTURE OF THE QUESTION ANSWERING SYSTEM The proposed intelligent Q&A system for the course Database Principles is illustrated in Figure 3. 27

Users questions FAQ Library Word segmentation FAQ pretreatment Question classification Table on question - feature words Keywords extraction Keywords expansion Key words Question type Determine candidate question set Question similarity calculation Key words Candidate question set Similarity>Threshold? N FAQ library after expansion Y The answer to the question Figure 3: Overall framework for the system. As shown in Figure 3, this system includes five major functional modules, which are: understanding the question, FAQ pretreatment, determining the candidate set of questions, question similarity calculation and FAQ library expansion. Every function module is introduced below. The module of understanding the question includes word segmentation, question classification, keywords extraction and keywords expansion and other processes. Its main function is to analyse the input question. The module for determining the candidate question set can narrow the search space according to problem type, in that the candidate question set is screened from the FAQ library. Thus, the operation of searching problems is facilitated. The module of FAQ library preprocessing implements word processing for the frequently asked questions, so as to obtain the feature words contained in each question and avoid reprocessing every time in similarity calculation, which improves the efficiency of the system. The module of question similarity calculation refers to the similarity calculation between the problems input by users and the problems in the candidate problems, which returns the most similar answers to users. The module of FAQ expansion adds questions that the system cannot answer within the FAQ library. Similarity values are calculated by the module. The system determines whether the maximum similarity value for a user s question is already present in the FAQ library and to judge whether the similarity is larger than a preset threshold. If the maximum similarity value is less than the threshold, then, the problem input by the user does not exist in the FAQ library. This problem would be added to the FAQ library, which would further improve the library [14]. RESEARCH ABOUT CORE ISSUES The Domain Knowledge Library The domain knowledge library can be implemented by constructing a knowledge points table, a field characteristics table, a feature words similarity table and a question type table. In this article, the concepts of database principles are 28

described by using domain feature vocabulary. The relationships between concepts are represented through feature words similarity tables. The domain feature table stored the special vocabulary of database principles, which is used to describe the course knowledge. The domain feature table is the data base of word segmentation processing and is also the data foundation of problems similarity calculation. Also, defined was the structure of question types, which are divided into six types, viz. the definition type, reason type, list type, relation type, method type and other type. The table of question types is shown in Table 1: Question Similarity Calculation Table 1: Question type table. Typeid Type 1 Definition type 2 Reason type 3 List type 4 Relation type 5 Method type 6 Other type Whether the question similarity calculation is accurate or not directly determines the performance of the question answering system. The LDA model regards question answering systems as containing multiple topics and these topics have different weightings. The topic retrieval model includes implicit topic information. The specific calculation is as follows [15]: P ( C B) = λ p( w B) + (1 λ) p( w t)* p( t B) (1) w C w C t t B w t Where, t B is the topic set of question B, t is one of the topics in t B, λ is a parameter. p ( w t) is the probability that word w emerges in topic t, p ( t B) is the probability that the theme t emerges in question B. All answers in the Q&A system are needed to calculate P ( C B) and construct a topic retrieval model. If the probability P ( C B) is less than a threshold, the answer is implicit, otherwise the answer is explicit. SYSTEM IMPLEMENTATION The intelligent Q&A system, described in this article, was based on the LDA model and developed using Java, JSP and Eclipse, MySQL for the database and Tomcat on the server. The interface for this system is easy to use, but the background program is quite complex. However, users do not need to know about the background program in detail. So, the system is easy to use. The operation of this system is illustrated in the following Figures 4-7. To the definition type question What is transaction? (Figure 4), the system responds as shown in Figure 5. Figure 4: Definition question. Figure 5: Reply to the definition question. To the relation type What is the difference between stored procedure and trigger? (Figure 6) is the answer as shown in Figure 7: 29

Figure 6: Relation question. Figure 7: Reply to the relation question. CONCLUSIONS With Database Principles as the study object and by using the LDA model, an intelligent Q&A system was designed and implemented. This intelligent Q&A system can reduce the workload of teachers and better meet the personalised needs of students. In the network environment, students can ask questions easily and answers can be rapidly produced, improving the students learning and the quality of the teaching. ACKNOWLEDGEMENTS This work was supported by the ordinary project of Anhui Province Colleges and Universities Natural Science Foundation of China (No.KJ2013B283, No.KJ2012Z401); the open project of Intelligent Information Processing Laboratory at Suzhou University of China (No.2013YKF14); and the project of Anhui Province Higher Education Revitalisation Plan of China (No.2013zytz074). REFERENCES 1. Guzdial, M., Education: teaching computing to everyone. J. of Communications of the ACM, 52, 5, 31-33 (2009). 2. Lugmayr, A., Applying design thinking as a method for teaching in media education. Proc. 15th Inter. Academic MindTrek Conf.: Envisioning Future Media Environments, 332-334 (2011). 3. Hardy, N., Pinto, M. and Wei, H., The impact of collaborative technology in IT and computer science education: harnessing the power of Web 2.0. Proc. 9th ACM SIGITE Conf. on Infor. Technol. Educ., Cincinnati, 63-64 (2008). 4. Eastman, C.M. and Jansen, B.J., Coverage, relevance, and ranking: the impact of query operators on Web search engine results. J. of ACM Trans. on Infor. Systems, 21, 4, 383-411 (2003). 5. Hendrix, G.G., Sacerdoti, E.D., Sagalowicz, D. and Slocum, J., Developing a natural language interface to complex data. ACM Trans. Database Systems, 3, 2, 105-147 (1978). 6. Lin, J.J. and Katz, B., Question answering from the Web using knowledge annotation and knowledge mining techniques. Proc. 12th Inter. Conf. on Infor. and Knowledge Manage., 116-123 (2003). 7. Jeongwoo, K., Eric, N. and Luo, S., A probabilistic graphical model for joint answer ranking in question answering. Proc. 30th Annual Inter. ACM SIGIR Conf., 343-350 (2007). 8. Jijkoun, V. and Rijke, M.D., Retrieving answers from frequently asked questions pages on the Web. Proc. ACM Conf. on Infor. and Knowledge Manage., 76-83 (2005). 9. Agichtein, E., Lawrence, S. and Gravano, L., Learning to find answers to questions on the Web. ACM Trans. on Internet Technol., 4, 2, 129-162 (2004). 10. Hijikata, Y., Takenaka, T. and Kusumura, Y., Interactive knowledge externalization and combination for SECI model. Proc. 4th Inter. Conf. on Knowledge Capture, 151-158 (2007). 11. Blei, D.M., Ng, A.Y. and Jordan, M.I., Latent Dirichlet allocation. J. of Machine Learning Research, 3, 993-1022 (2003). 12. Wang, D., Thint, M. and Al-Rubaie, A., Semi-supervised latent Dirichlet allocation and its application for document classification. Proc. IEEE/WIC/ACM Inter. Joint Conferences on Web Intelligence and Intelligent Agent Technol., 306-310 (2012). 13. Bhardwaj, A., Reddy, M. and Setlur, S., Latent Dirichlet allocation based writer identification in offline handwriting. Proc. 9th IAPR Inter. Workshop on Document Analysis Systems, 357-362 (2010). 14. Jijkoun, V. and Rijke, M., Retrieving answers from frequently asked questions pages on the Web. Proc. 14th ACM Inter. Conf. on Infor. and Knowledge Manage., 76-83 (2005). 15. Krestel, R., Fankhauser, P. and Nejdl, W., Latent Dirichlet allocation for tag recommendation. Proc. 3rd ACM Conf. on Recommender Systems, 61-68 (2009). 30