Discovering Characteristics of Aberrant Driving Behavior

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

On-Line Data Analytics

Learning From the Past with Experiment Databases

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Modeling user preferences and norms in context-aware systems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

Mining Association Rules in Student s Assessment Data

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Unit 7 Data analysis and design

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Assignment 1: Predicting Amazon Review Ratings

A Case Study: News Classification Based on Term Frequency

Abstractions and the Brain

Rule Learning with Negation: Issues Regarding Effectiveness

STA 225: Introductory Statistics (CT)

Applications of data mining algorithms to analysis of medical data

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Australian Journal of Basic and Applied Sciences

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Certified Six Sigma Professionals International Certification Courses in Six Sigma Green Belt

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Software Maintenance

Student Transportation

IMPROVING PEOPLE S PARTICIPATION IN COMMUNITY DEVELOPMENT

Inside the mind of a learner

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Analyzing the Usage of IT in SMEs

Mathematics process categories

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Focus of the Unit: Much of this unit focuses on extending previous skills of multiplication and division to multi-digit whole numbers.

A student diagnosing and evaluation system for laboratory-based academic exercises

Reducing Features to Improve Bug Prediction

Computerized Adaptive Psychological Testing A Personalisation Perspective

Outreach Connect User Manual

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Probability and Statistics Curriculum Pacing Guide

Python Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Greek Teachers Attitudes toward the Inclusion of Students with Special Educational Needs

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Probabilistic Latent Semantic Analysis

Implementing a tool to Support KAOS-Beta Process Model Using EPF

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

A cognitive perspective on pair programming

Lecture 1: Basic Concepts of Machine Learning

Learning Methods for Fuzzy Systems

The Moodle and joule 2 Teacher Toolkit

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

NTU Student Dashboard

Rule-based Expert Systems

Word Segmentation of Off-line Handwritten Documents

How to set up gradebook categories in Moodle 2.

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Laboratorio di Intelligenza Artificiale e Robotica

Extending Place Value with Whole Numbers to 1,000,000

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Kristin Moser. Sherry Woosley, Ph.D. University of Northern Iowa EBI

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Quantitative Research Questionnaire

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

The Enterprise Knowledge Portal: The Concept

Human Emotion Recognition From Speech

Alberta Police Cognitive Ability Test (APCAT) General Information

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Pragmatic Use Case Writing

LEGO MINDSTORMS Education EV3 Coding Activities

Field Experience Management 2011 Training Guides

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

SARDNET: A Self-Organizing Feature Map for Sequences

CAN PICTORIAL REPRESENTATIONS SUPPORT PROPORTIONAL REASONING? THE CASE OF A MIXING PAINT PROBLEM

Laboratorio di Intelligenza Artificiale e Robotica

Mining Student Evolution Using Associative Classification and Clustering

Mathematics subject curriculum

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

SCT Banner Student Fee Assessment Training Workbook October 2005 Release 7.2

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Co-op Placement Packet

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

AQUA: An Ontology-Driven Question Answering System

Speech Recognition at ICSI: Broadcast News and beyond

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

MMOG Subscription Business Models: Table of Contents

Earl of March SS Physical and Health Education Grade 11 Summative Project (15%)

UML MODELLING OF DIGITAL FORENSIC PROCESS MODELS (DFPMs)

Transcription:

Discovering Characteristics of Aberrant Driving Behavior LOUKAS TSIRONIS, Lecturer, Department of Production and Management Engineering, Democritus University of Thrace, Xanthi 67100 Greece, http://www.duth.gr/ VASSILIS MOUSTAKIS, Associate Professor HARRY MAVROPOULOS, EMMANUEL MARAVELAKIS, Lecturer Department of Natural Resources Engineering Technological Educational Institute of Crete 73133 Chania http://www.chania.teicrete.gr NICHOLAS BILALIS, Associate Professor Abstract: - : Recent studies have shown that unsafe driver acts can be classified into two distinct categories (i.e. errors and violations) entailing different measures for reducing road traffic accidents [1],[2]. A survey of over 1400 drivers in Greece is reported in which a variety of aberrant driving behaviors were identified. Factor analysis was performed to the data collected and seven groups of violations were found. Further statistical analysis showed correlations between those groups and accident liability. Data mining software SEE5 was then applied to reveal the tendencies of the Greek drivers and the descriptions of dangerous drivers. The algorithm traced the violations that are responsible for the risky driving acts and brought out useful, but yet hidden, information. Key-Words: - driver behavior, violations, errors, data minng, SEE5

1 The Method A properly formed, two-section questionnaire was distributed to the main cities of Greece, containing general items like drivers age, gender, marital status, etc. at its first section, while the second section consisted of 112 items based on the Driver Behavior Questionaire [3] and the extensions to it, introduced in a similar swedish study [4]. Participants were asked to indicate on a six-point scale (never=1, very seldom=2, rather seldom=3, sometimes=4, often=5, very often=6) how often they committed the behaviour described in each item. More than 1450 questionnaires were completed and collected for further analysis. The analysis performed contained 2 stages. At first a factor analysis was performed to identify the main groups of violations and then a machine learning approach using the SEE5 tool was applied in order to discover the interesting patterns and trends and bring out the hidden information contained in our database. 2 Factor Analysis of the Questionnaire Items The questionnaire, including 112 items, was submitted to a principal components analysis using oblimin rotation to allow for correlations among factors [5]. The scree plot suggested a seven factor solution. The seven factors found by the analysis are the follows: 1. Mistakes 2. Highway Code Violations 3. Low Alertness 4. Aggressive Violations 5. Inexperience 6. Lack of Consideration 7. Parking Violations 3 Predictors of accident involvement Hierarchical multiple regression analysis was used to predict accident rates using as independent variables: age, gender, mileage and the seven classes of behavior. The variables that independently and significantly predicted accident involvement were found to be: mileage, gender, age and highway code violations (HCV). At this point let s see the violations that the HCV consist of : 1. Exceed speed limit during low traffic (Sign5) 2. Disregard speed limit to follow traffic (Sign6) 3. Forget the speed limit (Sign7) 4. Deliberately exceed speed limit when overtaking (Take15) 5. Crossing solid line when changing lane (Take8) 6. Drive at the speed other drivers do (Guy3) 7. Accelerate at a green / yellow phase (Lit1) 8. Cross on lights that have just turned red (Lit2) 9. Disregard red lights at night (Lit7) The names in brackets are the code names which were used for each violation. 4 Data Mining Concepts Data mining is the discovery of interesting, yet hidden, knowledge in very large databases [6]. Corporate databases often contain unknown trends, patterns and relationships among objects (e.g. clients and products) that are of strategic importance to the organization. This knowledge cannot be discovered easily with conventional query tools or statistical packages, because they either lack support for handling very large data sets or expect the user to have some idea of the form of the hidden relationships from the beginning of the search process. Data mining tools in general, apply algorithms to large amounts of data in such a way that the data reveal hidden patterns and relationships and uncover correlations that were previously invisible to workers and the business [7]. Data mining tools help the enterprise understand customer behavior, predict events and expose the linkages between events and trends. It is important to realize that data mining is not so much a new technique as a new way to deal with information. A data mining environment can be realized on many different levels using several different techniques. The basic steps of a data mining project are shown in the following diagram: Figure 1: The data mining process 5 The Data Mining Tool SEE5 / C5.0 Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be used to gain insight into aspects of

the organization's operations, and to predict outcomes for future situations as an aid to decisionmaking. Patterns often concern the categories to which situations belong. For example, is a loan applicant creditworthy or not? Will a certain segment of the population ignore an incoming mail or respond to it? Will a process give high, medium, or low yield on a batch of raw material? See5 (Windows) and its Unix counterpart C5.0 are sophisticated data mining tools for discovering patterns that delineate categories, assembling them into classifiers, and using them to make predictions. See5/C5.0 has been designed to analyze substantial databases containing thousands to hundreds of thousands of records and tens to hundreds of numeric or nominal fields. To maximize interpretability, See5/C5.0 classifiers are expressed as decision trees or sets of if-then rules, forms that are generally easier to understand than neural networks. The algorithm that the program is using is the same as the previous edition, C 4.5 [8], which is one of the most popular classifiers. It was produced by J.R. Quinlan as an extension of the ID3 tree classifier [9]. Due to the widely acknowledged efficiency of ID3 and C4.5, the results generated by these algorithms have been used in comparative tests in numerous papers and have become characteristic benchmarks for efficiency in the field of machine learning. One of the strongest aspects of the C4.5. algorithm is the information gain an information based consistency measure used by the method to evaluate partitioning of the examples into disjoint subjects. The measure is defined as follows. Let U denote a set of examples, n the number of different classes of examples in U and p(u,j) the proportion of those examples in U that belong to the j-th class. The information content of the set U is expressed as: n Info( U ) = p( U, j)log( p( U, j)) (1) j= 1 6 The Data Mining Tool SEE5 / C5.0 From the records collected, the attributes that were previously found that have strong correlation with the accident involvement were selected, properly formatted and imported to the SEE5 software in order to extract a ruleset that properly describes our database and is able to predict accident involvement efficiently. The attributes used, their price range and the target attribute which is no other than the accident involvement are shown in table 1 : ATTRIBUTE CODE NAME VALUES Gender Gender 1, 2 Age Age <25, 26_35, 36_45, 46_55, >56 Mileage Mileage 0_5, 5_10, 10_20, 20_30, 30_50, >50 Exceed speed limit Sign5 1, 2, 3, 4, 5, 6 during low traffic Disregard speed Sign6 1, 2, 3, 4, 5, 6 limit to follow traffic Forget the speed Sign7 1, 2, 3, 4, 5, 6 limit Deliberately exceed Take15 1, 2, 3, 4, 5, 6 speed limit when overtaking Crossing solid line Take8 1, 2, 3, 4, 5, 6 when changing lane Drive at the speed Guy3 1, 2, 3, 4, 5, 6 other drivers do Accelerate at a Lit1 1, 2, 3, 4, 5, 6 green / yellow phase Cross on lights that Lit2 1, 2, 3, 4, 5, 6 have just turned red Disregard red lights at night Lit7 1, 2, 3, 4, 5, 6 Target Attribute : Accident Involvment Β6 0 = NO 1 = YES Table 1 : Accident Involvement Prediction Attributes 7 Results After importing the data to SEE5 and running the algorithm, 16 rules were extracted that predictit the class of each record : Rule 1: (119, lift 2.4) age = 46_55 Take8 <= 2 -> class 0 [0.992] Rule 2: (183/1, lift 2.4) Take8 <= 1 -> class 0 [0.989] Rule 3: (47, lift 2.4) mileage = >50 -> class 0 [0.980] Rule 4: (126/6, lift 2.3) mileage = 0_5 -> class 0 [0.945] Rule 5: (210/28, lift 2.1) Take15 <= 1 Lit2 <= 2 -> class 0 [0.863] Rule 6: (62/10, lift 2.0) mileage = 5_10 Sign7 <= 1 -> class 0 [0.828] Rule 7: (336/62, lift 2.0) Take15 <= 1 -> class 0 [0.814] Rule 8: (282/66, lift 1.8) Sign5 <= 1 -> class 0 [0.764] Rule 9: (379, lift 1.7)

Take8 > 2 -> class 1 [0.997] Rule 10: (167, lift 1.7) age = 36_45 -> class 1 [0.994] Rule 11: (104, lift 1.7) age = 26_35 -> class 1 [0.991] Rule 12: (49, lift 1.7) age = <25 -> class 1 [0.980] Rule 13: (57/2, lift 1.6) mileage = 30_50 -> class 1 [0.949] Rule 14: (114/6, lift 1.6) mileage = 20_30 Take8 > 3 -> class 1 [0.940] Rule 15: (439/36, lift 1.6) -> class 1 [0.916] Rule 16: (181/15, lift 1.6) mileage = 5_10 -> class 1 [0.913] Rule 17: (135/11, lift 1.6) mileage = 20_30 Sign5 > 1 -> class 1 [0.912] Table 2 : Rules of Accident Involment Prediction Each rule consists of: A rule number -- this is quite arbitrary and serves only to identify the rule. Statistics (n, lift x) or (n/m, lift x) that summarizes the performance of the rule, where n is the number of training cases covered by the rule and m, if it appears, shows how many of them do not belong to the class predicted by the rule. The lift x is the estimated accuracy of the rule divided by the prior probability of the predicted class. One or more conditions that must all be satisfied if the rule is to be applicable. A class predicted by the rule. A value between 0 and 1 that indicates the confidence with which this prediction is made This ruleset classifies correctly 1371 of the 1453 records, achieving accuracy of 94.4 %. Specifically, the general performance of the algorithm is shown in table 3: Class (0) Class (1) Classified as 585 20 Class (0) 62 786 Class (1) Table 3 : Algorithm Performance 8 Conclusions In this study, data mining is proposed as an operational decision tool for the prediction of accident involvement in Greece. This method, especially conceived for multi-attribute classification problems, suits the problem well. The prediction model has the form of decision rules. The derived decision rules reveal the most relevant attributes that should be considered by the analyser in order to evaluate the risk of accident of a driver. It is important to mention that the rules were derived from a particular data set and as such they represent a generalized description of the experience of it. Following this, these rules cannot be applied uncritically to other databases. If such a need arises, however, a new data set may be created and the same method can be used to analyze it and generate the appropriate rules. Concerning the classification of drivers, the data mining approach produced very satisfactory results. This result is very important because this approach becomes, for the future, a strong alternative tool for the analysis of similar problems. Finally, compared to other existing methods, this approach offers the following advantages: It discovers important facts hidden in data and expresses them in the natural language of decision rules. It accepts both quantitative and qualitative attributes

It can contribute to the minimization of the time and cost of the decision making process as it is an information processing system in real time. It offers transparency of classification decisions, allowing for their argumentation. It takes into account background knowledge of the decision maker. References: [1]Evans L. (1991): Traffic Safety and the Driver, Van Nostrand Reihnold, New York [2]Rumar K. (1985): The role of perceptual and cognitive failures in observed behavior. In Human Behavior and Traffic Safety, Plenum Press, New York [3]Reason J.T., Manstead A., Stradling S., Baxter J. and Campbell K. (1990): Errors and violations on the road: a real distinction?, Ergonomics, 33, 1315 1332 [4]Aberg L. and Rimmo P.A. (1998): Dimensions of aberrant driver behavior, Ergonomics, 41, 39 56 [5]Κontogiannis, T., Kossiavelou, Z. and Marmaras, N. (2002). Self-reports of aberrant behaviour on the roads: errors and violations in a sample of Greek drivers. Accident Analysis and Prevention, 34, 381-399. [6]Adriaans P. and Zantinge D. (1996): Data Mining, Addison-Wessley [7]Michalski R., Bratko I. & Kubat M. (1999): Machine Learning and Data Mining Methods and Applications. John Wiley and Sons, NY,USA. [8]Quinlan J. R. (1993): C4.5: Programs for Machine Learning, Morgan Kaufmann [9]Quinlan J. R. (1986): Induction of decision trees, Machine Learning, vol. 1