Modeling Reaction Time for Abstract and Concrete Concepts using a Recurrent Network

Similar documents
Lecture 1: Machine Learning Basics

An empirical study of learning speed in backpropagation

Softprop: Softmax Neural Network Backpropagation Learning

Artificial Neural Networks written examination

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

Python Machine Learning

SARDNET: A Self-Organizing Feature Map for Sequences

INPE São José dos Campos

An Empirical and Computational Test of Linguistic Relativity

Learning Methods for Fuzzy Systems

Learning to Schedule Straight-Line Code

Generative models and adversarial training

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Intervening to alleviate word-finding difficulties in children: case series data and a computational modelling foundation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Axiom 2013 Team Description Paper

Degeneracy results in canalisation of language structure: A computational model of word learning

Speech Recognition at ICSI: Broadcast News and beyond

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Grade 6: Correlated to AGS Basic Math Skills

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

THE INFLUENCE OF TASK DEMANDS ON FAMILIARITY EFFECTS IN VISUAL WORD RECOGNITION: A COHORT MODEL PERSPECTIVE DISSERTATION

Rule Learning With Negation: Issues Regarding Effectiveness

Evolutive Neural Net Fuzzy Filtering: Basic Description

A Reinforcement Learning Variant for Control Scheduling

Lecture 10: Reinforcement Learning

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speeding Up Reinforcement Learning with Behavior Transfer

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Improvements to the Pruning Behavior of DNN Acoustic Models

Visit us at:

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Modeling function word errors in DNN-HMM based LVCSR systems

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

On-Line Data Analytics

A Comparison of Annealing Techniques for Academic Course Scheduling

A Case Study: News Classification Based on Term Frequency

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Calibration of Confidence Measures in Speech Recognition

CS Machine Learning

Software Maintenance

Exploration. CS : Deep Reinforcement Learning Sergey Levine

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.ro] 3 Mar 2017

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Improving Conceptual Understanding of Physics with Technology

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

White Paper. The Art of Learning

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

School of Innovative Technologies and Engineering

A student diagnosing and evaluation system for laboratory-based academic exercises

Laboratorio di Intelligenza Artificiale e Robotica

Analysis of Enzyme Kinetic Data

Mathematics process categories

The Evolution of Random Phenomena

Human Emotion Recognition From Speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

BENCHMARK TREND COMPARISON REPORT:

Word Segmentation of Off-line Handwritten Documents

Measures of the Location of the Data

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Artificial Neural Networks

Beeson, P. M. (1999). Treating acquired writing impairment. Aphasiology, 13,

Emotional Variation in Speech-Based Natural Language Generation

Learning Disability Functional Capacity Evaluation. Dear Doctor,

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Reinforcement Learning by Comparing Immediate Reward

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Rule Learning with Negation: Issues Regarding Effectiveness

On-the-Fly Customization of Automated Essay Scoring

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

A redintegration account of the effects of speech rate, lexicality, and word frequency in immediate serial recall

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Second Exam: Natural Language Parsing with Neural Networks

An Introduction to the Minimalist Program

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Genevieve L. Hartman, Ph.D.

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Transcription:

Modeling Reaction Time for Abstract and Concrete Concepts using a Recurrent Network Dana Dahlstrom and Jonathan Ultis Department of Computer Science and Engineering University of California, San Diego La Jolla CA 92093-04, USA {dana,jultis}@cs.ucsd.edu Abstract Several recent studies have emulated aspects of human language processing using attractor networks. One study attempted to emulate human behavior in the lexical decision task: humans recognize concrete words faster than abstract words. The network, though, produced the opposite behavior. We revisit that result, exploring several other stopping criteria as well as an entirely different notion of concept abstractness. Even with these modifications the network consistently settles faster on abstract input than on concrete input, leading us to question fundamentally the reasons behind the human response times. Modeling human behavior with a recurrent network Several recent studies have had success modeling certain aspects of human behavior in language processing with attractor networks. An attractor network is a recurrent neural network designed to settle to a stable output over time. Attractor networks may be more accurate simulations of human neural processes than previous models. One attractor network effectively modeled semantic and associative priming effects observed in humans []. Another exhibited many symptoms of dyslexia when damaged [2]. In order to further understand the relationship of attractor networks to human language processing, we revisited a study on the effect of abstractness on reaction time in the lexical decision task [3]. The lexical decision task is to decide whether a string of characters is a word or not. Several studies have indicated humans identify concrete words more quickly than abstract words [4, 5, 6, 7]. This effect is modulated by frequency: low frequency words are more strongly affected than are high frequency words [4, 6]. Many other variables might also affect lexical decision speed as performed by an attractor network: frequency in training, density in the semantic space, and the number of bits turned on in the semantic vectors [3]. Our contribution is an analysis of different settling criteria for the attractor network and of a different notion of the representational difference between abstract and concrete pseudo-semantic vectors.

00 semantic units 00 hidden units 20 orthographic units Figure : Network architecture 2 Emulating the lexical decision task Largely because humans identify concrete words faster than abstract words, many researchers think humans accomplish the lexical decision task through semantic access: decoding or failing to decode the meaning of the word in question. The assumption underlying this connection is that abstract concepts are more elusive and therefore slower to access. Since attractor networks trained to map orthography to semantics have successfully modeled other human data, it is natural to wonder whether they can also capture this differential latency in lexical decision. To show such an effect requires a notion of the difference between abstract and concrete pseudo-semantic vectors, and a measure corresponding to response time. 2. Network architecture The network used in our experiments is depicted in Figure. This architecture is the one used by Plaut to model semantic priming []. Clouse, in the study this paper extends, used the same architecture [3]. There are 20 input units, 00 output units, and 00 hidden units. The hidden layer is fully connected to itself and the output layer; the output layer is fully connected to itself and the hidden layer, except that individual units are not connected to themselves. The hidden and output layers influence each other in a gradual settling process defined by the update equation x [t] j = τ i s [t τ] i w ij + ( τ)x [t τ] j () where x [t] j is the state of node j at time t, τ is a fraction ranging from [0..], s i is input i to unit x j, w ij is the weight between input i and unit j. The value of s i is simply x i passed through the squashing function s [t] i = ( + exp ) (2) x [t] j This scheme is a discrete approximation to a continuous activation process. The tick length τ determines the granularity of the approximation: when τ = the state of each node is fully replaced by the weighted sum of its inputs at each time step; a smaller τ results in more gradual change.

2.2 Representing abstractness In the above architecture, the 20-unit input vector is used to represent orthography, and the 00-unit output vector is used to represent semantics. It is not immediately clear how the output should differ when representing abstract versus a concrete concepts; we consider several possibilities. Clouse s study considered vectors with fewer one bits (and correspondingly more zero bits) to be more abstract. Intuitively, one bits might represent things known about a concept, and fewer things may be known about abstract concepts than concrete concepts. Abstract concepts might also be viewed as generalizations of concrete concepts. For example, vehicle generalizes the more concrete car, bus, bicycle, and airplane. In this view, abstract semantic vectors would be surrounded in vector space by many others which are closely related, more concrete instantiations. We analyze both these notions of abstractness. 2.3 Training set The network was trained using pairs of randomly generated orthographic and semantic vectors. The orthographic vectors were 20 bits long and were randomly constructed such that the chance of any bit being on was 0. The semantic vectors were 00 bits long and were grouped into neighborhoods intended to represent families of similar concepts. Each neighborhood was based on a random prototype; others were generated from it by randomly flipping a few bits. No attempt was made to correlate similar orthographic vectors to similar semantic vectors in any way. Three variables influenced the details of semantic vector generation: the probability p(one) of a given bit in the prototype being a one; the probability p(flip) of a given bit in the prototype being flipped when generating neighbors; and the size of semantic neighborhoods. When testing the notion of vectors representing abstract concepts having fewer one bits, the original prototype vectors were not included. In this case, p(one) was 0. for abstract and 0.5 for concrete. When testing the notion of abstractness as generalization, the prototype vectors were included and labeled as abstract. In this simulation all prototypes were generated with p(one) = 0.5. For both simulations, p(flip) alternated between 0.05 and 0.5 to vary the density of neighborhoods. The number of vectors in a semantic neighborhood also alternated between 2 and 6. Overall, there were as many vectors in neighborhoods of size 2 as in neighborhoods of size 6; that is, there were 3 times as many neighborhoods of size 2. 2.4 Training method For the first notion of abstractness we use the same ten training sets Clouse used; two new sets were generated to test the notion of prototype vectors as abstract concepts. For both analyses we use Clouse s training code, which implements Pearlmutter s continuous back-propagation through time [8]. Before each new concept the network s state was reset to s i = 0.3 for each unit, which caused each x j to be 0.8473. The value 0.3 was chosen because it was the average value of all entries in all training set vectors. As each pattern was presented the input was corrupted with noise drawn from a normal distribution with mean µ = 0 and standard deviation σ = 0.05. The noise is intended to keep the network from over-fitting the training input. The input was held constant for 4

time steps while the network settled. Cross entropy error between the output and target was accumulated between time t = 2 and t = 4 [9]. The error accumulated over that period was back-propagated to time t = 0. The weights of the network were modified stochastically using gradient descent with momentum. The learning rate was 0.005 and momentum was 0.8. Clouse reported results for training at several stages, but our results are reported only for the network after,000 epochs of training with τ = 0.2 followed by 60 epochs of training with τ = 0.05 and 40 epochs of training with τ = 0.0. In this annealing approach, the final finer-grained training is used to smooth out the network s response. In the testing phase, τ = 0.0. 3 Testing method Before each pattern the networks state was reset to s i = 0.3 for all nodes, just as in training. The network was allowed to run until a particular stopping criterion held. We experimented with three such criteria. Results are reported in terms of the number of time steps t the network took to settle. For each of the tests reported, τ = 0.0, so during each time step the weights were updated 00 times. 3. Stopping criteria In the original study, only two stopping criteria were considered. Both were based on the maximum change in the state of any node. The first criterion considered only the maximum change in state of any output node; the second considered the maximum change in state of any output or internal node. Since the state of internal nodes is not commonly part of the stopping criterion, the criteria considered in this study depend only on the output state. The maximum change of any node is not a particularly well motivated stopping criterion. It is possible for all but one node to have minimal change while a single node oscillates between two values, preventing this criterion from holding. Conversely, all nodes in the network could change at just below the maximum change threshold; this criterion would call the network settled despite the collectively larger change. Overall velocity of the network s output state the square root of the sum of squared change over all output nodes seems to better capture settling than does max-change, and is less susceptible to idiosyncratic behavior in a single node. Since the target output for any particular input is known in this simulation, we also experimented with the sum-squared error as a stopping condition. The sum-squared error is the distance from the network s current output to the target output. The sum-squared error is more an ideal than a realistic stopping criterion since the true error would not be known in practice. For our purposes the sum-squared error criterion is something of a benchmark. For each of the three stopping criteria we tried three threshold values: 0., 0.0, and 0.00. 4 Simulations with low p(one) as abstract There was a strong consistency across all the stopping criteria we examined. For no threshold did any stopping criterion cause the network to perform faster on concrete than on abstract concepts. The graphs in Figure 2 are representative of the results. Still, a heavyweight mechanism such as another network could estimate the error.

.6.6.6.5.5.5.4.3.2.4.3.2.4.3.2. abstract concrete 0. p(one) 0.5 (a) max change = 0.0. abstract concrete 0. p(one) 0.5 (b) velocity = 0.0. abstract concrete 0. p(one) 0.5 (c) s.s.e. = 0. Figure 2: s with each criterion for p(one) = 0. and 0.5 One of the trends the original study recognized was the correlation between response time and neighborhood density as determined by p(flip). While we observed this tendency in our results, it was so slight as to be insignificant according to a t-test. The settling times for terms from dense and sparse neighborhoods with a sum-squared error stopping criterion thresholded at 0. are depicted in Figure 3(a). It is interesting to note that randomly flipping bits in a prototype with few one bits is likely to introduce more one bits. As a result, the randomly generated neighbors are likely to be more concrete than the original randomized seed. For abstract prototypes, then, the higher p(flip), the more concrete the derived vectors are likely to be. Since the correlation between p(flip) and response time is so slight, it may be entirely due to the correlation between p(flip) and the effective resulting p(one). If this is true, the correlation should disappear if derived vectors were generated by swapping pairs of bits rather than flipping them individually. The most striking tendency in response time, both in this study and in Clouse s, is that terms with high training frequency are identified much more quickly than are low frequency terms. Again this result held over all combinations of stopping criterion and threshold. The settling times for low and high frequency terms with a sum-squared error stopping criterion thresholded at 0. are depicted in Figure 3(b)..6.6.5.5.4.3.2.4.3.2. dense sparse 0.05 p(flip) 0.5 (a). low high frequency 4 (b) Figure 3: s with sum-squared error criterion, threshold = 0.

.6.6.6.5.5.5.4.3.2.4.3.2.4.3.2. abstract centroid concrete derived. abstract centroid concrete derived. abstract centroid concrete derived (a) max change = 0.0 (b) velocity = 0.0 (c) sse = 0. Figure 4: s with each criterion for centroid versus derived vectors 5 Simulations with centroid semantic vectors as abstract No combination of stopping criterion and threshold caused the network to perform faster on centroid vectors than on derived vectors. The graphs in Figure 4 are representative of the results. 6 Conclusions In this study we were able to more thoroughly confirm Clouse s result. The particular recurrent network model used did not react more quickly to concrete words than abstract words for any combination of stopping criterion and notion of abstract. 7 Intuitions It may be that concrete terms are recognized more quickly than abstract terms not because of their representation, but because concrete terms are reinforced more heavily in learning. Concrete ideas are reinforced with sensory input, while abstract ideas may be learned only through their connection to other ideas. In addition, concrete ideas may be less likely to have training noise. The statement that is a domino is more likely to be correct than the statement you re sad today. If concrete ideas are indeed reinforced more heavily in human learning, then concrete ideas should behave as though they had a higher frequency on average than abstract ideas. The frequency results from this and previous studies suggest that training concrete ideas with a higher frequency than abstract ideas would produce the speed difference seen in humans performing the lexical decision task. 8 Acknowledgments We would like to thank Gary Cottrell for suggesting this topic and several approaches, and Dan Clouse for sharing with us his code and test sets. References [] D.C. Plaut, J. L. Seidenberg, and M.S. Patterson. Understanding normal and impaired word reading: Computational principles in quasi-regular domains. In Psychological Review, 996.

[2] D.C. Plaut and T. Shallice. Deep dyslexia: A case study of connectionist neuropsychology. In Cognitive Neuropsychology, 993. [3] Daniel S. Clouse and Garrison W. Cottrell. Regularities in a random mapping from orthography to semantics. In Proceedings of the Twentieth Annual Cognitive Science Conference, 998. [4] C. T. James. The role of semantic information in lexical decisions. In J Exper Psych: Human Perception and Performance, 975. [5] C. P. Whaley. Word-nonword classification time. In J Verbal Learning and Verbal Behavior, 978. [6] J. F. Kroll and J. S. Merves. Lexical access for concrete and abstract words. In J Exper Psych: Learning, Memory and Cognition, 986. [7] A. M. B. de Groot. Representational aspects of word imageability and word frequency as assessed through word association. In J Exper Psych: Learning, Memory and Cognition, 989. [8] B.A. Pearlmutter. Learning state space trajectories in recurrent neural networks. In Neural Computation, volume, pages 243 269, 989. [9] G.E. Hinton. Connectionist learning procedures. In Artificial Intelligence, volume 40, pages 85 234, 989.