Modeling Reaction Time for Abstract and Concrete Concepts using a Recurrent Network

Size: px

Start display at page:

Download "Modeling Reaction Time for Abstract and Concrete Concepts using a Recurrent Network"

Charlotte Ray
5 years ago
Views:

1 Modeling Reaction Time for Abstract and Concrete Concepts using a Recurrent Network Dana Dahlstrom and Jonathan Ultis Department of Computer Science and Engineering University of California, San Diego La Jolla CA , USA {dana,jultis}@cs.ucsd.edu Abstract Several recent studies have emulated aspects of human language processing using attractor networks. One study attempted to emulate human behavior in the lexical decision task: humans recognize concrete words faster than abstract words. The network, though, produced the opposite behavior. We revisit that result, exploring several other stopping criteria as well as an entirely different notion of concept abstractness. Even with these modifications the network consistently settles faster on abstract input than on concrete input, leading us to question fundamentally the reasons behind the human response times. Modeling human behavior with a recurrent network Several recent studies have had success modeling certain aspects of human behavior in language processing with attractor networks. An attractor network is a recurrent neural network designed to settle to a stable output over time. Attractor networks may be more accurate simulations of human neural processes than previous models. One attractor network effectively modeled semantic and associative priming effects observed in humans []. Another exhibited many symptoms of dyslexia when damaged [2]. In order to further understand the relationship of attractor networks to human language processing, we revisited a study on the effect of abstractness on reaction time in the lexical decision task [3]. The lexical decision task is to decide whether a string of characters is a word or not. Several studies have indicated humans identify concrete words more quickly than abstract words [4, 5, 6, 7]. This effect is modulated by frequency: low frequency words are more strongly affected than are high frequency words [4, 6]. Many other variables might also affect lexical decision speed as performed by an attractor network: frequency in training, density in the semantic space, and the number of bits turned on in the semantic vectors [3]. Our contribution is an analysis of different settling criteria for the attractor network and of a different notion of the representational difference between abstract and concrete pseudo-semantic vectors.

2 00 semantic units 00 hidden units 20 orthographic units Figure : Network architecture 2 Emulating the lexical decision task Largely because humans identify concrete words faster than abstract words, many researchers think humans accomplish the lexical decision task through semantic access: decoding or failing to decode the meaning of the word in question. The assumption underlying this connection is that abstract concepts are more elusive and therefore slower to access. Since attractor networks trained to map orthography to semantics have successfully modeled other human data, it is natural to wonder whether they can also capture this differential latency in lexical decision. To show such an effect requires a notion of the difference between abstract and concrete pseudo-semantic vectors, and a measure corresponding to response time. 2. Network architecture The network used in our experiments is depicted in Figure. This architecture is the one used by Plaut to model semantic priming []. Clouse, in the study this paper extends, used the same architecture [3]. There are 20 input units, 00 output units, and 00 hidden units. The hidden layer is fully connected to itself and the output layer; the output layer is fully connected to itself and the hidden layer, except that individual units are not connected to themselves. The hidden and output layers influence each other in a gradual settling process defined by the update equation x [t] j = τ i s [t τ] i w ij + ( τ)x [t τ] j () where x [t] j is the state of node j at time t, τ is a fraction ranging from [0..], s i is input i to unit x j, w ij is the weight between input i and unit j. The value of s i is simply x i passed through the squashing function s [t] i = ( + exp ) (2) x [t] j This scheme is a discrete approximation to a continuous activation process. The tick length τ determines the granularity of the approximation: when τ = the state of each node is fully replaced by the weighted sum of its inputs at each time step; a smaller τ results in more gradual change.

3 2.2 Representing abstractness In the above architecture, the 20-unit input vector is used to represent orthography, and the 00-unit output vector is used to represent semantics. It is not immediately clear how the output should differ when representing abstract versus a concrete concepts; we consider several possibilities. Clouse s study considered vectors with fewer one bits (and correspondingly more zero bits) to be more abstract. Intuitively, one bits might represent things known about a concept, and fewer things may be known about abstract concepts than concrete concepts. Abstract concepts might also be viewed as generalizations of concrete concepts. For example, vehicle generalizes the more concrete car, bus, bicycle, and airplane. In this view, abstract semantic vectors would be surrounded in vector space by many others which are closely related, more concrete instantiations. We analyze both these notions of abstractness. 2.3 Training set The network was trained using pairs of randomly generated orthographic and semantic vectors. The orthographic vectors were 20 bits long and were randomly constructed such that the chance of any bit being on was 0. The semantic vectors were 00 bits long and were grouped into neighborhoods intended to represent families of similar concepts. Each neighborhood was based on a random prototype; others were generated from it by randomly flipping a few bits. No attempt was made to correlate similar orthographic vectors to similar semantic vectors in any way. Three variables influenced the details of semantic vector generation: the probability p(one) of a given bit in the prototype being a one; the probability p(flip) of a given bit in the prototype being flipped when generating neighbors; and the size of semantic neighborhoods. When testing the notion of vectors representing abstract concepts having fewer one bits, the original prototype vectors were not included. In this case, p(one) was 0. for abstract and 0.5 for concrete. When testing the notion of abstractness as generalization, the prototype vectors were included and labeled as abstract. In this simulation all prototypes were generated with p(one) = 0.5. For both simulations, p(flip) alternated between 0.05 and 0.5 to vary the density of neighborhoods. The number of vectors in a semantic neighborhood also alternated between 2 and 6. Overall, there were as many vectors in neighborhoods of size 2 as in neighborhoods of size 6; that is, there were 3 times as many neighborhoods of size Training method For the first notion of abstractness we use the same ten training sets Clouse used; two new sets were generated to test the notion of prototype vectors as abstract concepts. For both analyses we use Clouse s training code, which implements Pearlmutter s continuous back-propagation through time [8]. Before each new concept the network s state was reset to s i = 0.3 for each unit, which caused each x j to be The value 0.3 was chosen because it was the average value of all entries in all training set vectors. As each pattern was presented the input was corrupted with noise drawn from a normal distribution with mean µ = 0 and standard deviation σ = The noise is intended to keep the network from over-fitting the training input. The input was held constant for 4

4 time steps while the network settled. Cross entropy error between the output and target was accumulated between time t = 2 and t = 4 [9]. The error accumulated over that period was back-propagated to time t = 0. The weights of the network were modified stochastically using gradient descent with momentum. The learning rate was and momentum was 0.8. Clouse reported results for training at several stages, but our results are reported only for the network after,000 epochs of training with τ = 0.2 followed by 60 epochs of training with τ = 0.05 and 40 epochs of training with τ = 0.0. In this annealing approach, the final finer-grained training is used to smooth out the network s response. In the testing phase, τ = Testing method Before each pattern the networks state was reset to s i = 0.3 for all nodes, just as in training. The network was allowed to run until a particular stopping criterion held. We experimented with three such criteria. Results are reported in terms of the number of time steps t the network took to settle. For each of the tests reported, τ = 0.0, so during each time step the weights were updated 00 times. 3. Stopping criteria In the original study, only two stopping criteria were considered. Both were based on the maximum change in the state of any node. The first criterion considered only the maximum change in state of any output node; the second considered the maximum change in state of any output or internal node. Since the state of internal nodes is not commonly part of the stopping criterion, the criteria considered in this study depend only on the output state. The maximum change of any node is not a particularly well motivated stopping criterion. It is possible for all but one node to have minimal change while a single node oscillates between two values, preventing this criterion from holding. Conversely, all nodes in the network could change at just below the maximum change threshold; this criterion would call the network settled despite the collectively larger change. Overall velocity of the network s output state the square root of the sum of squared change over all output nodes seems to better capture settling than does max-change, and is less susceptible to idiosyncratic behavior in a single node. Since the target output for any particular input is known in this simulation, we also experimented with the sum-squared error as a stopping condition. The sum-squared error is the distance from the network s current output to the target output. The sum-squared error is more an ideal than a realistic stopping criterion since the true error would not be known in practice. For our purposes the sum-squared error criterion is something of a benchmark. For each of the three stopping criteria we tried three threshold values: 0., 0.0, and Simulations with low p(one) as abstract There was a strong consistency across all the stopping criteria we examined. For no threshold did any stopping criterion cause the network to perform faster on concrete than on abstract concepts. The graphs in Figure 2 are representative of the results. Still, a heavyweight mechanism such as another network could estimate the error.

5 abstract concrete 0. p(one) 0.5 (a) max change = 0.0. abstract concrete 0. p(one) 0.5 (b) velocity = 0.0. abstract concrete 0. p(one) 0.5 (c) s.s.e. = 0. Figure 2: s with each criterion for p(one) = 0. and 0.5 One of the trends the original study recognized was the correlation between response time and neighborhood density as determined by p(flip). While we observed this tendency in our results, it was so slight as to be insignificant according to a t-test. The settling times for terms from dense and sparse neighborhoods with a sum-squared error stopping criterion thresholded at 0. are depicted in Figure 3(a). It is interesting to note that randomly flipping bits in a prototype with few one bits is likely to introduce more one bits. As a result, the randomly generated neighbors are likely to be more concrete than the original randomized seed. For abstract prototypes, then, the higher p(flip), the more concrete the derived vectors are likely to be. Since the correlation between p(flip) and response time is so slight, it may be entirely due to the correlation between p(flip) and the effective resulting p(one). If this is true, the correlation should disappear if derived vectors were generated by swapping pairs of bits rather than flipping them individually. The most striking tendency in response time, both in this study and in Clouse s, is that terms with high training frequency are identified much more quickly than are low frequency terms. Again this result held over all combinations of stopping criterion and threshold. The settling times for low and high frequency terms with a sum-squared error stopping criterion thresholded at 0. are depicted in Figure 3(b) dense sparse 0.05 p(flip) 0.5 (a). low high frequency 4 (b) Figure 3: s with sum-squared error criterion, threshold = 0.

6 abstract centroid concrete derived. abstract centroid concrete derived. abstract centroid concrete derived (a) max change = 0.0 (b) velocity = 0.0 (c) sse = 0. Figure 4: s with each criterion for centroid versus derived vectors 5 Simulations with centroid semantic vectors as abstract No combination of stopping criterion and threshold caused the network to perform faster on centroid vectors than on derived vectors. The graphs in Figure 4 are representative of the results. 6 Conclusions In this study we were able to more thoroughly confirm Clouse s result. The particular recurrent network model used did not react more quickly to concrete words than abstract words for any combination of stopping criterion and notion of abstract. 7 Intuitions It may be that concrete terms are recognized more quickly than abstract terms not because of their representation, but because concrete terms are reinforced more heavily in learning. Concrete ideas are reinforced with sensory input, while abstract ideas may be learned only through their connection to other ideas. In addition, concrete ideas may be less likely to have training noise. The statement that is a domino is more likely to be correct than the statement you re sad today. If concrete ideas are indeed reinforced more heavily in human learning, then concrete ideas should behave as though they had a higher frequency on average than abstract ideas. The frequency results from this and previous studies suggest that training concrete ideas with a higher frequency than abstract ideas would produce the speed difference seen in humans performing the lexical decision task. 8 Acknowledgments We would like to thank Gary Cottrell for suggesting this topic and several approaches, and Dan Clouse for sharing with us his code and test sets. References [] D.C. Plaut, J. L. Seidenberg, and M.S. Patterson. Understanding normal and impaired word reading: Computational principles in quasi-regular domains. In Psychological Review, 996.

7 [2] D.C. Plaut and T. Shallice. Deep dyslexia: A case study of connectionist neuropsychology. In Cognitive Neuropsychology, 993. [3] Daniel S. Clouse and Garrison W. Cottrell. Regularities in a random mapping from orthography to semantics. In Proceedings of the Twentieth Annual Cognitive Science Conference, 998. [4] C. T. James. The role of semantic information in lexical decisions. In J Exper Psych: Human Perception and Performance, 975. [5] C. P. Whaley. Word-nonword classification time. In J Verbal Learning and Verbal Behavior, 978. [6] J. F. Kroll and J. S. Merves. Lexical access for concrete and abstract words. In J Exper Psych: Learning, Memory and Cognition, 986. [7] A. M. B. de Groot. Representational aspects of word imageability and word frequency as assessed through word association. In J Exper Psych: Learning, Memory and Cognition, 989. [8] B.A. Pearlmutter. Learning state space trajectories in recurrent neural networks. In Neural Computation, volume, pages , 989. [9] G.E. Hinton. Connectionist learning procedures. In Artificial Intelligence, volume 40, pages , 989.

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3