A Cortex-inspired Associative Memory with O(1) Time Complexity Learning, Recall, and Recognition of Sequences

A Corex-inspired Associaive Memory wih O(1) Time Complexiy Learning, Recall, and Recogniion of Sequences Gerard J. Rinkus Volen Cener for Complex Sysems Brandeis Universiy Absrac A corex-inspired associaive memory model possessing O(1) ime complexiy for boh sorage (learning) and rerieval (recall, recogniion) of sequences is described. I learns sequences, specifically, binary spaioemporal paerns, wih single rials. Resuls are given demonsraing O(1) rerieval of: a) a sequence s remaining iems when promped wih is iniial iem, i.e., episodic recall; and b) he mos similar sored sequence when presened wih a novel sequence, i.e., recogniion/caegorizaion. The hidden (represenaion) layer, L2, is organized ino winner-ake-all compeiive modules (CMs) hypohesized o be analogous o corical minicolumns. Represenaions consis of one acive uni per CM. The hear of he model is a maching algorihm ha, given he curren inpu momen, σ, i.e., he curren inpu iem in he conex of he sequence hus far: a) finds he sored represenaion, σ*, of ha previously experienced momen, σ*, ou of all previously experienced momens, which is spaioemporally mos similar o σ; and b) reurns a normalized measure, G, of ha similariy. When in recall or recogniion mode, he model simply reacivaes σ* since i is he mos likely hypohesis given he model s hisory and he curren inpu. When in learning mode, he model injecs an amoun of noise, inversely proporional G, ino he process of choosing he cells o represen σ. This yields he propery ha he size of he inersecion beween represenaions is an increasing funcion of he spaioemporal similariy of he momens ha hey represen. Thus, he higher-order saisics (spaioemporal similariy srucure) of he se of learned sequences is refleced direcly in he paerns of inersecions over he se of represenaions. This propery, in conjuncion wih he use of he binary sparse represenaion, makes he O(1) recall and recogniion (i.e., inference) possible. 1

1. Inroducion Human capabiliy sill far exceeds ha of machines on a vas range of informaion processing asks. Parsing speech in noisy environmens, undersanding he meaning of ha speech a many levels from he lieral up hrough recognizing sarcasm or irony, reading an opponen s inenion, and hus being able o predic his or his eam s nex or ensuing movemens, ec. These are all spaioemporal (sequenial) asks ha humans wih sufficien experience rouinely perform boh accuraely and in real ime. An underlying commonaliy beween asks like hese is ha hey all require rapid inference in exponenially large hypohesis spaces. For example, he space of possible movemen paerns over he nex five seconds of he se of opposing soccer eam players, is exponenially large. Neverheless, an experienced player can ofen correcly predic he paern wih sufficien accuracy o be successful in he momen. The relaively slow speed of neural operaions implies ha somehow he brain drasically pares down he huge hypohesis space so ha very few hypoheses are explicily considered a any given momen. A base, he abiliy o achieve such a paring down is a funcion of how he hypoheses are represened in memory. For example, he log 2 N ime complexiy of binary search depends on he fac ha he se of N iems being searched over is arranged by numerical similariy, e.g.., 8 is more similar o 9 han o 2. More generally, one migh say ha search and rerieval is abeed o he exen ha he similariy srucure of he hypohesis space is refleced in he physical represenaion (sorage) of he hypoheses. Of course, ensuring ha such a propery i.e., similar inpus map o similar represenaions (SISR) holds as successive inpus are presened o he sysem generally incurs increased sorage-ime complexiy. However, he model described herein implemens his propery in such a way ha i achieves he heoreically opimal ime complexiy, i.e., O(1), for boh sorage and rerieval. This model, originally developed in Rinkus (1995, 1996), employs sparse disribued represenaions o do unsupervised learning, recall and recogniion of binary spaioemporal paerns, or sequences, and possesses his immediae search/rerieval propery. More formally, he model has O(1) search ime complexiy, meaning ha he number of compuaional seps necessary o find he mos similar sored sequence (o some query sequence) remains consan no maer how many sequences have been sored, and no maer how long individual sequences are. This holds 2

wheher he model is asked o recall he remainder of a sequence given a promp (episodic recall) or recognize a novel sequence (by acivaing he memory race of he mos similar previously learned sequence). Moreover, he model does single-rial learning of is inpu sequences and, in fac, has O(1) sorage ime complexiy as well. The model, named TEMECOR (TEmporal MEmory using COmbinaorial Represenaions), achieves his performance primarily hrough wo feaures: a) a paricular, sparse disribued, k-wta, represenaion, modeled afer he minicolumn srucure of neocorex; and b) an algorihm which compues, a each momen during learning, he maximal similariy, G, beween he currenly presening sequence and all of is sored sequences (and all of heir prefixes) and hen adds an amoun of noise, inversely proporional o G, ino he choice of unis ha will become acive a ha momen. Togeher, hese feaures yield he key propery ha he degree of overlap beween wo sored sequence represenaions (memory races) is an increasing funcion of he spaioemporal similariy of he sequences ha hey represen, i.e., he SISR propery. Thus, boh he memory races of he individual sequences and he higher-order saisics (similariy srucure) of he se of sequences reside in a single monolihic srucure (se of weighs). The specific claim of his paper is ha he proposed model has O(1) compuaional ime complexiy for boh sorage (learning) and rerieval (recall, recogniion) of sequences. This follows immediaely from he fac ha none of he model s equaions (Eqs. 1-11) have any dependence on he number sequences sored. While he compuaional space complexiy, i.e., sorage capaciy, of an associaive memory model is also of paramoun imporance, his paper does no presen a definiive resul for he model s space complexiy. However, he appendix gives wo indirec argumens suggesing ha he model s space complexiy will allow scaling o biologically relevan problem sizes. A comprehensive examinaion of he model s space complexiy will be presened in a fuure paper. 2. Background The use of disribued represenaions for caegories (hidden variables, causes, facors) has been cenral o boh he mulilayer percepron (MLP), e.g., Backpropagaion, and associaive memory approaches o compuaional inelligence. However, his is no rue of he family of probabilisic models, generally referred o as graphical models, which has become he dominan paradigm in he field. This very large family includes Bayesian Belief Neworks (Pearl, 1988), hidden Markov models (HMMs), 3

mixure models, principal componens analysis, amongs ohers (Roweis & Ghahramani, 1999). All of hese models were originally described wih he assumpion of a localis (singleon) represenaion of caegories. Many of hem, noably HMMs, which is by far, he predominan approach for speech recogniion (and perhaps for he domain of emporal/sequenial paern recogniion in general), have reained his localis assumpion over almos heir enire developmenal courses. More recenly, here has been a movemen owards disribued represenaions for caegories in graphical models. This movemen is ofen aribued o he observaion, by Williams & Hinon (1991), of a fundamenal inefficiency of localis represenaions. They observed ha, in order o represen only N bis of informaion abou he hisory of a ime series, he single mulinomial sae variable of a sandard, localis HMM requires 2 N possible values, i.e., an HMM wih 2 N nodes. In conras, a disribued represenaion of sae wih only N binary variables (an HMM wih 2N nodes) can also sore N bis of hisory informaion. A number of models using disribued represenaions of sae have since been proposed (Ghahramani & Jordan, 1997; Brown & Hinon, 2001; Jacobs e al., 2002; Blizer e al., 2005). The benefi of disribued represenaions can be viewed as increased capaciy for soring he higher-order saisics presen in a sequence (or in any daa se, more generally). This poin can be made more concreely in erms of sparse disribued represenaions. In a sparse disribued represenaion, in which any given represenee is represened by a small subse of represenaional unis (e.g., 100 ou of 1000) being acive, he higher-order saisics, i.e., similariy srucure, of he se of represenees, can be represened direcly by he paerns of overlaps (inersecions) over heir represenaions. The more similar wo represenees are, he more unis hey have in common. Figure 1 illusraes his concep. A lef is a hypoheical se of observaions (A, B, D) in some feaure space and a hypoheical underlying caegory srucure (C1, C2) consisen wih he disances beween he observaions in he feaure space. A lower righ is one possible sparse disribued represenaion: each observaion is represened by 5 ou of 12 of he unis (which is sparse enough o make he poin). The verical doed lines on he righ show how he paern of overlaps of he hree represenaions can direcly represen he higher-order saisics (a nesed caegorical srucure) of he observaions: he more similar observaions, A and B, have hree unis in common, which consiues a represenaion of subcaegory, C1, while he full se of observaions have only wo unis in common, which consiues a represenaion of he super-caegory, C2. 4

Hidden, hough acual, caegorical srucure Paern of overlaps, which implicily represens he similariy srucure of he inpu space. animals C2 C2 fish C1 C1 A A B D rou shark bear B D Observaions Represenaions of he individual observaions Figure 1: (Lef) A hypoheical se of hree observed paerns (A, B, D) from some inpu space and a hypoheical underlying caegory srucure. A and B are more similar han A and D or B and D. (Lower righ) A sparse disribued represenaion of he individual observaions, A, B, and D. (Upper righ) Differen subses of unis, C1 and C2, corresponding o he pair-wise and riple-wise overlaps respecively, consiue explici represenaions of higher-order similariy (i.e., caegory) srucure. Figure 1 is inended o show only ha i is possible o represen he similariy srucure of a se of individual observaion by he paern of overlaps over heir represenaions. The informaion presen in he paern of overlaps in he box a upper righ is fully analogous o a dendrogram (hierarchical clusering analysis) ypically used o demonsrae he caegorizaion behavior of learning models ha use fully disribued represenaions (Elman, 1990). In he case of sparse disribued represenaions, as in Figure 1, he appropriae measure of disance beween represenaions is Hamming disance. The paern of overlaps in he upper righ box is isomorphic o a dendrogram in which A and B would be clusered as C1 and hen C1 and D would be clusered as C2, which would look exacly like he ree srucure on he lef, as desired. The evidence, given in Secion 4, ha he proposed model acually uses his implici caegory informaion is ha i correcly classifies novel observaion (sequences) even hough here are no explici and separae represenaions of caegory informaion in he model. 5

The represenaion of similariy srucure by paerns of overlap consiues nohing less han an addiional represenaional dimension, one which is unavailable a eiher exreme of he represenaional coninuum, i.e., in eiher localis or densely (fully) disribued represenaions. Under a localis scheme here are no code overlaps; hus, he only way o represen he higher-order saisical srucure over he represenees is in he weighs beween he represenaions. Under a dense scheme, where formally, all unis paricipae in all represenaions, all code overlaps are complee, e.g., a hidden layer wih M real-valued unis in a Backpropagaion model. Here, he only available represenaion of similariy is a scalar (e.g., Euclidean) disance beween he represenaions. 1 The possibiliy of represening similariy by overlaps has been generally underappreciaed hus far. However, i is a he core of he spaial model of Rachkovskij & Kussel (2000, 2001) and of he spaioemporal model proposed herein. TEMECOR has no been developed wihin he framework of graphical models. In paricular, alhough i deals wih spaioemporal paerns, i has no been developed as an exension of any varian of HMMs, nor of he more general class of Dynamic Belief Nes (DBNs). Is learning algorihm will, hus, no be cas as a version of he expecaion-maximizaion (EM) algorihm (Dempser & Laird, 1977). Insead, he learning algorihm, which uses only single rials, will be described as compuing a spaioemporal similariy meric and using i o guide he choice of represenaion so ha he paern of overlaps over he final se of represenaions reflecs he higher-order spaioemporal saisics of he inpu paern se. As will be shown, his propery leads direcly o a simple rerieval (inference) algorihm, ha finds he mos similar sored sequence o a query sequence (i.e., he maximum likelihood hypohesis) in a single compuaion, i.e., a compuaion which does no depend on how many sequences have been sored.. TEMECOR has also no been exended from he various MLP varians ha have been applied o emporal/sequenial paerns, e.g., Waibel (1989), Jordan (1986), Elman (1990), Williams & Zipser (1989). The MLP-based approach has been successfully applied o many caegorizaion problems. 1 This issue is cenral o composiionaliy. Classical AI approaches are localis. Composiionaliy is implicily presen owing o he clear mapping beween represenees, boh wholes and pars, and heir represenaions. For connecionis approaches ha employ dense disribuions, here is no such clear mapping available, hence he classicis s (Fodor & Pylyshyn, 1988) claim ha connecionis sysems lack wha Van Gelder (1990) erms synacic composiionaliy. However, as discussed here, sparse connecionis models can possess synacic composiionaliy. 6

However, such models have: a) no been demonsraed o simulaneously explain boh episodic memory (recall of sequences) and semanic memory (caegorizaion of novel sequences); b) require numerous raining rials of each exemplar; and c) suffer from caasrophic forgeing in which nearly all informaion abou a previously learned daa se can be los when he model learns a new daa se (McCloskey & Cohen, 1989; French, 1991; Cleeremans, 1993). Moreover, as discussed by Cleeremans, such Backpropagaion-based models have difficulies ha are paricularly exacerbaed in he conex of learning complex sequences, which are sequences in which iems can recur muliple imes and in varying conexs. TEMECOR is closer o a number of sparse disribued associaive memory models (e.g., Willshaw e al., 1969; Palm, 1980; Lynch, 1986; Kanerva, 1988; Moll & Miikkulainen, 1996), alhough i differs from hese in many ways, including is focus on spaioemporal paerns and is novel use of noise o effec he crucial SISR (similar inpus map o similar represenaions) propery. I also has similariies o he Synfire Chains model (Abeles, 1991) in ha i maps inpus, specifically sequences, ino spaioemporal memory races, alhough he focus is no on he fine-scale iming of spikes. Several auhors, e.g., Dayan & Zemel (1995), have discussed he opposiional naures of he coding mechanisms of cooperaion and compeiion. O Reilly (1998) explains ha any use of compeiion prevens he cooperaiviy and combinaorialiy of rue disribued represenaions, and he need o preserve independence among he unis in a [disribued represenaion] prevens he inroducion of any rue acivaion-based compeiion. He adds ha he developmen of sparse disribued represenaions, which necessarily inegrae hese wo opposing mechanisms, has been challenging largely because hey are difficul o analyze mahemaically. Relaedly, a number of auhors have discussed he benefis of and need for sparse disribued represenaions (Olshausen & Field, 1996; Hinon & Ghahramani, 1997). TEMECOR may provide a hopeful alernaive in his regard because is paricular implemenaion of he sparse disribued paradigm cleanly separaes he cooperaive and compeiive aspecs. As will be seen shorly, compeiion is isolaed wihin he model s winner-ake-all compeiive modules (CMs), and he cooperaion is presen in ha inernal represenaions (codes) are ses of coacive unis, one per CM. 3. Model Descripion Figure 2 shows he ypes of sequences processed by he model. In his example, he inpu surface consiss of 16 absrac binary feaures arranged 7

in a verical array; array posiion does no reflec any spaial relaion in he physical inpu domain. Figure 2a shows an insance of a compleely uncorrelaed or whie noise sequence in which a randomly chosen subse of binary feaures is chosen o be acive a each momen.. The inenion of using inpu ses consising of such uncorrelaed sequences is o approximae he condiions of episodic memory wherein essenially arbirary combinaions of feaures are remembered essenially permanenly despie occurring only once. The model also handles correlaed, or more specifically, complex, sequence ses (Figure 2b) in which whole inpu iems can recur muliple imes and in differen emporal conexs, as is necessary for modeling language-like domains. a) Uncorrelaed b) Correlaed (Complex) 3 10 L O T T O Figure 2: a) A random binary spaioemporal paern (sequence) having 10 ime seps, or momens. The inpu surface has 16 absrac binary feaures, a randomly chosen subse of which is acive a each momen. b) A complex sequence in which he same iem (subse of feaures), e.g., an English leer, can recur in differen conexs, as is appropriae for represening linguisic domains. As shown in Figure 3, he wo-layer model 2 uses sparse disribued binary represenaions, or codes, in is inernal layer, L2. In his example, L2 is divided up ino Q = 4 winner-ake-all compeiive modules (CMs) and every code consiss of one acive uni in each CM (black L2 unis). Hence, he layer insaniaes a paricular ype of k-wta archiecure. Here, L1 is arranged as a 2D grid of binary absrac feaures; he model is agnosic o modaliy and o any inheren dimensionaliy of he inpu space. Figure 3 shows he sae of a small insance of he model on all four ime seps 2 This model has already been generalized o a hierarchical framework allowing an arbirary number of layers (Rinkus & Lisman, 2005) and will be described in a separae manuscrip. 8

during he learning of he sequence, [ABDB]. I illusraes several aspecs of he model s basic operaion, as well as some nomenclaure. Noe ha only a small subse of he connecions (weighs) are shown in his figure. Throughou his paper, we will assume complee iner-level conneciviy: all L1 unis connec wih all L2 unis in boh he boom-up (BU) and op-down (TD) direcion. We also assume nearly complee inra-level (horizonal, or H) conneciviy (wihin L2 only): all L2 unis conac all oher L2 unis excep hose in is own CM. Code: A AB ABD ABDB L2 L1 Time: Inpu: Momen: 1 A [A] 2 B [AB] Figure 3: A depicion of a small insance of he model on all four ime seps, or momens, while learning he sequence, [ABDB]. Only a represenaive subse of he weigh increases on each ime sep are shown (black lines). See ex for deailed explanaion of he figure. Figure 3 shows ha during learning, a new L2 code generally becomes acive for each successive ime sep, or momen, of a sequence. A momen is defined as a paricular spaial inpu (iem) in he conex of he paricular sequence of iems leading up o i (i.e., a paricular prefix or hisory). Thus, he sequence [ABDB] consiss of four unique momens, [A], [AB], [ABD], and [ABDB], bu only hree unique inpu iems. The symbol, σ, is used o denoe a momen. The symbol, ϒ, is used o denoe an individual inpu iem, i.e., a se of co-acive L1 unis (a purely spaial paern). The symbol,, is used o denoe an L2 code, i.e., a se of acive L2 unis; e.g., AB (black unis only) denoes he L2 code for he momen, [AB]. The figure also shows ha he same inpu iem, e.g., B, generally gives rise o differen L2 codes, depending on prior conex; hence, AB is no equal o ABDB, alhough hey overlap a one CM (he one a lower righ). Figure 3 graphically depics a represenaive subse of he learning ha occurs during presenaion of he sequence. All weighs, BU, TD, and H, are iniially zero. A =1, an inpu iem, A, presens a he inpu layer, L1. 9 3 D [ABD] 4 B [ABDB]

Acive L1 unis send signals via heir BU weighs o L2. Assume for he momen ha some random L2 code, A, is chosen. The deails of how L2 codes are chosen will be elaboraed shorly. The BU and TD weighs beween he hree acive L1 unis comprising iem A, and he four unis of A are increased (only he increases involving he lower lef CM s winning uni are shown). The arcs from he gray L2 uni in he upper lef CM a =2 show a subse of he horizonal (emporal) learning ha would occur wihin L2 from A (which was acive a =1 and is shown as he gray unis in L2 a =2) ono AB (black unis in L2 a =2). Thus, successively acive L2 codes are chained ogeher by a emporal Hebbian learning law, Eq. 1a, in which an L2 uni acive a -1 increases is weighs ono all L2 unis acive a (excep for unis in is own CM). The learning law for he BU and TD weighs is he more familiar Hebbian rule, Eq. 1b, in which he weigh is increased if L1 uni, i, and L2 uni, j, are co-acive. w ( i, j) w 1 = (1a) 1 1 1, i, j, CM ( i) CM ( j) ( i, j), oherwise 1, i and j boh acive a w ( i, j) = (1b) w ( i, j), oherwise The model s operaion during learning is as follows. In Eq. 2, each individual L2 uni, i, compues he weighed sum, φ(i), of is horizonal (H) inpus from he L2 unis acive on he prior ime sep, -1. The analogous summaion, ψ(i), for he boom-up (BU) inpus from he currenly acive L1 unis, ϒ, is compued in Eq. 3. 3 In Eqs. 4 and 5, hese summaions are normalized, yielding Φ(i) and Ψ(i). The φ normalizaion is possible because L2 code size is invarian and equal o Q, he number of CMs. For insance, in he model of Figure 3, all L2 codes have exacly Q = 4 acive unis. Therefore, he maximum oal horizonal inpu possible on any ime sep is Q-1 = 3 (since unis do no receive H synapses from oher unis in heir own CM). Thus, we normalize by dividing a uni s summed H inpus by Q-1. Similarly, BU inpus can be normalized because he number of acive inpu feaures on any given ime sep is assumed o be a consan, M. 4 3 This equaion is changed slighly in Sec. 4 o accommodae a slighly more complex represenaion of ime used in he compuer model. The deails will be explained in Sec. 4. 4 The heory does no require he sric normalizaion of he number of acive inpu feaures per ime sep, bu his is no discussed in his paper. 10

φ ( i ) = w k i (2) k 1 ψ ( i ) Φ ( i) = = w (3) k ϒ φ ( i) Q 1 k i ψ ( i) Ψ ( i) = (5) M In Eq. 6, each L2 uni compues is own local degree of mach (suppor), (i), o he curren momen. Formally, his is implemened as muliplicaion beween he normalized versions of a uni s horizonal (H) and boom-up (BU) inpus, as indicaed in Figure 4. Noe however ha on he firs ime sep of any sequence (=1), here is no horizonal inpu presen. In his case, depends only on Ψ. This muliplicaion resuls in a sharpening of he poserior probabiliy (or likelihood) over he hypohesis space, as in he Produc of Expers model (Hinon, 1999) and consisen wih he ideas exposied by Lee & Mumford (2002). v Ψ ( i), = 1 ( i) = u v Φ ( i) Ψ ( i), > 1 Le Φ ˆ, z (depiced near he op righ of Figure 4) denoe he normalized H inpu vecor over he unis of he z h CM. Φ ˆ, z represens he emporal conex of (hisory leading up o) he curren momen,, and is a funcion of he specific se of L2 unis acive a -1. Similarly, Ψ ˆ, z (near he boom righ of Figure 4) denoes he normalized BU inpu vecor over he unis of he z h CM. Φ ˆ, z and Ψ ˆ, z are independenly exponeniaed, ypically by some small ineger (e.g., 2 or 3), o effec separae generalizaion gradiens for each influence. This furher sharpens he poserior. (4) (6) 11

Figure 4: Ψ ˆ, z, depics a hypoheical vecor (disribuion) of normalized BU inpus (Ψ values) o he unis of CM z a ime. The dashed ellipse represens a Ψ value of 1.0. Similarly, Φ ˆ, z depics a vecor of normalized H inpus for CM z. ˆ, z ˆ z Ψ and Φ are exponeniaed (see ex) and hen muliplied ogeher yielding a, final combined local degree of suppor disribuion, ˆ, z, over he unis of CM z. The model s abiliy o easily represen very long-disance emporal dependencies derives from he huge space of represenaions available for represening momens. In he small oy example model of Figure 3, whose layer 2 has only four CMs each wih four unis, here are 4 4 = 256 possible L2 codes, which may no seem like all ha much. However, he CM is envisioned as analogous o he corical minicolumn (Mouncasle, 1957; Peers & Sehares, 1996), or more specifically, o ha subse of a minicolumn corresponding o is principal represening cells, e.g., o a minicolumn s approximaely 30 layer 2/3 pyramidals. Furher, he model s L2 is envisioned as a pach of corex on he order of a hypercolumn, which subsumes approximaely 70 minicolumns. Such a siuaion, depiced in Figure 5, yields an asronomical number, i.e., 30 70, of unique codes. This is no o say ha all 30 70 codes can be used. Because he model s learning 12

rule increases all weighs from one L2 code o he nex, complee sauraion of he H weigh marix, and hus oal loss of informaion, would occur long before all 30 70 codes were assigned. Jus how large a fracion of he codes could be used while mainaining a given fracion of he oal sored informaion is he issue of sorage capaciy. Alhough an analyic sorage capaciy resul is no given here, empirical resuls for a simpler version of he model are given in he appendix. These resuls offer indirec evidence for subsanial capaciy, on he order of 0.135 bis/synapse, as he problem is scaled o arbirarily large size. 5 Figure 5: Diagram reflecing he approximae parameers of a hypercolumn-sized pach of corex, e.g., abou 70 minicolumns where each minicolumn conains abou 30 represening cells, i.e., he layer 2/3 pyramidals. The code shown (se of black cells) is herefore one ou of a possible 30 70 codes. In he nex sep of he learning algorihm, Eq. 7, he model simply finds he maximum value,, z, in each CM, z. Then, in Eq. 8, he model compues he maximum similariy, G, of he curren momen, i.e., he curren spaial inpu in he emporal conex of he sequence leading up o he curren inpu, o all previously experienced momens. Tha is, he model effecively evaluaes he likelihoods all sored hypoheses in parallel. G is normalized beween 0 and 1 and is defined as he average of he maximal 5 The possibiliy of such a vas amoun of codes (e.g., 30 70 ) provides a possible answer o he binding problem. I is simply ha here are enough unique codes available o explicily represen he combinaorially large number of possible inpu configuraions for reasonably consrained inpu spaces. The same basic idea was proposed in O Reilly e al., 2003. 13

local mach values in each CM. I can be hough of as a global mach (familiariy) signal. If G = 1, hen he curren momen is idenical o some previously experienced momen. If G is close o 0, hen he curren momen is compleely unfamiliar (novel)., z max ( i) G = (7) i CM Q z z = 1, z = (8) Q In he las phase of he learning algorihm, he model uses he global familiariy measure, G, o nonlinearly ransform he local degree of suppor values,, ino final probabiliies, ρ, of being chosen winner. Eqs. 9, 10 and 11 ogeher define how G modulaes, in a graded fashion, he characer of ha ransform. The precise parameers of hese equaions are no paricularly imporan: hey provide various means of conrolling generalizaion gradiens and he average separaion (i.e., Hamming disance) beween codes. Wha is imporan is he overall effec achieved, i.e., ha code overlap is an increasing funcion of spaioemporal similariy. a = G 1 α α + d C ( ) = a ξ i b ( i ) 1 + c ) e ρ ( i ) = k CM ( + ξ ( i ) ξ ( k ) 14 1 K (9) (10) (11) Once, he final ρ-disribuion is deermined (Eq. 11), a choice of a single winner (in each CM) is made according o i. The behavior of he model in his las phase is bes described in erms of hree differen regimes: G 1, G 0, and G somewhere in he middle. A G value close o 1 means ha here is a paricular previously experienced (sored) momen, σ*, ha is exremely similar o he curren (query) momen and herefore is causing he high G value. In his case, he model should reinsae σ* s code, σ*, wih probabiliy very close o 1.0. This is achieved by adjusing he ransform s parameers so ha i is a highly expansive nonlineariy, as shown in Figure

6a. This effecively causes unis wih he highes values in heir respecive CMs o win wih probabiliy close o one, and herefore σ*, as a whole, o be reinsaed wih probabiliy close o one. In conras, when G is near 0, he ransform becomes a compleely fla consan funcion (Figure 6b), which assigns equal probabiliy of winning o all unis in a module. This minimizes he average overlap of he se of winners (which is a new code) and all previously sored (known) codes. For middling G values ranging from 0 up o 1, he ransform gradually morphs from he perfecly compressive consan funcion of Figure 6b ino a highly expansive sigmoid as shown in Figure 6a. The higher he value of G, he larger he average overlap of he winning code and he closes maching sored code. Thus, his G-dependen gradual morphing of he ransform direcly yields he desired propery ha spaioemporally similar momens map o similar codes. a) G 1.0 1 4000.0 0 0.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3000.0 ρ 2000.0 1000.0 0 0.0 b) G 0.0 0 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2.0 1 k ρ 1.0 0.0 Figure 6: Mapping local similariy values ( values) o final probabiliies of being chosen winner (ρ values). Noe ha in he G 1 regime, he model is acually no in a learning mode. Tha is, because he choice of code depends on he produc of H signals from he previously acive code and BU signals from he inpu, exac reinsaemen of a sored code implies ha boh he prior code and he curren inpu are as hey were on he pas occasion ha is so similar o he 15

curren momen (i.e., causing G 1). Tha means ha here are no novel pairings, {i,j}, of acive unis, such ha uni i is acive on he prior ime sep and uni j is acive on he curren ime sep, and hus ha here are no new opporuniies for synapic increases. Thus, no learning can occur. The expeced number of novel pairings and herefore he amoun of learning ha will occur gradually increases as G drops o zero. Thus, G effecively auomaically conrols he model s movemen along he coninuum from learning o recalling/recognizing. Figure 7 depics wo hypoheical siuaions corresponding o G 1 and G 0. There is a clear maximal uni in each CM in Figure 7a. However, noe ha even in Figure 7b, here will generally sill be a maximal uni in each CM (alhough i migh no be disinguishable in his picure). In general, early in he model s learning period (life) he se of maximal unis, on each successive momen, will be idenical o some previously sored code. This is rue regardless of he value of G. This is why he model canno simply choose he maximal uni in each CM as he final winner. Doing so would lead o assigning almos all momens o a iny number of codes (even a single code), essenially losing almos all informaion abou he daa se. a) G 1.0 b) G 0.0 Figure 7: a) Hypoheical insance where G 1. b) An insance where G 0. This las poin in he discussion of he learning algorihm bridges naurally o he discussion of rerieval, i.e., recall and recogniion. For when G is near 1, i is safe, in fac, opimal, o simply choose he max uni in each CM o become acive: he winner selecion process becomes one ha can be purely local and sill be opimal. By opimal, I mean ha i 16

rerieves he mos similar sored momen o he curren query momen, or in oher words, rerieves he maximum likelihood hypohesis. Thus, during recall and recogniion, a each momen, he model simply acivaes he maximal uni in each CM. The recogniion algorihm consiss of Eqs. 2-6, followed by acivaing he max uni in each CM. The rerieval algorihm is slighly simpler because, following presenaion of he firs iem (i.e., he promp), only H signals are used o deermine which L2 unis become acive a each successive momen. Hence, he rerieval algorihm consiss of Eqs. 2-5, followed by Eq. 6, and hen by acivaing he maximal uni in each CM. Once he L2 code is acivaed, each L1 uni compues is oal TD inpu. Those whose TD inpu exceeds a hreshold, ypically se slighly lower han he number of L2 CMs, Q, become acive, hus replaying he original sequence a L1. ( i), 1 ( i) = Ψ = Φ ( i), > 1 An exended example follows, which illusraes wo of he model s key properies: a) spaioemporally similar momens map o similar codes, and b) he compuaion process for finding he closes maching sored momen does no depend on he number of momens (sequences) sored. This hypoheical example (Figure 8) shows he learning of he memory races for four 2-iem sequences by a small insance of he model. When he firs iem, A, of Sequence 1 presens all BU weighs are zero. All raw (ψ), and hus normalized (Ψ), BU inpu values are also zero. Thus, all local mach () values are zero, and hus, he global mach (G) is zero. Accordingly, he values are mapped hrough a consan funcion, making all unis in any given CM equally likely o win. The op panel shows a randomly chosen code, A, for iem A. Learning would occur a his poin: he BU weighs from he unis of A o he unis of A would be increased o a weigh of 1. The second iem, B, presens on he nex ime slice. Since B A = 0, none of he BU weighs from he unis of B have been increased ye. Furhermore, no H weighs have been increased ye. Since we are beyond he firs ime slice of he sequence, is compued as he produc of normalized H and BU (Φ and Ψ) inpus. However, since boh ψ and φ are zero for all L2 unis, so are all Φ, Ψ, and values, and so is G. Again, all unis end up being equally likely o win. Anoher randomly chosen code, AB, is shown for iem B (i.e., for momen [AB]). A his poin, learning occurs in he H weigh marix from he previously acive unis of A o he unis of AB, and in he BU marix from he unis of B o he unis of AB. 17 (6 )

Seq. 1 A AB A B Seq. 2 C CB = 0 A C =1 AB CB A C = 0 C B Seq. 3 D DB =1 A D = 2 AB DB A D =1 D B Seq. 4 E EB = 2 A E = 3 AB EB A E = 2 E B Figure 8: Hypoheical example of he firs four sequences learned by a model, illusraing he propery ha similar spaioemporal paerns (sequences) map o similar inernal represenaions (codes). Gray unis denoe inersecions wih AB. Nex, Sequence 2 presens o he model. The firs iem, C, is unfamiliar. None of C s feaures have been presen hus far. Again, G = 0; hus anoher random code, C, is chosen. Then he second iem, B, presens. Iem B has been experienced before bu in he conex of a differen predecessor. I seems reasonable ha he model should remember his novel momen, [CB], as disinc from [AB]. Thus, i should choose a novel L2 code, CB, having lile (or a leas, non-oal) overlap wih any previously chosen L2 code. While [CB] clearly has some similariy o he previously experienced momen, [AB], all values would in fac be zero in his case (because he H 18

inpus o all L2 cells would be zero). Thus, CB is chosen randomly: le s assume ha, by chance, i has one uni (gray cell in he lower righ CM) in common wih AB. Now Sequence 3 presens. Iem D has one feaure in common wih A (as well as one in common wih B). Various L2 unis will hus have non-zero ψ, and hus non-zero Ψ and values. However, none of he values will be close o1; hus, G will be non-zero bu closer o zero han one. The -o-ρ ransform will hus have some expansiviy and herefore cause some bias favoring he unis wih higher values, hough here will sill be significan randomness (noise) in he final choice of winners. Consequenly, we show he code, D, as having one uni in common wih A. Then iem B presens following D. This momen, [DB], has greaer spaioemporal similariy o [AB] han [CB] did. This increased similariy is manifes neurally in he uni ha is common o A and D, which means ha various L2 unis will have non-zero φ, and herefore non-zero Φ, values a his momen. Iem B is familiar: hus various L2 unis will have non-zero ψ, and hus Ψ, values. Finally, some L2 unis will have a non-zero produc,, of Φ and Ψ values; hus G will also be non-zero. Again, he ransform will have some expansiviy and hus cause some bias favoring unis wih higher values. Thus, we depic a code, DB, having wo unis in common wih AB. Finally, Sequence 4 presens. I s firs iem, E, is even more similar o A han D was (2 ou of 3 feaures in common). Wihou belaboring furher, I will jus sae ha his increased similariy leads o a code, E, having wo unis in common wih A, and ha his in urn leads o a code, EB, ha has even more overlap wih AB han did DB. Overall, his example shows how he model maps he spaioemporal similariy of momens o he (spaial) similariy of codes. This is he crucial propery of he model, which makes possible he O(1) rerieval complexiy. Figure 9 demonsraes he correcness of he model s rerieval sraegy i.e., simply picking he max uni in each CM o win rerieves he mos similar sored momen for he firs momen [A] of Sequence 1. To show his, I inroduce a varian of he G measure, he code-specific G, G( x ), which Eq. 12 defines as he average of he values for he unis comprising a specific code,, raher han he average of he maximal value in each x CM. The erm, x, z, in Eq. 12, denoes he winning uni in CM z during momen x. Each panel of Figure 9 shows one of he eigh L2 codes sored by he model during learning, along wih iem A a L1. The values, arising when iem A presens, for each of he code s unis is shown. The average of hose four values, i.e., G( x ), is shown a he op of each 19

panel. Since his is he fis ime sep of he rerieval even (query), he values equal he Ψ values (Eq. 7, wih exponen v = 1). ( ) z= 1 x, z x = (12) G( ) Q Q The main poins here are: a) he code-specific G, G( A), for he correc code, A, is maximal; and b) he code-specific G s for he res of he codes correlae roughly wih he similariy of he momens ha hey represen and he query momen, [A]. For example, G( A ) > G( E ) > G( D ) > G( C ), and by consrucion of he example, A A > A E > A D > A C. The code-specific G can be hough of as he relaive likelihood of he hypohesis represened by he code. G ( ) = 1.0 A G ( ) =.165 G ( ) = 0 G ( ) =.25 AB C CB = 1 =1 = 0 = 0.66 =1 =1 = 0.33 = 0.33.33 G ( ) =.332 D G ( ) =.332 DB G ( ) =.665 E G ( ) =.082 EB =1 = 0 =1 = 0.66 = 0. 33 =1 = 0.33 =1 = 0.33 Figure 9: The code-specific G values (relaive likelihoods), when he query momen is [A], of all eigh momens (hypoheses) sored in he model. Noice ha he sored represenaion of [A], A, has he highes code-specific G. Figure 10 shows he code-specific G values for all eigh codes (hypoheses) when he query momen is [AB]. Thus, his is a recogniion es in which iem A was presened on he prior ime sep and iem B is now 20

being presened. This figure reinforces he same wo main poins as Figure 9. The correc code, AB, has he highes code-specific G, and he oher codes code-specific G s fall off roughly wih spaioemporal similariy o he query momen, [AB]. For example, G( AB ) > G( EB ) > G( DB ) > G( CB ). Again, by consrucion of he example, sim([ AB],[ AB ]) > sim([ AB],[ EB ]) > sim([ AB],[ DB ]) > sim([ AB],[ CB ]), where sim(σ i, σ j ) is he simple spaioemporal similariy meric, Eq. 13, ha depends only on he oal number of common feaures across all T ime seps. T i j = i j = 1 sim( σ, σ ) σ σ (13) G ( ) =.082 A G ( ) = 1.0 G ( ) =.082 G ( ) =.25 AB C CB = 0 = 0.33 = 1 = 1 = 1 = 1 = 0.33 = 1 G ( ) =.75 D G ( ) =.665 DB G ( ) =.082 E G ( ) =.875 EB = 1 = 1 = 1 = 0.33 = 1.33 = 1 = 0. 33 = 1 = 0.5 = 1 = 1 Figure 10: The code-specific G s, when he query momen is [AB] (i.e., when B is presening righ afer A presened), for all eigh sored momens. Table 1 provides he code-specific G (relaive likelihood) value for each of he eigh sored momens (hypoheses) given each of hose same momens as queries. The daa in he firs wo rows comes from Figures 9 and 10, respecively. The able demonsraes ha he same properies shown in Figures 9 and 10 hold generally for he learning se as a whole. The mos profound aspec of he model is ha he model finds he mos similar sored momen wihou explicily compuing hese code-specific G 21

values explicily. In fac, he model does no even have o compue G, i.e., Eq. 8, during recall and recogniion (i.e. during percepual inference). Because he way in which codes were chosen during learning maps spaioemporal similariy of momens ino degrees of code overlap and hus auomaically and implicily embeds he higher-order saisics (spaioemporal similariy srucure) of he sequence se in he model s weigh marices, simply choosing he maximal uni in each CM is guaraneed o reacivae he code of he mos similar sored momen, provided he query momen is close enough o ha sored momen. Table 1: Code-Specific G Values (Relaive Likelihoods) of all Sored Momens (hypoheses) given hose same Sored Momens as Queries Hypohesis Query A AB C CB D DB E EB A 1.0 0.165 0 0.25 0.332 0.332 0.665 0.082 AB 0.082 1.0 0.082 0.25 0.75 0.665 0.082 0.875 C 0 0 1.0 0.25 0.75 0.665 0.082 0.875 CB 0 0.25 0.5 1.0 0.25 0.25 0.25 0.25 D 0.5 0.832 0.25 0.58 1.0 0.5 0.665 0.665 DB 0.25 0.665 0.332 0.25 0.415 1.0 0.25 0.25 E 0.83 0.25 0 0.332 0.5 0.332 1.0 0.165 EB 0.082 0.75 0.082 0.25 0.5 0.665 0.082 1.0 Wha Table 1 does no demonsrae direcly is ha novel sequences (momens), ha are perurbaions (noisy versions) of he learning se sequences, will be mapped o he spaioemporally mos similar sored momen. However, he demonsraed code-specific G gradiens are, in some sense, he dual of ha resul. In any case, he simulaion resuls in he following secion do provide a direc demonsraion of his propery. 4. Simulaion Resuls The global familiariy, G, could be used as a signal ha ells he model wheher i is in sorage (learning) or rerieval (recall, recogniion) mode. However, he resuls repored here were produced using a simpler proocol in which he operaional mode is mandaed exernally. Thus, hese experimens have a learning phase in which he se of sequences is presened one ime each and hen a second es phase ha is eiher a recall es or a recogniion es. Weighs are frozen during he es phase. In recall ess, he model is promped wih jus he firs iem from one of he sequences ha i learned. This promp iem is also referred o as he 22

query, or query momen. In his case, he model reurns he remainder of he sequence. This can be hough of as a demonsraion of episodic recall of a sequence. For recall ess, we repor he recall accuracy a boh layers. To compue he accuracy, R ( ) 2, a L2, we compare he code a ime during he recall es o he code ha occurred a ime during learning, as defined in Eq. 14. E() is he number of CMs in which he uni acivaed in recall differs from he winner during learning, i.e., errors. A sequence s overall L2 accuracy is he average L2 accuracy over all ime seps (iems) of he sequence, excluding he promp ime sep. The formula for L1 recall accuracy, R ( ) 1, is slighly more complicaed since, a L1, he wo kinds of errors, deleed feaures, D(), and inruding feaures I(), mus be kep rack of separaely. The L1 code reinsaed a ime during recall is compared o he L1 code ha occurred a ime during learning, as defined in Eq. 15. C() is he number of correc feaures a L1 a ime during recall. Q E( ) R2 ( ) = (14) Q C( ) D( ) R1 ( ) = C( ) + I( ) (15) In recogniion ess, he model is presened wih an enire sequence. This query sequence could be idenical o one of he learning sequences or i could be a perurbed version of one of he learning sequences. In he former case, he model should reinsae he same sequence of L2 codes (he memory race) ha occurred during he learning rial. In he laer case, he model should acivae, on each successive momen, he L2 code corresponding o he sored momen ha is spaioemporally mos similar o he curren query momen. This demonsraes recogniion (classificaion) of a sequence. For recogniion ess, we only repor he L2 accuracy because he L1 code is supplied from he environmen on all ime seps of a recogniion rial: hus, here is no noion of an error a L1 in a recogniion es. The recogniion accuracy a L2 is also compued using Eq. 14. In his case, he comparisons are made wih he sored sequence from which he perurbed es sequence was generaed. Several parameers are common o all hree simulaion experimens. In all cases, he nework s L1 consised of 100 binary feaure deecors; he L1 and L2 unis were compleely conneced in boh he BU and TD direcion; and he L2 unis were compleely conneced via he horizonal weigh marix. All weighs are binary and iniialized o zero. 23

Experimen 1 Experimen 1 provides a small proof-of-concep demonsraion ha a single nework can learn a se of sequences wih single rials and subsequenly boh recall all of he individual sequences essenially perfecly and also recognize subsanially perurbed (very noisy), and hus, novel, insances of he learning se sequences. The learning se, shown in Figure 11, conained five sequences, each having five iems. Each iem consiss of en randomly seleced feaures. Thus, here are 25 unique momens. L2 consised of eigh CMs, each wih 10 unis. The recall accuracy for his se was nearly perfec; 98.18% a L2 and 100% a L1. Of he 400 reacivaions of L2 unis ha occur over he course of recalling all five sequences, only eigh errors occurred. 0 0 0 3 1 4 2 Figure 11: The five randomly creaed sequences comprising he learning se for Experimen 1. Iem (ime sep) indices are across he op. Figure 12 shows he complee learning and recogniion races, a boh L1 and L2, for he firs sequence. The op row of each race shows he L2 codes. Each row of small gray squares in an L2 grid corresponds o a CM, and each individual square in he row, o a single uni wihin ha CM. Black squares are he winners. Noice ha he race includes wo sub-seps for each iem of he sequence: e.g., iem A is acually presened wice, a sub-seps 0 and 1, iem B is presened a sub-seps 2 and 3, ec. This reflecs he compuer simulaion s propery ha signals originaing from a given inpu iem are processed up hrough all layers (in his case, here are only wo) before he nex iem is presened. Thus, on he firs sub-sep of each iem, he inpu paern iself becomes acive a L1 and on he second sub-sep, he BU signals from ha L1 code reach L2 and acivae an L2 code. Noice also ha he L2 codes also say acive for wo sub-seps, alhough hey are delayed one sub-sep from he L1 codes. This is he more complex represenaion of ime alluded o in an earlier foonoe and i enails modifying Eq. 3 o he following. The only difference is ha he 24

summaion is over signals arriving from L1 unis acive on he prior subsep, i.e., he se, ϒ 1, raher han he curren sub-sep. ψ ( i ) = w k i (3 ) k ϒ 1 1 (A) 2 (B) 3 (C) 4 (D) 5 (E) Iems 0 3 4 5 6 7 8 9 Sub-seps Learning L2 L1 Recogniion 87.5 100 100 100 100 100 100 100 100 L2 L1 Figure 12: (op) The complee learning race for Sequence 1 for boh L1 and L2. (boom) The complee recogniion race for a perurbed version of Sequence 1 which had 40% (4 ou of 10) feaures changed on each iem. The numbers below he L2 codes in he recogniion race are R 2 scores (percenages). The R 2 value of 87.5% on sub-sep 1 means ha he correc winner was reinsaed in 7 ou of he 8 CMs. See ex for furher discussion. Recogniion accuracy was also exremely good: 94.78% averaged over all five perurbed sequences. Noe ha only he final (in his case, 2 nd ) subsep of each iem is included in he compuaion of a sequence s recogniion accuracy. The inuiion here is ha as long as he recogniion L2 code racks he original (learning) L2 code on he iem imescale (as opposed o he finer sub-sep imescale), he model should be seen as demonsraing recogniion. These five sequences had a 40% perurbaion rae: i.e., 4 ou of he 10 feaures comprising each of he learning iems were randomly changed. This very subsanial degree of perurbaion can be seen by visually comparing wo L1 codes [he original learning iem (op) and is perurbed varian (boom)] as in he verically conneced boxes on he righ of in Figure 12. This race, ypical of he whole se, shows ha despie he consan exreme perurbaion (noise) on every iem of he recogniion es sequence, he model reacivaes he exac L2 code of he mos similar sored momen on each sub-sep hroughou he whole recogniion es sequence, 25

excep for one error on sub-sep 1 (see he verically conneced boxes on he lef; he error is in he CM represened by he op row of he L2 grid). Tha is, he model co-classifies he varian insance of Sequence 1 wih he mos similar sored sequence, which is Sequence 1. Neiher he learning algorihm (Eqs. 1-11) nor he rerieval algorihms (Eqs. 2-6 and Eqs. 2-6 ) depend on he number of sequences (or, momens) previously sored. Therefore, boh learning and rerieval ime are consan; i.e., O(1) ime complexiy. Experimen 2 The second experimen provides a larger example. In his case, here were 10 sequences, each wih 10 iems. However, his nework s L2 is much larger: 9 CMs wih 26 unis each. Again, recall accuracy was virually perfec: 99.05% a L2 and 100% a L1. The se of learning sequences, as a whole, conained 4616 bis of informaion and he nework had 95,472 weighs, yielding an informaion sorage rae of 0.0483 bis/synapse. While his is abou an order of magniude below he heoreical maximum for sparse binary associaive memories, hese experimens are inended as a simple proof of concep. Numerous modificaions and opimizaions are possible and will be explored in he fuure (some are briefly described in he discussion secion and appendix). Recogniion accuracy was also exremely good: 94.78% averaged over all en perurbed sequences. These sequences had a 30% perurbaion rae: i.e., 3 ou of he 10 feaures comprising each of he learning iems were randomly changed. As in Experimen 1, learning and rerieval are O(1). In his case, here are 10 x 10 = 100 sored momens whose represenaions compee a each query momen. 5. Fuure Work As presened herein, he model has a significan problem learning a large se of sequences in which many sequences sar wih he same iem or prefix of iems, e.g., all he words of some naural language. The problem is no ha he model has any inrinsic difficuly dealing wih a se of complex sequences, i.e., a sequence in which iems can recur many imes and in many differen conexs. As he example of Secion 3 shows, he choice of represenaion for an iem depends on he prior conex. Thus, here are four differen codes for iem B in ha example. In general, his allows he model o move hrough a common sae o he correc successor sae. Raher, he problem is simply ha far oo small a se of L2 codes will ever be used o represen he iniial momen of sequences in domains such as a naural language lexicon. For example, consider encoding he 26