Summary and Conclusion 137
Humans gather a lot of information about the environment via vision and language. Moreover, we can easily combine these two modalities. For example, when I ask you to look for your wallet, you are able to find this object amongst multiple others. Thus somehow you have managed to combine the word wallet with the actual object in the visual environment and have selected this object amongst competing information. But how do we perform such a task? More specifically, how do we integrate vision and language in order to select objects in the environment? The dominant view within visual attention literature is that selection is mainly visually driven. So when you are instructed to look for your wallet you will imagine what your specific wallet looks like. In other words, you create a visual representation of your wallet. This visual representation is then used in order to find the matching object in the visual environment. Visual representations are, however, not the only type of representation that can be activated. Words, of course, have meaning, and the same is true for visual objects. For example, when you see or hear the word wallet you will also know that this object is meant to hold money, that you need it when you go out for shopping and that it is normally stored inside a bag or in your pocket. This type of information is called semantic information and people also activate semantic representations of visual objects or words. This means that in principle selection could also be semantically driven. The main question of this thesis is when are objects selected on the basis of visual representations (like shape, texture and color) and when are semantic representations (like function and meaning) more relevant for selection? This question was investigated in multiple experiments where participants had to indicate whether a certain object was present or absent in a visual display. Instructions were presented auditorily either before or after the visual display was presented. Some displays contained a picture of the correct referent (target) amongst pictures of other objects. However, in other displays the crucial ones the target was absent, but these included instead pictures of objects that were either semantically or visually related to the spoken word (amongst several unrelated objects). For example, when people had to look for a banana, then the search display contained a picture of a monkey (semantically related), a canoe (visually related), a hat and a tambourine (unrelated). Eye movements towards these different objects provided the main measure. Specifically, it was measured how much time 138
people spent fixating each object within a specific time period. This is called proportion fixation time (or P(fix) in short). P(fix) can be plotted over multiple time periods, displaying clearly how fixation preferences change over time. Visually related objects will receive more fixations than unrelated objects when people have a more visual representation of the target. This is called a visual bias. However, when the target is more semantically represented, than people will orient more often towards semantically related objects than to unrelated objects (i.e., semantic bias). Thus, these fixation preferences over time will give an indication of the representation that is currently mostly activated. None of these experiments could have been conducted without an extensive and properly controlled stimulus set. In the literature such a set was lacking and therefore, Chapter 2 introduces a new stimulus set. This set consists of 100 word-pictures pairs. Each word is paired to a semantically related object, a visually related object and two unrelated objects. A semantic relationship indicates that the object and the referred to object share something in meaning or function, whereas a visual relationship implies a common share in visual features. Because of the inclusion of both semantic and visual relationships, researchers can contrast semantic and visual influences on selection. The high number of word-pictures pairs makes within participants designs possible. The set is ecologically valid as it contains photos of real life objects. Moreover, it is controlled for an extensive range of properties. First, word-picture pairs were only matched at one type of relationship: either semantically or visually or not at all. Second, the semantically related, visually related and unrelated pictures did not differ overall on both visual and linguistic factors, like luminance, visual complexity and naming agreement. The stimulus set of Chapter 2 was extended with 20 additional trials for the studies reported in the later chapters. Additionally, to introduce a target absent-present task, another 120 word-pictures pairs were created where the word actually referred to an object in the display. Chapter 3 tested the prediction that the order of stimuli presentation determines whether selection is more visually or more semantically driven. This prediction was directly derived from the cascaded activation model of visual-linguistic interactions. This model states that seeing a picture leads to a visual representation initially as pictures are visual in nature - and only later to a semantic representation. For spoken words the information flow is different: here people are confronted with the phonological structure first (i.e., 139
the sounds), and therefore a phonological representation will be activated before the associated semantic and visual representations. Thus when the visual display is presented before the word, people have enough time to activate both visual and semantic representations of the pictures. However, when the pictures are presented after the word, the visual representations of the pictures will be activated before the semantic representations. Thus in the first condition the visual and semantic biases are expected to arise around the same time, whereas in the latter the visual bias is expected to arise earlier in time than the semantic bias. These predictions converged with the results of Chapter 3. When the word preceded the pictures, there was indeed a visual dominance, but this dominance disappeared when the pictures preceded the word. Importantly, substantial semantic biases were observed in both conditions, even when they arose later in time. In the second experiment of Chapter 3 it was investigated whether the timing of the semantic bias was influenced by the presence of a visually related object. The results showed that the temporal dynamics of the relative semantic bias was the same with or without visual competition (i.e., it did not matter whether there was also a visually related object in the display). Visual orienting is thus driven by priority settings that dynamically shift between visual and semantic representations, with each of these types of bias operating largely independently. Another interesting question is whether visual and semantic representations are activated automatically or whether people have some form of control over them so that representations are only activated when they are actually needed for the current task. In Chapter 4 this was investigated by manipulating the relevance of the spoken word. Participants always memorized the word for a subsequent verbal recognition task. In the meantime, during retention period, they performed a task where they either had to search for the spoken word (i.e., the word was relevant for the task) or for another object (i.e. the spoken word was irrelevant for the task). In the relevant condition it is beneficial to activate visual and semantic representations of the word as these are needed in order to find the target. However, in the irrelevant condition this is not needed. Moreover, it might even harm task performance. Logically, when people have some form of control over the activation of visual and semantic representations, they will activate these representations in a lesser extent in the irrelevant than in the relevant condition. Visual and semantic biases would thus be reduced in the 140
condition where the word is irrelevant for search. It is especially expected that people would rely less on visual representations as these are not needed for the memory task. However, when the spread of activation is more automatic, than visual and semantic representations would be activated regardless of whether the word is relevant or irrelevant for the search task. In that case results should be similar for both conditions. Chapter 4 reports clearly different results for the relevant and irrelevant condition. Overall, the biases were much reduced when the word was irrelevant for search. But more importantly, the relative balance between visual and semantic biases was different in both conditions. When the word was relevant for search, there was a visual dominance, replicating the findings of Chapter 3. But in the condition where the word was irrelevant for search, the semantic and visual bias arose around the same time. It thus seems that people have some form of control over the activation of visual and semantic representations. Note that cognitive control has traditionally been considered a key feature of working memory. The results of Chapter 4 are thus in line with a recently proposed working memory hypothesis. Similar to the cascaded activation model, this hypothesis states that seeing an object or hearing a word leads to the activation of different representations, but additionally makes explicit that working memory is needed to bind this long-term knowledge to the temporary information about the environment (i.e., location of the objects). In the previous chapters the stimuli were always visually present when people performed the search. However, in Chapter 5, the visual stimuli were removed before participants received the target instruction. Here people had to search their memory in order to make the decision whether the object was present or absent. Previous research has shown that in such a condition people will still make eye movements towards locations previously occupied by the referred to objects. So after hearing the word rabbit, people would fixate the location where a picture of a rabbit was presented earlier. Note that in the rabbit example the word and the picture match at all levels of representation. In Chapter 5 it was investigated whether these looks-to-nothing could also be observed when words were only semantically or visually related to a previously shown object. In all three experiments, there were biases towards locations previously occupied by the target (i.e., the objects that matched the word at all levels of representation), replicating earlier work. Furthermore, in two of three experiments there were also biases towards locations previously occupied by 141
visually or semantically related pictures. Biases were less strong than with the actual target, but the bias towards the target had the same pattern over time as the biases towards the semantically and visually related objects (with a high correlation between the semantic and visual bias as well). Thus, visual and semantic representations can guide eye movements also in memory search, but less strong than when the visual stimuli are actually present. The data of Chapter 3 and Chapter 4 suggest that semantically related objects can attract attention. More importantly, these objects seem to do so even when placed outside foveal vision (i.e., a central area in vision that perceives great detail). However, this is a controversial claim in the scene processing literature as some researchers have argued that this is impossible. Therefore, the data of Chapter 3 and Chapter 4 was re-analyzed in Chapter 6. Rather than looking at orienting biases over time, three measures of attentional capture were assessed, namely latency to first fixation on the object (i.e., the time elapsed between the onset of the display and the first fixation on the object), probability of first fixation and the mean amplitude of the first incoming saccade. The results showed that the latency to first fixation on the object was shorter for semantically related objects than for unrelated objects. Additionally, the probability of first fixation was higher for semantically related objects than for unelated objects. Semantically related objects were thus selected initially more often than unrelated objects. More importantly, the amplitudes of the first incoming saccades were large, indicating that people selected the objects on the basis of extrafoveal vision. Overall, these re-analyses indeed confirm semantics can capture attention immediately. It seems that, under some circumstances, extrafoveal semantic processing can occur, for example when objects are presented alone rather than in a scene. Taken together, the current thesis shows that the question whether visual orienting is driven by semantics is better rephrased as to when do semantics drive visual orienting. All experiments reported in this thesis show consistently that both visual and semantic representations influence selection, independently from each other. But whether selection is more visually or more semantically driven is modulated by multiple factors, specifically temporal parameters and task requirements. 142