Chapter Ten

Interacting with Visualizations

The kinds of interactions we discuss in this chapter are what Kirsh and Maglio (1994) called epistemic actions. An epistemic action is an activity intended to uncover new information. A good visualization is not just a static picture or a three-dimensional (3D) virtual environment that we can walk through and inspect like a museum full of statues. A good visualization is something that allows us to drill down and find more data about anything that seems important. Ben Shneiderman (1998) developed a mantra to guide visual information-seeking behavior and the interfaces that support it: “Overview first, zoom and filter, then details on demand.” In reality, however, we are just as likely to see an interesting detail, zoom out to get an overview, find some related information in a lateral segue, and then zoom in again to get the details of the original object of interest. The important point is that a good computer-based visualization is an interface that can support all of these activities. Ideally, every data object on a screen will be active and not just a blob of color. It will be capable of displaying more information as needed, disappearing when not needed, and accepting user commands to help with the thinking processes.

Interactive visualization is a process made up of a number of interlocking feedback loops that fall into three broad classes. At the lowest level is the data manipulation loop, through which objects are selected and moved using the basic skills of eye–hand coordination. Delays of even a fraction of a second in this interaction cycle can seriously disrupt the performance of higher level tasks. At the intermediate level is an exploration and navigation loop, through which an analyst finds his or her way in a large visual data space. As people explore a new town, they build a cognitive spatial model using key landmarks and paths between them; something similar occurs when they explore data spaces. In the case of navigating data spaces, the time taken to get to a new vantage to find a particular piece of information is a direct cost of knowledge. Faster navigation means more efficient thinking. At the highest level is a problem-solving loop through which the analyst forms hypotheses about the data and refines them through an augmented visualization process. The process may be repeated through multiple visualization cycles as new data is added, the problem is reformulated, possible solutions are identified, and the visualization is revised or replaced. Sometimes the visualization may act as a critical externalization of the problem, forming a crucial extension of the cognitive process. This chapter deals with two of the three loops: low-level interaction and exploration. General problem solving is discussed in Chapter 11.

Data Selection and Manipulation Loop

A number of well established “laws” describe the simple, low-level control loops needed in tasks such as the visual control of hand position or the selection of an object on the screen.

Choice Reaction Time

Given an optimal state of readiness, with a finger poised over a button, a person can react to a simple visual signal in about 130 msec (Kohlberg, 1971). If the signals are very infrequent, the time can be considerably longer. Warrick et al. (1964) found reaction times as long as 700 msec under conditions where there could be as much as two days between signals. The participants were engaged in routine typing, so they were at least positioned appropriately to respond. If people are not positioned at workstations, their responses will naturally take longer.

Sometimes, one must make a choice before reacting to a signal. A simple choice-reaction-time task might involve pressing one button if a red light goes on and another if a green light goes on. This kind of task has been studied extensively. It has been discovered that reaction times can be modeled by a simple rule called the Hick–Hyman law for choice reaction time (Hyman, 1953). According to this law,

(10.1)

where C is the number of choices, and a and b are empirically determined constants. The expression log₂(C) represents the amount of information processed by the human operator, expressed in bits of information.

Many factors have been found to affect choice reaction time—the distinctness of the signal, the amount of visual noise, stimulus–response compatibility, and so on—but under optimal conditions the response time per bit of information processed is about 160 msec, plus the time to set up the response. Thus, if there are eight choices (3 bits of information), the response time will typically be on the order of the simple reaction time plus approximately 480 msec. Another important factor is the degree of accuracy required. People respond faster if they are allowed to make mistakes occasionally, and this effect is called a speed–accuracy tradeoff. For a useful overview of factors involved in determining reaction time, see Card et al. (1983).

Two-Dimensional Positioning and Selection

In highly interactive visualization applications, it is useful to have graphical objects function not only as program output—a way of representing data—but also as program input—a way of finding out more about data. Selection using a mouse or similar input device (such as a joystick or trackball) is one of the most common interactive operations, and it has been extensively studied. A simple mathematical model provides a useful estimation of the time taken to select a target that has a particular position and size:

(10.2)

where D is the distance to the center of the target, W is the width of the target, and a and b are constants determined empirically; see Figure 10.1(a). These are different for different devices.

Figure 10.1 (a) A simplified reaching task, where the red cursor must be moved into the beige rectangle. (b) The visually guided reaching control loop, where the human processor makes adjustments based on visual feedback from the computer.

This formula is known as Fitts’ law, after Paul Fitts (1954). The term log₂(D/W + 1.0) is known as the index of difficulty (ID). The value 1/b is the index of performance (IP) and is given in units of bits per second. There are a number of variations in the index-of-difficulty expression, but the one given here is the most robust (MacKenzie, 1992). Typical IP values for measured performance made with the fingertip, the wrist, and the forearm are all in the vicinity of 4 bits per second (Balakrishnan & MacKenzie, 1997). To put this into perspective, consider moving a cursor 16 cm across a screen to a 0.5-cm target. The index of difficulty will be about 5 bits. The selection will take more than a second longer than selecting a target that is already under the cursor.

Fitts’ law can be thought of as describing an iterative process of eye–hand coordination, as illustrated in Figure 10.1(b). The human starts by judging the distance to the target and initiates the hand movement. On successive iterations, a corrective adjustment is made to the hand movement based on visual feedback showing the cursor position. Greater distances and smaller targets both result in more iterations. The logarithmic nature of the relationship derives from the fact that, on each iteration, the task difficulty is reduced in proportion to the remaining distance.

In many of the more complex data visualization systems, as well as in experimental data visualization systems using 3D virtual-reality (VR) technologies, there is a significant lag between a hand movement and the visual feedback provided on the display (Liang et al., 1991; Ware & Balakrishnan, 1994). Fitts’ law, modified to include lag, looks like this:

(10.3)

HumanTime is the human processing time and MachineLag is the time the computer takes to update the display based on user input. According to this equation, the effects of lag increase as the target gets smaller. Because of this, a fraction of a second lag can result in a subject taking several seconds longer to perform a simple selection task. This may not seem like much, but in a VR environment intended to make everything seem easy and natural, lag can make the simplest task difficult. Fitts’ law is part of an International Standards Organization standard (ISO 9214-9) that sets out protocols for evaluating user performance and comfort when using pointing devices with visual display terminals. It is invaluable as a tool for evaluating potential new input devices.

Hover Queries

The most common kind of epistemic action with a computer is done by dragging a cursor over an object and clicking the mouse button. The hover query dispenses with the mouse click. Extra information is revealed about an object when the mouse cursor passes over it. Usually it is implemented with a delay; for example, the function of an icon is shown by a brief text message after hovering for a second or two. However, a hover query can function without a delay, making it a very rapid way of getting additional information. This enables the mouse cursor to be dragged over a set of data objects, quickly revealing the data contents and perhaps allowing an interactive query rate of several per second in special circumstances.

[G10.1] For the fastest epistemic actions, use hover queries, activated whenever the mouse cursor passes over an object. These are only suitable where the query targets are dense and inadvertent queries will not be overly distracting.

Path Tracing

Fitts’ law deals with single, discrete actions, such as reaching for an object. Other tasks, such as tracing a curve or steering a car, involve continuous ongoing control. In such tasks, we are continually making a series of corrections based on visual feedback about the results of our recent actions. Accot and Zhai (1997) made a prediction, based on Fitts’ law, that applies to continuous steering tasks. Their derivation revealed that the speed at which tracing could be done should be a simple function of the width of the path:

(10.4)

where v is the velocity, W is the path width, and τ is a constant that depends on the motor control system of the person doing the tracing. In a series of experiments, the researchers found an almost perfect linear relationship between the speed of path following and the path width, confirming their theory. The actual values of τ lay between .05 and .11 sec, depending on the specific task. To make this more concrete, consider the problem of tracing a pencil along a 2-mm-wide path. Their results suggest that this will be done at a rate of between 1.8 and 4 cm/sec.

Two-Handed Interaction

In most computer interfaces, users select and move graphical objects around the screen with a mouse held in one hand, leaving the other hand unoccupied, but when interacting in the everyday world we frequently use both our hands. This leads us to the question of how we might improve the computer interface by taking advantage of both hands (Buxton & Myers, 1986).

The most important principle that has been discovered relating to the way tasks should be allocated to the two hands is Guiard’s kinematic chain theory (Guiard, 1987). According to this theory, the left hand and the right hand form a kinematic chain, with the left hand providing a frame of reference for movements with the right, in right-handed individuals. For example, if we sculpt a small object out of modeling clay, we are likely to hold it in the left hand and do the detailed shaping with the right. The left hand reorients the piece and provides the best view, whereas the right pokes and prods within that frame of reference.

Interface designers have incorporated this principle into superior interfaces for various tasks (Bier et al., 1993; Kabbash et al., 1994). In an innovative computer-based drawing package, Kurtenbach et al. (1997) showed how templates, such as the French curve, could be moved rapidly over a drawing by a designer using his left hand while using his right hand to paint around the shape.

[G10.2] When designing interfaces for two-handed data manipulations, the non-dominant hand (usually the left) should be used to control frame-of-reference information, while the dominant hand (usually the right) should be used to make detailed selections or manipulations of data.

Another beneficial use of the left hand is in positioning tools for easy access. In interactive drawing packages, users spend a lot of time moving between the drawing and various menus positioned off to the side of the screen. The toolglass and magic lens approach, developed by Bier et al. (1993), got around this problem by allowing users to use the left hand to position tool palettes and the right hand to do normal drawing operations. This allowed for very quick changes in color or brush characteristics. As an additional design refinement, they also made some of the tools transparent (hence toolglasses).

In an application more relevant to information visualization, Stone et al. (1994) developed the magic lens idea as a set of interactive information filters implemented as transparent windows that the user can move over an information visualization with the left hand. The magic lens can be programmed to be a kind of data X-ray, revealing normally invisible aspects of the data. Figure 10.2, for example, shows a map with land use patterns in the area around Charlotte, North Carolina (Butkiewicz et al., 2010). The magic lens reveals population density in the area it covers. A good interface for selections to be made based on population density would use the right hand, in a conventional way, to control a cursor to “click through” the magic lens, while the left hand would be used to position the lens.

Figure 10.2 A magic lens showing population density with a background map showing land-use patterns. From Butkiewicz et al. (2010). Courtesy of Tom Butkiewicz

Learning

Over time, people become more skilled at any task, barring fatigue, sickness, or injury. A simple expression known as the power law of practice describes the way task performance speeds up over time (Card et al., 1983):

(10.5)

where C = log(T_n ) is based on the time to perform the task on the first trial, T_n is the time required to perform the nth trial, and α is a constant that represents the steepness of the learning curve. The implication of the log function is that it may take thousands of trials for a skilled person to improve his performance by 10% because he is far along the learning curve. In contrast, a novice may see a 10% gain after only one or two trials.

One of the ways in which skilled performance is obtained is through the chunking of small subtasks into programmed motor procedures. The beginning typist must make a conscious effort to hit the letters t, h, and e when typing the word the, but the brains of experienced typists can execute preprogrammed bursts of motor commands so that the entire word can be typed with a single mental command to the motor cortex. Skill learning is characterized by more and more of the task becoming automated and encapsulated. To encourage skill automation, the computer system should provide rapid and clear feedback of the consequences of user actions (Hammond, 1987).

Control Compatibility

Some control movements are easier to learn than others, and this depends heavily on prior experience. If you move a computer mouse to the right, causing an object on the screen to move to the right, this positioning method will be easy to learn. A skill is being applied that you gained very early in life when you first moved an object with your hand and that you have been refining ever since. But, if the system interface has been created such that a mouse movement to the right causes a graphical object to move to the left, this will be incompatible with everyday experience, and positioning the object will be difficult. In the behaviorist tradition of psychology, this factor is generally called stimulus–response (S–R) compatibility. In modern cognitive psychology, the effects of S-R compatibility are readily understood in terms of skill learning and skill transfer.

In general, it will be easier to execute tasks in computer interfaces if the interfaces are designed in such a way that they take advantage of previously learned ways of doing things. Nevertheless, some inconsistencies are easily tolerated, whereas others are not. For example, many user interfaces amplify the effect of a mouse movement so that a small hand movement results in a large cursor movement. Psychologists have conducted extensive experiments that involve changing the relationship between eye and hand. If a prism is used to laterally displace what is seen relative to what is felt, people can adapt in minutes or even seconds (Welch & Cohen, 1991). This is like using a mouse that is laterally displaced from the screen cursor being controlled. People are also able to adapt easily to relatively small inconsistencies between the angle of a hand movement and the angle of an object movement that results. For example, a 30-degree angular inconsistency is barely noticed (Ware & Arsenault, 2004).

On the other hand, if people are asked to view the world inverted with a mirror, it can take weeks of adaptation for them to learn to operate in an upside-down world (Harris, 1965). Snyder and Pronko (1952) had subjects wear inverting prisms continuously for a month. At the end of this period, reaching behaviors seemed error free, but the world still seemed upside down. This suggests that if we want to achieve good eye–hand coordination in an interface, we do not need to worry too much about matching hand translation with virtual object translation, but we should worry about large inconsistencies in the axis of rotation.

[G10.3] When designing interfaces to move objects on the screen, be sure that object movement is in the same general direction as hand movement.

Some imaginative interfaces designed for virtual reality involve extreme mismatches between the position of the virtual hand and the proprioceptive feedback from the user’s body. In the Go-Go Gadget technique (named after the cartoon character Inspector Gadget), the user’s virtual hand is stretched out far beyond his or her actual hand position to allow for manipulation of objects at a distance (Poupyrev et al., 1996).

Studies by Ramachandran (1999) provide interesting evidence that even under extreme distortions people may come to act as if a virtual hand is their own, particularly if touch is stimulated. In one of Ramachandran’s experiments, he hid a subject’s hand behind a barrier and showed the subject a grotesque rubber Halloween hand. Next, he stroked and patted the subject’s actual hand and the Halloween hand in exact synchrony. Remarkably, in a very short time, the subject came to perceive that the Halloween hand was his or her own. The strength of this identification was demonstrated when the researcher hit the Halloween hand with a hammer. The subjects showed a strong spike in galvanic skin response (GSR), indicating a physical sense of shock. No shock was registered without the stroking. The important point from the perspective of VR interfaces is that even though the fake hand and the subject’s real hand were in quite different places a strong sense of identification occurred.

Consistency with real-world actions is only one factor in skill learning. There are also the simple physical affordances of the task itself. It is easier for us to make certain body movements than others. Very often we can make computer-mediated tasks easier to perform than their real-world counterparts. When designing a house, we do not need to construct it virtually with bricks and concrete. The magic of computers is that a single button click can often accomplish as much as a prolonged series of actions in the real world. For this reason, it would be naive to conclude that computer interfaces should evolve toward VR simulations of real-world tasks or even enhanced Go-Go Gadget types of interactions.

Exploration and Navigation Loop

Viewpoint navigation is important in visualization when the data is mapped into an extended and detailed 3D space. Viewpoint navigation is cognitively complex, encompassing theories of path finding and map use, cognitive spatial metaphors, and issues related to direct manipulation and visual feedback.

Figure 10.3 sketches the basic navigation control loop. On the human side is a cognitive logical and spatial model whereby the user understands the data space and his or her progress through it. If the data space is maintained for an extended period, parts of its spatial model may become encoded in long-term memory. On the computer side, the view of the visualization is changed, based on user input.

Figure 10.3 The navigation control loop.

We start with the problem of 3D locomotion; next we consider the problem of path finding and finally move on to the more abstract problem of maintaining focus and context in abstract data spaces.

Locomotion and Viewpoint Control

Some data visualization environments show information in such a way that it looks like a 3D landscape, not just a flat map. This is achieved with remote sensing data from other planets, as well as maps of the ocean floor and other data related to the terrestrial environment. The data landscape idea has also been applied to abstract data spaces such as the World Wide Web (see Figure 10.4 for an example). The idea is that we should find it easy to navigate through data presented in this way because we can harness our real-world spatial interpretation and navigation skills. James Gibson (1986) offered an environmental perspective on the problem of perceiving for navigation:

A path affords pedestrian locomotion from one place to another, between the terrain features that prevent locomotion. The preventers of locomotion consist of obstacles, barriers, water margins and brinks (the edges of cliffs). A path must afford footing; it must be relatively free of rigid foot-sized obstacles.

Figure 10.4 Websites arranged as a data landscape. From Bray (1996). Reproduced with permission

Gibson described the characteristics of obstacles, margins, brinks, steps, and slopes. According to Gibson, locomotion is largely about perceiving and using the affordances offered for navigation by the environment. (See Chapter 1 for a discussion of affordances.) His perspective can be used in a quite straightforward way in designing virtual environments, much as we might design a public museum or a theme park. The designer creates barriers and paths in order to encourage visits to certain locations and discourage others.

We can also understand navigation in terms of the depth cues presented in Chapter 7. All the perspective cues are important in providing a sense of scale and distance, although the stereoscopic cue is important only for close-up navigation in situations such as walking through a crowd. When we are navigating at higher speeds, in an automobile or a plane, stereoscopic depth is irrelevant, because the important parts of the landscape are beyond the range of stereoscopic discrimination. Under these conditions, structure-from-motion cues and information based on perceived objects of known size are critical.

It is usually assumed that a smooth motion flow of visual texture across the retina is necessary for judgment of the direction of self-motion within the environment. But Vishton and Cutting (1995) investigated this problem using VR technology, with subjects moving through a forest-like virtual environment, and concluded that relative displacement of identifiable objects over time was the key, not smooth motion. Their subjects could do almost as well with a low frame rate, with images presented only 1.67 times per second, but performance declined markedly when updates were less than 1 per second. The lesson for the design of virtual navigation aids is that these environments should be sparsely populated with a sufficient number of objects to provide frame-to-frame cues about self-motion.

[G10.4] To support view navigation in 3D data spaces, a sufficient number of objects must be visible at any time to judge relative view position, and several objects must persist from one frame to the next to maintain continuity.

Ideally, frame rates should be at least 2 per second; however, although judgments of heading may not be impaired by low frame rates, other problems will result. Low frame rates cause lag in visual feedback and, as discussed previously, this can introduce serious performance problems.

Changing the viewpoint in a data space can be done using a navigation metaphor, such as walking or flying, or it can be done using a more abstract, nonmetaphoric style of interaction, such as zooming in to a selected point on a data object. Ultimately, the goal is to get to the most informative view of the data space efficiently. The use of metaphors may make learning the user interface easier, but a nonmetaphoric interaction method may ultimately be the best.

Spatial Navigation Metaphors

Interaction metaphors are cognitive models for interaction that can profoundly influence the design of interfaces to data spaces. Here are two sets of instructions for different viewpoint control interfaces:

1. “Imagine that the model environment shown on the screen is like a real model mounted on a special turntable that you can grasp, rotate with your hand, move sideways, or pull towards you.”

2. “Imagine that you are flying a helicopter and its controls enable you to move up and down, forward and back, left and right.”

With the first interface metaphor, if the user wishes to look at the right side of the scene, she must rotate the scene to the left to get the correct view. With the second interface metaphor, the user must fly her vehicle forward and around to the right, while turning in toward the target. Although the underlying geometry in the two cases is the same, the user interface and the user’s conception of the task are very different.

Navigation metaphors have two fundamentally different kinds of constraints on their usefulness. The first of these constraints is essentially cognitive. The metaphor provides the user with a model that enables the prediction of system behavior given different kinds of input actions. A good metaphor is one that is apt, matches the system well, and is easy to understand. The second constraint is more of a physical limitation. The implementation of a particular metaphor will naturally make some actions physically easy to carry out and others difficult to carry out; for example, a walking metaphor limits the viewpoint to a few feet above ground level and the speed to a few meters per second. Both kinds of constraints are related to Gibson’s concept of affordances—a particular interface affords certain kinds of movement and not others, but it must also be perceived to embody those affordances.

Note that, as discussed in Chapter 1, we are going beyond Gibson’s view of affordances here. Gibsonian affordances are directly perceived properties of the physical environment. In computer interfaces interaction is indirect, mediated through the computer, and so is perception of data objects, so Gibson’s concept as he framed it does not strictly apply. We must extend the notion of affordances to apply to both the physical constraints imposed by the user interface and cognitive constraints relating to the user’s understanding of the data space. A more useful definition of an interface with good cognitive affordances is one that makes the possibility for action plain to the user and gives feedback that is easy to interpret.

Four main classes of metaphors have been employed in the problem of controlling the viewpoint in virtual 3D spaces. Figure 10.5 provides an illustration and summary. Each metaphor has a different set of affordances.

Figure 10.5 Four navigation metaphors: (a) World-in-hand. (b) Eyeball-in-hand. (c) Walking. (d) Flying.

World-in-hand. The user metaphorically grabs a part of the 3D environment and moves it (Ware & Osborne, 1990; Houde, 1992). Moving the viewpoint closer to a point in the environment actually involves pulling the environment closer to the user. Rotating the environment similarly involves twisting the world about a point as if it were held in the user’s hand. A variation on this metaphor has the object mounted on a virtual turntable or gimbal. The world-in-hand model would seem to be optimal for viewing discrete, relatively compact data objects, such as virtual vases or telephones. It does not provide affordances for navigating long distances over extended terrains.

Eyeball-in-hand. In the eyeball-in-hand metaphor, the user imagines that she is directly manipulating her viewpoint, much as she might control a camera by pointing it and positioning it with respect to an imaginary landscape. The resulting view is represented on the computer screen. This is one of the least effective methods for controlling the viewpoint. Badler et al. (1986) observed that “consciously calculated activity” was involved in setting a viewpoint. Ware and Osborne (1990) found that some viewpoints were easy to achieve but others led to considerable confusion. They also noted that with this technique physical affordances are limited by the positions in which the user can physically place her hand. Certain views from far above or below cannot be achieved or are blocked by the physical objects in the room.

Walking. One way of allowing inhabitants of a virtual environment to navigate is to simply let them walk. Unfortunately, even though a large extended virtual environment can be created, the user will soon run into the real walls of the room in which the equipment is housed. Most VR systems require a handler to prevent the inhabitant of the virtual world from tripping over the real furniture. A number of researchers have experimented with devices such as exercise treadmills so that people can walk without actually moving over the ground. Typically, something like a pair of handlebars is used to steer. In an alternative approach, Slater et al. (1995) created a system that captures the characteristic up-and-down head motion that occurs when people walk in place. When this head bobbing is detected, the system moves the virtual viewpoint forward in the direction of head orientation. This gets around the problem of bumping into walls and may be useful for navigating in environments such as virtual museums; however, the affordances are still restrictive.

Flying. Modern digital terrain visualization packages commonly have fly-through interfaces that enable users to smoothly create an animated sequence of views of the environment. Some of these are quite literal, having aircraft-like controls. Others use the flight metaphor only as a starting point. No attempt is made to model actual flight dynamics; rather, the goal is to make it easy for the user to get around in 3D space in a relatively unconstrained way. We (Ware & Osborne, 1990) developed a flying interface that used simple hand motions to control velocity. Unlike real aircraft, this interface makes it as easy to move up, down, or backward as it is to move forward. Subjects with actual flying experience had the most difficulty; because of their expectations about flight dynamics, pilots did unnecessary things such as banking on turns and were uncomfortable with stopping or moving backward. Subjects without flying experience were able to pick up the interface more quickly. Despite its lack of realism, this was rated as the most flexible and useful interface when compared to others based on the world-in-hand and eyeball-in-hand metaphors. It later became the original user interface for Fledermaus™, a 3D geospatial visualization package.

The optimal navigation method depends on the exact nature of the task. A virtual walking interface may be the best way to give a visitor a sense of presence in an architectural space. Something loosely based on the flying metaphor may be a more useful way of navigating through spatially extended data landscapes. The affordances of the virtual data space, the real physical space, and the input device all interact with the mental model of the task that the user has constructed.

Wayfinding, Cognitive Maps, and Real Maps

In addition to the problem of moving through an environment in real time, there is the meta-level problem of how people build an understanding of larger environments over time and how they use this understanding to seek information. One aspect of this problem is usually called wayfinding. It encompasses both the way in which people build mental models of extended spatial environments and the way they use physical maps as aids to navigation.

Unfortunately, this area of research is plagued with a diversity of terminology. Throughout the following discussion, bear in mind that there are two clusters of concepts, and the differences between these clusters relate to the dual-coding theory discussed in Chapter 9. One cluster includes the related concepts of declarative knowledge, procedural knowledge, topological knowledge, and categorical representations. These concepts are fundamentally logical and nonspatial and therefore mostly nonvisual. The other cluster includes the related concepts of spatial cognitive maps and coordinate representations. These are fundamentally spatial.

Seigel and White (1975) proposed that there are three stages in the formation of wayfinding knowledge. First, information about key landmarks is learned; initially, there is no spatial understanding of the relationships between them. This is sometimes called declarative knowledge. We might learn to identify a post office, a church, and the hospital in a small town.

Second, procedural knowledge about routes from one location to another is developed. Landmarks function as decision points. Verbal instructions often consist of procedural statements related to landmarks, such as, “Turn left at the church, go three blocks, and turn right by the gas station.” This kind of information also contains topological knowledge, because it includes connecting links between locations. Topological knowledge has no explicit representation of the spatial position of one landmark relative to another.

Third, a cognitive spatial map is formed. This is a representation of space that is two dimensional and includes quantitative information about the distances between the different locations of interest. With a cognitive spatial map, it is possible to estimate the distance between any two points, even though we have not traveled directly between them, and to make statements such as, “The university is about one kilometer northwest of the train station.” In Seigel and White’s initial theory and in much of the subsequent work, there has been a presumption that spatial knowledge developed strictly in the order of these three stages: declarative knowledge, procedural knowledge, and cognitive spatial maps.

Seigel and White’s theory, however, ignored the importance of map technologies. Cognitive maps can be acquired directly from an actual map much more rapidly than by traversing the terrain. Thorndyke and Hayes-Roth (1982) compared people’s ability to judge distances between locations in a large building. Half of them had studied a map for half an hour or so, whereas the other half never saw a map but had worked in the building for many months. The results showed that for estimating the straight-line Euclidean distance between two points, a brief experience with a map was equivalent to working in the building for about a year. For estimating the distance along the hallways, however, the people with experience in the building did the best.

People can easily construct spatial mental maps of the objects they can see together from a particular vantage point. Colle and Reid (1998) conducted an experimental study using a virtual building consisting of a number of rooms connected by corridors. The rooms contained various objects. In a memory task following the exploration of the building, subjects were found to be very poor at indicating the relative positions of objects located in different rooms, but they were good at indicating the relative positions of objects within the same room. This suggests that cognitive spatial maps form easily and rapidly in environments where the viewer can see everything at once, as is the case for objects within a single room. It is more likely that the paths from room to room were captured as procedural knowledge. The practical application of this is that overviews should be provided wherever possible in extended spatial information spaces.

[G10.5] Consider providing an overview map to speed up the acquisition of a mental map of a data space.

[G10.6] Consider providing a small overview map to support navigation through a large data space.

Perspective views are less effective in supporting the generation of mental maps. Darken et al. (1998) reported that Navy pilots typically fail to recognize landmark terrain features on a return path, even if these were identified correctly on the outgoing leg of a low-flying exercise. This suggests that terrain features are not encoded in memory as fully three-dimensional structures, but rather are remembered in some viewpoint-dependent fashion as predicted by the image-based theory of object recognition discussed in Chapter 8.

The results of Colle and Reid’s study fit well with a somewhat different theory of spatial knowledge proposed by Kosslyn (1987). He suggested that there are only two kinds of knowledge, not necessarily acquired in a particular order. He called them categorical and coordinate representations. For Kosslyn, categorical information is a combination of both declarative knowledge and topological knowledge, such as the identities of landmarks and the paths between them. Coordinate representation is like the cognitive spatial map proposed by Seigel. A spatial coordinate representation would be expected to arise from the visual imagery obtained with an overview. Conversely, if knowledge were constructed from a sequence of turns along corridors when the subject was moving from room to room, the natural format would be categorical.

Landmarks, Borders, and Place

In an influential paper in the field of city planning, Lynch (1960) classified the structure of a city in terms of regions where different kinds of activities took place, boundaries blocking locomotion, landmarks providing focus points and aids to navigation, and pathways affording navigation. Recent work in neuroscience has shown remarkable parallels between at least some of these structures and processes operating in a mid-brain structure called the hippocampus, a region of the brain that has long been known to be important for our understanding of space (O’Keefe & Nadel, 1978). Three types of neurons have been identified: Border cells signal impenetrable barriers, place cells signal specific locations (e.g., at the fridge, by the stove, on the sofa in the living room) (Solstad et al., 2008), and grid cells contain an updated map of where we are currently, relative to our surroundings (Hafting et al., 2005). These contain links to place cell and object cell information.

All of this suggests that visual landmarks representing meaningful data objects are important in visualization design. Landmarks can tie points between declarative or procedural representations in the mind, with spatial representations provided in an external map. Vinson (1999) created a set of design guidelines for landmarks in virtual environments. The following guidelines can be added to G10.4:

[G10.7] When designing a set of landmarks, make each landmark visually distinct from the others.

[G10.8] When designing a landmark, make it recognizable as far as possible at all navigable scales.

Creating recognizable landmarks in 3D environments can be difficult because of multiple viewpoints. As discussed in Chapter 8, we recognize objects better from familiar and canonical viewpoints. An interesting way to assist users to encode landmarks for navigation in 3D environments was developed by Elvins et al. (1997). They presented subjects with small 3D subparts of a virtual cityscape that they called worldlets. The worldlets provided 3D views of key landmarks, presented in such a way that observers could rotate them to obtain a variety of views. Subsequently, when they were tested in a navigation task, subjects who had been shown the worldlets performed significantly better than subjects who had been given pictures of the landmarks or subjects who had simply been given verbal instructions.

Frames of Reference

The ability to generate and use something cognitively analogous to a map can be thought of in terms of applying a different perspective or frame of reference to the world. A map is like a view from a viewpoint high in the sky. Cognitive frames of reference are often classified into egocentric and exocentric. According to this classification, a map is just one of many exocentric views—views that originate outside of the user.

Egocentric Frame of Reference

The egocentric frame of reference is, roughly speaking, our subjective view of the world. It is anchored to the head or torso, not the direction of gaze (Bremmer et al., 2001). Our sense of what is ahead, left, and right does not change as we move our eyes around the scene, but it does change with body and head orientation. As we explore the world, we change our egocentric viewpoint primarily around two, not three, axes of rotation. As illustrated in Figure 10.6 we turn our bodies mostly around a vertical axis (pan) to change heading, and swivel our heads on the neck (also pan) about a similar vertical axis for more rapid adjustments in view direction. We also tilt our heads forward and back but generally not to the side (roll). The same two axes are represented in eye movements; our eyeballs do not rotate about the axis of the line of sight. More concisely, human angle of view control normally has only two degrees of freedom (pan and tilt) and lacks roll.

Figure 10.6 Most of the time we only rotate our viewpoint about two axes, corresponding to tilt and pan.

A consequence of the fact that we are most familiar with only two of the three degrees of freedom of viewpoint rotation is that when viewing maps, either real or in a virtual environment, we are most comfortable with only two degrees of freedom of rotation. Figure 10.7 illustrates an interface for rotating geographical information spaces constructed to have the same two degrees of freedom (Ware et al., 2001). The widgets allow rotation around the center point (equivalent to turning the body) and tilt from horizontal up into the plane of the screen (equivalent to forward and back head tilt), but they do not allow rotation around the line of site through the center of the screen (equivalent to the rarely used sideways head tilt).

Figure 10.7 View control widgets for examining geographic data. Note that the rotational degrees of freedom match the rotational degrees of freedom of egocentric coordinates. The three views show different amounts of tilt. The handle on the top widgets can be dragged up and down to change tilt and moved left and right to rotate about the vertical axis.

[G10.9] In interfaces to view map data in 3D, the default controls should allow for tilt around a horizontal axis and rotation about a vertical axis, but not rotation around the line of sight.

Because we tend to move our bodies forward and only rarely sideways, a simple interface to simulate human navigation can be constructed with only three degrees of freedom, two for rotations (heading and tilt) and one to control forward motion in the direction of heading. If a fourth degree of freedom is added, it may be most useful to allow for something analogous to head turning. This allows for sideways glances while traveling forward.

Exocentric Frames of Reference

The term exocentric simply means external. In 3D computer graphics, exocentric frames of reference are used for applications such as monitoring avatars in video games, controlling virtual cameras in cinematography, and monitoring the activities of remote or autonomous vehicles. Obviously, there is an infinite number of exocentric views. The following is a list of some of the more important and useful ones:

• Another person’s view. For some tasks, it can be useful to take the egocentric view of someone else who is already present in our field of view. Depending on the angular disparity in the relative directions of gaze, this can be confusing, especially when the other person is facing us. In the ClearBoard system (Ishii & Kobayashi, 1992), a remote collaborator appeared to be writing on the other side of a pane of glass. By digitally reversing the image, a common left–right frame of reference was maintained.

• Over the shoulder view. A view from just behind and to the side of the head of an individual. This view is commonly used in cinematography.

• God’s-eye view. Following a vehicle or avatar from above and behind, as shown in Figure 10.8(a). This view is very common in video games. Because it provides a wider field of view, it can be better for steering a remote vehicle than the more obvious choice, an egocentric view from the vehicle itself (Wang & Milgram, 2001).

Figure 10.8 (a) God’s-eye view of a moving vehicle represented by the tube object in the foreground. (b) Wingman’s view of the same vehicle.

• Wingman’s view. Following a vehicle or avatar while looking at it from the side, as shown in Figure 10.8(b). Exocentric views that follow a moving object, such as the God’s-eye or wingman’s views, are sometimes called tethered (Wang & Milgram, 2001).

Map Orientation

Three views are commonly available in electronic map and chart displays:

• North-up plan view. This is the classic orthographic map view, with north up, usually with the vehicle placed in the middle, oriented appropriately.

• Track-up plan view. Also an orthographic map view, but oriented so that the heading of the vehicle is in the vertical up direction on the map.

• Track-up-perspective view. This is another name for the God’s-eye view already mentioned. In this case, the map is given a perspective view on the screen. The viewpoint is above and behind the vehicle.

The first two of these are illustrated in Figure 10.9(a, b) and the third in Figure 10.9(c).

Figure 10.9 (a) North-up map. (b) Track-up map. (c) North-up map with user view explicitly displayed.

A number of studies have compared north-up plan views with track-up views and suggested that the track-up view is preferable in that it is easier to use and results in fewer errors (Levine et al., 1984; Shepard & Hurwitz, 1984; Aretz, 1991). Nevertheless, experienced navigators often prefer the north-up over the track-up view because it gives them a consistent frame of reference for interpreting geographic data. This is especially important when two map interpreters are communicating over a phone or radio link—for example, in battlefield situations or when scientists are collaborating at a distance.

In visual cognitive terms, using a map involves comparing imagery on a display with objects in the world. This can be conceptualized as creating a cognitive binding between visual objects visible in two different spatial reference frames. In his work on displays for pilots, Aretz (1991) identified two different mental rotations necessary for successful map use. The first, azimuthal rotation, is used to align a map with the direction of travel. The track-up display executes this rotation in the display computer, eliminating the need for the task to be performed mentally. The second is vertical tilt. A map can be horizontal, in which case it directly matches the plane of the displayed information, or it can be oriented vertically, as is typical of the map displays used in car dashboards. Of these two transformations, azimuthal misalignment is the one that gives the most cognitive difficulty, and its difficulty increases in a nonlinear fashion. People take much longer and are less accurate when a map is aligned more than 90 degrees from their direction of travel (Wickens, 1999; Gugerty & Brooks, 2001).

It is possible to enhance a north-up map and make it almost as effective as a track-up map, even for novices. Aretz (1991) evaluated a north-up map with the addition of a clear indicator of the forward field of view of the navigator. This significantly enhanced the ability of the users to orient themselves. Figure 10.9(c) illustrates this kind of enhanced map.

[G10.10] When designing an overview map, provide a “you are here” indicator that shows location and orientation.

Tilting a display to form an oblique track-up-perspective view (God’s-eye) is less disruptive of performance than azimuthal misalignment (Hickox & Wickens, 1999). This may be at least partly because the misalignment in tilt is never more than 90 degrees and often much less. A perspective track-up view (like the top panel of Figure 10.9) reduces the tilt mismatch between the display and the environment or can entirely eliminate it if the view exactly matches the world view of the user, in which case we have an egocentric view. But, one of the problems with the egocentric view is that the user cannot see very far ahead, especially if a 3D scene is rendered with buildings and landscape features obscuring the view.

Track-up-perspective views have been studied for application in the field of aviation human factors (Schreiber et al., 1998; Hickox & Wickens, 1999). These studies show that the relative tilt angle between the display perspective and the scene perspective has a significant effect only if the angles are relatively large. Mismatches of 20 degrees or less resulted in minimal or no disruption of performance.

[G10.11] Maps used in navigation should provide three views: north-up, track-up, and track-up-perspective. A track-up perspective view should be the default.

Focus, Context, and Scale in Nonmetaphoric Interfaces

We have been dealing with the problem of how people navigate through 3D data spaces, under the assumption that the methods used should reflect the way we navigate in the real world. The various navigation metaphors are all based on this assumption; however, several successful spatial navigation techniques do not use an explicit interaction metaphor but do involve visual spatial maps. These techniques make it easy to move quickly from one view to another at different scales; because of this, they are said to solve the focus-context problem. Think of the problem of wayfinding as one of discovering specific objects or detailed patterns (focus) in a larger data landscape (context). The focus–context problem is simply a generalization of this, the problem of finding detail in a larger context.

In a way, the terms focus and context are misleading. It implies that the small scale is the more important subject of attention, but in data analysis important patterns can occur at any spatial scale. The important thing is to be able to easily relate large-scale patterns to small-scale patterns. We will not abandon the focus and context terms, though, because they are too deeply entrenched.

The three kinds of focus–context problems are concerned with the spatial properties, structural properties, or temporal properties of a data set. Sometimes all three can be involved.

• Spatial scale. Spatial-scale problems are common to all mapping applications; for example, a marine biologist might want to understand the spatial behavior of individual codfish within a particular school off the Grand Banks of Newfoundland. This information is understood in the context of the shape of the continental shelf, as well as the boundary between cold Arctic water and the warm waters of the Gulf Stream.

• Structural scale. Complex systems can have structural components at many levels. A prime example is computer software. This has structure within a single line of code, structure within a subroutine or procedure (perhaps 50 lines of code), structure at the object level for object-oriented code (perhaps 1000 lines of code), and structure at the system level. Suppose that we want to visualize the structure of a large program, such as a digital telephone switch (comprising as many as 20 million lines of code); we may wish to understand its structure through as many as six levels of detail.

• Temporal scale. Many data visualization problems involve understanding the timing of events at very different scales. In understanding data communications, for example, it can be useful to know the overall traffic patterns in a network as they vary over the course of a day. It can also be useful to follow the path of an individual packet of information through a switch over the course of a few microseconds.

It is worth noting that the focus-context problem has already been spatially solved by the human visual system, at least for moderate changes in scale. The brain continuously integrates detailed information from successive fixations of the fovea with the less-detailed information that is available at the periphery. This is combined with data coming from the prior sequence of fixations. For each new fixation, the brain must somehow match key objects in the previous view with those same objects moved to new locations. Differing levels of detail are supported in normal perception because objects are seen at much lower resolution at the periphery of vision than in the fovea. The fact that we have no difficulty in recognizing objects at different distances means that scale-invariance operations are supported in normal perception. The best solutions to the problem of providing focus and context in a display are likely to take advantage of these perceptual capabilities.

The spatial scale of maps, the structural levels of detail in computer programs, and the temporal scale in communications monitoring are very different application domains, but they belong to a class of related visualization problems and they can all be represented by means of spatial layouts of data. The same interactive techniques can often be applied. In the following sections, we consider the perceptual properties of four different visualization techniques to solve the focus-context problem: distortion, rapid zooming, elision, and multiple windows.

Distortion Techniques

A number of techniques have been developed that spatially distort a data representation, giving more room to designated points of interest and decreasing the space given to regions away from those points. What is of immediate interest is spatially expanded at the expense of what is not, thus providing both focus and context. Some techniques have been designed to work with a single focus, such as the hyperbolic tree browser (Lamping et al., 1995), as shown in Figure 10.10.

Figure 10.10 Hyperbolic tree browser from Lamping et al. (1995). The focus can be changed by dragging a node from the periphery to the center.

An obvious perceptual issue related to the use of distorting focus–context methods is whether the distortion makes it difficult to identify important parts of the structure. This problem can be especially acute when actual geographical maps are expanded. For example, Figure 10.11, from Keahey (1998), shows a distorted map of the Washington, D.C., subway system. The center is clear as intended, but the labels on the stations surrounding the center have been rendered unintelligible. This leads to the next guideline.

Figure 10.11 A fisheye view centered on downtown Washington, D.C. From Keahey (1998). Reproduced with permission

[G10.12] When designing a visualization that uses geometric fisheye distortion methods, allow a maximum scale change factor of five.

Some methods allow multiple foci to be simultaneously expanded, such as the table lens (Rao & Card, 1994) illustrated in Figure 10.12. Many of these methods use simple algebraic functions to distort space based on the distance from each focus.

Figure 10.12 Table lens form Rao and Card (1994). Multiple row- and column-wise centers of focus can be created.

The basic perceptual problem that can occur with distortion techniques is that parts of the structure will no longer be recognized. Distorting layout algorithms sometimes move parts of an information structure to radically different locations in the display space. This, of course, entirely defeats the purpose of focus and context, which depends on memory of patterns to relate information represented at different spatial scales.

[G10.13] Design fisheye distortion methods so that meaningful patterns are always recognizable.

Rapid Zooming Techniques

Another way of enabling people to comprehend focus and context is to use a single window but make it possible to transition quickly between spatial scales. Rapid zooming techniques do this. A large information landscape is provided, although only a part of it is visible in the viewing window at any instant. The user is given the ability to zoom rapidly into and out of points of interest, which means that, although focus and context are not simultaneously available, the user can move quickly and smoothly from focus to context and back. If smooth scaling is used, the viewer can perceptually integrate the information over time. The Pad and Pad++ systems (Bederson & Hollan, 1994) are based on this principle. They provide a large planar data landscape, with an interface using a simple point-and-click technique to move quickly and smoothly in and out.

The proper rate of zoom has been a subject of study (Guo et al., 2000; Plumlee, 2004). Plumlee’s (2004) results suggest that the rate of zoom should be independent of the number of objects displayed and the frame rate but individual preferences vary widely. Some people prefer a zoom rate as slow as 2× per second, while others prefer a rate as fast as 8× per second (a zoom rate of 8× per second means that the scale is changing smoothly by a factor of 8 every second). Both studies suggest a default zoom rate of 3 to 4× per second.

[G10.14] When designing a zooming interface, set a default scaling rate of 3 to 4× (magnification or minification) per second. The rate should be user changeable so that experts can increase it.

Mackinlay et al. (1990) invented a rapid navigation technique for 3D scenes that they called point of interest navigation. This method moves the user’s viewpoint rapidly, but smoothly, to a point of interest that has been selected on the surface of an object. At the same time, the view direction is smoothly adjusted to be perpendicular to the surface. A variant of this is to relate the navigation focus to an object. Parker et al. (1998) developed a similar technique that is object based rather than surface based; clicking on an object scales the entire 3D virtual environment about the center of that object while simultaneously bringing it to the center of the workspace. This method is illustrated in Figure 10.13.

Figure 10.13 Center of workspace navigation. Clicking and dragging down on the box shown on the left causes it to move to the center of the workspace and expands the space around that center. Dragging up shrinks the space.

In all these systems, a key issue is the rapidity and ease with which the view can be changed from a focal one to an overview and back. Less than a second of transition time is probably a good rule of thumb, but the animation must be smooth to maintain the identity of objects in their contexts. To maintain a sense of location, landmark features should be designed to be recognized consistently, despite large changes in scale.

Elision Techniques

In visual elision, parts of a structure are hidden until they are needed. Typically, this is achieved by collapsing a large graphical structure into a single graphical object. It can be thought of as a kind of structural fisheye, also referred to as semantic zoom (Furnas, 1986). This is an essential component of the intelligent zoom system (Bartram et al., 1994), discussed in Chapter 11, and is becoming increasingly common in network visualizations. In these systems, when a node is opened it expands to reveal its contents. The success of structural methods depends on the extent to which related information can be naturally grouped into larger objects. Also, if the goal is to compare information that resides in little boxes, there must be a clear way of finding out something about what a box might reveal from its external appearance alone.

Multiple Simultaneous Views

In visualization systems where large data spaces are represented, it is common to have one window that shows an overview and several others that show expanded details. The major perceptual problem with the use of multiple windows is that detailed information in one window is disconnected from the overview (context information) shown in another. A solution is to use lines to connect the boundaries of the zoom window to the source image in the larger view. Figure 10.14 illustrates a zooming window interface for an experimental calendar application. Day, month, and year are shown as tables in separate windows, which are connected by triangular areas that integrate the focus information in one table within the context provided by another (Card et al., 1994).

Figure 10.14 The spiral calendar (Mackinlay et al., 1994). The problem with multiple windows is that information can become visually fragmented. In this application, information in one window is linked to its context in another by a connecting transparent wedge.

The great advantage of the multiple window technique over the others listed previously is that it does not distort and it is able to show focus and context simultaneously. Its main disadvantage is the cost of setting up and manipulating extra windows.

If we have multiple views simultaneously, then the links between views can be made visually explicit (Ware & Lewis, 1995). Figure 10.15 shows an attached window used in a 3D zooming user interface. The method includes a viewpoint proxy, a transparent pyramid showing the direction and angle of the tethered view, and lines that visually link the secondary window with its source (Plumlee & Ware, 2003).

Figure 10.15 An attached window in GeoZui3D.

Evidence from usage suggests that the use of multiple windows with view proxies can be effective with scale differences of at least a factor of 30, meaning that this method is preferable to distortion methods where there is a larger difference between focus and context information.

[G10.15] For large 2D or 3D data spaces, consider providing one or more windows that show a magnified part of the larger data space. These can support a scale difference of up to 30 times. In the overview, provide a visual proxy for the locations and directions of the magnified views.

Conclusion

We have been emphasizing the use of spatial maps to help in navigating data spaces; however, keep in mind that maps are not always the best answer. Consistent layout is essential for spatial memory to support revisiting an information entity having a spatial representation. Providing maps of the Internet, for example, has been tried many times, yet they have not proven useful. The Internet is vast and dynamic, and many representations of the same information are often required depending on what we are looking for. All of these factors mean that generating consistent maps is difficult if not impossible; any one map will necessarily provide only a partial view of some aspect of the available data. Procedural instructions can be more useful when the task itself requires navigating from data object to data object, taking certain actions at each. In this case, the cognitive representation of the task is likely to be topological and process oriented, not spatial.

Another caveat must be added to some of the guidelines that have been provided. We have been discussing navigating a data space as quickly and transparently as possible. Doing so involves supporting eye–hand coordination, using well-chosen interaction metaphors, and providing rapid and consistent feedback. The word transparent in user interface design is a metaphor for an interface that is so easy to use that it all but disappears from consciousness, but transparency can also come from practice, not just good initial design. A violin has an extraordinarily difficult user interface, and to reach virtuosity may take thousands of hours, but once virtuosity is achieved the instrument will have become a transparent medium of expression. This highlights a thorny problem in the development of novel interfaces. It is very easy for the designer to become focused on the problem of making an interface that can be used quickly by the novice, but it is much more difficult to research and develop designs for the expert. It is almost impossible to carry out experiments on expert use of radical new interfaces for the simple reason that no one will ever spend enough time on a research prototype to become truly skilled. Also, someone who has spent thousands of hours navigating using a set of buttons on a game controller will find that particular user interface easy to use and natural, even though novices find it very difficult. This means that even a poorly designed user interface may be best for a user population that is already highly skilled with it.

One of the goals of cognitive systems design is to tighten the loop between human and computer, making it easier for the human to obtain important information from the computer via the display. Simply shortening the amount of time it takes to acquire a piece of information may seem like a small thing, but human visual and verbal working memories are very limited in capacity and the information stored is easily lost; even a few seconds of delay or an increase in the cognitive load can drastically reduce the rate of information uptake by the user. When a user must stop thinking about the task at hand and switch attention to the computer interface itself, the effect can be devastating to the thought process. The result can be the loss of all or most of the cognitive context that has been set up to solve the real task. After such an interruption, the train of thought must be reconstructed. Research on the effect of interruptions tells us that this can greatly reduce cognitive productivity (Field & Spence, 1994; Cutrell et al., 2000).