16.1 INTRODUCTION

GRADABLE adjectives and scalar expressions more generally have been one of the most active topics of research in formal semantics over the past twenty or so years. This period has been characterized not only by important theoretical progress but also by an increasing use of experimental approaches in service of theoretical goals, and as such this area offers a wealth of case studies illustrating the benefits of experimental research in semantics.

This chapter is intended to offer a survey of recent work in experimental semantics relating to the topics of gradability, scalarity, and adjective meaning, with a view to informing future research in this area. There is a long tradition of research on adjective meaning and its acquisition from the perspective of cognitive and developmental psychology as well as psycholinguistics (see e.g. Eilers et al., 1974; Bartlett, 1976; Rips & Turnbull, 1980; Ebeling & Gelman, 1988; Sedivy et al., 1999). I do not attempt to do justice to this tradition here, but rather focus on work that is more explicitly oriented towards issues in formal semantics.

In order to achieve more general relevance, the material that follows is organized not primarily by the specific research questions investigated, but rather according to the role that experimentation has played in the process of theory development and evaluation. Section 16.2 considers research that has served to strengthen and reinforce intuition-based data on which theoretical accounts are based. Section 16.3 moves on to cases where experimental approaches have yielded data beyond what is accessible to intuition and introspection, thus offering new power to test and refine formal theories. Finally, section 16.4 considers research that is not guided by a specific theoretical proposal or hypothesis, but rather serves to map out the empirical landscape with respect to a particular phenomenon, potentially raising new questions of theoretical importance. Under each of these broad headings, one or more specific research issues have been selected as case studies of the potential application of experimental techniques. Along the way, the specific methodologies used are described in some detail; this discussion is expanded in section 16.5, which offers some more general observations on appropriate methodologies for research on adjective meaning, and suggests points that the researcher should keep in mind.

16.2 REINFORCING INTUITION-BASED DATA

Traditionally, the semanticist’s data are sourced through introspection and perhaps informal elicitation methods, and are based on intuitions regarding the acceptability of linguistic expressions and their meaning, the latter diagnosed via judgements on allowable contexts of use, entailments, and suitable paraphrases. Such approaches still form the backbone of work in formal semantics, but it goes almost without saying that data collected in these ways are subtle, open to dispute, and subject to possible influence from the scholar’s own theoretically-driven perspective. Thus, as experimental methods have become more accessible and widely accepted in formal linguistics, it has become increasingly common for researchers in semantics to look to the results of structured experiments with ‘ordinary’ speakers to provide additional substantiation of the data on which theories are based. In what follows I discuss one case study where a formal theory of adjective meaning has been strengthened in this way.

16.2.1 Scale structure and the absolute/relative distinction

Probably the most influential recent theory of gradable adjective meaning is the scalar theory of Rotstein & Winter (2004), Kennedy & McNally (2005), and Kennedy (2007), according to whuch gradable adjectives lexicalize measure functions that relate individuals to degrees on scales. This is exemplified by the following lexical entry for the adjective tall:¹

On this approach, all gradable adjectives have the same basic semantic form. Adjectives differ, however, in the structures of the scales they lexicalize, specifically in whether or not these have maximum and/or minimum points. This gives rise to four possible scale types, on the basis of which gradable adjectives can be divided into two broad classes: absolute gradable adjectives, which lexicalize totally or partially closed scales, and relative gradable adjectives, whose corresponding scales are open on both ends.

The absolute/relative distinction manifests itself in a number of ways. First, absolute gradable adjectives, but not their relative counterparts, can occur with endpoint-oriented degree modifiers. Specifically, low-degree modifiers such as slightly and a bit are only possible with adjectives whose scales have minimum points, whereas maximality modifiers such as completely and perfectly can only occur with those whose scales have maximum points:

(3)a. slightly/perfectly transparent

b. slightly/*perfectly rough

c. *slightly/perfectly smooth

d. *slightly/*perfectly tall

Even more basically, the presence or absence of scalar endpoints is argued to determine the nature of the standard or threshold for the adjective in its positive (unmodified) form. Absolute gradable adjectives have endpoint-based standards: to be rough, for example, is to have a more than minimum degree of roughness, while to be smooth is to have a maximum degree of smoothness. Relative gradable adjectives, in contrast, have contextual standards, which intuitively are set relative to a comparison class. To be tall, for example, is to have a height exceeding the norm or standard for the sort of entities under consideration; thus a tall tree is taller than a tall child. The difference in standard type between absolute and relative gradable adjectives is further reflected in other phenomena, such as distinct patterns of entailment between the positive and comparative forms.

Kennedy (2007) accounts for the correlation between scale type and standard type by means of a Principle of Interpretive Economy, which calls for maximizing the semantic contribution of conventional elements such as scalar endpoints to the computation of truth values. If a scale has a maximum or minimum point, this provides the default standard; it is only when such elements are absent that a purely contextual standard is possible.

In its ability to explain a range of superficially unrelated data on the basis of a single property of scales, the scalar theory is elegant and appealing, and it is not surprising that it has emerged as the standard account of gradable adjective meaning. Its position has undoubtedly been further strengthened by the availability of diverse forms of experimental confirmation for the intuitions on which it is based.

In a series of four experiments, Frazier et al. (2008) investigated the processing of two sub-types of absolute gradable adjectives, those with maximum standards (e.g. dry, clean, straight) versus those with minimum standards (e.g. wet, dirty, bent), finding substantiation for judgements reported in the theoretical literature. As illustrated in (3), scalar-minimum based modifiers such as slightly and a little are reported to be acceptable with the latter but not the former class. This was confirmed in an online speeded acceptability judgement task, which found as predicted that adding slightly/a little decreased the acceptability of maximum standard but not minimum standard adjectives. An eye-tracking study supported a similar conclusion. Further experiments found that subjects were also sensitive to other contrasts in acceptability described in the theoretical literature, such as the following (from Rotstein & Winter, 2004).

(4)a. #The two towels are completely dry, but the red one is a little bit wetter than the blue one.

b. The two towels are completely wet, but the red one is a little bit drier than the blue one.

From the perspective of the above-described theoretical research these findings are not surprising, but they nonetheless help to dispel potential concerns that the basic data could be an artefact of the introspective method (see Clifton et al., 2006, Frazier et al., 2008, for related discussion).

Syrett, Bradley, et al. (2006) and Syrett, Kennedy, & Lidz (2010) provide a different sort of support for the reality of the absolute/relative distinction, collected in the context of research into children’s acquisition of gradable adjective meaning. The phenomenon that was investigated involved definite descriptions with gradable adjectives. The definite article the introduces existence and uniqueness presuppositions: to be felicitous, the long rod requires that there be one and only one rod describable as long in the context of utterance. But given two shortish rods, one somewhat longer than the other, the long rod can nonetheless be felicitously used to describe the longer of the two. This is because long has a context-dependent relative standard, which can be set in such a way that one and only one member of the pair—the longer one—has a length that exceeds it (cf. Kyburg & Morreau, 2000). By contrast, the straight wire cannot be used to describe the straighter of two bent wires, because the standard for straight is not relative and context-dependent but rather absolute, fixed to the scalar maximum point of complete straightness.

Syrett and colleagues tested this pattern in a series of experiments among adults and children aged 3-5, using a novel Presupposition Assessment Task. Subjects were presented with two objects differing in the extent they possessed some scalar property, with the request ‘Give me the Adj one’; their task was to give the experimenter the object requested if they could, and if not, to explain why not. As predicted, for the relative gradable adjectives long and big, subjects—both children and adults—consistently selected the longer/bigger of the two objects, even when neither would otherwise be considered long/big. A different pattern emerged for absolute gradable adjectives such as bent and full. In felicitous trials that featured one and only one object meeting the standard for the adjective (e.g. with one straight and one bent wire), subjects selected that one. But in infelicitous trials in which both or neither of the objects met the standard (e.g. two wires with different degrees of bend), adults and to a somewhat lesser extent children largely refused to select one. Thus ordinary speakers are sensitive to the distinction in context-sensitivity predicted by the scalar theory, accommodating existence and uniqueness presuppositions for relative but not absolute gradable adjectives. Furthermore, this sensitivity appears to develop as early as age 3.

The predicted difference in context-sensitivity between relative and absolute gradable adjectives has been demonstrated by other methods as well. McNabb (2012) conducted an online study in which adjectives were paired with stimuli arrays consisting of a target item in the context of three similar items with varying degrees of the corresponding property (e.g. ladders varying in height; glasses varying in fullness). The target item was held constant across conditions, while the remaining items were varied: in one condition, the target had the smallest degree of the property in question; in another, it was intermediate in the array; and in a final condition, it had the greatest degree of the relevant property. Subjects were asked to select one of four sentences that best described the target item: (i) the circled X is Adj; (ii) the circled X is very Adj; (iii) the circled X is not Adj; (iv) the circled X is neither Adj nor not Adj. As predicted, for relative adjectives such as tall and wide, the extent to which the target item was characterized as Adj/very Adj varied significantly with the experimental context, while absolute gradable adjectives such as bent and straight did not exhibit comparable context-sensitivity. Taking a completely different approach, Aparicio et al. (2015) report results of a visual world eye-tracking experiment showing a difference between the two classes in how context affects reference resolution.

Taken as a whole, the above studies have thus provided strong support for the scalar theory of gradable adjectives. At the same time, these and other works have drawn attention to ways in which the theory as described must be refined and augmented. Both McNabb’s participants and Syrett and colleagues’ child subjects sometimes applied maximum standard adjectives such as full to objects that did not have the maximum degree of the relevant property. This might be attributed to the loose or imprecise use of these terms, with the implication that the scalar theory must be supplemented with a theory of imprecision (e.g. along the lines of Lasersohn, 1999). Other experimental work has documented unexpected patterns in the distribution and interpretation of degree modifiers such as very (McNabb, 2012) and slightly (Sassoon, 2011; Bogal-Allbritten, 2012), in some cases suggesting particular points where the theory must be modified. Sassoon, for example, provides experimental evidence that compatibility with slightly is a diagnostic for standard type, not scale type. Still other researchers have sought to extend the inquiry beyond paradigm cases of absolute and relative gradable adjectives. Liao et al. (2016) and Liao & Meskin (2017) investigate aesthetic adjectives such as beautiful and elegant, while Hansen & Chemla (2017) look at at colour words; experimental investigations show both classes to pattern intermediately between absolute and relative gradable adjectives. Finally, Sassoon (2012, 2013b) investigates multidimensional adjectives such as healthy and sick, proposing that for such pairs the antonymy relation must be understood in terms of universal versus existential quantification over dimensions, a view that can in a sense be seen as a generalization of the notion of scalar endpoints. Sassoon supports her quantificational theory with experimental as well as corpus-based findings on the acceptability of such adjectives with exception phrases, for example healthy/not sick except for high blood pressure.

In this way, we see that the theoretical and experimental research paths have proceeded in parallel, with the findings combining synergistically to provide a deeper understanding of adjective meaning and its relation to scale structure.

16.3 GOING BEYOND INTUITIONS

If experimental techniques had no other function in theoretical semantics beyond confirming data sourced through introspective methods, one might tend to characterize their overall importance as modest. But some of the most exciting applications have been ones that take us beyond the sort of data that is accessible to intuition and introspection, opening up a rich new source of data on which theoretical work can be founded. We have already seen this to some extent in the above-described work, for example in McNabb’s (2012) investigation of the truth conditions of gradable adjectives modified by very: at what point a bent nail can be characterized as very bent, and how this is affected by context, are not matters of easy and clear-cut intuitive judgements. In this section, I present two case studies that further demonstrate how experimental approaches can take us beyond the limits of introspection-based data.

16.3.1 Relative gradable adjectives and context-dependence

Above it was observed that relative gradable adjectives such as tall and long have contextually determined interpretations, which depend in some way on a comparison class to provide a threshold or standard of comparison. But none of the above-described works have addressed the question of how precisely the standard is set relative to a given context or comparison class. As an example, the truth conditions for tall relative to a comparison class C might be stated in any of the following ways, all of which have been suggested at one point or another in the semantics literature (see e.g. Bartsch & Vennemann, 1973; Von Stechow, 1984; Bale, 2008, 2011):

(5) 〚Anna is tall〛^C = 1 iff…

a. Anna is among the tallest n% (e.g. tallest third) of the Cs

b. Anna’s height is among the top n% of heights of Cs

c. Anna’s height is greater than the mean height of Cs

d. Anna’s height is more than k standard deviations greater than the mean height of Cs

Alternately, the standard might be conceptualized not as a point but a range (Von Stechow, 2009; Solt, 2011), or as a probability distribution over precise points (Lassiter & Goodman, 2013; Qing & Franke, 2014a,b). We might even dispense with the notion of a standard entirely, and instead state the meaning of the adjective in terms of a classification problem, such that the tall members of C are those that are sufficiently similar in height to the tallest C, or to some prototypical tall entity (Schmidt et al., 2009; McNally, 2011).

While the formulations in (5) are on the surface extremely similar, they make distinct predictions as to how the extension of an adjective such as tall will change as the statistical properties of the comparison class are varied. Thus selecting between them requires eliciting not only static judgements as to the interpretation of an adjective in a given context, but also judgements of how that interpretation changes from context to context. This makes the question difficult if not impossible to investigate via introspective methods alone. One might have clear intuitions as to which members of a given set could be characterized, relative to that set, as tall; but it is much less easy to intuit how one’s judgements shift as the properties of the set are varied. To borrow a term from experimental psychology, the problem might be characterized in terms of learning effects: exposure to one stimulus may influence how a subject reacts to subsequent similar stimuli. In formal experiments, such effects can be controlled for by varying the order of presentation or by choosing a between-subjects rather than within-subjects design. But this is not possible when the semanticist is essentially her own research subject. Such limitations, coupled with the increasing availability of cost-effective research modalities, have led researchers interested in this question to turn to experimental approaches.

Some of the first work of this sort involved young children as subjects. Barner & Snedeker (2008) investigated 4-year-olds’ understanding of tall and short as applied to novel nouns as a way to study the emergence of compositional semantics in language development. Child subjects were shown a set of figures of varying heights which were given the name ‘pimwits’ and were asked to identify the tall pimwits or the short pimwits. In the baseline condition, roughly the tallest third of the set was selected as tall, and likewise roughly a third as short. But when the composition of the set was changed, so too did children’s cutoffs for tall: when more very short pimwits were added, the threshold for tall was on average set lower, whereas when more very tall pimwits were added, the threshold was on average set higher. The results for short were similar though less clear-cut, a finding that is consistent with other work showing that acquisition of negative antonyms lags that of their positive counterparts (e.g. Donaldson & Wales, 1970). The conclusion that can be drawn is that children as young as 4 are able to extract the statistical properties of a comparison class and use these to determine the extension of a novel adjective+noun combination in a compositional way. See also Tribushinina (2013) for similar research with child subjects.

Subsequent research has applied the same sort of methodology to investigating adults’ understanding of adjectives such as tall. As in Barner and Snedeker’s work, such studies have typically exposed subjects to some array of items intended to represent a comparison class, and asked them to indicate which ones can be described by the adjective in question. The primary experimental manipulation involves the nature of the comparison class set. As an example, Figure 16.1 shows a stimulus item from Solt & Gotzner (2012) (discussed later in this section).

FIGURE 16.1. Sample stimuli from comparison class experiments (from Solt & Gotzner, 2012)

Schmidt et al. (2009) performed a series of online experiments with the goal of evaluating possible mathematical models of human judgements about the application of tall in context. Subjects saw arrays of items—bars of varying heights—and were asked to specify which were tall. The stimulus arrays were varied in their statistical properties, including the number of items and the mean and variance of their heights. Subjects’ judgements were compared with the predictions of a variety of mathematical models of the sort exemplified in (5), as follows: for each distribution, the model’s probability that an item would be classified as tall (using the best-fit model parameters) was compared to the proportion of subjects who actually labelled that item tall, using a mean difference measure. Two models were found to offer the best fit with the data: a threshold-based model of the form in (5b) and a categorization-based cluster model.

Other work has applied variations of the same technique to somewhat different theoretical questions. In Tribushinina (2011), what was manipulated was not the statistical properties of the stimulus array but rather the type of entities that comprised it, the issue under investigation being the relative role of two sorts of standards: a scalar midpoint value based on the visually presented stimuli and an external reference point based on world knowledge. Dutch groot ‘big’ and klein ‘small’ were paired with sequences of items varying in size from large to small. In one condition, the items depicted were prototypically large entities (e.g. elephants); in a second, they were prototypically small entities (e.g. mice), and in third, entities with no strong size associations (e.g. balloons). As expected, the largest items regardless of object category were consistently called ‘big’, and the smallest consistently called ‘small’. But subjects on average labelled a larger number of prototypically small items than prototypically big items as ‘big’; conversely, more prototypically big items than prototypically small items were labelled ‘small’. This provides evidence that speakers are able to dynamically integrate the two types of standards.

Solt & Gotzner (2012) sought not only to clarify the truth conditions of relative gradable adjectives, but also to determine what sort of ontological structures are required to express them. Four gradable adjectives were tested—tall, big, dark, and pointy—each paired with arrays of thirty-six items representing comparison classes (again see Figure 16.1). As the statistical properties of the array were varied, the average number of items classified by subjects as tall, dark, etc. likewise changed, as did the average cutoff point for the application of the adjective (e.g. the smallest height which counted as tall). This shows that the truth conditions cannot be stated in terms of rank orders of individuals, per (5a); tall, for example, cannot mean ‘among the tallest third of the comparison class’. Rather, degrees are necessary. But a simple degree-based formulation such as (5b) is not sufficient. A second experiment, which used comparison classes featuring ‘gaps’ in the distribution of items over degrees, confirmed that the meaning of such adjectives cannot be stated in terms of an ordinal scale constructed from an ordering on the comparison class, but rather requires a scale with a distance metric, which would support truth conditions such as (5c,d). Comparison classes with gaps also played a role in research by Booij & Sassoon (2013), who demonstrate that this factor influences threshold setting among adults but not children.

Finally, Qing & Franke (2014b) propose a model of gradable adjective meaning based on pragmatic reasoning about optimal language use, in which thresholds for gradable adjectives are inferred probabilistically from statistical properties of a comparison class, with a particular threshold being used with a probability proportional to its communicative efficacy. They replicate the experiment of Solt & Gotzner (2012) and demonstrate a close fit between the predictions of the model and the experimental results. In subsequent research (Qing & Franke, 2014a), it is shown that the distinction between absolute and relative gradable adjectives can be plausibly modelled as resulting from differences in the assumed prior distribution of the comparison class with respect to the property in question (see also Lassiter & Goodman, 2013 for a similar proposal).

Importantly, prior expectations are themselves not readily accessible to introspection. Schöller & Franke (2015, 2016, 2017) take up this issue in work on the gradable quantity adjectives many and few, in which experimental data was collected on both prior expectations and interpretations. Building on theoretical proposals by authors including Fernando & Kamp (1996), these authors investigate the hypothesis that many and few each have a stable context-independent core meaning in the form of a fixed threshold on a probability distribution that represents prior expectations. By way of example, a sentence such as Joe eats many burgers would on this view be true iff Joe eats more burgers than θ per cent of members of the relevant comparison class (say, adult American males)—for some context-invariant value θ. In Schöller & Franke’s experiments, priors were elicited in an online study in which subjects were asked questions such as that in (6) in the context (6a), and answered by rating the likelihood of different ranges by adjusting sliders, a method developed by Kao et al. (2014). Comprehension of many/few was assessed by asking the same question in context (6b), with participants answering by selecting the range they thought most likely to be the one the speaker had in mind.

(6) How many burgers do you think Joe eats per month?

a. Joe is a man from the US.

b. Joe is a man from the US who eats few/many burgers.

Finally, production of many/few was assessed by presenting subjects with a fact (e.g. that Joe eats ten to twelve burgers a month) and asking whether a corresponding statement (e.g. Compared to other men from the US, Joe eats many/few burgers a month) was a good description of that fact. The results were input into a data-based computational model, which achieved moderate success in inferring the latent threshold parameters for the two expressions.

The specific research questions addressed in the above studies have varied, as have the formal frameworks in which they have been couched. All of this work, however, has pursued a broad common goal, namely to more precisely characterize how ordinary speakers determine the extensions of vague gradable expressions such as tall. While a standard model has yet to emerge, it is thanks to experimental work of the sort discussed here that it has been possible to go beyond the general observation that such adjectives have context-dependent meanings, and to formulate and test specific hypotheses as to their semantic content.

16.3.2 The logic of vagueness

Section 16.3.1 considered one aspect of the semantics of (relative) gradable adjectives, specifically their context-dependence. In this section we turn to a related phenomenon that also characterizes members of this class: vagueness.² On this topic, it is not only the subtlety of the required judgements but also semanticists’ own formal training and theoretical commitments that make it difficult to take an empirical approach based on introspection alone, and have led to an increasing use of experimental methods.

Vagueness is often characterized as the existence of borderline cases. For example, a man of height 6 feet (’) 4 inches (”) (1.93m) must certainly be considered tall (for an adult male), while a man of height 4’10” (1.48m) is certainly not tall. But a man of height 5’11” (1.8m) seems neither clearly tall nor clearly not tall; he is a borderline case. The lack of a sharp cutoff between the positive and negative extensions of predicates such as this also gives rise to the well-known Sorites paradox, otherwise known as the Paradox of the Heap (from the Greek word soros ‘heap’). In one of its common forms, the Sorites paradox is illustrated by the sequence in (7): the two premises in (7a) and (7b) strike us as unquestionably true, but the conclusion in (7c) which follows from them is just as clearly false.

(7)a. A man who is 6’4” tall is tall.

b. A man who is ⅙” shorter than a tall man is also tall.

c. A man who is 4’10” tall is tall.

The crucial property that gives rise to this paradox is tolerance (Wright, 1975), meaning that the applicability of the predicate is insensitive to small changes in the relevant measure.

A wide variety of logical frameworks have been applied to vagueness: multivalued and fuzzy logics (Lakoff, 1973; Zadeh, 1975a; Tye, 1994), supervaluations (Fine, 1975; Kamp, 1975), subvaluations (Hyde, 1997), epistemicism (Williamson, 1994), and contextualism (Klein, 1980; Kamp, 1981b; Bosch, 1983; Raffman, 1996; Fara, 2000); see Sorensen (2016) for an overview, and Keefe & Smith (1997b) for a more in-depth review. Much of the argumentation in favour of and against individual theories has been formal or theoretical in nature, and might strike the linguist as quite removed from claims about natural language. But certain theoretical arguments have rested explicitly or implicitly on assumptions about ordinary speakers’ use and understanding of vague language. An example is the hypothesis that certain phenomena can be attributed to speakers’ tendency to interpret a vague predicate P as definitely P (see Serchuk et al., 2011, for discussion). Such claims are difficult to assess or dispute on the basis of introspection, particularly as scholars working in this area typically have theoretical commitments whose predictions are not easy to disentangle from raw native speaker intuitions. For example, a scholar trained in theories of vagueness might take as a given the principle of bivalence, according to which every proposition is either true or false. For this reason, both philosophers and linguists have increasingly argued for the value of eliciting native speaker judgements on critical examples (see e.g. Smith, 2008), a development that has given rise to a growing body of experimental research.

An important early example of work of this sort comes out of the field of philosophy, namely Bonini et al. (1999). In a series of simple experiments, subjects were asked to name the numerical values that represent the cutoffs for the application of vague predicates, largely gradable adjectives (e.g. tall, expensive, long, dangerous, poor). One group, the truth judgers, were asked to name the minimum value that would make the predicate true, for example, the smallest height that makes it true to say that a man is tall. A second group, the falsity judgers, were asked to provide the maximum value at which the predicate would be false, for example, the greatest height at which it would be false to say that a man is tall. Across different vague predicates and variants of the question wording, a gap was consistently found between the two values elicited in this way: the average minimum value at which the predicate was judged true was greater than the average maximum value at which it was judged false. Thus participants’ performance was apparently not consistent with bivalence. While this might seem to favour an analysis of vagueness involving truth value gaps, Bonini and colleagues argue instead that their results support an epistemic theory according to which vagueness is construed as ignorance: there is some fact to the matter regarding the cutoff for vague predicates such as tall, but this value is unknown (and perhaps unknowable) to speakers. As evidence, they cite both theoretical arguments as well as the results of a further experiment, which found a similar ‘gappy’ pattern of answers to questions relating to a definite but unknown value, such as the average height of 30-year-old Italian men.

The methodology used by Bonini and colleagues has been criticized by later authors; Serchuk et al. (2011), for example, point to a potential ambiguity in the question wording, and demonstrate that the finding of a gap disappears when the ambiguity is removed. On the theoretical side, authors including Hampton (2007) and Alxatib & Pelletier (2011a) critique Bonini et al.’s argumentation against gap-based theories such as supervaluationism and in favour of epistemicism. To a large extent, this debate reflects the challenge of linking the behaviour of experimental subjects to the constructs of philosophical theories, a connection that necessarily requires ancillary assumptions. But efforts to resolve the debate have also led to further experimental undertakings, which have yielded results that do more to constrain possible theories of vagueness.

A significant work in this area is Alxatib & Pelletier (2011a), which again derives novel and theoretically significant insights from a very simple experimental methodology. As stimuli, subjects saw a picture of a mock police line-up including five men whose heights were seen to range from 5’4” (1.63m) to 6’6” (1.98m). For each ‘suspect’, they were asked to judge the following four sentences using the response options ‘true’, ‘false’, and ‘can’t tell’.

(8)a. X is tall.

b. X is not tall.

c. X is tall and not tall.

d. X is neither tall nor not tall.

The crucial judgements are those for the middle suspect, who at 5’11” (1.8m) represented a borderline case of tall. Considering first the simple statements (8a) and (8b), the authors found a preference for false versus true responses: subjects were more likely to judge it false that the borderline individual was tall than true that he was not tall, and similarly more likely to judge it false that he was not tall than true that he was tall. This represents a divergence from classical logic, according to which a proposition is false if and only if its negation is true. The findings relating to the conjunctions were even more surprising, and problematic for any of the leading logical approaches to vagueness. Classically, these are contradictions, and even authors who have argued for systems having intermediate or undetermined truth values in addition to the standard true/false have largely taken for granted that propositions of this form are necessarily false (e.g. Kamp, 1975). But Alxatib & Pelletier (2011a) found that in the case of the borderline-tall suspect (but not the clear cases of tall and not tall), roughly half of subjects evaluated these as true. Furthermore, ‘tall and not tall’ was judged true even by some subjects who judged both conjuncts false (and similarly, mutatis mutandis, for the negative conjunction).

The surprising acceptability of such ‘borderline contradictions’—that is, classical contradictions involving borderline cases of a predicate—has been further substantiated in other work using different methodologies. Ripley (2011) elicited judgements of sentences such as The circle both is and isn’t near the square paired with a series of seven images of circle/square pairs in which the distance between the two figures was varied from zero (the two figures touching) to large. The highest level of acceptability—roughly at the midpoint of a seven-point scale—was found for those pictures with an intermediate distance between the two, which might be characterized as borderline cases of near the square. Taking a different approach, Sauerland (2011) conducted an online study in which subjects were asked to rate the truth of sentences using a value between 0 and 100. Higher ratings were obtained for the conjunction of a proposition of intermediate truth value and its negation (e.g. A 5’10”-guy is tall and a 5’10”-guy isn’t tall) than for the conjunction of two equally true but unrelated propositions (e.g. A 5’10”-guy is tall and a guy with $100,000 isn’t rich). Subsequent research by Égré & Zehr (2016) has found that while both ‘gappy’ descriptions of the form x is neither P nor not P and ‘glutty’ descriptions of the form x is P and not P are judged to be fairly acceptable, the former are preferred over the latter.

The above-described findings are significant because formal logical theories of vagueness differ in their ability to account for them. The acceptability of borderline contradictions is in particular hard to reconcile with a simple subvaluationist or supervaluationist approach. Building on the experimental findings, the above authors and others have made further theoretical proposals aimed at explaining these patterns. Alxatib & Pelletier (2011a) propose that vague predicates are ambiguous between a superinterpretation on which they are neither true nor false for borderline cases and a subinterpretation on which they are both true and false for such individuals; pragmatic principles call for the selection of the strongest interpretation on which the resulting sentence could be true. Ripley (2011) argues that the data point towards a paraconsistent logic according to which contradictions are not necessarily false. In subsequent work, he and colleagues (Cobreros et al., 2012) propose an analysis couched in a novel logical framework featuring three related notions of truth: classical, strict and tolerant. Finally, Alxatib et al. (2013) argue for an account based on fuzzy logic augmented with a rescaling operation, which successfully overcomes earlier criticisms of fuzzy logic as a possible model of vague natural language (see especially Kamp, 1975). The reader is referred to Alxatib & Sauerland (Chapter 20 in this volume) for further discussion of these theoretical developments.

A final strand of experimental research into vagueness has involved eliciting judgements on dynamic versions of Sorites series, such as a series of colour chips ranging in small increments from clear cases of blue to clear cases of green (Égré et al., 2013; Raffman, 2014). The main finding is that the location of the boundary drawn between the extensions of two adjacent colour adjectives is dependent on order of presentation. For example, the point of transition from blue to green is closer to the blue end of the spectrum when the presentation order is blue-to-green, and closer to the green end when the order is reversed (an effect known as enhanced contrast). Once again, this is the sort of pattern that would be difficult to detect via introspection (cf. the discussion of comparison classes in section 16.3.1). As with the case of borderline contradictions, these findings have been subject to different theoretical explanations: Raffman (2014) argues for a contextualist approach, while Égré and colleagues propose that their findings can be accounted for within the strict-tolerant framework of Cobreros and colleagues.

As may be inferred from the earlier discussion, experimental research into the interpretation of vague language has not provided clear pointers to any single theory of vagueness, and given the subtlety and complexity of the topic, it is somewhat unrealistic to expect this to emerge in the near future. What this work has achieved, however, is a range of new insights into the facts that a theory of vagueness must account for. This in turn has prompted important new theoretical developments.

16.4 MAPPING THE EMPIRICAL LANDSCAPE

The research we have seen so far has been strongly theory-driven. The starting point has been some formal theory or theories of a linguistic phenomenon; experimentally sourced data have been used to support or refine a particular theoretical approach, or to adjudicate between theories. The experimental materials have typically been based on a relatively small set of adjectives (or other scalar expressions), selected to be representative of established classes (e.g. absolute vs. relative gradable adjectives; vague predicates), whose behaviour is assumed to be uniform. Along the way, new facts have been uncovered, including differences among items expected to exhibit similar behaviour; an example is McNabb’s (2012) finding that some absolute maximum-standard adjectives (but not others) allow context-dependent, non-maximal interpretations. But this has not been the primary objective of the research.

In this section, we will see two cases of research which has sought to profile the behaviour of a wide range of items, with the goal of determining to what extent it is uniform, and what meaningful subclasses exist. Such work does not to seek to test a particular theory, but rather yields data that can serve as the basis for subsequent theory building.

16.4.1 Ordering Subjectivity

A further phenomenon relating to adjective meaning that has been the subject of considerable recent interest is subjectivity, as diagnosed by the possibility of so-called ‘faultless disagreement’ (Köbel, 2004; Lasersohn, 2005; Stephenson, 2007). To give a classic example, two speakers might disagree as to whether or not a dish is tasty or an experience is fun, with neither seeming to have said something objectively incorrect or false. Recently, it has been recognized that there are at least two sorts of adjectival subjectivity (see e.g. Kennedy, 2013; Bylinina, 2014). So-called evaluative adjectives such as beautiful give rise to faultless disagreement effects in both their positive and comparative forms. Dimensional gradable adjectives such as tall in their positive forms likewise allow faultless disagreement; their comparative forms, however, allow only objective or factual readings:

Importantly, most of the discussion of this pattern has focused on clear cases such as tall and beautiful. Once we move beyond paradigm examples such as these, intuitions become murkier. This is particularly the case for adjectives that lack corresponding numerical measurement systems (unlike tall) but that are also not classically evaluative (unlike beautiful), examples being clean/dirty, smooth/bumpy, and sharp/dull. Can two speakers disagree faultlessly about which of two shirts is dirtier or which of two knives is sharper?

Solt (2018) explores this issue in an online study of thirty-five gradable adjectives using a novel faultless disagreement paradigm, in which subjects saw brief two-speaker dialogues of the sort in (10) and were asked to classify each using one of two response options: ‘only one can be right, the other must be wrong’ and ‘it’s a matter of opinion’. It was found that with regards to so-called ordering subjectivity, that is, the subjective interpretation of comparative forms, gradable adjectives pattern into not two but three subclasses. Some such as tall allow only objective readings; some such as beautiful allow only subjective readings; and a large group including clean/dirty and sharp/dull allow both types of interpretation. Thus by taking a broad view and testing a wide range of individual items without pre-established notions of how they might divide into subclasses, this research established that the landscape of adjectival subjectivity is richer and more diverse than previously recognized. To account for these findings, Solt builds on proposals by Kennedy (2013), Bylinina (2014), and McNally & Stojanovic (2017) by positing two distinct sources of ordering subjectivity. The first of these is multidimensionality, the specific claim being that certain gradable adjectives (e.g. clean/dirty) must be analysed as lexicalizing underspecified functions of some contextually determined set of underlying component dimensions. The second factor is judge dependence, whereby purely subjective adjectives such as beautiful and tasty include an explicit judge or experiencer argument as part of their lexical semantics.

16.4.2 Adjectives and implicature

Our final case study takes us from the domain of semantics proper to the semantics/pragmatics interface, and more specifically the topic of scalar implicature (Grice, 1975; Horn, 1989; Levinson, 2000a). By way of illustration, an utterance of the sentence in (11a) would typically license the inference that Anna did not eat all of the cookies. Similarly, the utterance of (11b) would typically allow the hearer to infer that the problem is not impossible to solve.

(11)a. Anna ate some of the cookies.

b. The problem is difficult to solve.

Inferences of this sort are widely considered to be a variety of conversational implicature, specifically scalar implicatures. Expressions such as some and difficult are taken to evoke lexical scales of alternatives, that is, sets of two or more lexical items ordered by semantic strength. In (11a), the scale might be 〈some, all〉, while in (11b) it is of the form 〈difficult, impossible〉. From the speaker’s choice to use a weaker scalar item (e.g. some, difficult), the hearer can infer that stronger items on the scale (e.g. all, impossible) fail to obtain.

There is a large body of experimental research into scalar implicature, conducted with the goal of providing evidence for or against particular theoretical accounts.³ But as pointed out by Van Tiel et al. (2016), the great majority of such work has focused on the single item some, and to a lesser extent or. The assumption seems to be that these items are representative of the class of scalar items more generally, whose behaviour is (tacitly) expected to be uniform. Recently, there have been attempts to test this ‘uniformity assumption’, with interesting findings for gradable adjectives in particular.

Doran et al. (2009, 2012) investigate whether the type of scale affects the extent to which scalar implicatures are incorporated into the truth conditional content of an utterance, using a novel experimental paradigm intended to separate literal or truth-conditional meaning from pragmatic implicatures. Subjects saw dialogues of the form in (12) in which a speaker uses a weak scalar term, coupled with a fact which supports the use of a stronger scalemate:

(12) Irene: How attractive is Kate?

Sam: She’s pretty.

FACT: Kate was voted ‘World’s Most Beautiful Woman’ this year.

Following a method developed by Larson et al. (2009), subjects were not asked directly about the truth or falsity of the crucial sentence, but rather whether it would be considered true or false by the character ‘Literal Lucy’, who always interprets language literally. Using this method, it was found that inferences from gradable adjectives such as pretty, big, and annoyed were less frequently incorporated into truth-conditional meaning than those from quantifiers, cardinal numerals, and rank orderings. Furthermore, only in the case of adjectives was the rate of implicatures dependent on the discourse context, in that implicatures were computed more frequently when alternatives were explicitly mentioned (e.g. when Irene’s question in (12) was instead ‘Is Kate average-looking, pretty, or gorgeous?’). This was taken to indicate that the lexical scales that give rise to implicatures from the use of gradable adjectives are less accessible than those involved in the interpretation of items such as quantifiers.

Van Tiel et al. (2016) take an even finer-grained view, investigating the rate of scalar inferences for forty-three items, including thirty-two adjectives. Citing possible issues with the verification task used by Doran and colleagues, such as the lack of parallel in the description of the facts in the different conditions, Van Tiel and colleagues chose instead to use an inferencing task, executed online via Amazon MTurk. Each trial consisted of a statement made by a speaker ‘John’ or ‘Mary’, containing a scalar expression in predicative position (for example, John says ‘She is intelligent.’). Subjects were asked to indicate whether they would conclude that according to that speaker, a statement including a stronger scalar alternative (e.g, that she is brilliant) would not obtain. An answer of ‘Yes’ means that the scalar implicature is calculated. The picture that emerged was one of extreme diversity. For some pairs, such as some/all, cheap/free, and possible/certain, scalar inferences were near universal; that is, subjects consistently made the inference from cheap to not free, and so forth. By contrast, for pairs such as small/tiny and content/happy, only a small minority of subjects drew the corresponding inferences. Other pairs spanned the range in between.

With the help of further online experiments as well as corpus data, the researchers profiled each pair on a range of dimensions that were hypothesized to play a role in accounting for this diversity. Only two factors were found to have a significant effect: the semantic distance between the two items (scalar inferences were more frequent when the stronger term was perceived as more distant from the weaker one); and boundedness (implicatures were more frequent when the stronger term was a scalar endpoint, as in the pair difficult/impossible). This latter finding relates to the discussion of scale structure in section 16.2.1 earlier, and suggests that this aspect of semantic scales also plays a role in pragmatic phenomena.

Importantly, a large proportion of the variance in the observed implicature rates remained unexplained by any of the factors considered in this study. Van Tiel and co-authors (2016) conclude that much of this remaining variance is idiosyncratic and unsystematic, perhaps resulting from statistical regularities in participants’ previous experience with the individual items in question. Subsequent authors have however questioned this (rather negative) takeaway. Focusing in particular on the adjectival results, McNally (2017) challenges Van Tiel et al.’s (2016) conclusion that the salience or availability of the stronger alternative has no role in explaining variability in implicature rates, arguing that the methods used to measure availability did not adequately account for the polysemy of gradable adjectives and the potential role of (inferred) discourse structure. And in another recent experimental undertaking, Benz et al. (2017) revisit Van Tiel et al.’s (2016) findings and demonstrate that lower rates of scalar implicatures are correlated with higher rates of negative strengthening, whereby the negation of the stronger term in the pair is strengthened such that it also negates the weaker term. For example, the denotation of the negated not happy spans the scalar territory corresponding to ‘content but not happy’ and ‘unhappy’, but pragmatically, its interpretation tends to be strengthened to ‘unhappy’. On this strengthened interpretation, not happy does not upper-bound content but instead contradicts it; thus no implicature from content to not happy is possible. Once this factor is taken into consideration, Benz and colleagues argue that Van Tiel et al.’s findings are compatible with a modified form of the uniformity hypothesis. Thus it remains very much an open question how much idiosyncratic variation there actually is in the rate of scalar implicatures for adjectival pairs.

l6.5 NOTES ON METHODOLOGY

In closing this chapter, we will briefly take a closer look at methodological issues relating to the investigation of adjectival and scalar meaning.

The first important observation to be made is that most experimental work in this area has relied on quite simple behavioural tasks. Chief among these are the following:

• ‘Language/world matching’ tasks, where subjects relate words or sentences of natural language to some visually or verbally represented object(s) or state of affairs. This includes more specifically:

– truth/acceptability judgement tasks (subjects rate the truth/acceptability of a sentence in relation to a picture/verbal description)

– categorization tasks (subjects select which of a set of items can be described by a given linguistic expression)

– the Presupposition Assessment Task developed by Syrett et al. (2010)

• Acceptability/grammaticality judgement tasks

• Inferencing tasks, where subjects judge whether one sentence follows from another, or the compatibility of two sentences

• Interpretation tasks, where subjects are asked to describe their interpretation of a word/sentence, for example by giving a numerical value

• Other metalinguistic tasks, such as rating the acceptability of a paraphrase or the status of a disagreement.

Such tasks are essentially structured versions of the sorts of questions that semanticists pose to themselves and attempt to resolve through introspection. There has also been some use of methods that assess the on-line processing of scalar expressions, such as self-paced reading, reaction time measurement and eye-tracking, but on the whole these have played a lesser role. This pattern perhaps reflects the fact that most experimental work in this area has been carried out by semanticists and philosophers rather that psycholinguists and psychologists. At the same time, the history of success with the straightforward techniques listed earlier is clear evidence that theoretically relevant data can come from very simple approaches.

To date, there has been relatively little work directed explicitly at comparing methodologies in experimental semantics. This contrasts with the fields of experimental syntax and experimental pragmatics (particularly in the area of scalar implicature), where ‘research on research’ is a well-established tradition (see e.g. Featherston, 2007; Geurts & Pouscoulous, 2009; Sprouse et al., 2013). On the topic of adjectival semantics, there is however one phenomenon, namely the absolute/relative distinction, that has been investigated via a diverse range of methodologies (see section 16.2.1). That very different approaches have produced largely converging results is an encouraging sign for the continued use of experimentation to bolster theoretical work in this area.

The lack of explicit comparative work makes it somewhat premature at this stage to recommend preferred methodologies for investigating adjective meaning; and in any case, appropriate methodologies will be as varied as the theoretical goals against which they are applied. However, reviewing some of the studies discussed in the present chapter suggests some points to keep in mind.

16.5.1 Design

Research designs in which subjects evaluate a large number of similar items can be problematic, leading to fatigue or alternately giving rise to priming effects or the development of response strategies. Beyond this, designs that require subjects to directly compare items can yield differences that might not be present if those items are tested in isolation (see Frazier et al., 2008 and Sprouse et al., 2013, and references therein). Techniques for avoiding unwanted effects relating to multiple item exposure include the adequate use of fillers and randomization/rotation of items, as well as the choice of between-subject rather than within-subject designs. The latter has been made more feasible through the increased availability of online research modalities—particularly Amazon Mechanical Turk (MTurk)—which have made it possible and cost-effective to recruit very large research samples, with each subject responding to only a small number of experimental items. Studies on adjective meaning have, however, varied in the extent to which they take advantage of such possibilities. On one end of the spectrum we see ‘one-shot’ studies in which each subject gives only a single experimental judgement (Schmidt et al., 2009), as well as studies (both online and in the lab) designed such that subjects see only one version of a question or test item, for example either an adjective or its antonym, or an adjective in a single test context (e.g. Bonini et al., 1999; Frazier et al., 2008). At the other end of the spectrum, there have been studies in which subjects give a hundred or more judgements (e.g. Hansen & Chemla, 2017), as well as those where subjects repeatedly respond to a single linguistic item in multiple contexts (Ripley, 2011; Alxatib & Pelletier, 2011a) or where the task is to choose between linguistic descriptions or paraphrases (McNabb, 2012; Bogal-Allbritten, 2012). There are often solid practical reasons for employing such a design (especially for in-lab studies), and there is no strong evidence for questioning the results of studies conducted in this way, though there have been reports of order effects (see e.g. Syrett et al., 2010; Hansen & Chemla, 2017). But when it comes to the investigation of subtle aspects of meaning, it could be argued that the ideal study is one in which subjects respond in a ‘fresh’ way to each item, as if was the first of its kind ever seen; and the researcher should strive to come as close to this ideal as is practicable.

16.5.2 Task

In this (or any other area), a very basic aspect of experimental design is to ensure that the task is one that subjects can reasonably be expected to carry out. If the goal is to tap into subjects’ intuitions as to the meaning and possible use of scalar expressions, perhaps the best sort of task is one that—for the subject at least—represents a natural extension of the everyday use and interpretation of those expressions. What is natural is of course open to debate, but self-paced reading tasks, acceptability judgements, language/world matching tasks, and inferencing tasks would seem to be good candidates, and all have a long history of use. One sort of task where caution is called for involves those in which subjects are asked to specify some aspect of meaning by providing a numerical value or values, as done by Bonini et al. (1999) (section 16.3.2) and in a much more sophisticated form by Schöller & Franke (2015) to elicit prior expectations (section 16.3.1). The results from this work are quite promising, but it perhaps should be kept in mind that such a technique asks the respondent to probe her intuitions in a way that she has likely never done before. Further experience would be helpful in establishing the validity and reliability of data sourced in this way.

16.5.3 Question/answer wording

In any experiment, the wording of questions and answer options can affect how subjects respond. Experimental work on the logic of vagueness provides a good illustration of possible issues in this area. Eliciting judgements of truth on a multipoint scale (as done by Ripley, 2011) or in the form of a number between 0 and 100 (per Sauerland, 2011) has the potential to educate respondents to the perhaps novel view that truth is a gradient rather than categorical notion. Conversely, using ‘true/false/can’t tell’ as response options (as done by Alxatib & Pelletier, 2011a) might instead imply an epistemic view of vagueness. While it is not possible to avoid issues of this sort, the researcher must be aware of the commitments inherent to the choices made.

16.5.4 Stimuli

As discussed earlier, a widespread technique in the investigation of adjective meaning requires subjects to relate an adjective to some representation of a state of affairs. Importantly, such an approach can only be used with adjectives whose meanings can be depicted or otherwise represented. An easy case is an adjective such as tall, probably the most-tested of all gradable adjectives: it is straightforward to visually represent objects of systematically varying heights. This is not so for adjectives whose meanings relate to perceptual dimensions other than the visual one, such as loud/quiet, hard/soft, or salty, particularly if the experiment will be carried out online. Most problematic are adjectives that refer to internal mental states (e.g. happy) and evaluative or aesthetic judgements (e.g. good, beautiful). An interesting attempt to overcome this problem is made by Liao et al. (2016); Liao & Meskin (2017), who represent different degrees of beauty via pictures of sculptures and of human faces whose symmetry (known to correlate with perceptions of beauty) is digitally manipulated. It remains to be seen how far such approaches can be extended, and thus, for a range of adjective types, it may be necessary to rely on other task types, such as inferencing tasks.

16.6 CONCLUSIONS

This chapter has highlighted a variety of ways that experimental approaches have been used in developing and testing formal semantic theories of gradable adjectives and other scalar expressions. Space limitations and the case study format has made this review necessarily somewhat selective: there have been many other theoretically significant and methodologically sophisticated experimental studies on adjectival and scalar meaning that could just as well have been discussed. In any case, a look at the programme of any major semantics conference is sufficient to show that experimentally based work now has an established role in the field. It is to be expected that coming years will see the further development of methodologies for investigating scalar meaning, and an ever growing body of best practices.

ACKNOWLEDGEMENTS

For helpful discussions on topics in experimental semantics, I would like to thank Sam Alxatib, Chris Cummins, Nicole Gotzner, Scott Grimm, Uli Sauerland, and Carla Umbach. Work on this chapter was supported by the Deutsche Forschungsgemeinschaft under grant SO1157/1-2.

¹ The entry in (1), due to Kennedy (2007), takes the measure function to be the entire content of the adjective. An alternate approach, invoked by Heim (2000) and others, analyses gradable adjectives as degree relations, that is, type 〈d,〈e,t〉〉. This difference is not crucial to the present discussion.

² The topic of vagueness is covered in more depth by Alxatib & Sauerland in Chapter 20 in this volume. The present brief discussion focuses on the role of experimentation versus introspection in understanding adjectival vagueness.

³ For more in-depth discussion of experimental research on scalar implicature, see the contributions by Skordos & Barner, Degen & Tanenhaus, and Breheny in Chapters 2, 3, and 4, respectively, in the present volume. For an overview of current semantic/pragmatic theories of scalar implicature, the reader is referred to Sauerland (2012a).

CHAPTER 16

ADJECTIVE MEANING AND SCALES