In this section, I present a framework for analyzing causal effects and apply the framework to describe how
field experiments eliminate some of the possible bias in observational studies. For concreteness, I use the Rosenstone and
Hansen (
1993) study as a running example. In their participation study, some respondents are contacted by campaigns and others are not.
In the language of experiments, some subjects are “treated” (contacted) and others are “untreated” (not contacted). The key
challenge in estimating the causal effect of the
treatment is that the analyst must somehow use the available data to construct an estimate of a counterfactual: what outcome
would have been observed for the treated subjects had they not been treated? The idea that for each subject there is a “potential
outcome” in both the treated and the untreated state is expressed using the notational system termed the
“Rubin causal model” after
Rubin (
1978,
1990).
10 To focus on the main ideas, I initially ignore
covariates. For each individual
i, let
Yi0 be the outcome if
i does not receive the treatment (in this example, contact by the mobilization effort) and
Yi1 be the outcome if
i receives the treatment. The treatment effect for individual
i is defined as
The treatment effect for individual
i is the difference between the outcomes for
i in two possible, although mutually exclusive, states
of the world – one in which
i receives the treatment and another in which
i does not. Moving from a single individual to the average for a set of individuals, the
average treatment effect for the treated (ATT) is defined as
where
E stands for a group average and
Ti = 1 when a person is treated. In words,
Yi1|Ti = 1 is the post-treatment outcome among those who are treated, and
Yi0|Ti = 1 is the outcome for
i that would have been observed if those who are treated had not been treated. Equation (
2) suggests why it is difficult to estimate a causal effect. Because each individual is either treated or not, for each individual
we observe either
Y1 or
Y0. However, to calculate Equation (
2) requires
both quantities for each treated individual. In a dataset, the values of
Y1 are observed for those who are treated, but the causal effect of the treatment cannot be measured without an estimate of
what the average
Y would have been for these individuals had they not been treated. Experimental and observational research designs employ different
strategies for producing estimates of this counterfactual. Observational data analysis forms a comparison group using those
who remain untreated.
This approach generates
selection bias in the event that the outcomes in the untreated state for those who are untreated are different from the outcomes
in the untreated state for those who are treated. In other words, selection bias occurs if the differences between those who
are and are not treated extend beyond exposure to the treatment. Stated formally, the observational comparison of the treated
and the untreated compares:
A comparison of the average outcomes for the treated and the untreated equals the average treatment effect for the treated
plus a selection bias term. The selection bias is due to the difference in the outcomes in the untreated state for those treated
and those untreated. This selection bias problem is a critical issue addressed by experimental methods.
Random assignment forms groups without reference to either observed or unobserved attributes of the subjects and, consequently,
creates groups of individuals that are similar prior to application of the treatment. When groups are formed through random
assignment, the group randomly labeled the control group has the same expected average outcome in the untreated state as the
set of subjects designated at random to receive the treatment. The randomly assigned control group can therefore be used to
produce an unbiased estimate of what the outcome would have been for the treated subjects, had the treated subjects remained
untreated,
thereby avoiding selection bias.
The critical assumption for observational work to produce unbiased treatment effect estimates is that, controlling for
covariates (whether through regression or through matching),
E(
Yi0|Ti = 1) =
E(
Yi0|Ti = 0) (i.e., apart from their exposure to the treatment, the treated and untreated group outcomes are on average the same
in the untreated state). Subject to sampling variability, this will be true by design when treatment and control groups are
formed at random. In contrast, observational research uses the observables to adjust the observed outcomes and thereby produce
a proxy for the treated subject's potential outcomes in the untreated state. If this effort is successful, then there is no
selection bias. Unfortunately, without a clear rationale based on detailed knowledge of why some observations are selected
for treatment and others are not, this assumption is rarely convincing. Consider the case of estimating the effect of campaign
contact on voter turnout. First, there are likely to be important omitted variables correlated with campaign contact that
are not explained by the included variables. Campaigns are strategic and commonly use voter files to plan which households
to contact. A key variable in many campaign targeting plans is the household's
history of participation, and households that do not vote tend to be ignored. The set of control variables available in the
ANES data, or other survey datasets, does not commonly include vote history or other variables that might be available to
the campaign for its strategic planning. Second, past turnout is highly correlated with current turnout. Therefore,
E(
Yi0|Ti = 1) may be substantially higher than
E(
Yi0|Ti = 0). Moreover, although it may be possible to make a reasonable guess at the direction of selection bias, analysts rarely
have a clear notion of the magnitude of selection bias in particular applications, so it is uncertain how estimates may be
corrected.
11
In addition to selection bias, field experiments address a number of other common methodological difficulties in observational
work, many of these concerns related to
measurement. In field experiments, the analyst controls the treatment assignment, so there is no error in measuring who is
targeted for treatment. Although observational studies could, in principle, also measure the treatment assignment accurately,
in practice analysis is frequently based on survey data, which relies on self-reports. Again, consider the case of the voter
mobilization work. Contact is self-reported (and, for the most part, so is the outcome, voter turnout). When there is misreporting,
the collection of individuals who report receiving the treatment are in fact a mix of treated and untreated individuals. By
placing untreated individuals in the treated group and treated individuals in the untreated group, random misclassification
will tend to attenuate the estimated treatment effects. In the extreme case, where the survey report of contact is unrelated
to actual treatment status or individual characteristics, the difference in outcomes for those reporting treatment and those
not reporting treatment will vanish. In contrast, systematic measurement error could lead to exaggeration of treatment effects.
In the case of survey-based voter mobilization research, there is empirical support for concern that misreporting of treatment
status leads to overestimation of treatment effects. Research has demonstrated both large amounts of misreporting and also
a positive correlation between misreporting having been contacted and misreporting having voted
(Vavreck
2007;
Gerber and Doherty
2009).
There are some further difficulties with survey-based observational research that are addressed by field experiments. In addition to the uncertainty regarding who
was assigned the treatment, it is sometimes unclear what the treatment was because survey measures are sometimes not sufficiently
precise. For example, the ANES item used for campaign contact in the Rosenstone and Hansen study asks respondents: “Did anyone
from one of the political parties call you up or come around and talk to you about the campaign?” This question ignores nonpartisan
contact; conflates different modes of communication, grouping together face-to-face canvassing, volunteer calls, and commercial
calls (while omitting important activities such as campaign mailings); and does not measure the frequency or timing of contact.
In addition to the
biases discussed thus far, another potential source of difference between the observational and experimental estimates is
that those who are treated outside the experimental context may not be the same people who are treated in an experiment. If
those who are more likely to be treated in the real world (perhaps because they are likely to be targeted by political campaigns)
have especially large (or small) treatment effects, then an experiment that studies a random sample of registered voters will
underestimate (or overestimate) the ATT of what may often be the true population of interest – those individuals most likely
to be treated in typical campaigns. A partial corrective for this is weighting the result to form population proportions similar
to the treated population in natural settings, although this would fail to account for differences in treatment effects between
those who are actually treated in real-world
settings and those who “look” like them but are not
treated.
Finally, although this discussion has focused on the advantages of randomized experiments over observational studies, field
experimentation also has some advantages over conventional laboratory experimentation in estimating campaign effects. Briefly,
field experiments of campaign communications typically study the population of registered voters (rather than a student population or other volunteers),
measure behavior in the natural context (versus a university laboratory or a “simulated” natural environment; also, subjects
are typically unaware of the field experiment), and typically estimate the effect of treatments on the actual turnout (rather
than on a surrogate measure such as stated vote intention or political interest).
Field experiments are not a panacea, and there are often substantial challenges in the implementation, analysis, and interpretation
of findings. For an informative recent discussion of some of the limitations of field experiments, see
Humphreys and Weinstein (
2009) and especially
Deaton (
2009); for a reply to Deaton, see
Imbens (
2009). Rather than compile and evaluate a comprehensive list of potential concerns and limitations, I provide in this section
a somewhat informal account of how I address some of the questions I am frequently asked about field experiments.
13 The issue of the external validity of field experiments is left for Section 7.
Some field experiments have high levels of noncompliance due to the inability to treat all of those assigned to the treatment group (low contact rates). Other methods,
such as lab experiments, seem to have perfect compliance. Does this mean field experiments are biased?
Given that one-sided noncompliance (i.e., the control group remains untreated, but some of those assigned to the treatment
group are not treated) is by far the most common situation in political science field experiments, the answer addresses this
case. If the researcher is willing to make some important technical assumptions (see
Angrist, Imbens, and Rubin [
1996] for a formal statement of the result) when there is failure to treat in a random experiment, a consistent (large
sample unbiased) estimate of the average treatment effect on those treated can be estimated by differencing the mean outcome
for those assigned to the treatment and control groups and dividing this difference by the proportion of the treatment group
that is actually treated.
The consequences of failure to treat are illustrated in
Figure 9.1, which depicts the population analogues for the quantities that are produced by an experiment with noncompliance.
14 Figure 9.1 provides some important intuitions about the properties and limitations of the treatment effect estimate when some portion
of the treatment group is not treated. It depicts a pool of subjects in which there are three types of people (a person's
type is not directly observable to the experimenter), and in which each type has different values of
Yi(0) and
Yi(1), where
Yi(
X) is the potential outcome for a subject of type
i when treated (
X = 1) or untreated (
X = 0). Individuals are arrayed by group, with the
X axis marking the population proportion of each type and the
Y axis indicating average outcome levels for subjects in each group. Panel A depicts the subjects when they are assigned to
the treatment group, and panel B shows the subjects when assigned to the control group. (Alternatively,
Figure 9.1 can be thought of as depicting the potential outcomes for a large population sample, with some subjects randomly assigned
to the treatment group and others to the control group. In this case, the independence of treatment group assignment and potential
outcomes ensures that for a large sample, the proportions of each type of person are the same for the treatment and the control
group, as are the
Yi(
X) levels.)
Panel A shows the case where two of the three types of people are actually treated when assigned to the treatment group and
one type is not successfully treated when assigned to the treatment group (in this example, type 1 and type 2 are called “compliers”
and type 3 people are called “noncompliers”). The height of each of the three columns represents the average outcome for each
group, and their widths represent the proportion of
the subject population of that type. Consider a simple comparison of the average outcome when subjects are assigned to the
treatment group versus the control group (a.k.a. the intent-to-treat [ITT] effect). The geometric analogue to this estimate
is to calculate the difference in the total area of the shaded rectangles for both treatment and control assignment. Visually,
it is clear that the difference between the total area in panel A and the total area in panel B is an area created by the
change in
Y in panel A due to the application of the treatment to groups 1 and 2 (the striped rectangles). Algebraically, the difference
between the treatment group average and the control group average, the ITT, is equal to [
Y1(1) –
Y1(0)]
p1 + [
Y2(1) –
Y2(0)]
p2. Dividing this quantity by the share of the treatment group actually treated, (
p1 +
p2), produces the ATT.
15 This is also called the
complier average causal effect (CACE), highlighting the fact that the difference between the average outcomes when the group
is assigned to the treatment versus the control condition is produced by the changing treatment status and subsequent difference
in outcomes for the subset of the subjects who are compliers.
As
Figure 9.1 suggests, one consequence of failure to treat all of those assigned to the treatment group is that the average treatment
effect is estimated for the treated, not the entire subject population.
The average treatment effect (ATE) for the entire subject pool equals [
Y1(1) –
Y1(0)]
p1 + [
Y2(1) –
Y2(0)]
p2 + [
Y3(1) –
Y3(0)]
p3. Because the final term in the ATE expression is not observed, an implication of noncompliance is that the researcher is only
able to directly estimate treatment effects for the subset of the population that one is able to treat. The implications of
measuring the ATT rather than the ATE depend on the research objectives and whether treatment effects vary across individuals.
Sometimes the treatment effect among those who are treated is what the researcher is interested in, in which case failure
to treat some types of subjects is a feature of the experiment, not a deficiency. For example, if a campaign is interested
in the returns from a particular type of canvassing sweep through a neighborhood, the campaign wants to know the response
of the people whom the effort will likely reach, not the hypothetical responses of people who do not open the door to canvassers
or who have moved away.
If treatment effects are homogeneous, then the CACE and the ATE are the same, regardless of the contact rate. Demonstrating
that those who are treated in an experiment have pre-treatment observables that differ from the overall population mean is
not sufficient to show that the CACE is different from the ATE, because what matters is the treatment effect for compliers
versus noncompliers (see Equation [1]), not the covariates or the level of
Yi(0).
Figure 9.1 could be adjusted (by making the size of the gap between
Yi(0) and
Yi(1) equal for all groups) so that all groups have different
Yi(0) but the same values of
Yi(1) –
Yi(0). Furthermore, higher contact rates may be helpful at reducing any gap between CACE and ATE. As
Figure 9.1 illustrates, if the type 3 (untreated) share of the population approaches zero (the column narrows), then the treatment effect
for this type would have to be very different from the other subjects in order to produce enough “area” for this to lead to
a large difference between the ATE and CACE. Although raising the share of the treatment group that is successfully treated
typically reduces the difference between ATE and CACE, in a pathological case, if the marginal treated individual has a more
atypical treatment effect than the average of those “easily” treated, then the gap between CACE and ATE may grow as the proportion
treated increases. The ATE and CACE gap can be investigated empirically by observing treatment effects under light and intensive
efforts to treat. This approach parallels the strategy of investigating the effects of survey nonresponse by using extra
effort to interview and determining whether there are differences in the lower and higher response rate samples (Pew Research
Center 1998).
Although the issue of partial treatment of the target population is very conspicuous in many field experiments, it is also
a common problem in laboratory experiments. Designs such as typical laboratory experiments that put off randomization until
compliance is assured will achieve a 100-percent treatment rate, but this does not “solve” the problem of measuring the treatment
effect for a population (ATE) versus those who are treatable (CACE). The estimand for a laboratory experiment is the ATE for the particular group of people
who show up for the experiment. Unless this is also the ATE for the broader target population, failure to treat has entered
at the subject recruitment stage.
One final note is that nothing in this answer should be taken as asserting that a low contact rate does not matter. The discussion
has focused on estimands, but there are several important difficulties in estimating the CACE when there are low levels of
compliance. First, noncompliance affects the precision of the experimental estimates. Intuitively, when there is nearly 100
percent failure to treat, it would be odd if meaningful experimental estimates of the CACE could be produced because the amount
of noise produced by random differences in
Y due to sampling variability in the treatment and control groups would presumably swamp any of the difference between the
treatment and control groups that was generated by the treatment effect. Indeed, a low contact rate will lead to larger standard
errors and may leave the experimenter unable to produce useful estimates of the treatment effect for the compliers.
Further, estimating the CACE by dividing the observed difference in treatment and control group outcomes by the share of
the treatment group that is treated can be represented as a two-stage estimator. The first stage regression estimates observed
treatment status (treated or not treated) as a function of whether the subject is assigned to the treatment group. When treatment
group assignment produces only a small change in treatment status (e.g., the contact rate is very low in a mobilization experiment)
and the sample size is small, group assignment is a weak instrument and estimates may be meaningfully biased. The danger of
substantial bias from weak instruments is diagnosed when assignment to the treatment group does not produce a strong statistically
significant effect on treatment status. See Angrist and Pischke (
2009) for an extended discussion of this issue.
Do Field Experiments Assume Homogeneous Treatment Effects?
The answer is “no.” See
Figure 9.1, which depicts a population in which the compliers are divided into two subpopulations with different treatment effects.
The ITT and the CACE both estimate the
average treatment effects, which may vary across individuals.
Are Field Experiments Ethical?
All activities, including research, raise ethical questions. For example, it is surprising to read that certain physics experiments
currently being conducted are understood by theoreticians to have a measurable (although very small) probability of condensing
the planet Earth into a sphere 100 meters in diameter
(Posner
2004). I am not aware of any field experiments in political science that pose a remotely similar level of threat. A full treatment
of the subject of research ethics is well beyond the scope of a brief response and not my area of expertise, but I will make
several points that I believe are sometimes neglected.
First, advocates of randomized trials in medicine turn the standard ethical questions around and argue that those who treat
patients in the absence of well-controlled studies should reflect on the ethics of using unproven methods and not performing
the experiments necessary to determine whether the interventions they employ actually work. They argue that many established
practices and policies are often merely society-wide experiments (and, as such, poorly designed experiments that lack a control
group but somehow sidestep ethical scrutiny
and bureaucratic review). They recount the tragedies that followed when practices were adopted without the support of experimental
evidence
(Chalmers
2003). Taking this a step further, recent work has begun to quantify the lives lost due to delays imposed by institutional review
boards
(Whitney and Schneider
2010).
Second, questions are occasionally raised as to whether an experimental intervention might change a social outcome, such as
affecting an election outcome by increasing turnout. Setting aside the issue of whether changing an election outcome through
increased participation or a more informed electorate (the most common mechanism for this hypothetical event, given current
political science field experiments) is problematic or praiseworthy, in the highly unlikely event that an experiment did alter
an election result, this would only occur for the small subset of elections where the outcome would have been tied or nearly
tied in the absence of the experiment. In this case, there are countless other mundane and essentially arbitrary contributions
to the outcome with electoral consequences that are orders of magnitude larger than the typical experimental intervention.
A partial list includes ballot order
(Miller and Krosnick
1998); place of voting
(Berger, Meredith, and Wheeler
2008); number of polling places
(Brady and McNulty
2004); use of optical scan versus punch card ballots
(Ansolabehere and Stewart
2005); droughts, floods, or recent shark attacks
(Achen and Bartels
2004); rain on Election Day
(Knack
1994); and a win by the local football team on the weekend prior to the election
(Healy, Malhotra, and Mo
2009). That numerous trivial or even ridiculous factors might swing an election seems at first galling, but note that these factors
only matter when the electorate is very evenly divided. In this special case, however, regardless of the election outcome,
an approximately equal number of citizens will be pleased and disappointed with the result. As long as there is no regular
bias in which side gets the benefit of chance, there may be little reason for concern. Perhaps this is why we do not bankrupt
the treasury to make sure our elections are entirely error
free.
Does the Fact That Field Experiments Do Not Control for Background Activity Cause Bias?
Background activity affects the interpretation of the experimental results but does not cause bias. Background conditions
affect Yi(0) and Yi(1), but subjects can be assigned and unbiased treatment effects can be estimated in the usual fashion. That is not to say
that background conditions do not matter, because they may affect Y(0) and Y(1) and therefore the treatment effect Y(1) – Y(0). If the treatment effect varies with background conditions, then background factors affect the generalizability of the results; the treatment effect that is estimated should be thought of as conditional on the background
conditions.
Are Field Experiments Too Expensive to Be Used in My Research?
Field experiments tend to be expensive, but there are ways to reduce the cost, sometimes dramatically. Many recent field experiments
were performed in cooperation with organizations that are interested in evaluating a program or communications effort. Fortunately,
a growing proportion of foundations are requiring (and paying for) rigorous evaluation of the programs they support, which
should provide a steady flow of projects looking for partners to assist in experimental evaluations.
What about Treatment “Spillover” Effects?
Spillover effects occur when those who are treated in turn alter their behavior in a way that affects other subjects.
16 Spillover is a potentially serious issue in field experiments. It is also fair to note that spillover is typically
not a problem in the controlled environment of laboratory experiments because contact among subjects can be observed and regulated.
In most applications, the presence of spillover effects attenuate estimated treatment effects by causing the control group
to be partially treated. If the researcher is
concerned about mitigating the danger from spillover effects, then reducing the density of treatment is one option; this will
likely reduce the share of the control group affected by spillover. Another perspective is to consider spillover effects as
worth measuring in their own right; some experiments have been designed to measure spillover
(Nickerson
2008). It is sometimes forgotten that spillover is also an issue in observational research. In survey-based observational studies
of party contact and candidate choice, for example, only those who report direct party contact are coded as contacted. If
those who are contacted in turn mobilize those who are not contacted, then this will introduce a downward bias into the observational
estimate of the causal effect of party contact, which is based on comparison of those coded treated and those coded
untreated.
Abramowitz, Alan I. 1988. “Explaining Senate Election Outcomes.” American Political Science Review 82: 385–403.
Achen, Christopher H., and Larry M. Bartels. 2004. “Blind Retrospection: Electoral Responses to Droughts, Flu, and Shark Attacks.” Estudio/Working Paper 2004/199.
Adams, William C., and Dennis J. Smith. 1980. “Effects of Telephone Canvassing on Turnout and Preferences: A Field Experiment.” Public Opinion Quarterly 44: 389–95.
Addonizio, Elizabeth, Donald Green, and James M. Glaser. 2007. “Putting the Party Back into Politics: An Experiment Testing Whether Election Day Festivals Increase Voter Turnout.” PS: Political Science & Politics 40: 721–27.
Angrist, Joshua D. 1990. “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records.” American Economic Review 80: 313–36.
Angrist, Joshua D., Guido Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” Journal of the American Statistical Association 91: 444–72.
Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” Quarterly Journal of Economics 106: 979–1014.
Angrist, Joshua D., and Jorn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton, NJ: Princeton University Press.
Ansolabehere, Stephen D., and Alan S. Gerber. 1994. “The Mismeasure of Campaign Spending: Evidence from the 1990 U.S. House Elections.” Journal of Politics 56: 1106–18.
Ansolabehere, Stephen D., and Shanto Iyengar. 1996. Going Negative: How Political Advertising Divides and Shrinks the American Electorate. New York: Free Press.
Ansolabehere, Stephen D., and James M. Snyder. 1996. “Money, Elections, and Candidate Quality.” Typescript, Massachusetts Institute of Technology.
Ansolabehere, Stephen D., and Charles Stewart III. 2005. “Residual Votes Attributable to Technology.” Journal of Politics 67: 365–89.
Arceneaux, Kevin, Alan S. Gerber, and Donald P. Green. 2006. “Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment.” Political Analysis 14: 1–36.
Arceneaux, Kevin, and David Nickerson. 2009. “Who Is Mobilized to Vote? A Re-Analysis of Eleven Randomized Field Experiments.” American Journal of Political Science 53: 1–16.
Bergan, Daniel E. 2009. “Does Grassroots Lobbying Work?: A Field Experiment Measuring the Effects of an E-Mail Lobbying Campaign on Legislative Behavior.” American Politics Research 37: 327–52.
Berger, Jonah, Marc Meredith, and S. Christian Wheeler. 2008. “Contextual Priming: Where People Vote Affects How They Vote.” Proceedings of the National Academy of Sciences 105: 8846–49.
Brady, Henry E., and John E. McNulty. 2004. “The Costs of Voting: Evidence from a Natural Experiment.” Paper presented at the annual meeting of the Society for Political
Methodology, Palo Alto, CA.
Butler, Daniel M., and David W. Nickerson. 2009. “Are Legislators Responsive to Public Opinion? Results from a Field Experiment.” Typescript, Yale University.
Chalmers, Iain. 2003. “Trying To Do More Good Than Harm in Policy and Practice: The Role of Rigorous, Transparent, Up-to-Date Evaluations.” Annals of the American Academy of Political and Social Science 589: 22–40.
Chin, Michelle L., Jon R. Bond, and Nehemia Geva. 2000. “A Foot in the Door: An Experimental Study of PAC and Constituency Effects on Access.” Journal of Politics 62: 534–49.
Dale, Allison, and Aaron Strauss. 2007. “Mobilizing the Mobiles: How Text Messaging Can Boost Youth Voter Turnout.” Working paper, University of Michigan.
Davenport, Tiffany C., Alan S. Gerber, and Donald P. Green. 2010. “Field Experiments and the Study of Political Behavior.” In The Oxford Handbook of American Elections and Political Behavior, ed. Jan E. Leighley. New York: Oxford University Press, 69–88.
Deaton, Angus S. 2009. “Instruments of Development: Randomization in the Tropics, and the Search for the Elusive Keys to Economic
Development.” NBER Working Paper No. 14690.
Eldersveld, Samuel J. 1956. “Experimental Propaganda Techniques and Voting Behavior.” American Political Science Review 50: 154–65.
Eldersveld, Samuel J., and Richard W. Dodge. 1954. “Personal Contact or Mail Propaganda? An Experiment in Voting Turnout and Attitude Change.” In Public Opinion and Propaganda, ed. Daniel Katz. New York: Holt, Rinehart and Winston, 532–42.
Erikson, Robert S., and Thomas R. Palfrey. 2000. “Equilibria in Campaign Spending Games: Theory and Data.” American Political Science Review 94: 595–609.
Gerber, Alan S. 1998. “Estimating the Effect of Campaign Spending on Senate Election Outcomes Using Instrumental Variables.” American Political Science Review 92: 401–11.
Gerber, Alan S. 2004. “Does Campaign Spending Work?: Field Experiments Provide Evidence and Suggest New Theory.” American Behavioral Scientist 47: 541–74.
Gerber, Alan S. in press. “New Directions in the Study of Voter Mobilization: Combining Psychology and Field Experimentation.”
Gerber, Alan S., and David Doherty. 2009. “Can Campaign Effects Be Accurately Measured Using Surveys?: Evidence from a Field Experiment.” Typescript, Yale University.
Gerber, Alan S., David Doherty, and Conor M. Dowling. 2009. “Developing a Checklist for Reporting the Design and Results of Social Science Experiments.” Typescript, Yale University.
Gerber, Alan S., James G. Gimpel, Donald P. Green, and Daron R. Shaw. in press. “The Size and Duration of Campaign Television Advertising Effects: Results from a Large-Scale Randomized Experiment.”
American Political Science Review.
Gerber, Alan S., and Donald P. Green. 2000. “The Effects of Canvassing, Direct Mail, and Telephone Contact on Voter Turnout: A Field Experiment.” American Political Science Review 94: 653–63.
Gerber, Alan S., and Donald P. Green. 2008. “Field Experiments and Natural Experiments.” In Oxford Handbook of Political Methodology, eds. Janet M. Box-Steffensmeier, Henry E. Brady, and David Collier. New York: Oxford University Press, 357–81.
Gerber, Alan S., and Donald P. Green. 2011. Field Experiments: Design, Analysis, and Interpretation. Unpublished Manuscript, Yale University.
Gerber, Alan S., Donald P. Green, and Edward H. Kaplan. 2004. “The Illusion of Learning from Observational Research.” In Problems and Methods in the Study of Politics, eds. Ian Shapiro, Rogers Smith, and Tarek Massoud. New York: Cambridge University Press, 251–73.
Gerber, Alan S., Donald P. Green, and Christopher W. Larimer. 2008. “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” American Political Science Review 102: 33–48.
Gerber, Alan S., Donald P. Green, and David W. Nickerson. 2001. “Testing for Publication Bias in Political Science.” Political Analysis 9: 385–92.
Gerber, Alan S., Gregory A. Huber, and Ebonya Washington. in press. “Party Affiliation, Partisanship, and Political Beliefs: A Field Experiment.” American Political Science Review.
Gerber, Alan S., Dean Karlan, and Daniel Bergan. 2009. “Does the Media Matter? A Field Experiment Measuring the Effect of Newspapers on Voting Behavior and Political Opinions.” American Economic Journal: Applied Economics 1: 35–52.
Gerber, Alan S., and Kyohei Yamada. 2008. “Field Experiment, Politics, and Culture: Testing Social Psychological Theories Regarding Social Norms Using a Field Experiment
in Japan.” Working paper, ISPS Yale University.
Gosnell, Harold F. 1927. Getting-Out-the-Vote: An Experiment in the Stimulation of Voting. Chicago: The University of Chicago Press.
Green, Donald P., and Alan S. Gerber. 2004. Get Out the Vote: How to Increase Voter Turnout. Washington, DC: Brookings Institution Press.
Green, Donald P., and Alan S. Gerber. 2008. Get Out the Vote: How to Increase Voter Turnout. 2nd ed. Washington, DC: Brookings Institution Press.
Green, Donald P., and Jonathan S. Krasno. 1988. “Salvation for the Spendthrift Incumbent: Reestimating the Effects of Campaign Spending in House Elections.” American Journal of Political Science 32: 884–907.
Guan, Mei, and Donald P. Green. 2006. “Non-Coercive Mobilization in State-Controlled Elections: An Experimental Study in Beijing.” Comparative Political Studies 39: 1175–93.
Habyarimana, James, Macartan Humphreys, Dan Posner, and Jeremy Weinstein. 2007. “Why Does Ethnic Diversity Undermine Public Goods Provision? An Experimental Approach.” American Political Science Review 101: 709–25.
Harrison, Glenn W., and John A. List. 2004. “Field Experiments.” Journal of Economic Literature 42: 1009–55.
Healy, Andrew J., Neil Malhotra, and Cecilia Hyunjung Mo. 2009. “Do Irrelevant Events Affect Voters’ Decisions? Implications for Retrospective Voting.” Stanford Graduate School of Business
Working Paper No. 2034.
Humphreys, Macartan, and Jeremy M. Weinstein. 2007. “Policing Politicians: Citizen Empowerment and Political Accountability in Africa.” Paper presented at the annual meeting
of the American Political Science Association, Chicago.
Humphreys, Macartan, and Jeremy Weinstein. 2009. “Field Experiments and the Political Economy of Development.” Annual Review of Political Science 12: 367–78.
Hyde, Susan D. 2010. “Experimenting in Democracy Promotion: International Observers and the 2004 Presidential Elections in Indonesia.” Perspectives on Politics 8: 511–27.
Imbens, Guido W. 2009. “Better Late Than Nothing: Some Comments on Deaton (2009) and Heckman and Urzua (2009).” National Bureau of Economic Research
Working Paper No. 14896.
Jacobson, Gary C. 1978. “The Effects of Campaign Spending in Congressional Elections.” American Political Science Review 72: 469–91.
Jacobson, Gary C. 1985. “Money and Votes Reconsidered: Congressional Elections, 1972–1982.” Public Choice 47: 7–62.
Jacobson, Gary C. 1990. “The Effects of Campaign Spending in House Elections: New Evidence for Old Arguments.” American Journal of Political Science 34: 334–62.
Jacobson, Gary C. 1998. The Politics of Congressional Elections. New York: Longman.
John, Peter, and Tessa Brannan. 2008. “How Different Are Telephoning and Canvassing? Results from a ‘Get Out the Vote’ Field Experiment in the British 2005 General
Election.” British Journal of Political Science 38: 565–74.
Knack, Steve. 1994. “Does Rain Help the Republicans? Theory and Evidence on Turnout and the Vote.” Public Choice 79: 187–209.
LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” American Economic Review 76: 604–20.
Levitt, Steven D. 1994. “Using Repeat Challengers to Estimate the Effect of Campaign Spending on Election Outcomes in the U.S. House.” Journal of Political Economy 102: 777–98.
Loudon, Irvine. 2000. The Tragedy of Childbed Fever. Oxford: Oxford University Press.
Medical Research Council, Streptomycin in Tuberculosis Trials Committee. 1948. “Streptomycin Treatment for Pulmonary Tuberculosis.” British Medical Journal 2: 769–82.
Michelson, Melissa R. 2003. “Getting Out the Latino Vote: How Door-to-Door Canvassing Influences Voter Turnout in Rural Central California.” Political Behavior 25: 247–63.
Michelson, Melissa R., Lisa García Bedolla, and Margaret A. McConnell. 2009. “Heeding the Call: The Effect of Targeted Two-Round Phonebanks on Voter Turnout.” Journal of Politics 71: 1549–63.
Miller, Joanne M., and Jon A. Krosnick. 1998. “The Impact of Candidate Name Order on Election Outcomes.” Public Opinion Quarterly 62: 291–330.
Miller, Roy E., David A. Bositis, and Denise L. Baer. 1981. “Stimulating Voter Turnout in a Primary: Field Experiment with a Precinct Committeeman.” International Political Science Review 2: 445–60.
Nickerson, David W. 2008. “Is Voting Contagious? Evidence from Two Field Experiments.” American Political Science Review 102: 49–57.
Olken, Benjamin. 2010. “Direct Democracy and Local Public Goods: Evidence from a Field Experiment in Indonesia.” American Political Science Review 104: 243–67.
Paluck, Elizabeth Levy, and Donald P. Green. 2009. “Deference, Dissent, and Dispute Resolution: An Experimental Intervention Using Mass Media to Change Norms and Behavior in
Rwanda.” American Political Science Review 103: 622–44.
Panagopoulos, Costas, and Donald P. Green. 2008. “Field Experiments Testing the Impact of Radio Advertisements on Electoral Competition.” American Journal of Political Science 52: 156–68.
Posner, Richard A. 2004. Catastrophe: Risk and Response. Oxford: Oxford University Press.
Rosenstone, Steven J., and John Mark Hansen. 1993. Mobilization, Participation, and Democracy in America. New York: MacMillan.
Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” The Annals of Statistics 6: 34–58.
Rubin, Donald B. 1990. “Formal Modes of Statistical Inference for Causal Effects.” Journal of Statistical Planning and Inference 25: 279–92.
Taubes, Gary. 1993. Bad Science: The Short Life and Weird Times of Cold Fusion. New York: Random House.
Vavreck, Lynn. 2007. “The Exaggerated Effects of Advertising on Turnout: The Dangers of Self-Reports.” Quarterly Journal of Political Science 2: 287–305.
Verba, Sidney, Kay Lehman Schlozman, and Henry E. Brady. 1995. Voice and Equality: Civic Voluntarism in American Politics. Cambridge, MA: Harvard University Press.
Wantchekon, Leonard. 2003. “Clientelism and Voting Behavior: Evidence from a Field Experiment in Benin.” World Politics 55: 399–422.
Whitney, Simon N., and Carl E. Schneider. 2010. “A Method to Estimate the Cost in Lives of Ethics Board Review of Biomedical Research.” Paper presented at the “Is Medical
Ethics Really in the Best Interest of the Patient?” Conference, Uppsala, Sweden.