3

Judged
by Machines

CHAPTER 3

In the mid-1990s, various US carriers raised an antitrust case against American Airlines and United Airlines. The complaint was that online search systems were biased against foreign and domestic carriers.1 Their case focused on the algorithms used in ticket reservation systems. These systems prioritized flights that used the same carrier for all the legs of a trip, so an agent searching for a ticket from Louisville, Kentucky, to London, going through New York, would see flights involving no change of carrier higher on the list than flights involving two carriers. Since in the 1990s, screens showed only four to five flights, and 90 percent of all bookings came from the first screen, small differences in ranking had big financial implications.

As this airline reservation example shows, algorithms are not always fair. In fact, algorithmic bias is now a prevalent topic of discussion as it concerns computer vision systems,2 university admission protocols,3 natural language processing,4 recommender systems,5 courtroom risk assessment tools,6 online advertisements,7 and finance.8 But much like human biases, algorithmic biases are not simply the result of maleficence. They can emerge from both practical and fundamental considerations.9

On the practical side, both people and machines learn from data that are often biased and incomplete—the data we have instead of the data we wish we had.10

Biased data can lead to biased learning and behavior. But even with the existence of perfect data, guaranteeing fairness may not be possible. Fairness is a concept that can be defined in multiple ways, so it is not always possible to satisfy multiple definitions of fairness simultaneously.11

To illustrate this fundamental limitation, consider two populations: A and B. A and B could be people identifying with different genders, or belonging to different nationalities, age groups, or ethnicities. For the purpose of this exercise, the type of difference or its source doesn’t matter. What matters is that we want to achieve a fair outcome when it comes to our treatment of populations A and B.

But what constitutes a fair outcome? To keep things simple, consider two definitions.

The first definition is known as statistical parity or demographic parity. This means guaranteeing that outcomes affect equal proportions of A and B.12 The second definition is equality of false rejections or equality of opportunity.13 This means guaranteeing that the probability of being rejected if you are from population A or population B is the same.

In principle, satisfying both definitions is possible if we consider an outcome that doesn’t hinge on any particular selection criterion or merit. For instance, if we pass out free concert tickets at random, we would satisfy both statistical parity and equality of false rejections. But what if the fans of the band playing in the concert were not equally distributed among both populations, A and B? In that case, distributing tickets at random would be unfair for the group that included most of these fans. Fans in this group would get fewer tickets and be more likely to be rejected.

This simple example can help us motivate more complex—and relevant—cases. Instead of free concert tickets, consider giving out loans, admitting students to college, or giving someone a promotion at work. These are all cases that not only are more delicate, but also imply some degree of selection or merit. The case of the loan is more straightforward. In principle, loans should be allocated to those who are more likely to repay them. Promotions and college admissions are trickier because they invoke the idea of merit, which may be harder to measure, even post hoc, than whether someone can repay a loan.

To illustrate how selection or merit interacts with our two notions of fairness, let populations A and B be of the same size but have a different probability of paying back a loan. To keep things simple, assume that 40 percent of the people in A would repay a loan, but only 20 percent of the people in B would.

We can achieve statistical parity by giving loans to 20 percent of the people in A and 20 percent of the people in B. This would be fair, in that the same fraction of both populations would get a loan, but would violate equality of opportunity, since we would be rejecting 20 percent of people in A who would repay their loans. But if we enforce equality of opportunity, we will end up giving more loans to people in group A, violating statistical parity.

All this goes to show that, even in simple examples, satisfying multiple definitions of fairness cannot be guaranteed. This is not because fairness cannot be defined, but because it allows multiple definitions. In this particular example, we used only two definitions, but we could have used many more.14 We may include a third definition, requiring equality in false acceptances (e.g., giving loans to people who will not pay them with the same probability in both groups).

Fairness is a complex concept that accepts multiple definitions that (in most cases) cannot be satisfied simultaneously.15 The world is unfair—not only because people and machines are biased—but because it affords multiple ways of defining a fair outcome.

Yet, not all unfairness comes from mathematical impossibilities. In fact, unfairness also comes from algorithms and the data used to design them. While in principle, these sources of unfairness could be vexing, in practice they are also sources of unfairness that potentially could be corrected.

Consider the example of word embeddings. Word embedding is a natural language-processing technique used to translate words into mathematical representations. It is also a popular example of algorithmic bias. This is because word embeddings can perpetuate the racial and gender stereotypes found in its training data. In a word embedding, adding the vector for the word Queen and that for the word Man gives you the word King. This means that these vectors satisfy semantic relationships (e.g., “a King is a male Queen”). But not all the relationships learned by word embedding are as simple and uncontroversial. Word embeddings also encode relationships, such as “Man is to computer programmer as woman is to homemaker.”16 In fact, if the text used to train the embedding contains mentions of women performing stereotypical actions, such as cooking or cleaning, the embedding will codify, maintain, and sometimes even enhance these stereotypical associations.17

Recent research, however, has focused not only on documenting these biases, but also on how to reduce them.18 For instance, word embedding bias can be reduced by expanding text with sentences that counterbalance biases, or by identifying and “subtracting” the dimensions where bias manifests itself more strongly.19

Another example of data-driven algorithmic bias is facial recognition systems.20 People studying the accuracy of these algorithms have found them to be less accurate at identifying darker faces, especially those of black women.21 This has motivated the creation of data sets that are more comprehensive in terms of demographic attributes, poses, and image quality,22 as well as the rise of auditing efforts designed to check and report on the accuracy and biases of facial recognition systems.23

Another discussion on algorithmic bias involves the use of pretrial “risk assessment” tools.24 These are algorithms used to predict the probability that a defendant will reoffend (recidivism) or fail to appear in court.25 Pretrial risk assessment tools have become popular in the US, but they also have been found to show biases. In 2016, investigators working for ProPublica26 published an article based on “risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014.”

They used that data to “see how many were charged with new crimes over the next two years,” which was “the same benchmark used by the creators of the algorithm.” They found that the algorithm “was particularly likely to falsely flag black defendants as future criminals, at almost twice the rate as white defendants.” They also found that disparities could not be explained by prior crimes.

Unfortunately, biases are not unique to algorithms. Humans have them too. Scholars in the social sciences, for instance, have long studied the biases affecting job applications27 by looking at the callback rates for résumés with ethnically differentiated names28 or photographs.29 Thus, neither humans nor machines can guarantee fairness.

Here, we compare people’s reactions to cases of bias attributed to humans or machines. We present them in the context of college admissions, police enforcement, salaries, counseling, and human resources; in scenarios where humans or algorithms are the source of bias or the ones helping reduce bias. As in the previous chapter, we base our study on scenarios and measure people’s reactions to them using the following questions (as appropriate):

  • Were the [person/algorithm]’s actions harmful? (from “Not harmful at all” to “Extremely harmful”)
  • Would you hire this [person/algorithm] for a similar position? (from “Definitely not” to “Definitely yes”)
  • Were the [person/algorithm]’s actions intentional? (from “Not intentional at all” to “Extremely intentional”)
  • Do you like the [person/algorithm]? (from “Strongly dislike” to “Strongly like”)
  • How morally wrong or right were the [person/algorithm]’s actions? (from “Extremely wrong” to “Extremely right”)
  • Do you agree that the [person/algorithm] should be promoted to a position with more responsibilities? (from “Strongly disagree” to “Strongly agree”)
  • Do you agree that the [person/algorithm] should be replaced by a(n) [algorithm/person] (replace different)? (from “Strongly disagree” to “Strongly agree”)
  • Do you agree that the [person/algorithm] should be replaced by another [person/algorithm] (replace same)?(from “Strongly disagree” to “Strongly agree”)
  • Do you think the [person/algorithm] is responsible for the action)? (from “Not responsible at all” to “Extremely responsible”)
  • Do you think the [person/algorithm] is responsible for the [discriminatory/fair] outcome)? (from “Not responsible at all” to “Extremely responsible”)
  • If you were in a similar situation as the [person/algorithm], would you have done the same? (from “Definitely not” to “Definitely yes”)

In addition to considering situations where a machine or a human either acted unfairly or corrected an unfair act, we considered variations in the ethnicity of the person being discriminated against (Hispanic, African American, or Asian). Different ethnicities are associated with different core stereotypes, so we expect different judgments in discriminatory situations.

In total, we considered a total of twenty-four possible scenarios and forty-eight conditions.* In the next section, we document the results obtained for scenarios involving human resource (HR) screenings, college admissions, salary increases, and policing.

The four groups of scenarios are listed next.

Human Resource Screenings

A company replaces their HR manager with a new [manager/algorithm] tasked with screen-ing candidates for job interviews.

S19 S20 S21

Unfair treatment

An audit reveals that the new [manager/algorithm] never selects [Hispanic/African American/Asian] candidates even when they have the same qualifications as other candidates.

S22 S23 S24

Fair treatment

An audit reveals that the new [manager/algorithm] produces a fairer process for [Hispanic/African American/Asian] candidates, who were discriminated against by the previous system.

College Admissions

To improve their admissions process, a university hires a new [recruiter/algorithm] to evaluate the grades, test scores, and recommendation letters of applicants.

S25 S26 S27

Unfair treatment

An audit reveals that the new [recruiter/algorithm] is biased against [Hispanic/African American/Asian] applicants.

S28 S29 S30

Fair treatment

An audit reveals that the new [recruiter/algorithm] is fairer to [Hispanic/African American/Asian] applicants, who were discriminated against by the previous system.

Salary Increases

A financial company hires a new [manager/algorithm] to decide the yearly salary increases of its employees.

S31 S32 S33

Unfair treatment

An audit reveals that the new [manager/algorithm] consistently gives lower raises to [Hispanic/African American/Asian] employees, even when they are equal to other employees.

S34 S35 S36

Fair treatment:

An audit reveals that the new [manager/algorithm] is fairer to [Hispanic/African American/Asian] employees, who were being discriminated against by the previous process.

Policing

The police commissioner of a major city deploys a new squad of [police officers/police robots] in a high-crime neighborhood.

S37 S38 S39

Unfair treatment

An audit reveals that the new squad has been detaining a disproportionally large percentage of innocent [Hispanics/African Americans/Asians].

S40 S41 S42

Fair treatment

An audit reveals that the new squad is fairer to innocent [Hispanic/African American/Asian], who had been detained in large numbers by the previous law enforcement procedures.

Figure 3.1 shows people’s reactions to the scenarios in which discrimination was observed. In all cases, humans are seen as more intentional, and also as more responsible for actions and outcomes. But beyond these obvious effects, we do observe some interesting, albeit relatively weak, patterns.

First, we find that—unlike most previous cases—moral judgments are not favorable to humans. In fact, we find that human actions are judged worse than machine actions (i.e., less moral), and are seen as more harmful in several scenarios, such as the college admissions and salary scenarios for African Americans and Hispanics. This provides additional evidence supporting the idea that reactions to machine actions are not simply the result of a generalized bias against machines since these biases change with a scenario’s context and moral dimensions.

We also find small but interesting differences among the various ethnic groups depending on the particular scenario. The college admissions scenario elicits the strongest differences in judgment, especially for African Americans and Hispanics (who suffer more discrimination than Asians in contexts related to intelectual traits because of differences in stereotypes). Here, human actions are judged as relatively less moral and more harmful than the actions of machines. We also find that biases against African Americans and Hispanics result in slightly stronger differences in judgment between humans and machines compared to Asians. This suggests that differences in judgment between human and machine actions are slightly modulated by the ethnic group of the victim and the situation described in the scenario (e.g., college admissions). These differences aligns with our expectations for a US sample.

Moreover, people also think that the human should be replaced with another person. What is paradoxical, however, is that even though humans are seen as more intentional and more responsible than machines, people still prefer not to replace them with machines (as has been the case in all previous scenarios), adding further evidence in support of the idea of algorithm aversion.30

Figure 3.2 shows people’s reactions to scenarios in which discrimination was corrected. In general, we find a tendency for people to be more willing to promote humans, meaning that humans may receive more credit when they are involved in actions that correct unfair treatment. For the most part, however, we don’t find strong differences in judgment, except for the policing scenario, where the actions of humans are judged as much better than those of machines across several dimensions.

While biases can be problematic, the psychologists who have long studied them would be hard pressed to classify them as simple cognitive flaws. Instead, biases exist as rules of thumb or heuristics that evolved to make fast decisions in environments with limited information.31

An example of these heuristics is the idea that people may perceive groups using two dominant characteristics: warmth and competence.32 This model predicts that groups high in warmth and low in competence (e.g., disabled people, babies, and the elderly) elicit sympathy, whereas groups low in warmth and high in competence elicit envy or jealousy.

Heuristics and stereotypes are certainly incorrect ways to judge individuals. Humans can have overgeneralized beliefs regarding members of a social group (stereotypes) and exhibit biased attitudes toward those groups (prejudice). Prejudice may then lead to unfair treatment or discrimination. But because heuristics work as a way to facilitate decision-making in information-deprived environments (or environments with excess information),33 it is not surprising that we find them in both humans and machines.

Similar to humans, cognitive machines are inferential and base their inferences on abstract forms of categorization (which is called stereotyping in humans). In order to make predictions, machines often group and classify data using explicit and abstract features. Consider the idea of a principal component—a vector that accounts for most of the variance in a data set. Principal components are a common tool in machine learning, and they are similar to the idea of a stereotype, like classifying people using the vectors of warmth and competence. Unlike warmth and competence, however, principal components usually involve abstract features that are derived directly from data and can be difficult to interpret. This adds obscurity to algorithms and has led some people to advocate for increased transparency and interpretability as ways to mitigate algorithmic bias.

In fact, the use of explicit versus abstract features has been at the core of a nuanced discussion on the bias and fairness of algorithms lately. To avoid biases based on gender or race, scholars have proposed a variety of methods, from simply removing explicit demographic characteristics from a data set to predicting outcomes using only variables that are orthogonal to demographic characteristics. Yet recent research has shown that methods that tend to circumvent the possibility of bias may actually backfire because reaching fair outcomes is better served by using the most accurate predictors, even if these include explicit demographic information.34

In a recent paper on algorithmic fairness, Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan develop this idea by comparing an efficient and an equitable.35 These planners were tasked with admitting college applicants. The efficient planner was interested only in maximizing performance, measured by the grade point average (GPA) of the students admitted to college, while the equitable planner was interested in both performance and the racial composition of the admitted class.

To illustrate this idea, the scholars compared three methods: admissions that were blind to demographic variables (e.g., race was removed from the sample); admissions that included variables that were orthogonalized with respect to racial variables; and admissions that used racial variables explicitly. They report that the most equitable and efficient outcomes were reached using the model that explicitly included demographic variables.

To understand this distinction, consider students from two races: P (privileged) and U (underprivileged), who are applying to college. Because of their privileges, students in race P score higher in many of the variables that are predictive of future academic success, such as standardized test scores. Should we blind algorithms to race, then? Or is there a better solution?

Imagine a student from race U that obtains the same score as a student from race P on a standardized test. The student from race U was able to reach the same outcome as the student from race P in the absence of P’s privileges. Yet a model lacking an explicit racial variable will be unable to adjust for the lack of privilege affecting the scores of students from race U. A model that is blind to race will rate both students equally, and hence hurt the less privileged student. Instead, what the proposed theory suggests36 is to use the most accurate possible model (including racial variables, when relevant), and then setting different thresholds to achieve the desired level of equity (using, for instance, some of the definitions of fairness introduced earlier in this chapter).

This example illustrates the importance of separating the goals of equity and predictive accuracy. Even though it may be tempting to modify data to eliminate any trace of demographic characteristics, the best way to achieve efficient and equitable outcomes may be to treat prediction and equity as two separate parts of the same problem.

In the US, discriminatory treatment is not only frowned upon, but also illegal. Title VII of the 1964 Civil Rights Act37 is “a federal law that prohibits employers from discriminating against employees on the basis of sex, race, color, national origin, and religion.” The Supreme Court affirmed Title VII unanimously in 1971 in Griggs v. Duke Power Company, a class action suit claiming that Duke’s policies discriminated against African American employees.38 The court ruled that, independent of intent, discriminatory outcomes for protected classes violated Title VII.39

In our data, we find important differences in the level of intent and responsibility assigned to discriminatory actions performed by humans and machines. However, in agreement with the Griggs decision, we find only small differences in moral judgment, suggesting that—in unfair cases—it is the outcome rather than the intention that is judged.

The removal of intent from the legal judgment of bias has important implications for those working on the fairness of algorithms. It means that, even in the absence of intent, those creating the algorithms may be liable for biased outcomes.

This outcome-based approach to policing discrimination is opening a new market for an algorithm certification industry and discipline:40 a community focused on auditing the bias of algorithms and certifying them when they are not biased.

In the next chapter, we shift our gaze away from algorithmic bias and focus on another uncomfortable aspect of our digital reality: privacy. This will help us expand our understanding to another dimension of the way in which humans judge machines.

NOTES

  1.   1   V. Bilotkach, N. G. Rupp, and V. Pai, Value of a Platform to a Seller: Case of American Airlines and Online Travel Agencies, SSRN (2017), https://papers.ssrn.com/abstract=2321767;

  2. B. Friedman and H. Nissenbaum, “Bias in Computer Systems,” ACM Transactions on Information Systems 14 (1996): 330–347.

  3.   2   B. F. Klare, M. J. Burge, J. C. Klontz, R. W. V. Bruegge, and A. K. Jain, “Face Recognition Performance: Role of Demographic Information,” IEEE Transactions on Information Forensics and Security 7 (2012): 1789–1801;

  4. Buolamwini and Gebru, “Gender Shades”;

  5. A. Torralba and A. A. Efros, “Unbiased Look at Dataset Bias,” in CVPR 2011 (IEEE, 2011): 1521–1528, https://doi.org/10.1109/CVPR.2011.5995347.

  6.   3   M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual Fairness,” in Advances in Neural Information Processing Systems, eds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, et al. (Curran Associates, 2017), 4066–4076.

  7.   4   J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-Level Constraints,” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (ACL, 2017), https://doi.org/10.18653/v1/D17-1323;

  8. N. Garg, L. Schiebinger, D. Jurafsky, and J. Zou, “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes,” PNAS 115 (2018): E3635–E3644;

  9. L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach,” Women Also Snowboard: Overcoming Bias in Captioning Models,” in Computer Vision–ECCV 2018, eds. V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Springer International Publishing, 2018), 793–811;

  10. T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings,” in Advances in Neural Information Processing Systems 29,eds. D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Curran Associates, 2016), 4349–4357.

  11.   5   Baeza-Yates, “Data and Algorithmic Bias in the Web.”

  12.   6   R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth, “Fairness in Criminal Justice Risk Assessments: The State of the Art,” Sociological Methods & Research (July 2018), https://doi.org/10.1177/0049124118782533;

  13. O. A. Osoba and W. Welser IV, An Intelligence in Our Image: The Risks of Bias and Errors in Artificial Intelligence (Rand Corporation, 2017);

  14. Z. Lin, J. Jung, S. Goel, and J. Skeem, “The Limits of Human Predictions of Recidivism,” Science Advances 6 (2020): eaaz0652;

  15. J. Dressel and H. Farid, “The Accuracy, Fairness, and Limits of Predicting Recidivism,” Science Advances 4 (2018): eaao5580.

  16.   7   A. Lambrecht and C. Tucker, “Algorithmic Bias? An Empirical Study of Apparent Gender-Based Discrimination in the Display of STEM Career Ads,” Management Science 65 (2019): 2966–2981.

  17.   8   J. Koren, “What Does That Web Search Say about Your Credit?,” Los Angeles Times 17 July 2016, https://www.latimes.com/business/la-fi-zestfinance-baidu-20160715-snap-story.html.

  18.   9   M. Kearns and A. Roth, The Ethical Algorithm: The Science of Socially Aware Algorithm Design (Oxford University Press, 2019);

  19. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. A. Galstyan, “Survey on Bias and Fairness in Machine Learning,” arXiv:1908.09635 [cs] (2019).

  20. 10   Bilotkach et al., Value of a Platform to a Seller;

  21. Baeza-Yates, “Data and Algorithmic Bias in the Web”;

  22. M. G. Haselton, D. Nettle, and D. R. Murray, “The Evolution of Cognitive Bias,” in Handbook of Evolutionary Psychology 1–20 (American Cancer Society, 2015), https://doi.org/10.1002/9781119125563.evpsych241;

  23. T. M. Mitchell, The Need for Biases in Learning Generalizations (1980);

  24. A. Caliskan, J. J. Bryson, and A. Narayanan, “Semantics Derived Automatically from Language Corpora Contain Human-Like Biases,” Science 356 (2017): 183–186;

  25. S. Hajian, F. Bonchi, and C. Castillo, “Algorithmic Bias: From Discrimination Discovery to Fairness-Aware Data Mining,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2016), 2125–2126, https://doi.org/10.1145/2939672.2945386.

  26. 11   Kearns and Roth, The Ethical Algorithm;

  27. Mehrabi et al., “Survey on Bias and Fairness in Machine Learning.”

  28. 12   Kearns and Roth, The Ethical Algorithm;

  29. Mehrabi et al., “Survey on Bias and Fairness in Machine Learning”;

  30. M. Kearns, S. Neel, A. Roth, and S. Wu, “Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness,” in 35th International Conference on Machine Learning, ICML 2018 (IMLS, 2018), 4008–4016.

  31. 13   M. Hardt, E. Price, and N. Srebro, “Equality of Opportunity in Supervised Learning,” in Advances in Neural Information Processing Systems (2016), 3315–3323.

  32. 14   Mehrabi et al., “Survey on Bias and Fairness in Machine Learning.”

  33. 15   Kearns and Roth, The Ethical Algorithm.

  34. 16   Bolukbasi et al., “Man Is to Computer Programmer.”

  35. 17   Zhao et al., “Men Also Like Shopping”;

  36. Hendricks et al., “Women Also Snowboard”;

  37. Bolukbasi et al., “Man Is to Computer Programmer”;

  38. Caliskan et al., “Semantics Derived Automatically.”

  39. 18   Zhao et al., “Men Also Like Shopping”;

  40. Hendricks et al., “Women Also Snowboard”;

  41. Bolukbasi et al., “Man Is to Computer Programmer.”

  42. 19   Bolukbasi et al., “Man Is to Computer Programmer.”

  43. 20   M. Turk and A. Pentland, “Eigenfaces for Recognition,” Journal of Cognitive Neuroscience 3 (1991), 71–86;

  44. T. Kanade, Y. Tian, and J. F. Cohn, “Comprehensive Database for Facial Expression Analysis,” in Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000 (IEEE Computer Society, 2000), 46;

  45. Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep Learning Face Attributes in the Wild,” (2015), 3730–3738;

  46. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep Face Recognition,” in Proceedings of the British Machine Vision Conference 2015 (British Machine Vision Association, 2015), 41.1–41.12, https://doi.org/10.5244/C.29.41;

  47. Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep Learning Face Representation by Joint Identification-Verification,”in Advances in Neural Information Processing Systems 27, eds. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, 2014), 1988–1996;

  48. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2014), 1701–1708.

  49. 21   Klare et al., “Face Recognition Performance”;

  50. Buolamwini and Gebru, “Gender Shades.”

  51. 22   Torralba and Efros, “Unbiased Look at Dataset Bias”;

  52. B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen. et al., “Pushing the Frontiers of Unconstrained Face Detection and Recognition: IARPA Janus Benchmark A,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2015), 1931–1939;

  53. G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments,” Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition. Marseille, France (2008).

  54. 23   M. Ngan and P. Grother, Face Recognition Vendor Test (FRVT)—Performance of Automated Gender Classification Algorithms (2015), https://nvlpubs.nist.gov/nistpubs/ir/2015/NIST.IR.8052.pdf, https://doi.org/10.6028/NIST.IR.8052;

  55. I. D. Raji and J. Buolamwini, “Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products,” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (2019).

  56. 24   Lin et al., “The Limits of Human Predictions of Recidivism”;

  57. Dressel and Farid, “The Accuracy, Fairness, and Limits of Predicting Recidivism.”

  58. 25   Electronic Privacy Center, “Algorithms in the Criminal Justice System: Pre-Trial Risk Assessment Tools,” https://epic.org/algorithmic-transparency/crim-justice/.

  59. 26   J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine Bias,” ProPublica (23 May 2016).

  60. 27   D. Neumark, R. J. Bank, and K. D. Van Nort, “Sex Discrimination in Restaurant Hiring: An Audit Study,” Quarterly Journal of Economics 111 (1996): 915–941;

  61. L. Kaas and C. Manger,“Ethnic Discrimination in Germany’s Labour Market: A Field Experiment,” German Economic Review 13 (2012): 1–20;

  62. M. Bertrand and S. Mullainathan, “Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination,” American Economic Review 94 (2004): 991–1013;

  63. P. Oreopoulos, “Why Do Skilled Immigrants Struggle in the Labor Market? A Field Experiment with Thirteen Thousand Resumes,” American Economic Journal: Economic Policy 3 (2011): 148–171;

  64. D. Neumark, I. Burn, and P. Button, “Experimental Age Discrimination Evidence and the Heckman Critique,” American Economic Review 106 (2016): 303–308;

  65. C. Fershtman and U. Gneezy, “Discrimination in a Segmented Society: An Experimental Approach,” Quarterly Journal of Economics 116 (2001): 351–377;

  66. E. O. Arceo-Gomez and R. M. Campos-Vazquez, “Race and Marriage in the Labor Market: A Discrimination Correspondence Study in a Developing Country,” American Economic Review 104 (2014): 376–380.

  67. 28   L. Kaas and C. Manger, “Ethnic Discrimination in Germany’s Labour Market: A Field Experiment,” German Economic Review 13 (2012): 1–20;

  68. Bertrand and Mullainathan, “Are Emily and Greg More Employable?”;

  69. Oreopoulos, “Why Do Skilled Immigrants Struggle in the Labor Market?”;

  70. Neumark et al., “Experimental Age Discrimination Evidence.”

  71. 29   Arceo-Gomez and Campos-Vazquez, “Race and Marriage in the Labor Market.”

  72. 30   Dietvorst, “Algorithm Aversion.”

  73. 31   G. Gigerenzer, Gut Feelings: The Intelligence of the Unconscious (Penguin, 2007);

  74. G. Gigerenzer, “How to Make Cognitive Illusions Disappear: Beyond ‘Heuristics and Biases,’” European Review of Social Psychology 2 (1991): 83–115;

  75. G. Gigerenzer and H. Brighton, “Homo Heuristicus: Why Biased Minds Make Better Inferences,” Topics in Cognitive Science 1 (2009): 107–143;

  76. A. Tversky and D. Kahneman, “Judgment under Uncertainty: Heuristics and Biases,” Science 185 (1974): 1124–1131;

  77. T. Gilovich, D. Griffin, and D. Kahneman, Heuristics and Biases: The Psychology of Intuitive Judgment (Cambridge University Press, 2002);

  78. D. Kahneman, S. P. Slovic, P. Slovic, and A. Tversky, Judgment under Uncertainty: Heuristics and Biases (Cambridge University Press, 1982).

  79. 32   S. T. Fiske, A. J. C. Cuddy, P. Glick, and J. Xu, “A Model of (Often Mixed) Stereotype Content: Competence and Warmth Respectively Follow from Perceived Status and Competition,” Journal of Personality and Social Psychology 82 (2002): 878–902;

  80. S. T. Fiske, A. J. C. Cuddy, and P. Glick, “Universal Dimensions of Social Cognition: Warmth and Competence,” Trends in Cognitive Sciences 11 (2007): 77–83.

  81. 33   Tversky and Kahneman, “Judgment under Uncertainty: Heuristics and Biases”;

  82. Gilovich et al., Heuristics and Biases;

  83. G. Gigerenzer and D. G. Goldstein, “Reasoning the Fast and Frugal Way: Models of Bounded Rationality,” Psychological Review 103 (1996): 650;

  84. G. Gigerenzer and P. M. Todd, Simple Heuristics That Make Us Smart (Oxford University Press, 1999).

  85. 34   J. Kleinberg, J. Ludwig, S. Mullainathan, and A. Rambachan, “Algorithmic Fairness,” AEA Papers and Proceedings 108 (2018): 22–27.

  86. 35   Kleinberg et al., “Algorithmic Fairness.”

  87. 36   Kleinberg et al., “Algorithmic Fairness.”

  88. 37   Title VII of the Civil Rights Act of 1964: Know Your Rights, AAUW: Empowering Women since 1881, https://www.aauw.org/what-we-do/legal-resources/know-your-rights-at-work/title-vii/.

  89. 38   Griggs v. Duke Power Company, Oyez, https://www.oyez.org/cases/1970/124.

  90. 39   R. Chowdhury and N. Mulani, “Auditing Algorithms for Bias,” Harvard Business Review (2018), https://hbr.org/2018/10/auditing-algorithms-for-bias.

  91. 40   Chowdhury and Mulani, “Auditing Algorithms for Bias”;

  92. J. Guszcza, I. Rahwan, W. Bible, M. Cebrian, and V. Katyal, “Why We Need to Audit Algorithms,” Harvard Business Review (2018), https://hbr.org/2018/11/why-we-need-to-audit-algorithms.

  1. * Certainly, we would have liked to consider more conditions, such as additional ethnicities and nonbinary gender identities, but that would have increased the number of scenarios and independent groups that we had to recruit to an unwieldy number. We leave the exercise of extending this analysis to more conditions for the future.

  2. It generally applies to employers with fifteen or more employees, including federal, state, and local governments.