DISCRIMINATION
Billy Ball transformed Major League Baseball in the early 1990s. Billy Beane became the general manager of the Oakland Athletics when it had the highest payroll in baseball. His charter was to cut costs and still field a winning team. To do so, Beane turned to statistics and used supervised learning to determine which low-cost players to acquire. The method worked so well that the As produced the longest winning streak in baseball history and racked up 103 wins for the season. The methodology was subsequently adopted throughout baseball and other sports and became the subject of the best-selling book and movie Moneyball.1
When Anne Milgram became the attorney general of New Jersey in 2007, she observed that the criminal justice system was rarely using data to make decisions. Detectives would piece together evidence using sticky notes, but the data collected was not being shared. So, she decided to use a Moneyball approach to criminal justice in New Jersey and apply the same principles to the most critical public safety decisions, including the degree of risk an individual who has been arrested poses to public safety. Her idea was to create a system to influence decisions about whether someone was to be released or detained. She also designed the system to influence sentencing and whether the justice department recommended drug treatment.2
To do this, she created a training table containing 1.5 million rows. Each row represented one individual. The input columns included prior convictions, incarceration history, history of violent behavior, and court appearance failures. The output columns were whether they had committed a new crime or a new act of violence and whether they had subsequently failed to appear in court. The attorney general’s office created three separate supervised learning systems to predict the three output columns from the input columns. It combined the results into a score using conventional programming and made the score available to judges to aid them in making bail and sentencing decisions. This process reportedly reduced murder rates in Camden, New Jersey, one of the most dangerous cities in America, by 41 percent and overall crime by 26 percent.
An automated decision system (ADS) is a Moneyball-like software program that makes decisions and recommendations that previously had been made by people. ADSes have a wide range of uses, including scoring loan applications,3 identifying possible terrorists and criminals,4 evaluating college applications,5 and making employment decisions.6 In some ways, ADSes can make better decisions than humans. They can be less biased and more consistent, as they will not overtly apply racial stereotypes. A 1996 study reviewed 136 studies that compared ADS-based predictions of behaviors such as violence with predictions based on human judgment and found the ADS-based predictions were far more accurate.7
The use of ADSes in law enforcement goes back to 1983, when a Rand Corporation study proposed the use of statistics to predict the likelihood that someone convicted of a crime would commit new crimes when they were released. Rand developed a statistical model based on seven factors. They proposed that justice departments use this prediction technique as a sentencing model. Through this process, now known as selective incapacitation, criminals who were highly likely to commit future crimes received longer sentences.8
Unfortunately, despite the best efforts of an ADS tool’s designer, ADSes can be discriminatory. During US president Barack Obama’s second term, Attorney General Eric Holder argued that ADS-based risk assessments used by judges to help them assess the risk of reoffending discriminate against minorities and lower-income groups.9 The input columns for these ADS tools typically include marital history, employment status, education, and neighborhood. These factors have a high correlation with minority and lower-income groups. They are not overtly discriminatory; however, because they correlate with overtly biased factors, they are effectively discriminatory.
It is illegal in the US to discriminate in employment, housing, loans, credit, education, criminal justice, and other matters based on race, religion, gender, disability, and family status. Despite these laws, discrimination happens both intentionally and unintentionally. In 2003, economists Marianne Bertrand and Sendhil Mullainathan responded to help-wanted ads in Boston and Chicago with fake resumes. The researchers gave the resumes random names that sounded African American (e.g., Lakisha and Jamal) or Caucasian (e.g., Emily and Greg). The Caucasian resumes received 50 percent more callbacks.10 Social media such as LinkedIn facilitates intentional discrimination by providing a place where biased hiring managers can view an applicant’s picture.
ADSes can help avoid explicit bias. Unfortunately, if the data is bad, they can also institutionalize it. A University of California at Berkeley study found that when loan decisions were made by an ADS, there was 40 percent less discrimination than when the decisions were made by people. Still, ADS discrimination resulted in Latinx borrowers paying 5.3 more basis points than nonminority borrowers. African Americans paid 2.0 more basis points.11
If the training table contains biases, the resulting ADS will be biased. Factors such as race, religion, color, gender, disability, and family status can be explicitly removed from training tables to prevent ADSes from making decisions based on these factors. However, as Holder discovered, the training tables may contain other factors that are correlated. For example, if there are high concentrations of people from certain races and religions in certain zip codes, then the zip codes might serve as a proxy for race and religion and effectively be used in making biased decisions.12
Researchers have developed various ADSes to predict recidivism for prospective jail parolees. If some of these ADSes include a column containing the person’s neighborhood and the judges use the resulting recidivism prediction data in criminal sentencing, then people from specific neighborhoods may end up staying in jail longer than people from other communities. In fact, researchers have shown that ADS-based risk assessments used in the US criminal justice system to set bond amounts and to assess the likelihood of future violent acts and general recidivism are biased against African Americans.13 Worse, the likelihood of using “dirty data” is even higher for police departments with a history of racial bias.14
Biased algorithms do not require biased developers. The designers of the ADSes for predicting violent acts and recidivism were not consciously trying to institutionalize racism; they were trying to do the opposite, and they would have succeeded if their data was not biased.
Credit card fraud detection and lending algorithms are often inherently biased against the poor, because more impoverished people mostly pay cash. With more data available on people who use credit cards, lenders can make better decisions about fraud and lending risk for people who use credit cards. The result will be less favorable lending decisions and higher interest rates for the average person who does not use credit cards. In November 2019, a Wall Street regulator began to investigate whether Goldman Sachs credit card practices use biased algorithms, and Apple cofounder Steve Wozniak has called on the government to investigate whether Apple’s credit card practices are discriminatory.15
Job screening ADSes use data that incorporates the hiring preferences and experience of previous hiring managers. Amazon built an ADS to predict which job applicants would be the best employees. However, because most software engineers were historically male, the ADS inadvertently learned a bias against female applicants.16 Amazon discontinued the system when they discovered this issue.
Hospitals use an ADS from United Health Group to determine which patients with chronic ailments could most benefit from a more personalized approach to care. One of the ADS variables was the amount of money previously spent on healthcare for an individual. However, since healthcare spending for African-American patients was generally less than for white patients, the ADS was more likely to select white patients for this more personalized approach, thereby increasing the discrimination against African-American patients and widening the gap in care quality.17
The effects of data bias are not limited to ADSes. Recall that word embeddings are representations of words that contain word relationships, such as man is to woman as king is to queen. These word embeddings, so widely used in natural language processing systems, also often contain biases. For example, researchers have found that word embeddings contain relationships such as Man is to computer programmer as woman is to homemaker. This out-of-date view of the world results from the use of older news articles by the training systems. If we do not change the data used to train the machine learning systems accordingly, the result will be biased.
Robyn Speer, a former MIT researcher who cofounded the natural language processing company Luminoso, tried to build a sentiment analysis application for reading restaurant reviews and rating the restaurants. She found that the application consistently rated Mexican restaurants low, even though the Mexican restaurants received good reviews from people. The problem was that the word embedding for the word Mexican captures information about the words most commonly found in news articles that also contain the word. Unfortunately, news articles that contain the term Mexican often also contain the word illegal. As a result, the word embedding contains some of the associated negative sentiment.18 Speer also pointed out that the word embeddings for terms like girlfriend, teen, and Asian will have pornography connotations.
Similarly, people in poor neighborhoods with mostly African-American populations pay higher auto insurance premiums than people who otherwise have the same risk level but who live in wealthier and white communities.19
Data issues also impact facial recognition systems. Google Photos’ early facial recognition system frequently labeled screenshots of Black people as gorillas.20 The reason was that researchers used training tables with mostly Caucasian observations.
A facial recognition system trained primarily on images of men will not work well for women. A system trained on images of white women will not work well for women from the Middle East. Joy Bouknight, founder of the Algorithmic Justice League, studied the facial recognition software of IBM, Microsoft, and Facebook and found that commercial facial recognition software performs up to 34 percent better at identifying white males than Black females.21
Similarly, early speech recognition systems in cars had difficulty understanding women’s voices.22 Researchers have also found that speech recognition systems perform poorly for African-American speakers.23 Some language identification systems do not recognize nonstandard dialects of English and often misclassify them as other languages or having lower levels of performance.24 Even web searches will not be as effective for women, African Americans, and speakers using nonstandard dialects. This lowered performance puts people at a disadvantage in their ability to obtain information online.
In response to these data bias issues, researchers are starting to develop tools that detect and remove bias in training tables25 and in word embeddings.26 AI researchers also have proposed an approach to building automated tests that show promise in detecting bias in training tables,27 and Microsoft has released an open source tool for assessing the fairness of machine learning systems.28
Public awareness of issues around fairness and discrimination is on the rise, but we have a long way to go. Only 13 percent of large companies are taking steps to mitigate algorithmic unfairness and discrimination issues.29 As public awareness builds, I am hopeful we will see more voluntary, as well as regulatory, efforts around this critical issue.
Also, to get to the right solutions, we need to recognize that AI did not create all these fairness and discrimination issues.30 These issues are the inevitable outcome of using large training tables to make predictions and decisions. The core statistical techniques have been around for decades. Yes, researchers have developed new methods, but it is the availability of powerful computers and big data that have expanded the impact of these techniques on society. If AI had never been invented, these statistical techniques would still be in use, and the issues they create would be nearly as serious.31
People with good intentions create many of the fairness and discrimination issues. Sure, some consciously discriminate, but a far greater number discriminate unconsciously. Most people who create algorithms are well intentioned. Unfortunately, biased data exacerbates unconscious biases. We must make sure we do not throw the proverbial baby out with the bathwater. If we ban all algorithmic decisions, all we will have left are conscious and unconscious bias and haphazard, inconsistent decision-making. We are better off figuring out how to remove bias from the data and use algorithms that are interpretable and transparent.
HOW TO STOP DISCRIMINATION
There are many actions that corporations, governments, and other organizations can take to reduce discrimination:
•Hire a diverse workforce to reduce intentional discrimination.
•Use only ADS systems that use interpretable algorithms.
•When building ADS systems, preprocess the data to remove bias.
•Run tests on ADS systems to determine whether they are biased.
•Use only ADS systems that are certified as bias-free by independent third parties.32
There are also steps that individuals can take to protect themselves and others against discrimination. The first is to vote with your wallet. Choose banks, insurance companies, and any other companies that use ADS systems to impact the lives of people that demonstrate antidiscrimination policies. Check to see whether they publish statistics showing a diverse hiring pattern. Determine whether they only use ADS systems that are explainable. Find out whether they test their ADS systems to ensure they are nondiscriminatory. Discover whether they have third-party nondiscrimination certifications for their ADS systems.
INTERPRETABILITY
One way to avoid data bias is to require transparency in ADSes. If decision makers use an ADS system to make judgments that significantly affect people, such as making determinations on loan applicants or predicting the likelihood that a person will reoffend, it is essential to understand the underlying logic the system uses to make that judgment. It is not enough to get a yes or no answer. It is important to make sure that the algorithm is fair and that the data is unbiased.
Deep learning technology exacerbates the social issues because these systems are often difficult or impossible to interpret.33 For example, banking systems once used supervised learning algorithms that made it easy to understand the rationale for approving or denying a loan application. However, to handle larger training tables and to improve the predictability about whether the applicant would default, many banks turned to deep learning technology, which might provide a more accurate prediction. Unfortunately, the system often cannot provide the rationale for making the decision.34 Some researchers argue that no system should be allowed to make or influence decisions unless the algorithm is interpretable and the characteristics of the underlying data are well understood. It should also be available for public inspection or at least accessible enough to be certified as free of bias across factors such as race, gender, religion, culture, sexual orientation, and socioeconomic status.
The majority of ADSes use supervised learning. Some supervised learning algorithms are interpretable; that is, it is easy to see the relative weights the algorithm gave to the input columns. Other supervised learning algorithms (and especially deep neural networks) are more like black boxes. They produce an answer but not a rationale. When deep learning powers an ADS for credit scoring, it will improve the accuracy of the scoring predictions in some respects, but it still leaves the question of whether the lack of transparency is worth the improvement.35
DATA FUNDAMENTALISM AND NECESSARY SECRECY
Data fundamentalism exacerbates these issues. The term, coined by MIT professor Kate Crawford, who studies the impact of AI on society,36 refers to the tendency to assume that computers speak the truth. When I was a postdoc at Yale, Roger Schank took a group of students to Belmont Park to watch thoroughbred racing. Before the first race, he stood in front of the bleachers and did a lecture on handicapping. Of course, all the New York pundits in the stands rolled their eyes and elbowed each other: What does this guy know about handicapping? However, when Roger pulled out some notes written on green and white computer print paper (this was back in 1979), the pundits stopped laughing and crowded around to hear what Roger had to say. The point is that people often assume computers are right. Unfortunately, data fundamentalism causes people to forget that any ADS judgment is only as good as the training data.
We need to educate people about the need to avoid data fundamentalism. Computers are not always right, and their output can be wrong for many opaque reasons. Before we act on an answer or recommendation from a computer system, it is often prudent to investigate how the computer system arrived at the answer. If we are to accept answers and recommendations from deep learning systems that lack interpretability, we at least need guidelines from the vendor on how to evaluate reliability.
Another issue is that some predictive algorithms must remain secret. The US Internal Revenue Service only has the resources to audit perhaps 1 percent of all tax returns. It has long used predictive algorithms to determine which taxpayers are most likely to attempt to evade taxes. Being selected for an audit can cause significant expense and wasted effort. At the same time, if the IRS disclosed its algorithm, it would provide a blueprint for people who want to evade taxes on how to avoid an audit.
Similarly, Google keeps the details of its search ranking algorithm secret. If disclosed, the algorithm would serve as a blueprint for people who want to use unfair techniques to get their pages to the top of search results and would result in a poor experience for Google search users.37
AI researchers also are beginning to put a good amount of effort into making supervised learning algorithms easier to interpret.38 More importantly, vendors such as Google39 and Microsoft40 are starting to offer explanations as part of their cloud-based supervised learning tools. These explanations cannot be expected to be anywhere near perfect and will work better for some algorithms than others, but it is a start.
Large companies are starting to take notice, but only 19 percent of large companies surveyed in the 2019 Stanford AI Index reported that they are taking steps to improve the interpretability of their ADS systems.41 There are several governmental regulations intended to reduce ADS discrimination. The European Union General Data Protection Regulation now requires an individual to consent to the use of an ADS for a decision that has a consequential impact on that individual. Even if consent is given, an explanation of ADS decisions is required. In the US, the Equal Credit Opportunity Act requires finance companies to provide reasons for unfavorable decisions to their customers. The Federal Trade Commission Algorithmic Accountability Act of 2019 requires large companies to assess the risk of discrimination before using an ADS for high-risk decision-making. The Hong Kong Monetary Authority issued regulations that hold the human user responsible for ADS loan decisions.
Courts are also starting to respond to these data bias issues. A Texas teacher’s union won a 2017 court case in which teachers objected to the use of an automated scoring system as the primary method of identifying 221 teachers for termination. The issue was that the school system had no way to know if the scoring used biased data.42 Although the parties settled the case out of court, the school system agreed to stop using the automated scoring system.
Some analysts and ethicists have suggested that lawmakers adopt regulations that require that protected classes be uncorrelated with an algorithm’s decision. However, that would, in effect, set up quota systems, which have their own issues. Fortunately, AI researchers also are beginning to put a good amount of effort into making supervised learning algorithms easier to interpret.43