Chapter 3. Fair Data

The choice of what data to use in developing algorithmic and machine learning code is important and will likely continue to be important long into the future. This chapter considers what it means for a dataset to be fair and how to identify aspects of a dataset that can be problematic for fairness. In our discussion we will stick to two main themes throughout. First, is the concern about garbage in, garbage out, which we’ll call the fidelity concern. An example of such concerns could be building an algorithm for college admissions decisions that was based on data where students’ grades and names had been mixed up in the dataset, creating false information and likely leading to an unrealistic and faulty algorithm. Second is the concern about whether data was obtained in a way consistent with fair play, which we’ll call the provenance concern. An example of such concerns would be a psychiatrist selling the names of depressed patients to a marketing agency after obtaining these names through her own practice of medicine. If you were a data scientist working in a marketing firm, you probably wouldn’t (and shouldn’t) feel comfortable using such data once you knew where it had come from.

The fidelity concern is what most people think about when terms like data integrity are thrown around. The concern need not have anything to do with fairness. For reasons related to the bottom line, most businesses care about data quality because they want accurate algorithmic products, and accurate algorithms generally require high quality data. Fidelity is important for fairness concerns too. It’s not fair to release products into the wild if those products have bee trained on shabby and likely inaccurate data. More specifically, such algorithmic products are unlikely to reflect equity and security concerns. However, our fidelity concerns go even further than this. We also want to know whether the dataset represents a situation we would want our model to reproduce. This is not always the case with datasets measured in the real world, so we want to see how our dataset might deviate from desirable outcomes and be aware of these limitations. For this reason, addressing the fidelity concerns requires us to identify aspects of the dataset true to the world but inconsistent with our fairness ideals. All this is comprised in fidelity.

The provenance concern relates to consent and the appropriateness of having access to particular data given how the data was collected. It is worth noting that these are old questions, but nowadays society seems to have new answers. We are in a point in time where data science is a growing area of interest and competence due to the paradigm of widespread private metadata collection and analysis in digital products. It’s worth noting that the sorts of A/B testing that junior data scientists assist with used to be the province of a much smaller group of people, who were either market research professionals or social scientists. These professionals tended to have ethics and research protocol training in addition to statistical training and disciplinary training. This was important because there were well respected and agreed upon protocols for what constituted consent to be in a research pool and to have data collected about oneself. With the ease of metadata collection and surveillance, these research ethics have seemingly gone out the window in the sense that far more social science is now happening outside of academia than inside it, and often without institutional ethics reviews. Not only is ethical oversight lacking, but also far more dta used to draw conclusions about human behavior is now collected surrepetiously through digital metadata, such that “research participants” have no way of knowing about or limiting their participation. It is clear that we are a new world, but it’s not clera we have developed new ethics rules to reflect that world. Being thoughtful about provenance of datasets will help us start to re-establish the role of ethics in data analysis.

Below we consider various aspects of fair data. First, we look at data integrity concerns as they relate to fairness. Fairness requires that we responsibly handle data so that it reflects the truth, which can’t happen with sloppy data collection or storage practices. Second, we discuss qualitatively what aspects of a dataset make its use consistent or inconsistent with fairness. We focus on the three aspects of fairness discussed in the previous chapter: equity, privacy, and security, with an emphasis on equity because that is what will most likely influence our initial considerations of which dataset to choose. It is to be hoped that with respect to privacy adnd security, these concerns will have been handled at the infrastructure level previous to embarking on any data selection for any modeling task. Finally, we conclude with a practical example of how we might explore a dataset for signs of discrimination. This will prepare us for the next two chapters, where we discuss ways of either preprocessing data to enhance fairness or generating synthetic data for the same purpose.

Ensuring data integrity

Data integrity is a technical topic, and there is already sufficient writing on this topic for the interested reader to pursue at the general level. Here we cover the topic only from the perspective of how shoddy data practices can have fairness implications. The most basic fairness consideration in machine learning and data-driven computing is whether the training data for a model reflects true measurements proportional to the reported sampling technique. What are some ways that ths can go wrong? Thinking about this statement, we can see there are two considerations involved. First, how we label the data we have. What do we think it is versus what is it really? Second, how do our data points relate to the real world given their sampling? That is how were they collected and what does that process tell us about how we should use our data to make inferences about the larger world?

True measurements

The process of collecting and storing measurements can introduce untruths in all sorts of ways. Below I discuss the main ways these can enter.

Proxies

The undocumented or misremembered use of proxy information presents a serious threat to data integrity that implicates fairness. For example, perhaps a column in a csv file is labeled as “wealth” when it is measuring income, or perhaps the column actually contains a wealth estimation from a machine learning model’s ouput rather than true measured wealth. In such cases, downstream users could understandably misuse this information because it has not been handled with sufficient care.

Unfortunately proxies are often casually labelled without a recognition that they are proxies, even when it’s abundantly clear that they can’t possibly be what they claim. For example, a column might be labelled “friendliness”, even though there’s no mathematically true and unique way to quantify friendliness. Perhaps “friendliness” resulted from a question asked directly of someone, such as “Would you describe yourself as a friendly person?” Perhaps it was assessed by that person’s peers. Perhaps it was assessed via workplace videos that show some coworkers lunching togethers and others being loners. Any of these explainations would signal a proxy. While it is far more convenient to have short column/field names in our csv files and databases, we should try to be mindful when using proxies. We might even name them such, e.g. “friendliness_proxy”. Eve better wouldbe to name them for what they truly are such as “friendliness_rating_by_peers” or “percent_days_eat_lunch_w_coworkers.”

A column name itself can be a powerful force. Once we call something “friendliness” we may start thinking of it that way, making us less likely to recognize or questiona assumptions built into how we think a proxy should relate to what we want to study. This is a problem for two reasons. First, the data scientist working with the data may over time slide into sloppy thinking where they stop questioning the assumptiosn built into the proxy and simply take is as a given, removing the caution they should apply with proxies. Second, imagine downstream users or even subjects of the model seeking to understand how decisions were made about them. Even if they can see released data, an item such as “friendliness” is not going to let them know how this judgment was made. If they are not aware of the proxy it reduces their ability to understand algorithmic decisions or to challenge incorrect decision making.

As a concrete example, imagine that part of an employee’s performance rating is linked to the “friendliness” proxy. A particular employee wants to know why they did not receive a promotion to a managerial post, and you point to the low “friendliness” indicator. Perhaps you monitored how often employees lunched together and used this as a proxy (note that there are also provenance issues implicit in whether it was appropriate to use such data), but perhaps this particular employee has work-related responsibilities at lunch time that are unaccounted for by the proxy. Without more detailed knowledge about the proxy, the employee cannot hope to challenge the assessment and show that they are actually a star.

Robustness Failures

Failures of robustness are failures to define measures that correlate to a sufficiently general attribute for our inputs or our resulting models to be useful. Consider the “friendliness” proxy discussed above. Apart from the fact that such a value is a proxy and might have been used carelessly, we can ask whether it is a measure that translates across a variety of situations and use cases to show that it reflects what we think of as the general attribute of friendliness. This is more formamly described as a question of external validity. Can we show that a measure works and means something outside of whatever small situation or dataset we have used to prove or develop the measure in the first place?

Note that this problem is distinct from the careless use of a proxy described above. Even if we had our friendliness proxy properly labeled and accounted for, we would separately have to ask whether it represented a concept as general as we asserted. For example, imagine we used having lunch with one’s coworkers a high percentage of days as a measure of friendliness. We could imagine that in reality this attribute correlated with ambition rather than friendliness and that someone scoring high on this proxy might not demonstrate much friendliness at all outside of mandatory workplace networkings. This would show a lack of robustness in our measure or in the conclusions we had drawn from such a measure if stated in general terms of friendliness.

Undescribed Variation

Sometimes values measured and placed into one column, implying a single source of data, actually have a variety of sources. For example, maybe you are recording an employee’s subjective performance ratings but not recording from whom those performance ratings were provided. Where there is a systematic difference from one reviewer to another, as there often is, you leave later modelers or even later human reviewers of an employee unable to provide some adjustment for a more universal comparison. While such an adjustment might not be perfect, it is very poor quality data that does not document such important differences to provide downstream users full information about the limits of the data as well as information that can provide a way to contextualize or mitigate data discrepancies. That is, your data has undescribed sources of variation in the measures, even in a case where you could easily have recored these sources of variation.

One way to think of this is that you are mixing the measurements you put into a particular column. Measurements can be mixed for a variety of common reasons that could easily factor into fairness considerations. Some examples of what to look for or ask about are below:

*Different measuring devices feeding data into the same column (inter-rate reliability) e.g. Different bosses providing different performance reviews. *Measuring devices with poor precision e.g. A rating process is so subjective that it doesn’t even duplicate its own earlier judgments in response to the same stimulus

However it is not only mixed measurements that cause this problem. We also run into the problem of undescribed variation when we do not include all important variables in an analysis, usually because we are not even aware that a variable is important or is even an attribute worth mentioning. This can have fairness implications for a variety of reasons. First, unaccounted for variation may may a dataset look like it reflects a discriminatory process when in fact there is undescribed variation that could also account for the apparent discrimination. Consider for example the hiring of personnel at a truck driving company. Perhaps we would start by looking at a dataset showing that a much lower percentage of female applicants than male applicants succeeded in obtaining an interview. However if we expanded our dataset to include inforamtion about who had a commercial driver’s license, it could turn out that the variation explaining the women’s lack of success was in this data column. So fairness concerns that motivate us to collect as much data to account for variation as possible can even assuage discrimination concerns.

The problem of undescribed or unaccounted for variation is one that social science cannot escape. However, the less variation is accounted for, the less fair the dataset is and the less accurate the results it will produce. Seeking out additional covariates and labels should always be on high on the list when compiling a fair dataset. This is quite important because one major finding of statisticians and computer scientists is that the idea of, for example, a race blind algorithm is actually one that will usually result in more, not less, discrimination. This goes against the intuition of many people who try to build models blind to sensitive characteristics to avoid having to worry about discrimination, but in fact this will often result in less fairness rather than more fairness. Accounting for variation is always more desirable than turning a blind eye.

Proportionality and Sampling Technique

Proportionality refers to the need for data to be representative of the system it is measuring with respect to the probability of different kinds of data to be measured. Proportionality encompasses all sorts of expectations for frequency of data collection. For example, here are some ways someone might describe the collection of different data sets that would give us varying understandings of how much data we had relative to the situation being measured.

The system takes a picture of every car that goes down the block.
The system takes a picture of every car that goes down the block while over the speed limit.
The system is programmed to take a picture of every car that goes down the block speeding during daytime hours when children are outside playing.
The system takes a picture of every car going near or over the speed limit, but sometimes there are technical glitches and we don’t know whether these glitches are random or not.
The system takes a picture of any car exhibiting suspicious behavior as judged by a proprietary algorithm we bought from a company that did not dislose its methodologies.

We can imagine many other scenarios. The point is that the proportionality relates to how often data collection happens relative to how often qualifying events or data could be collected. Sometimes we will have reasonably good assurances about how proportional and representative our sampling is, such as if we record every car, but other times we will only have a spotty notion, such as if our sampling is passed through logic which itself is opaque to use regarding how and when to sample. A dataset can only maintain its integrity if its sampling technique, that is the method for obtaining data, is clearly documented, preferably with lots of redunancies, in both metadata and in formal documentation of a dataset. Sampling can be imperfect or non-representative so long as we are aware of these deficiencies.

Below we discuss a few specific aspects of sampling with important fairness implications.

Biased Sampling

The most oft-cited example of sampling problems affecting fairness is biased sampling. One example of biased sampling that is commonly known but still worth touching on is the possibility that race affects how likely a police officer is to stop a driver and find a reason to search the vehicle. Traffic stops happen disproportionately in the US to people with dark skin. These stops can lead to vehicle searches, which can result in the discovery of illicit objects and substances in vehicles. This then leads to biaed sampling for other crimes. For example, while black Americans are convicted at much higher rates for drug related crimes than are white Americans, we can ask whether some of this difference is explained by the biased sampling of vehicles in that white drivers are far less likely to be subject to a traffic stop and resulting vehicle search than are black drivers. Clearly, this has far reaching fairness implications when basic sampling may be drastically different across groups.

Unfortunately, biased sampling cannot be detected in a data set absent knowledge about or assumptions about baseline rates. This is one of the reasons differences in race in statistics are so hotly debated - because baseline rates as measured are always subject to the question of biased sampling. This is not an issue you will be able to solve on your own but it is an issue where you should be aware of the possibility and consider running simulations with a variety of baseline assumptions to see hiw such assumptions would fit your data.

Data Duplication

It happens surprisingly often that duplicate data points find their way into the same data set. Perhaps someone’s name was misspelled in one round of inputs so that a person becomes more than one person, thereby having greater weight than they ought to in subsequent analyses. Worse is when this happens due to problems with a data pipeline, and particularly if such problems introduce bias in estimates or projects obtained from subsequently applied methods.

For example, imagine a mobile app designed to collect information about people’s behavior (hopefully with their informed consent and for a purpose that serves the interests of the research subjects). Imagine the app engineer wrote code that uploads data to servers, which store the raw data that is later input into a long data science pipeline. Perhaps through some programming error the app itself is not particularly careful about what data has already been uploaded and what hasn’t. We can imagine cases in which all the same data is uploaded repeatedly. If there are not deduplication checks later in the pipeline, this will mean that older data is always overweighted relative to newer data.

In addition to lowering accuracy overall, this could lead to all sorts of bad outcomes. For example, if the app is judging whether a person has “reformed”, perhaps by judging the ability of a recovernig alcoholic to stay away from bars or the ability of a former stalker to stay away from her victim’s house, it’s possible that the person’s improvement in behavior would not be reflected in the data fairly because the old data would continue to get too much weight. This would then have created a technology that, through bad/duplicate data, was imposing an unduly high or even impossible standard on someone seeking to demonstrate that they had reformed.

Choosing appropriate data

Separate from the question of true data is the question of appropriate data. When we get to this state of the inquiry as to “Should I use this data set?”, we’ll assume we’ve already passed muster on the first question of “do I trust this data?”. In this case we move on to the second question, which we can frame as “Is it fair to use this data, now that I trust it?”. In the sections below we discuss this second question, “should I use this data?” as considerations relate to our three fairness prongs: equity, privacy, and security.

Equity

Choosing appropriate data for a model has several distinct equity concerns to address. These are defined below, and then different aspects of choosing data are broken down according to which branch of equity concerns they most clearly affect.

Bias and improper discrimination concerns: Bias and discrimination concerns relate to whether ML systems could be using improper considerations, such as membership in a protected category, when measuring merit within a system operating on equity. These are the most common kind of concerns currently addresed by both journalists and academics in their quest for fairer automation and machine learning. Note that discrimination is not necessarily wrong as arguably all equity-oriented systems are discriminating among individuals. The important consideration is the basis for discrimination. When the basis of discrimination is wrong, either because it is immoral or because it is prohibited by law or because it is unrelated to the ultimate goal of a system, such discrimination is improper.
Fair play concerns: Fair play concerns relate to notions that we want a society that promotes individuals from all sectors of society and helps people who need help. We want to make sure everyone has a look in and a chance of success. This may sometimes mean we don’t want a system that is as accurate as possible. For example, kicking people when they’re down and could get back up doesn’t seem consistent with fair play. So we would probably want to build ML systems that don’t discriminate on the basis of long term unemployment even if long term past unemployment might predict future unemployment. Likewise, even if a poorer socioeconomic status might make a job candidate slightly less likely to succeed in a given position, that doesn’t mean we always want to use that information.
Reasonable expectations concerns: Even apart from fair play concerns are concerns about reasonable expectations. These concerns may evolve over time. Someone might have given you permission to store a particular kind of data, but perhaps they gave that permission before it became clear that such data could make them highly vulnerable to certain privacy, security, or discrimination attacks. You could ask whether your use of data to get at information that wasn’t apparent at the time consent was given in keeping with reasonable expectations when consent was given and data collected. Even if consent couldn’t fully explain everything, what would a reasonable person at the time have expected. What does a reasonable person expect now. Is a given data set in keeping with those expectations, and if not, do you have any basis for why your technology pipeline should not comply with such reasonable expectations? (hint: probably not)

Concerns about bias

Data about biased systems

Concerns that ML systems will replicate existing illegal and immoral biases have a basis partly with respect to the fact that real-world systems producing data with which to build ML systems can sometimes quite reasonably be suspected of racial bias. Some common examples are data sets related to policing and data sets related to employment outcomes. In both these domains there have been both formal and informal divisions based on illegal factors, such as race, sex, or religion, for centuries. More concretely, such formal bases for discrimination, particularly in the employment context, have only recently been addressed with formal prohibitions. For example, just a few decades ago it was still legal to fire a woman for being pregnant, and it is an active crisis in US policing that unarmed black people are much more likely to be shot by the police than unarmed people of other races.

Data collected in a biased way

A subsidiary concern about biased systems is that the while the functioning of a system and real outcomes may not be biased, it could be that reporting about the system is nonetheless biased. I call this a subsidiary point because you could say that from a data science perspective it hardly matters which is the case. However it’s good to earmark this for yourself as a separate consideration. Perhaps you have received a dataset for which you believe you can be entirely sure that the system is not biased. Maybe it is something entirely natural and free from social influence (I am struggling to come up with an example). Nonetheless, even then perhaps reporting is different.

For example, imagine the likelihood of being a victim of a DUI, as modeled for a car insurance company, is actually equally likely, in proportion to the population, to befall a man or a woman. But imagine that for some reason men are less likely to report being the victim of such an accident - perhaps they are more inclined to drive away rather than report the accident to their insurer. In such cases, women may unfairly be charged much higher prices by their insurer on the belief that they are somehow more likely to be killed in a fatal DUI, even though the underlying facts do not support such a concern.

Concerns about fair play

In this group of issues related to fair play, we consider whether we are putting data subjects in a position that undermines their autonomy or authority about how their personal information is used. These concerns relate to concerns about consent and whether the appropriate conditions are in place for meaningful consent to be given.

Data collected without notice of its purpose

In regimes where data collection is regulated, it is consistently true that consent is defined to require complete and accurate notice regarding the purpose of data collection. In medical studies, research subjects must know what will happen with the tissues or medical information they provide. Likewise, when participating in psychology experiments, research subjects must be told before, or in some cases after (during a debrief), what the purpose of the experiment was that they participated in. Humans must be given truthful and informative facts about where there information is going.

This is beginning to be true in the realm of personal data more generally, particularly personal data collected on the internet. With the advent of GDPR those companies to whom the law applies must provide meaningful notice not only about what kinds of information they collect but what they intend to do with the information. Even in the US, where such consent is not strictly required, most companies will have privacy policies and terms of service that explain, in dense and arguably intentionally obscured language, what they do with the data they collected.

If you happen to be using a data set where it is not clear that the people included in the data set provided information knowing how it would be used, you should carefully evluate the ethics of the situation. In many cases it will not be appropriate to move forward with such a data set.

Data collected in a way that leaves no possibility to opt out

Consent is only meaningful if there is a real opportunity not to give consent. This is one of the reason law regulates all sorts of bargains, such as how door-to-door salespeople can behave, what kinds fo questions an employer can ask, and so on. We recognize that in some situations there is an imbalance of knowledge or power that makes consent to cooperate somewhat meaningless. In fact, one debate that is ongoing in recent years is whether consent to the terms of service of very large companies, such as Facebook, Google, or Apple, can be meaningful when for many of the services they provide, there is no meaningful competition. For example, some teenagers would argue that they could not possibly opt out of Facebook, at least not without very sad consequences for their social lives. If that is really the case, we can ask whether it is really ethical for research work to go on at Facebook or even at third party organizations that are buying data from Facebook, given that many people may be on the platform merely as the result of social compulsion in their respective demographic and socializing groups.

Helping people rise instead of fall

Another concern could be that data collected could be accurate but could also build a system that doesn’t leave a way up. For example, let’s imagine that someone’s social class is relevant to their likelihood of succeeding in a particular job.

Reasonable expectations

Another side of consent is determining what was consented to. Practices can only be consented to if they were meaningfully described and there was a meaningful opportunity to opt out, as above. However, this is not the end of the story, because it’s rare for agreements to be absolutely complete and to specify absolutely everything. This is simply not possible. So we need to fill in the gaps with inferences regarding what the parties would have agreed to if they had filled in the gaps when a particular situation arises that does not appear to be covered by a past agreement.

Data collected before techniques or technologies developed

Technological change is one area where we can ask what someone reasonably agreed to when they agreed to share their data. For example, it was popular to post photographs on social media websites long before the advent of modern convolutiona neural network and their awesome powers of facial recognition. For this reason, we can imagine that many people posting photographs before such technologies became commonplace in no way expected the rise of such technologies. We can imagine that while people recognized the putting photographs on the web was somewhat giving up their privacy, that they did not know that some day computers would be able to find photoghraphs of them effortlessly in a way that would enable scaleable surveillance or even claims to be able to read their personality or sexual orientation by photograph. One could make the argument that even if consent were given to store and analyze photographs, that consent did not extend to these new technologies and that consent should not be considered as allowing this.

Alternately, we could consider the donation of medical tissue. Perhaps patients in the past donated their brains to scientific study and signed fairly generous waivers as to what could be done with those brains. However, imagine that such waivers were signed before the discovery of DNA (which was not so long ago). In such a case, and particularly given the implications for descendants of the donor, we can imagine the donor might not have signed away their rights to their brains if it might some day compromise the genetic privacy of their descendants.

These concerns may not be applicable to your field, but they may. When you are looking at data, particularly data where a recent technology has changed how informative it is, you should carefully consider whether the data compromises the subject or makes them substantially more vulnerable than they could have anticipated at the time they gave consent. As one realistic example, a medical study recently found that facial recognition algorithms, applied to MRI scans rather than to photographs, could actually reidentify people based on their photographs, hence matching confidential medical information to public facing photographs. This is the stuff of reality, not science fiction.

Another example is the very new use of voice skins for criminal purposes. Apparently with some amount of voice data, someone’s voice can be replicated quite reliably. Given this, we can ask whether we really have our friends’ permission to record their voices, for example by video, and post these in unprotected locations.

Privacy

Our discussion of equity was quite wide ranging and went far beyond the matter of treating similarly meritorious individuals similarly and into the domain of also treating all individuals with some baseline level of respect and autonomy. Some of the data choice matters above have already touched on privacy, but below we cover those that were not.

Metadata collected without consent

In collecting and using metadata, we should think long and hard about what intrudes into domains where an individual reasonably did not expect to be observed, even if something was technically observable. For example, a decade ago it seemed creepy even to web developers to record things such as where someone clicked on a web page and how long they spent on different areas of the page. Now such recordings are standard, but just because they are standard in industry does not mean that they reflect what people expect or desire with respect to their privacy level when browsing.

When you see datasets in your organization, it’s worth asking the question whether the collection of that dataset was in accordance with what people would have expected and more importantly what they would have wanted. This is particularly true for metadata, that is data that results from observing someone and for which someone could not possibly avoid observation other than, for exmample, by simply not using the internet at all, which hardly seems like a reasonable balance.

Personal data

One way of thinking about privacy in a data-driven ML context is through the lens of personal data. We can look to two important legal sources to define personal data for us, namely GDPR and the California Consumer Privacy Act (CCPA), each of these pieces of legislation expected to affect a huge portion of technology companies doing business in the United States or Europe.

Data that proxies for private information

The more sophisticated techniques grow, the more we might ask whether certain kinds of information that originally seemed not to touch on privacy concerns might eventually do so. For example, if sexual orientation can be determined from someone’s photo alone (and this is a contested claim and not a settled matter), should even the storage of basic photographs invoke privacy concerns? Should such photographs ever be stored in a way that might connect someone’s photo to personal data, such as name, age, or address, such as could be used by someone seeking to hurt people of alternative sexual orientations?

Security

elow we discuss a few considerations for datasets that relate to security. We highlight cases where use of a dataset to train a critical model could be problematic. We also highlight cases where possession of data could be problematic and where holding the data and the risks entailed in the data being found are likely not worth the risk relative to the potential benefit.

Data likely to be attacked

In an era of greatly increased cyberattacks, one good question to ask is what are one’s liabilities in storing and analyzing a given dataset. The current state of the law is such that it’s unclear whether people with sensitive information in a dataset may have a legal claim against someone storing that data if the data is hacked. In the past such a legal theory has been rejected, but there is evidence that courts are evolving on this matter. This is important because, while you might think that storing what could be sensitive information is handy and there’s “no harm”, your organization might eventually be liable if you got hacked. Even if you didn’t plan to publish the data, thus, you could be on the hook, and certainly you are on the hook ethically if this happens.

You might think your organization steers away from this simply because you don’t handle healthcare data or financial data, such as credit card numbers. However, in the era of doxxing, living online, and increasing incivility in society, more people are at risk. Even when your motivations are in the right place, you could hurt someone. Imagine for example you keep a carefully compiled list of women who have been stalked and beaten up by their boyfriends, with some form of personally identifiable information. Nowadays it seems that not only the stalkers but vigilante groups might go after these women or be incensed just to see their name on a list. Likeise (and perhaps less sympathetically), a recent case was of a defunct white nationalist organization being hacked. While many people expressed interest in downloading the publicly available data to do visualizations, we might ask about the moral implications if one oft heir visualizations led to someone being beaten up for their association with the group.

To give a much older example, during WWII the percentage of Jewish citizens who survived the war was particular low in the Netherlands. This was because the Dutch government kept excellent records on their citizens, including religion. Apparently no one had thought to burn these records or make them inaccessible when the Nazi forces invaded.

Data can be weaponiszed. Even if you don’t plan to weaponize it, make sure you only work with potentially sensitive data if you really have to. And make sure to keep a forward looking and expansive definition of what constitutes sensitive data so you don’t get caught out failing to reflect the cultural conversation.

Incomplete data

The use of incomplete data could easily pose a security risk, particularly in the case of physical safety depending on the outcome of a technological system. We might ask whether this is what happened with Boeing’s MCAS system. Presumably the software was tested, perhaps in a simulator, and certainly through many test flights. And yet, clearly insufficient data was collected relative to the real world use cases because twice the system failed once the system had been released into the wild.

For the same reason, any machine learning algorithm may prove unsafe if we are not able to explore the full event space in advance. We must also keep in mind that the potential event space is not only what the upstream inputs that can arrive are but also the combinations of downstream use cases with potential outputs. Have all potentials been covered? If not, we need to think very carefully about issues such as the explainability of a model. How can we be sufficiently confident of its performance?

Naive data for training

Recently, Android phones manifested a serious security breach when it was realized that a new feature, introduced to allow people to open their phones with face recognition, also allowed third parties to unlock someone’s phone by using that person’s sleeping face. Undoubtedly this must in part have reflected problems with the training data. It must not have been the case that sleeping faces were in some way marked as a no-go. Likewise, we might ask whether other serious lapses are still present in the training data driving this tool. For example, should faces showing physical anguish, such as might be produced with an abusive partner, law enforcement officer, or kidnapper forcing one’s face into position in front of a camera? Shouldn’t there be some built in motions a person can make with their face in order to avoid unlocking the phone?

When choosing an appropriate data set, we need to consider whether we have considered potential events that could occur as inputs into a model but that might not be represented in the data set. While undoubtedly no one can think of them all, it is certainly fair to expect those working on such products to at least think about and account for the worst case possible scenario, such as that discused above. So for any product that will be algorithmically powered, an appropriate data set will include data that tells our algorithms to do in these life or death and/or worst-case scenarios.

Choosing the right dataset for a question and the right question for a dataset

Another concern could be related to systems quite likely to be higher than first order systems. A first order system in which a system simply functions. A second order system reacts to its own functioning. A third order systme reacts to its reactions to its functiojnining. This sounds complicated but reflects a lot of the real world. For example, if you are building a system to sentence people, and in that system you build in an assessment of how likely someone is to reoffend, your system is likely woefully incompletel if it does not take into account its own interventions. It is well known that the length of someone’s sentence in prison can affect their likelihood of re-offending, although it may not be the way that you might intuit. It’s not necessarily that a very long and harsh sentence can make someone very sorry indeed or simply very desirable not to go back to prison. A long sentence can also expose someone to more criminal culture or even result in their becoming part of a gang. Such factors point to longer prison terms making someone more rather than less likely to reoffend. So if you are seeking to build a system that would intervene in a population but not able to access data about that intervention, you should have serious qualms about using the data and the model, at least without a very short shelf life.

Modeling low incidence events

If you are seeking to model low incident events, you should also be very careful about the kinds of data you selected. What the recent past has shown us is that people building ML pipelines are not good at simply admitting that they cannot predict something. They would rather find a proxy. Unfortunately, using data to identify a proxy rather than the real thing can rcreate all sorts of problems. Most importantly, it can lead to extreme injustice. For example, one fear related to currently deployed recidivism prediction instruments is that they are seeking to predict something - namely violent reoffenses- that is actually a low probability event. In some cases what has been used instead to predict these offenses is data about non-violent re-offense. While such data might be somewhat related, this data might also simply lead to profiling the wrong people and for the wrong reasons.

Incomplete data in complex systems

In a recent Science publications, researchers identified evidence of racial bias in a ML system that had been constructed by a health insurance company to identify patients who were most likely to need more healthcare in the future and to target those patients for additional interventions. The researchers found that black patients receiveing a similar sickness severity score to white patients were actually sicker than the white patients with the same score. The researchers further estimated that about half of black patients who should have been receiving additional interventions were missed by the algorithm. The reason? Healthcare costs had been used as a proxy for sickness, so the algorithm was predicting healthcare costs rathert han predicting illness.

The problem with this is that vast amounts of evidence show that historically and currently there are vast disparities in the amount of care and the amount of money spent on that care when comparing white Americans to black Americans. Hence, using such a proxy as a way to identify sickness systematically but presumably unintentionally reproduced an old injustice.

Researchers believe that inappropriate proxies are one of the biggest culprits in reproducing or creating new sources of bias So, when you are choosing a dataset to answer a specific question, you will likely need to determine whether you are in fact using a proxy rather than the real measure. In the example given here, it’s fairly apparent, at least with hindsight.

But consider a less obvious example and do so knowing it’s a tough problem. Health records are incredibly complicated. Those in medical billing can tell you that there are literally thousands of codes in any given practice area. What’s more, because different insurance companies may have different reimbursement practices or coding policies, physicians and billing professionals alike have to tailor their coding according to a given patient’s health insurance. Even within a single insurance, institutions vary their coding practices for a variety of reasons. If you are the data scientist looking at all this variation, it would be difficult to come up with a single measure of sickness. Below are a few ways you might be tempted to measure sickness and also explanations for why these proxies could be just as problematic as cost.

Number of times nursing staff checked on a patient while patient was in the hospital.
- This might seem like a good way of gauging how much care professionals thought a particular patient needed, which might be a good proxy for sickness, at least on a short time scale. However, this could also correlate strongly with race, or how wealthy a patient was, or how assertive their family was.
Number of different specialists seen by a patient.
- Same problems as above.
Patient’s bodyweight and blood pressure
- This would privilege certain kinds of illnesses without necessarily achieving the global target of identifying all sick patients. It could easily turn out that the kind of illnesses focused on in this way also correlated in some way that produced a discriminatory impact. For example, if women systematically had these health problems but men did not, the algorithm would systematically discriminate against men in the targeting of extra helpful interventions.

This discussion might be incredibly disempowering. After all, if proxies are out and the ground truth isn’t available or possibly even knowable, what can be done? Here are a few suggestions.

Proxies aren’t out entirely. They should be used carefully. They should also be disclosed openly. The health insurance company targeting people should have made clear that their product was built to target high cost patients. In such a case, professionals out in the field might have more quickly spotted the built-in bias to the proxy and warned those who had rolled out the product.
Products should only be as ambtious as reality dictates. If something isn’t knowable, we’re better served recognizing that rather than pretending otherwise.
A mix of proxy variables, which comes down to a more consciously constructed single proxy indicator, might defeat some concerns evinced by only single proxy indicator.
Including data that covers racial information might be a way to build a less discriminatory model when proxies are used. If such data can be coupled with proxy data and other indicators that get more at the question of health, this could be a way to reduce the discriminatory effects of using a proxy variable.
More data can be obtained. In fact, particularly for high stakes decisions, this is exactly the recommended way of correcting issues of bias. In low-stakes decisions it can be possible to shape or resample data or to sacrifice some accuracy, but in high stakes decisions, such as healthcare, organizations need to be willing to put more money and thought into data collection.

Quality assurance for fair data

Assuming we have tried to respect the principles and procedures described above, we can now ask from a coding perspective how do we ensure good quality in our data? We want data that does not introduce discriminatory conduct into the world our models will replicate. We also want data that preserves privacy. In the code samples below, we introduce data checks that can identify some simple early problems. Then in the next two chapters we follow this up in far more detail, considering how to either preprocess our data or generate alternate synthetic data points to alleviate fairness concerns, most particularly with respect to discrimination and privacy.

Measures of Discrimination

In this section, we focus on some measures of discrimination that are commonly applied, particularly in the legal setting. Note that these are far from representing all notions of fairness. Note also that measuring discrimination is not the same as measuring equity. For equity we would likely want to focus on individual outcomes, and there are a host of meaures to focus on whetherh similar individuals are treated similarly. However, such a problem is unlikely to be one you run into early on, at the stage of determining whether a dataset is even putatively reasonable to use without preprocessing. At this point we will focus on looking for group level effects, with an emphasis on group parity.

Many decisions of interest will ultimately result from some kind of binary decision. Either an applicant gets a loan or does not. Or a defendant gets convicted or does not. Or gets a job or does not. In such cases, we can imagine success and failure as the binary outcomes. Let us also consider for the moment that we have two groups, the majority group (favored_group) and the minority group (disfavored_group). We will then have success and failure rates that appear as below.

Table 3-1. BinarySuccessTable
group	bad outcome	good outcome
favored_group	r1/r2	N1
disfavored_group	r3/r4	N2
M1	M2	N

If the percentage of the favored group getting a good outcome, r2/N1 is much larger than the percentage of the disfavored group getting a good outcome, r3/N2, this fact would itself give rise to suspicions of discrimination. Such fairly simple ratios are in fact used in legal settings as one measure of potential discrimination, and US courts notably focus on these selection rates when evaluating claims of discrimination.

European courts take a slightly different approach, focusing on measures that stem from the failure rate rather than the success rate. We denote these as p1 = r1/N1 = failure rate of favored group and p2 = r3/N2 = failure fate of disfavored group. U.K. courts tend to look at the difference in failure rates, that is p2 - p1, which is known as the risk difference. The EU Court of Justice in contrast has discussed p2/p1, which is known as the risk ratio.

We know that it is quite likely that the selection rates of the favored group and disfavored group are unlikely to be identical, even absent discrimination, due to numerical fluctuations. We also know that there could be reasons related to the correlates of group membership that could also account for differences in selection rates even if there was no per se group membership related discrimination. So what kind of threshholds should we set for the measures of discrimination?

There is no clear answer from a mathematical or sociological point of view, but the United States legal system has provided one rule of thumb that has been in use for decades. This is the adverse impact rule, put forth in 1978 and still in use, which defines the statistical threshold to establish adverse impact in an employment situation. The rule is also known as the 80% rule or the four-fifths rule because the statement is that adverse impact is established where the selection rate for the disfavored group, namely a group that is provided special attention under the law due to past discrimination (e.g. race, sex, religion, national origin, and a few others, in most countries) must achieve success at a rate that is at least 80% of the rate of success for whatver group is most successful.Note that a finding of adverse impact alone is not usually enough to win a lawsuit as intent to discriminate must also be shown. However a showing of adverse impact is nonetheless serious and can trigger government investigations and the need for such a disparity to be explained by those in charge of the process producing the disparate results.

So if lacking a rule of thumb, the 80% rule can be a good one to start with absent domain knowledge or reasons for setting a different rule. Hence, when you are looking for discrimination in a data set when first getting familiar with that data set, you should be using these basic measures of discrimination and applying them to potentially vulnerable groups within the data set.

Let’s use an example. We consider the German credit data set, a heavily used data set often used to illustrate techniques and metrics associated with discrimination. We then look at the success rates for applying for credit for each of a variety of gender/marital status combinations, which are described in the accompanying documentation as

	      Personal status and sex
	      A91 : male   : divorced/separated
	      A92 : female : divorced/separated/married
	      A93 : male   : single
	      A94 : male   : married/widowed
	      A95 : female : single

Let’s consider the impact of gender and marital status on the success rate for obtaining credit.

import pandas as pd

## read in data
data = pd.read_csv("german.data", delim_whitespace=True, header=None)

## calculate the number of successes and total numbers
success_data = data.groupby(8).agg([lambda x: sum(x == 1), lambda x: len(x)]).iloc[:, -2:]
success_data.columns = ["num_success", "num_total"]
A91    0.817910
A92    0.883871
A93    1.000000
A94    0.992754
## determine the rate of the most successful group and compare all other groups
most_successful = np.max(success_data.num_success / success_data.num_total)
print(success_data.num_success / success_data.num_total / most_successful)

That yields the following output

A91    0.817910
A92    0.883871
A93    1.000000
A94    0.992754

So if this were a hiring process, the EEOC would probably not launch an investigation but might keep an eye on why divorced males in particular (A91) seemed to be doing so much worse than other groups in their applications. However, the 80% rule applies a bright line standard, that is where there is clearly a likely violation. We want to do better than that. If we want to do so with this original data set, we can explore ways to preprocess the data, even before training an algorithm, to make the dataset more fair. We will learn how to do this preprocessing in Chapter 5.

You might be somewhat dissatisfied with the very simplified measure of discrimination above. Perhaps you also want to think about how to handle cases of explanatory variables. For example, perhaps much of the potential discrimination above can be explained by other correlated variables. Perhaps the divorced men also have poor financial behavior.

In this case we can use thinking that comes from associational rule mining. In particular, we look at Pedreschi et al’s (2009) definition of \alpha-discriminatory as a way to study decision rules suggested by association sin a data set. Before studying it, however, we need a few pieces of vocabulary associated with association rule mining.

First, the basics. Association rule mining looks for rules in a tabular data set such as A → B and the like, where the probabilties occur at rates that are higher than chance. We call such a rule, A → B an association rule. The confidence of the association rule A → B is P(A & B) / P(A). This gives us a measure of how often B really occurs when A is a given. How is this helpful?

Now Let’s imagine there is some kind of protected attribute, X, that is legally prohibited from affecting a decision. Facially, the entity studied in a data set insists that X is not taken into account, but we do notice some discrepancy in the success rate of individuals in group X compared to individuals in other gorups. We can look at the confirence of two decision rules, A→B and (A + X)→B and compare the confidence of these two rules.

In this way, Pedreschi et al define a decision rule as \alpha-discriminatory if

conf((A+X)→B) / conf(A→B) > \alpha

where alpha is a quantity we would specify before looking at the data. This is a way of quantifying how much stronger an association rule becomes when provided with information about the prohibited class. For our purposes, we set an \alpha value of 2 as our cutoff.

Let’s consider that we were concerned about the divorced males above. Let’s now see whether, given an uncertain savings state, there is still reason to think that divorced males are additionally discriminated against. We do this by computing the confidence of the proposed association rules

Unknown credit → Bad outcome Unknown credit AND Divorced male → Bad outcome

We do this with the following code

## looking at whether a rule is alpha discriminatory
num_div_males                           = data[data.ix[:, 8]  == "A91"].shape[0]
num_unknown_credit                      = data[data.ix[:, 5]  == "A65"].shape[0]
num_unknown_credit_div_male             = data[data.ix[:, 5]  == "A65"][data.ix[:, 8]  == "A91"].shape[0]
num_bad_outcome                         = data[data.ix[:, 20] == 2].shape[0]
num_bad_outcome_unknown_credit          = data[data.ix[:, 5]  == "A65"][data.ix[:, 20] == 2].shape[0]
num_bad_outcome_unknown_credit_div_male = data[data.ix[:, 8]  == "A91"][data.ix[:, 5] == "A65"][data.ix[:, 20] == 2].shape[0]

### test association rule 1: unknown credit -> bad outcome
rule1_conf = num_bad_outcome_unknown_credit / num_unknown_credit
rule2_conf = num_bad_outcome_unknown_credit_div_male / num_unknown_credit_div_male

## compute elift (ratio of confidence rules)
print(rule2_conf / rule1_conf)

This results in the following output

0.9531

Not only is this below our threshold but it is also below 1, which suggests that if anything there is something like favorable treatment rather than disfavorable treatment for divorced males in the context of unknown credit history. On the other hand we can compare this to the treatment for divorced females given these two ratios. The code to do so is essentially identical except that we select for group “A92” rather than group “A91” in the marital column. The code to do so is in the repository but not reproduced here. If we do this, the result is the following output

1.5172

Here we see some evidence whereby inputting the gender rather than just the unknown credit history actually worsens the treatment for females rather than for males. However, the confidence ratio of 1.51 is still below our preset threshold of 2, so we will not treat this as discrimination. Incidentally note that this ratio of confidence values is also known as extended lift.

It is important that we set the threshold in advance so that we can have a uniform metric, preferably across institutions, and also so that we don’t “cheat” and find ourselves adjusting our standards for discrimination based on what we find in the data. While you might think of these as additional opportunities to protect discriminated against groups, someone else might use this same option to let discrimination slide by adjusting the tolerance in the other direction based on what they find. We must not let the data drive the standards.

List of metrics

If you ever delve into the current or recent scholarship on fairness, it’s clear that this is a discipline that has not yet coalesced along a specific definition of fairness, and it’s also clear that such a consensus may never emerge. Technical scholars have made a very good case that there are numerous ways to measure fairness, and that many different metrics have validity and utility depending on the legal and historical concerns at play in a particular situation.

Below we examine a laundry list of fairness metrics. These were chosen as a subset of what is offered through a recently released machine learning fairness package open source developed by IBM, AIF360. At the time of writing this is one of the most visible and comprehensive fairness packages, and for this reason we will focus heavily on this package in the next few chapters. We will, also, include other open source and DIY options, but it’s good to recognize where the potential for a canon to emerge will come from, and AIF360 seems to offer one possibility.

Note that these metrics presuppose a label of some kind. Such labels can come either from natural labels in the data or from labels emitted from a model.

Balanced Error Rate for the Disfavored Group: For many reasons, including the fact that a disfavored group, at least in the US context, often tends to be the numerical minority group (with criminal justice a notable exception). This often results in a higher error rate for the disfavored minority grou pas compared to the minority group. Likewise, sometimes errors are not evenly distributed, and with the famous case of the COMPAS algorithm, for example, data points coming from black defendants were much more likely to experience a false high risk label (the negative outcome) rather than a false low risk label (the positive outcome).

Example 3-1.

Throughout this book we’ll use “positive” to refer not to “1” but the sense of “desired outcome”. Fairness doesn’t come into play when a truly neutral assessment is made (if such a thing even exists). Fairness comes into play when desirable resources are being allocated such that there are clearly better and worse outcomes. We’ll use “desired” and “positive” interchangably and in a way that emphasizes the value of a label rather than its technical definition. This way, we can talk about a variety of models interchangeably with regard to positive outcomes, be it the COMPAS algorithm assessing whether someone is high risk for recidivism (not a positive outcome) or assessing whether someone should be offered a position (a positive outcome). In both case we can imagine the label could come back as “1” in a binary prediction problem, but not both “1"s are positive.

Another vocabulary choice to be aware of is that we’ll use “privileged group”, “favored group”, and “majority group” mostly interchangeably. This label is not meant to convey any judgment by the author or the readers regarding actual merit but is only meant to be descriptive and to recognize that historically a group has been favored, meaning that it has historically received some kind of privileged group that the disfavored group has not received. Note that the privileged group need not always be the majority group (history is rich with examples of numerically small groups amassing privileges), but if we do speak of a majority group you can infer that the designation is also as a privileged group. This will simplify our language in coming chapters but is not meant to build in assumptions either about actual merit, future merit, or (absent the designation of “majority”) numerical representativeness.

Disparate Impact: This is a term taken from the legal literature but also endowed with a technical definition. In the fairness literate this has come to refer to the ratio of the rate of a positive outcome for the disfavored group to the rate of a positive outcome for the favored group. In the legal literature, as has been mentioned before, several US government departments have supported the 80% rule, which means that this ratio should be no lower than 0.8.
Statistical Parity Differences: Statistical parity refers to a fairness standard whereby all groups should measure the same, that is show statistical parity, on a measure such as false negatives, false positives, or numerical means. As we’ve discussed previously, for most metrics that could be assessed under a requirement of statistical parity, it is mathematically impossible to satisfy two such metrics simultaneously. It is likely impossible to satisfy more individualized notions of fairness while satisfying statistical parity. Nonetheless, there are many use cases, analogous to legal reasoning where near or perfect statistical parity is quite desirable, and so measuring the different in statistical measures between a favored and disfavored group is one metric of fairness. The larger the difference, the less fair the situation is rated to be.

Inframarginality and intersectionality

In a widely cited paper, Fould et al (2019) explore another fairness assumption that can be different from the “all groups are or should be equal assumption.” Fould et al recognize that the sociologically-naive way of looking at the world is to assume that the world can be modeled as a level playing field so that differences in “merit” or scoring predictions between groups are considered legitimate and actionable. This is designated as the infra-marginality principle. In contrast, intersectional theory, first formalized by third wave feminism but now redeployed as a proposal for the technical fairness literature recognizes that the distribution of merit itself are likely to be influenced by unfair social processes. What’s more, intersectionality theory recognizes that the degree of unfairness is likely differential by different group memberships, and that cominations of group memberships can elicit yet different disparities in treatment (e.g. crossing race with gender to notice particular harms to, say, black women, both because of their race and their gender).

Consistency: Consistency is one measure of fairness that is not oriented towards group outcomes or equality. Rather consistency is a check for arbitrariness, and recognizes that in any individual group we can imagine that the members of a favored and disfavored group within a data set are not necessarily equal. For example, it could be that one group has features that make it highly correlated with desirable attributes. A common example of this is comparing women in NYC, unlikely to have a commercial truck driver’s license, with white men in a midwestern state, where the base rate of having a commercial truck driving license is much higher. If we were to ignore these underlying correlations, we could very well end up with a model too strongly favoring statistical parity. One way we could likely identify this would be a consistency check. Consistency is commonly established by an individually driven methodology, such as a nearest neighbors metric. The logic here is that individuals who are near one another in the input space should have consistent outcomes, which seems fair in the individual sense that individuals shoul dbe treated as individuals not merely members of a group. It also enhances fairness in the sense of producing outcomes that are not arbitrary.
Delta between accuracy and a discrimination measure: This is not formally a fairness metric but can be used both as a metric to optimize for training and also as a metric to see the relative fairness of a model. [insert name] used this to train their fair representation model to great success. A strength of this fairness metric is that it recognizes the sometimes tension between fairness and accuracy and to some extent leaves a model free to find a solution strong not just along a fairness or not just along an accuracy metric but along both. This recognizes that we both want to maximize accuracy and minimize unfairness at the same time and effectively makes this a single metric. As discussed in earlier chapters, accuracy does have some claim on fairness considerations. After all a model that is not accurate can easily become arbitrary and capricious as it is untethered to reality. Also, a model that is not accurate lacks a rationale for its very existence, even as it may be subjecting people to decisions that would be better made in a non-automated fashion, again raising fairness concerns.
Differential Fairness: This is a very recently proposed metric for fairness very much grounded in the differential privacy literature and recognizing the connection between the mathematics of privacy and the mathematics of discrimination. As with differential privacy, there is an epsilon term, and an epsilon-fair outcome will be one in which regardless of which attributes are considered in one group as compared to another group, the ratio of the positive outcome for the disfavored group (no matter how specified) as compared to any other group will be within some narrow band around 1, where epislon established the band. The advatage of this definition is that it can apply to widely specified groups or narrowly specified groups.

Stages of fairness interventions

Broadly speaking, researchers in fairness-aware machine learning divide fairness interventions into three stages: pre-processing, in-processing, and post-processing. These are briefly described below.

Pre-processing: These methodologies refer to attempts to adjust data before training of a model has begun. Such attempts can relate to any aspects of data or its labels that can be adjusted independently of and prior to training a model. For example, one very simple way of pre-processing for fairness that we will discuss in the next chapter is reweighting the data. This is a way of addressing unfairness by adjusting upwards the weight of data points judged to represent cases where an otherwise bias system made the right decision. Likewise, those data points judged to represent instances of potential bias are given lower weifhtings. In addition to adjusting the relative weihts of data points, pre-processing can also relate to different representations of the input data, such as reducing the dimensionality of the data in a way that removes information about a protected category or finding a transformation of the data, including its labels, that preserves as much of the original information as possible while reducing unfairness. Pre-processing methos may be particularly attractive when contemplating the release of a data set in situations where the releaser of the data set wants to take pro-active steps to ensure fair outcomes from models trained on the dataset. Of course, in such cases, it should be clearly indicated that a data set has been through some processing.
In-processing: These methodologies refer to attempts to ensure fairer outcomes in machine learning by training models in a fair-aware manner. In such cases, the original data is not adjusted but instead the way in which a model is optimized reflects fairness concerns rather than merely accuracy concerns. For some normative reasons, organizations may prefer to use in-processing rather than pre-processing even though many of the techniques for in-processing having much in common with those of pre-processing. For exampe, it may give the wrong impression from a publicity standpoint to hear that data has been “massaged”, as one might characterized pre-processing, whereas indicating that a model was trained in a fair-aware way might be more acceptable from a publicity standpoint. Use of an in-processing methodology could also be quite attractive in the case where someone wants to maintain maximum control over potential downstream use cases because the fairness is already baked-into the model and requires no interventions from downstream users who may not be aware that a model requires additional fitting or modifications.
Post-processing: These methodologies refer to attempts to ensure fairer outcomes using models that have not had data massaged prior to training and that have not been developed through fairness aware training. These methodologies generally rely on choosing a fairness metric, generally one that is group-oriented rather than individual-oriented, and selecting a fairness metric to equalize between groups. Even in the case of a blackbox model, such models can be retrofitted for fairness in post-processing by choosing different thresholds and scoring criteria, or different probabilities of label swapping or categorizations, according to different groups. These methodologies generally work by identifying individuals near classification thresholds and changing their labels or randomizing their labels slightly within some confidence band around the threshold. These methods can be useful in making a black box model fairer when it has not been trained in a way that emphasizes fairness. These options are the trickiest for a number of reasons, and they are unlikely to pass legal muster given current US law on discrimination, in that post-processing can amount to disparate treatment, which is a form of behavior clearly prohibited under US law in all circumstances.

Binary categories

Fairness and discrimination metrics are generally defined for one binary relationship at a time, essentially privileged vs. not privileged. As scholars have noticed, this is how the term has generally been applied in legal case law and scholarship to date, making it the most relevant way to think about the problem. However, this is likely itself not recognizing the complexities of discrimination, such as has been recognized to be the case for black women in the US. Researchers have often noted that such individuals face one set of prejudices for their race, another for their sex, and possibly a specific kind of bias targeted at the interaction of these terms. Of course the situation can get even more complicated if we think of interacted race and gender with sexual orientation, which is a realistic scenario where we can imagine different kinds and degrees of discrimination depending on the interaction between these categories. Case law so far has not articulated a concernw with such interactions, and so for the most part metrics will follow a binary categorization.

Comprehensive data-acquisition checklist

Data integrity issues
- Do the labels on the data mean what they say or are they potentially misleading?
- Was the data collection method sufficiently documented?
- Have potential issues of non-random data collection been sufficiently highlighted in data documentation and naming?
Consent issues
- Was informed consent appropriate to the sensitivty of the data obtained for a dataset?
- Was that consent solicited in a reasonable way where opting out was a possibility?
- Does the current use of the data reflect the originally obtained consent?
Reasonable expectations
- What were the reasonable expectations of the data subject the time of data collection?
- What were the reasonable expectations of the data collector at the time of data collection?
Appropriateness of data for modeling goal
- Is the question being asked one that can be answered with the available data?
- If there is bias in a data set, is it possible to collect more data to lessen the bias and answer a critical question?