9

Transparency

Too much light makes the baby go blind.

—Facetious title of a long-running comedy improv show

AI offers incredible possibilities for our country, but it also presents peril. Transparency into how AI models are trained and what data is used to train them is critical for consumers and policy makers.

—US House Representative Anna Eshoo

Transparency isn’t just an important ideal. It is essential to successful AI accountability.

—Marietje Schaake, 2023

“Transparency”—being clear about what you’ve done and what the impact is. It sounds wonky, but matters enormously. Companies like Microsoft often give lip service to “transparency,” but provide precious little actual transparency into how their systems work, how they are trained, or how they are tested internally, let alone what trouble they may have caused.

We need to know what goes into systems, so we can understand their biases (political and social), their reliance on purloined works, and how to mitigate their many risks. We need to know how they are tested, so we can know whether they are safe.

Companies don’t really want to share.

Which doesn’t mean they don’t pretend otherwise. For example, in May 2023, Microsoft’s president Brad Smith announced a new “5 point plan for governing AI,” allegedly “promoting transparency”; the CEO immediately amplified his remarks, saying, “We are taking a comprehensive approach to ensure we always build, deploy, and use AI in a safe, secure, and transparent way.”1

But as I write this, you can’t find out what Microsoft’s major systems were trained on. You can’t find out how much they relied on copyrighted materials. You can’t find out what kind of biases might follow from their choice of materials. And you can’t find out enough about what they were trained on to do good science (e.g., in order to figure out how well the models are reasoning versus whether they simply regurgitate what they are trained on). You also can’t find out whether they have caused harm in the real world. Have large language models been used, for example, to make job decisions, and done so in a biased way? We just don’t know. In an interview with Joanna Stern of The Wall Street Journal, OpenAI’s CTO Mira Murati wouldn’t even give the most basic answers about what data had been used in training their system Sora, claiming, improbably, to have no idea.2

Not long ago, in a briefing on AI that I gave at the UN, I highlighted this gap between words and action.3 Since then, a team with members from Stanford University, MIT, and Princeton, led by computer scientists Rishi Bommasani and Percy Liang, created a careful and thorough index of transparency, looking at ten companies across 100 factors, ranging from the nature of the data that was used to the origins of the labor involved to what had been done to mitigate risks.4

Every single AI company received a failing grade. Meta had the highest score (57 percent), but even it failed on factors such as transparency of their data, labor, usage policy, and feedback mechanisms.5

Not a single company was truly transparent around what data they used, not even Microsoft (despite their lip service to transparency) or OpenAI, despite their name.

The report’s conclusions were scathing:

The status quo is characterized by a widespread lack of transparency across developers. . . . Transparency is a broadly-necessary condition for other more substantive societal progress, and without improvement opaque foundation models are likely to contribute to harm. Foundation models are being developed, deployed, and adopted at a frenetic pace: for this technology to advance the public interest, real change must be made to rectify the fundamental lack of transparency in the ecosystem.6


Worse, as the Stanford/Princeton/MIT team put it, “While the societal impact of these models is rising, transparency is on the decline.”

While I was sketching this chapter, a nonprofit reassuringly called the Data & Trust Alliance—sponsored by 20-plus big tech companies—managed to get coverage in a New York Times article titled “Big Companies Find a Way to Identify A.I. Data They Can Trust.”7

When I checked out the alliance’s webpage, it had all the right buzzwords (like “[data] provenance” and “privacy and protection”), but the details were, at best, geared toward protecting companies, not consumers.8 With something like GPT-4, it would tell you almost nothing you actually wanted to know, for example, about copyrighted sources, likely sources of bias, or other issues. It would be like saying for a Boeing 787: “source of parts: various, US and abroad; engineering: Boeing and multiple subcontractors.” True, but so vague as to be almost useless. To actually be protected, we would need much more detail.

What should we, as citizens, demand?

Writing good transparency bills takes hard work. As Archon Fung and cowriters’ Full Disclosure put it: “to be successful, transparency policies must be accurate, keep ahead of disclosers’ efforts to find loopholes, and, above all, focus on the needs of ordinary citizens”—and it is work that absolutely must be done.21

The good news is there is some motion here. In December 2023, Representatives Anna Eshoo (D-CA) and Don Beyer (D-VA) introduced an important bill on transparency; in February 2024, Senator Ed Markey (D-MA) and Senator Martin Heinrich (D-NM), working together with representatives Eshoo and Beyer, introduced a bill for environmental transparency.22 I hope these bills make their way into law.