7 Data Rights

On the one hand, we want to encourage remixes, collage, sampling, reimagining. We want people to look at what exists, in new ways, and offer new insights into what they mean and how. On the other hand, those who want better data rights want people to be comfortable putting effort and care into writing, art making, and online communication. That’s unlikely to happen if they don’t feel like they own and control the data they share, regardless of the platforms they use.

—Eryk Salvaggio, 2023

Don’t let them steal your content, don’t let them destroy the arts. And don’t let them steal your soul. The Great Data Heist has to stop, and you need to get a cut. The first word in data should be consent.

One way to think about this is legalistically, in terms of current laws. As I write this, there are multiple lawsuits pending, filed by writers, artists, computer programmers, and others, arguing that Generative AI has infringed on their intellectual property.1 My guess is that the plaintiffs will win (or settle) some of these cases, but the law is not always entirely clear, and neither judges nor juries are always predictable.

Regardless of where the law has been historically, a second question is moral: Should companies have rights to train on copyrighted materials or the rights to use your personal data? The third question is in terms of future laws. In other words, what kind of society should we want, and what changes in our laws might be necessary to get there?

For years, most people looked on quietly as companies like Google and Meta have routinely invaded our privacy. Nowadays, too few people are speaking up as companies like OpenAI have trained their models on massive amounts of content, a lot of it copyrighted, with essentially no compensation to the artists or writers who created it.

Venture capitalist Vinod Khosla (perhaps not coincidentally a major investor in OpenAI) even went so far as to suggest that there shouldn’t be any rules to protect content creators from having their works absorbed wholesale by Generative AI:

To restrict AI from training on copyrighted material would have no precedent in how other forms of intelligence that came before AI, train.

There are no authors of copyrighted material that did not learn from copyrighted works, be it in manuscripts, art or music. You can’t separate Gauguin’s influence from Matisse, Velazquez’s from Dali, Picasso’s from Pollock, Beyonce’s from Taylor Swift, nor Charles Taylor’s from Yuval Noah Harari. The list goes on.2

The point about precedents should cut no ice here. Before the printing press came into being, there were no copyright laws at all.3

Copyright laws were developed to protect intellectual property—but they were instituted before large language models were developed. The relevant precedent here, which Khosla ignored, is for inventing new laws in light of new technology. The point of copyright laws, back in the fifteenth and sixteenth centuries (and continuing today), was to protect writers against having their stuff plagiarized by printers. The point now should be to update those laws, to prevent people from being ripped off in a new way, namely, by the chronic (near) regurgitators known as large language models. We need to update our laws.

Just because things can (perhaps) be done presently doesn’t mean we should allow them going forward. To take one example, we have, with good reason, developed laws against usury, against suckering people into paying outlandishly high interest rates. Society let loan sharking slide for a while, and eventually realized that was a bad idea.4 We can pass laws to prohibit things that are unjust, even ones that were not anticipated historically. To make sure that creators’ intellectual property is properly protected, we should, where necessary, pass new laws that are clear and unambiguous.

Within the tech community, the AI researcher/composer Ed Newton-Rex stands apart, as one of the first to speak out. He resigned from leading the audio team at Stability AI (a major Generative AI startup), saying, “I can only support Generative AI that doesn’t exploit creators by training models—which may replace them—on their work without permission.”5

We should stand with him. We should not use Generative AI that exploits creators, period. We should stand with the musician and polymath Jaron Lanier, and demand what he has called data dignity:

In a world with data dignity, digital stuff would typically be connected with the humans who want to be known for having made it. In some versions of the idea, people could get paid for what they create, even when it is filtered and recombined through big models, and tech hubs would earn fees for facilitating things that people want to do.6

Is that really too much to ask? Lanier has long called for micropayments; if a company uses your data, you should get a small cut. That’s not going to be enough to make a living from, but it’s not unreasonable for society to demand some measure of profit sharing.

Software engineer Pete Dietert put all this even more strongly and more generally, in a scathing post on LinkedIn cautioning about what he calls “digital replicants” of living people:

Regardless of economic infringements, if someone takes my digital text, and digital images of me, and digital recordings of my voice, and models a virtual me without my consent, that is a direct violation of my “moral rights” to my own identity—regardless if someone makes money out of “digital me” or not. This is the current “replicant” threat we are talking about. . . . Currently, parts of my digital physical/mental identity are already being sold by data brokers. So my autonomy and moral rights to my own works, my own digital behaviours, and my own digital self, are already frequently being violated. In this sense, “#AGI” means creating an ever higher fidelity digital replicant of me, that is effectively owned by someone else, and is poised to directly compete against me, simply because I “chose” to exist, or have some digital works of “me” made available on the Internet. No. I did not and do not consent to a digital replicant being made of me. That should be “end of story.” But the Tech Bros. “answer” is “too bad, so sad, you ‘posted’ so your digital identity now belongs to us.”

As I wrote recently on X:

I . . . can imagine a world in which AI would create genuinely original works of art, using a form of deeper and more original AI than we currently know how to build.

But the world I actually foresee in the near-term is one in which big tech companies use stochastic mimicry and power to constantly encroach on the rights and economics of journalists, artists, musicians, and more, driving most out of business, and leaving the world adrift in a sea of mediocrity.

I hope Congress won’t let that happen.

I hope that Congress will insist that there can be no use of copyrighted work for training without two things: consent and compensation.

Creators who do not wish to consent should not be coerced into having their work used in the training of AI systems that regularly produce near-plagiaristic outputs.7

For the sake of artists, writers, musicians, and other creators—and for the sake of all of us who appreciate their works, I hope that all of this happens, soon.