Why training AI can't be IP theft

44 min read (1+ hr w/ quotes)
Tagged: #AI #publication #IP #enforcement #prosthesis #rhetoric #plagiarism

Posted Thu Apr 03, 2025 in cyber

AI is a huge subject, so it’s hard to boil my thoughts down into any single digestible take. That’s probably a good thing. As a rule, if you can fit your understanding of something complex into a tweet, you’re usually wrong. So I’m continuing to divide and conquer here, eat the elephant one bite at a time, etc.

Right now I want to address one specific question: whether people have the right to train AI in the first place. The argument that they do not¹ goes like this:

When a corporation trains generative AI they have unfairly used other people’s work without consent or compensation to create a new product they own. Worse, the new product directly competes with the original workers. Since the corporations didn’t own the original material and weren’t granted any specific rights to use it for training, they did not have the right to train with it. When the work was published, there was no expectation it would be used like this, as the technology didn’t exist and people did not even consider “training” as a possibility. Ultimately, the material is copyrighted, and this action violates the authors’ copyright.

I have spent a lot of time thinking about this argument and its implications. Unfortunately, even though I think that while this identifies a legitimate complaint, the argument is dangerously wrong, and the consequences of acting on it (especially enforcing a new IP right) would be disastrous. Let me work through why:

The complaint is real

Artists wanting to use copyright to limit the “right to train” isn’t the right approach, but not because their complaint isn’t valid. Sometimes a course of action is bad because the goal is bad, but in this case I think people making this complaint are trying to address a real problem.

I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.” This is also a real thing companies want to do. Replacing labor that has to be paid wages with capital that can be owned outright increases profits, which is every company’s purpose. And there’s certainly a push right now to do this. For owners and executives production without workers has always been the dream. But even though it’s economically incentivized for corporations, the wholesale replacement of human work in creative industries would be disastrous for art, artists, and society as a whole.

So there’s a fine line to walk here, because I don’t want to dismiss the fear. The problem is real and the emotions are valid, but that doesn’t mean none of the reactions are reactionary and dangerous. And the idea that corporations training on material is copyright infringement is just that.

The learning rights approach

So let me focus in on the idea that one needs to license a “right to train”, especially for training that uses copyrighted work. Although I’m ultimately going to argue against it, I think this is a reasonable first thought. It’s also a very serious proposal that’s actively being argued for in significant forums.

Copyright isn’t a stupid first thought. Copyright (or creative rights in general) intuitively seems like the relevant mechanism for protecting work from unauthorized uses and plagiarism, since the AI models are trained using copyrighted work that is licensed for public viewing but not for commercial use. Fundamentally, the thing copyright is “for” is making sure artists are paid for their work.

This was one of my first thoughts too. Looking at the inputs and outputs, as well as the overall dynamic of unfair exploitation of creative work, “copyright violation” is a good place to start. I even have a draft article where I was going to argue for this same point myself. But as I’ve thought through the problem further, that logic breaks down. And the more I work through it, every IP-based argument I’ve seen to try to support artists has massively harmful implications that make the cure worse than the disease.

Definition, proposals, assertions

The idea of a learning right is this: in addition to the traditional reproduction right copyright reserves to the author, authors should be able to prevent people from training AI on their work by withholding the right.

This learning right would be parallel to other reservable rights, like reproduction: it could be denied outright, or licensed separately from both viewing and reproduction rights at the discretion of the rightsholder. Material could be published such that people were freely able to view it but not able to use it as part of a process that would eventually create new work, including training AI. The mechanical ability to train data is not severable from the ability to view it, but the legal right would be.

This is already being widely discussed in various forms, usually as a theory of legal interpretation or a proposal for new policy.

Asserting this right already exists

Typically, when the learning rights theory is seen in the wild it’s being pushed by copyright rightsholders who are asserting that the right to restrict others from training on their works already exists.

A prime example of this is the book publishing company Penguin Random House, which asserts that the right to train an AI from a work is already a right that they can reserve:

Penguin Random House Copyright Statement (Oct 2024) No part of this book may be used or reproduced in any manner for the purpose of training artificial intelligence technologies or systems. In accordance with Article 4(3) of the Digital Single Market Directive 2019/790, Penguin Random House expressly reserves this work from the text and data mining exception.

In the same story, the Society of Authors explicitly affirms the idea that AI training cannot be done without a license, especially if that right is explicitly claimed:

So you want to write an AI art license

6 min read
Tagged: #ai #ip #technical #enforcement #plagiarism #publication

Posted Sat Apr 08, 2023 in cyber

Hi, The EFF, Creative Commons, Wikimedia, World Leaders, and whoever else,

Do you want to write a license for machine vision models and AI-generated images, but you’re tired of listening to lawyers, legal scholars, intellectual property experts, media rightsholders, or even just people who use any of the tools in question even occasionally?

You need a real expert: me, a guy whose entire set of relevant qualifications is that he owns a domain name. Don’t worry, here’s how you do it:

Given our current system of how AI models are trained and how people can use them to generate new art, which is this:

sequenceDiagram
    Alice->>Model: Hello. Here are N images and<br>text descriptions of what they contain.
    Model->>Model: Training (looks at images, "makes notes", discards originals)
    Model->>Alice: OK. I can try to make similar images from my notes,<br>if you tell me what you want.
    Curio->>Model: Hello. I would like a depiction of this new <br>thing you've never seen before.
    Model->>Curio: OK. Here are some possibilites.

The Joy of RSS

4 min read
Tagged: #platforms #media-consumption #gush #big-tech #plagiarism

Posted Sun Oct 17, 2021 in tech

During the years when Homestuck updated regularly, I usually had some sort of update notifier that pinged me when a new page was posted. But since Homestuck usually updated daily, I ended up just keeping a tab open and refreshing it. And that’s pretty much how I kept up with other serial media on the internet, for years. A writing blog that posts regular updates? Keep a dedicated tab open and refresh it occasionally. Comic? Tab. To this day, I have a “serial” browser window that’s just tabs of sites I check regularly. (Or imagine I might want to check regularly, at least.)

please don’t tell anyone how I live

Of course, this is terrible. The biggest problem is browser tabs are expensive. If you have a tab open, that takes up a dedicated chunk of memory, even when you’re not reading anything. CPU too, probably, if the site has JavaScript running on it (which is to say, is either decades out of date, or this one). Not to mention the clutter.

Unfortunately, dedicated browser tabs fit specific use case of keeping up with serial media well. Social media feeds — all of them, Twitter, Facebook, Tumblr, Reddit, YouTube — are explicitly “media aggregators”, services that combine multiple media sources into one feed. This is no good for serial media. If you’re following multiple sources, they likely update on different schedules, and updates from the more active ones will bury updates from those slower. Even email updates have this problem. No, you need a dedicated space for each source (but not each update), which a dedicated browser tab will get you.

There is a good system for this, though: RSS.

RSS (Really Simple Syndication) is a fantastic technology that has fallen out of favour in the mainstream lately. It works like this: the media source puts up a small file somewhere that notes the dates, titles, and (optionally) content of posts. And that’s it. There’s no API, it’s just a file people can read if they want. It’s like traditional syndication, but instead of selling articles to multiple distributors (as with syndicated cartoons), you’re distributing articles to many consumers directly.

GioCities

blogs by Gio

Tagged: plagiarism