You Can't Un-Train a Model: Protecting Your Work in the Age of AI

June 16, 2026

A large AI model can read the work of a million people in the time it takes you to finish this sentence. It usually does not ask, it does not always credit, and once your work is part of what it learned, there is no button that takes it back out. Most of the time you cannot even tell it happened.

That is the uncomfortable shape of the problem, and pretending otherwise helps no one. So this is not a post that promises to make your work AI-proof. Nothing does. It is about the one simple, durable thing you can still do, why it is worth doing anyway, and where the rules are heading.

The past: how the machines learned to read

The first generation of large language and image models did not learn from nothing. They learned from us. Their training data was the public internet at scale: enormous web crawls, Wikipedia, digitized books, code repositories, forums and vast libraries of images. This is not a guess. The model builders' own research papers list these sources, and when reporters and researchers pried open one of the most widely used training sets, Wikipedia turned out to be its second-largest single source, behind only a database of patents. If it was reachable and readable, it was often treated as fair game, and much of it was gathered without asking the people who made it.

That produced a real leap in what software can do. It also swept up an enormous amount of human craft (novels, photographs, illustrations, songs, blog posts written for a few hundred readers) and folded it, uncredited, into systems now worth billions. Whether that counts as fair use or as infringement is an open and actively debated question, one that regulators are examining and that is being tested, case by case, in the courts. We are not going to try to settle it here. The practical point for a creator is simpler: the scraping already happened, at a scale and speed no individual can match or reverse.

Training is only the first door

There are actually two ways your work reaches an AI, and only one of them is the training we just described. That first door, training, looks backward and, as we have seen, is effectively permanent.

The second door is wide open right now. Most leading assistants (ChatGPT, Claude and others) can search the live web while they answer. Ask one a question and it may pull up your article, your Reddit comment or your LinkedIn post in that very moment and weave the idea, or the wording, into its reply. Attribution is not always clear: when the output is a fresh, synthesized article, it is hard to tell how much of it rests on your words, and the model will not necessarily show its sources unless you explicitly ask it to cite everything and avoid plagiarism. Your work does not need to be in any training set to end up shaping the output. It just has to be findable when someone asks.

The two doors differ in ways that matter. Training is a past snapshot you cannot easily change; live retrieval is continuous, and it reaches even something you published an hour ago. Training happens once; retrieval can resurface your work, uncredited, to a different person every day. What they share is the part that should give you pause: in both cases your authorship can be quietly folded into someone else's answer, and in neither case is anyone asking you first.

The present: the part that is genuinely hard

Here is the part most articles skip, and it is worth being honest about. It can feel like there is nothing you can do, and on one narrow point that feeling is understandable: proving that a particular model trained on your particular work, or pulled it into a live answer, is, today, very hard. Training sets are rarely published in full. Outputs are blends, not copies, so a system can be soaked in your style without ever reproducing your file. From the outside the model is a black box that occasionally says something suspiciously familiar.

We will be blunt about what this means for the tool we make: a timestamp does not prove that an AI scraped you, and it will not stop one from doing so. Anyone who tells you otherwise is selling something. What a timestamp gives you is narrower, and more useful than it first sounds.

What you can actually do: keep dated proof of what's yours

Strip the problem down and a single question survives every dispute about creative work: can you show that this exact thing existed, in your hands, on this date? Not a vague "I made it years ago", but a record anyone can check.

That is what timestamping is for. You compute a fingerprint of your file (a short SHA-256 hash that changes completely if a single byte changes) and anchor that fingerprint to a public record with a date nobody can backdate. The work itself stays private. The fingerprint is meaningless to anyone who does not already hold the file. But the day precedence matters (a stolen design, a "we came up with it independently", a licensing claim, a future scheme that credits or pays creators) you hold verifiable evidence of what existed and when, instead of a screenshot and a story.

The reason to do this is almost embarrassingly practical: it is inexpensive, it is immediate, and the record is durable. Timestamping a file takes seconds and costs about as much as a pizza. It is insurance you buy once and hope never to need. With EMOZ the fingerprint is computed in your browser, so the file never leaves your device, and the proof can be verified by anyone, independently, even if EMOZ disappeared tomorrow. When something is this affordable and the downside of skipping it is "I had no proof", the math is not complicated.

Declare your terms, too

Proving authorship is one half. Stating how your work may be used is the other, and there is a quiet movement to make that machine-readable.

If you publish on your own website, you can tell well-behaved crawlers to stay out with a robots.txt file, and in Europe you can formally reserve your work against text and data mining, an opt-out the EU AI Act now requires the largest model makers to detect and respect. A newer proposal, llms.txt, goes a step further: a simple file that states your terms to AI systems directly. If your site runs on WordPress, plugins such as Website LLMs.txt will generate it for you, and as of this writing that one plugin alone has more than 40,000 active installations, a sign that plenty of website owners are already looking for ways to state their terms to AI.

Be clear-eyed about this. Today it is closer to a polite sign on the door than a lock. No major AI company has committed to honoring llms.txt, and a sign only works on those who choose to read it. So why mention it at all? Because it is one more thing you can put on the record. The timestamp is what carries the date and the tamper-evidence; pair it with a note of your terms and you can later show not only that the work existed on a given day, but that on that day you had already said you did not want it used to train AI. If and when these signals are given teeth, the people who declared their terms early and can prove when they did will be holding a record instead of an argument.

The future: the EU AI Act, and the rules being written now

The reason any of this is worth the effort is that the ground is moving. Regulators have decided that opacity about training data is no longer acceptable, and Europe has gone first.

The EU AI Act entered into force on August 1, 2024, and its rules switch on in phases. The part that matters most to creators, the obligations on general-purpose AI models in Article 53, has applied since August 2, 2025; most of the rest of the Act follows on August 2, 2026 (as of this writing, in mid-2026, that date is only weeks away). Under Article 53, the providers of those models must publish a "sufficiently detailed summary" of the content they trained on, following a template the European Commission issued in July 2025, and must operate a policy that respects EU copyright law, including honoring the machine-readable opt-out that lets creators reserve their work from text and data mining. Models that were already on the market have until August 2, 2027 to publish their summary.

Read that back as a creator. For the first time, the companies behind these models must state, publicly and on the record, what they trained on, and must respect a signal that means "do not use this work without asking". That transparency is the hinge: a disclosure rule and an opt-out only become leverage for you if, when the moment comes, you can show what you made, when you made it, and that you had reserved it. The United States is moving more slowly, but the US Copyright Office has published report after report on exactly this question, and licensing deals between AI companies and large rights holders are appearing where there used to be only silence.

None of this is settled, and we are not going to pretend it protects you today. But the direction of travel is unmistakable: toward more transparency, real opt-outs and the possibility of credit or payment for the work that trains these systems. When that machinery arrives it will run on evidence, and the creators positioned to benefit are the ones who can answer "what did you make, and when?" without flinching.

The honest bottom line

So, plainly, what a timestamp does and does not do.

It proves that a specific work existed in a specific form on a specific date, verifiably, by anyone, for as long as the public record stands. That is the foundation for any claim of precedence. It does not prove an AI used your work. It does not stop scraping. And it does not grant or replace copyright: copyright is automatic the moment you record your work, and where the stakes justify it you should still register it formally, because registration carries benefits in some places that no timestamp can. A timestamp is the complement: an instant, inexpensive record you can create the moment a work exists. Formal registration is a far heavier undertaking, and it works differently in every jurisdiction (the EU, the US and the rest each run their own systems): it can take months, and once trademarks, multiple countries or a lawyer are involved, it easily runs into thousands of euros or dollars.

The age of AI did not invent the creator's oldest problem, which is proving that your work was yours, first. It just made it urgent and a great deal harder to ignore. You cannot un-train the models that already read the internet. You can, starting with your very next file, make sure that everything you create from here forward carries a proof you can stand behind.

Timestamp it when you finish it. Declare your terms. Keep your originals. It costs almost nothing, and it is the part of this whole mess that is actually in your hands.