this post was submitted on 12 Apr 2026

658 points (95.3% liked)

Technology

84019 readers

3303 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

658

Linux lays down the law on AI-generated code, says yes to Copilot, no to AI slop, and humans take the fall for mistakes — after months of fierce debate, Torvalds and maintainers come to an agreement (www.tomshardware.com)

submitted 1 week ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

272 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] ell1e@leminal.space 26 points 1 week ago* (last edited 1 week ago) (2 children)

Ultimately, the policy legally anchors every single line of AI-generated code

How would that even be possible? Given the state of things:

https://dl.acm.org/doi/10.1145/3543507.3583199

Our results suggest that [...] three types of plagiarism widely exist in LMs beyond memorization, [...] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, [...] Plagiarized content can also contain individuals’ personal and sensitive information.

https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/

Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books. [...] This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. [...]The Stanford study proves that there are such copies in AI models, and it is just the latest of several studies to do so.

https://www.twobirds.com/en/insights/2025/landmark-ruling-of-the-munich-regional-court-(gema-v-openai)-on-copyright-and-ai-training

The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, [...] the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining

https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7

In this work we explored the relationship between discourse quality and memorization for LLMs. We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.

https://arxiv.org/abs/2601.02671

recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures [...]. We investigate this question [...] our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

How does merely tagging the apparently stolen content make it less problematic, given I'm guessing it still won't have any attribution of the actual source (which for all we know, might often even be GPL incompatible)?

But I'm not a lawyer, so I guess what do I know. But even from a non-legal angle, what is this road the Linux Foundation seems to embrace of just ignoring the license of projects? Why even have the kernel be GPL then, rather than CC0?

I don't get it. And the article calling this "pragmatism" seems absurd to me.

[–] FauxLiving@lemmy.world 5 points 1 week ago* (last edited 1 week ago) (2 children)

Given the research that you've done here I'm going to assume that you're looking for an answer and not simply taking us on a gish gallop.

Your premise, and what appears to be the primary source of confusion, is built on the idea that this is 'stolen' work which, from a legal point of view, is untrue. If you want to dig into why that is, look into the precedent setting case of Authors Guild, Inc. v. Google, Inc. (2015). The TL;DR is that training AI on copyrighted works falls under the Fair Use exemptions in copyright law. i.e. It is legal, not stealing.

The case you linked from Munich shows that other country's legal systems are interpreting AI training in the same way. Training AI isn't about memorization and plagiarism of existing work, it's using existing work to learn the underlying patterns.

That isn't to say that memorization doesn't happen, but it is more of a point of interest to AI scientists that are working on understanding how AI represents knowledge internally than a point that lands in a courtrooom.

We all memorize copyrighted data as part of our learning. You, too, can quote Disney movies or Stephen King novels if prompted in the right way. This doesn't make any work you create automatically become plagarism, it just means that you have viewed copyrighted work as part of your learning process. In the same way, artists have the capability to create works which violate the copyright of others and they consumed copyrighted works as part of their learning process. These facts don't taint all of their work, either morally or legally... only the output that literally violates copyright laws.

The pragmatism here is recognizing that these tools exist and that people use them. The current legal landscape is such that the output of these tools is as if they were the output of the users. If an image generator generates a copyrighted image then the rightsholder can sue the person, not the software. If a code generator generates licensed code then the tool user is responsible.

This is much like how we don't restrict the usage of Photoshop despite the fact that it can be used to violate copyright. We, instead, put the burden on the person who operates the tool

That's what is happening here. Linus isn't using his position to promote/enforce/encourage LLM use, nor is he using his position to prevent/restrict/disallow any AI use at all. He is recognizing that this is a tool that exists in the world in 2026 and that his project needs to have procedures that acknowledge this while also ensuring that a human is the one responsible for their submissions.

This is the definition of pragmatism (def: action or policy dictated by consideration of the immediate practical consequences rather than by theory or dogma).

e: precedent, not president (I'm blaming the AI/autocorrect on this one)

[–] mimavox@piefed.social 3 points 1 week ago

Training AI isn't about memorization and plagiarism of existing work, it's using existing work to learn the underlying patterns.

Thank you. This is exactly what people misunderstands. LLMs aren't gigantic databases that just shuffles information that they've copied from the internet.

[–] bss03@infosec.pub 1 points 1 week ago (1 children)

The TL;DR is that training AI on copyrighted works falls under the Fair Use exemptions in copyright law

This judgement was reversed by the next federal judge that reviewed AI, in the Meta case.

It is far from legally settled whether training is fair use or not.

[–] FauxLiving@lemmy.world 2 points 1 week ago

Well, cynically, the Supreme Court will decide and Team AI has more money to buy RVs and luxury vacations.

[–] anarchiddy@lemmy.dbzer0.com 0 points 1 week ago (2 children)

That's not really how copyright law works.

[–] ell1e@leminal.space 5 points 1 week ago (1 children)

Would you also say that to this lawyer reviewing Co-Pilot in 2026? https://github.com/mastodon/mastodon/issues/38072#issuecomment-4105681567

Disclaimer: this isn't legal advice.

[–] anarchiddy@lemmy.dbzer0.com 0 points 1 week ago (1 children)

LLMs themselves being products of copyright isnt the legal question at issue, it's the downstream use of that product.

If I use a copyright-infringing work as a part of a new creative work, does that new work infringe copyright by default? Or does the new work need to be judged itself as to the question of infringing a copyrighted work?

And if it is judged as infringing, who is responsible for the damage done? Can I pass the damages back to the original infringing work? Or should I be held responsible for not performing due diligence?

[–] FauxLiving@lemmy.world 3 points 1 week ago (1 children)

If I use a copyright-infringing work as a part of a new creative work, does that new work infringe copyright by default?

No, see reaction content, parody content, etc. They all undoubtedly use copyrighted work and they don't automatically infringe on copyright by default.

And if it is judged as infringing, who is responsible for the damage done? Can I pass the damages back to the original infringing work? Or should I be held responsible for not performing due diligence?

The infringing party is the human that used the tool which generated the infringing work. Everything after that is exactly the same applicaton of copyright law just as if you were selling pictures of Mickey Mouse that you drew yourself. Disney can sue you, they can't sue the pencil manufacturer.

[–] anarchiddy@lemmy.dbzer0.com 3 points 1 week ago

Yup

People want to pretend as if everything that flows downstream from the creation of LLMs is illegal, but that's just not the reality.

[–] hperrin@lemmy.ca 2 points 1 week ago* (last edited 1 week ago) (2 children)

It is though. If you commit copyrighted code that was output by an LLM, you do have to follow the license of that code. If you don’t, that’s copyright infringement.

Even if the code isn’t copyrighted code, then it’s public domain code that can’t be copyrighted:

https://sciactive.com/human-contribution-policy/#More-Information

[–] FauxLiving@lemmy.world 1 points 1 week ago (1 children)

You're confusing two separate legal issues.

Licenses are created and enforced by contract law.

You can violate a contract without violating a copyright and you can violate a copyright without agreeing to a license. You can also license works that are not able to be protected by a copyright because they are two separate categories of law.

[–] hperrin@lemmy.ca 1 points 1 week ago* (last edited 1 week ago) (1 children)

Sure, you can license them, but that license is unenforceable, because you don’t own the copyrights, so you can’t sue anyone for copyright infringement. And you’d have to be a fool to agree to a license for public domain material. You can do whatever you want with it, no license necessary.

[–] FauxLiving@lemmy.world 1 points 1 week ago (1 children)

because you don’t own the copyrights, so you can’t sue anyone for copyright infringement.

You can't sue for copyright infringement.

You can, however, use content which is not able to be copyrighted and also still license (under contract law/EULAs) your product including terms prohibiting copying of the non-copyrightable information.

This was settled in: https://en.wikipedia.org/wiki/ProCD%2C_Inc._v._Zeidenberg

On Zeidenberg's copyright argument, the circuit court noted the 1991 Supreme Court precedent Feist Publications v. Rural Telephone Service, in which it was found that the information within a telephone directory (individual phone numbers) were facts that could not be copyrighted. For Zeidenberg's argument, the circuit court assumed that a database collecting the contents of one or more telephone directories was equally a collection of facts that could not be copyrighted. Thus, Zeidenberg's copyright argument was valid. However, this did not lead to a victory for Zeidenberg, because the circuit court held that copyright law does not preempt contract law. Since ProCD had made the investments in its business and its specific SelectPhone product, it could require customers to agree to its terms on how to use the product, including a prohibition on copying the information therein regardless of copyright protections

You can't copyright phone numbers, just like you can't copyright generated code, but you can still create a license which protects your uncopyrightable content and it can be enforced via contract law.

[–] hperrin@lemmy.ca 1 points 1 week ago (1 children)

Sure, but if it’s open source, I can just take that code without agreeing to your contract. Since it’s public domain, I can do whatever I want with it. You can only enforce a contract if I agree to it.

[–] FauxLiving@lemmy.world 1 points 1 week ago (1 children)

It doesn't have to be open source.

If someone 100% generates code to make software then the software isn't protected by copyright.

That software could be distributed and licensed under an EULA and the fact that it isn't protected by copyright means absolutely nothing as far as the EULA is concerned.

The copyright status and the ability to license a piece of software under contract law do not depend on one another.

[–] hperrin@lemmy.ca 1 points 1 week ago (1 children)

Linux is open source.

[–] FauxLiving@lemmy.world 1 points 1 week ago (1 children)

I'm not talking about Linux.

The context of my reply is about LLM generated code and the downstream use of it in a product.

See:

LLMs themselves being products of copyright isnt the legal question at issue, it’s the downstream use of that product.

Assuming that the code is 100% LLM generated and uncopyrightable does not affect the ability to enforce license restrictions created via End User Licensing on downstream uses of that product.

A piece of software that is unable to be copyrighted due to being 100% generated can be licensed and can expect to have that license enforced via contract law.

[–] hperrin@lemmy.ca 1 points 1 week ago (1 children)

Ah, ok. This is a conversation about Linux, so that doesn’t apply. Linux is open source, so it wouldn’t matter if someone wanted to enforce a EULA, anyone else could just take the source and do what they want with it.

[–] FauxLiving@lemmy.world 1 points 1 week ago* (last edited 1 week ago) (1 children)

That may be what you were talking about, but you replied to me and I was not having a conversation about Linux.

I know, I asked myself.

[–] hperrin@lemmy.ca 1 points 1 week ago

You replied to me, man. xD

[–] anarchiddy@lemmy.dbzer0.com -2 points 1 week ago (2 children)

The Linux Kernel is under a copyleft license - it isnt being copyrighted.

But the policy being discussed isn't allowing the use of copyrighted code - they're simply requiring any code submitted by AI be tagged as such so that the human using the agent is ultimately responsible for any infringing code, instead of allowing that code go undisclosed (and even 'certified' by the dev submitting it even if they didnt write or review it themselves)

Submissions are still subject to copyright law - the law just doesnt function the way you or OP are suggesting.

[–] AeonFelis@lemmy.world 1 points 1 week ago (1 children)

they’re simply requiring any code submitted by AI be tagged as such so that the human using the agent is ultimately responsible for any infringing code, instead of allowing that code go undisclosed

This makes zero sense, because the article says that this new tagging will replace the legally binding "Signed-off-by" tag. Wouldn't that old tag already put that responsibility on the person submitting the code.

Also - what will holding the submitter responsible even achieve? If an infringement is detected, the Linux maintainers won't be able to just pass all the blame to the submitter of that code while keeping it in the codebase - they'll have to remove the infringing code regardless of who's responsible for putting it in.

[–] anarchiddy@lemmy.dbzer0.com 1 points 1 week ago

Kinda, but they're specifically saying the the AI agent cannot itself tag the contribution with the sign-off - like, someone using Claude Code to submit PRs on their behalf. The developer must add the tag themselves, indicating that they at least reviewed and submitted it themselves, and it wasn't just an agent going off-prompt or some other shit and submitting it without the developer's knowledge. This is saying 'the dog ate my homework' is not a valid excuse.

The developer can use AI, but they must review the code themselves, and the agent can't "sign-off" on the code for them.

Also - what will holding the submitter responsible even achieve?

What does holding any individual responsible on a development team do? The Linux project is still responsible for anything they put out in the kernel just like any other project, but individual developers can be removed from the contributing team if they break the rules and put it at risk.

The new rule simply makes the expectations clear.

[–] hperrin@lemmy.ca 1 points 1 week ago* (last edited 1 week ago) (1 children)

Copyleft doesn’t mean it’s not copyrighted. Copyleft is not a legal term. “Copyleft” licenses are enforced through copyright ownership.

Did you read the quotes from the copyright office I linked to? I am going to go ahead and trust the copyright office over you on issues of copyrightability.

[–] anarchiddy@lemmy.dbzer0.com 0 points 1 week ago (1 children)

Even if this were true, it would only mean that the GNU license is unenforceable, not that the Linux kernel itself is infringing copyright

[–] hperrin@lemmy.ca 2 points 1 week ago (1 children)

Unless the code the AI generated is a copy of copyrighted code, of course. Then it would be copyright infringement.

I can cause the AI to spit out code that I own the copyright to, because it was trained on my code too. If someone used that code without including attribution to me (the requirement of the license I release my code under), that would be copyright infringement. Do you understand what I mean?

[–] anarchiddy@lemmy.dbzer0.com 0 points 1 week ago (1 children)

That would be true even if they didn't use AI to reproduce it.

The problem being addressed by the Linux foundation isn't the use of copyrighted work in developer contribution, it's the assumption that the code was authored by them at all just because it's submitted in their name and tagged as verified.

Does that make sense?

[–] hperrin@lemmy.ca 2 points 1 week ago (1 children)

Yes, that makes sense. People have always been able to intentionally commit copyright infringement. However, it has historically been fairly difficult to unintentionally commit copyright infringement. That’s no longer the case. AI makes it very easy to unintentionally commit copyright infringement. That’s a good reason to ban it outright.

[–] anarchiddy@lemmy.dbzer0.com 0 points 1 week ago (1 children)

The risk of that is relatively low for kernel contributions, though. Most of the work being done is porting existing protocols/firmware into the latest Linux kernel, not creating novel features.

The larger risk is instability caused by bad, hallucinated code because it was submitted under the assumption of human authorship. In both cases, further review by the Linux team can be done if they understand where that code is coming from.

Banning AI does nothing, because theres no way of knowing who uses it without proper disclosure, which wouldnt happen if it were banned. To use an example from the article, it would be like banning code written with the use of a specific brand of keyboard.

Better to have it properly disclosed than to make it illicit

[–] hperrin@lemmy.ca 1 points 1 week ago

Wow, what an atrocious analogy. So, you just can’t determine what brand of keyboard someone uses, period. When someone uses an AI, there will be certain patterns that are somewhat more common in their code. Their code will also look different than their previous code. It also tends to produce very large commits. You can also ask them why they did certain things and see how they answer. So you might not be 100% accurate, but there are ways to tell when someone is using AI.