Open source license that doesn’t allow your code to be used for AI data training? : opensource

[–] BlameTheAntifa@lemmy.world 2 points 7 hours ago

Courts keep ruling that theft-by-AI is “fair use” so no license will protect you from legalized theft.

[–] plz1@lemmy.world 1 points 7 hours ago

Doesn't matter. The companies will do it anyways and make you sue to stop them, at whic point they'll exhaust your resources long before any potential victory in court.

[–] pulsewidth@lemmy.world 35 points 1 day ago* (last edited 14 hours ago) (1 children)

Literally any license that says any derivatives of your work must include attribution, bonus points if you use a license that says derivative works must be shared under the same license, eg GPL ~~or MIT~~.

The AI Bros will take it anyway though and ignore your license, and the courts are very pro-business and AI is like half the US economy at this point, so it's probably all pretty pointless.

They're currently being sued for just that in Doe's v. GitHub et al, which has already been going on for years. Currently its waiting to be scheduled in the 9th circuit, they've already been waiting 18 months.

(Edited to correct, see comment below)

[–] somegeek@programming.dev 20 points 1 day ago (2 children)

MIT is a permissive license and it doesnt enforce anything.

if you want freedom, use strict licenses like GPL.

Gnu has a great guide on licenses

https://www.gnu.org/licenses/license-recommendations.html

https://www.gnu.org/licenses/license-list.html

[–] 8uurg@lemmy.world 3 points 8 hours ago

MIT still requires the license and copyright notice to be maintained though, it is why even proprietary software includes an 'open-source licenses' listing somewhere under help / alongside the distribution. Arguably, AI models reproducing a bit of MIT licensed code would be just as much in violation as with any other license.

GPL still gives much better guarantees w.r.t. providing the source code and modifications made thereto, though.

[–] pulsewidth@lemmy.world 3 points 14 hours ago

That is a great guide, thanks for the info. Edited my comment to correct.

[–] BrikoX@lemmy.zip 31 points 1 day ago (1 children)

There are various licenses that do that, but by definition they are not open source. Those two things are mutually exclusive.

[–] Jakeroxs@sh.itjust.works 1 points 7 hours ago* (last edited 7 hours ago) (1 children)

I find it very interesting the sudden abhorence of the masses to ignoring copyright, what happened to all the pirates? The shift to "copyright is good actually" by antiAI folks is bewildering.

Y'all really think the system is set up to protect the little folks in the first place?

I thought the whole point of open source is sharing 🤷

Edit: Makes me think of "socialists" who would deny the basic rights they're saying they're fighting for from "rich" people.

[–] BrikoX@lemmy.zip 2 points 6 hours ago* (last edited 6 hours ago) (1 children)

Not sure if you are specifically talking to me, but yeah I do find it strange.

All the recent lawsuits between tech companies and publishers and people somehow became defenders of publshers when those fuckers are the same people that are exploiting creative people in the first place. Creative people are not the beneficiaries of all these settlements that are happening.

I think it's all the anger of self-published creators with following who convincing people that they are being destroyed and those people are blindly directing that anger at tech companies not fully understanding the situation.

[–] Jakeroxs@sh.itjust.works 1 points 4 hours ago

Completely agree, and my comment wasn't really directed at you, but I figured you saw the... Oddness of the request with how you worded your response.

[–] asudox@lemmy.asudox.dev 19 points 1 day ago (4 children)

I found this, which adds additional text to the existing licenses to prohibit training an AI on the licensed code: https://github.com/non-ai-licenses/non-ai-licenses

Though, per OSI's definition, your code probably would no longer be open source, since training an LLM is technically considered a field of endeavour:

OSD number 6:

The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.

[–] cat_fishing@feddit.online 7 points 1 day ago

This is exactly what I was looking for. Thanks!

[–] veniasilente@lemmy.dbzer0.com 4 points 1 day ago

We need a new field of licensing, something like Ethical Source License. With AI being a thing on-the-field, and even before tbh, Open Source has alas become a paradigm of the past.

[–] Orygin@sh.itjust.works 2 points 1 day ago

If we can play semantics, the program (the compiled binary) can be used for anything with no field restrictions.
But the code is not the program itself, it's the recipe, and usage could be restricted in some specific ways.
In my opinion, since free licenses already have restrictions regarding distribution, saying AI models trained on this data are derivative works and must be licensed compatible (ie training data set, methods and models themselves being free).
I feel it's a better middle ground where the freedom of users are not violated nor restricted, and the code/knowledge stays free

[–] hperrin@lemmy.ca 3 points 1 day ago (1 children)

Maybe you could say that AI training is not a use of the program, it is a use of the source code.

[–] ulterno@programming.dev 3 points 18 hours ago

OSI is not US court (or at least I hope not).
Playing on words isn't going to get the license accepted.

On the other hand, why does it have to be accepted?
You are doing something different. Just do the different thing.

[–] poVoq@slrpnk.net 15 points 1 day ago (1 children)

The problem is that the AI companies claim they are just "reading" the code to let their AI models learn from it, thus licenses / copyright doesn't apply in their interpretation of the situation.

[–] cat_fishing@feddit.online 4 points 1 day ago (1 children)

The interesting thing is that from my testing with very niche code, it downright copied the only online example (my repo.) It even reused my variable names.

[–] poVoq@slrpnk.net 13 points 1 day ago

Yeah, but you (or someone in a similar situation) will have to take them to court for the legal situation to be clarified. Just putting another license on the code will not stop them.

[–] slazer2au@lemmy.world 13 points 1 day ago (1 children)

Licensing only works as well as enforcing it. How do you show a LLM consumed your code as part of its training data?

[–] lobut@lemmy.ca 7 points 1 day ago (1 children)

Some authors typed the first few sentences of their book and the LLM spit out the rest.

[–] FaceDeer@fedia.io 1 points 1 day ago (1 children)

That generally only happens in cases of overfitting, where the model was trained on a poorly de-duplicated data set that contains many copies of that book (or excerpts, quotes, and so forth). This is considered a flaw by AI trainers and a lot of work goes into sanitizing the training data to prevent it.

[–] XLE@piefed.social 4 points 1 day ago* (last edited 1 day ago) (1 children)

But you're otherwise disgusted by the fact that material is plagiarized without consent to begin with...

...Right, FaceDeer?

[–] FaceDeer@fedia.io -1 points 1 day ago

You went digging through my Reddit comments to find a two-month-old thread, that must have taken a lot of effort. But I'm afraid I don't see what the relevance of it is, aside from a general "it's about AI". The bulk of the comments I wrote there were about water usage.

I'm genuinely puzzled. Are you saying that deduplicating data is "hiding unethical behaviour?" It's actually intended for improving the model's performance, having a model spit out exact copies of its training data means you've produced a hugely expensive and wasteful re-implementation of copy-and-paste rather than a generative AI. The whole point of generative AI is to produce novel outputs.

[–] DandomRude@piefed.social 11 points 1 day ago

The thing about licenses is that they only work if they can be defended in court. In the US system in particular, it is simply impossible for a private individual to do so (even multi-billion-dollar corporations with their highly paid lawyers seem to be powerless against artificially inflated AI giants such as OpenAI).

Therefore, it must be assumed that even restrictive licenses will simply be disregarded.

[–] the_riviera_kid@lemmy.world 7 points 1 day ago

Do you think the companies scraping your data give a fuck?

[–] hperrin@lemmy.ca 6 points 1 day ago

Technically, a lot of them already disallow this without maintaining copyright notices, which the AI companies aren’t doing. I doubt the AI companies would follow the new license, but at least it would be even more explicitly illegal.

[–] voluble@lemmy.ca 3 points 1 day ago

CC BY-SA is this, in spirit.

This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

See: https://creativecommons.org/licenses/by-sa/4.0/

There are other flavours of Creative Commons licenses that might suit you better. Anyway, worth looking into.

[–] TomAwezome@lemmy.world 5 points 1 day ago* (last edited 1 day ago)

As nice as this would be, it's not very likely... Licenses are usually limp suggestions from the perspective of companies with billions of dollars. AI companies train on millions of copyrighted materials, both literature and art, without any express permission from the authors or artists, and with essentially no recourse or compensation to the authors. You could append a 'no AI training' clause to an existing license like the MIT license, but the impact that will have will mostly be brief personal satisfaction and won't change what the AI companies do. It's genuinely more useful to keep code proprietary to prevent it from being used to train AI models.

[–] FaceDeer@fedia.io 3 points 1 day ago (1 children)

It may be that such a license can't exist. The way these viral copyleft licenses work is that they offer things to people who accept them that copyright otherwise doesn't permit. The usual example: you can distribute copies of this work (a thing that copyright prohibits you from doing by default) but in exchange you must release any derivative works you make under the same licence.

The problem is that you actually can reject that licence. You can download the software (that's allowed because the person distributing the software agreed to the copyleft license) and then decide you're not going to accept the license that came with it. At that point you're restricted by ordinary copyright and can only do the things you'd normally do with it.

There have already been court cases in the US that have ruled that training an AI is fair use, and the resulting model is not a derivative work covered by the copyright of the original. So you can just go ahead and train the AI at that point.

[–] XLE@piefed.social 2 points 1 day ago (1 children)

(Facedeer is a prolific pro-AI concern troll from Reddit, so take his post here with a huge grain of salt)

[–] FaceDeer@fedia.io 2 points 1 day ago (1 children)

Did I say anything incorrect?

[–] schnurrito@discuss.tchncs.de 2 points 1 day ago (1 children)

No. I explained the same thing a few days ago at https://discuss.tchncs.de/post/55176429/24029241 without remembering anything you ever wrote.

[–] FaceDeer@fedia.io -1 points 1 day ago (1 children)

Thanks. This has actually been a thing that bothered me many years before AI was ever a thing, there are open source programs I've installed that pop open a clickwrap "agree with the GPL before you can install this" step and it shows a misunderstanding of how these licenses fundamentally work. They're not EULAs.

As for whether I'm a "concern troll", AI happens to be an area of significant interest to me right now and so I've been commenting a lot on it. My opinion on it also happens to be unpopular. I don't like the idea of closed social media bubbles where only groupthink is allowed, so I just go ahead and speak my mind even knowing it'll likely get hit with a lot of downvotes. I'm finding the Fediverse to be a lot more insular than Reddit is in this regard, I suspect because the population in general is a lot smaller, but at least downvotes don't tend to "bury" comments.

If anyone can't stand reading my comments, I recommend blocking me. It's the ultimate downvote.

[–] schnurrito@discuss.tchncs.de 2 points 1 day ago (1 children)

On that other question you raise, here is the FSF's position: https://www.gnu.org/licenses/gpl-faq.html#ClickThrough

[–] FaceDeer@fedia.io 1 points 1 day ago

Yeah, I'm not bothered by the actual clicking of "I accept the GPL", it's more the misunderstanding of it that the existence of the click-through represents. If someone's licencing their code I would hope they'd spend a bit of time researching how the license they're using actually works.

[–] Lembot_0006@programming.dev 0 points 1 day ago (1 children)

You can create your own license with whatever idiotic limitations you wish.

Example: Litsenzy -- zis code for no bad peple, no robots and no to that ugly shit who stepped on my foot yesterday in ze tram.

[–] cat_fishing@feddit.online 3 points 1 day ago

Yikes 😬 you seem unhinged

Opensource