this post was submitted on 06 Sep 2024
1715 points (90.1% liked)
Technology
79015 readers
3633 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Is is just extracting patterns. Is making statistical samples of which token ("word", informally speaking) is likely followed given the previous stream.
It can only reproduce passages of things it has seen many, many times. I cannot reproduce the whole work. Those two quotes can be seen elsewhere on the internet plenty of times. And it's fair use there, so it would be fair use with a chat bot as well.
There have been papers published where researchers were able to regenerate an image that was present in the training set of Stable Diffusion. But they were only able to find that image (and others) in particular, because they were present in the training set multiple times, and the caption was the same (it was the portrait picture of some executive at a company).
Yeah, you are not gonna be able to do that with an LLM. They will be able to quote only some passages, and only of popular books that have been quoted often enough.
You cannot do that with an LLM.
I hate that some corporations are burning money, resources and energy on this, and the solution is not to restrict fair use even further. Machine Learning is complex, but if I had to summarize in some way is "just" gathering statistics of which word comes next (in the case of a text model). This is no different than getting a large corpus of text, and sample it for word frequency, letter frequency, N-gram frequency, etc. It is well known that this is fair use. You only store the copyrighted works to run the software and produce a very transformative work that is a summary many orders of magnitude smaller than the copyrighted work. This is fair use, and it should still be. Changing that is gonna harm the public, small companies and independent researchers way more than big tech companies.
As I said in another comment, I would very much welcome a way to force big corpos to release their models. Make a model bigger than N parameters? You needed too much fair use in one gulp: your model has to be public, and in the public domain. I would fucking welcome that! But going in the opposite direction is just risky.
I don't understand why small individuals think that copyright is their friend, and will protect them from big tech companies. Copyright will always harm the weak and protect the powerful as a net result. It's already a miracle that we can enjoy free software and culture by licenses that leverage copyright in our favor.
If I want to go and read a Harry Potter book, I presumably have to pay someone something (excluding library services because those are services provided for actual people, not AI's)?
This LLM clearly has read Harry Potter and Chamber of Secrets, and is merely refusing to display the data it already has on it. "Data" in this case meaning the work, the book.
I'm not for current copyright laws, but I find defending these hypocritical companies despicable. I'm sure you're able to imagine that if it suited OpenAI, they might argue the exact opposite of what they're arguing. Companies don't really argue things in good faith, rather always arguing for the thing that will be the most profitable for them, no matter the veracity.