A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

paraphrand
link
fedilink
English
131Y

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

@SCB@lemmy.world
link
fedilink
English
01Y

Leftists hating on AI while dreaming of post-scarcity will never not be funny

Cosmic Cleric
link
fedilink
English
71Y

Because ultimately, it’s about the truth of things, and not what team is winning or losing.

Because everyone learns from books, it’s stupid.

@Whimsical@lemmy.world
link
fedilink
English
31Y

The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.

Of course, this stance doesn’t really have an answer for any of the other problems involved in the tech, not the least of which is that there’s bigger issues at play than just “content”.

@Skanky@lemmy.world
link
fedilink
English
381Y

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

@uis@lemmy.world
link
fedilink
English
21Y

Yep. Legally every word is copyrighted. Yes, law is THAT stupid.

People think it’s a broken system, but it actually works exactly how the rich want it to work.

I don’t get why this is an issue. Assuming they purchased a legal copy that it was trained on then what’s the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That’s exactly what people do too. We’re just not quite as good at it.

Hildegarde
link
fedilink
English
61Y

A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone’s copyright and use it to create something else, you need permission from the copyright holder.

The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.

@FatCat@lemmy.world
link
fedilink
English
71Y

It is not a derivative it is transformative work. Just like human artists “synthesise” art they see around them and make new art, so do LLMs.

Hildegarde
link
fedilink
English
41Y

Transformative works are not a thing.

If you copy the copyrightable elements of another work, you have created a derivative work. That work needs to be transformative in order to be eligible for its own copyright, but being transformative alone is not enough to make it non-infringing.

There are four fair use factors. Transformativeness is only considered by one of them. That is not enough to make a fair use.

@BURN@lemmy.world
link
fedilink
English
31Y

LLMs don’t create anything new. They have limited access to what they can be based on, and all assumptions made by it are based on that data. They do not learn new things or present new ideas. Only ideas that have been already done and are present in their training.

@LordShrek@lemmy.world
link
fedilink
English
51Y

this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?

Aye, but I’m thinking the whole notion of copyright is banking on the fact that human beings are inherently lazy and not everyone will start churning out books in the same universe or style. And if they do, it takes quite some time to get the finished product and they just get sued for it. It’s easy, because there’s a single target.

So there’s an extra deterrent to people writing and publishing a new harry potter novel, unaffiliated with the current owner of the copyright. Invest all that time and resources just to be sued? Nah…

Issue with generating stuff with 'puters is that you invest way less time, so the same issue pops up for the copyright owner, they’re just DDoS-ed on their possible attack routes. Will they really sue thousands or hundreds of thoudands of internet randos generating harry potter erotica using a LLM? Would you even know who they are? People can hide money away in Switzerland from entite governments, I’m sure there are ways to hide your identity from a book publisher.

It was never about the content, it’s about the opportunities the technology provides to halt the gears of the system that works to enforce questionable laws. So they’re nipping it in the bud.

@LordShrek@lemmy.world
link
fedilink
English
01Y

this brings up the question: what is a book? what is art? if an “AI” can now churn out the next harry potter sequel and people literally can’t tell that it’s not written by JK Rowling, then what does that mean for what people value in stories? what is a story? is this a sign that we humans should figure something new out, instead of reacting according to an outdated protocol?

yes, authors made money in the past before AI. now that we have AI and most people can get satisfied by a book written by AI, what will differentiate human authors from AI? will it become a niche thing, where some people can tell the difference and they prefer human authors? or will there be some small number of exceptional authors who can produce something that is obviously different from AI?

i see this as an opportunity for artists to compete with AI, rather than say “hey! no fair! he can think and write faster than me!”

Well, poor literature has always existed, which some might not even dignify to call literature. Are writers of such things threatened by LLMs? Of course they are. Every new technology has beought with it the fear of upending somebody’s world. And to some extent, every new technology has indeed done just that.

Personally, and… this will probably be highly unpopular, I honestly don’t care who or what created a piece of art. Is it pretty? Does it satisfy my need for just the right amount of weird, funny and disturbing to stir emotions or make me go ‘heh, interesting!’? Then it really doesn’t matter where it comes from. We put way too much emphasis on the pedigree of art and not on the content. Hell, one very nice short story I read was the greentext one about humans being AI and escaping from the simulation. Wonder how many would scoff at calling art something that came out of 4chan?

Maybe this is the issue? Art is thought of as a purely human endeavour (also birds do it, and that one pufferfish that draws on the seabed, but they’re “dumb” animals so they don’t count, right? hell, there’s even a jumping spider that does some pretty rad dances). And if code in a machine can do it just as well (can it? let it - we’ll be all the better for it. can’t it? let it be then - no issue) then what would be the significance of being human?

People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

@abbotsbury@lemmy.world
link
fedilink
English
21Y

but on mega steroids with a nearly limitless capacity for information retention.

That sounds like redistributing copyrighted books

Hup!
link
fedilink
English
6
edit-2
1Y

Nope people are just acting like ChatGPT is making commercial use of the content. Knowing a quote from a book isn’t copyright infringement. Selling that quote is. Also it doesn’t need to be content stored 1:1 somewhere to be infringement. That misses the point. If you’re making money of a synopsis you wrote based on imperfect memory and in your own words it’s still copyright infringment until you sign a licensing agreement with JK. Even transforming what you read into a different medium like a painting or poetry cam infinge the original authors copyrights.

Now mull that over and tell us what you think about modern copyright laws.

@Ronath@lemmy.world
link
fedilink
English
41Y

Just adding, that, outside of Rowling, who I believe has a different contract than most authors due to the expanded Wizarding World and Pottermore, most authors themselves cannot quote their own novels online because that would be publishing part of the novel digitally and that’s a right they’ve sold to their publisher. The publisher usually ignores this as it creates hype for the work, but authors are careful not to abuse it.

Lol:

Content industry: It can reproduce our stuff OpenAI: Content industry: They are hiding that it can reproduce us

@dx1@lemmy.world
link
fedilink
English
2
edit-2
1Y

Kopimi

(edit 4 minutes in - hey I have this guy’s album already (“Red Extensions of Me”))

I’m basically on the same page as this guy except I don’t think the government has to manage a royalties system. People can handle that freely, no? Plus you can pretty immediately envision they’re gonna have some kind of asinine censorship policy for what content is acceptable and what content isn’t.

@LordShrek@lemmy.world
link
fedilink
English
11Y

the government in its current form would have that flaw in the content distribution system, yes, but his main idea is that it would be like open-source ran in the sense of “government of the people”

@dx1@lemmy.world
link
fedilink
English
11Y

That’s optimistic.

@scarabic@lemmy.world
link
fedilink
English
21Y

One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.

So yeah, case closed. They are full of shit.

@rosenjcb@lemmy.world
link
fedilink
English
24
edit-2
1Y

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

@bachalxyz@lemmy.world
link
fedilink
English
11Y

How are they going to prove if something was written by an AI?

stevedidWHAT
link
fedilink
English
01Y

It’s a complicated answer I’m unqualified to answer but essentially there exists some sort of baseline either for people or for how gpt responds usually and then they can figure out which way the text “leans”

So that explains the “problematic” responses.

@LordShrek@lemmy.world
link
fedilink
English
21Y

are we no longer allowed to borrow books from friends?

@benni@lemmy.world
link
fedilink
English
91Y

Yeah, but if you wanna act out the contents of the book and sell it as a movie, you need to buy the rights.

@nednobbins@lemmy.world
link
fedilink
English
61Y

Yes but there’s a threshold of how much you need to copy before it’s an IP violation.

Copying a single word is usually only enough if it’s a neologism.
Two matching words in a row usually isn’t enough either.
At some point it is enough though and it’s not clear what that point is.

On the other hand it can still be considered an IP violation if there are no exact word matches but it seems sufficiently similar.

Until now we’ve basically asked courts to step in and decide where the line should be on a case by case basis.

We never set the level of allowable copying to 0, we set it to “reasonable”. In theory it’s supposed to be at a level that’s sufficient to, “promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” (US Constitution, Article I, Section 8, Clause 8).

Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?

@LordShrek@lemmy.world
link
fedilink
English
-11Y

yes, but that’s a different situation. with the LLM, the issue is that the text from copyrighted books are influencing the way it speaks. this is the same with humans.

@Touching_Grass@lemmy.world
link
fedilink
English
-1
edit-2
1Y

Mods remove this comment as this instance no longer tolerates discussions of piracy. We went through this last week

@ClamDrinker@lemmy.world
link
fedilink
English
11
edit-2
1Y

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.

Cosmic Cleric
link
fedilink
English
-1
edit-2
1Y

It feels like we’ve just taken our first steps down the path of the Robin Williams acted movie ‘Bicentennial Man’ timeline.

Its a bit pedantic, but I’m not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as ‘possessed’ under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator’s intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don’t control how the idea is interpreted so its not really yours any more.

If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.

@Bogasse@lemmy.world
link
fedilink
English
81Y

Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

Yeah I suppose that’s on them.

@Toasteh@lemmy.world
link
fedilink
English
51Y

Copyright definitely needs to be stripped back severely. Artists need time to use their own work, but after a certain time everything needs to enter the public space for the sake of creativity.

@Blapoo@lemmy.ml
link
fedilink
English
451Y

We have to distinguish between LLMs

  • Trained on copyrighted material and
  • Outputting copyrighted material

They are not one and the same

Should we distinguish it though? Why shouldn’t (and didn’t) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn’t provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.

Good news, they already do! Artists can license their work under a permissive license like the Creative Commons CC0 license. If not specified, rights are reserved to the creator.

I know, but one of the biggest conflicts between artists and AI developers is that they didn’t seek a license to use them for training. They just did it. So even if the end result is not an exact reproduction, it still relied on unauthorized use.

@BURN@lemmy.world
link
fedilink
English
11Y

Unfortunately AI training sets don’t tend to respect those licenses. Since it’s near impossible to prove they used it without permission they’re SoL

@Blapoo@lemmy.ml
link
fedilink
English
21Y

Ah, but that’s the thing. Training isn’t copying. It’s pattern recognition. If you train a model “The dog says woof” and then ask a model “What does the dog say”, it’s not guaranteed to say “woof”.

Similarly, just because a model was trained on Harry Potter, all that means is it has a good corpus of how the sentences in that book go.

Thus the distinction. Can I train on a comment section discussing the book?

Create a post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


  • 1 user online
  • 175 users / day
  • 576 users / week
  • 1.37K users / month
  • 4.48K users / 6 months
  • 1 subscriber
  • 7.41K Posts
  • 84.7K Comments
  • Modlog