Those claiming AI training on copyrighted works is “theft” misunderstand key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they’re extracting general patterns and concepts - the “Bob Dylan-ness” or “Hemingway-ness” - not copying specific text or images.

This process is akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in “vector space”. When generating new content, the AI isn’t recreating copyrighted works, but producing new expressions inspired by the concepts it’s learned.

This is fundamentally different from copying a book or song. It’s more like the long-standing artistic tradition of being influenced by others’ work. The law has always recognized that ideas themselves can’t be owned - only particular expressions of them.

Moreover, there’s precedent for this kind of use being considered “transformative” and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was ruled legal despite protests from authors and publishers. AI training is arguably even more transformative.

While it’s understandable that creators feel uneasy about this new technology, labeling it “theft” is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn’t make the current use of copyrighted works for AI training illegal or unethical.

For those interested, this argument is nicely laid out by Damien Riehl in FLOSS Weekly episode 744. https://twit.tv/shows/floss-weekly/episodes/744

Roflmasterbigpimp
link
fedilink
English
111M

Okay that’s just stupid. I’m really fond of AI but that’s just common Greed.

“Free the Serfs?! We can’t survive without their labor!!” “Stop Child labour?! We can’t survive without them!” “40 Hour Work Week?! We can’t survive without their 16 Hour work Days!”

If you can’t make profit yet, then fucking stop.

@dhork@lemmy.world
link
fedilink
English
231M

Bullshit. AI are not human. We shouldn’t treat them as such. AI are not creative. They just regurgitate what they are trained on. We call what it does “learning”, but that doesn’t mean we should elevate what they do to be legally equal to human learning.

It’s this same kind of twisted logic that makes people think Corporations are People.

It’s an interesting area. Are they suggesting that a human reading copyright material and learning from it is a breach?

I’ll train my AI on just the bee movie. Then I’m going to ask it “can you make me a movie about bees”? When it spits the whole movie, I can just watch it or sell it or whatever, it was a creation of my AI, which learned just like any human would! Of course I didn’t even pay for the original copy to train my AI, it’s for learning purposes, and learning should be a basic human right!

@Valmond@lemmy.world
link
fedilink
English
61M

In the meantime I’ll introduce myself into the servers of large corporations and read their emails, codebase, teams and strategic analysis, it’s just learning!

@FatCat@lemmy.world
creator
link
fedilink
English
-51M

I am thrilled to see the output you get!

learning should be a basic human right!

Education is a basic human right (except maybe in Usa, then it should be one there)

Yeah. A human right.

@HereIAm@lemmy.world
link
fedilink
English
141M

“This process is akin to how humans learn… The AI discards the original text, keeping only abstract representations…”

Now I sail the high seas myself, but I don’t think Paramount Studios would buy anyone’s defence they were only pirating their movies so they can learn the general content so they can produce their own knockoff.

Yes artists learn and inspire each other, but more often than not I’d imagine they consumed that art in an ethical way.

There is an easy answer to this, but it’s not being pursued by AI companies because it’ll make them less money, albeit totally ethically.

Make all LLM models free to use, regardless of sophistication, and be collaborative with sharing the algorithms. They don’t have to be open to everyone, but they can look at requests and grant them on merit without charging for it.

So how do they make money? How goes Google search make money? Advertisements. If you have a good, free product, advertisement space will follow. If it’s impossible to make an AI product while also properly compensating people for training material, then don’t make it a sold product. Use copyright training material freely to offer a free product with no premiums.

Nimo
link
fedilink
English
31M

I hate to say this but “let the market decide” if Ai is something the consumer wants/needs they’ll pay for it otherwise let it die.

HexesofVexes
link
fedilink
English
51M

I rather think the point is being missed here. Copyright is already causing huge issues, such as the troubles faced by the internet archive, and the fact academics get nothing from their work.

Surely the argument here is that copyright law needs to change, as it acts as a barrier to education and human expression. Not, however, just for AI, but as a whole.

Copyright law needs to move with the times, as all laws do.

Copyright is a lesser evil compared to taking human labor and creativity for free to sell a product.

HexesofVexes
link
fedilink
English
21M

Come visit academia some time… Copyright laws ensure we do all the work and get nothing in return;)

Let’s engage in a little fantasy. Someone invents a magic machine that is able to duplicate apartments, condos, houses, … You want to live in New York? You can copy yourself a penthouse overlooking the Central Park for just a few cents. It’s magic. You don’t need space. It’s all in a pocket dimension like the Tardis or whatever. Awesome, right? Of course, not everyone would like that. The owner of that penthouse, for one. Their multi-million dollar investment is suddenly almost worthless. They would certainly demand that you must not copy their property without consent. And so would a lot of people. And what about the poor construction workers, ask the owners of constructions companies? And who will pay to have any new house built?

So in this fantasy story, the government goes and bans the magic copy machine. Taxes are raised to create a big new police bureau to monitor the country and to make sure that no one use such a machine without a license.

That’s turned from magical wish fulfillment into a dystopian story. A society that rejects living in a rent-free wonderland but instead chooses to make itself poor. People work to ensure poverty, not to create wealth.

You get that I’m talking about data, information, knowledge. The first magic machine was the printing press. Now we have computers and the Internet.

I’m not talking about a utopian vision here. Facts, scientific theories, mathematical theorems, … All such is free for all. Inventors can get patents, but only for 20 years and only if they publish them. They can keep their invention secret and take their chances. But if they want a government enforced monopoly, they must publish their inventions so that others may learn from it.

In the US, that’s how the Constitution demands it. The copyright clause: [The United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Cutting down on Fair Use makes everyone poorer and only a very few, very rich people richer. Have you ever thought about where the money goes if AI training requires a license?

For example, to Reddit, because Reddit has rights to all those posts. So do Facebook and Xitter. Of course, there’s also old money, like the NYT or Getty. The NYT has the rights to all their old issue about a century back. If AI training requires a license, they can sell all their old newspapers again. That’s pure profit. Do you think they will their employees raises out of the pure goodness of their heart if they win their lawsuits? They have no legal or economics reason to do so. The belief that this would happen is trickle-down economics.

@Emerald@lemmy.world
link
fedilink
English
0
edit-2
1M

Thanks for a comment like this. It’s interesting how everyone steps in to endorse piracy (unauthorized copying of copyrighted works), yet when a business does it for AI purposes everyone freaks out.

The copyright industry wants money. So, 4 legs good, 2 legs better. It’s depressing to see how easily people are led around by the nose.

@mriormro@lemmy.world
link
fedilink
English
231M

You know, those obsessed with pushing AI would do a lot better if they dropped the patronizing tone in every single one of their comments defending them.

It’s always fun reading “but you just don’t understand”.

@LANIK2000@lemmy.world
link
fedilink
English
171M

This process is akin to how humans learn…

I’m so fucking sick of people saying that. We have no fucking clue how humans LEARN. Aka gather understanding aka how cognition works or what it truly is. On the contrary we can deduce that it probably isn’t very close to human memory/learning/cognition/sentience (any other buzzword that are stands-ins for things we don’t understand yet), considering human memory is extremely lossy and tends to infer its own bias, as opposed to LLMs that do neither and religiously follow patters to their own fault.

It’s quite literally a text prediction machine that started its life as a translator (and still does amazingly at that task), it just happens to turn out that general human language is a very powerful tool all on its own.

I could go on and on as I usually do on lemmy about AI, but your argument is literally “Neural network is theoretically like the nervous system, therefore human”, I have no faith in getting through to you people.

Even worse is, in order to further humanize machine learning systems, they often give them human-like names.

The “you wouldn’t download a car” statement is made against personal cases of piracy, which got rightfully clowned upon. It obviously doesn’t work at all when you use its ridiculousness to defend big ass corporations that tries to profit from so many of the stuff they “downloaded”.

Besides, it is not “theft”. It is “plagiarism”. And I’m glad to see that people that tries to defend these plagiarism machines that are attempted to be humanised and inflated to something they can never be, gets clowned. It warms my heart.

@fancyl@lemmy.world
link
fedilink
English
71M

Are the models that OpenAI creates open source? I don’t know enough about LLMs but if ChatGPT wants exemptions from the law, it result in a public good (emphasis on public).

The STT (speech to text) model that they created is open source (Whisper) as well as a few others:

https://github.com/openai/whisper

https://github.com/orgs/openai/repositories?type=all

@WalnutLum@lemmy.ml
link
fedilink
English
5
edit-2
1M

Those aren’t open source, neither by the OSI’s Open Source Definition nor by the OSI’s Open Source AI Definition.

The important part for the latter being a published listing of all the training data. (Trainers don’t have to provide the data, but they must provide at least a way to recreate the model given the same inputs).

Data information: Sufficiently detailed information about the data used to train the system, so that a skilled person can recreate a substantially equivalent system using the same or similar data. Data information shall be made available with licenses that comply with the Open Source Definition.

They are model-available if anything.

I did a quick check on the license for Whisper:

Whisper’s code and model weights are released under the MIT License. See LICENSE for further details.

So that definitely meets the Open Source Definition on your first link.

And it looks like it also meets the definition of open source as per your second link.

Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

@WalnutLum@lemmy.ml
link
fedilink
English
3
edit-2
1M

Whisper’s code and model weights are released under the MIT License. See LICENSE for further details. So that definitely meets the Open Source Definition on your first link.

Model weights by themselves do not qualify as “open source”, as the OSAID qualifies. Weights are not source.

Additional WER/CER metrics corresponding to the other models and datasets can be found in Appendix D.1, D.2, and D.4 of the paper, as well as the BLEU (Bilingual Evaluation Understudy) scores for translation in Appendix D.3.

This is not training data. These are testing metrics.

Edit: additionally, assuming you might have been talking about the link to the research paper. It’s not published under an OSD license. If it were this would qualify the model.

I don’t understand. What’s missing from the code, model, and weights provided to make this “open source” by the definition of your first link? it seems to meet all of those requirements.

As for the OSAID, the exact training dataset is not required, per your quote, they just need to provide enough information that someone else could train the model using a “similar dataset”.

@WalnutLum@lemmy.ml
link
fedilink
English
31M

Oh and for the OSAID part, the only issue stopping Whisper from being considered open source as per the OSAID is that the information on the training data is published through arxiv, so using the data as written could present licensing issues.

Ok, but the most important part of that research paper is published on the github repository, which explains how to provide audio data and text data to recreate any STT model in the same way that they have done.

See the “Approach” section of the github repository: https://github.com/openai/whisper?tab=readme-ov-file#approach

And the Traning Data section of their github: https://github.com/openai/whisper/blob/main/model-card.md#training-data

With this you don’t really need to use the paper hosted on arxiv, you have enough information on how to train/modify the model.

There are guides on how to Finetune the model yourself: https://huggingface.co/blog/fine-tune-whisper

Which, from what I understand on the link to the OSAID, is exactly what they are asking for. The ability to retrain/finetune a model fits this definition very well:

The preferred form of making modifications to a machine-learning system is:

  • Data information […]
  • Code […]
  • Weights […]

All 3 of those have been provided.

@WalnutLum@lemmy.ml
link
fedilink
English
2
edit-2
1M

The problem with just shipping AI model weights is that they run up against the issue of point 2 of the OSD:

The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

AI models can’t be distributed purely as source because they are pre-trained. It’s the same as distributing pre-compiled binaries.

It’s the entire reason the OSAID exists:

  1. The OSD doesn’t fit because it requires you distribute the source code in a non-preprocessed manner.
  2. AIs can’t necessarily distribute the training data alongside the code that trains the model, so in order to help bridge the gap the OSI made the OSAID - as long as you fully document the way you trained the model so that somebody that has access to the training data you used can make a mostly similar set of weights, you fall within the OSAID

Edit: also the information about the training data has to be published in an OSD-equivalent license (such as creative Commons) so that using it doesn’t cause licensing issues with research paper print companies (like arxiv)

@graycube@lemmy.world
link
fedilink
English
161M

Nothing about OpenAI is open-source. The name is a misdirection.

If you use my IP without my permission and profit it from it, then that is IP theft, whether or not you republish a plagiarized version.

@dariusj18@lemmy.world
link
fedilink
English
-11M

So I guess every reaction and review on the internet that is ad supported or behind a payroll is theft too?

lettruthout
link
fedilink
English
1061M

If they can base their business on stealing, then we can steal their AI services, right?

@MonkderVierte@lemmy.ml
link
fedilink
English
1
edit-2
1M

deleted by creator

Create a post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


  • 1 user online
  • 191 users / day
  • 586 users / week
  • 1.37K users / month
  • 4.49K users / 6 months
  • 1 subscriber
  • 7.41K Posts
  • 84.7K Comments
  • Modlog