Wondering what data OpenAI used to train its buzzy new text-to-video AI? OpenAI CTO Mira Murati seems to be wondering, too.

So plagiarism?

CTO should definitely know this.

@ItsMeSpez@lemmy.world
link
fedilink
English
277M

They do know this. They’re avoiding any legal exposure by being vague.

Of course she knows it. She just doesn’t want to get sued.

@blazeknave@lemmy.world
link
fedilink
English
27M

I feel like at their scale, if there’s going to be a figure head marketable CTO, it’s going to be this company. If not, you’re right, and she’s lying lol

I have a feeling that the training material involves cheese pizza…

There is no way in hell it isn’t copyrighted material.

@abhibeckert@lemmy.world
link
fedilink
English
36
edit-2
7M

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

Politically Incorrect
link
fedilink
English
-237M

removed by mod

What does this human is going to do with this reading ? Are they going to produce something by using part of this book or this article ?

If yes, that’s copyright infringement.

@echo64@lemmy.world
link
fedilink
English
177M

If you read an article, then copy parts of that article into a new article, that’s copyright infringement. Same with ais.

@anlumo@lemmy.world
link
fedilink
English
-97M

Depends on how much is copied, if it’s a small amount it’s fair use.

@echo64@lemmy.world
link
fedilink
English
107M

Fair use depends on a lot, and just being a small amount doesn’t factor in. It’s the actual use. Small amounts just often fly under the nose of legal teams.

@FireTower@lemmy.world
link
fedilink
English
67M

Fair use is a four factor test amount used is a factor but a low amount being used doesn’t strictly mean something is fair use. You could use a single frame of a movie and have it not qualify as fair use.

@RatBin@lemmy.world
link
fedilink
English
27M

Obviously nobody fully knows where so much training data come from. They used Web scraping tool like there’s no tomorrow before, with that amount if informations you can’t tell where all the training material come from. Which doesn’t mean that the tool is unreliable, but that we don’t truly why it’s that good, unless you can somehow access all the layers of the digital brains operating these machines; that isn’t doable in closed source model so we can only speculate. This is what is called a black box and we use this because we trust the output enough to do it. Knowing in details the process behind each query would thus be taxing. Anyway…I’m starting to see more and more ai generated content, YouTube is slowly but surely losing significance and importance as I don’t search informations there any longer, ai being one of the reasons for this.

@Gakomi@lemmy.world
link
fedilink
English
-127M

Any company CEO does not know shit that goes on in the dev department so her answer does not surprise me, ask the Devs or the team leader in charge of the project. The CEO is only there to make sure the company makes money as he and the share holders only care about money!

@overload@sopuli.xyz
link
fedilink
English
97M

Chief Technology Officer, not CEO

@Gakomi@lemmy.world
link
fedilink
English
17M

So you mean another person that has no idea because is higher up on the chain of command that all he/she cares about is how to make more money ? Seriously in any company I worked untill not everyone at the level of management or above had mostly no idea about this shit and most of them I have no idea how they got in those positions as they have close to 0 technical skill! And the speeches that those people do are made by people that again are not part of the infrastructure or development team. I do find this disturbing as hell but at this point it’s also what I expect to happend as I only seen this shit.

@TimeNaan@lemmy.world
link
fedilink
English
217M

She’s CTO not CEO. She absolutely should know the answer.

@Gakomi@lemmy.world
link
fedilink
English
17M

She should but she does not as I mention in another post anyone at team leader or above in all the companies that I work so far bearly had any technical skill and didn’t have any idea about this shit, only some bits and pieces that they got through some documentation that the dev team made. They had some vague idea of how our infrastructure works but that about it.

@Fedizen@lemmy.world
link
fedilink
English
3
edit-2
7M

this is why code AND cloud services shouldn’t be copyrightable or licensable without some kind of transparency legislation to ensure people are honest. Either forced open source or some kind of code review submission to a government authority that can be unsealed in legal disputes.

@CosmoNova@lemmy.world
link
fedilink
English
32
edit-2
7M

I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.

@laxe@lemmy.world
link
fedilink
English
97M

Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?

@Bogasse@lemmy.ml
link
fedilink
English
77M

Yeah, the fact that AI progress just relies on “we will make so much money that no lawsuit will consequently alter our growth” is really infuriating. The fact that general audience apparently doesn’t care is even more infuriating.

@toddestan@lemmy.world
link
fedilink
English
-17M

I’d say not really, Tolkien was a writer, not an artist.

What you are doing is violating the trademark Middle-Earth Enterprises has on the Gandalf character.

The point was that I absorbed that information to inform my “art”, since we’re equating training with stealing.

I guess this would have been a better example lol. It’s clearly not Gandalf, but I wouldn’t have ever come up with it if I hadn’t seen that scene

@stackPeek@lemmy.world
link
fedilink
English
307M

This tellls you so much what kind of company OpenAI is

@wabafee@lemmy.world
link
fedilink
English
77M

Half open or half close?

An Intelligence piracy company?

what’s wrong with her face?

@girl@sopuli.xyz
link
fedilink
English
57M

she grimaced?

qaz
link
fedilink
English
47M

They use awkward stills to generate clicks

It’s annoying and distracting, just like the headline.

@dezmd@lemmy.world
link
fedilink
English
-137M

LLM is just another iteration of Search. Search engines do the same thing. Do we outlaw search engines?

@AliasAKA@lemmy.world
link
fedilink
English
137M

SoRA is a generative video model, not exactly a large language model.

But to answer your question: if all LLMs did was redirect you to where the content was hosted, then it would be a search engine. But instead they reproduce what someone else was hosting, which may include copyrighted material. So they’re fundamentally different from a simple search engine. They don’t direct you to the source, they reproduce a facsimile of the source material without acknowledging or directing you to it. SoRA is similar. It produces video content, but it doesn’t redirect you to finding similar video content that it is reproducing from. And we can argue about how close something needs to be to an existing artwork to count as a reproduction, but I think for AI models we should enforce citation models.

@dezmd@lemmy.world
link
fedilink
English
-77M

How does a search engine know where to point you? It injests all that data and processes it ‘locally’ on the search engines systems using algorithms to organize the data for search. It’s effectively the same dataset.

LLM is absolutely another iteration of Search, with natural language ouput for the same input data. Are you advocating against search engine data injest as not fair use and copyright violations as well?

You equate LLM to Intelligence which it is not. It is algorithmic search interation with natural language responses, but that doesn’t sound as cool as AI. It’s neat, it’s useful, and yes, it should cite the sourcing details (upon request), but it’s not (yet?) a real intelligence and is equal to search in terms of fair use and copyright arguments.

@AliasAKA@lemmy.world
link
fedilink
English
57M

I never equated LLMs to intelligence. And indexing the data is not the same as reproducing the webpage or the content on a webpage. For you to get beyond a small snippet that held your query when you search, you have to follow a link to the source material. Now of course Google doesn’t like this, so they did that stupid amp thing, which has its own issues and I disagree with amp as a general rule as well. So, LLMs can look at the data, I just don’t think they can reproduce that data without attribution (or payment to the original creator). Perplexity.ai is a little better in this regard because it does link back to sources and is attempting to be a search engine like entity. But OpenAI is not in almost all cases.

dantheclamman
link
fedilink
English
47M

I feel conflicted about the whole thing. Technically it’s a model. I don’t feel that people should be able to sue me as a scientist for making a model based on publicly available data. I myself am merely trying to use the model itself to explain stuff about the world. But OpenAI are also selling access to the outputs of the model, that can very closely approximate the intellectual property of people. Also, most of the training data was accessed via scraping and other gray market methods that were often explicitly violating the TOU of the various places they scraped from. So it all is very difficult to sort through ethically.

Create a post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


  • 1 user online
  • 191 users / day
  • 586 users / week
  • 1.37K users / month
  • 4.49K users / 6 months
  • 1 subscriber
  • 7.41K Posts
  • 84.7K Comments
  • Modlog