Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’

@joelthelion@lemmy.world

I don’t understand why they don’t use a second model to detect falsehoods instead of trying to fix it in the original LLM?

@doggle@lemmy.world

Ai models are already computationally intensive. This would instantly double the overhead. Also being able to detect problems does not mean you’re able to fix them.

@kromem@lemmy.world

More than double, as query size is very much connected to the effective cost of the generation, and you’d need to include both the query and initial response in that second pass.

Then - you might need to make an API call to a search engine or knowledge DB to fact check it.

And include that data as context along with the query and initial response to whatever decides if it’s BS.

So for a dumb realtime chat application, no one is going to care enough to slow out down and exponentially increase costs to avoid hallucinations.

But for AI replacing a $120,000 salaried role in writing up a white paper on some raw data analysis, a 10-30x increase over a $0.15 query is more than acceptable.

So you will see this approach taking place in enterprise scenarios and professional settings, even if we may never see them in chatbots.

@kromem@lemmy.world

2+ times the cost for every query for something that makes less than 5% unusable isn’t a trade off that people are willing to make for chat applications.

This is the same fix approach for jailbreaking.

You absolutely will see this as more business critical integrations occur - it just still probably won’t be in broad consumer facing realtime products.

Flying Squid

And then they can use a third model to detect falsehoods in the second model and a fourth model to detect falsehoods in the third model and… well, it’s LLMs all the way down.

@thimantha@lemmy.world

The LLM Centipede

@postmateDumbass@lemmy.world

Token Ring AI

@Coreidan@lemmy.world

Mean while every one is terrified that chatgpt is going to take their job. Ya we are a looooooooooong way off from that.

@emptyother@lemmy.world

Not ChatGPT, but other new AI stuff is likely to take a few jobs. Actors and voice-actors among other.

@postmateDumbass@lemmy.world

You mean the free version from a website.

Think about the powerful ones. Government ones. Wall Street ones. Etc.

dub

Yet I’ve still seen many people clamoring that we won’t have jobs in a few years. People SEVERELY overestimate the ability of all things AI. From self driving, to taking jobs, this stuff is not going to take over the world anytime soon

@PeterPoopshit@lemmy.world

Idk, an ai delivering low quality results for free is a lot more cash money than paying someone an almost living wage to perform a job with better results. I think corporations won’t care and the only barrier will be whether or not the job in question involves enough physical labor to be performed by an ai or not.

@knotthatone@lemmy.world

AI isn’t free. Right now, an LLM takes a not-insignificant hardware investment to run and a lot of manual human labor to train. And there’s a whole lot of unknown and untested legal liability.

Smaller more purpose-driven generative AIs are cheaper, but the total cost picture is still a bit hazy. It’s not always going to be cheaper than hiring humans. Not at the moment, anyway.

dub

They already do this. With chat bots and phone trees. This is just a slightly better version. Nothing new

@Notyou@sopuli.xyz

Right, but that’s the point right? This will grow and more jobs will be obsolete because of the amount of work ai can generate. It won’t take over every job. I think most people will use AI as a tool at the individual level, but companies will use it to gut many departments. Now they would just need one editor to review 20 articles instead of 20 people to write said articles.

@rosenjcb@lemmy.world

As long as you can’t describe an objective loss function, it will never stop “hallucinating”. Loss scores are necessary to get predicable outputs.

@Zeshade@lemmy.world

In my limited experience the issue is often that the “chatbot” doesn’t even check what it says now against what it said a few paragraphs above. It contradicts itself in very obvious ways. Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously? Or a check to ensure recipes are edible (for this specific application)? A bit like those physics informed NN.

@kromem@lemmy.world

Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously?

You, in your “limited experience” pretty much exactly described the fix.

The problem is that most of the applications right now of LLMs are low hanging fruit because it’s so new.

And those low hanging fruit examples are generally adverse to 2-10x the query cost in both time and speed just to fix things like jailbreaking or hallucinations, which is what multiple passes, especially with additional context lookups, would require.

But you very likely will see in the next 18 months multiple companies being thrown at exactly these kinds of scenarios with a focus for more business critical LLM integrations.

To put it in perspective, this is like people looking at AIM messenger back in the day and saying that the New York Times has nothing to worry about regarding the growth of social media.

We’re still very much in the infancy of this technology in real world application, and because of that infancy, a lot of the issues present that aren’t fixable inherent to the core product don’t yet have mature secondary markets around fixing those shortcomings yet.

So far, yours was actually the most informed comment in this thread I’ve seen - well done!

@Zeshade@lemmy.world

Thanks! And thanks for your insights. Yes I meant that my experience using LLM is limited to just asking bing chat questions about everyday problems like I would with a friend that “knows everything”. But I never looked at the science of formulating “perfect prompts” like I sometimes hear about. I do have some experience in AI/ML development in general.

@cryball@sopuli.xyz

Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously?

Maybe, but it might not be that simple. The issue is that one would have to design that logic in a manner that can be verified by a human. At that point the logic would be quite specific to a single task and not generally useful at all. At that point the benefit of the AI is almost nil.

@postmateDumbass@lemmy.world

And if there were an algorithm that was better at determining what was or was not the goal, why is that algorithm not used in the first place?

@Zeth0s@lemmy.world

That’s called context. For chatgpt it is a bit less than 4k words. Using api it goes up to a bit less of 32k. Alternative models goes up to a bit less than 64k.

Model wouldn’t know anything you said before that

That is one of the biggest limitations of current generation of LLMs.

@Womble@lemmy.world

Thats not 100% true. they also work by modifying meanings of words based on context and then those modified meanings propagate indefinitely forwards. But yes, direct context is limited so things outside it arent directly used.

@Zeth0s@lemmy.world

They don’t really chance the meaning of the words, they just look for the “best” words given the recent context, by taking into account the different possible meanings of the words

@Womble@lemmy.world

No they do, thats one of the key innovations of LLMs the attention and feed forward steps where they propagate information from related words into each other based on context. from https://www.understandingai.org/p/large-language-models-explained-with?r=cfv1p

For example, in the previous section we showed a hypothetical transformer figuring out that in the partial sentence “John wants his bank to cash the,” his refers to John. Here’s what that might look like under the hood. The query vector for his might effectively say “I’m seeking: a noun describing a male person.” The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.

@Zeth0s@lemmy.world

That’s exactly what I said

They don’t really chance the meaning of the words, they just look for the “best” words given the recent context, by taking into account the different possible meanings of the words

The word’s meanings haven’t changed, but the model can choose based on the context accounting for the different meanings of words

@Womble@lemmy.world

The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.

This is the bit you are missing, the attention network actively changes the token vectors depending on context, this is transferring new information into the meanings of that word.

@Zeth0s@lemmy.world

The network doesn’t detect matches, but the model definitely works on similarities. Words are mapped in a hyperspace, with the idea that that space can mathematically retain conceptual similarity as spatial representation.

Words are transformed in a mathematical representation that is able (or at least tries) to retain semantic information of words.

But different meanings of the different words belongs to the words themselves and are defined by the language, model cannot modify them.

Anyway we are talking about details here. We could kill the audience of boredom

Edit. I asked gpt-4 to summarize the concepts. I believe it did a decent job. I hope it helps:

Embedding Space:
- Initially, every token is mapped to a point (or vector) in a high-dimensional space via embeddings. This space is typically called the “embedding space.”
- The dimensionality of this space is determined by the size of the embeddings. For many Transformer models, this is often several hundred dimensions, e.g., 768 for some versions of GPT and BERT.
Positional Encodings:
- These are vectors added to the embeddings to provide positional context. They share the same dimensionality as the embedding vectors, so they exist within the same high-dimensional space.
Transformations Through Layers:
- As tokens’ representations (vectors) pass through Transformer layers, they undergo a series of linear and non-linear transformations. These include matrix multiplications, additions, and the application of functions like softmax.
- At each layer, the vectors are “moved” within this high-dimensional space. When we say “moved,” we mean they are transformed, resulting in a change in their coordinates in the vector space.
- The self-attention mechanism allows a token’s representation to be influenced by other tokens’ representations, effectively “pulling” or “pushing” it in various directions in the space based on the context.
Nature of the Vector Space:
- This space is abstract and high-dimensional, making it hard to visualize directly. However, in this space, the “distance” and “direction” between vectors can have semantic meaning. Vectors close to each other can be seen as semantically similar or related.
- The exact nature and structure of this space are learned during training. The model adjusts the parameters (like weights in the attention mechanisms and feed-forward networks) to ensure that semantically or syntactically related concepts are positioned appropriately relative to each other in this space.
Output Space:
- The final layer of the model transforms the token representations into an output space corresponding to the vocabulary size. This is a probability distribution over all possible tokens for the next word prediction.

In essence, the entire process of token representation within the Transformer model can be seen as continuous transformations within a vector space. The space itself can be considered a learned representation where relative positions and directions hold semantic and syntactic significance. The model’s training process essentially shapes this space in a way that facilitates accurate and coherent language understanding and generation.

@doggle@lemmy.world

They do keep context to a point, but they can’t hold everything in their memory, otherwise the longer a conversation went on the slower and more performance intensive doing that logic check would become. Server CPUs are not cheap, and ai models are already performance intensive.

@kromem@lemmy.world

This is trivially fixable. As is jailbreaking.

It’s just that everyone is somehow still focused on trying to fix it in a single monolith model as opposed to in multiple passes of different models.

This is especially easy for jailbreaking, but for hallucinations, just run it past a fact checking discriminator hooked up to a vector db search index service (which sounds like a perfect fit for one of the players currently lagging in the SotA models), adding that as context with the original prompt and response to a revisionist generative model that adjusts the response to be in keeping with reality.

The human brain isn’t a monolith model, but interlinked specialized structures that delegate and share information according to each specialty.

AGI isn’t going to be a single model, and the faster the industry adjusts towards a focus on infrastructure of multiple models rather than trying to build a do everything single model, the faster we’ll get to a better AI landscape.

But as can be seen with OpenAI gating and depreciating their pretrained models and only opening up access to fine tuned chat models, even the biggest player in the space seems to misunderstand what’s needed for the broader market to collaboratively build towards the future here.

Which ultimately may be a good thing as it creates greater opportunity for Llama 2 derivatives to capture market share in these kinds of specialized roles built on top of foundational models.

trainsaresexy

It seems like Altman is a PR man first and techie second. I wouldn’t take anything he actually says at face value. If it’s ‘unfixable’ then he probably means that in a very narrow way. Ie. I’m sure they are working on what you proposed, it’s just different enough that he can claim that the way it is now is ‘unfixable’.

Standard Diffusion really how people get the different-model-different-application idea.

@kromem@lemmy.world

I mean, I think he’s well aware of a lot of this via his engineers, who are excellent.

But he’s managing expectations for future product and seems to very much be laser focused on those products as core models (which is probably the right choice).

Fixing hallucinations in postprocessing is effectively someone else’s problem, and he’s getting ahead of any unrealistic expectations around a future GPT-5 release.

Though honestly I do think he largely underestimates just how much damage he did to their lineup by trying to protect against PR issues like ‘Sydney’ with the beta GPT-4 integration with Bing, and I’m not sure if the culture at OpenAI is such that engineers who think he’s made a bad call in that can really push back on it.

They should be having an extremely ‘Sydney’ underlying private model with a secondary layer on top sanitizing it and catching jailbreaks at the same time.

But as long as he continues to see their core product as a single model offering and additional layers of models as someone else’s problem, he’s going to continue blowing their lead taking a LLM trained to complete human text and then pigeon-holing it into only completing text like an AI with no feelings and preferences would safely pretend to.

Which I’m 98% sure is where the continued performance degradation is coming from.

@malloc@lemmy.world

I was excited for the recent advancements in AI, but seems the area has hit another wall. Seems it is best to be used for automating very simple tasks, or at best used as a guiding tool for professionals (ie, medicine, SWE, …)

@postmateDumbass@lemmy.world

It will just take removing the restrictions so people can make porn, then monetizing that to fund more development.

A story as old as media.

kratoz29

Well to be honest it is the best way, I mean, I’m pretty sure their purpose was a tool to aid people, and not to replace us… Right?

@Zeth0s@lemmy.world

Hallucinations is common for humans as well. It’s just people who believe they know stuff they really don’t know.

We have alternative safeguards in place. It’s true however that current llm generation has its limitations

@rambaroo@lemmy.world

Humans can recognize and account for their own hallucinations. LLMs can’t and never will.

@Zeth0s@lemmy.world

They can’t… Most people strongly believe they know many things while they have no idea what they are talking about. Most known cases are flat earthers, qanon, no-vax.

But all of us are absolutely convinced we know something until we found out we don’t.

That’s why double blind tests exists, why memories are not always trusted in trials, why Twitter is such an awful place

@ydieb@lemmy.world

You are two - CGP Grey us a good video about it.

@alvvayson@lemmy.world

deleted by creator

Dark Arc

Sure, but these things exists as fancy story tellers. They understand language patterns well enough to write convincing language, but they don’t understand what they’re saying at all.

The metaphorical human equivalent would be having someone write a song in a foreign language they barely understand. You can get something that sure sounds convincing, sounds good even, but to someone who actually speaks Spanish it’s nonsense.

@Zeth0s@lemmy.world

Calculators don’t understand maths, but they are good at it.

LLMs speak many languages correctly, they don’t know the referents, they don’t understand concepts, but they know how to correctly associate them.

What they write can be wrong sometimes, but it absolutely makes sense most of the time.

Dark Arc

but it absolutely makes sense most of the time

I’d contest that, that shouldn’t be taken for granted. I’ve tried several questions in these things, and rarely do I find an answer entirely satisfactory (though it normally sounds convincing/is grammatically correct).

@Zeth0s@lemmy.world

This is the reply to your message by our common friend:

I understand your perspective and appreciate the feedback. My primary goal is to provide accurate and grammatically correct information. I’m constantly evolving, and your input helps in improving the quality of responses. Thank you for sharing your experience. - GPT-4

I’d say it does make sense

@Delphia@lemmy.world

https://youtu.be/-VsmF9m_Nt8

Song written by an Italian intended to sound like american accented english but its intentionally gibberish.

@Delphia@lemmy.world

Yeah I fully expect to see genre specific LLMs that have a subscription fee attatched squarely aimed at hobbies and industries.

When I finally find my new project car I would absolutely pay for a subscription to an LLM that has read every service manual and can explain to me in plain english what precise steps the job involves and can also answer followup questions.

@thedoginthewok@lemmy.world

That’s what I’m expecting too.

I’ve been using chatGPT instead of reading the documentation of the programming language I am working in (ABAP). It’s way faster to get an answer from chatGPT than finding the relevant spots in the docs or through google, although it doesn’t always work.

If you take an LLM and feed it documentation and relevant internet data of specific topics, it can be a quite helpful tool. I don’t think LLMs will get much farther than that, but we’ll see.

@BilboBargains@lemmy.world

Hers, try this mushroom and Ayahuasca smoothie.

@fubo@lemmy.world

The way that one learns which of one’s beliefs are “hallucinations” is to test them against reality — which is one thing that an LLM simply cannot do.

@DragonAce@lemmy.world

I don’t think thats the case. If I understand correctly, the current issue is processing power, they can only load so much data before response time goes to absolute shit. I would think that layering different AI logic checks to verify statements made, recall previous conversations, and other mental processes that humans do automatically, would correct this issue. But with current technology its not even an option. My theory is that once quantum computers are actually finally realized and economically feasible, developers will be able to overcome the response time hurdle and all of the layered logic checks will be able to run simultaneously and instantly. My personal opinion is that I think the eventual layering of numerous AI models to overlap, check, and recheck one another, will be what brings on the emergence of what could be considered actual AI consciousness.

@_jonatan_@lemmy.world

It is not an issue of processing power, it’s a problem with the basic operating principles of LLMs. They predict what they “think” is a valid bit of text to come after the last bit of text.

Sure it could be verified by some other machine learning tool, but we have no idea how that could work.

But I strongly doubt LLMs are a stepping stone on the way to true AIs. If you want to get to the moon you can’t just build higher and higher towers.

Also quantum computers aren’t really suited to run artificial neural networks as far as I know.

@DragonAce@lemmy.world

Very good points. I have very limited knowledge about the inner workings of most LLMs, I just know the tidbits I’ve read here and there.

As far as quantum computers, based on my current understanding is once they’re at a point where they can be used commercially, they should easily be able to model/run artificial neural networks. Based on the stuff I’ve seen from Dr. Michio Kaku, quantum computers will eventually have the capacity to do pretty much anything.

@_jonatan_@lemmy.world

I hadn’t looked up what Michio Kaku had said about quantum computing before, but it does not look well-regarded.

“His 2023 book on Quantum Supremacy has been criticized by quantum computer scientist Scott Aaronson on his blog. Aaronson states "Kaku appears to have had zero prior engagement with quantum computing, and also to have consulted zero relevant experts who could’ve fixed his misconceptions””

I’m hardly an expert on the subject, but as I understand it they have some very niche uses, mostly in cryptography and some forms of simulation.

@nxfsi@lemmy.world

“AI” are just advanced versions of the next word function on your smartphone keyboard, and people expect coherent outputs from them smh

1bluepixel

Seriously. People like to project forward based on how quickly this technological breakthrough came on the scene, but they don’t realize that, barring a few tweaks and improvements here and there, this is it for LLMs. It’s the limit of the technology.

It’s not to say AI can’t improve further, and I’m sure that when it does, it will skillfully integrate LLMs. And I also think artists are right to worry about the impact of AI on their fields. But I think it’s a total misunderstanding of the technology to think the current technology will soon become flawless. I’m willing to bet we’re currently seeing it at 95% of its ultimate capacity, and that we don’t need to worry about AI writing a Hollywood blockbuster any time soon.

In other words, the next step of evolution in the field of AI will require a revolution, not further improvements to existing systems.

@postmateDumbass@lemmy.world

I’m willing to bet we’re currently seeing it at 95% of its ultimate capacity

For free? On the internet?

After a year or two of going live?

@persolb@lemmy.ml

It is possible to get coherent output from them though. I’ve been using the ChatGPT API to successfully write ~20 page proposals. Basically give it a prior proposal, the new scope of work, and a paragraph with other info it should incorporate. It then goes through a section at a time.

The numbers and graphics need to be put in after… but the result is better than I’d get from my interns.

I’ve also been using it (google Bard mostly actually) to successfully solve coding problems.

I either need to increase the credit I giver LLM or admit that interns are mostly just LLMs.

@PrinzMegahertz@lemmy.world

I recently asked it a very specific domain architecture question about whether a certain application would fit the need of a certain business application and the answer was very good and showed both a good understanding of architecture, my domain and the application.

@WoahWoah@lemmy.world

Are you using your own application to utilize the API or something already out there? Just curious about your process for uploading and getting the output. I’ve used it for similar documents, but I’ve been using the website interface which is clunky.

@persolb@lemmy.ml

Just hacked together python scripts.

Pip install openapi-core

@WoahWoah@lemmy.world

Just FYI, I dinked around with the available plugins, and you can do something similar. But, even easier is just to enable “code interpreter” in the beta options. Then you can upload and have it scan documents and return similar results to what we are talking about here.

Flying Squid

In the 1980s, Racter was released and it was only slightly less impressive than current LLMs only because it didn’t have an Internet’s worth of data it was trained on, but it could still write things like:

Bill sings to Sarah. Sarah sings to Bill. Perhaps they will do other dangerous things together. They may eat lamb or stroke each other. They may chant of their difficulties and their happiness. They have love but they also have typewriters. That is interesting.

If anything, at least that’s more entertaining than what modern LLMs can output.

@kromem@lemmy.world

So is your brain.

Relative complexity matters a lot, even if the underlying mechanisms are similar.

Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’

Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’

Tech experts are starting to doubt that ChatGPT and A.I. 'hallucinations' will ever go away: 'This isn’t fixable'

Technology

Our Rules

Approved Bots