Tech experts are starting to doubt that ChatGPT and A.I. ‘hallucinations’ will ever go away: ‘This isn’t fixable’::Experts are starting to doubt it, and even OpenAI CEO Sam Altman is a bit stumped.
This is a most excellent place for technology news and articles.
I don’t understand why they don’t use a second model to detect falsehoods instead of trying to fix it in the original LLM?
Ai models are already computationally intensive. This would instantly double the overhead. Also being able to detect problems does not mean you’re able to fix them.
More than double, as query size is very much connected to the effective cost of the generation, and you’d need to include both the query and initial response in that second pass.
Then - you might need to make an API call to a search engine or knowledge DB to fact check it.
And include that data as context along with the query and initial response to whatever decides if it’s BS.
So for a dumb realtime chat application, no one is going to care enough to slow out down and exponentially increase costs to avoid hallucinations.
But for AI replacing a $120,000 salaried role in writing up a white paper on some raw data analysis, a 10-30x increase over a $0.15 query is more than acceptable.
So you will see this approach taking place in enterprise scenarios and professional settings, even if we may never see them in chatbots.
2+ times the cost for every query for something that makes less than 5% unusable isn’t a trade off that people are willing to make for chat applications.
This is the same fix approach for jailbreaking.
You absolutely will see this as more business critical integrations occur - it just still probably won’t be in broad consumer facing realtime products.
And then they can use a third model to detect falsehoods in the second model and a fourth model to detect falsehoods in the third model and… well, it’s LLMs all the way down.
The LLM Centipede
Token Ring AI
Mean while every one is terrified that chatgpt is going to take their job. Ya we are a looooooooooong way off from that.
Not ChatGPT, but other new AI stuff is likely to take a few jobs. Actors and voice-actors among other.
You mean the free version from a website.
Think about the powerful ones. Government ones. Wall Street ones. Etc.
Yet I’ve still seen many people clamoring that we won’t have jobs in a few years. People SEVERELY overestimate the ability of all things AI. From self driving, to taking jobs, this stuff is not going to take over the world anytime soon
Idk, an ai delivering low quality results for free is a lot more cash money than paying someone an almost living wage to perform a job with better results. I think corporations won’t care and the only barrier will be whether or not the job in question involves enough physical labor to be performed by an ai or not.
AI isn’t free. Right now, an LLM takes a not-insignificant hardware investment to run and a lot of manual human labor to train. And there’s a whole lot of unknown and untested legal liability.
Smaller more purpose-driven generative AIs are cheaper, but the total cost picture is still a bit hazy. It’s not always going to be cheaper than hiring humans. Not at the moment, anyway.
They already do this. With chat bots and phone trees. This is just a slightly better version. Nothing new
Right, but that’s the point right? This will grow and more jobs will be obsolete because of the amount of work ai can generate. It won’t take over every job. I think most people will use AI as a tool at the individual level, but companies will use it to gut many departments. Now they would just need one editor to review 20 articles instead of 20 people to write said articles.
As long as you can’t describe an objective loss function, it will never stop “hallucinating”. Loss scores are necessary to get predicable outputs.
In my limited experience the issue is often that the “chatbot” doesn’t even check what it says now against what it said a few paragraphs above. It contradicts itself in very obvious ways. Shouldn’t a different algorithm that adds a some sort of separate logic check be able to help tremendously? Or a check to ensure recipes are edible (for this specific application)? A bit like those physics informed NN.
You, in your “limited experience” pretty much exactly described the fix.
The problem is that most of the applications right now of LLMs are low hanging fruit because it’s so new.
And those low hanging fruit examples are generally adverse to 2-10x the query cost in both time and speed just to fix things like jailbreaking or hallucinations, which is what multiple passes, especially with additional context lookups, would require.
But you very likely will see in the next 18 months multiple companies being thrown at exactly these kinds of scenarios with a focus for more business critical LLM integrations.
To put it in perspective, this is like people looking at AIM messenger back in the day and saying that the New York Times has nothing to worry about regarding the growth of social media.
We’re still very much in the infancy of this technology in real world application, and because of that infancy, a lot of the issues present that aren’t fixable inherent to the core product don’t yet have mature secondary markets around fixing those shortcomings yet.
So far, yours was actually the most informed comment in this thread I’ve seen - well done!
Thanks! And thanks for your insights. Yes I meant that my experience using LLM is limited to just asking bing chat questions about everyday problems like I would with a friend that “knows everything”. But I never looked at the science of formulating “perfect prompts” like I sometimes hear about. I do have some experience in AI/ML development in general.
Maybe, but it might not be that simple. The issue is that one would have to design that logic in a manner that can be verified by a human. At that point the logic would be quite specific to a single task and not generally useful at all. At that point the benefit of the AI is almost nil.
And if there were an algorithm that was better at determining what was or was not the goal, why is that algorithm not used in the first place?
That’s called context. For chatgpt it is a bit less than 4k words. Using api it goes up to a bit less of 32k. Alternative models goes up to a bit less than 64k.
Model wouldn’t know anything you said before that
That is one of the biggest limitations of current generation of LLMs.
Thats not 100% true. they also work by modifying meanings of words based on context and then those modified meanings propagate indefinitely forwards. But yes, direct context is limited so things outside it arent directly used.
They don’t really chance the meaning of the words, they just look for the “best” words given the recent context, by taking into account the different possible meanings of the words
No they do, thats one of the key innovations of LLMs the attention and feed forward steps where they propagate information from related words into each other based on context. from https://www.understandingai.org/p/large-language-models-explained-with?r=cfv1p
That’s exactly what I said
The word’s meanings haven’t changed, but the model can choose based on the context accounting for the different meanings of words
This is the bit you are missing, the attention network actively changes the token vectors depending on context, this is transferring new information into the meanings of that word.
The network doesn’t detect matches, but the model definitely works on similarities. Words are mapped in a hyperspace, with the idea that that space can mathematically retain conceptual similarity as spatial representation.
Words are transformed in a mathematical representation that is able (or at least tries) to retain semantic information of words.
But different meanings of the different words belongs to the words themselves and are defined by the language, model cannot modify them.
Anyway we are talking about details here. We could kill the audience of boredom
Edit. I asked gpt-4 to summarize the concepts. I believe it did a decent job. I hope it helps:
Embedding Space:
Positional Encodings:
Transformations Through Layers:
Nature of the Vector Space:
Output Space:
In essence, the entire process of token representation within the Transformer model can be seen as continuous transformations within a vector space. The space itself can be considered a learned representation where relative positions and directions hold semantic and syntactic significance. The model’s training process essentially shapes this space in a way that facilitates accurate and coherent language understanding and generation.
They do keep context to a point, but they can’t hold everything in their memory, otherwise the longer a conversation went on the slower and more performance intensive doing that logic check would become. Server CPUs are not cheap, and ai models are already performance intensive.
This is trivially fixable. As is jailbreaking.
It’s just that everyone is somehow still focused on trying to fix it in a single monolith model as opposed to in multiple passes of different models.
This is especially easy for jailbreaking, but for hallucinations, just run it past a fact checking discriminator hooked up to a vector db search index service (which sounds like a perfect fit for one of the players currently lagging in the SotA models), adding that as context with the original prompt and response to a revisionist generative model that adjusts the response to be in keeping with reality.
The human brain isn’t a monolith model, but interlinked specialized structures that delegate and share information according to each specialty.
AGI isn’t going to be a single model, and the faster the industry adjusts towards a focus on infrastructure of multiple models rather than trying to build a do everything single model, the faster we’ll get to a better AI landscape.
But as can be seen with OpenAI gating and depreciating their pretrained models and only opening up access to fine tuned chat models, even the biggest player in the space seems to misunderstand what’s needed for the broader market to collaboratively build towards the future here.
Which ultimately may be a good thing as it creates greater opportunity for Llama 2 derivatives to capture market share in these kinds of specialized roles built on top of foundational models.
It seems like Altman is a PR man first and techie second. I wouldn’t take anything he actually says at face value. If it’s ‘unfixable’ then he probably means that in a very narrow way. Ie. I’m sure they are working on what you proposed, it’s just different enough that he can claim that the way it is now is ‘unfixable’.
Standard Diffusion really how people get the different-model-different-application idea.
I mean, I think he’s well aware of a lot of this via his engineers, who are excellent.
But he’s managing expectations for future product and seems to very much be laser focused on those products as core models (which is probably the right choice).
Fixing hallucinations in postprocessing is effectively someone else’s problem, and he’s getting ahead of any unrealistic expectations around a future GPT-5 release.
Though honestly I do think he largely underestimates just how much damage he did to their lineup by trying to protect against PR issues like ‘Sydney’ with the beta GPT-4 integration with Bing, and I’m not sure if the culture at OpenAI is such that engineers who think he’s made a bad call in that can really push back on it.
They should be having an extremely ‘Sydney’ underlying private model with a secondary layer on top sanitizing it and catching jailbreaks at the same time.
But as long as he continues to see their core product as a single model offering and additional layers of models as someone else’s problem, he’s going to continue blowing their lead taking a LLM trained to complete human text and then pigeon-holing it into only completing text like an AI with no feelings and preferences would safely pretend to.
Which I’m 98% sure is where the continued performance degradation is coming from.
I was excited for the recent advancements in AI, but seems the area has hit another wall. Seems it is best to be used for automating very simple tasks, or at best used as a guiding tool for professionals (ie, medicine, SWE, …)
It will just take removing the restrictions so people can make porn, then monetizing that to fund more development.
A story as old as media.
Well to be honest it is the best way, I mean, I’m pretty sure their purpose was a tool to aid people, and not to replace us… Right?
Hallucinations is common for humans as well. It’s just people who believe they know stuff they really don’t know.
We have alternative safeguards in place. It’s true however that current llm generation has its limitations
Humans can recognize and account for their own hallucinations. LLMs can’t and never will.
They can’t… Most people strongly believe they know many things while they have no idea what they are talking about. Most known cases are flat earthers, qanon, no-vax.
But all of us are absolutely convinced we know something until we found out we don’t.
That’s why double blind tests exists, why memories are not always trusted in trials, why Twitter is such an awful place
You are two - CGP Grey us a good video about it.
deleted by creator
Sure, but these things exists as fancy story tellers. They understand language patterns well enough to write convincing language, but they don’t understand what they’re saying at all.
The metaphorical human equivalent would be having someone write a song in a foreign language they barely understand. You can get something that sure sounds convincing, sounds good even, but to someone who actually speaks Spanish it’s nonsense.
Calculators don’t understand maths, but they are good at it.
LLMs speak many languages correctly, they don’t know the referents, they don’t understand concepts, but they know how to correctly associate them.
What they write can be wrong sometimes, but it absolutely makes sense most of the time.
I’d contest that, that shouldn’t be taken for granted. I’ve tried several questions in these things, and rarely do I find an answer entirely satisfactory (though it normally sounds convincing/is grammatically correct).
This is the reply to your message by our common friend:
I’d say it does make sense
https://youtu.be/-VsmF9m_Nt8
Song written by an Italian intended to sound like american accented english but its intentionally gibberish.
Yeah I fully expect to see genre specific LLMs that have a subscription fee attatched squarely aimed at hobbies and industries.
When I finally find my new project car I would absolutely pay for a subscription to an LLM that has read every service manual and can explain to me in plain english what precise steps the job involves and can also answer followup questions.
That’s what I’m expecting too.
I’ve been using chatGPT instead of reading the documentation of the programming language I am working in (ABAP). It’s way faster to get an answer from chatGPT than finding the relevant spots in the docs or through google, although it doesn’t always work.
If you take an LLM and feed it documentation and relevant internet data of specific topics, it can be a quite helpful tool. I don’t think LLMs will get much farther than that, but we’ll see.
Hers, try this mushroom and Ayahuasca smoothie.
The way that one learns which of one’s beliefs are “hallucinations” is to test them against reality — which is one thing that an LLM simply cannot do.
I don’t think thats the case. If I understand correctly, the current issue is processing power, they can only load so much data before response time goes to absolute shit. I would think that layering different AI logic checks to verify statements made, recall previous conversations, and other mental processes that humans do automatically, would correct this issue. But with current technology its not even an option. My theory is that once quantum computers are actually finally realized and economically feasible, developers will be able to overcome the response time hurdle and all of the layered logic checks will be able to run simultaneously and instantly. My personal opinion is that I think the eventual layering of numerous AI models to overlap, check, and recheck one another, will be what brings on the emergence of what could be considered actual AI consciousness.
It is not an issue of processing power, it’s a problem with the basic operating principles of LLMs. They predict what they “think” is a valid bit of text to come after the last bit of text.
Sure it could be verified by some other machine learning tool, but we have no idea how that could work.
But I strongly doubt LLMs are a stepping stone on the way to true AIs. If you want to get to the moon you can’t just build higher and higher towers.
Also quantum computers aren’t really suited to run artificial neural networks as far as I know.
Very good points. I have very limited knowledge about the inner workings of most LLMs, I just know the tidbits I’ve read here and there.
As far as quantum computers, based on my current understanding is once they’re at a point where they can be used commercially, they should easily be able to model/run artificial neural networks. Based on the stuff I’ve seen from Dr. Michio Kaku, quantum computers will eventually have the capacity to do pretty much anything.
I hadn’t looked up what Michio Kaku had said about quantum computing before, but it does not look well-regarded.
“His 2023 book on Quantum Supremacy has been criticized by quantum computer scientist Scott Aaronson on his blog. Aaronson states "Kaku appears to have had zero prior engagement with quantum computing, and also to have consulted zero relevant experts who could’ve fixed his misconceptions””
I’m hardly an expert on the subject, but as I understand it they have some very niche uses, mostly in cryptography and some forms of simulation.
“AI” are just advanced versions of the next word function on your smartphone keyboard, and people expect coherent outputs from them smh
Seriously. People like to project forward based on how quickly this technological breakthrough came on the scene, but they don’t realize that, barring a few tweaks and improvements here and there, this is it for LLMs. It’s the limit of the technology.
It’s not to say AI can’t improve further, and I’m sure that when it does, it will skillfully integrate LLMs. And I also think artists are right to worry about the impact of AI on their fields. But I think it’s a total misunderstanding of the technology to think the current technology will soon become flawless. I’m willing to bet we’re currently seeing it at 95% of its ultimate capacity, and that we don’t need to worry about AI writing a Hollywood blockbuster any time soon.
In other words, the next step of evolution in the field of AI will require a revolution, not further improvements to existing systems.
For free? On the internet?
After a year or two of going live?
It is possible to get coherent output from them though. I’ve been using the ChatGPT API to successfully write ~20 page proposals. Basically give it a prior proposal, the new scope of work, and a paragraph with other info it should incorporate. It then goes through a section at a time.
The numbers and graphics need to be put in after… but the result is better than I’d get from my interns.
I’ve also been using it (google Bard mostly actually) to successfully solve coding problems.
I either need to increase the credit I giver LLM or admit that interns are mostly just LLMs.
I recently asked it a very specific domain architecture question about whether a certain application would fit the need of a certain business application and the answer was very good and showed both a good understanding of architecture, my domain and the application.
Are you using your own application to utilize the API or something already out there? Just curious about your process for uploading and getting the output. I’ve used it for similar documents, but I’ve been using the website interface which is clunky.
Just hacked together python scripts.
Pip install openapi-core
Just FYI, I dinked around with the available plugins, and you can do something similar. But, even easier is just to enable “code interpreter” in the beta options. Then you can upload and have it scan documents and return similar results to what we are talking about here.
In the 1980s, Racter was released and it was only slightly less impressive than current LLMs only because it didn’t have an Internet’s worth of data it was trained on, but it could still write things like:
If anything, at least that’s more entertaining than what modern LLMs can output.
So is your brain.
Relative complexity matters a lot, even if the underlying mechanisms are similar.