ChatGPT’s voice mode is getting worse, and here’s the possible reason



 I use ChatGPT’s voice mode a lot. After the release of advanced voice mode, it quickly became an essential tool for me to practice my English, get quick answers, or just kill time.

But recently, or precisely, after the release of GPT-5, I noticed a conspicuous downgrade in voice mode.

First notable issue is repeating things in custom instructions, like if you include “Respond in a natural tone” in your custom instructions, the response in voice mode will almost certainly include phrases like “Absolutely, I’ll respond in a natural tone as you requested,” which sounds unnatural and repetitive.
More importantly, a more serious problem has emerged: ChatGPT’s personality has changed in voice mode.

Its responses have become flat and reactive. Most of the time, it just echoes what you say or paraphrases it without offering new ideas. Even worse, it sometimes interrupts itself, hallucinates things you didn’t say, and responds to those hallucinations.

Why is this happening?

The reason seems simple: they’re not using the same model.

You may have noticed that even though GPT-4o is now labeled a “Legacy model,” when you use voice mode and check that conversation in text chat, it still says GPT-4o , not GPT-5.

When voice mode was first announced in the article “ChatGPT can now see, hear, and speak”, GPT-4o didn’t exist yet. The model back then couldn’t understand or generate audio directly.

Instead, OpenAI used a separate speech recognition model to transcribe your voice into text and a TTS model to read out the model’s text response.

This setup had several drawbacks. Because the model couldn’t perceive audio directly, it couldn’t understand tone or emotion. Similarly, the TTS voice often sounded flat and emotionless. And the biggest issue was latency , responses often took 3–4 seconds, which felt unnatural in a real conversation.

However, there was one major advantage: you were actually talking to the same model you texted with. The personality and reasoning remained consistent between voice and text.

Later, OpenAI switched to a more native solution, they made a multimodal model which is the GPT-4o we familiar with, which could understand both text and audio directly. This made conversations more natural and reduced latency.


Note: the following content contains personal suppositions that haven’t been verified yet.


But here’s the catch: in the ChatGPT app, the GPT-4o used for voice mode is probably not the same model as the GPT-4o used for text.
It seems OpenAI created a smaller, faster multimodal variant optimized for real-time voice conversation, a smart compromise between speed and depth.
That’s why responses in voice mode sound simpler or less expressive than in text.


Remember when I said GPT-4o is being used in the current version for voice mode? That’s another interesting point, because GPT-5 appears very likely to not have the ability or at least worse at native audio understanding andgeneration.

When you go to the OpenAI developer platform, in the audio section, you will find there’s no model that starts with a prefix of gpt5- at all!

It’s possible that they will train a GPT-5 model that emphasizes multimodal understanding and call it GPT-5o, but considering they still prioritize text chat and their limited compute, I don’t expect this issue to be fixed anytime soon.


If you look at social media like Twitter, you will find a lot of people use voice mode as their companion or even closest friends. I really wish OpenAI could pay more attention to this feature, because it means a lot to people in this socially isolated digital era.

Comments

Popular Posts