Long before ink touched paper, human civilizations used the spoken word as a way to store, retrieve, and spread information and knowledge. As one of the most natural ways of human communication, voice will become one of the most intuitive and powerful interfaces for interacting with advanced AI models.
Recent advancements in text-to-speech (TTS) and speech-to-text synthesis are leading to more realistic and human-like Voice AI. In particular, novel approaches like transformers increase abilities for contextual and emotional intelligence — i.e. the nuances of knowing
how to say something, as well as novel, multimodal techniques that improve prosody (speech patterns that invoke expressiveness) from the concatenative methods that led to more robotic, choppy, staccato-like speech previously.
Listen to audio samples from various papersBy 2026, it is estimated that 90% of online content will be synthetically generated. And humans aren’t as good at detecting synthetic voices as we might think. In a 2023 study funded by
Dawes Centre for Future Crime, researchers found that humans could only discern deep fakes from human voices 70% of the time in a test environment. The study infers that we’d likely do worse in real-life scenarios.
Like every technology, Voice AI can be used for positive and more sinister purposes. To master the future, we believe we have to design with it, and this living research report looks into the emergent use-cases, key technical concepts, and forecasts for a technology that looks to become a lot more prevalent in the near future.