Strange Ventures
A  Living Research Report

Voice AI

By Tara Tan and Daniella Levitan
Design by Jacob Waites
Last updated: Sep 12 2023


SUBSCRIBE FOR FUTURE UPDATES

Introduction

Long before ink touched paper, human civilizations used the spoken word as a way to store, retrieve, and spread information and knowledge. As one of the most natural ways of human communication, voice will become one of the most intuitive and powerful interfaces for interacting with advanced AI models.
  1. Zero UI:
    Voice interactions are an intuitive way to navigate advanced AI models like knowledge graphs (KGs), large language models (LLMs), and more. 
  2. Human-Machine Feedback Loops:
    Real-time feedback loops enhance the learning curve of these algorithms, paving the way for more personalized and efficient AI experiences.
Recent advancements in text-to-speech (TTS) and speech-to-text synthesis are leading to more realistic and human-like Voice AI. In particular, novel approaches like transformers increase abilities for contextual and emotional intelligence — i.e. the nuances of knowing how to say something, as well as novel, multimodal techniques that improve prosody (speech patterns that invoke expressiveness) from the concatenative methods that led to more robotic, choppy, staccato-like speech previously. 

Listen to audio samples from various papers

By 2026, it is estimated that 90% of online content will be synthetically generated. And humans aren’t as good at detecting synthetic voices as we might think. In a 2023 study funded by Dawes Centre for Future Crime, researchers found that humans could only discern deep fakes from human voices 70% of the time in a test environment. The study infers that we’d likely do worse in real-life scenarios. 

Like every technology, Voice AI can be used for positive and more sinister purposes. To master the future, we believe we have to design with it, and this living research report looks into the emergent use-cases, key technical concepts, and forecasts for a technology that looks to become a lot more prevalent in the near future. 

On The Horizon

We may see more of these emergent patterns and trends as Voice AI and its ecosystem matures in its development and adoption.

Personalization at Scale

Gone are the days of the cookie-cutter approach. With Voice AI, businesses are headed towards personalizing their interactions with every customer, increasing touchpoints and opportunities to upsell. Think sales assistants that remember what you last bought or game characters that chat and react to how you play.

Invisible to the Human Ear

AI catches stuff we simply can't. As a quick example: Hedge funds are now using Voice AI to tune into earnings calls. Apart from the usual numbers, this tech picks up on voice changes, hesitations, and even the confidence in an exec's voice, and is used to infer stock market moves. In a different context, teachers in classrooms are using Voice AI to detect early signs of dyslexia in a more efficient way.

Getting the Bigger Picture

This is about getting the whole story—with real-time feedback as events unfold. Contextual intelligence will become part of live diligence and reporting. Think a copilot for “fact checking” during a live congressional hearing or a presidential debate.

Democratizing Access

Perhaps one of the most profound impacts of Voice AI is in the realm of accessibility. Breaking down barriers related to special needs, or language, the technology ensures that information is accessible to all. For individuals with special needs, voice-controlled devices can be life-changing, allowing them to regain the ability to speak or to listen to the written word the way they prefer.
Want to dive deeper? join our slack.

At A Glance

Latest Concepts

Contextual Intelligence
The effectiveness of Voice AI is closely tied to its ability to understand and interpret context. Recent models pair text analysis (what is being said) with audio language models (how it’s being said) to get to more expressive and nuanced speech.
Zero or Few Shot Learning
We are getting closer to zero or few shot learning, which means models can produce convincing output with little training or few samples. This means greater speed in producing voice models, as well as efficiency, such as the case of being able to recreate a voice model from a small dataset of recordings of say, a niche, regional dialect. 

Prosody & Expressiveness
Recent models have made strides in the proper modeling of prosody (i.e. the rhythm, melody, and intonation patterns of spoken language) to achieve more natural and expressive voice synthesis. Techniques include duration modeling, pitch contour generation, and emphasis modeling. This can also be applied to accents, speech patterns and quirks, leading to more accurate voice models and avatars.

Key Papers To Check Out

AudioPaLM (2023)

NaturalSpeech2 (2023)

FastSpeech2 (2021)

Tacotron (2017)

WaveNet (2016)

AudioPaLM (2023)

ELI5: AudioPaLM fuses text-based and speech-based language models, PaLM-2 and AudioLM, into a unified multimodal architecture that has the ability to perform zero-shot speech-to-text, and speech-to-speech translation across multiple languages. 

Link | Link to papers.chat | Listen to audio samples

NaturalSpeech2 (2023)

ELI5: NaturalSpeech 2 is a diffusion-based TTS system from Microsoft Research that is able to synthesize natural voices with high expressiveness and strong zero-shot ability (needs little training or few samples to reproduce).

Link | Link to papers.chat | Listen to audio samples

FastSpeech2 (2021)

ELI5: Drawing inspiration from the Transformer architecture in machine translation, FastSpeech2 is a non-autoregressive model (meaning the output doesn’t happen sequence by sequence, and thus is faster and more holistic) that directly generates speech waveforms from text in parallel.

Link | Link to papers.chat | Listen to audio samples

Tacotron (2017)

ELI5: Tacotron is a text-to-speech system that creates speech from text directly, without requiring any domain expertise. This system has been tested on US English and was found to be more natural than existing parametric systems. Tacotron is also faster than other text-to-speech methods because it generates speech at the frame level, which means it works on a chunk of samples rather than the granular sample-by-sample. 

Link | Link to papers.chat

WaveNet (2016) 

ELI5: A deep generative model from Deepmind that leapfrogged prior approaches in its expressiveness. The model directly works on the raw waveform to generate highly detailed and natural speech. Its main limitation is computational intensity, making real-time synthesis a challenge.

Link | Link to papers.chat | Listen to audio samples

Current Challenges

There are a few challenges and limitations to the current state of technology in Voice AI, that keeps it in the realm of “uncanny valley”.

Emotional Intelligence

Ensuring that synthesized speech can convey the intended emotion is a significant challenge. It’s still challenging to always correctly detect nuance and subtlety of intentions encoded in information.

Going Multilingual

Especially in low-resource languages (languages with thin data-sets), producing natural-sounding speech is challenging.

Slang or Colloquial speech

Language is an evolving thing, and the variety in colloquial speech, technical jargon, or even regional dialect makes it difficult for generalized models to be accurate.

Deepfakes & Ethics

It can be hard to track or discern voice models, which can be used irresponsibly, or even cloned without notice.

Library of Repos & Tools

A1000.org Audio Samples Comparison

Text-to-Speech Synthesis Benchmark

Emergent Applications

Media, News, Entertainment

Americans spend over 8 hours a day consuming digital content, and it’s only forecasted to grow. In this section, we explore four areas where we see the most potential for Voice AI to disrupt: podcasts, sports, news, and media production.
Strange Hunches: How AI May Supercharge Trends in 2024
  • Podcast and audio content consumption is on the rise
    • Over 150M monthly podcast episodes were consumed worldwide in July 2023, doubling from 2021. We believe that we will see even more growth in audio media.
  • Dynamic Sports entertainment and betting experiences
    • Sports live streaming and betting experiences will be enhanced with real-time analytics, for instance bringing updated statistics to live sportscasting or betting odds.
  • Exponential increase in user-generated and studio content .
    • The pace of media production has accelerated more than twofold. From content generation to transcription to new editing tools, we will see an exponential increase in user-generated and studio content across all channels.
  • A “land grab” moment in media and entertainment
    • Media companies will seize the opportunity to expand across mediums, platforms, thematics, and geographies, accelerated by AI.
Notable Case Studies
AI Sportcasters
Major sports tournaments, such as Wimbledon and the Masters, are leveraging AI to provide insights, analysis, and statistics for sports games.

Check out this AI commentary delivered over highlight reels, aided by models built by IBM.
Round-the-Clock Reporting
DeFiance Media partnered with Hour One to automate news content creation, delivering of news stories every 2 hours, all year round.

View their 24/7 news channel.
Film & Audio Post-Production
To continue the legacy of James Earl Jones' portrayal of Darth Vadar past his retirement, Respeecher cloned the actor's voice using archival footage for the Obi-Wan Kenobi series on Disney Plus.

Gaming

Nearly 40% of the world is a gamer. That’s more than one in every three people you meet. And they invest serious time into gaming, with 18% of US gamers playing up to 6 to 10 hours a week. We believe that this trend is set to accelerate with AI, through the creation of more immersive, dynamic, and collaborative gaming experiences that can also safely expand across demographics (like, gaming for kids) and geographies.
Strange Hunches: How AI May Supercharge Trends in 2024
  • Immersive Experiences
    Voice AI adds depth to avatars and characters, enhancing the gameplay experience.
  • Safer gaming environments
    Live moderation of in-game chats will lead to the safe expansion into markets like the under-18s, and beyond.
  • Easy Localization
    Localization through language and narratives is now more accessible to even indie game studios.
Notable Case Studies
Immersive experiences
Interactions with non-player characters (NPCs) are no longer a side-show. AI can create more dynamic and lifelike voice interactions with NPCs, enhancing player engagement, game narrative, and storytelling. 

Get a glimpse of the future possibilities of gaming here.
Localization, Now Within Reach
Localization was previously limited to game-makers with large budgets. Now, even indie game-makers can localize narrations or scenes in multiple languages quickly and cost-effectively. Voices, clues, and narratives can be generated in different languages.
Moderation Makes New Markets
ToxMod, developed by Modulate, is an AI voice moderation tool designed to combat toxicity and extremism in online gaming. Powered by machine learning, it identifies harmful behaviors in gaming voice chats, prioritizing player protection and safer gaming environments.
"The bonus factor is that there's empirical evidence that strong safety tools pay for themselves - we've found ToxMod lifts platforms' new user retention, for instance, by as much as 10-20%. "
Mike Pappas, CEO, Modulate

Learning & Development

Education pervades all steps in one’s life, from early development in childhood to continuous learning in your career. The applications for Voice AI in education particularly stood out to us in 3 sectors of the industry: schooling (K-12 and universities), language fluency, and corporate training.

The market size of each vector is massive. Revenue for public and private schools in the US is estimated at over 1 trillion US dollars1 2. The global English language learning market alone is forecasted to reach a value greater than 35 billion USD by 20303. And global corporate training was valued at 151.75 billion US dollars in 2021 and is expected to grow to 487.3 billion USD in 20304. The opportunities to make a dent in these areas are vast and exciting.
Strange Hunches: How AI May Supercharge Trends in 2024
  • Personalized attention in the classroom
    • Voice AI enhances teachers’ abilities to provide more personalized feedback and assessments of all their students, from language fluency to identifying potential learning difficulties, more efficiently.
  • Less effort is required to develop Corporate Training Videos
    • The flexibility to create, edit, and share Voice AI generated content enables companies to build and disseminate essential corporate training modules across different roles and geographies faster.
  • Accessibility in educational institutions is becoming a more prevalent topic
    • Text-to-speech AI technology will become increasingly common in educational settings as an alternative method to digest written material.
STAY in the loop for future updates
Notable Case Studies
Dyslexia Identification
EarlyBird Education partnered with SoapBox Labs, a speech recognition company, to enhance its early literacy assessment, including detection of dyslexia.

This voice-based assessment helps educators identify challenges faced by struggling young students before they start reading. Learn more about it here.
Read by Listening
Students, from NYC Public Schools to Barnard College,  who struggle with reading comprehension can listen to textbook chapters from convenient mobile devices on 2x speed while going for a walk, rather than struggling for hours to read through one chapter.
Corporate Training
Voice AI is particularly handy for corporate training. Companies can put together division or regional-specific training videos and lessons with ease by using text to speech technology that does not require extensive recording and editing.

Sales, Marketing, Customer Support

Whether it be when making a purchase (via Alexa or even a restaurant drivethru), listening to an ad, or getting information about a product, you may have interacted with some form of Voice AI without even knowing it. All of these activities hit on core customer-facing activities like sales (making a purchase or upselling), marketing (viewing an ad), and customer support (getting info on a product). Statista estimates that in the US alone 2023 spending on digital advertising will be 271 billion US dollars, up to 325 billion in 2025, with spend specifically on audio advertising increasing to nearly 7 billion in 2025.
Strange Hunches: How AI May Supercharge Trends in 2024
  • Customers want to feel “seen”
    • As machine learning continues to advance, advertisements, including a voice ad you might hear on Spotify, will become far more targeted to the specific listener.
  • Restaurant drive-thrus are streamlining the process
    • In a constant effort to keep finding new ways to improve the consumer experience, we are starting to see fast food chains implementing Voice AI solutions to free up workers for more challenging customer interactions.
  • Advanced customer support
    • Voice AI can take the human bias out of assessing emotion. With a Voice AI companion, sales agents can get additional insight into customer sentiment, helping to find upsell/cross-sell opportunities and a better understanding of customer satisfaction.
Notable case studies
Customer Calls 
Now any business can scale how they chase leads at an exponentially lower cost. 

With the right integrations, Air AI voice calls can analyze ongoing conversations, identify customer sentiments, and even make a sale. 

Check out Air AI's demo of an AI agent's 10 min sales call.
Hyper-relevant Ads
A Bollywood star gives a personal shoutout to thousands of small family-run grocers — a feat only possible with AI.

In 2021, Ogilvy Mumbai cloned Shah Rukh Khan's face and voice to create hundreds of personalized ads in an ad for Cadbury Chocolate.
Drive Thrus
Fast-food giants are getting into the game. Franchises like Wendy’s and White Castle are piloting Voice AIs in drive-thrus. White Castle has set a number of hefty goals, including order completion rates of 90% (greater than current benchmarks by staff) and that an average order gets received and processed in just over 60 seconds.

Looking Forward

This living research report will be updated on a bi-annual basis. Questions, comments, feedback? Let us know at research@strangevc.com.