A New Era Of Voice

Matt Hocking is the Executive Chairman and co-founder of WellSaid Labs, an AI text-to-speech voice generation solution.

getty

If you spend any amount of time on social media these days (especially TikTok or Instagram), you have undoubtedly heard an AI-generated voice, whether you knew it or not. Why are companies and individuals using these AI voices, and how did we get to the point of having an indistinguishable AI voice from that of a human?

The first ventures into text-to-speech (TTS) technology, or AI voice, consisted of devices or computer-based systems that could approximate human speech by concatenating speech sounds. Although these early methods generated comprehensible voices, they were highly rudimentary and robotic—exceptionally far off from resembling real human speech. Over the decades, advancements in computing power, algorithms and data processing enabled the evolution of TTS everywhere—from voice-activated virtual assistants like Microsoft’s Cortana, Apple’s Siri and Amazon’s Alexa to the automated voices of customer support lines and even memes seen across social media.

Today, AI voice can be found in almost every industry, from financial services, insurance and healthcare to retail, media, hospitality and more. The opportunities go way beyond what we’ve seen thus far, offering possibilities for more personalized experiences in every domain we can think of, including advertisements, onboarding, training videos and even news narration.

Moreover, AI voice technology has created a monumental shift in how we interact with technology and provides brands with an entirely new format for engaging with their audiences. Transforming the way people consume written content by turning text into rich audio that truly captures their listeners’ attention not only helps with engagement but expands the reach of their awareness to new demographics.

MORE FOR YOU

When Does Bridgerton Season 3 Part 2 Come Out On Netflix See The Release Schedule

A Storm Of 3,000 Ukrainian Bomblets Blew Up Four Russian Jets At Their Base In Crimea

Samsung Galaxy S24 Series Users Really Want To Turn Off One Of Its Best Features

By incorporating AI voice into products, end users can have personalized content spoken to them in a way that makes multimodal user experiences even more rich. These advancements help create more affordable, multi-modal solutions and user experiences that enable individuals to listen to their preferred content. Imagine listening to your news articles while cooking or exercising instead of reading on your phone.

From day one, the biggest challenge for TTS systems was replicating near-perfect human speech; most results were emotionless and flat. This is because these systems didn't have the ability to capture the rich variation and intonation that goes into every spoken word or phrase. What makes speech isn’t just a series of words but the pitch, tone and even regional dialect—all of which carry emotional and meaningful depth and context behind our words and encapsulate the essence of the individual behind the voice.

An AI voice is meticulously crafted and involves several steps that combine advanced technology, sophisticated algorithms and various essential tools. Data collection is only the first step. With advancements in AI, TTS can leverage endless amounts of data, advanced algorithms, and sophisticated tools to offer a wider breadth of voices and speaking styles than ever before. But how does AI truly learn how to capture the uniqueness of human voices?

To understand what makes human voices so unique, particularly in English, we need to understand graphemes and phonemes. Graphemes, as the name may imply, are how words are written, and phonemes are how they sound spoken. These two concepts govern how we communicate in English, and the relationship between the two must be taught. Teaching AI phonemes is key to producing synthetic voices that capture the nuance of human communication. However, doing so requires standardized pronunciation so AI can know the correct pronunciation just from reading the word.

This is where the International Phonetic Alphabet comes in. In the late 19th century, the IPA was created to establish a uniform written system for representing all sounds heard in various languages, not just English. IPA comprises symbols representing a distinct sound, aiming for a consistent mapping between symbols and sounds across languages.

By ensuring a direct correspondence between symbols and sounds, the IPA promotes consistency and reduces confusion in spoken language. For instance, the word "banana" is represented as /bəˈnænə/. In this IPA transcription, each sound is distinctly represented: there is only one "b," the vowel sound "a" is represented by /ə/, and the stressed vowel sound "a" is represented by /æ/. This clarity ensures that despite variations in spelling, there remains only one pronunciation for each IPA representation.

IPA helps the deep learning models understand pronunciation, but on the user side, it’s too cumbersome to be helpful. Instead, using a respelling system is much easier and more efficient.

An example of this is our company's partnership with Oxford Languages. Oxford Languages provides WellSaid Labs with its most up-to-date syllabified IPA transcriptions of words. Utilizing these IPA transcriptions, we map words in scripts to their respelling system. Subsequently, the model undergoes training on both the regular spellings of words and their respelled counterparts. Through this approach, the model learns how both graphemes and respellings correspond to phonemes, giving users precise control over pronunciation that still sounds human.

Other big players are taking advantage of AI-driven TTS capabilities. For instance, Microsoft Azure’s TTS leverages the latest advances in neural networks to deliver clear-sounding language. On a similar note, Amazon Polly transforms text into speech through deep learning across dozens of languages and voices.

So why does this even matter? Achieving human-like performance has been a challenging journey, but now that it's within reach, we are witnessing the complete potential of this technology. Enterprise businesses will be transformed—from call centers, creative agencies and marketing teams to products, experiences and even corporate training teams. All of this is possible due to the technical feats across the field and the advanced technology, sophisticated algorithms and various tools outlined above.

As we look into the future of TTS, our industry will be able to create even more lifelike, expressive and personalized voices. Pretty soon, we can expect AI voice to become even more integral to our daily lives and indistinguishable from human speech, capable of conveying any emotion desired.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Follow me on Twitter or LinkedIn. Check out my website.

More From Forbes

A New Era Of Voice

When Does Bridgerton Season 3 Part 2 Come Out On Netflix See The Release Schedule

A Storm Of 3,000 Ukrainian Bomblets Blew Up Four Russian Jets At Their Base In Crimea

Samsung Galaxy S24 Series Users Really Want To Turn Off One Of Its Best Features