Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

e to any emotion based on what a director asks.How do you tell an AI voice to do this and make it sound natural? How can AI inject two different emotions into a single sentence? Right now, it can’t, while a professional voice actor has no problem doing this.AI speech also struggles to get you to engage with the story.There’s no belief in an AI voice, and no command prompt can add it. This matters more than you think.Have you ever been engrossed in a book only to find yourself pulled out of the story by a typo? It’s like that.Actors make their living by making you believe there really is an alien chasing them, this laundry detergent is the best, or they’re in love with their co-star, the one they hate in real life.With this in mind, who would believe an AI voice that sounds flat?There are other characteristics of modeled speech I’m surprised people are willing to put up with.The breathing patterns are wrong and unnatural.It often lacks the cadence of human speech we’re all accustomed to. This can make you feel uncomfortable.The speech is often in a monotone.This is especially noticeable when the AI is reciting a list. This can annoy you.Modeled speech can also lead to listening fatigue. Eventually, your brain will tune out, and you’ll start getting distracted.<h2 id="cfdd">“But what about James Earl Jones?” you may ask.</h2>“Why are they going through the trouble of recording him for posterity if AI speech isn’t any good?”The answer lies in how his speech is being recorded and the system being used. He has been in the studio for hundreds of hours, recording every kind of word and emotion possible. It’s costing Disney a fortune to do, but they know it will pay off in the long term.They’re also counting on the technology getting better, and when it does, they will have all those samples in storage. In addition, they have the time, energy, and resources to make it sound good right now.However, this is only f

Options

or one voice, and it’s being done by a company with deep pockets. Large video game developers are also able to spend big coin on their text-to-speech (TTS).The average producer can’t afford this, so they use the currently available simple systems, which leads to the problems I’ve discussed.Fake voices are far from new, by the way.Synthesized speech has been around since the 1960s. The DECtalk system Stephen Hawking used was introduced in the 1980s.These synthesized systems have been a godsend for people with sight, speech, and reading issues.We’ve also had modeled speech in our hands and homes for over a decade. Apple’s Siri appeared in 2010, followed by Amazon’s Alexa in 2013.<h2 id="b093">So why are producers and YouTubers and Audiobook creators using AI speech?</h2>To save money.To save time.But there’s a catch.They’re banking on the people who can’t tell the difference, the same people who think a $20 pair of headphones sound just as good as the$ 500 ones.They’re banking on people not caring about the sound of the voice they’re listening to, even with all the inhuman quirks.They’re letting themselves be talked into a need to feed the beast daily, and AI is letting them get content out there quickly, even if it’s not high quality.However, this may come at the cost of alienating a large portion of your intended audience by using AI speech, either because they understand the moral issues or because it’s challenging to listen to.From a practical point of view, AI speech is not ready for prime time.We are at a crossroads. Modeled everything threatens to overwhelm us before we’ve even finished discussing their dangers.And dashing in without thinking it through due to FOMO leads to an onslaught of crappy, derivative work. I have friends who tell me it’s getting better, but I’m not seeing or hearing it yet.We also need to solve the problem of creative people being paid fairly sooner than later.</article></body>

Why Using AI Speech in Your Project Is a Bad Idea on a Practical Level

The ethics matter, but there’s more to consider

I’m going to give you an excellent reason to avoid AI speech that’s different from the battle of creative people fighting to be paid fairly for their talent and skills.

I know that ethics do matter. Hollywood studios want to use the likenesses and voices of actors and voice actors in perpetuity for a one-time payment, and that’s not even close to fair. Smaller producers don’t want to pay at all and aren’t even interested in getting your consent.

Much is being made about James Earl Jones and the quest to model his voice for Darth Vader while he’s still alive so Disney can keep using it forever in their Star Wars productions.

The difference is that Jones is being paid a boatload of money to license his voice so that his heirs will get royalties from its usage. He can do this because he has clout in Hollywood and a singular voice that is difficult to imitate.

SAG-AFTRA wants the same for ALL actors.

However, I think AI-modeled speech has another problem at the moment, and it’s this:

They still don’t sound real.

Here’s why it makes a difference.

There’s a video going around featuring a virtual Morgan Freeman in appearance and voice. It’s pretty good, and if you didn’t know it was fake, you’d likely be fooled. But it’s not perfect.

The thing is, this video is the exception, not the rule.

Its imperfection lies in the two things ChatGTP and other AI vendors are struggling with; nuance and belief.

Nuance is lost in AI speech.

It’s not about what you’re hearing but what you’re feeling.

I can say “I love you” in dozens of different ways. I can say it in anger, sadness, fear, joy, and more. As a voice actor, I can quickly change to any emotion based on what a director asks.

How do you tell an AI voice to do this and make it sound natural? How can AI inject two different emotions into a single sentence? Right now, it can’t, while a professional voice actor has no problem doing this.

AI speech also struggles to get you to engage with the story.

There’s no belief in an AI voice, and no command prompt can add it. This matters more than you think.

Have you ever been engrossed in a book only to find yourself pulled out of the story by a typo? It’s like that.

Actors make their living by making you believe there really is an alien chasing them, this laundry detergent is the best, or they’re in love with their co-star, the one they hate in real life.

With this in mind, who would believe an AI voice that sounds flat?

There are other characteristics of modeled speech I’m surprised people are willing to put up with.

The breathing patterns are wrong and unnatural.

It often lacks the cadence of human speech we’re all accustomed to. This can make you feel uncomfortable.

The speech is often in a monotone.

This is especially noticeable when the AI is reciting a list. This can annoy you.

Modeled speech can also lead to listening fatigue. Eventually, your brain will tune out, and you’ll start getting distracted.

“But what about James Earl Jones?” you may ask.

“Why are they going through the trouble of recording him for posterity if AI speech isn’t any good?”

The answer lies in how his speech is being recorded and the system being used. He has been in the studio for hundreds of hours, recording every kind of word and emotion possible. It’s costing Disney a fortune to do, but they know it will pay off in the long term.

They’re also counting on the technology getting better, and when it does, they will have all those samples in storage. In addition, they have the time, energy, and resources to make it sound good right now.

However, this is only for one voice, and it’s being done by a company with deep pockets. Large video game developers are also able to spend big coin on their text-to-speech (TTS).

The average producer can’t afford this, so they use the currently available simple systems, which leads to the problems I’ve discussed.

Fake voices are far from new, by the way.

Synthesized speech has been around since the 1960s. The DECtalk system Stephen Hawking used was introduced in the 1980s.

These synthesized systems have been a godsend for people with sight, speech, and reading issues.

We’ve also had modeled speech in our hands and homes for over a decade. Apple’s Siri appeared in 2010, followed by Amazon’s Alexa in 2013.

So why are producers and YouTubers and Audiobook creators using AI speech?

To save money.

To save time.

But there’s a catch.

They’re banking on the people who can’t tell the difference, the same people who think a $20 pair of headphones sound just as good as the $500 ones.

They’re banking on people not caring about the sound of the voice they’re listening to, even with all the inhuman quirks.

They’re letting themselves be talked into a need to feed the beast daily, and AI is letting them get content out there quickly, even if it’s not high quality.

However, this may come at the cost of alienating a large portion of your intended audience by using AI speech, either because they understand the moral issues or because it’s challenging to listen to.

From a practical point of view, AI speech is not ready for prime time.

We are at a crossroads. Modeled everything threatens to overwhelm us before we’ve even finished discussing their dangers.

And dashing in without thinking it through due to FOMO leads to an onslaught of crappy, derivative work. I have friends who tell me it’s getting better, but I’m not seeing or hearing it yet.

We also need to solve the problem of creative people being paid fairly sooner than later.