Summary

The undefined website provides a FastAPI backend framework for auto-dubbing YouTube videos with voice timbre recognition and text-to-speech synthesis, deployable on cloud platforms like Cloud Run, and is inspired by OpenVoice technology.

Abstract

The undefined website outlines a repository that serves as a foundation for creating a FastAPI backend application capable of dubbing YouTube videos. This application leverages OpenVoice technology to recognize the voice timbre of the original video and then synthesizes speech that matches this timbre from translated subtitles. The project supports flexible deployment options, including GitHub Actions, Cloud Build, and Cloud Run, ensuring scalability and ease of use. The website also provides detailed instructions for setting up the development environment, downloading necessary model checkpoints, running the application, and interacting with the API endpoints to process and download the dubbed videos. Additionally, the project is poised for future enhancements such as model performance improvements, serverless GPU support, frontend interface development, and better translation synchronization.

Opinions

The project is presented as a cutting-edge solution for automated video dubbing, implying that it represents a significant advancement in the field.
The use of OpenVoice technology is highlighted as a key feature, suggesting that it contributes to high-quality voice timbre recognition and synthesis.
The emphasis on flexible deployment options indicates a strong focus on accessibility and scalability for developers and users.
The mention of future directions, including model improvements and serverless GPU support, suggests a commitment to continuous innovation and performance optimization.
The recommendation of a cost-effective AI service, ZAI.chat, as an alternative to ChatGPT Plus (GPT-4), implies an endorsement for more affordable AI solutions without compromising on quality.

YouTube Auto-Dub with FastAPI, OpenVoice, Docker and Cloud Run

This repository serves as a starting point for developing a FastAPI backend for dubbing YouTube videos by capturing and inferring the voice timbre using OpenVoice.

https://github.com/mazzasaverio/youtube-auto-dub

Core Features

Voice Timbre Recognition: Utilizes OpenVoice technology to accurately recognize the voice timbre from the original YouTube video.
Text-to-Speech Synthesis: Downloads and processes subtitles, translating them and converting them into speech, matching the original voice timbre as closely as possible.
Flexible Deployment: Supports deployment via GitHub Actions and Cloud Build, with compatibility for Cloud Run deployment, ensuring scalability and ease of use. Currently, inference is performed using CPU. For setting up Cloud Run with Terraform, refer to the following repository for instructions: https://github.com/mazzasaverio/fastapi-cloudrun-starter

Getting Started

To get started with YouTube Auto-Dub, follow these steps:

Environment Setup

For local development, we recommend setting up a conda environment with:

conda install mamba -n base -c conda-forge
mamba create -n youtube-auto-dub python=3.9 -y
mamba install -n youtube-auto-dub pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia -y
conda activate youtube-auto-dub
pip install -r requirements.txt

Download Required Checkpoints

Download the model checkpoints necessary for voice timbre recognition and synthesis:

sudo aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://myshell-public-repo-hosting.s3.amazonaws.com/checkpoints_1226.zip -d /code -o checkpoints_1226.zip
sudo unzip /code/checkpoints_1226.zip -d backend/checkpoints

Running the Application

With the environment set up and checkpoints downloaded, navigate to the backend directory and start the application using:

cd backend
uvicorn app.main:app --reload

To use YouTube Auto-Dub, begin by submitting a YouTube link via the endpoint:

/api/v1/download/

The application will process the video, recognize the voice timbre, translate the subtitles, synthesize the translated speech matching the original timbre, and then assemble the final video. The processed video will be saved in backend/data/final_videos. With the video ID returned in the output, you can check the processing status through the endpoint:

/api/v1/status/{video_id}

Finally, you can download the final video by using the endpoint:

/api/v1/download-video/{video_id}

inserting the video’s ID.

This project is designed with cloud deployment in mind. The provided cloudbuild.yaml and Terraform configurations facilitate deployment on Google Cloud Platform, specifically using Cloud Run for scalable, serverless application hosting.

The development of YouTube Auto-Dub was inspired by the following repository:

OpenVoice: Instant voice cloning technology by MyShell, utilized for voice timbre recognition and synthesis in this project.

Future Directions

Model Improvements: Explore and integrate better models for voice recognition and synthesis.
Serverless GPU Support: Investigate options for serverless GPU computing to accelerate processing.
Frontend Interface: Develop a user-friendly frontend for easier interaction with the application.
Translation Synchronization: Enhance the synchronization between translated text and video content for a seamless viewing experience.