Abstract
Synthesizing a real-time, high-resolution, and lip-sync digital human is a challenging task. Although the Wav2Lip model represents a remarkable advancement in real-time lip-sync, its clarity is still limited. To address this, we enhanced the Wav2Lip model in this study and trained it on a high-resolution video dataset produced in our laboratory. Experimental results indicate that the improved Wav2Lip model produces digital humans with greater clarity than the original model, while maintaining its real-time performance and accurate lip-sync. We implemented the improved Wav2Lip model in a government interface application, generating a government digital human. Testing revealed that this government digital human can interact seamlessly with users in real-time, delivering clear visuals and synthesized speech that closely resembles a human voice.
Keywords
0 Introduction
A “digital human” is a digital technology designed to simulate real-world human characteristics. An “interactive digital human” can engage and converse with users in real time. The interactive digital human systems comprise three main components: the digital human itself, a question-answering (QA) system, and a text-to-speech (TTS) system. Optionally, a speech recognition (SR) system can be included or replaced with text input. Digital humans have been extensively applied across sectors such as government, e-commerce, healthcare, cultural tourism, and education. A “government interactive digital human” in this context is designed for public administration, where it functions as an intelligent customer service tool for governance-related inquiries. This technology offers swift, convenient access to public services and has helped to facilitate the digital transformation of government affairs.
The core technology supporting digital humans is speech-driven talking face synthesis, the most challenging aspect of which is lip-sync accuracy. While recent advancements have greatly enhanced this technology, certain challenges persist. The Wav2Lip [1] model, a leading solution in terms of real-time performance and lip-sync accuracy, still lacks clarity as it processes facial images with a resolution of only 96×96 pixels; further, the videos used to train it are low-resolution and not sufficiently sharp. We improved the Wav2Lip model in this study to improve the clarity of its digital-human visuals. The improved model can process facial images with a resolution of 384×384 pixels and was trained on high-resolution video data produced in our laboratory.
The key contributions are as follows:
1) We improved the Wav2Lip model by adding a six-layer convolutional neural network (CNN) structure to the lip-sync discriminator's face encoder. We also introduced two extra blocks to both the face encoder block and face decoder block of the generator. We recorded a high-resolution video dataset to train the model, enabling it to generate high-resolution digital human in real time subsequently. The face resolution of generated faces increased from 96×96 pixels to 384×384 pixels while retaining the original model's lip-syncing accuracy and real-time performance.
2) We assembled and organized a government knowledge base (KB) , a government QA system, and a fine-tuned TTS system. We integrated the digital human, government QA system, and TTS to form a comprehensive and interactive digital human platform.
1 Related Work
1.1 Digital Human
Existing digital human technologies can be either two-dimensional (2D) or three-dimensional (3D) . This research primarily focuses on 2D digital human, which rely on speech-driven talking-face synthesis as their core technology. Current research on this topic falls into two main categories: speaker-specific models and speaker-independent models. Speaker-specific models require identity-specific training and extensive data on a particular speaker. In this study, we focused on speaker-independent models.
There are many algorithmic models for digital humans, including Wav2Lip [1], Speech2Vid [2], Kim et al.'s model [3], ATVG [4], FOMM [5], MakeItTalk [6], FACIAL [7], AnyoneNet [8], Audio2Head [9], EVP [10], PC-AVS [11], AD-NeRF [12], and StyleTalker [13], DFA-NeRF [14], DFRF [15], and DiffTalk [16]. These can be divided into 5 categories according to their generation methods. Methods based on audio features include Speech2Vid, Wav2lip, PC-AVS, StyleTalker. Among them, Wav2Lip, PC-AVS, and StyleTalker adopt generative adversarial networks (GAN) [17]. Methods based on 2D facial representation include ATVG, MakeItTalk, and AnyoneNet. Methods based on 3D morphable models (3DMM) [18] include Kim et al.'s, FACIAL, EVP, AD-NeRF, DFA-NeRF, and DFRF. Among them, AD-NeRF, DFA-NeRF, and DFRF use neural radiation fields (Nerfs) [19]. Methods based on dense motion fields[20] include FOMM and Audio2Head, while those based on diffusion models[21] include DiffTalk.
Wav2Lip [1] is widely regarded as the state-of-the-art for its synthesis speed and lip-sync accuracy. However, its output resolution is low, and the videos it generates are fairly blurry. Gupta et [22] created improved high-resolution talking-face videos by training a lip-sync generator in a compact vector quantized space. They also used the GPEN face-enhancement method for enhanced resolution, which is effective but overly time-consuming and not conducive to real-time synthesis. Muaz et al.[23] applied shift-invariant learning to generate photo-realistic, high-resolution videos with accurate lip-sync; this approach is also computationally costly. Ideally, a digital human model needs to balance both high clarity and processing speed.
1.2 Question Answering System
Many types of QA systems have been established to date; our focus here is on systems supported by question-answer pairs. These systems operate by calculating the similarity between a user-submitted question and the questions stored in a knowledge base, returning the answer associated with the most similar question as a candidate answer. To complete this calculation, both the user's question and the stored questions must first be converted into vector forms. There has been significant research on text vectorization, which can be used to facilitate this conversion.
Early text vectorization methods include the word2vec model [24], a type of neural network model. After 2018, the field shifted towards pre-training models, marked by the advent of large language models such as GPT [25-27], BERT [28], and ERNIE [29]. These models fundamentally rely on the Transformer [30] deep learning structure.
1.3 TTS
TTS synthesis is designed to produce human-like speech from text. Typically, TTS models first generate a mel-spectrogram from input text [31-38], which is then converted into a speech waveform using a separately pre-trained vocoder [39-42]. Alternatively, some models directly generate the waveform from text in an end-to-end manner [43-46].
Traditional TTS systems were trained on relatively small datasets. In contrast, today's large-scale TTS systems [47-49] are trained on tens of thousands of hours of speech data. These advanced systems can be fine-tuned with as little as 1 min or even 5 s of training data. The resulting speech closely matches the original voice and sounds more natural.
2 Methods
2.1 Digital Human for Real-time and High-Resolution
The Wav2Lip [1] model includes two parts: a lip-sync discriminator and a generator. To accommodate the increased input image resolution in our model, additional convolutional layers were necessary.
2.1.1 Lip-sync discriminator and its training principle
The lip-sync discriminator consists of a face encoder and an audio encoder. Both predominantly utilize CNN architectures. The face encoder of the original Wav2Lip model accepts an input image size of 96×96 pixels, whereas our improved version uses an input size of 384×384 pixels. This four-fold increase in both length and width necessitated the addition of six layers to the convolutional structure of the face encoder beyond that of the original Wav2Lip model. Details regarding the face encoder model are described in Appendix 1. The design of the original audio encoder was kept unchanged.
The original dataset consists of recorded video files that are segmented into shorter clips during preprocessing, each containing only one sentence. Each short video is then processed to frame the video, detect faces, and extract the audio, yielding multiple facial images and an accompanying audio file. This preprocessing transforms the original video data into a dataset suitable for training.
The training process begins by randomly selecting a starting frame number and extracting five consecutive images from that point. The mel-spectrogram matrix of the audio corresponding to these five images is calculated. These images, along with their associated mel-spectrogram matrix, form a positive sample. Conversely, a negative sample is created when the audio does not correspond to the images.
In this system, a positive sample is labeled with a ground truth of “1” and a negative sample with a ground truth of “0”. The face encoder processes each image to produce a vector v , while the audio encoder processes the mel-spectrogram matrix to also produce a vector s . Both vectors have shapes of 1×512. The output value of the lip-sync discriminator is the cosine similarity between these two vectors, as shown in Eq. (1) .
(1)
During model training, multiple samples are collected into a batch (e.g., a batch size of 10) . The cosine similarities p of these10 samples are computed to create an array, and their corresponding ground truths y are compiled into another. The loss between these two arrays is calculated using the cross-entropy loss function, as shown in Eq. (2) . The parameters of the face encoder and audio encoder models are optimized based on this loss.
(2)
2.1.2 Generator and its training principle
The generator includes three key components: a face encoder block model, a face decoder block model, and an audio encoder. All primarily utilize CNN structures. Compared to the traditional Wav2Lip model, we increased the depth of both the face encoder block model and face decoder block model from seven blocks to nine blocks for enhanced performance. Each block includes one deconvolution layer and two convolutional layers. Appendices 2 and 3 provide details of the face encoder block model and face decoder block model, respectively. The design of the audio encoder was kept consistent with the original model.
To train the generator, begin by selecting a random number i to obtain five consecutive images starting from this index. Replace the pixels in the lower half of these five images with zero values. Then, randomly select another five images. Combine these ten images, denoted as x. Calculate the mel-spectrogram matrix corresponding to the five consecutive images starting from the i-th image, denoted as mel. Additionally, calculate separate mel-spectrogram matrices for the five consecutive images starting from the (i-1) -th, i-th, (i+1) -th, (i+2) -th, and (i+3) -th image, denoted as indiv_mel. The ground truth, denoted as gt, is the sequence of five consecutive images starting from the i-th image.
Model processing includes four steps:
1) Process the audio indiv_mel using the audio encoder model;
2) Process image x using the face encoder block model;
3) Use the face decoder block model to process the outputs of the first two steps;
4) Use the output block model to process the results of the third step to obtain the generated image, denoted as Lg.
The loss function is included two parts: 1) Calculating the L1 norm loss between the generated image Lg and the ground truth LG, as shown in Eq. (3) , and 2) employing a pre-trained expert lip-sync discriminator to calculate the loss value. Input the generated image Lg and mel-spectrogram matrix into the discriminator to compute the loss using the cross-entropy loss function. Then, combine these two loss values to calculate the total loss and optimize the generator model parameters accordingly.
(3)
2.1.3 High-resolution video data
High-resolution video data is required for model training to create high-resolution digital humans. We recorded a dataset of 4K resolution videos totaling300 min in duration and featuring ten individuals. Each participant contributed 30 min of spoken content in Chinese, with only one person speaking per video.
We used the S3FD [50] algorithm for face detection, obtaining cropped face images. Typically, such cropped images are rectangular and can be directly converted to a square format [1]. However, this can cause distortion. We addressed this problem by using the bottom edge of the cropped face image as a reference, selecting the square down, and then resizing to 384×384 pixels.
2.2 Government Question Answering System
2.2.1 Construction of government knowledge base
We compiled relevant governmental data to establish a government question answering system, collecting400 question-answer pairs from Zhejiang Provincial Government Service Network. This network covers a wide range of information, including residence registry, ID card services, immigration, social insurances and housing funds, vehicle-related services, corporate groups, professional certifications, and departmental responsibilities. Relevant question-and-answer pairs were organized into a knowledge base, structured to include both questions and corresponding answers. Using a Python program, we constructed a dictionary wherein each question serves as a key and its respective answer as the value.
2.2.2 Construction of government question answering system
We built a government question answering system based on the knowledge base. This primarily included computing the similarity between a user-submitted question and the questions contained within the knowledge base. The system then selects the answer corresponding to the most similar question as a candidate answer. To complete this process, both the user's question and the knowledge base questions must be converted into vector format. We employed the BERT model, a robust text vectorization tool, to transform the text into vectors.
2.3 TTS
Existing TTS technology is relatively mature. We employed the open-source speech synthesis system GPT-SoVITS to fine-tune our TTS model. Fine-tuning requires audio data from a target speaker. By recording just l min of audio from a speaker, we were able to fine-tune our model so that the synthesized voice closely approximates the original voice.
3 Experimentation and Practical Application
3.1 Technical Route
The objective of this study was first to develop a real-time and high-resolution digital human, followed by the creation of a government interactive digital human.
The core of the digital human is based on the improved Wav2Lip model (as shown in Fig.1) . The technical route for its establishment began with capturing and preparing video data for subsequent processing, followed by developing the lip-sync discriminator model and the generator model as components of the improved Wav2Lip. This improved model was then trained using preprocessed data, focusing on both the lip-sync discriminator and generator. Once the generator model was fully trained, inputting audio into the system enabled the generation of digital human videos in real-time.
Fig.1The technical route of digital human
The technical route for government digital human interaction is depicted in Fig.2. First, the user inputs a question into the government question answering system. If the system finds a matching answer, it outputs the corresponding text. The answer text is then fed into the TTS system, which converts it into audio. Finally, the audio is input into the improved Wav2Lip model, which generates a digital human video providing a visual and auditory response to the user.
Fig.2The technical route of digital human in government interaction
3.2 Program Implementation and Deployment
The digital human, government question answering system, and TTS components were all developed using the PyTorch deep learning framework and implemented in the Python program. The system was deployed with the Docker method to ensure a scalable and isolated environment. The three components were integrated into a cohesive interactive platform using the Vue framework. The hardware configuration supporting the system comprises an NVIDIA RTX 3060 Ti graphics card with 12 GB memory and an i9-10900k CPU, on a computer operating with Linux.
3.3 Real-time and High-Resolution Digital Human Generation
The improved Wav2Lip model successfully generates high-quality digital humans after training, exhibiting two main advantages: 1) The generated face image is clear, as shown in Fig.3; 2) the model swiftly produces video content (e.g., a10 s video returned in only 5 s) .
Fig.3Images saved during training phase
3.4 Quantitative Evaluations
To evaluate the quality of lip-sync, we use Lip Sync Error-Confidence (LSE-C) and Lip Sync Error-Distance (LSE-D) metrics introduced in Wav2Lip [1]. We also use Fréchet Inception Distance (FID) to measure the quality of the generated faces [51]. The mean LSE-D, LSE-C, and FID scores are shown in Table1.
Table1Quantitative comparison of different methods on our new datasets
As shown in Table1, our method outperforms previous approaches. We see that the lip-sync accuracy of the videos generated using our method is almost as good as real synced videos, and our method produces lip-sync videos at very high-resolutions.
3.5 Practical Application
3.5.1 Text-generated digital human
Text can be transformed into digital-human video by combining the TTS and digital human model, as depicted in Fig.4. We employed the open-source speech synthesis system GPT-SoVITS to fine-tune our TTS model, resulting in a high-quality synthesized voice closely approximating the original human voice. The speech synthesis also operates at high speed, generating1 s of audio in only 0.5 s, allowing for real-time synthesis.
Fig.4Text-generated with digital humans
3.5.2 Government interaction with digital humans
In this study, we integrated a government question answering system, TTS, and a digital human to create a government interactive digital human system. When a user submits a question, if there is an answer in the knowledge base, the digital human can quickly vocalize a response (as shown in Fig.5) .
Fig.5Government interaction with digital humans
The government digital human we are developing is currently in the experimental phase and has not been formally implemented in any practical application of governance operations. Government departments have dedicated data storage and management departments to ensure security. If our digital human was adopted by these departments, the associated QA and KB would need to be carefully screened to include only publicly accessible data, further safeguarding data security. Furthermore, the number and distribution of users could be counted, and user feedback could be recorded to optimize the digital human model further. Optimized digital humans can provide users with a better experience, increase user engagement, and ultimately contribute to broader societal benefits.
4 Conclusions
In this study, we improved the Wav2Lip model to enable it to process resolutions up to 384×384 pixels, compared to the original96×96 pixels. We also created a set of high-resolution video data for training, allowing the model to generate high-resolution, accurately lip-synced digital humans in real time, achieving clear results without the need for facial enhancement algorithms. We constructed the digital human, government QA and TTS, then integrated these three components to create a government digital human capable of real-time interaction with clear, high-quality output. This digital human system is well-suited for deployment in public service sectors and may significantly advance the digital transformation of government operations.
The government QA system utilized in this work operates on a fixed set of question-and-answer pairs, limiting its ability to handle a broader array of inquiries. Moving forward, we plan to explore the development of a large government model to expand its capabilities. The system could also respond to a wider variety of questions after integrating the digital human with large models like Zhipu Qingyan's ChatGLM4. Additionally, while the proposed government interaction digital human currently supports only Chinese, it may be extended to other languages after gathering video data in those respective languages.
Note:The facial images used in this study were provided by the author of this article and are presented here with her consent.
Appendix 1: The face encoder model
(1) Input layer: The input image consists of 5 facial images, each with 3 channels and 384×384 pixels, totaling15 channels.
(2) Face encoder model:
(2.1) Convolutional layer: there are16 channels, the kernel is 7×7, stride is 1, padding is 3.
(2.2) Convolutional layer: there are32 channels, the kernel is 5×5, stride is (1, 2) , padding is 1.
(2.3) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.4) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.5) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.6) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.7) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.8) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.9) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.10) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.11) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.12) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.13) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.14) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.15) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.16) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.17) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.18) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.19) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.20) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.21) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 0.
(2.22) Convolutional layer: there are512 channels, the kernel is 1×1, stride is 1, padding is 0.
Appendix 2: The face encoder block model
(1) Face encoder block 1:
(1.1) Convolutional layer: there are8 channels, the kernel is 7×7, stride is 1, padding is 3.
(2) Face encoder block 2:
(2.1) Convolutional layer: there are16 channels, the kernel is 3×3, stride is 2, padding is 1.
(2.2) Convolutional layer: there are16 channels, the kernel is 3×3, stride is 1, padding is 1.
(2.3) Convolutional layer: there are16 channels, the kernel is 3×3, stride is 1, padding is 1.
(3) Face encoder block 3:
(3.1) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 2, padding is 1.
(3.2) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(3.3) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(4) Face encoder block 4:
(4.1) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 2, padding is 1.
(4.2) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(4.3) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(5) Face encoder block 5:
(5.1) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 2, padding is 1.
(5.2) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(5.3) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(6) Face encoder block 6:
(6.1) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 2, padding is 1.
(6.2) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(6.3) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(7) Face encoder block 7:
(7.1) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 1.
(7.2) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(7.3) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(8) Face encoder block 8:
(8.1) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 1.
(8.2) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(9) Face encoder block 9:
(9.1) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 0.
(9.2) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 0.
Appendix 3: The face decoder block model
(1) Face decoder block 1:
(1.1) Convolutional layer: there are512 channels, the kernel is 1×1, stride is 1, padding is 0.
(2) Face decoder block 2:
(2.1) Deconvolution layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 0.
(2.2) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(3) Face decoder block 3
(3.1) Deconvolution layer: there are512 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(3.2) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(3.3) Convolutional layer: there are512 channels, the kernel is 3×3, stride is 1, padding is 1.
(4) Face decoder block 4:
(4.1) Deconvolution layer: there are256 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(4.2) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(4.3) Convolutional layer: there are256 channels, the kernel is 3×3, stride is 1, padding is 1.
(5) Face decoder block 5:
(5.1) Deconvolution layer: there are128 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(5.2) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(5.3) Convolutional layer: there are128 channels, the kernel is 3×3, stride is 1, padding is 1.
(6) Face decoder block 6:
(6.1) Deconvolution layer: there are64 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(6.2) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(6.3) Convolutional layer: there are64 channels, the kernel is 3×3, stride is 1, padding is 1.
(7) Face decoder block 7:
(7.1) Deconvolution layer: there are32 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(7.2) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(7.3) Convolutional layer: there are32 channels, the kernel is 3×3, stride is 1, padding is 1.
(8) Face decoder block 8:
(8.1) Deconvolution layer: there are16 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(8.2) Convolutional layer: there are16 channels, the kernel is 3×3, stride is 1, padding is 1.
(8.3) Convolutional layer: there are16 channels, the kernel is 3×3, stride is 1, padding is 1.
(9) Face decoder block 9:
(9.1) Deconvolution layer: there are8 channels, the kernel is 3×3, stride is 2, padding is 1, output padding is 1.
(9.2) Convolutional layer: there are8 channels, the kernel is 3×3, stride is 1, padding is 1.
(9.3) Convolutional layer: there are8 channels, the kernel is 3×3, stride is 1, padding is 1.
The number of parameters in the model was determined through trial-and-error. A larger number of channels increases the complexity of the model and decelerates its inference speed, making real-time synthesis increasingly challenging. Conversely, too few channels lead to an overly simplistic model that cannot synthesize a digital human effectively.