Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

3 2 minutes read

[Submitted on 11 Aug 2024 (v1), last revised 28 May 2025 (this version, v2)]

Authors:Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhaa Tao

View the PDF file from the paper entitled VQ-CTAP: Representing the exact grainy granules sequence for speech processing, by Chunyu Qiang and 14 other authors

PDF HTML (experimental) view

a summary:Deep learning brought significant improvements in the field of media acting. For tasks such as the text to speech (TTS), the transformation of the sound (VC), and the identification of automatic speech (ASR), the representation of the granules sequence (at the frame level) through the media is required, while emphasizing the semantic content of the text method during the abolition of prominent information in the method of discourse. We suggest a method called “quantitative quantitative vocal audio before training (VQ-CTAP), which uses cross cross vessels to bring text and speech to a common multimedia space, learn how to connect the text and speech at the frame level. The proposed VQ-CTAP is a model to learn to represent the sequence through the media, which provides a promising solution to the tasks of obstetrics and identifying the microscopic granules in the treatment of speech. VQ-CTAP can be applied directly to VC and ASR tasks without refining or additional structures. We suggest a thoroughly trusted deloset, which connects multiple frozen stereotypes for the TTS mission, with the possibility of delivery and operation. We design a step -by -making strategy to ensure an effective active rapprochement by injecting and setting the effect of different losses. Moreover, we suggest the loss of deficient semantic consistency to move to the enhancement of representative capabilities, allowing the model to better generalize the lack of data and capture the nuances of language information. In addition, the VQ-CTAP achieves high-pressure speech coding at a rate of 25 Hz of the input forms of the input 24 kHz, which is a 960-fold reduction in the sampling rate. The audio view is available in this URL https