

  1. ASRT
  2. ASR-Automatic Speech Recognition &&&&&&&&&& Paddle Speech
    涉及数据集:Aishell, wenetspeech, librispeech…
    ① DeepSpeech2: End-to-End Speech Recognition in English and Mandarin;
    ② u2–Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition;
    Conformer, Transformer, chunk-conformer
    ① SpeedySpeech: Efficient Neural Speech Synthesis (conformer);
    ② Conformer: Convolution-augmented Transformer for Speech Recognition
    其中解码方式还涉及,Attention, …and so on.
    不同的解码方式,其 Character Error Rate - CER 也不尽相同。

About End to End :
E2E models combine the acoustic, pronunciation and language models into a single neural network, showing competitive results compared to conventional ASR systems.
There are mainly three popular E2E approaches, namely CTC, recurrent neural network transducer (RNN-T) and attention based encoder-decoder (AED).


Propose a new framework namely U2 to unify non-streaming and streaming speech recognition.

Framework is based on the hybrid CTC/attention architecture with conformer blocks.
Propose a dynamic chunk-based attention strategy to allow arbitrary right context length.

To support streaming, Modify the conformer block while bringing negligible performance degradation.

a Shared Encoder, a CTC Decoder and a Attention Decoder.
The Shared Encoder consists of multiple Transformer or Conformer encoder layers.

The CTC Decoder consists of a linear layer and a log softmax layer;
The CTC loss function is applied over the softmax output in training.

The Attention Decoder consists of multiple Transformer decoder layers.


  1. 共享Encoder包含多层transformer或者conformer;
    (encoder-conformer layers are particularly modified.—改成了causal convolution)
  2. CTC解码器为一个全连接层和一个softmax层;
  3. Attention解码器包含多层transformer层。

Propose a dynamic chunk-based attention strategy to allow arbitrary right context length.

At inference time, the CTC decoder generates n-best hypotheses in a streaming manner.

The inference latency could be easily controlled by only changing the chunk size.

The CTC hypotheses are then rescored by the attention decoder to get the final result.
This efficient rescoring process causes negligible sentence-level latency.
注意力解码器对 CTC 假设进行重新评分以获得最终结果。

模型训练loss包含两个部分:CTC loss 和 AED loss
第一项为 CTC loss,第二项为 AED loss


Make the Shared Encoder only see limited right contexts, then CTC decoder could run in a streaming mode in the first pass.----------为了使模型支持流式,需要限制共享Encoder看到未来信息。

为了支持流式语音识别,提出了Dynamic Chunk Training。

U2 只能在共享编码器流式传输时进行流式传输。 在标准的 Transformer 编码器层中使用了完全自注意力。 即,如果靠下图的(a), 做不到流式传输。(a)为标准的self attention,在每个输入时刻t都需要依赖整句的输入。


Limited input t only see a limited right context t+ 1, t+ 2, …, t+W, where W is the right context for each encoder layer, and the total context is accumulated through all the encoder layers.
For example, if we have N encoder layers, each has W right context, the total context is N ∗ W.

针对于此,提出了chunk attention, 图(C)。
通过固定的块大小 C 将输入分成几个块,深绿色代表当前块,对于每个块,都有输入 [t+1, t+2, …, t+C],每个块都依赖于自身和所有之前的块。


CTC 解码器以流式传输方式输出第一遍假设。
1. Attention Decoder mode. -----The CTC results are ignored in this mode.
Attention Decoder generate outputs in an auto-regressive way with the attention of the output of Shared Encoder.
2. Rescoring mode.
来自 CTC 的 n 最佳假设由注意力解码器在教师强制模式下使用共享编码器的输出进行评分。 最好的重新评分假设用作最终结果。 这种模式避免了自回归过程并获得更好的实时因子。
此外,可以通过简单的方式对 CTC 分数进行加权组合以获得更好的结果。



SoX( Sound eXchange)是一个跨平台(Windows,Linux,MacOS 等)的命令行实用程序,可以将各种格式的音频文件转换为需要的其他格式。
SoX 还可以对输入的音频文件应用各种效果,也支持在大多数平台上播放和录制音频文件。


Python .mp3转.wav

from pydub import AudioSegment

wav_file = 'now.wav'

song = AudioSegment.from_mp3('2.mp3')
song.export(wav_file , format="wav")

不是pip install


    ASRT https blog ailemon net 2018 08 29 asrt a chinese speech recognition system ASR Automatic Speech Recognition Paddle