流式ASR的工程化考虑的是在实现流式模型的基础上搭建服务,使得流式ASR服务可以处理实际业务的大量调用请求,可以实现对业务场景所需的ASR服务的稳定支持。在研究模型的延时性能时,线下测试仅需要参考模型的推理的实时率RTF值(Real Time Factor),但对于部署的ASR服务,还需要考量这个服务部署在单机服务器上可以实现多少量的并发,即一个CPU或GPU在性能跑满时可以支持处理多少个访问请求,并且在用户感觉不到延迟的时间内返回识别结果,然后根据实际业务量来评估需要多少硬件资源来支撑部署ASR服务。因此模型要尽可能的优化和压缩到最少的耗时,在实现高并发性能的同时降低资源的消耗量。从硬件资源的角度来看,Transfomer支持并行计算,使用GPU会带来很大的速度提升。而RNN网络计算时存在时序依赖,GPU的加速效果不是很明显,选择CPU是一种较为经济高效的方式,因此需要根据选用的模型来决定成本较低的硬件配置方案。
[1]Hannun, Awni, et al. "Deep speech: Scaling up end-to-end speech recognition."arXiv preprint arXiv:1412.5567(2014).
[2] Linhao Dong, Shuang Xu, and Bo Xu. "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition."2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.
[3]Karita, Shigeki, et al. "A comparative study on transformer vs rnn in speech applications."2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019.
[4] C.-F. Yehet al., "Transformer-Transducer: End-to-End Speech Recognition with Self-Attention,"**p. arXiv:1910.12977Accessed on: October 01, 2019[Online]. Available:https://ui.adsabs.harvard.edu/abs/2019arXiv191012977Y
[5]Q. Zhanget al., "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss,"**p. arXiv:2002.02562Accessed on: February 01, 2020[Online]. Available:https://ui.adsabs.harvard.edu/abs/2020arXiv200202562Z
[6] W. Huang, W. Hu, Y. Yeung, and X. Chen, "Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition,"*ArXiv,*vol. abs/2008.05750, 2020.
[7] Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253.
[8] H. Miao, G. Cheng, C. Gao, P. Zhang and Y. Yan, "Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[9] N. Moritz, T. Hori and J. Le, "Streaming Automatic Speech Recognition with the Transformer Model," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[10] N. Moritz, T. Hori and J. L. Roux, "Triggered Attention for End-to-end Speech Recognition," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[11] A. Gulatiet al., "Conformer: Convolution-augmented Transformer for Speech Recognition,"*ArXiv,*vol. abs/2005.08100, 2020.
[12] P. Guoet al., "Recent Developments on ESPnet Toolkit Boosted by Conformer,"*ArXiv,*vol. abs/2010.13956, 2020.
[13] B. Zhanget al., "Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition,"*ArXiv,*vol. abs/2012.05481, 2020.
[16] Jaeyong Kim,Mostafa EI-Khamy,Jungwon Lee ,"T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement",arxiv:https://arxiv.org/abs/1910.06762
[17]Li N , Liu S , Liu Y , et al. "Neural Speech Synthesis with Transformer Network"[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33:6706-6713.
[18]X. Chang, W. Zhang, Y. Qian, J. L. Roux and S. Watanabe, "End-To-End Multi-Speaker Speech Recognition With Transformer," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[19 ]X. Chang, W. Zhang, Y. Qian, J. L. Roux and S. Watanabe, "End-To-End Multi-Speaker Speech Recognition With Transformer," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)