探索高斯混合模型框架，以适应深神经网络声学模型

论文标题

探索高斯混合模型框架，以适应深神经网络声学模型

Exploring Gaussian mixture model framework for speaker adaptation of deep neural network acoustic models

论文作者

Tomashenko, Natalia, Khokhlov, Yuri, Esteve, Yannick

论文摘要

在本文中，我们研究了GMM衍生的（GMMD）特征，以适应深神经网络（DNN）声学模型。在GMMD特征上训练的DNN的适应是通过最大后验（MAP）适应用于GMMD特征提取的辅助GMM模型的。我们在两个不同的神经网络体系结构中探索了适用的GMMD功能与常规功能（例如瓶颈和MFCC功能）的融合：DNN和Time-Delay神经网络（TDNN）。我们分析和分析了基于最大似然线性回归（FMLLR）的不同类型的适应技术，例如I-矢量和功能空间适应技术，并使用拟议的适应方法进行比较，并使用各种类型的融合，例如特征水平，后验，lattice级别，lattice水平和其他方式来探索它们的互补性，以发现最佳级别。 Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN and TDNN setups at different levels and provide additional gain in recognition performance: up to 6% of relative word error rate reduction (WERR) over the strong feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) speaker adapted DNN baseline, and up to 18% of relative WERR in comparison借助扬声器独立（SI）DNN基线模型，对传统功能进行了培训。对于TDNN模型，与SI基线相比，所提出的方法可达到高达26％的相对WERR，而与使用I-Vectors相比的模型相比，相比提高了13％。从各种角度对改编的GMMD特征的分析证明了它们在不同级别的有效性。

In this paper we investigate the GMM-derived (GMMD) features for adaptation of deep neural network (DNN) acoustic models. The adaptation of the DNN trained on GMMD features is done through the maximum a posteriori (MAP) adaptation of the auxiliary GMM model used for GMMD feature extraction. We explore fusion of the adapted GMMD features with conventional features, such as bottleneck and MFCC features, in two different neural network architectures: DNN and time-delay neural network (TDNN). We analyze and compare different types of adaptation techniques such as i-vectors and feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) with the proposed adaptation approach, and explore their complementarity using various types of fusion such as feature level, posterior level, lattice level and others in order to discover the best possible way of combination. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN and TDNN setups at different levels and provide additional gain in recognition performance: up to 6% of relative word error rate reduction (WERR) over the strong feature-space adaptation techniques based on maximum likelihood linear regression (fMLLR) speaker adapted DNN baseline, and up to 18% of relative WERR in comparison with a speaker independent (SI) DNN baseline model, trained on conventional features. For TDNN models the proposed approach achieves up to 26% of relative WERR in comparison with a SI baseline, and up 13% in comparison with the model adapted by using i-vectors. The analysis of the adapted GMMD features from various points of view demonstrates their effectiveness at different levels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题