有很多方法可以做到这一点。最简单的方法是查看中提出的方法关于 Kaggle 竞赛TensorFlow 语音识别挑战赛 https://www.kaggle.com/c/tensorflow-speech-recognition-challenge(仅按投票最多的排序)。This one https://www.kaggle.com/alphasis/light-weight-cnn-lb-0-74特别清晰简单,包含以下功能。输入是从 wav 文件中提取的样本的数值向量、采样率、帧大小(以毫秒为单位)、步长(跨步或跳过)大小(以毫秒为单位)和一个小偏移量。
from scipy.io import wavfile
from scipy import signal
import numpy as np
sample_rate, audio = wavfile.read(path_to_wav_file)
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
输出定义在SciPy手册 https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.spectrogram.html,但有一个例外,即频谱图使用单调函数 (Log()) 重新缩放,该函数对较大值的抑制作用远大于对较小值的抑制,同时使较大值仍然大于较小值。这样,规格中的极值就不会主导计算。或者,可以将值限制在某个分位数,但首选对数(甚至平方根)。还有许多其他方法可以标准化频谱图的高度,即防止极端值“欺负”输出:)
freq (f) : ndarray, Array of sample frequencies.
times (t) : ndarray, Array of segment times.
spec (Sxx) : ndarray, Spectrogram of x. By default, the last axis of Sxx corresponds to the segment times.
或者,您可以检查 train.py 和 models.py 代码github 仓库 https://github.com/tensorflow/tensorflow/tree/v1.4.0/tensorflow/examples/speech_commands来自音频识别的 Tensorflow 示例 https://www.tensorflow.org/tutorials/audio_recognition.
这是另一个线程 https://www.kaggle.com/timolee/audio-data-conversion-to-images-eda解释并给出了用 Python 构建频谱图的代码。