机器学习 / 语音识别 · 2023年8月1日

YESNO单音素模型解码与训练

YESNO单音素模型解码与训练

1.训练与测试

首先进入其目录,先进行训练

cd /data/zsf/Zsf_WorkSpace/Kaldi_WorkSpace/kaldi/egs/yesno/s5
./run.sh

从日志输出

fsttablecompose exp/mono0a/graph_tgpr/Ha.fst data/lang_test_tg/tmp/CLG_1_0.fst 
fstrmsymbols exp/mono0a/graph_tgpr/disambig_tid.int 
fstminimizeencoded 
fstdeterminizestar --use-log=true 
fstrmepslocal 
fstisstochastic exp/mono0a/graph_tgpr/HCLGa.fst 
0.5342 -0.000422432
HCLGa is not stochastic
add-self-loops --self-loop-scale=0.1 --reorder=true exp/mono0a/final.mdl exp/mono0a/graph_tgpr/HCLGa.fst 
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono0a/graph_tgpr data/test_yesno exp/mono0a/decode_test_yesno
decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd utils/run.pl exp/mono0a/graph_tgpr exp/mono0a/decode_test_yesno
steps/diagnostic/analyze_lats.sh: see stats in exp/mono0a/decode_test_yesno/log/analyze_alignments.log
Overall, lattice depth (10,50,90-percentile)=(1,1,2) and mean=1.2
steps/diagnostic/analyze_lats.sh: see stats in exp/mono0a/decode_test_yesno/log/analyze_lattice_depth_stats.log
local/score.sh --cmd utils/run.pl data/test_yesno exp/mono0a/graph_tgpr exp/mono0a/decode_test_yesno
local/score.sh: scoring with word insertion penalty=0.0,0.5,1.0
%WER 0.00 [ 0 / 232, 0 ins, 0 del, 0 sub ] exp/mono0a/decode_test_yesno/wer_10_0.0

其运行没有问题,前提是按前面的BLOG,务必安装好KALDI,

接下将训练好的模型放到板子上,测试一下训练生成的模型

cd exp
scp -r mono0a/ root@192.168.1.158:/root/yesorno  //训练好的模型拷贝到板子

目录结构如下,太长,只取一部分

root@NanoPi-M1:~/yesorno# tree
.
└── mono0a
    ├── 0.mdl
    ├── 40.mdl
    ├── 40.occs
    ├── ali.1.gz
    ├── cmvn_opts
    ├── decode_test_yesno
    │   ├── lat.1.gz
    │   ├── log
    │   │   ├── analyze_alignments.log
    │   │   ├── analyze_lattice_depth_stats.log
    │   │   ├── decode.1.log
。。。。。

接下来,尝试录一个段yes与no的录音,然后再将进放到板子上

base) zsf@BSP-D01-srv:~$ scp yesorno.wav  root@192.168.1.158:/root
The authenticity of host '192.168.1.158 (192.168.1.158)' can't be established.
ECDSA key fingerprint is a2:88:9a:23:d0:bf:f0:f9:3e:af:77:6d:02:86:7b:3a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.158' (ECDSA) to the list of known hosts.
root@192.168.1.158's password: 
yesorno.wav                                                                                                                               100%  252KB 252.4KB/s   00:00    
(base) zsf@BSP-D01-srv:~$ 

2.上板测试

先来看下online-wav-gmm-decode-faster解码器用的法

root@NanoPi-M1:~# ./online-wav-gmm-decode-faster
./online-wav-gmm-decode-faster

Reads in wav file(s) and simulates online decoding.
Writes integerized-text and .ali files for WER computation. Utterance segmentation is done on-the-fly.
Feature splicing/LDA transform is used, if the optional(last) argument is given.
Otherwise delta/delta-delta(i.e. 2-nd order) features are produced.
Caution: the last few frames of the wav file may not be decoded properly.
Hence, don't use one wav file per utterance, but rather use one wav file per show.

Usage: online-wav-gmm-decode-faster [options] wav-rspecifier model-infst-in word-symbol-table silence-phones transcript-wspecifier alignments-wspecifier [lda-matrix-in]

Example: ./online-wav-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:wav.scp model HCLG.fst words.txt '1:2:3:4:5' ark,t:trans.txt ark,t:ali.txt
Options:
  --acoustic-scale            : Scaling factor for acoustic likelihoods (float, default = 0.1)
  --batch-size                : Number of feature vectors processed w/o interruption (int, default = 27)
  --beam                      : Decoding beam.  Larger->slower, more accurate. (float, default = 16)
  --beam-delta                : Increment used in decoder [obscure setting] (float, default = 0.5)
  --beam-update               : Beam update rate (float, default = 0.01)
  --channel                   : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)
  --cmn-window                : Number of feat. vectors used in the running average CMN calculation (int, default = 600)
  --hash-ratio                : Setting used in decoder to control hash behavior (float, default = 2)
  --inter-utt-sil             : Maximum # of silence frames to trigger new utterance (int, default = 50)
  --left-context              : Number of frames of left context (int, default = 4)
  --max-active                : Decoder max active states.  Larger->slower; more accurate (int, default = 2147483647)
  --max-beam-update           : Max beam update rate (float, default = 0.05)
  --max-utt-length            : If the utterance becomes longer than this number of frames, shorter silence is acceptable as an utterance separator (int, default = 1500)
  --min-active                : Decoder min active states (don't prune if #active less than this). (int, default = 20)
  --min-cmn-window            : Minumum CMN window used at start of decoding (adds latency only at start) (int, default = 100)
  --num-tries                 : Number of successive repetitions of timeout before we terminate stream (int, default = 5)
  --right-context             : Number of frames of right context (int, default = 4)
  --rt-max                    : Approximate maximum decoding run time factor (float, default = 0.75)
  --rt-min                    : Approximate minimum decoding run time factor (float, default = 0.7)
  --update-interval           : Beam update interval in frames (int, default = 3)

Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)

接下来我们来准备数据,直接在数据集中取测试数据,在板子上测试,数据在s5的目录

/data/zsf/Zsf_WorkSpace/Kaldi_WorkSpace/kaldi/egs/yesno/s5/waves_yesno

然后为了方便测试,直接拷贝到板子上

/yesno/s5$ scp -r waves_yesno root@192.168.1.158:/root
root@192.168.1.158's password: 
1_0_1_0_1_0_0_1.wav                                                                                                                       100%   77KB   4.3MB/s   00:00    
README                                                                                                                                    100%  833   447.3KB/s   00:00    
0_1_1_1_1_0_1_0.wav                                                                                                                       100%   94KB   4.4MB/s   00:00    
0_1_0_0_1_0_1_0.wav  
。。。。。。。。。。

然后再取SCP,也就是WAV文件的路径与文件名对应的一个文件,通过这个文件 ,解码器可以解码多个WAV文件,我们随便解析5个,主要是为了测试解码器

1_0_0_0_0_0_0_0 /root/waves_yesno/1_0_0_0_0_0_0_0.wav
1_0_0_0_0_0_0_1 /rootwaves_yesno/1_0_0_0_0_0_0_1.wav
1_0_0_0_0_0_1_1 /root/waves_yesno/1_0_0_0_0_0_1_1.wav
1_0_0_0_1_0_0_1 /root/waves_yesno/1_0_0_0_1_0_0_1.wav
1_0_0_1_0_1_1_1 /root/waves_yesno/1_0_0_1_0_1_1_1.wav
~

然后再分别从目录下取出

final.mdl  HCLG.fst  words.txt
它们分别对应model,  HCLG.fst words.txt,其在s5/exp/mono0a下面,具体的路径就不贴了

接下就是解码了

./online-wav-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:test_yesno_wav.scp final.mdl HCLG.fst words.txt '1:2:3:4: 5' ark,t:trans.txt ark,t:ali.txt
./online-wav-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:test_yesno_wav.scp final.mdl HCLG.fst words.txt 1:2:3:4:5 ark,t:trans.txt ark,t:ali.txt
File: 1_0_0_0_0_0_0_0
ERROR (online-wav-gmm-decode-faster[5.5.120~486-da328]:main():online-wav-gmm-decode-faster.cc:148) Sampling rates other than 16kHz are not supported!

[ Stack-Trace: ]
./online-wav-gmm-decode-faster(kaldi::MessageLogger::LogMessage() const+0x7dc) [0x1c2cd4]
./online-wav-gmm-decode-faster(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x14) [0xbf950]
./online-wav-gmm-decode-faster(main+0xa68) [0xbcad0]
/lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x9d) [0xb6bc28aa]

结果发现原文件 采样率不是16K的,我们来改造WAV文件 ,把它从8K采样的文件 转成16K采样的文件 ,这个转换随便打个工具就可以,我直接用的audacity转的。

 ./online-wav-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:test_yesno_wav.scp final.mdl HCLG.fst words.txt '1:2:3:4: 5' ark,t:trans.txt ark,t:ali.txt
./online-wav-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:test_yesno_wav.scp final.mdl HCLG.fst words.txt 1:2:3:4:5 ark,t:trans.txt ark,t:ali.txt
File: 1_0_0_0_0_0_0_0
NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

File: 1_0_0_0_0_0_0_1
NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

File: 1_0_0_0_0_0_1_1
NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

File: 1_0_0_0_1_0_0_1
NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

File: 1_0_0_1_0_1_1_1
NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

NO

YES

root@NanoPi-M1:~#
打赏作者