1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290
| 2021.9.17 #!/bin/bash # Copyright 2017 Johns Hopkins University (Shinji Watanabe) # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) . ./path.sh || exit 1; . ./cmd.sh || exit 1; # general configuration backend=pytorch stage=0 # start from 0 if you need to start from data preparation #从步骤0开始 stop_stage=100 #到步骤100结束 ngpu=1 # number of gpus ("0" uses cpu, otherwise use gpu) #gpu的数量设为1 debugmode=1 dumpdir=dump # directory to dump full features N=0 # number of minibatches to be used (mainly for debugging). "0" uses all minibatches. verbose=0 # verbose option resume= # Resume the training from snapshot # feature configuration do_delta=false train_config=conf/tuning/train_CTC.yaml #训练设置 lm_config=conf/lm.yaml #语言模型设置 decode_config=conf/decode.yaml #解码设置 # rnnlm related lm_resume= # specify a snapshot file to resume LM training lmtag= # tag for managing LMs # decoding parameter recog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best' n_average=10 #recog_model因为用了10条路径,选择正确率最优的路径,注释里说选择损失函数最好的路径 # data data=./export/data data_url=www.openslr.org/resources/33 #data数据存放的目录 #data_url数据下载网址 # exp tag tag="CTC" # tag for managing experiments. #实验命名,为了区分 建议设置 . ./utils/parse_options.sh || exit 1; #shell命令使用parse_options.sh传递 # Set bash to 'debug' mode, it will exit on : # -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', set -e set -u set -o pipefail train_set=train_sp train_dev=dev recog_set="dev test" if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then echo "stage -1: Data Download" local/download_and_untar.sh ${data} ${data_url} data_aishell local/download_and_untar.sh ${data} ${data_url} resource_aishell fi #下载数据集,通过运行download_and_untar.sh脚本下载aishell数据集,下载后存放在$data下 #stage0~2: kaldi格式数据集准备 #stage0: 数据准备 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then ### Task dependent. You have to make data the following preparation part by yourself. ### But you can utilize Kaldi recipes in most cases echo "stage 0: Data preparation" local/data_prep.sh ${data}/data_record/wav ${data}/data_record/transcript # remove space in text for x in train dev test; do cp data/${x}/text data/${x}/text.org paste -d " " <(cut -f 1 -d" " data/${x}/text.org) <(cut -f 2- -d" " data/${x}/text.org | tr -d " ") \ > data/${x}/text rm data/${x}/text.org done fi #上面就是数据准备的工作,目的就是创建和kaldi风格一样的数据格式wav.scp,utt2spk,spk2utt,text feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir} feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir} #递归创建这些文件夹 #stage1:特征提取 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then ### Task dependent. You have to design training and dev sets by yourself. ### But you can utilize Kaldi recipes in most cases echo "stage 1: Feature Generation" fbankdir=fbank # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 32 --write_utt2num_frames true \ data/train exp/make_fbank/train ${fbankdir} utils/fix_data_dir.sh data/train steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 10 --write_utt2num_frames true \ data/dev exp/make_fbank/dev ${fbankdir} utils/fix_data_dir.sh data/dev steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 10 --write_utt2num_frames true \ data/test exp/make_fbank/test ${fbankdir} utils/fix_data_dir.sh data/test #make_fbank_pitch.sh脚本用来提取80维的fbank特征+3维pitch特征,放在exp/make_fbank/train/fbank文件夹下。 #fix_data_dir.sh是检验数据集,删除一些没有任何特征的数据并且保证文件是有序的,节省算力。 #数据增强操作:最常用的方法是速度扰动和音量扰动。 #速度扰动 # speed-perturbed utils/perturb_data_dir_speed.sh 0.9 data/train data/temp1 utils/perturb_data_dir_speed.sh 1.0 data/train data/temp2 utils/perturb_data_dir_speed.sh 1.1 data/train data/temp3 # perturb_data_dir_speed.sh脚本进行速度扰动,就是使原来的语音进行变速,使得训练出来的模型能够适应不同的语音语速。 utils/combine_data.sh --extra-files utt2uniq data/${train_set} data/temp1 data/temp2 data/temp3 #把三种不同语音速度的data进行组合 rm -r data/temp1 data/temp2 data/temp3 steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 32 --write_utt2num_frames true \ data/${train_set} exp/make_fbank/${train_set} ${fbankdir} utils/fix_data_dir.sh data/${train_set} #进行语音增强后要重新进行特征提取 #倒谱均值归一化:作用使特征服从均值为0,方差为1的高斯分布。具体看kaldi代码中的注释 # compute global CMVN compute-cmvn-stats scp:data/${train_set}/feats.scp data/${train_set}/cmvn.ark #特征转存 # dump features for training split_dir=$(echo $PWD | awk -F "/" '{print $NF "/" $(NF-1)}') if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_tr_dir}/storage ]; then utils/create_split_dir.pl \ /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_set}/delta${do_delta}/storage \ ${feat_tr_dir}/storage fi if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_dt_dir}/storage ]; then utils/create_split_dir.pl \ /export/a{11,12,13,14}/${USER}/espnet-data/egs/${split_dir}/dump/${train_dev}/delta${do_delta}/storage \ ${feat_dt_dir}/storage fi dump.sh --cmd "$train_cmd" --nj 32 --do_delta ${do_delta} \ data/${train_set}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/train ${feat_tr_dir} for rtask in ${recog_set}; do feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir} dump.sh --cmd "$train_cmd" --nj 10 --do_delta ${do_delta} \ data/${rtask}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/recog/${rtask} \ ${feat_recog_dir} done fi # #字典准备:因为计算机进行深度学习模型的时,字符不能直接作为输入的,要将字符转换成特定的数字 dict=data/lang_1char/${train_set}_units.txt echo "dictionary: ${dict}" if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then ### Task dependent. You have to check non-linguistic symbols used in the corpus. echo "stage 2: Dictionary and Json Data Preparation" mkdir -p data/lang_1char/ #递归创建字典文件夹 echo "make a dictionary" echo "<unk> 1" > ${dict} # <unk> must be 1, 0 will be used for "blank" in CTC text2token.py -s 1 -n 1 data/${train_set}/text | cut -f 2- -d" " | tr " " "\n" \ | sort | uniq | grep -v -e '^\s*$' | awk '{print $0 " " NR+1}' >> ${dict} wc -l ${dict} #espnet中使用text2token.py来通过映射文件中的text文件生成词典.uniq用于去重复 echo "make json files" data2json.sh --feat ${feat_tr_dir}/feats.scp \ data/${train_set} ${dict} > ${feat_tr_dir}/data.json for rtask in ${recog_set}; do feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta} data2json.sh --feat ${feat_recog_dir}/feats.scp \ data/${rtask} ${dict} > ${feat_recog_dir}/data.json done fi #espnet中训练神经网络时,不是直接使用提取的特征和text这些映射文件,而是通过脚本data2json.sh将文件打包成一个json文件。整体结构分为两部分:input和ouput。input为语音的特征以及特征的形状shape;out为语音对应的文本及数字表示。 # you can skip this and remove --rnnlm option in the recognition (stage 5) if [ -z ${lmtag} ]; then lmtag=$(basename ${lm_config%.*}) fi lmexpname=train_rnnlm_${backend}_${lmtag} lmexpdir=exp/${lmexpname} mkdir -p ${lmexpdir} #语言模型的训练 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then echo "stage 3: LM Preparation" lmdatadir=data/local/lm_train mkdir -p ${lmdatadir} text2token.py -s 1 -n 1 data/train/text | cut -f 2- -d" " \ > ${lmdatadir}/train.txt text2token.py -s 1 -n 1 data/${train_dev}/text | cut -f 2- -d" " \ > ${lmdatadir}/valid.txt #使用lm_train.py脚本进行语言模型的训练 ${cuda_cmd} --gpu ${ngpu} ${lmexpdir}/train.log \ lm_train.py \ --config ${lm_config} \ --ngpu ${ngpu} \ --backend ${backend} \ --verbose 1 \ --outdir ${lmexpdir} \ --tensorboard-dir tensorboard/${lmexpname} \ --train-label ${lmdatadir}/train.txt \ --valid-label ${lmdatadir}/valid.txt \ --resume ${lm_resume} \ --dict ${dict} fi #需要准备4个文件,第一个在conf里面要准备的lm.yaml配置脚本对应lm_config,第二个第三个就是训练集train.txt和验证集文本文件valid.txt,最后一个就是词典了对应dict,将训练集和验证集文本转成数字进行训练。其他的就是ngpu就是训练时用的gpu个数,bakend就是选择网络框架(我们基本都是用pytorch),verbose表示log信息的输出格式,tensorboard工具进行训练过程监督训练的训练信息存储文件夹,resume表示加载上一次训练结束后保存的模型的路径。outdir是训练完成后语言模型存放的路径。这段代码脚本结束后,得到的是训练完成后语言模型。 if [ -z ${tag} ]; then expname=${train_set}_${backend}_$(basename ${train_config%.*}) if ${do_delta}; then expname=${expname}_delta fi else expname=${train_set}_${backend}_${tag} fi expdir=exp/${expname} mkdir -p ${expdir} #stage4:声学模型训练 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then echo "stage 4: Network Training" ${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log \ asr_train.py \ --config ${train_config} \ --ngpu ${ngpu} \ --backend ${backend} \ --outdir ${expdir}/results \ --tensorboard-dir tensorboard/${expname} \ --debugmode ${debugmode} \ --dict ${dict} \ --debugdir ${expdir} \ --minibatches ${N} \ --verbose ${verbose} \ --resume ${resume} \ --train-json ${feat_tr_dir}/data.json \ --valid-json ${feat_dt_dir}/data.json fi #声学模型的配置:基本和lm训练一样;outdir是训练完后模型存放路径,还有两个data.json文件,分别是训练集和验证集。最重要的就是配置文件train.yaml,可以用来选择不同的模型。 #解码 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then echo "stage 5: Decoding" nj=32 if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]]; then recog_model=model.last${n_average}.avg.best average_checkpoints.py --backend ${backend} \ --snapshots ${expdir}/results/snapshot.ep.* \ --out ${expdir}/results/${recog_model} \ --num ${n_average} fi #以上这一部分如果是transformer,就进行设置 pids=() # initialize pids for rtask in ${recog_set}; do ( decode_dir=decode_${rtask}_$(basename ${decode_config%.*})_${lmtag} feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta} # split data splitjson.py --parts ${nj} ${feat_recog_dir}/data.json #### use CPU for decoding ngpu=0 ${decode_cmd} JOB=1:${nj} ${expdir}/${decode_dir}/log/decode.JOB.log \ asr_recog.py \ --config ${decode_config} \ --ngpu ${ngpu} \ --backend ${backend} \ --batchsize 0 \ --recog-json ${feat_recog_dir}/split${nj}utt/data.JOB.json \ --result-label ${expdir}/${decode_dir}/data.JOB.json \ --model ${expdir}/results/${recog_model} \ --rnnlm ${lmexpdir}/rnnlm.model.best #解码这一部分和前面也是类似的 score_sclite.sh ${expdir}/${decode_dir} ${dict} ) & pids+=($!) # store background pids done i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done [ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false echo "Finished" fi
|