Kaldi语料的两种切分/组织方式及其处理

时间：2017-09-09 23:18:38 阅读：363 评论：0 收藏：0 [点我收藏+]

标签：response app style label 两种 line 声明标识 noi

text中每一个文本段由一个音频索引（indexed by utterance）
使用该方式的egs：librispeech、timit、thchs30、atc_en、atc_cn
语料的组织形式为：
一个音频（包含一个语句）对应一个文本（包含一个文本段）
或
一个音频（包含一个语句）对应一个文本（包含多个文本段）中的一个文本段
text中每一个文本段由一个时间片索引（indexed by segment）
- 使用该方式的egs：
  tedlium、atc0_comp_LDC94S14A
  时间片由segments文件指定，通常位于data/train, data/test, data/dev中
- 处理方式：
  以tedlium为例，语料的组织形式为一个音频sph（包含多个语句）对应一个文本stm（包含多个文本段）
  stm，kaldi使用的一种文本组织形式（文本格式），tedlium的例子：
  AaronHuey_2010X 1 AaronHuey_2010X 223.12 232.68 <o,f0,female> we appropriated land for(2) trails and(2) trains to shortcut through the heart of the lakota nation <sil> the treaties were(2) out the window <sil> in response three tribes led by the lakota chief {SMACK} red cloud <sil> (AaronHuey_2010X-223.12-232.68-F0_F-S27)
  stm文件的格式：
  <file-name> <?> <speaker-name> <segment-begin> <segment-end> <LABEL> <TEXT> <segment>
  文本：db/TEDLIUM_release1/$set/stm/*.stm（其中包含了时间片信息）
  音频：db/TEDLIUM_release1/$set/sph/*.sph
  tedlium/s5/run.sh调用了local/prepare_data.sh，其中对stm文件进行格式化，包括去除显式的静音标注和生成segments文件
  segments文件的格式：
  <utterance-id> <recording-id> <segment-begin> <segment-end>
  其中，segment-begin和segment-end以秒为单位。它们指明了一段发音在一段录音中的时间偏移量。"recording-id" 和在"wav.scp"中使用的是同一个标识字符串。再次声明一下，这只是一个任意的标识字符串，你可以随便指定。
  Kaldi中隐含地对静音SIL进行处理，不需要显式标注。但是这并不意味着其他噪声不需要显式标注，如：