码迷,mamicode.com
首页 > 其他好文 > 详细

PDNN中数据格式和数据装载选项

时间:2016-04-11 22:27:12      阅读:377      评论:0      收藏:0      [点我收藏+]

标签:

吃人家嘴短,拿人家手短,用别人的东西就不要BB了,按规矩来吧。

训练和验证的数据都在命令行以变量的形式按如下方式指定:

--train-data "train.pfile,context=5,ignore-label=0:3-9,map-label=1:0/2:1,partition=1000m"
--valid-data "valid.pfile,stream=False,random=True"

在第一个逗号前面的部分(如果有的话)指定了文件的名称。

全局样式通配符可以被用来指定多个文件(当前不支持Kaldi数据文件)。

数据文件也可以用gzip或者bz2进行压缩,在这种情况下,原始的扩展名后面有".gz"或者".bz2"这类的扩展名。

在文件名后面,你可以以"key=value"的格式指定任意数目的数据装载选项。这些选项的功能将在下面的部分中描述。

受支持的数据格式

PDNN目前支持3中数据格式:PFiles,Python pickle files,和Kaldi files。

PFiles

PFile是ICSI特征文件存档格式。PFiles以.pfile为扩展名。一个PFile可以存储多个语句,每个语句是一个帧序列。

每个帧都是和一个特征向量相关联,并且有一个或者多个标签。下面是一个PFile文件例子中的内容。

Sentence ID Frame ID Feature Vector Class Label
0 0 [0.2, 0.3, 0.5, 1.4, 1.8, 2.5] 10
0 1 [1.3, 2.1, 0.3, 0.1, 1.4, 0.9] 179
1 0 [0.3, 0.5, 0.5, 1.4, 0.8, 1.4] 32

对于语音处理,语句和帧分别对应于话语和帧。帧在每句话里面都是添加索引的。

对于其他的应用,可以使用伪造的语句指标和帧指标。

例如,如果有N个实例,你可以将所有的语句指标都设为0,而帧指标则从0到N-1。

一个标准的PFile工具箱是pfile_utils-v0_51。这个脚本将会自动安装,如果你在Linux上运行的话。HTK用户可以用这个Python脚本将HTK的特征和标签转化为PFiles。更多的信息可以参考上面的注释。

Python Pickle Files

Python pickle files may have the extension ".pickle" or ".pkl". A Python pickle file serializes a tuple of two numpy arrays, (feature, label). There is no notion of "sentences" in pickle files; in other words, a pickle files stores exactly one sentence. feature is a 2-D numpy array, where each row is the feature vector of one instance; label is a 1-D numpy array, where each element is the class label of one instance.

To read a (gzip-compressed) pickle file in Python:

> import cPickle, numpy, gzip
> with gzip.open(‘filename.pkl.gz‘, ‘rb‘) as f:
>     feature, label = cPickle.load(f)

To create a (gzip-compressed) pickle file in Python:

> import cPickle, numpy, gzip
> feature = numpy.array([[0.2, 0.3, 0.5, 1.4], [1.3, 2.1, 0.3, 0.1], [0.3, 0.5, 0.5, 1.4]], dtype = ‘float32‘)
> label = numpy.array([2, 0, 1])
> with gzip.open(‘filename.pkl.gz‘, ‘wb‘) as f:
>     cPickle.dump((feature, label), f)
 
Kaldi Files

The Kaldi data files accepted by PDNN are "Kaldi script files" with the extension ".scp". These files contain "pointers" to the actual feature data stored in "Kaldi archive files" ending in ".ark". Each line of a Kaldi script file specifies the name of an utterance (equivalent to a sentence in pfiles), and its offset in a Kaldi archive file, as follows:

utt01 train.ark:15213

Labels corresponding to the features are provided by "alignment files" ending in ".ali". To specify an alignment file, use the option "label=filename.ali". Alignment files are plain text files, where each line specifies the name of an utterance, followed by the label of each frame in this utterance. Below is an example:

utt01 0 51 51 51 51 51 51 48 48 7 7 7 7 51 51 51 51 48

On-The-Fly Context Padding and Label Manipulation

Oftentimes, we want to include the features of neighboring frames into the feature vector of the current frame. Of course this can be done when you prepare the data files, but this will bloat their size. A more clever way is to perform this "context padding" on the fly. PDNN provides the option "context" to do this. Specifying "context=5" will pad each frame with 5 frames on either side, so that the feature vector becomes 11 times the original dimensionality. Specifying "context=5:1" will pad each frame with 5 frames on the left and 1 frame on the right. Alternatively, you can also specify "lcxt=5,rcxt=1". Context padding does not cross sentence boundaries. At the beginning and end of each sentence, the first and last frames are repeated when the context reaches beyond the sentence boundary.

Some frames in the data files may be garbage frames (i.e. they do not belong to any of the classes to be classified), but they are important in making up the context for useful frames. To ignore such frames, you can assign a special class label (say c) to these frames, and specify the option "ignore-label=c". The garbage frames will be discarded; but the context of neighboring frames will still be correct, as the garbage frames are only discarded after context padding happens. Sometimes you may also want to train a classifier for only a subset of the classes in a data file. In such cases, you may specify multiple class labels to be ignored, e.g. "ignore-label=0:2:7-9". Multiple class labels are separated by colons; contiguous class labels may be specified with a dash.

When training a classifier of N classes, PDNN requires that their class labels be 0, 1, ..., N-1. When you ignore some class labels, the remaining class labels may not form such a sequence. In this situation, you may use the "map-label" option to map the remaining class labels to 0, 1, ..., N-1. For example, to map the classes 1, 3, 4, 5, 6 to 0, 1, 2, 3, 4, you can specify "map-label=1:0/3:1/4:2/5:3/6:4". Each pair of labels are separated by a colon; pairs are separated by slashes. The label mapping happens after unwanted labels are discarded; all the mappings are applied simultaneously (therefore class 3 is mapped to class 1 and is not further mapped to class 0). You may also use this option to merge classes. For example, "map-label=1:0/3:1/4-6:2" will map all the labels 4, 5, 6 to class 2.

 

技术分享

Partitions, Streaming and Shuffling

The training / validation corpus may be too large to fit in the CPU or GPU memory. Therefore they are broken down into several levels of units: files, partitions, and minibatches. Such division happens after context padding and label manipulation, and the concept of "sentences" are no longer relevant. As a result, a sentence may be broken into multiple partitions of minibatches.

Both the training and validation corpora may consist of multiple files that can be matched by a single glob-style pattern. At any point in time, at most one file is held in the CPU memory. This means if you have multiple files, all the files will be reloaded every epoch. This can be very inefficient; you can avoid this inefficiency by lumping all the data into a single file if they can fit in the CPU memory.

A partition is the amount of data that is fed to the GPU at a time. For pickle files, a partition is always an entire file; for other files, you may specify the partition size with the option "partition", e.g. "partition=1000m". The partition size is specified in megabytes (220bytes); the suffix "m" is optional. The default partition size is 600 MB.

Files may be read in either the "stream" or the "non-stream" mode, controlled by the option "stream=True" or "stream=False". In the non-stream mode, an entire file is kept in the CPU memory. If there is only one file in the training / validation corpus, the file is loaded only once (and this is efficient). In the stream mode, only a partition is kept in the CPU memory. This is useful when the corpus is too large to fit in the CPU memory. Currently, PFiles can be loaded in either the stream mode or the non-stream mode; pickle files can only be loaded in the non-stream mode; Kaldi files can only be loaded in the stream mode.

It is usually desirable that instances of different classes be mixed evenly in the training data. To achieve this, you may specify the option "random=True". This options shuffles the order of the training instances loaded into the CPU memory at a time: in the stream mode, instances are shuffled partition by partition; in the non-stream mode, instance are shuffled across an entire file. The latter achieves better mixing, so it is again recommended to turn off the stream mode when the files can fit in the CPU memory.

A minibatch is the amount of data consumed by the training procedure between successive updates of the model parameters. The minibatch size is not specified as a data loading option, but as a separate command-line argument to the training scripts. A partition may not consist of a whole number of minibatches; the last instances in each partition that are not enough to make a minibatch are discarded.

技术分享

 

PDNN中数据格式和数据装载选项

标签:

原文地址:http://www.cnblogs.com/tuhooo/p/5380315.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!