Version 3.17 released on April Fools‘ day, 2013. We slightly adjust the way class labels are handled internally. By default labels are ordered by their first occurrence in the training set. Hence for a set with -1/+1 labels, if -1 appears first, then internally -1 becomes +1. This has caused confusion. Now for data with -1/+1 labels, we specifically ensure that internally the binary SVM has positive data corresponding to the +1 instances. For developers, see changes in the subrouting svm_group_classes of svm.cpp.
svm_group_classes函数的功能是:group training data of the same class
- void svm_group_classes(const svm_problem *prob, int *nr_class_ret, int **label_ret, int **start_ret, int **count_ret, int *perm)
- nr_class_ret——统计得出样本集的类别总数
- label_ret——指向存储类别标号的数组
- start_ret——指向存储每个类别的起始位置的数组
- count_tet——指向存储每个类别的样本个数的数组
- perm——指向原始数据的索引数组
设一个例子:{ 有6个样本,总共4类,其中y[0]=y[1],y[2]=y[3],y[4],y[5] },则for循环的运行过程如下所示:
i=0 label[0]=y[0], data_label[0]=0
i=1 label[0]=y[0]=y[1], data_label[1]=0 count[0]=2
i=2 label[1]=y[2], data_label[2]=1
i=3 label[1]=y[2]=y[3], data_label[3]=1 count[1]=2
i=4 label[2]=y[4], data_label[2]=2 count[2]=1
i=5 label[3]=y[5], data_label[2]=3 count[3]=1
- // label: label name, start: begin of each class, count: #data of classes, perm: indices to the original data
- // perm, length l, must be allocated before calling this subroutine
- static void svm_group_classes(const svm_problem *prob, int *nr_class_ret, int **label_ret, int **start_ret, int **count_ret, int *perm)
- {
- int l = prob->l;//样本总数
- int max_nr_class = 16;//不够的话,自动增长为原来的两倍(见下文)
- int nr_class = 0;
- int *label = Malloc(int,max_nr_class);//Malloc(type,n) (type *)malloc((n)*sizeof(type))
- int *count = Malloc(int,max_nr_class);
- int *data_label = Malloc(int,l);
- int i;
- for(i=0;i<l;i++)
- {
- int this_label = (int)prob->y[i];//将类别赋给this_label
- int j;
- for(j=0;j<nr_class;j++)
- {
- if(this_label == label[j])//虽然刚开始label里面没值,但是第一步循环本内层也没有被运行
- {
- ++count[j];
- break;
- }
- }
- data_label[i] = j;
- if(j == nr_class)
- {
- if(nr_class == max_nr_class)
- {
- max_nr_class *= 2;//扩大最大类别数
- label = (int *)realloc(label,max_nr_class*sizeof(int));
- count = (int *)realloc(count,max_nr_class*sizeof(int));
- }
- label[nr_class] = this_label;
- count[nr_class] = 1;//这个是1
- ++nr_class;
- }
- }
- //
- // Labels are ordered by their first occurrence in the training set.
- // However, for two-class sets with -1/+1 labels and -1 appears first,
- // we swap labels to ensure that internally the binary SVM has positive data corresponding to the +1 instances.
- //
- if (nr_class == 2 && label[0] == -1 && label[1] == 1)
- {
- swap(label[0],label[1]);
- swap(count[0],count[1]);
- for(i=0;i<l;i++)
- {
- if(data_label[i] == 0)
- data_label[i] = 1;
- else
- data_label[i] = 0;
- }
- }
下面这一部分代码是用来计算每个类别的起始位置start、以及各个样本分类后的在原始数据中的索引位置perm数组。其中perm[i]=j: i表示当前同类样本位置,j表示原始数据位置。
- int *start = Malloc(int,nr_class);
- start[0] = 0;
- for(i=1;i<nr_class;i++)
- start[i] = start[i-1]+count[i-1];
- for(i=0;i<l;i++)
- {
- perm[start[data_label[i]]] = i;
- ++start[data_label[i]];
- }
- start[0] = 0;
- for(i=1;i<nr_class;i++)
- start[i] = start[i-1]+count[i-1];
- *nr_class_ret = nr_class;
- *label_ret = label;
- *start_ret = start;
- *count_ret = count;
- free(data_lab