标签:
Abstract
We present a new learning architecture: the Decision Directed Acyclic Graph (DDAG), which is used to combine many two-class classifiers into a multiclass classifiers. For an -
class problem, the DDAG contains 
 classifiers, one for each pair of classes. We present a VC analysis of the case when the node classifiers are hyperplanes; the resulting bound on the test error depends on 
 and on the margin achieved at the nodes, but not on the dimension of the space. This motivates an algorithm, DAGSVM, which operates in a kernel-induced feature space and uses two-class maximal margin hyperplanes at each decision-node of the DDAG. The DAGSVM is substantially faster to train and evaluate than either the standard algorithm or Max Wins, while maintaining comparable accuracy to both of these algorithms.
The problem of multiclass classification, especially for systems like SVMs, doesn‘t present an easy solution. It is generally simpler to construct classifier theory and algorithms for two mutually-exclusive classes than for  mutually-exclusive classes. We believe constructing 
-class SVMs is still an unsolved research problem.
The standard method for -class SVMs is to construct SVMs. The 
th SVM will be trained with all of the examples in the 
th class with positive labels, and all other examples with negative labels. We refer to SVMs trained in this way as 
 SVMs (short for one-versus-rest). The final output of the 
 
 SVMs is the class that corresponds to the SVM with the highest output value. Unfortunately, there is no bound on the generalization error for the 
 SVM, and the training time of the standard method scales linearly with 
.
Another method for constructing -class classifiers from SVMs is derived from previous research into combining two-class classifiers. Knerr suggested constructing all possible two-class classifiers from a training set of 
 classes, each classifier being trained on only two out of 
 classes. There would thus be 
 classifiers. When applied to SVMs, we refer to this as 
 SVMs (short for one-versus-one).
Knerr suggested combining these two-class classifiers with an “AND” gate. Friedman suggested a Max Wins algorithm: each  classifier casts one vote for its preferred class, and the final result is the class with the most votes. Friedman shows circumstances in which this algorithm is Bayes optimal. KreBel applies the Max Wins algorithm to Support Vector Machines with excellent results.
A significant disadvantage of the  approach, however, is that, unless the individual classifiers are carefully regularized (as in SVMs), the overall 
-class classifier system will tend to overfit. The “AND” combination method and the Max Wins combination method do not have bounds on the generalization error. Finally, the size of the 
 classifier may grow superlinearly with 
, and hence, may be slow to evaluate on large problems.
A Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. A Rooted DAG has a unique node such that it is the only node which has no arcs pointing into it. A Rooted Binary DAG has nodes which have either  or 
 arcs leaving them. We will use Rooted Binary DAGs in order to define a class of functions to be used in classification tasks. The class of functions computed by Rooted Binary DAGs is formally defined as follows.
Definition 1Decision DAGs (DDAGs). Given a space  and a set of boolean functions 
, the class 
 of Decision DAGs on 
 classes over 
 are functions which can be implemented using a rooted binary DAGs with 
 leaves labeled by the classes where each of the 
 internal nodes is labeled with an element of 
. The nodes are arranged in a triangle with the single root node at the top, two nodes in the second layer and so on until the final layer of 
 leaves. The 
-th node in layer 
 is connected to the 
-th and 
-st node in the 
-st layer.
To evaluate a particular DDAG G on input , starting at the root node, the binary function at node is evaluated. The node is then exited via the left edge, if the binary function is zero; or the right edge, if the binary function is one. The next node‘s binary function is then evaluated. The value of the decision function 
 is the value associated with the final leaf node. The path taken through the DDAG is known as the evaluation path. The input 
 reaches a node of the graph, if that node is on the evaluation path for 
. We refer to the decision node distinguishing classes 
 and 
 as the 
-node. Assuming that the number of a leaf is its class, this node is the 
-th node in the 
-th layer provided . Similarly the 
-nodes are those nodes involving class , that is, the internal nodes on the two diagonals containing the leaf labeled by 
.
The DDAG is equivalent to operating on a list, where each node eliminates one class from the list. The list is initialized with a list of all classes. A test point is evaluated against the decision node that corresponds to the first and last elements of the list. If the node prefers one of the two classes, the other class is eliminated from the list, and the DDAG proceeds to test the first and last elements of the new list. The DDAG terminates when only one class remains in the list. Thus, for a problem with  classes, 
 decision nodes will be evaluated in order to derive an answer.
The current state of the list is the total state of the system. Therefore, since a list state is reachable in more than one possible path through the system, the decision graph the algorithm traverses is a DAG, not simply a tree.
Decision DAGs naturally generalize the class of Decision Trees, allowing for a more efficient representation of redundancies and repetitions that can occur in different branches of the tree, by allowing the merging of different decision paths. The class of functions implemented is the same as that of Generalized Decision Trees, but this particular representation presents both computational and learning-theoretical advantages.
In this paper we study DDAGs where the node-classifiers are hyperplanes. We define a Perceptron DDAG to be a DDAG with a perceptron at every node. Let  be the (unit) weight vector correctly splitting the 
 and 
 classes at the 
-node with threshold 
. We define the margin of the 
-node to be 
, where 
 is the class associated to training example 
. Note that, in this definition, we only take into account examples with class labels equal to 
 or 
.
Theorem 1 Suppose we are able to classifya random  sample of labeled examples using a Perceptron DDAG on 
 classes containing 
 decision nodes with margins 
 at node 
, then we can bound the generalization error with probability greater than 
 to be less than
,
where  and 
 is the radius of a ball containing the distribution‘s support.
Theorem 1 implies that we can control the capacity of DDAGs by enlarging their margin. Note that, in some situations, this bound may be pessimistic: the DDAG partitions the input space into polytopic regions, each of which is mapped to a leaf node and assigned to a specific class. Intuitively, the only margins that should matter are the ones relative to the boundaries of the cell where a given training point is assigned, whereas the bound in Theorem 1 depends on all the margins in the graph.
By the above observations, we would expect that a DDAG whose -node margins are large would be accurate at identifying class 
, even when other nodes do not have large margins. Theorem 2 substantiates this by showing that the appropriate bound depends only on the 
-node margins, but first we introduce the notation, 
.
Theorem 2 Suppose we are able to correctly distinguish class  from the other classes in a random 
-sample with a DDAG 
 over 
 classes containing 
 decision nodes with margins 
 at node 
, then with probability 
,
,
where , and 
 is the radius of a ball containing the support of the distribution.
Large Margin DAGs for Multiclass Classification
标签:
原文地址:http://www.cnblogs.com/JenifferWu/p/5929347.html