【转载】Distributed Deep Learning on MPP and Hadoop

Joint work performed by Regunathan Radhakrishnan, Gautam Muralidhar, Ailey Crow, and Sarah Aerni of Pivotal’s Data Science Labs.

Deep learning greatly improves upon manual design of features, allows companies to get more insights from data, and shortens the time to explore, understand, and operationalize analytical results. The approach has recently become popular, both in academia and industry, as a machine-learning framework for learning structure (commonly referred to as features) from unlabeled data as well as feature generation for a supervised learning task (with labeled data). Researchers in computer vision and natural language processing (NLP) have shown that deep-learning-generated features provide state-of-the-art performance when compared to using engineered features (those manually designed) in machine learning. In this blog article, we show how deep learning can be implemented in a distributed computing platform such asPivotal Greenplum Database (GPDB) and Hadoop. In the following sections, we will briefly introduce the building block of Deep Learning, explain the auto-encoder, and then describe the details of the implementation itself.

Deep Learning Examples and Extending the Reach of Machine Learning

Applications of deep learning include classification of images into different types where the total number of classes is not known. For example, using a large volume of YouTube videos, researchers were able to automatically identify various types of content in videos, which might be useful in automatically curating and recommending new content to users. A second example is in automated generation of features from gene expression data to detect or classify cancer types. This publication explains how a deep learning-based classifier outperforms state-of-the-art on several image classification tasks such as handwritten digit recognition, traffic sign detection, andmore.

The complexity of designing features, particularly in the former case of identifying the space of possible classes, is daunting. The use of deep learning can increase the reach of machine learning by removing the reliance on and limitations of human-generated features. Since deep learning is computationally intensive, it lends itself naturally to a distributed framework with large scale computing platforms such as Hadoop and massively parallel processing (MPP) databases to cycle through the desired large datasets.

1. Auto-Encoder: The Building Block of Deep Learning

An auto-encoder is a neural network with one hidden layer that learns an identity function under sparsity and regularization constraints. In other words, the auto-encoder attempts to reconstruct the input data by projecting onto a lower-dimensional subspace defined by the hidden nodes. Hence, the hidden layer is forced to learn structure from the input training examples so that it can reconstruct the input at the output. For instance, consider the auto-encoder shown in Figure 1 below for input image patches that learns a hidden layer y₁ to output x. The input layer x is a set of intensity values from image patches. The hidden layer nodes project the high-dimensional input layer into a set of low-dimensional activation values of the hidden nodes. The activation values of the hidden nodes y₁ are combined to create the output layer x, which is an approximation to the input pixels. The hidden layer in this case learns structure from pixels in the form of edges in various orientations. The hidden layers typically have a smaller number of nodes than the input layer nodes and hence the hidden nodes are forced to compress the information in the input layer in such a way that the output layer can still be created. Since most of the local image patches tend to be smooth, the only structure that the hidden layers need to learn is the set of edges in different orientations that are common among the images.

Figure 1: Auto-Encoder to learn structure from pixels

Auto-encoders can be stacked one beside the other to learn higher order structures that encode different relationships between the structural elements from the previous layer. For example, we can learn another auto-encoder for which the input is y₁ and the hidden layer is y₂. The hidden layer y₂ now learns relationship between edges to form shapes, the way the first layer learned relationships between pixels in regions. We can derive higher-order features by building on the hidden layer of the previous Auto-encoder as shown in Figure 2 below.

Figure 2: Auto-Encoders for Computer Vision

Stacking the hidden layers y₁, y₂, y₃ yields a deep learning framework based on auto-encoders, commonly referred to as a Stacked Auto Encoder. It can create features at the level of object attributes starting from information in pixels in a completely unsupervised manner (without any labels on input image examples). Figure 3 below shows the final stacked auto-encoder.

Figure 3: Deep learning framework using stacked auto-encoders

2. Learning an Auto-Encoder

In order to learn an auto-encoder from a set of N unlabeled training examples, we need to find the set of parameters P = (W₁, b₁, W₂, b₂), such that the reconstruction errorΣ(x – x)²is minimized subject to regularization and sparsity constraints on the parameters. Figure 4 below shows an example of an auto-encoder with 3 input variables and 2 hidden nodes.

Figure 4: Parameters learned in an auto-encoder (W1,b1, W2,b2)

The parameters that minimize this cost function can be learned using a gradient descent procedure as suggested in Unsupervised Feature Learning with Deep Learning Tutorial. The high-level steps during learning are the following:

Step 1: Initialize the parameters P randomly.
Step 2: Compute the cost function and gradient of the cost function with current set of parameters P.
Step 3: Apply the gradient descent rule to update P Repeat Steps 2 and 3 until convergence of the cost function.

The computation of the gradient and the cost function is based on the popular techniques of the neural network back and forward propagation. For a large dataset of training examples, this process is computationally intensive, and a distributed platform and framework speeds the process well beyond what is possible with traditional systems.

3. Distributed Learning of Auto-Encoder on Pivotal GPDB

In this next section, we show how to distribute the learning problem on Pivotal Greenplum Database (GPDB) and Pivotal HD by explaining how the cost and gradient functions are distributed.

3.1 DISTRIBUTED COMPUTATION OF COST FUNCTION

In order to understand how the cost function computation can be distributed, let us consider the computational tasks involved. For each training example x, perform forward propagation as shown by the equations below:

a¹ = sigmoid (W¹x + b¹)

x = sigmoid (W²a¹ + b²)

Here, the first equation computes the activations (a¹) of all the hidden nodes for the input example x while the second equation computes the output responses x of the output layer. Both of these steps can be performed in parallel in all the segments of GPDB for the corresponding data that resides in those segments. After computing (a¹and x, the cost function can be computed as a sum of the following terms:

Σ(x – x)²—the reconstruction error term
Σ||W¹||² +Σ||W²||²—the regularization term
Σ ρ log(ρ/ρ_j) + (1-ρ) log((1-ρ)/(1-ρ_j))—Sparsity term, which is a function of average activation value of a hidden node (ρ_j) for all the examples

All of this can be accomplished in GPDB through a PL/R function that gets called on the data residing in each of the segments. Then, the final cost function value for all the data can be computed as the aggregated sum of individual cost function values from each of the segment. Figure 5 below illustrates the above procedure in GPDB. In each of the N segments in GPDB, a PL/R function computes forward propagation steps on the data stored in the corresponding segment to obtain a¹ and x. Then, steps a-c outlined above are computed to compute “cost_i” which is the cost function calculated on data stored in segment i. Finally, the cost values from all segments are aggregated to obtain “cost_all”.

Figure 5: Distributed computation of cost function in GPDB while learning the Auto-Encoder

3.2 DISTRIBUTED COMPUTATION OF GRADIENT FUNCTION

For gradient computation, in addition to computing the activation (a¹) and the output response (x) (as shown in the previous section), we need to perform backward propagation. This step propagates the error through the auto-encoder as shown below by the two equations:

delta(³) = -(x – x) * sigmoid_derivative(W(²)a¹+b²) delta(²)= W(²) delta(³) * sigmoid_derivative(W(¹)x+b¹)

Finally, the gradient value is computed as a function of the activation values and delta values. Similar to the cost function computation, computation of the delta and the gradient value can be distributed using a PL/R function in GPDB. Therefore, each segment just computes the gradient value for data that resides in that segment. Then, we can aggregate the gradient values from all the segments to perform one step of the gradient descent algorithm.

4. Learned Hidden Nodes from Natural Image Patches

We implemented the distributed deep learning algorithm on GPDB on the same natural image dataset referenced here and obtained the following hidden layer. As illustrated in Figure 6, the first level hidden layer uncovers edges and ridges at different orientations from the raw input pixel data.

Figure 6: Deep learning features (hidden nodes from first auto-encoder) from natural image patches (8×8)

5. Distributed Learning of Auto-Encoder on Pivotal Hadoop and HAWQ

Where Pivotal really provides an advantage is in the seamless reuse of the GPDB deep learning implementation on Pivotal HD, and on HAWQ, Pivotal’s SQL on Hadoop solution. In Hadoop, the per-segment gradient and cost function computations, which are implemented as PL/R functions can be easily implemented as mapper functions. A reducer function can then simply aggregate the values from all the mappers. In HAWQ, the GPDB PL/R functions are deployed as is and the algorithm can be run entirely in-database within HAWQ. For running deep learning on Pivotal Hadoop or HAWQ, the image patches need to be stored in HDFS.

Conclusions

Deep Learning, which is a framework for learning structure from unlabeled data, can be implemented to run on distributed computing platforms such as Hadoop, GPDB, and HAWQ. We showed that the computation of gradient descent steps can be distributed across multiple compute nodes using PL/R in HAWQ and GPDB and using Map-Reduce in Hadoop. Also, the implementation in R allowed seamless code re-use across multiple platforms such as PL/R in GPDB and HAWQ, and map reduce using R streaming on Hadoop. The ease of using this framework on these platforms ensures that we can learn features from large collections of unlabeled data and learn about a domain in an unsupervised fashion. The described implementation provides an important toolkit for data scientists: deep learning functionality for large volumes of data on Hadoop or GPDB/HAWQ. We note that the volume of disk I/O for gradient descent iterations can become a bottleneck during the learning of the auto-encoder for approaches in Hadoop or using GPDB/HAWQ. Nevertheless, it is useful to have deep learning functionality in the toolkit of data scientists in these platforms as well.

In a future blog post, our colleague Victor Fang will describe how this limitation for particularly large datasets can be addressed by implementing deep learning on Spark, which is rapidly gaining popularity as a distributed, in-memory computing framework.