标签:
Brevity is the soul of wit
This powerful quote by William Shakespeare applies well to techniques used in data science & analytics as well. Intrigued ? Allow me to prove it using a short story.
In May ‘ 2015, we conducted a Data Hackathon ( a data science competition) in Delhi-NCR, India.
Register for Data Hackathon 3.0 – The Battle of Survival
We gave participants the challenge to identify Human Activity Recognition Using Smartphones Data Set. The data set had 561 variables for training model used for the identification of Human activity in test data set.
The participants in hackathon had varied experience and expertise level. As expected, the experts did a commendable job at identifying the human activity. However, beginners & intermediates struggled with sheer number of variables in the dataset (561 variables). Under the pressure of time, these people tried using variables really without understanding the significance level of variable(s). They lacked the skill to filter information from seemingly high dimensional problems and reduce them to a few relevant dimensions – the skill of dimension reduction.
Further, this lack of skill came across in several forms in way of questions asked by various participants:
If you have faced similar questions, you are reading the right article. In this article, we will look at various methods to identify the significant variables using the most common dimension reduction techniques and methods.
The problem of unwanted increase in dimension is closely related to fixation of measuring / recording data at a far granular level then it was done in past. This is no way suggesting that this is a recent problem. It has started gaining more importance lately due to surge in data.
Lately, there has been a tremendous increase in the way sensors are being used in the industry. These sensors continuously record data and store it for analysis at a later point. In the way data gets captured, there can be a lot of redundancy. For example, let us take case of a motorbike rider in racing competitions. Today, his position and movement gets measured by GPS sensor on bike, gyro meters, multiple video feeds and his smart watch. Because of respective errors in recording, the data would not be exactly same. However, there is very little incremental information on position gained from putting these additional sources. Now assume that an analyst sits with all this data to analyze the racing strategy of the biker – he/ she would have a lot of variables / dimensions which are similar and of little (or no) incremental value. This is the problem of high unwanted dimensions and needs a treatment of dimension reduction.
Let’s look at other examples of new ways of data collection:
With more variables, comes more trouble! And to avoid this trouble, dimension reduction techniques comes to the rescue.
Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. These techniques are typically used while solving machine learning problems to obtain better features for a classification or regression task.
Let’s look at the image shown below. It shows 2 dimensions x1 and x2, which are let us say measurements of several object in cm (x1) and inches (x2). Now, if you were to use both these dimensions in machine learning, they will convey similar information and introduce a lot of noise in system, so you are better of just using one dimension. Here we have converted the dimension of data from 2D (from x1 and x2) to 1D (z1), which has made the data relatively easier to explain.
In similar ways, we can reduce n dimensions of data set to k dimensions (k < n) . These k dimensions can be directly identified (filtered) or can be a combination of dimensions (weighted averages of dimensions) or new dimension(s) that represent existing multiple dimensions well.
One of the most common application of this technique is Image processing. You might have come across this Facebook application – “Which Celebrity Do You Look Like?“. But, have you ever thought about the algorithm used behind this?
Here’s the answer: To identify the matched celebrity image, we use pixel data and each pixel is equivalent to one dimension. In every image, there are high number of pixels i.e. high number of dimensions. And every dimension is important here. You can’t omit dimensions randomly to make better sense of your overall data set. In such cases, dimension reduction techniques help you to find the significant dimension(s) using various method(s). We’ll discuss these methods shortly.
Let’s look at the benefits of applying Dimension Reduction process:
There are many methods to perform Dimension reduction. I have listed the most common methods below:
1. Missing Values: While exploring data, if we encounter missing values, what we do? Our first step should be to identify the reason then impute missing values/ drop variables using appropriate methods. But, what if we have too many missing values? Should we impute missing values or drop the variables?
I would prefer the latter, because it would not have lot more details about data set. Also, it would not help in improving the power of model. Next question, is there any threshold of missing values for dropping a variable? It varies from case to case. If the information contained in the variable is not that much, you can drop the variable if it has more than ~40-50% missing values.
2. Low Variance: Let’s think of a scenario where we have a constant variable (all observations have same value, 5) in our data set. Do you think, it can improve the power of model? Ofcourse NOT, because it has zero variance. In case of high number of dimensions, we should drop variables having low variance compared to others because these variables will not explain the variation in target variables.
3. Decision Trees: It is one of my favorite techniques. It can be used as a ultimate solution to tackle multiple challenges like missing values, outliers and identifying significant variables. It worked well in our Data Hackathon also. Several data scientists used decision tree and it worked well for them.
4. Random Forest: Similar to decision tree is Random Forest. I would also recommend using the in-built feature importance provided by random forests to select a smaller subset of input features. Just be careful that random forests have a tendency to bias towards variables that have more no. of distinct values i.e. favor numeric variables over binary/categorical values.
5. High Correlation: Dimensions exhibiting higher correlation can lower down the performance of model. Moreover, it is not good to have multiple variables of similar information or variation also known as “Multicollinearity”. You can use Pearson (continuous variables) or Polychoric (discrete variables) correlation matrix to identify the variables with high correlation and select one of them using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped.
6. Backward Feature Elimination: In this method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features.
Repeat this process until no other variables can be dropped. Recently in Online Hackathon organised by Analytics Vidhya (11-12 Jun’15), Data scientist who held second position used Backward Feature Elimination in linear regression to train his model.
Reverse to this, we can use “Forward Feature Selection” method. In this method, we select one variable and analyse the performance of model by adding another variable. Here, selection of variable is based on higher improvement in model performance.
7. Factor Analysis: Let’s say some variables are highly correlated. These variables can be grouped by their correlations i.e. all variables in a particular group can be highly correlated among themselves but have low correlation with variables of other group(s). Here each group represents a single underlying construct or factor. These factors are small in number as compared to large number of dimensions. However, these factors are difficult to observe. There are basically two methods of performing factor analysis:
8. Principal Component Analysis (PCA): In this technique, variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principle components. They are obtained in such a way that first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance.
The second principal component must be orthogonal to the first principal component. In other words, it does its best to capture the variance in the data that is not captured by the first principal component. For two-dimensional dataset, there can be only two principal components. Below is a snapshot of the data and its first and second principal components. You can notice that second principle component is orthogonal to first principle component.The principal components are sensitive to the scale of measurement, now to fix this issue we should always standardize variables before applying PCA. Applying PCA to your data set loses its meaning. If interpretability of the results is important for your analysis, PCA is not the right technique for your project.
Recently, we received this question on our data science forum. Here’s the complete answer.
In this article, we looked at the simplified version of Dimension Reduction covering its importance, benefits, the commonly methods and the discretion as to when to choose a particular technique. In future post, I would write about the PCA and Factor analysis in more detail.
Did you find the article useful? Do let us know your thoughts about this article in the comment box below. I would also want to know which dimension reduction technique you use most and why?
Beginners Guide To Learn Dimension Reduction Techniques
标签:
原文地址:http://www.cnblogs.com/yymn/p/4687175.html