What is high-dimensional, sparse data

Dimension reduction

Dimension reduction is a collection of statistical methods that the Dimension of the data reduced and at the same time Retains relevant information.

High-dimensional data is widely used by government agencies, scientific surveys, and industrial companies. However, the high dimension and the large volume of data raise at least two questions:

  1. One must overcome the curse of dimensionality, which says that high-dimensional spaces are inherently sparse, even with a large number of observations.
  2. How can the information within the data be presented sparingly?

Dimension reduction techniques address these problems to varying degrees by reducing the set of variables to a smaller set of either the original or new variables, the new variables being linear combinations or even nonlinear functions of the original variables. If the number of dimensions of the new data set is relatively small (usually up to about 3), data visualization becomes possible, which often makes data modeling much easier.

Dealing with high dimensionality can be used for algorithms of the machine learning to be difficult. A high dimensionality increases the complexity of the computation and increases the risk of overfitting (since the algorithm has more degrees of freedom).


The dimensional reduction techniques can be divided into two main categories: monitored dimension reduction (supervised dimension reduction) and unsupervised dimensional reduction (unsupervised dimension reduction).

The unsupervised dimensionality reduction dealt with all variables the same. Analysis usually has a natural definition of the information that interests us. Unsupervised dimensionality reduction methods find a new set of a smaller number of variables that are either a simpler representation offer or one intrinsic structure retained in the data while retaining most of the important information. Below are just a few of the most commonly used techniques.

Principal component analysis

The Principal component analysis (principle component analysis, PCA) finds a few orthogonal linear combinations of the original variables with the greatest deviations; these linear combinations are the Main componentsthat would be retained for later analysis. In the PCA, this information is the variation within the data. As a rule, the main components are sorted in descending order according to their variance. The number of principal components to include in the analysis depends on how much variance you want to keep.

Factor analysis

The Factor analysis assumes that a number of variables are interrelated by a smaller number of common factors. It estimates the common factors using assumptions about the variance-covariance structure.

Canonical correlation analysis

The canonical correlation analysis identifies and measures the association between two sets of random variables. It often finds a linear combination of variables for each set, with these two new variables showing the greatest correlation. Canonical correlation is appropriate in the same situations that multiple regression would be, but where there are multiple inter-correlated output variables.

Correspondence analysis

The Correspondence analysis is a graphical tool for exploratory data analysis of a contingency table. It projects the rows and columns as points on a diagram, with rows (columns) having a similar profile when the corresponding points are close together.

Multidimensional scaling

Multidimensional scaling finds a projection of the data into a smaller dimensional space so that the distances between the points in the new space reflect the approximations in the original data. The number of dimensions in an MDS plot can exceed 2 and is specified a priori.

Random forests

Random forests are not only suitable as effective classifiers, but also for feature selection. One approach to reducing dimensionality is to use a large and carefully constructed set of Decision trees (Decision trees) against a target attribute and then using the usage of each attribute to find the most meaningful subset of features. Specifically, we can create a large set (2000) of very flat trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often chosen as the best split, it is most likely an informational feature that will be preserved. A score based on the attribute usage in the Random Forest is calculated tells us - compared to the other attributes - which are the most predictive attributes.