Introduction
The aim of this blog is on the core intuition and ideas rather than the nitty gritty and technical details of these algorithms. After this blog you should be able to take away an understanding of the dimensionality reduction broadly. Sometimes technical details are crucial but they’re not worth getting into because by the time you’ve explained all the technical details you have lost the point of what you are trying to explain in the first place.Data Science often demands that we work with higher dimensions. These dimensions seem hopeless since extremely simple things become hard. In terms of machine learning models, we tend to add a number of features to catch salient features. However, after a certain point, the performance of the model will decrease with the increasing number of features. This phenomenon is often referred to as “The Curse of Dimensionality.” When we keep adding features without increasing the number of training samples as well, the dimensionality of the feature space becomes sparser and sparser. Due to this sparsity,it becomes much easier to find a “perfect” solution for the machine learning model. The model corresponds too closely to a particular set of data and doesn’t generalize well. It works too well on the training dataset and fails on future data and makes the prediction unreliable. You may also know it as overfitting. It is necessary to reduce the number of features considerably to boost our model’s performance and enable us to arrive at an optimal solution for our model. What if I tell you that fortunately for us, in most real life problems, it is often possible to reduce the dimensions in our data, without losing too much of the variance within our data? Sounds perfect, right?
Now let's suppose that we want to get an intuition for how some high dimensional data is arranged. The question is that is there some way to know the underlying structure of the data? A simple approach would be to find a map, in which each of the high dimensional objects is going to be represented by a point in such a way that similar objects are going to be represented by nearby points. And the dissimilar objects are going to be represented by distant points. And these points we’re going to embed in some low dimensional map of maybe two or three dimensions. And we can then visualize this two or three dimensional map using a scatter plot.
So here is an example of that. We have a bunch of high dimensional data points. We try to embed it in a two dimensional space. Clearly, we didn’t do a very good job because the distances between the points in the low dimensional map are very different from the high dimensional map. And so what we want to end up with is more with something like this, where the distances in these low dimensional map reflects the similarities in the original high dimensional data. To do this, we’re going to minimize some objective function that basically measures the discrepancy between the similarities in the original data and the dissimilarities in the map. This is called dimensionality reduction or embedding.
Essentially there are only 2 dimensionality reduction categories. These cover a broad range of techniques. Matrix factorization: This includes Pinciple Component Analysis and all kinds of probabilistic techniques.The goal of matrix factorization is to express a matrix as approximately a product of two smaller matrices. (insert picture here.) bigger = longer x wider. (6x4 4x3 )(9x3 3x7). The bigger matrix is the data where each row is a sample and each column is a feature and we want to factor this into a low dimensional representation multiplied by some exemplars used to reconstruct the original data. You can think of a single sample of your data as being represented as one row of your representation which acts upon the entire matrix of the exemplars. You are getting a single row expressed as a linear combination of these exemplars and the coefficients of that linear combination are your low dimensional representation. This is matrix factorization. Can be represented using maths A (approximately sign)UV U(representation) = mxn V(archetypes) = (n x c) A is the data(m x c) Why are we using the approximation sign here? When we talk about approximation we talk about optimization. This means that we’re going to try and minimize some notion of error. This loss is between our original and our reconstructed data subject to some constraints. Variations of these losses and constraints gives us different matrix factorization techniques.
Neighbor graphs: This includes t-distributed stochastic neighbor embedding. We build a graph using the data and embed that graph in a low dimensional space. The question is that how do we build a graph and then how to embed that graph? Let’s look at some of the intuition behind this. There is a data which has some non-linear low dimensional structure to it but a linear approach like PCA will not be able to find it.But if we draw a graph where each point is linked with an edge if it is in the k nearest neighbors of that point then the graph is able to uncover the structure that we wanted to find.
In the famous words of Mick Jagger, “You can’t always get what you want, but if you try sometimes, you just might find you get what you need”
PCA
Let's dive into the intuition behind PCA!
image We have a bunch of points in 2D. You can clearly see two different colors.Red and Blue.Let’s think about the distribution of these red points. These points have a mean value and a set of eigenvectors. Add notation. X bar and v1 v2. The idea is that we can represent the red points with only their v1 coordinates plus the mean. Essentially we just think of all these red points on being on a line and all I am going to tell you is that where on the line they are. We are going to ignore the amount that they’re off that line. So we have just reduced the dimensions from 2 to 1. This doesn’t sound lke that big a deal, nor is it. But in higher dimensions, this could be a huge deal. Imagine you’ve got something in 10,000 dimensional space. If you said, well, I’m going to represent them by one number, or even 30, that would be a huge reduction. So if I tell you in what direction do they vary the most in and I just give you that value then you have reduced the data from a lot of numbers to a small amount of numbers.
Let's look at PCA from a different perspective. Suppose, you wish to differentiate between different food items based on their nutritional content. Which variable will be a good choice to differentiate food items? If you choose a variable which varies a lot from one food item to another, you will be able to isolate them properly. Your job will be much harder if the chosen variable is almost same in food items. What if data doesn't have a variable which segregates food items properly? We can create an artificial variable through a linear combination of original variables like artVar1 = 2 X orgVar1 - 3 X orgVar2 + 5 X orgVar3. This is what essentially PCA does, it finds best linear combinations of the original variables so that the variance or spread along the new variable is maximum.These new variables are known as the principle components. We basically enter into a new coordinate system.We can do an awful a lot with mean squared error. Classical PCA is taking this loss and writing the loss as the squared error between each of the terms we’re just gonna sum it up over all the entries. (Add notation)It is computed using fancy LA techniques. We want to choose the axis that retains the maximum amount of variance in our data set, as it will most likely lose less information than if we were to consider other projections. In this case, the axis we choose is referred to as Principal Components, also known as PCA1 as it captures the maximum amount of variance within our data set, with the second Principal Component, as it accounts for the largest remaining variance in our dataset — denoted as PCA2.
t-SNE
This is a t stochastic neighboring embedding or t-SNE abbreviated. And here’s how it works. In a high dimensional space, we are going to measure similarities between points. And we’re going to do that in such a way that we’re only going to look at the local similarities. So it’s similarities to nearby points. So the red point is in the high dimensional space. An what we’re going to do is we’re going to centre a Gaussian over this point(xij). And now we’re going to measure the density of all the other points under this Gaussian. This gives us a set of probabilities pij which basically measures the similarity between pairs of points ij. You can think of this as a probability distribution over pair of points, where the probability of picking a particular pair of points is proportional to the similarity. If two points are close together in the original high dimensional, you’re going to have a large value for pij. If two points are dissimilar (far apart) then you’re going to get pij that is basically infinitesimal. So now we have an idea about what the local similarities of points in the high dimensional space are. The question is how we’re going to use them? Now we will look at the low dimensional space. So this is a two or three dimensional space. This will be our final map. The red point here is yi. We are basically going to do the same thing here. We are going to centre so e kernel over this point yi. And we’re going to measure the density of all the other points under that distribution. This gives us a probability qij which measures the similarity of two points in the low dimensional map. Now what we want is we want these probabilities qij, to reflect the similarities pij(high dim) as well as possible. If the qijs are physically identical to the pijs then apparently the structure of the map is very similar to the structure of the data in the original high dimensional space. We measure the difference between these pij values in high dim and qij in low dim is by KL divergence .Similar to many other methods, we set up two distance metrics in the original and the embedded space and minimize their difference. In t-SNE in particular, the original space distance is based on a Gaussian distribution and the embedded space is based on the heavy-tailed Student-t distribution. • If two points are close in the original space, there is a strong attractive force between the the points in the embedding • Conversely, if the two points are far apart in the original space, the algorithm is relatively free to place these points around. We want to lay out these points in a low dimensional way such that this KL divergence is minimized. In order to do that we are basically going to do gradient descent in this KL divergence which boils down to just moving the points around in such a way that this KL divergence becomes small.
While computing the qij is that we weren’t using a Gaussian kernel. But we were using something else. We were using a Student-t distribution, which is a distribution that is a lot kore heavy tilt than thr Gaussian distribution. The question is why this distribution? So let’s suppose our data was intrinsically high dimensional. For instance, it was uniformly sampled from a 10 dimensional hpercube. And we’re going to project this data down to two dimensions. So as a result, we can never preserve all the pairwise distances accurately. We need to compromise somewhere. As we saw that t-SNE was trying to model the local structure. Trying to get similar stuff close together in the map. So the dissimilar data has to be modeled too far apart in the map. I will illustrate with a very simple example where I have three points in the two dimensional space. The red lines are the small distances (local structure). The distance between the corner of the triangle(make blue line) that is global structure so a large pairwise distance. I am going to embed this data in 1-d while preserving local structure. You can see that the two pairwise distances are completely equity preserved. What you also see happening is that the distance between the two points that are far away have grown. Using the heavy tail distribution we are allowing this to happen. So if we have two points that have a pairwise ditanceof lets say 10 and now under a Gaussian it gives me a density of 1,then to get the same density under the student t distribution (density = 0.01), because of the heavy tails, these points have to be 20 or 30 apart.So for dissimilar points, these heavy tailed qij basically allow dissimilar points to be modeled too far apart in the map. So what you see here is t-SNE running gradient descent and doing learning and calculating KL divergence. You see a lot more structure than you had in the PCA plots. You can see that the 10 different digits are fairly well separated in the low dimensional map. And if you run it a little longer you get quite large gaps between the clusters. The labels of the digits were not used to generate the embedding. The labels are only used to color the plots. It’s a completely unsupervised thing.
image of lines
mnist tsne distill reference
Autoencoders$$f'(a) = \lim_{x \to a} \frac{f(x) - f(a)}{x-a}$$
This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/p/d76eb825-658b-4057-9d6b-8f6c7ddf140b/