I wish every data scientist had interactive data explorers like this one!
Comprehensive understanding of the data is a top priority for each data scientist. One important part of this process is plotting and visualizing the data. However, the data is always high-dimensional and the visualization is not a straightforward process. For this reason we use data projection techniques like UMAP, t-SNE or even PCA.
All these data projection techniques are amazing, however there’s one drawback with them. They are not interactive, not at least their Python implementations. In this short geeky coffee break, we will take a look at one interactive data exploration tool over the UMAP projections of the MNIST dataset.
The UMAP Explorer is a web app rendering an interactive UMAP visualization of the MNIST dataset. For even greater satisfaction, each data point is rendered as the image of the hand-written digit itself. Besides UMAP, we can also load a t-SNE projection of the data and observe how these two projection techniques differ.
It is a React application with a purpose to demonstrate how to render tens of thousands of images mapped to data points, but it also serves as an excellent tool for the data scientist.
The interactivity is a crucial component to understand the data better. To fully immerse into data exploration we need to dive interactively into the particular data points. In this way we can see their neighborhood and how they compare with the other data points.
In this MNIST use case, there are at least two aspects to observe: different digits that appear to be the same and the tiny (and wide respectively) gap between different groups of digits.
Different digits similar to each other
In the particular case of the UMAP Explorer we can zoom into the clusters of data points, click on them to have a more detailed view and observe the visual cues of the digits. This enables us to compare the digit with other similar digits and to even conclude that some totally different digits are written in so similar way. One such particular case is depicted in the figure below. A digit labeled and written as 2 is among a cluster of digits written as 7. No wonder if some classifier takes that 2 as 7.
Blending between the similar digits
Zooming into the particular use-cases is not the only advantage of the interactive visualizations. We can also investigate the global properties of the data, especially the boundaries between the different classes. As illustrated below, the gap between the digits 8 and 1 is obvious, however the digits 8 and 3 seem to blend. No wonder if some classifier struggles to distinguish 8s and 3s.
The UMAP Explorer is perfectly combining multiple technologies. First, it pre-computes the UMAP projections of the MNIST digits. This gives the projections of the images in the 2D plane, i.e. their (x, y) coordinates.
Very briefly, UMAP (Uniform Manifold Approximation and Projection) is a non-linear data dimension reduction algorithm. It builds a high-dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible. The most two important parameters it uses are:
n_neighbors: the number of approximate nearest neighbors used to construct the initial high-dimensional graph
min_dist: the minimum distance between points in low-dimensional space
In the Resources section below you can find an amazing set of articles that explain UMAP.
I wish we had more of these interactive data explorers.
If this is something you like and would like to see similar content you could follow me on LinkedIn or Twitter. Additionally, you can subscribe to the mailing list below to get similar updates from time to time.
- The GitHub repository of the UMAP Explorer
- A great tutorial on understanding UMAP
- A deeper dive in UMAP
Leave a comment