In this article I’m going to describe a pipeline for Text Clustering. I could not find any satisfactory explanation, so I decided to share my findings after a few weeks of research about this task.
In a previous article, I wrote about the value of datasets, and particular annotated (gold) datasets. But another type of data is contained in the whole internet, which is unlabelled and in overwhelming volumes.
So, the idea of learning from all this raw data is quite interesting, and lots of research has been done using two methods: semi-supervised – labelled and unlabelled data are both used, and unsupervised – labelled or unlabelled data can be both used.
The first intuition when looking at raw data is to try to find patterns. Clustering similar information together makes it easier to understand what’s going on inside the data.
Clustering is the task of organizing unlabelled objects in a way that objects in the same group are similar to each other and dissimilar to those in other groups. In other words, clustering is like unsupervised classification where the algorithm models the similarities instead of the boundaries.
It has been widely used for various tasks, including: opinion mining on social media to detect a tendency in posts, image segmentation where the goal is to detect the boundaries of any object, customer profiling based on product purchases, and document clustering based on phrases or words. But you can also use it to label data in place of a person, allowing the human to go from a supervising role to a validating role.
I’m going to cover the whole process of text clustering, from actual clustering to data visualization, through data preprocessing. In the end, you should be able to build a solid pipeline, using state-of-the-art techniques to further improve your datasets’ usage!
To start off, we need to meet three requirements. First of all, we need a distance measure to define whether or not two documents are similar, a criterion function to compute the quality of our clusters and finally an algorithm to optimize this criterion.
A distance measure can help us define the proximity of two points in our dataset. It should be large if documents 1 and 2 are similar and small if they differ.
The criterion function will inform us when we find the best clusters, and stop the pipeline. As an example, an approach would be to try to maximize the similarity between each document and the center of the cluster the document has been assigned to. Another approach could be to try to maximize the difference between each cluster of document, instead of their internal similarities.
Finally, we need an algorithm to optimize this criterion function. This algorithm can have several stages. A common way is to use a greedy approach consisting of two steps: initial clustering, and refinement.
The initial phase will select random documents of our corpus and assign them to clusters. Then the refinement phase will iterate over random documents, and compute the criterion function when this document is moved to another cluster. If the score is improved we continue to iterate, if not that means we found the best clusters on the given data.
Those three requirements are the baseline of a clustering task, and the criterion function depends on the approach you selected for your clustering experiment. From now on, clustering methods will only be branching out, and covering each algorithm would be too tedious. Instead, we are going to look at three algorithms distributed over three major approaches: partitional, hierarchical and density-based.
K-means is THE go-to clustering algorithm. Fast, available and easy to wrap your head around, it requires you to know the number of clusters your expect. One downside is that K-means tends to assume that your clusters will be simple, due to it’s partitional approach: it tries to decompose the dataset into non-overlapping subsets. Expect quick results, but with noise.
On the other hand, hierarchical clustering tends to model the dataset into clusters organized as a tree, where you can go up from one document to the whole corpus. This can be done from top to bottom (divisive) or bottom to top (agglomerative). Agglomerative clustering is an approach that yields a dendrogram (a tree diagram) of your dataset, for you to cut at the desired threshold. This leaves you with the freedom to view the assumed clusters before selecting them, but they will still be noisy.
Finally, density-based clustering will create clusters on the denser regions of your dataset. DBSCAN (and its improvement HDBSCAN) combines the best of agglomerative clustering with the capacity of removing noisy documents. The only parameter you have to select is the minimal distance to consider two documents as similar, and DBSCAN will do the rest.
If you want a more complete comparison of different clustering methods, you can find one here, with visuals too!
Here we covered the basics of clustering and listed several algorithms that can be used to find the best groups in our corpus. But for the algorithms to work, it is important to feed it the data under the right light. In the next part, we’re going to talk about preprocessing techniques, which will help reduce noise, and facilitate data visualisation.
Cleaning the data
As always with text, documents can be noisy, hiding information between stopwords, inflexions and sparse representations. The step of preprocessing is crucial to obtain a dataset easier to work with.
Traditional approaches to text clustering tends to tokenize the documents into its component words using available tools (TreeTagger for example). This leaves us with a lower grain to work with, words instead of whole sentences. Some of those might be pluralized, conjugated or inflected. To cope with that, lemmatization and stemming can be used: the first will yield an uninflected form, keeping the sense, where the other will truncate the word, resulting in a simpler token, but sometimes meaningless. On top of that, you can remove the stopwords, to further simplify the semantics of the document and thus improve the clustering.
One last thing that can be done is to make some words stand out in our corpus. Once stopwords are removed, the semantic of a sentence depends on the theme (nouns and adjectives) and the action (verbs, auxiliaries and adverbs). Those can be enhanced by adding synonyms, hyponyms and hypernyms, whether with rules or using pre-trained word embeddings. This enrichment step can yield better results, or worsen the noise in your data.. There’s no true or false here, you’ll have to try!
Once the corpus is pre-processed, we need to transform our data into something the algorithm can work with: vectors. Most of the time, TF-IDF is the preferred solution, but due to their sparsity, those vectors can be neglected to leave room for more denser representations. Paragraph vectors appeared after the enthusiasm for word embeddings, and thanks to their internal composition, allow an easy computation of similarity. Doc2Vec, an implementation of this paper from Tomas Mikolov is freely available in gensim, you should give it a try!
I can not end this part without talking about the curse of dimensionality. This phenomenon appears when you work with high-dimensional data, such as text where the size of the vectors is often equal to the size of the vocabulary. Put simply, the more dimensions you have, the more sparse the data will get, and computing the similarity between two points will become incrementally hard. Here’s a longer explanation, explained like you’re five!
The way you represent your data, and the way you present it to your algorithm can change everything, so try sparse and dense vectors on your heavy or lightly preprocessed documents.
But before jumping right into clustering, look at your data, this simple step can help you save hours of tinkering with your algorithms.
Just look at it!
Yes, it’s that simple! Look at the raw data you are about to process. You will probably find patterns yourself, that will help you define the number of clusters you can expect, or the amount of preprocessing your need to apply to let the clusters emerge by themselves.
Keep in mind that once you’ve vectorized your data, you’re working with high dimensions, and it’s incredibly hard to understand what’s going on.
A mathematician and an engineer attend a lecture by a physicist concerning theories that occur in spaces with dimensions of 9. The mathematician is sitting, clearly enjoying the lecture, while the engineer is looking puzzled. By the end the engineer has a terrible headache, and asks his friend: “How do you understand this stuff?”. The mathematician answers: “I just visualize the process.” To that the engineer replies: “How can you POSSIBLY visualize something that occurs in 9-dimensional space?” Visibly amused, the mathematician responds: “Easy, first visualize it in N-dimensional space, then let N be 9.”
To be able to visually analyse the data, we need to transform our N dimensional data into a 2 or 3 dimensional representation.
The goal of dimensionality reduction is to extract the principal information contained inside our data without using everything. PCA (for Principal Component Analysis) does just that.
The underlying idea behind PCA is to retain the data dimensions which contains most of the information (i.e: with the highest variance). For instance, dimensions with low variances tend to add few information value to the data and can be safely removed. Of course, most of the time, this leads to loss of information, but who cares when you can go to 300+ dimensions to only 3 while keeping more than 70 percent of the information? Well, sometimes you should care about it, but since it is an introduction, I won’t hold it against you.Others approaches, which work better with high-dimensional data, tries to preserve really small distances between documents, and by bringing closer the nearest neighbors, the dissimilar documents will go away.
t-SNE (for t-Distributed Stochastic Neighbor Embedding) is a technique that use this intuition to measure the similarities between close points in high-dimensional space, and then tries to lay out the points in a low-dimensional space in such a way that close points stay close. This is done by optimizing a distance measure (Kullback-Leibler divergence, for the nerds reading me) using gradient descent. t-SNE tends to yield better results than PCA for data in high dimensions, and this article explain in great details how to use it effectively. Also, here is a video from one of the two authors of t-SNE, talking about the maths behind his algorithm.
Once done, use your preferred plotting library (I personally like the availability of matplotlib, with the beauty of seaborn) or tool (Tensorflow’s projector is really something to try!) and enjoy your whole dataset being displayed in front of you after a few minutes of computation!
In the end, we can define text clustering as the following pipeline: first look at your data to have an idea of what you’ll be working with, then apply a preprocessing step to reduce noise, visualize your dataset to find natural patterns and/or clusters, actually cluster it, analyze the results either by a metric or by eye, and repeat!