The Delicacy of Data Augmentation in Natural Language Processing
Disclaimer: I am the author of this blog that was originally published on Medium as part of the blog portal of the company I was working for at that time, iGenius.
Data is everything in AI.
It is one of the most important premises on which tremendous effort is invested. As data is such a scarce resource, it is worth researching how to leverage the existing data in order to produce even more data. In other words, doing more with less.
The data augmentation consists of a set of techniques that handle the process of automatically generating high-quality data on top of the existing data.
For example, if we have a set of images, we can perform dozens of operations on every image: rotate, scale, shift, crop, change color intensities, etc.
However, the same is not applicable in the domain of natural language processing (NLP) where the main entity is a sentence. The data augmentation process is more challenging and not so straightforward. In the following blog post, we uncover the delicacy of data augmentation in NLP and explain some common techniques to do so.
Text Substitution
The main principle is seamless substitution. It means replacing some words with their equivalent also known as synonyms without changing the meaning of the phrase. This comes naturally since we use different words to express the same concepts. For instance, “smart”, “wise” and “intelligent” could mean the same in a certain context.
Rule-Based Substitution
The simplest mechanism is to replace words with their synonyms taken from a hand-crafted system. One such example is WordNet, a manually crafted database bridging the words with their synonyms.
This strategy has a clear edge of being lightweight and easy to implement. With little effort, we can considerably increase and make the data more diverse. However, this technique has its own limitations. First of all, it does not take into consideration the context in which the words appear. Second, it is agnostic to the scope of the dataset. To mitigate these effects, we need context-aware techniques, which we cover in the next section.
ML-Based Substitution
Instead of having a simple substitution that is independent of the context, we can act more intelligently. With the help of some Machine Learning models, we can learn and embed the context in the text replacement process. The most popular ones are substitution based on word embeddings and mask-based substitution using BERT-like models.
Word Embeddings Substitution
Word Embedding models like Word2Vec, GloVe, and FastText propelled the feature extraction revolution in NLP. With their help, we can transform words into a multidimensional vector representation and perform different mathematical operations on top of them. The learned vector representations are inferred from the context in which the words appear.
Comparing vectors is a piece of cake. We can use any of the pre-trained word embedding models and substitute the words with their nearest neighbors in the vector space.
Mask Based Substitution
The Transformer based model BERT (and its variants like ALBERTA and ROBERTA) utterly changed the way we train and deploy NLP systems. It is a pre-trained model with the aim to be fine-tuned on a multitude of downstream tasks, for instance, question answering.
The pre-training is completely unsupervised — conditioned on the context in which the words appear. This is achieved by masking the words and predicting their value based on the context. We can use exactly this feature to augment our NLP datasets. All we have to do is to mask some words in our sentences and let the BERT-like models predict the most probable replacements.
The clear advantage of all text substitution techniques is that they are simple and viable models that are already proven to work well. However, in order to fully automate the data augmentation process, we need an automated way of selecting which words to substitute. This can require additional heuristic-based techniques for which there is no guarantee of their efficiency.
Back Translation
Back Translation is one intriguing byproduct of the automatic translation to another language. It is the ability to translate a sentence in one language to a set of similar yet different sentences in another language. Then, translating this set of sentences back to the original language might increase the diversity of our original dataset.
For example, we can translate an English sentence to its corresponding French, German and Russian equivalents. If the back translation to English leads to a different sentence, we include it in the dataset.
Text Generation
The generative or language modelling systems have a boggling ability to generate text when given only a few words. We can make some limited use of the text generation ML models to augment our data.
We can tackle this problem from many angles. One example is to fine-tune a previously pre-trained GPT model to generate text. Another possibility is to train a Generative Adversarial Network (GAN) to artificially synthesize sentences similar to the sentences in our dataset.
Although this set of techniques sounds very promising and potentially can be the game-changer in data augmentation for NLP, it is not so trivial to make them work. Usually, there are multiple billions of parameter neural networks that require tons of computational resources and manually tuning of many hyperparameter knobs. Moreover, there is high uncertainty in what direction the generated sentence might continue.
For more information, please follow me on LinkedIn. If you like this content you can subscribe to the mailing list below to get similar updates from time to time.
Conclusion
In this blog post, we raised an important question on data augmentation in NLP. It is a process in which we increase our data — based on the existing data — without losing the quality. This is not so trivial to achieve because of the written language nature.
One word, even one letter, can change the meaning of everything.
Nevertheless, there are a couple of interesting techniques like word substitution, back translation, and automatic text generation.
Despite all the effort and progress made in this area, the general conclusion is that there is still a void to be filled for more reliable and efficient data augmentation.
Leave a comment