Background
What are some of the biggest hurdles that Natural Language Processing / Understanding (NLP/U) practitioners, especially in the industry, have to deal with?
Till a year-or-so back, it used to be the lack of progress in terms of robust deep-learning architectures, as compared to its cousin field of Computer Vision, which could carry out complicated and nuanced classification tasks (such as Emotional Analysis and Sentiment Analysis). With the advent of Transformers, and models like BERT, GPT-2, XL-Net, etc. built upon the aforementioned architecture, their now exists a plethora of models to choose from, no matter what task you wish to accomplish. But these models brought with them their own set of challenges, mainly - deployment. Now, how to put such deep models in to production, and still have a response time of < 700ms? Luckily for us, this question too has been answered over the last few months, with researchers developing methods to distill and / or prune these huge NLP models, so as to make them amenable for practical purposes. Despite all this progress, there is still a potential for improvement, both in terms of accuracy metrics and in reducing the space complexity of these models without affecting their performances. This is the story I want to talk about.
Dataset
My team and I, around 6 months back, took our first concrete step towards tackling the specific problem of accurately calculating the Emotional- and Sentiment- valence scores for written text, along with providing contextually-aware recommendations (and not just synonyms / antonyms) for words and phrases which the model predicts are maximally correlated to a particular emotion.
To this end, we started off with collecting data … obviously. This tends to be a laborious task, as any data engineer would recognise, but is also the most important as without a thorough understanding of the data, its distribution, its t-SNE plots, etc., one can never be sure about any biases or errors which may have crept in the dataset. For the purposes of tagging, we decided to narrow down on the following subsets: for sentiment - positive, negative, neutral - and for emotion - joy, surprise, anger, fear, sad - based upon certain principles of neuro-marketing. We were lucky enough to find a few publicly available datasets, though not tagged completely, but the majority of our dataset (i.e., ~0.75 million data-points) was collated using scraping code written in-house followed by multiple data augmentation techniques. After the customary cleaning-up and pre-processing of the dataset, it was now time for the interesting part of our job. Training the NLP models!
Baseline
We started off by testing some traditional Machine Learning (ML) based methods, in-order to establish a baseline. Irrespective of the result of the baseline, the accuracy could only get better from there. As is usual, we began with our good ‘ol friend - logistic regression. Generally overlooked in the last few years, especially since the resurgence of Deep Learning (DL) methods, it still is a robust method and gives very good results (sometimes even better than more complex methods). We tested a number of tokenizers for fitting our training data - TF-IDF, sentencepiece, Punkt - whilst using SVMs for the classification tasks. This gave us a fair idea of how they performed with respect to our dataset. Then, we started trying a slightly more advanced classification method - LSTM. The Bi-directional variant of LSTMs, along with the Attention mechanism, gave us some initial promising results. Attention proved useful in understanding the correlation between words and emotions. But we wanted to dig “deeper.”
Training & Results
A small drawback of LSTMs was its inability to handle mid-range dependencies. As a subscriber of the adage - One Model to Rule them All - we were determined to find that one model which would do all our classification tasks, along with handling the syntax and semantics of sentences. In walks … BERT. For training purposes, we used Stochastic Gradient Descent with Nestrov Momentum (ADAM didn’t perform well). And to ensure faster convergence, we used the 1-cycle policy. As the model was going to be deployed from a single K80 GPU machine and we wanted to ensure seamless, real-time results, we explored many optimisation techniques - (i) data parallelisation of both the input and the output of the model; and (ii) further utilising DistilBERT for compression which led to lesser parameters and similar results. This proved to be a fun engineering challenge, and the end result was something the entire team was happy with. We decided to put that model in to production for the v1.0 of our tool / product. To further improve the model classification, we used Label Smoothening with KL Divergence Loss. Finally, the F1-score of our algorithm for the classification tasks is ~0.945 and the AUC score is ~0.933, thereby ensuring quality results.
Future Work
At this point, it would be imperative for me to point out the significant help we have taken from open-source (OSS) libraries / codes / blogs, so as to build upon them to achieve our goals. In the same spirit, we too are working out concrete ways by which we can give back to the community. We endeavour to open-source our data-sets and code, so researchers can reproduce the results, and also publish research papers outlining the novel methods used in our work for people to build upon.
As you may have noticed, this blog focused primarily on our English language tool and the NLP research aspect of our work. In an upcoming blog, I will give you an overview of the Hindi tool and what it takes to deploy DL models on the cloud.
So do stay tuned.
Till then, let the emotions flow! Cheers.
Comments
No comments yet. Be the first to react!