The war on shady texts: how to search for suspicious SMS content
In this article:
At Sinch, we want to make it easy to use our SMS services. It should be quick to sign up and get started, but at the same time, we’re super alert for any suspicious activity. We’re constantly working to weed out spam and fraudulent traffic - but what’s the best way to do that?
This blog is part one in a series where we describe our message labeling journey - how we hunt down suspicious traffic and protect our clients - and their end-users.
Let’s dive in!
First, we analyzed our SMS traffic by pulling similar messages together, using unsupervised machine learning, from a research area known as topic models. We tried two models: Latent Dirichlet Allocation (LDA), which you can read more about here, and Biterm Topic Model (BTM).
LDA was our first choice since it was the first popular topic model. But as LDA doesn’t perform too well with shorter text messages, we tried two different variants. The first was the LDA model as is, with one text message counting as one document. For the second, we pre-processed messages, joining several together so long as they came from the same sender.
We found that BTM could be a possible solution to LDA’s shorter text performance issues, so we included this model in our initial investigations. All three models create groups or clusters of messages, which we could then analyze to see which topics came up most often.
Though they all highlighted similar topics like marketing or verification codes, we couldn’t find the specific topics we expected. Why? Because they didn’t come up very often. We were looking for content aimed at mature audiences or marketing drugs, among other things. Even when we increased the number of topics in our search, we found existing groups splitting into subtopics.
In the end, we had to sample and create a textual dataset using our internal data, which we then labeled. That gave us a dataset of roughly 6,000 messages. We tagged each text message with several labels, turning the text classification problem into a multilabel classification problem that included 15 different labels.
Most machine learning models only work with numbers, not text, so we needed to encode the messages into vectors (or embeddings) before training separate machine learning models. In general, we decided to split the machine learning model into two parts: the embedder and the classifier. The embedder turns text into vectors, and we decided to have a separate classifier for each label, using the vectors from the embedder as input. This approach makes it easier to add new labels. It also gives us the flexibility to use different classifier models for different labels.
From here on, we tried out different machine learning models. We started with BERT since it was state-of-the-art on several evaluation tasks when it was released. As for the classifier model, we tried using a simple dense neural network on top of the pooled embeddings of BERT. We trained BERT and the dense neural network together, resulting in a fine-tuned BERT model for this task. Even though many people love BERT, it wasn’t a good fit for us this time around.
But BERT has its downsides. One of them, although often forgotten, is the amount of computation time required. We could process roughly ten messages per second on a single GPU. Realizing this, we decided not to move forward with this embedding model.
While experimenting, our thoughts turned to the characteristics of our dataset. SMS messages often use abbreviations not seen in the natural language used to train the models. That’s why we decided to use embedding models that could handle unseen words. We found two algorithms that might work, ELMo, and fastText. (Not familiar with ELMo? You can read more about it here.)
We’d heard of ELMo before; it generated quite a buzz on its release. We were intrigued! As with BERT, we used a dense neural network on top of ELMo as a classifier. Evaluating the models was straightforward enough – our key metric was the F1 score. As a starting point, this gave us a good balance between too many false positives to react to while keeping the false negatives low enough to not miss anything. The results varied a lot.
For some labels, we reached an F1 score of between 0.8-0.95, which we considered enough for now. For other labels, we got a score of 0 since there were very few positive samples. The imbalance caused by the few positive samples in the dataset gave the model a high accuracy when predicting 0 on all messages for those labels. That wasn’t what we wanted the model to learn. On top of that, ELMo’s speed was still slower than we hoped. Given the volume of traffic we process, our solution had to handle at least 1,000 messages per second. Doing some benchmarks with ELMo, we reached a speed of roughly 400 messages per second. So, we decided to see how well fastText could perform.
With the information from our previous tests, we tried both pre-trained fastText models (available online) and custom fastText embeddings trained on our data. Instructions for this method are here. We used roughly 1 GB of our text data spread over time to train our embeddings. This time, we tried different classifiers, including XGBoost, convolutional neural networks, and logistic regression.
The results were surprisingly good! We were happy with fastText since it kept the F1 score in the 0.8-0.9 range while massively improving the processing speed; when compared to ELMo. The self-trained embeddings were also better than the pre-trained models, so we decided to use this embedding model. As for the classifiers, we got the best results using XGBoost, so we moved forward with that.
Overall, when choosing and creating our models, we learned a lesson. We hopped on the hype train of the new models too fast. Even though the NLP (Natural Language Processing) community is moving towards using expensively pre-trained models, they’re not always the best tool for the job.
If you thought this was a fun ride (hello, fellow nerds!), feel free to join us on the next part of this series, where we’ll discuss other problems we found with our dataset: data drifting and template messages.