The Problem
Searching for a job can be dangerous sometimes, especially if you are a woman. This is because some of the job listings are created by ill-minded people that use the job listings to harass women. This behavior is especially rampant when apps offer free credits for every user. We know this is true, because when we stopped offering free job listings at 24saatteiş, we saw a huge disappearance when it came to malicious users. At 24saatteiş, one of our main priorities is to create a safe environment for both job candidates and companies. In the past, we have managed to regulate this environment by screening every company that registers and blocking companies from using our app when we received a complaint. This is a passive method, as you wait for complaints from users, it is already too late by the time you detect these companies. Also, this method would omit users who were harassed but did not complain. Last but not least, it is not scalable. As the app’s popularity grew, this passive method became too much of a burden on our moderation team. Because of these downsides, we chose to take a proactive approach to our app’s moderation where we would identify companies before they acted maliciously.
The Challenge
We had two main challenges for this project; context and data quality.
Context matters…
When it comes to building a model to detect malicious behaviour, the context matters. For Instagram, commenting “You are so pretty” is not malicious. But in a job search app, it is. This is why using data from another source like Kaggle etc, was not an option for us. Instead, we have gone through every banned user on our app and decided which one of them was malicious. Secondly, we picked a random sample of companies and screened them again one by one.
90% of our time was spent on data cleaning
The second challenge was that the data was extremely dirty. Almost 80% of the malicious companies were created by the same people but under different accounts. So, even though they appeared to be different companies, they were the same people. The only way to detect those companies was to look at how they text and infer from their writing style. Without removing or somehow wrapping these accounts, our model would overfit. To solve this, we came up with an algorithm using graph theory. We first created a correlation matrix of all labeled companies using their text data. Then we represented this matrix by a graph where each node represented a company. Then, we draw an edge if the companies had a high correlation. Extracting clusters from this graph, we have successfully identified multiple accounts that were created by the same people.
Finally, the text data contained a lot of entities that were not part of natural speech like first name, phone number, email, dates etc. We also wrote entity recognition algorithms from scratch to recognize these entities in the text.
The Solution
To detect malicious companies from hundreds of companies that are signing up to our app even before we receive a complaint from jobseekers, we decided to analyze the data and create a machine learning model. From our experience, those malicious companies behaved in a very different pattern and we thought that we can exploit those patterns to detect them. So how are they different?
1. They talk too much…
The biggest factor that separates good companies from bad ones is that the malicious companies have extremely long conversations with job seekers. We looked at every company’s longest conversation length and found a strong relationship between being a babbler and being malicious.
2. …and when they do, they talk different
Apart from app usage, we also did some text mining on the messages that were sent by malicious and decent companies. Here are the word clouds of the most commonly used words by malicious and decent companies. Notice the capitalized words the word clouds. Those are entities that were tagged during the data cleaning process. Malicious companies tend to use adjectives a lot more and they adress workers using informal language.
3. As a result, they are much more likely to get blocked
Our app has a blocking feature where job seekers and companies can block each other. This feature enables job seekers and candidates to never interact with each other again. We observed that 74% of malicious companies were blocked at least once, compared to decent companies which this number is low as 13%.
The model
After cleaning the dataset, we created a random forest model that predicts whether a company is malicious or not. We have tried other methods such as SVM with linear,radial and polynomial kernels, naive bayes, logistic regression and k-nearest neighbors but the random forest yielded the highest accuracy. Below is the confusion matrix of the predictions on the test data. This table shows that our model’s predictions are correct 79.49% of the time and when we predict a company as malicious, we are correct 80% of the time. Our model also have equal amount of Type-1 and Type-2 error, meaning our predictions does not have a bias toward a certain class.
Since failing to predict a malicious company correctly is more harmful than the contrary, we paid extra attention to sensitivity of our model. Thankfully, our model has satisfactory sensitivity and 79.49% accuracy is good enough for us.
The Impact
After creating a model, we built an api that has a predict function. This function predicts companies that sent a message in a given interval whether they were malicious or not. We also built a slack bot that invokes our api and request prediction function every 15 minutes. After getting the response from the api, it sends message to our moderation channel on slack. 24saatteiş moderation team screens every company that was predicted malicious and blocks the company from using the app if necessary. So far, the model has received positive feedback from the moderation team and enabled us to detect malicious companies instantly. Thanks to our teams’ great effort, 24saatteiş continues to be the safest place to search for a job online.
Ege Taşlıçukur Data Scientist
Comentários