SKILLBERT: "SKILLING" THE BERT TO CLASSIFY SKILLS

Abstract

In the age of digital recruitment, job posts can attract a large number of applications, and screening them manually can become a very tedious task. These recruitment records are stored in the form of tables in our recruitment database (Electronic Recruitment Records, referred to as ERRs). We have released a de-identified ERR dataset to the public domain 1 . We also propose a BERT-based model, SkillBERT, the embeddings of which are used as features for classifying skills present in the ERRs into groups referred to as "competency groups". A competency group is a group of similar skills and it is used as a matching criteria (instead of matching on skills) for finding the overlap of skills between the candidates and the jobs. This proxy match takes advantage of the BERT's capability of deriving meaning from the structure of competency groups present in the skill dataset. In our experiments, the SkillBERT, which is trained from scratch on the skills present in job requisitions, is shown to be better performing than the pre-trained BERT (Devlin et al., 2019) and the Word2Vec (Mikolov et al., 2013). We have also explored K-means clustering (Lloyd, 1982) and spectral clustering (Chung, 1997) on SkillBERT embeddings to generate cluster-based features. Both algorithms provide similar performance benefits. Last, we have experimented with different machine learning algorithms like Random Forest (Breiman, 2001), XGBoost (Chen & Guestrin, 2016), and a deep learning algorithm Bi-LSTM (Schuster & Paliwal, 1997; Hochreiter & Schmidhuber, 1997). We did not observe a significant performance difference among the algorithms, although XGBoost and Bi-LSTM perform slightly better than Random Forest. The features created using SkillBERT are most predictive in the classification task, which demonstrates that the SkillBERT is able to capture information about the skills' ontology from the data. We have made the source code and the trained models 1 of our experiments publicly available.

1. INTRODUCTION

Competency group can be thought of as a group of similar skills required for success in a job. For example, skills such as Apache Hadoop, Apache Pig represent competency in Big Data analysis while HTML, Javascript are part of Front-end competency. Classification of skills into the right competency groups can help in gauging candidate's job interest and automation of the recruitment process. Recently, several contextual word embedding models have been explored on various domain-specific datasets but no work has been done on exploring those models on job-skill specific datasets. 



https://www.dropbox.com/s/wcg8kbq5btl4gm0/code_data_pickle_files.zip? dl=0! 1



like medical and law have already explored these models in their respective domains. Lee et al. (2019) in their BioBERT model trained the BERT model on a large biomedical corpus. They found that without changing the architecture too much across tasks, BioBERT beats BERT and previous state-of-the-art models in several biomedical text mining tasks by a large difference. Alsentzer et al. (2019) trained publicly released BERT-Base and BioBERT-finetuned models on clinical notes and discharge summaries. They have shown that embeddings formed are superior to a general domain or BioBERT specific embeddings for two well established clinical NER tasks and one medical natural language inference task (i2b2 2010 (Uzuner et al., 2011), i2b2 2012 (Sun et al., 2013a;b)), and MedNLI (Romanov & Shivade, 2018)).

