I recently participated in a weekend-long data science hackathon, titled ‘The Smart Recruits’. Organized by the amazing folks at Analytics Vidhya, it saw some serious competition. Although my performance can be classified as decent at best (47 out of 379 participants), it was among the more satisfying ones I have participated in on both AV (profile) and Kaggle (profile) over the last few months. Thus, I decided it might be worthwhile to try and share some insights as a data science autodidact.
The competition required us to use historical data to create a model to help an organization pick out better recruits. The evaluation metric to be used for judging the predictions was AUC (area under the ROC curve). You can read the problem statement on the competition page.
The hackathon itself was a weekend-long sprint and the data reflected this. The training set (1.2 MB) comprised of 9527 observations consisting of 23 variables including the target variable. The test set (~600 KB) comprised of 5045 observations and 22 predictor variables.
The Science behind it all
My code for the competition can be found here. The code is in R although the description of my approach towards the problem in this post is presented in a language-agnostic manner.
As for the choice of language, I used R since I am relatively more proficient with it as compared to Python, my other language of choice for data-intensive work. And that was an important factor considering the sprint nature of the competition. Also, I find R more suited than Python for data manipulation and visualization. I prefer Python when dabbling in deep learning or working with image and textual data. Then again, it’s just my opinion.
Recommended read: R vs Python for Data Science: The Winner is …
I spent some time exploring the data and the nature of variables through statistical summaries and visualizations. Getting to know your data is extremely important and just throwing it in a model straightway without doing that is a really bad approach! That can’t be stressed enough.
That being said, I didn’t spend as much time on this part as one normally would when participating in a month-long Kaggle competition or when working with real-life data. It’s a sprint after all.
Recommended read: A Comprehensive Guide to Data Exploration
If I was forced to choose just one significant learning I have acquired from participating in such competitions, it has to be the importance of cross-validation.
So the next thing I did after playing around with data for a while was to try and set up a good CV framework. There are all sorts of complicated frameworks that are used by experienced campaigners; peer into an active Kaggle forum and you will know.
But k-fold cross validation, which I used is, in general, a good enough start for most competitions. The decision regarding how to perform the split is critical. Random splits might be good enough at times. Other times the classes (0/1) are unbalanced so you might need to do stratified sampling or sometimes time-based splits (month, quarter, year, etc.) will have to be made.
Recommended read: Improve Your Model Performance using Cross Validation
As a beginner, I used to stumble across this term quite a lot and was unable to straightaway find some good resource. Over time, I have learnt what it represents and why it’s an indispensable part of any machine learning problem. A quote from this gem of an article, Discover Feature Engineering, sums it up nicely:
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
Bulk of my time was spent on this part. Briefly, I encoded the categorical variables (such as Gender, Occupation, Qualification), performed imputation to deal with missing/NA values , created new variables and removed some variables.
The encoding part is fairly straightforward; a glance through the code and one can understand the process.
Dealing with missing values is tricky for beginners and experienced campaigners alike. There are various methodologies one might employ. You can do away with incomplete observations altogether. That isn’t a good idea and in a case like ours when we don’t have a lot of data points, it’s a very bad approach. Other usual approach is to use the mean/median/mode of observations for numeric variables. That is what I have used. There are also sophisticated algorithms for imputation. MICE had given me good results in the past but after trying it out, I chose not to implement it in the final model.
Creating new useful variables was key in this competition. I managed to come up with some useful new features. For example, I split Applicant_DOB into three separate columns for date, month and year and further used the year column to create Applicant_Age.
Recommended read: Discover Feature Engineering, How to Engineer Features and How to Get Good at It
The final model employs XGBoost, the most popular algorithm in data science competitions. I also tried out a couple of models from the H2O package but went with XGBoost eventually since it provided better performance both locally and on the public leaderboard.
Simply put, hyperparameters can be thought of as cogs which can be turned to fine-tune the machine that is your algorithm. In case of XGBoost, nrounds (number of iterations performed during training) and max_depth (maximum depth of a tree created during training) are examples of hyperparameters.
There are automated methods such as GridSearch and RandomizedSearch which can be used. I used a manual approach, as described here by Davut Polat, a Kaggle master.
Recommended read: How to Evaluate Machine Learning Models: Hyperparameter Tuning
Solving a problem such as this takes time and perseverance, among other things. Analytics Vidhya’s hackathons have progressively become better in terms of the quality of competition on offer. Having Kaggle grandmasters and masters (including the eventual winner, Rohan Rao) in the discussion forums, and the competition itself, helped.
It’s fun so if you are looking to get your hands dirty in a data science competition, don’t think too much. Dive right in. Even if it seems overwhelming at first, you will only end up having fun and learn something along the way.
The code is available as a Github repo.
If you read and liked the article, sharing it would be a good next step.
Drop me a mail, or hit me up on Twitter or Quora in case you want to get in touch.
5 thoughts on “Data Science Competitions 101: Anatomy and Approach”
Nice post! You may find that encoding categorical variables as single columns of integers imposes an implied ‘order’ to the categories that you don’t intend. With as few categories as this data had, something like one-hot encoding (aka dummy coding) may yield better results.
LikeLiked by 1 person
I know about one-hot encoding but hadn’t thought about the ‘ implied order’ bit. I will keep this in mind. Thanks for the tip!
Thanks for sharing your code and approach. I am new to R as well as modeling. Your code really helped me understand the problem better.
Two observations –
1. Converting factors to numerics is not a good idea.
Example: Assigning Female = 1 , Male=2 means male value > female value which does not make any sense here.
Dummy variables creation for all categorical variables is the right way (also known as one-hot encoding).
2. Found one minor error in your R script.
In the very last section where model is trained on entire train data –
clf <- xgb.train( params = param,
data = d_train,
nrounds = 150,
verbose = 2,
watchlist = watchlist )
In parameters of xgb.train() function, data should be 'dtrain' instead of 'd_train'. (d_train in your case is referring to cv train data)