What I learned at Kaggle Days Dubai | Data Science Blog

Anyone interested in analytics or machine learning would certainly be aware of Kaggle. Kaggle is the world's largest community of data scientists and offers companies to host prize money competitions for data scientists around the world to compete in. This has made it the largest online competition platform too. However, Kaggle has started to evolve itself to organize offline meetups globally.

One such initiative is the organization of Kaggle Days. Up till now, four Kaggle Days events have been organized in various cities around the world, the recent one being in Dubai. The format of Kaggle Days involves a 2-day session consisting of presentations, practical workshops, and brainstorming sessions during the first day followed by an offline competition the next day. For a machine learning enthusiast with intermediate experience in this field, participating in a Kaggle hosted competition and teaming up with a Kaggle Grandmaster to compete against other grandmasters was an enjoyable experience on its own for me. I couldn’t reach the top ranks in the competition, but competing with and networking with the dozens of grandmasters and other enthusiasts present during the 2-day event boosted my learning and abilities.

My desire was to make the best use of this opportunity, learn to the utmost extent I could, and ask the right questions from the grandmasters present at the event to get the best out of their wisdom and learn the optimal ways to approach any data science problem. It was heart-whelming to discover how supportive they were as they shared tricks and advice to get to the top position in data science competitions, and improve the performance of any Machine Learning project. In this blog, I’d like to share the insights that I gathered during my conversations, and the noteworthy points I recorded during their presentations.

Strengthen Your Basic Knowledge

My primary mentor during the offline competition was Yauhen Babakhin. Yauhen is a data scientist at H2O.ai and has worked on a range of domains including e-commerce, gaming, and banking, specializing in NLP related problems. An inspiring personality and one of the youngest Kaggle Grandmasters. Fortunately, I got the opportunity to network with him the most. His profile deluded my misconception that only someone with a doctoral degree can achieve the prestige of being a Grandmaster.

During our conversations, the most significant advice that came from Yauhen was to strengthen our basic knowledge and have an intuition about various machine learning concepts and algorithms. One does not need to go extensively deep into these concepts or does not needs to be extra knowledgeable to begin with. As he said, “start learning a few important learning models but get to know how they work!” It will be ideal to start with the basics and extend your knowledge along the way by building experience through competitions, especially the ones hosted on Kaggle. For most of the queries, Yauhen suggests, one must know what to search on Google. This alone will prove to be an extremely handy tool on its own to get us through most of the problems despite having a limited experience relative to our competitors.

Furthermore, Yauhen emphasized on how Kaggle single-handedly played a leading role in heightening his skills. Throughout this period, he stressed on how challenges triggered him to perform better and learn more. It was such challenges that provoked him to learn beyond his current knowledge and explore areas beyond his specialization such as computer vision, said the winner of the $100,000 TGS Salt identification challenge. It was these challenges that prompted him to dive into various areas of machine learning and it was this trick that he suggested us to use to accelerate career growth.

Through this conversation, I was able to learn the importance of going broad. Though Yauhen insisted about selecting problems that target a broad range of problems and cover various aspects of Data Science, he also suggested to limit it to the extent that it should align with our career pursuits and realize even if we even need to target something beyond what we are ever going to use. Lastly, the Grandmaster in his late 20’s also wanted us to practice with deep learning models as it’ll allow us to target a broad set of problems and to discover the best approaches used by previous winners and to combine them in our projects or competition submissions. These approaches could be found in blogs, kernels and forum discussions.

Remain Persistent

My next detailed interaction was with Abhishek Thakur. The conversation provoked me to ask as many questions as I could, as every suggestion given by Abhishek seemed wise and encouraging. One of the rare examples of someone crowned with 2 Kaggle Grandmaster titles, competition and discussion grandmaster, Abhishek is the chief data scientist at boost.ai, once attaining the 3rd rank in global competitions at Kaggle. What made his profile more convincing was Abhishek's accelerated growth from a novice to a grandmaster within a time period of a year and a half. He started his career in machine learning from scratch and took this initiative from Kaggle itself. Initially starting off with the lowest rank in competitions, Abhishek was adamant that Kaggle could be the only platform one can totally rely on to catapult his growth within such a short period of time.

However, as Abhishek repeatedly said, it all required continuous persistence. From the beginning till now, even after being placed in the bottom ranks initially, Abhishek carried on and demonstrated how persistence was the key to his success. Upon inquiring about the significant tools that led him to get golds in his recent participation, Thakur emphasized immensely on feature engineering. He insisted on how this step was the most important from all in distinguishing the winner. Similarly, he suggested that a thorough exploratory data analysis can assist one to find those magical features that can enable one to get the winning results.

Like other Grandmasters who have attained massive success in this domain, Abhishek also laid emphasis on improving one's personal profile through Kaggle. Not only does it offer you a distinct and fast-paced learning experience, as it did for all the grandmasters at the event, but it's also recognized across various industries and major employees who value these rankings. Abhishek told how it enabled him to get numerous lucrative job offers over time.

Start Instantly with Competitions

During the first day, I was able to attend Pavel Pleskov’s workshop on ‘Building The Ultimate Binary Classification Pipeline’. Based in Russia, Pavel currently works for an NLP startup, PointAPI, and was once ranked at number 2 among Kagglers globally. The workshop was fantastic, but the conversations during and after the workshop intrigued me the most as they mostly comprised of tips for beginners.

Someone who quit his profitable business to compete on Kaggle, Pavel insisted on the ‘do what you love’ strategy as it leads to more life satisfaction and profit. Pavel told us how he started off with some of the most popular online courses on machine learning but found them lacking practical skills and homeworks which he covered using Kaggle. For beginners, he strongly recommended not to put off Kaggle contests or wait until the completion of courses, but to start instantly. According to him, practical experience on Kaggle is more important than any other course assignment.

Some other noteworthy and touching tips from Pavel were that in order to win such competitions, unlike many students who approach Kaggle as an academic problem and start creating fancy architectures and ultimately do not score well, Pavel approaches a problem with a business mindset. He increased the probability of success by leveraging resources, such as including people in his team who had resources, like a GPU, or merging his team with another to improve the overall score.

Upon an inquiry related to keeping the right balance between taking out time to build theoretical knowledge and using that time to generate new ideas, Pavel advised looking at forum threads on Kaggle. They can help you know how much theoretical knowledge you are missing while competing with others. Pavel is an avid user of LightGBM and CatBoost models, which he claims has given him superior rankings during the competitions. One of his suggestions is to use fast.ai library that, despite receiving many critical reviews, has been a flexible and useful library which he mostly keeps in consideration.

Hunt for Ideas and Rework Them

Due to the limitation of time during the 2 days event, I was able to hear less from another young grandmaster from Russia, coincidentally sharing the same first name with his fellow Russian grandmaster, Pavel Ostyakov. Remarkably, Pavel was still an undergrad student then, and has been working for Yandex and Samsung AI for past couple of years.

He brought a distinct set of advice that can prove to be extremely resourceful when one is targeting gold in competitions. He emphasized on writing clean code that could be used in the future and allows easy collaboration with other teammates, a practice usually overlooked which later becomes troubling for participants. He also insisted on trying to read as many forums on Kaggle as one can. Not just ones related to the same competition but those belonging from other competitions as well since most of them our similar. Apart from searching for workable solutions, Pavel suggested also looking for ideas that failed. As he recommended, one must try using (and reworking) those failed ideas as there are chances they may work.

Pavel also brought up the point that in order to surpass other competitors, reading research papers and implementing their solutions could increase your chances of success. However, during all this time he stressed a lot on to have a mindset that anyone can achieve gold in a competition, even if he/she possesses limited experience relative to others.

Experiment with diverse strategies

Other noteworthy tips and ideas that I collected while mingling with grandmasters and attending their presentations included those from Gilberto Titericz (Giba), the grandmaster from Brazil with 45 Gold medals! While personally inquiring Giba, he repeatedly used the key-word ‘experiment’ and insisted that it is always important to experiment with new strategies, methods and parameters. This is one simple, although tedious, way to learn quickly and get great results.

Giba also proposed, that to attain top performance, one must build models using different viewpoints of the data. This diversity can come from feature engineering, using varying training algorithms or using different transformations. Therefore, one must explore all possibilities. Furthermore, Giba suggested that fitting a model using default hyperparameters is good enough to start a competition and build a benchmark score to improve further. Regarding teaming up, he repeated that diversity is the key here as well and choosing someone who thinks similar to you is not a good move.

A great piece of advice that came from Giba was to blend models. Combining models can help improve the performance of the final solution, especially if each model’s predictions has low correlation. A blend can be something simple as a weighted average. For instance, non-linear models like Gradient Boosting Machines blend very well with neural network based models.

Conclusion

Considering the key-takeaways from the suggestions given by these grandmasters and observing the way they competed during the offline competition, I noted that beginners in data science must use their effort to try varying methodologies as much as they can. Moreover, a summary of the recommendations given above stress the significance of taking part in online competitions no matter how much knowledge or experience one possesses. I also noted that most of the experienced data scientists were fond of using ensemble techniques and one of the most prominent methods used by them was the creation of new features out of the existing ones. In fact, this is what was cited by the winners of the offline competition as their strategy for success. Conclusively, these sorts of meetups could enable one to interact with the top minds in the field and gain the maximum within a short period of time as I fortunately did.


This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/kaggle-career-data-science/