Kaggle: Where data scientists learn and compete

By hosting datasets, notebooks, and competitions, Kaggle helps data scientists discover how to build better machine learning models

Kaggle: Where data scientists learn and compete
kentoh / Getty Images

Data science is typically more of an art than a science, despite the name. You start with dirty data and an old statistical predictive model and try to do better with machine learning. Nobody checks your work or tries to improve it: If your new model fits better than the old one, you adopt it and move on to the next problem. When the data starts drifting and the model stops working, you update the model from the new dataset.

Doing data science in Kaggle is quite different. Kaggle is an online machine learning environment and community. It has standard datasets that hundreds or thousands of individuals or teams try to model, and there’s a leaderboard for each competition. Many contests offer cash prizes and status points, and people can refine their models until the contest closes, to improve their scores and climb the ladder. Tiny percentages often make the difference between winners and runners-up.

Kaggle is something that professional data scientists can play with in their spare time, and aspiring data scientists can use to learn how to build good machine learning models.

What is Kaggle?

Looked at more comprehensively, Kaggle is an online community for data scientists that offers machine learning competitions, datasets, notebooks, access to training accelerators, and education. Anthony Goldbloom (CEO) and Ben Hamner (CTO) founded Kaggle in 2010, and Google acquired the company in 2017.

Kaggle competitions have improved the state of the machine learning art in several areas. One is mapping dark matter; another is HIV/AIDS research. Looking at the winners of Kaggle competitions, you’ll see lots of XGBoost models, some Random Forest models, and a few deep neural networks.

Kaggle competitions

There are five categories of Kaggle competition: Getting Started, Playground, Featured, Research, and Recruitment.

Getting Started competitions are semi-permanent, and are meant to be used by new users just getting their foot in the door in the field of machine learning. They offer no prizes or points, but have ample tutorials. Getting Started competitions have two-month rolling leaderboards.

Playground competitions are one step above Getting Started in difficulty. Prizes range from kudos to small cash prizes.

Featured competitions are full-scale machine learning challenges that pose difficult prediction problems, generally with a commercial purpose. Featured competitions attract some of the most formidable experts and teams, and offer prize pools that can be as high as a million dollars. That might sound discouraging, but even if you don’t win one of these, you’ll learn from trying and from reading other people’s solutions, especially the high-ranked solutions.

Research competitions involve problems that are more experimental than featured competition problems. They do not usually offer prizes or points due to their experimental nature.

In Recruitment competitions, individuals compete to build machine learning models for corporation-curated challenges. At the competition’s close, interested participants can upload their resume for consideration by the host. The prize is (potentially) a job interview at the company or organization hosting the competition.

There are several formats for competitions. In a standard Kaggle competition, users can access the complete datasets at the beginning of the competition, download the data, build models on the data locally or in Kaggle Notebooks (see below), generate a prediction file, then upload the predictions as a submission on Kaggle. Most competitions on Kaggle follow this format, but there are alternatives. A few competitions are divided into stages. Some are code competitions that must be submitted from within a Kaggle Notebook.

Kaggle datasets

Kaggle hosts over 35 thousand datasets. These are in a variety of publication formats, including comma-separated values (CSV) for tabular data, JSON for tree-like data, SQLite databases, ZIP and 7z archives (often used for image datasets), and BigQuery Datasets, which are multi-terabyte SQL datasets hosted on Google’s servers.

There are several ways of finding Kaggle datasets. On the Kaggle home page you will find a listing of “hot” datasets and datasets uploaded by people you follow. On the Kaggle datasets page you will find a dataset list (initially ordered by “hottest” but with other ordering options) and a search filter. You can also use tags and tag pages to locate datasets, for example https://www.kaggle.com/tags/crime.

You can create public and private datasets on Kaggle from your local machine, URLs, GitHub repositories, and Kaggle Notebook outputs. You can set a dataset created from a URL or GitHub repository to update periodically.

At the moment, Kaggle has quite a few COVID-19 datasets, challenges, and notebooks. There have already been several community contributions to the effort to understand this disease and the virus that causes it.

Kaggle Notebooks

Kaggle supports three types of notebook: scripts, RMarkdown scripts, and Jupyter Notebooks. Scripts are files that execute everything as code sequentially. You can write notebooks in R or Python. R coders and people submitting code for competitions often use scripts; Python coders and people doing exploratory data analysis tend to prefer Jupyter Notebooks.

Notebooks of any stripe can optionally have free GPU (Nvidia Tesla P100) or TPU accelerators and may use Google Cloud Platform services, but there are quotas that apply, for example 30 hours of GPU and 30 hours of TPUs per week. Basically, don’t use a GPU or a TPU in a notebook unless you need to accelerate deep learning training. Using Google Cloud Platform services may incur charges to your Google Cloud Platform account if you exceed free tier allowances.

You can add Kaggle datasets to Kaggle notebooks at any time. You can also add Competition datasets, but only if you accept the rules of the competition. If you wish, you can chain notebooks by adding the output of one notebook to the data of another notebook.

Notebooks run in kernels, which are essentially Docker containers. You can save versions of your notebooks as you develop them.

You can search for notebooks with a site keyword query and a filter on notebooks, or by browsing the Kaggle homepage. You can also use the Notebook listing; like datasets, the order of notebooks in the list is by “hotness” by default. Reading public notebooks is a good way to learn how people do data science.

You can collaborate with others on a notebook multiple ways, depending on whether the notebook is public or private. If it is public, you can grant editing privileges to specific users (everyone can view). If it is private, you can grant viewing or editing privileges.

Kaggle public API

In addition to building and running interactive notebooks, you can interact with Kaggle using the Kaggle command line from your local machine, which calls the Kaggle public API. You can install the Kaggle CLI using the Python 3 installer pip, and authenticate your machine by downloading an API token from the Kaggle site.

The Kaggle CLI and API can interact with competitions, datasets, and notebooks (kernels). The API is open source and is hosted on GitHub at https://github.com/Kaggle/kaggle-api. The README file there provides the full documentation for the command-line tool.

Kaggle community and education

Kaggle hosts community discussion forums and micro-courses. Forum topics include Kaggle itself, getting started, feedback, Q&A, datasets, and micro-courses. Micro-courses cover skills relevant to data scientists in a few hours each: Python, machine learning, data visualization, Pandas, feature engineering, deep learning, SQL, geospatial analysis, and so on.

All in all, Kaggle is very useful for learning data science and for competing with others on data science challenges. It’s also very useful as a repository for standard public datasets. It’s not, however, a replacement for paid cloud data science services or for doing your own analysis.

Copyright © 2020 IDG Communications, Inc.