[Updated]12 Must See Coronavirus Data Sets for Data Scientists and Researchers

Coronavirus Data Sets for Data Scientists and Researchers
Coronavirus Data Sets for Data Scientists and Researchers


Since the last few weeks, the world is in grip of the pandemic of COVID-19 or popularly known as “Coronavirus” which is threatening a global shutdown. It has spread at the speed of knots, giving no time to prepare against this novel virus. Each and every section of the society whether it be the Government, Healthcare or the NGOs, everyone is taking precautionary measures for safeguarding people from the coronavirus. In this article, we’ll be looking at some of the publicly available coronavirus data set which can be used for understanding and finding answers to a large number of questions related to coronavirus. 

It is a good opportunity for our data science and machine learning communities to use these coronavirus data and throw some more light on its traits.

List of Coronavirus Data Sets


1Kaggle – Day level information on COVID-19 affected cases

This dataset hosted over kaggle has information about the number of affected cases, deaths, and recovery from 2019 novel coronavirus. It has multiple csv files, with different features. The good thing about this dataset is that it covers information from across the globe. Therefore, better data analysis can be performed. It is updated regularly.

Moreover, this coronavirus data set contains time series data, this will give us an opportunity to try and perform hands-on practical with time-series analysis.

Refer here for the dataset.

2Kaggle – COVID-19 cases in Italy

This virus was originated in Wuhan, a city in Hubei Province of China. But at present the country with the most suspected cases and deaths after China is Italy. This is another Kaggle hosted dataset where we can find information about coronavirus cases region wise and also province wise. There is a dashboard as well which visualizes the current situation of coronavirus cases.

Through this coronavirus dataset, one can understand how did the rise in coronavirus cases took place in Italy. Along with this, which particular regions and provinces are most affected by this life-threatening virus, can be deduced using our data science techniques.

To download the dataset, click here.

3Kaggle – Dataset of COVID-19 in South Korea

After China and Italy, one of the neighboring countries of China, i.e. South Korea has been a major sufferer of coronavirus. This coronavirus data set focuses on the various kinds of information available from South Korea. We can find 5 different csv files, with information about the registered cases, patients, places visited by patients, time-series data of South Korea and also keywords searched by locals related to coronavirus. A very detailed dataset that can assist in projects like data visualization, geographical visualization and also time series analysis.

For accessing the data, go here.

4EU Open Data Portal

A dataset hosted on European Open Data Portal and provided by the European Centre for Disease Prevention and Control contains three different kinds of datasets regarding coronavirus. The datasets are updated on a daily basis. The first coronavirus dataset is about the cases found in Europe, another dataset has information on worldwide cases. The third dataset contains the geographical distribution of the COVID-19 cases in the form of an excel file. By using this dataset, we can perform data analysis of how European countries were affected and also visualize the results through maps.

The dataset can be downloaded from here.

5The Humanitarian Data Exchange (HDX) – Novel Coronavirus Cases

A dataset compiled by John Hopkins University Center for Systems Science and Engineering with the help of multiple sources like WHO, DXY.cn, China CDC, US CDC, Government of Canada and many other such governmental and non-governmental organizations. This coronavirus data set has information about the people who are confirmed to have coronavirus along with this suspected, recovered and the number of deaths have also been mentioned. The dataset has 6 different csv files comprising of different sets of data. Along with this, you can also view the results through an interactive visualization that shows the worldwide map with affected regions.

To download this interactive dataset, click here

6Our World in Data – Coronavirus Data

The next in this list of datasets is the source of data provided by Our World in Data. Here we can find five different coronavirus datasets in the form csv files. These csv files consist of information about the confirmed cases all over the world, there is also data on deaths as well. At this source, we will also get the updated datasets of confirmed cases and deaths.

Download the dataset from here.

7Nextstrain – Coronavirus Dataset 

The last source of the dataset is Nextstrain. Here along with some interactive visualizations with intriguing filters to apply, you can also download the dataset used for building these coronavirus visualizations. The salient feature of this dataset is the number of sources who have provided the data. Using this dataset, we explore data visualization at an advanced level and also learn about some new visualizations. I will also recommend you to play around with the interactive visualization as well.

Access the dataset, by clicking here

8Coronavirus Dataset of India

This is really unfortunate but coronavirus has started to spread throughout India. There have been numerous cases reported and a couple of deaths too. All of this information is now available in a dataset where information about confirmed cases and cured cases are covered. There are multiple csv files in this dataset, this will surely help in deeper analysis of how this virus has entered India and what impact it has caused on various states of India. The dataset is hosted over Kaggle and can be downloaded from the link given below.

Click here for dataset

9Coronavirus Data set for R Package

Recently R package for coronavirus was released. If you work with R, then you can look at this package and work with it. For python users, there is a Github repository from where the csv file of the dataset can be downloaded for analysis purposes. The dataset contains information about cases happening all over the world. So it will be really beneficial for better understanding and exploration of both coronavirus and data science.

To access the dataset, click here

10COVID-19 Open Research Dataset (CORD-19)

One of the huge datasets amongst all these resources is this particular dataset hosted over Kaggle. Recently Allen Institute for AI, Microsoft, Facebook, and US Government collaborated to create this open-source research dataset. An Individual can access this dataset without any constraints and then perform analysis on such large data will definitely provide unique and new results.

To fetch the data click here.

11COVID-19 cases with chest X-ray or CT images. 

This is an open dataset consisting of chest X-ray or CT scan images of coronavirus patients. The dataset is still developing with each passing day. This dataset is hosted over Github where the other relevant information for the dataset is already provided. Using this dataset source, we can find out any specific information available from these scans of coronavirus patients. Along with this, we can sharpen our skills in computer vision and learn some new techniques involved in this field.

To download dataset click here.

12Coronavirus Tweets Dataset

We are well aware of this fact that social media is the mirror of what society is thinking. In this dataset hosted over Kaggle has three different csv files. One of the files contains information about tweets, the other two files are about the countries from where the tweets have originated and the hashtags used along with the tweets. Using this dataset, we can probe into projects like sentiment analysis and other relevant natural language processing projects.

To download the dataset, click here.


So we have come to the end of this article, where we looked at some of most useful sources for getting coronavirus data sets from across the globe. You can use these datasets for learning and exploring different aspects of data science such as data preprocessing, data visualization and data analysis. Furthermore, do take care of your health and stay safe.

Like and Comment section (Community Members)

Create Your ML Profile!

Don't miss out to join exclusive Machine Learning community


No comments yet