Introduction Tutorial to Hugging Face Datasets Library

Ankur K.
Last Updated On August 2, 2023
Natural Language Processing

Table of Contents

What is Hugging Face Datasets Library?

Hugging Face is quite popular for providing ready-to-use pre-trained Transformer models. However, Hugging Face is also a great place to find datasets for your machine learning and it provides a library called Datasets for easy data access, management, preprocessing, caching, etc.

Some of the advantages of using the Hugging Face Datasets library are –

1. Disk-Based Storage

One of the most important features of the Dataset library is its efficient data access. It leverages Apache Arrow for memory-mapped data access, meaning you can process large datasets that won’t fit into memory. This makes it a powerful tool for large-scale data analysis and machine learning.

2. Immutable and Versioned Datasets

Ensuring the reproducibility of experiments is crucial in the field of data science and machine learning. The Dataset library supports versioning, and all datasets are immutable, which means the underlying data can’t be changed once created. This makes it easy to keep track of the exact state of the data used in a specific version of your experiment.

3. Smart Caching

The library processes and caches only the data that is needed, making it highly efficient. Each transformation generates a new dataset, and the computations are done lazily, only when requested. This approach allows for efficient pipelining and chaining of transformations.

4. Integration With Other Frameworks

Hugging Face’s Dataset library is built to seamlessly integrate with popular machine learning frameworks like Tensorflow, Pytorch, JAX, and Spark, making it quite easy to feed your data into a model. You can convert your entire dataset into PyTorch Tensors or TensorFlow Tensors with a single method call.

Install Hugging Face Datasets Library

The dataset library of Hugging Face can be installed with pip as shown below.

In [0]:

pip install datasets

List Datasets Module

Fetching All Datasets of Hugging Face

Let us start with fetching the list of available datasets in the Hugging Face hub. For this, we use the list_datasets module of the datasets library and print the number of available datasets.

In [1]:

from datasets import list_datasets

all_datasets = list_datasets()
print('Number of datasets are', len(all_datasets))

Out[1]:

Number of datasets are 49046

Listing First 10 Datasets

Next, we print the name of the first 10 datasets in the list.

In [2]:

all_datasets[0:10]

Out[2]:

['acronym_identification', 'ade_corpus_v2', 
'adversarial_qa', 'aeslc', 
'afrikaans_ner_corpus', 'ag_news', 
'ai2_arc', 'air_dialogue', 
'ajgt_twitter_ar', 'allegro_reviews']

Load Dataset Module

Loading a Dataset

To load a particular dataset we use load_dataset module of the datasets library of Hugging Face. In the below example, we load a dataset called “squad” and print its metadata that shows the names of features in the dataset and the number of rows available in the Train and Validation set.

In [3]:

from datasets import load_dataset

data_set = load_dataset("squad")
print(data_set)

Out[3]:

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Fetching a Data Row

Here we access the training dataset and print the first entry in the dataset.

In [4]:

train_ds = data_set["train"]
train_ds[0]

Out[4]:

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Accessing Column Names

The column names of the data set can be accessed by using column_names method over the dataset as shown below.

In [5]:

train_ds.column_names

Out[5]:

['id', 'title', 'context', 'question', 'answers']

Getting Features Metadata

The features and their data type can be fetched by using the features method on the dataset as shown below.

In [6]:

train_ds.features

Out[6]:

{'id': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None),
 'context': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}

Setting Hugging Face Datasets Format to Pandas

By default, the Hugging Face Datasets are based on the Apache Arrow data format type. However, this can be converted to Pandas for ease of use and data wrangling.

In the below example, we can see that initially, the default type of data set is Apache Arrow type. We then use set_format function to set the type = ‘pandas’. The 2nd print statement confirms that the type is indeed changed to Pandas type.

In [7]:

import pandas as pd
from datasets import load_dataset

data_set = load_dataset("squad")
train_ds = data_set["train"]

print('Default Type:')
print(type(train_ds))

data_set.set_format(type="pandas")
train_df = data_set["train"][:]

print('\nAfter Conversion to Pandas Type:')
print(type(train_df))

Out[7]:

Default Type:
<class 'datasets.arrow_dataset.Dataset'>

After Conversion to Pandas Type:
<class 'pandas.core.frame.DataFrame'>

Reference: Hugging Face Documentation

Ankur K.

I am a Data Architect by profession and like writing tech articles on AI/ML
View all posts

Introduction Tutorial to Hugging Face Datasets Library

What is Hugging Face Datasets Library?

1. Disk-Based Storage

2. Immutable and Versioned Datasets

3. Smart Caching

4. Integration With Other Frameworks

Install Hugging Face Datasets Library

List Datasets Module

Fetching All Datasets of Hugging Face

Listing First 10 Datasets

Load Dataset Module

Loading a Dataset

Fetching a Data Row

Accessing Column Names

Getting Features Metadata

Setting Hugging Face Datasets Format to Pandas

Leave a Reply Cancel reply

Latest Posts

Follow US