What is Hugging Face Datasets Library?
Hugging Face is quite popular for providing ready-to-use pre-trained Transformer models. However, Hugging Face is also a great place to find datasets for your machine learning and it provides a library called Datasets for easy data access, management, preprocessing, caching, etc.
Some of the advantages of using the Hugging Face Datasets library are –
1. Disk-Based Storage
One of the most important features of the Dataset library is its efficient data access. It leverages Apache Arrow for memory-mapped data access, meaning you can process large datasets that won’t fit into memory. This makes it a powerful tool for large-scale data analysis and machine learning.
2. Immutable and Versioned Datasets
Ensuring the reproducibility of experiments is crucial in the field of data science and machine learning. The Dataset library supports versioning, and all datasets are immutable, which means the underlying data can’t be changed once created. This makes it easy to keep track of the exact state of the data used in a specific version of your experiment.
3. Smart Caching
The library processes and caches only the data that is needed, making it highly efficient. Each transformation generates a new dataset, and the computations are done lazily, only when requested. This approach allows for efficient pipelining and chaining of transformations.
4. Integration With Other Frameworks
Hugging Face’s Dataset library is built to seamlessly integrate with popular machine learning frameworks like Tensorflow, Pytorch, JAX, and Spark, making it quite easy to feed your data into a model. You can convert your entire dataset into PyTorch Tensors or TensorFlow Tensors with a single method call.
Install Hugging Face Datasets Library
The dataset library of Hugging Face can be installed with pip as shown below.
In [0]:
pip install datasets
List Datasets Module
Fetching All Datasets of Hugging Face
Let us start with fetching the list of available datasets in the Hugging Face hub. For this, we use the list_datasets module of the datasets library and print the number of available datasets.
In [1]:
from datasets import list_datasets all_datasets = list_datasets() print('Number of datasets are', len(all_datasets))
Out[1]:
Number of datasets are 49046
Listing First 10 Datasets
Next, we print the name of the first 10 datasets in the list.
In [2]:
all_datasets[0:10]
Out[2]:
['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews']
Load Dataset Module
Loading a Dataset
To load a particular dataset we use load_dataset module of the datasets library of Hugging Face. In the below example, we load a dataset called “squad” and print its metadata that shows the names of features in the dataset and the number of rows available in the Train and Validation set.
In [3]:
from datasets import load_dataset data_set = load_dataset("squad") print(data_set)
Out[3]:
DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 87599 }) validation: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 10570 }) })
Fetching a Data Row
train_ds = data_set["train"] train_ds[0]
Out[4]:
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}
Accessing Column Names
The column names of the data set can be accessed by using column_names method over the dataset as shown below.
In [5]:
train_ds.column_names
Out[5]:
['id', 'title', 'context', 'question', 'answers']
Getting Features Metadata
The features and their data type can be fetched by using the features method on the dataset as shown below.
In [6]:
train_ds.features
Out[6]:
{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}
Setting Hugging Face Datasets Format to Pandas
By default, the Hugging Face Datasets are based on the Apache Arrow data format type. However, this can be converted to Pandas for ease of use and data wrangling.
In the below example, we can see that initially, the default type of data set is Apache Arrow type. We then use set_format function to set the type = ‘pandas’. The 2nd print statement confirms that the type is indeed changed to Pandas type.
In [7]:
import pandas as pd from datasets import load_dataset data_set = load_dataset("squad") train_ds = data_set["train"] print('Default Type:') print(type(train_ds)) data_set.set_format(type="pandas") train_df = data_set["train"][:] print('\nAfter Conversion to Pandas Type:') print(type(train_df))
Out[7]:
Default Type: <class 'datasets.arrow_dataset.Dataset'> After Conversion to Pandas Type: <class 'pandas.core.frame.DataFrame'>