Introduction
In this tutorial, we will go through the PyTorch Dataloader which is a very flexible utility to load datasets for training purposes for your deep learning project. We will understand why this function is used and also see some examples of how to use this Dataloader in Pytorch.
What is DataLoader in PyTorch?
Sometimes when working with a big dataset it becomes quite difficult to load the entire data into the memory at once. As such the only way forward is to load data into memory in batches for processing, this means you may have to write extra code to do this. But do not worry, PyTorch has you covered with its Dataloader function.
The dataloader function is available in PyTorch torch.utils.data class and supports the following tasks –
- Customization of Data Loading Order
- Map-Style and Iterable-Style Datsets
- Automatic Batching
- Data Loading with single and multiple processes
- Automatic Memory Pinning
Syntax of PyTorch DataLoader
The following section shows the syntax of dataloader function in PyTorch library along with the information of its parameters.
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None, *, prefetch_factor=2,
persistent_workers=False)
Parameters
- Dataset – It is mandatory for a DataLoader class to be constructed with a dataset first. PyTorch Dataloaders support two kinds of datasets:
- Map-style datasets – These datasets map keys to data samples. Each item is retrieved by a get_item() method implementation.
- Iterable-style datasets – These datasets implement the iter() protocol. Such datasets retrieve data in a stream sequence rather than doing random reads as in the case of map datasets.
- Batch size – Refers to the number of samples in each batch.
- Shuffle – Whether you want the data to be reshuffled or not.
- Sampler – refers to an optional torch.utils.data.Sampler class instance. A sampler defines the strategy to retrieve the sample – sequential or random or any other manner. Shuffle should be set to false when a sampler is used.
- Batch_Sampler – Same as the data sampler defined above, but works at a batch level.
- num_workers – Number of sub-processes needed for loading the data.
- collate_fn – Collates samples into batches. Customized collation is possible in Torch.
- pin_memory – Pinned (page-locked) memory locations are used by GPUs for faster data access. When set to True, this option enables the data loader to copy tensors into the CUDA pinned memory.
- drop_last – If the total data size is not a multiple of the batch_size, the last batch has less number of elements than the batch_size. This incomplete batch can be dropped by setting this option to True.
- timeout – Sets the time to wait while collecting a batch from the workers (sub-processes).
- worker_init_fn – Defines a routine to be called by each worker process. Allows customized routines.
Example of DataLoader in PyTorch
Example – 1 – DataLoaders with Built-in Datasets
This first example will showcase how the built-in MNIST dataset of PyTorch can be handled with dataloader function. (MNIST is a famous dataset that contains hand-written digits.)
import torch
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
Here in this example, we are using the transforms module of torchvision. It is generally used when we have to handle image datasets and can help in normalizing, resizing, and cropping of the images.
For this MNIST dataset, we are using the normalization technique. This way the values from -0.5 to +0.5 are converted to values from 0 to 1.
The following code that contains the transforms function is used for normalization.
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
The following code snippet is used for loading the desired dataset. We are using PyTorch dataloader to load the data by giving batch_size = 64 and we have also enabled shuffling for reordering data each epoch of data load.
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
Extracting /root/.pytorch/MNIST_data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /root/.pytorch/MNIST_data/MNIST/raw Processing...
/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py:502: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:143.) return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!
For fetching all the images of the dataset, we are going to use iter function along with a dataloader.
dataiter = iter(trainloader)
images, labels = dataiter.next()
print(images.shape)
print(labels.shape)
plt.imshow(images[1].numpy().squeeze(), cmap='Greys_r')
torch.Size([64, 1, 28, 28]) torch.Size([64])
<matplotlib.image.AxesImage at 0x7fdc324cdb50>
Example – 2 – DataLoaders on Custom Datasets
This second example shows how we can use PyTorch dataloader on custom datasets. So let us first create a custom dataset.
The below code snippet helps us to create a custom dataset that contain 1000 random numbers.
from torch.utils.data import Dataset
import random
class SampleDataset(Dataset):
def __init__(self,r1,r2):
randomlist=[]
for i in range(1,1000):
n = random.randint(r1,r2)
randomlist.append(n)
self.samples=randomlist
def __len__(self):
return len(self.samples)
def __getitem__(self,idx):
return(self.samples[idx])
dataset=SampleDataset(4,445)
dataset[100:120]
[435, 117, 315, 266, 279, 441, 364, 383, 241, 299, 146, 124, 74, 128, 404, 400, 214, 237, 40, 382]
Finally, we will be to use the dataloader function on our custom dataset. Notice that we have given the batch_size as 12 and have also enabled parallel multiprocess data loading with num_workers =2.
The output shows that data is load is divided into 12 different batches. Some of the tensors are displayed for reference.
from torch.utils.data import DataLoader
loader = DataLoader(dataset,batch_size=12, shuffle=True, num_workers=2 )
for i, batch in enumerate(loader):
print(i, batch)
0 tensor([ 16, 179, 246, 127, 263, 418, 33, 410, 107, 281, 438, 164]) 1 tensor([421, 55, 183, 19, 47, 402, 336, 290, 241, 121, 308, 140]) 2 tensor([265, 149, 62, 421, 67, 427, 302, 149, 134, 269, 116, 267]) 3 tensor([318, 404, 365, 324, 229, 184, 10, 391, 71, 424, 387, 256]) 4 tensor([178, 138, 200, 398, 420, 98, 147, 338, 341, 434, 58, 332]) 5 tensor([403, 256, 290, 238, 186, 57, 343, 361, 388, 81, 271, 111]) 6 tensor([340, 59, 73, 298, 275, 102, 20, 413, 95, 83, 380, 323]) 7 tensor([ 71, 15, 443, 44, 394, 252, 103, 11, 383, 292, 57, 109]) 8 tensor([398, 406, 84, 369, 272, 409, 367, 205, 353, 24, 305, 21]) 9 tensor([280, 200, 79, 424, 26, 58, 233, 194, 362, 379, 228, 428]) 10 tensor([316, 225, 231, 272, 382, 132, 306, 295, 150, 365, 420, 17]) 11 tensor([280, 432, 51, 123, 356, 29, 172, 225, 143, 147, 226, 262]) 12 tensor([208, 366, 267, 389, 135, 398, 359, 365, 52, 210, 152, 214]) . . . 69 tensor([ 43, 351, 383, 435, 368, 26, 316, 145, 409, 140, 224, 159]) 70 tensor([210, 68, 404, 30, 32, 324, 18, 416, 340, 354, 337, 436]) 71 tensor([414, 114, 233, 320, 105, 318, 326, 139, 319, 205, 69, 123]) 72 tensor([165, 265, 381, 33, 392, 261, 57, 23, 131, 186, 232, 186]) 73 tensor([404, 105, 345, 436, 51, 392, 263, 138, 364, 439, 12, 295]) 74 tensor([163, 70, 137, 435, 250, 354, 190, 335, 39, 323, 365, 96]) 75 tensor([148, 383, 322, 300, 309, 125, 46, 29, 231, 432, 258, 376]) 76 tensor([314, 266, 248, 236, 296, 434, 93, 138, 140, 12, 444, 302]) 77 tensor([ 41, 257, 13, 64, 295, 330, 396, 251, 379, 232, 108, 364]) 78 tensor([ 70, 161, 168, 41, 434, 258, 327, 270, 42, 347, 384, 282]) 79 tensor([392, 13, 258, 416, 146, 308, 32, 276, 302, 177, 410, 263]) 80 tensor([186, 433, 420, 11, 273, 230, 377, 416, 303, 83, 20, 240]) 81 tensor([ 47, 354, 171, 207, 178, 351, 137, 138, 33, 224, 422, 280]) 82 tensor([214, 193, 444, 432, 274, 268, 67, 217, 64, 84, 27, 102]) 83 tensor([419, 62, 244])
Conclusion
This is the end of another tutorial based on PyTorch. In this tutorial, we understood how PyTorch Dataloader is quite useful in loading a huge amount of data in batches into the memory along with a couple of examples. Hope this tutorial will be useful for you in your own deep learning project.
Reference – PyTorch Documentation
-
I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.
View all posts