PyTorch num_workers, a tip for speedy training

Talha Anwar
2 min readSep 23, 2021

There is a huge debate what should be the optimal num_workers for your dataloader. Num_workers tells the data loader instance how many sub-processes to use for data loading. If the num_worker is zero (default) the GPU has to weight for CPU to load data. Theoretically, greater the num_workers, more efficiently the CPU load data and less the GPU has to wait.

DeepLizard has presented a great table comparing batch size and number of workers. They claimed that increasing num_workers works upto an extent and performance start diminishing beyond that. I agree to this, but the question is that what should be the optimal value? They said that the best num_workers value is 2, beyond that the time start increasing. But their approach is not very practical . They have used batch size of 100,1000 and 10,000 and we know that in practical scenario, even the modern GPU such as RTX series cant have 1000 or 10000 batch size. Even the RTX 3070 ti, cant afford a batch size of 100 on image size of 224*224 using ResNet50.

I observed that the value of num_worker depends upon you data. I also noticed that if we are fetching data from different folders, the cpu take more time to load. If we are using well defined structure such as following, the CPU take less time. In this scenario we are using PyTorch ImageFolder to load data.

Train
-Class A
— image1
__image2
__imagen
-Class B
— image1
__image2
__imagen

What if we have data frame as follow, the data loader take more time.

df={‘path’:[‘DictA/FolderA/imageA’,’DictA/FolderB/imageA’,’DictB/FolderA/imageA’,’DictB/FolderB/imageA’],
‘label’:[0,1,0,1]}

I am using following code snippet to test the runtime based on different values of num_workers with a fix batch size of 64. Image size is 224x224 and pre-trained ResNet 50 is used for binary classification. The data is being loaded from the data frame and data frame has path of images with corresponding labels. To demonstrate this, first I will specify my system. I am using RTX 3070 Ti, 32 GB Ram, and 12 cores Ryzen CPU (24 processors).

from time import time
import multiprocessing as mp
for num_workers in range(2, mp.cpu_count(), 2):
train_loader = DataLoader(train_reader,shuffle=True,num_workers=num_workers,batch_size=64,pin_memory=True)
start = time()
for epoch in range(1, 3):
for i, data in enumerate(train_loader, 0):
pass
end = time()
print("Finish with:{} second, num_workers={}".format(end - start, num_workers))

Results

Finish with:329.18824434280396 second, num_workers=2
Finish with:185.9790952205658 second, num_workers=4
Finish with:133.18926239013672 second, num_workers=6
Finish with:110.62966227531433 second, num_workers=8
Finish with:96.49145102500916 second, num_workers=10
Finish with:91.31508040428162 second, num_workers=12
Finish with:84.63301157951355 second, num_workers=14
Finish with:87.73049902915955 second, num_workers=16
Finish with:91.74906945228577 second, num_workers=18
Finish with:93.52409195899963 second, num_workers=20
Finish with:105.80439901351929 second, num_workers=22

From the result, we can observe that num workers of 2 took 329 seconds and the optimal value is 14. SO i would advice you to run above code snippet and tune this value accordingly. It will speedup your process a lot.

https://deeplizard.com/learn/video/kWVgvsejXsE

--

--