Managing a Large Dataset in AzureML

Question

To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML?  The files are about 4k each of binary data, and are entirely independent from each other.

Alexandre Ouellet · Answer

We are already using a partitionned dataset, but my main complaint is related to azureml more specifically (and maybe python) : It takes a long, long while to provision a compute that can handle that, and a partitionned dataset needs to list the files to map it to a callable that loads the data in memory

Alexandre Ouellet · Answer

my main issue is that it takes about 15 minutes from "pipeline is started" to it actually getting inside my node's function (and that is true regardless of the size of the image, as I have to say our image is quite big (some specialized tools + nvidia drivers are quite huge))

Alexandre Ouellet · Answer

The dataset does still fit in memory (for now, I suspect it won't at one point)

Nok Lam Chan · Answer

Is it on a blob storage? I suspect it takes a lot of time just to iterate the directory and open a million files...

Paul Weiss · Answer

IIRC kedro uses fsspec and that is much slower at enumerating large numbers of files than e.g. pyarrowfs-adlgen2 that uses a newer ADLS API. Maybe you want to look into that.

Alexandre Ouellet · Answer

Yes, it is on a blob storage, and yes, just enumerating large files is an issue (even locally on an nvme)

Alexandre Ouellet · Answer

There is a way to mount blob storage into AML's compute, but it seems to not be an option with Kedro-AzureML (and preliminary tests seems to suggests that they are as slow as fsspec).  Downloading (as in, modify the command sent to AML to ask it to download the dataset for us) seems faster (I'm guessing the download is parallelized vs listing everything, then downloading everything piece by piece), but again, not supported by kedro-azureml (though it would seem doable, though not exactly trivial)

Join the Kedro community

Managing a Large Dataset in AzureML