To keep the other thread focused : Is there a way to manage a dataset of about 1 million files in AzureML? The files are about 4k each of binary data, and are entirely independent from each other.
We are already using a partitionned dataset, but my main complaint is related to azureml more specifically (and maybe python) : It takes a long, long while to provision a compute that can handle that, and a partitionned dataset needs to list the files to map it to a callable that loads the data in memory
my main issue is that it takes about 15 minutes from "pipeline is started" to it actually getting inside my node's function (and that is true regardless of the size of the image, as I have to say our image is quite big (some specialized tools + nvidia drivers are quite huge))
Is it on a blob storage? I suspect it takes a lot of time just to iterate the directory and open a million files...
IIRC kedro uses fsspec and that is much slower at enumerating large numbers of files than e.g. pyarrowfs-adlgen2 that uses a newer ADLS API. Maybe you want to look into that.
Yes, it is on a blob storage, and yes, just enumerating large files is an issue (even locally on an nvme)
There is a way to mount blob storage into AML's compute, but it seems to not be an option with Kedro-AzureML (and preliminary tests seems to suggests that they are as slow as fsspec). Downloading (as in, modify the command sent to AML to ask it to download the dataset for us) seems faster (I'm guessing the download is parallelized vs listing everything, then downloading everything piece by piece), but again, not supported by kedro-azureml (though it would seem doable, though not exactly trivial)