Join the Kedro community

Updated 2 months ago

Mounting aws efs volume to kubeflow pipeline with kedro

Hey Folks I am looking for a way to mount AWS EFS volume to my kedro pipeline which will be executed by kubeflow . I am using the kubeflow plugin.
The config has below 2 options for Volumes , I am not sure which one is for what purpose

  volume:

    # Storage class - use null (or no value) to use the default storage
    # class deployed on the Kubernetes cluster
    storageclass: # default

    # The size of the volume that is created. Applicable for some storage
    # classes
    size: 1Gi

    # Access mode of the volume used to exchange data. ReadWriteMany is
    # preferred, but it is not supported on some environements (like GKE)
    # Default value: ReadWriteOnce
    #access_modes: [ReadWriteMany]

    # Flag indicating if the data-volume-init step (copying raw data to the
    # fresh volume) should be skipped
    skip_init: False

    # Allows to specify user executing pipelines within containers
    # Default: root user (to avoid issues with volumes in GKE)
    owner: 0

    # Flak indicating if volume for inter-node data exchange should be
    # kept after the pipeline is deleted
    keep: False
2.
  # Optional section to allow mounting additional volumes (such as EmptyDir)
  # to specific nodes
  extra_volumes:
    tensorflow_step:
    - mount_path: /dev/shm
      volume:
        name: shared_memory
        empty_dir:
          cls: V1EmptyDirVolumeSource
          params:
            medium: Memory

V
m
N
8 comments

Can you give some thoughts here

  1. Is used as a "main" volume, which will be mounted under /home/kedro/data
  2. Is used for "extras" - meaning your use-case specific - if you need some additional volume for any purpose, you can attach it using this method. Most common use case is in the example - extending /dev/shm for distributed training in PyTorch (Kubernetes has problems with that).

can i conclude that the first volume needs to be configured in case i want to use the EFS system.

Also, the storage class is something that i need to check with the k8 cluster manager for the EFS I want to mount.

This is how our EFS system is used as a pvc volume in our kubernetes cluster

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-pv-kubeflow
spec:
  accessModes:
 - ReadWriteMany
  capacity:
    storage: 100Gi
  csi:
    driver: efs.csi.aws.com
    volumeHandle: "fs-02d6475f7552a3c13:/data"
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  volumeMode: Filesystem

So as per our discussion storageClassName: efs-sc , this is what we need to use right as the storage class ?

I defined the above storage class as mentioned below

  # Optional volume specification
  volume:
    storageclass:  efs-sc

    access_modes: [ReadWriteMany]

    # Flag indicating if the data-volume-init step (copying raw data to the
    # fresh volume) should be skipped
    skip_init: False

    # Allows to specify user executing pipelines within containers
    # Default: root user (to avoid issues with volumes in GKE)
    owner: 0

    # Flak indicating if volume for inter-node data exchange should be
    # kept after the pipeline is deleted
    keep: False

Logs -

                    INFO     Loading data from companies     data_catalog.py:539
                             (CSVDataset)...                                    
                    INFO     Running node:                           node.py:364
                             preprocess_companies_node:                         
                             preprocess_companies([companies]) ->               
                             [preprocessed_companies]                           
                    DEBUG    Inside Preprocess Companies             nodes.py:32
                    DEBUG    Checking EFS Mount now                  nodes.py:33
                    DEBUG    ['01_raw']                              nodes.py:34

This only shows [01_raw] and it seems The EFS is still not accessible.

Any headsup ?

can you also look into this thread

unfortunately I have never use kubeflow myself so I can't be much help here.

As we discussed I tried doing that, but no success. Can you also look into it.

Add a reply
Sign up and join the conversation on Slack