11 min read

Getting Started with DVC and MinIO

Table of Contents

πŸ€” What is DVC and MinIO

DVC

DVC is a tool designed for managing large datasets and machine learning models. Think of it like Git, but for data. It helps you track changes to your data files, datasets, and model files over time, just like Git tracks changes in code. This is especially useful in machine learning projects where data is constantly being updated and you want to keep a history of these changes, collaborate with others, or reproduce results.

MinIO

MinIO is a high-performance, open-source object storage system. It is similar to Amazon S3, which is a popular cloud storage service. MinIO allows you to store and manage large amounts of data (like datasets, images, videos, etc.) in a way that is scalable and easy to access. You can think of it as your personal cloud storage solution, which can be hosted on your servers.

How DVC connects with MinIO S3 bucket

When we work with DVC, your data can become quite large. Instead of storing this data directly in your Git repository (which is not practical for large files), you store it in an external storage system like MinIO. DVC keeps track of where your data is stored and ensures that the correct versions of your datasets are linked to your project.

πŸš€ Let’s get started!

To run MinIO locally, we’ll use the following docker-compose.yaml configuration. This setup will create a container for MinIO, enabling you to easily manage your storage needs.

services:
  minio:
    image: quay.io/minio/minio:latest
    container_name: minio
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
    ports:
      - "${MINIO_ENDPOINT_PORT}:9000"
      - "${MINIO_WEBUI_PORT}:9001"
    volumes:
      - minio_data:/data
    command: server /data --console-address ":9001"

volumes:
  minio_data:

πŸ›  Setting Up Environment Variables

Before running the container, make sure you add a .env file to define your environment variables. This file will provide the necessary credentials and port configurations for MinIO.

MINIO_ROOT_USER=minio
MINIO_ROOT_PASSWORD=minio123
MINIO_ENDPOINT_PORT=12000
MINIO_WEBUI_PORT=12001

Now we can ran the docker that we’ve created with docker-compose up -d.

πŸ’‘

This command will start the MinIO container in detached mode, meaning it will run in the background. Once MinIO is up and running, you can access the MinIO web console by navigating to http://localhost:12001 in your browser. Log in using the credentials you specified in the .env file.

🌱 Downloading the Dataset from Kaggle

Next, we’ll download the Crop Recommendation Dataset from Kaggle. This dataset will be stored in your project’s structure, ready to be used with DVC.

  1. Download the Dataset: Head over to Kaggle and download the dataset.
  2. Organize Your Project Structure: Place the downloaded dataset in your project directory, structured as follows:
    β”œβ”€β”€ data/
    β”‚   β”œβ”€β”€ raw/
    β”‚   β”‚   └── Crop_recommendation.csv
    β”œβ”€β”€ docker-compose.yaml
    β”œβ”€β”€ .env
    

πŸ”§ Setting Up DVC with MinIO

Once we’ve set up MinIO and organized your project structure, the next step is to configure DVC (Data Version Control) to use MinIO as your storage backend. This involves setting up DVC to interact with MinIO securely using your access and secret keys.

πŸ›  Step 1: Initialize DVC in Your Project

First, ensure that DVC is initialized in your project. If you haven’t done so already, run the following command in your project directory:

dvc init
🚨

If you encounter the following error when trying to initiate DVC:

ERROR: failed to initiate DVC - ~/dvc-tutorial is not tracked by any supported SCM tool (e.g. Git). Use --no-scm if you don't want to use any SCM or --subdir if initializing inside a subdirectory of a parent SCM repository.

This error means that DVC requires your project to be tracked by a Source Control Management (SCM) tool like Git. Essentially, DVC expects that your project is under version control to manage changes effectively. Before running dvc init, you need to ensure that your project is initialized as a Git repository. You can do this by running git initand once Git is initialized, you can proceed with dvc init.

🌐 Step 2: Access the MinIO Console and Create a Bucket

Before configuring DVC to use MinIO, you need to create a bucket where your data will be stored.

  1. Access the MinIO Web Console: Open your web browser and navigate to the MinIO console URL:
    http://localhost:MINIO_WEBUI_PORT
    
    πŸ’‘

    Replace MINIO_WEBUI_PORT with the port you specified for the MinIO web console (e.g., 12001).

  2. Log In: Use the root user and password you defined in the .env file to log in.
  3. Create a New Bucket:
    • Once logged in, in the navbars, navigate to Administrator/Buckets, click the + icon or the Create Bucket button in the MinIO dashboard.
    • Name your bucket (e.g., my-bucket) and complete the creation process.
  4. Create New Access and Secret Keys:
    • Navigate to the β€œUser/Access Keys” section in the MinIO console, click the Create access key + button and you will be redirected to Create Access Key form.
    • Don’t forget to change the Expire for the access key, click Create.
    • Then copy your Access Key and Secret Key to safe place for example the .env. Example

πŸ”— Step 3: Add MinIO as a Remote Storage in DVC

Now that your bucket is ready and you have your new access and secret keys, let’s configure DVC to use MinIO as a remote storage location. You can do this by running

dvc remote add -d minio-remote s3://BUCKET_NAME
πŸ’‘

Replace BUCKET_NAME by the name of the remote storage you are adding.

ℹ️

s3://BUCKET_NAME is the S3-compatible URL pointing to your MinIO bucket.

Sometime there will be an error suggests that no remote is set as the default, ensure we specify minio-remote as the default remote

dvc remote default minio-remote

🌐 Step 4: Configure MinIO Endpoint URL

Since MinIO is running locally or on a custom server, you also need to specify the endpoint URL where MinIO is accessible. Use the following command

dvc remote modify minio-remote endpointurl http://localhost:MINIO_ENDPOINT_PORT
πŸ’‘

Replace MINIO_ENDPOINT_PORT with the port number you’ve configured for MinIO (e.g., 12000).

βš™οΈ Step 5: Configuring the Remote Locally

For additional security or project-specific settings, you might want to configure your DVC remote locally. This ensures that your access credentials aren’t pushed to a shared repository. To set this up, use

dvc remote modify minio-remote --local access_key_id <YOUR_MINIO_ACCESS_KEY>
dvc remote modify minio-remote --local secret_access_key <YOUR_MINIO_SECRET_KEY>
ℹ️

Step 4 and Step 5 will generate configuration file in .dvc directory called config and config.local.

πŸ“‚ Step 6: Add Your Data to DVC

Now that DVC is configured with MinIO, it’s time to track your data files. Navigate to the directory containing your dataset (e.g., data/raw/) and run

dvc add data/raw/Crop_recommendation.csv
ℹ️

This command will track the specified file with DVC and generate a .dvc file in your project, which contains metadata about the tracked data.

πŸš€ Step 7: Push Data to MinIO

Finally, push the tracked data to your MinIO remote using

dvc push

πŸ”„ Step 8: Updating the Dataset

If you ever make changes to your dataset, like adding new data, removing entries, or modifying existing dataβ€”you’ll need to update DVC to track these changes. Here’s how you can make it

  1. Add the Updated Dataset: Use the dvc add command to track the new version of the dataset.
    dvc add data/raw/Crop_recommendation.csv
    
  2. Commit the Changes: Next, commit the changes to your Git repository. This will ensure that your DVC project is in sync with the latest dataset.
    git add data/raw/Crop_recommendation.csv.dvc
    git commit -m "Updated dataset with new changes"
    
  3. Push the Updated Data: Finally, push the updated dataset to your MinIO remote.
    dvc push
    

πŸ”„ Pull Data from MinIO

If you’re working on a different machine or need to retrieve the dataset and model files from MinIO, you can use the dvc pull command. This command will download the tracked data from your remote storage back into your local project.

dvc pull

πŸŽ‰ Bonus: Working with Scripts

Managing datasets is just one part of a data science project. Often, you’ll also have scripts that process data, train models, and evaluate results. DVC can help you manage these scripts too, ensuring that they are tracked and executed in the correct order.

πŸ›  Step 1: Add Your Scripts

Let’s say you have several scripts for different stages of your workflow

  • cleaning.py: Cleans the dataset by checking for missing values and normalizing the data.
  • split.py: Splits the dataset into training and testing sets.
  • train.py: Trains a machine learning model on the training set.
  • test.py: Tests the trained model and evaluates its performance.
ℹ️

Here’s the scripts that you can use for this tutorial.

πŸ—‚ Step 2: Organize Your Project Structure

First, organize your project so that all your scripts are placed in a scripts directory. This keeps everything neat and easy to manage.

.
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/
β”‚   β”œβ”€β”€ processed/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ cleaning.py
β”‚   β”œβ”€β”€ split.py
β”‚   β”œβ”€β”€ train.py
β”‚   β”œβ”€β”€ test.py
β”œβ”€β”€ model/
β”‚   └── model.pkl
β”œβ”€β”€ .dvc/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ dvc.yaml
β”œβ”€β”€ .env
└── docker-compose.yaml

πŸ”— Step 3: Create DVC Stages for Each Script

Next, use DVC to create stages for each of these scripts. This ensures that each script is executed in the correct order, and the outputs are tracked properly.

  1. Data Cleaning Stages

    dvc stage add -n clean_data -d scripts/cleaning.py -d data/raw/Crop_recommendation.csv -o data/processed/Crop_recommendation_cleaned.csv python scripts/cleaning.py
    
  2. Data Splitting Stages

    dvc stage add -n split_data -d scripts/split.py -d data/processed/Crop_recommendation_cleaned.csv -o data/processed/train.csv -o data/processed/test.csv python scripts/split.py
    
  3. Model Training Stages

    dvc stage add -n train_model -d scripts/train.py -d data/processed/train.csv -o model/model.pkl python scripts/train.py
    
  4. Model Testing Stages

    dvc stage add -n test_model -d scripts/test.py -d model/model.pkl -d data/processed/test.csv python scripts/test.py
    

πŸš€ Step 4: Running the Pipeline

Once all the stages are set up, you can run your entire pipeline with command

dvc repro

This command will execute each stage in the order you’ve defined, from cleaning the data to training and testing the model. DVC tracks everything, so you can reproduce your results at any time, or revert to previous versions if needed.

🎯 Step 5: Push Everything to MinIO

After running your pipeline, push all your data, models, and DVC metadata to MinIO

dvc push

This ensures that your entire project, including scripts, datasets, and models, is safely stored and versioned in your MinIO bucket.

πŸ“„ Example Output

$ dvc repro
'data\raw\Crop_recommendation.csv.dvc' didn't change, skipping
Running stage 'clean_data':
> python scripts/cleaning.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

Running stage 'split_data':
> python scripts/split.py
Updating lock file 'dvc.lock'

Running stage 'train_model':
> python scripts/train.py
Updating lock file 'dvc.lock'

Running stage 'test_model':
> python scripts/test.py
Model Accuracy: 98.64%
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add 'data\processed\.gitignore' dvc.lock 'model\.gitignore'

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

$ dvc push
Collecting                                                                                                                                   |5.00 [00:00,  277entry/s] 
Pushing
4 files pushed

πŸ“ Final Thoughts

By integrating the scripts with DVC, you’ve created a powerful, reproducible workflow that handles everything from data preparation to model evaluation. Whether you’re working on your local machine or collaborating with others, DVC ensures that your work is organized, tracked, and easily sharable.