π€ What is DVC and MinIO
DVC
DVC is a tool designed for managing large datasets and machine learning models. Think of it like Git, but for data. It helps you track changes to your data files, datasets, and model files over time, just like Git tracks changes in code. This is especially useful in machine learning projects where data is constantly being updated and you want to keep a history of these changes, collaborate with others, or reproduce results.
MinIO
MinIO is a high-performance, open-source object storage system. It is similar to Amazon S3, which is a popular cloud storage service. MinIO allows you to store and manage large amounts of data (like datasets, images, videos, etc.) in a way that is scalable and easy to access. You can think of it as your personal cloud storage solution, which can be hosted on your servers.
How DVC connects with MinIO S3 bucket
When we work with DVC, your data can become quite large. Instead of storing this data directly in your Git repository (which is not practical for large files), you store it in an external storage system like MinIO. DVC keeps track of where your data is stored and ensures that the correct versions of your datasets are linked to your project.
π Letβs get started!
To run MinIO locally, weβll use the following docker-compose.yaml
configuration.
This setup will create a container for MinIO, enabling you to easily manage your storage needs.
services:
minio:
image: quay.io/minio/minio:latest
container_name: minio
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
ports:
- "${MINIO_ENDPOINT_PORT}:9000"
- "${MINIO_WEBUI_PORT}:9001"
volumes:
- minio_data:/data
command: server /data --console-address ":9001"
volumes:
minio_data:
π Setting Up Environment Variables
Before running the container, make sure you add a .env
file to define your environment variables.
This file will provide the necessary credentials and port configurations for MinIO.
MINIO_ROOT_USER=minio
MINIO_ROOT_PASSWORD=minio123
MINIO_ENDPOINT_PORT=12000
MINIO_WEBUI_PORT=12001
Now we can ran the docker that weβve created with docker-compose up -d
.
This command will start the MinIO container in detached mode, meaning it will run in the background. Once MinIO is up and running, you can access the MinIO web console by navigating to http://localhost:12001 in your browser. Log in using the credentials you specified in the .env file.
π± Downloading the Dataset from Kaggle
Next, weβll download the Crop Recommendation Dataset from Kaggle. This dataset will be stored in your projectβs structure, ready to be used with DVC.
- Download the Dataset: Head over to Kaggle and download the dataset.
- Organize Your Project Structure: Place the downloaded dataset in your project directory, structured as follows:
βββ data/ β βββ raw/ β β βββ Crop_recommendation.csv βββ docker-compose.yaml βββ .env
π§ Setting Up DVC with MinIO
Once weβve set up MinIO and organized your project structure, the next step is to configure DVC (Data Version Control) to use MinIO as your storage backend. This involves setting up DVC to interact with MinIO securely using your access and secret keys.
π Step 1: Initialize DVC in Your Project
First, ensure that DVC is initialized in your project. If you havenβt done so already, run the following command in your project directory:
dvc init
If you encounter the following error when trying to initiate DVC:
ERROR: failed to initiate DVC - ~/dvc-tutorial is not tracked by any supported SCM tool (e.g. Git). Use --no-scm if you don't want to use any SCM or --subdir if initializing inside a subdirectory of a parent SCM repository.
This error means that DVC requires your project to be tracked by a Source Control Management (SCM) tool like Git. Essentially, DVC expects that your project is under version control to manage changes effectively.
Before running dvc init, you need to ensure that your project is initialized as a Git repository. You can do this by running git init
and once Git is initialized, you can proceed with dvc init
.
π Step 2: Access the MinIO Console and Create a Bucket
Before configuring DVC to use MinIO, you need to create a bucket where your data will be stored.
- Access the MinIO Web Console: Open your web browser and navigate to the MinIO console URL:
http://localhost:MINIO_WEBUI_PORT
π‘Replace
MINIO_WEBUI_PORT
with the port you specified for the MinIO web console (e.g., 12001). - Log In: Use the root user and password you defined in the .env file to log in.
- Create a New Bucket:
- Once logged in, in the navbars, navigate to
Administrator/Buckets
, click the+
icon or theCreate Bucket
button in the MinIO dashboard. - Name your bucket (e.g., my-bucket) and complete the creation process.
- Once logged in, in the navbars, navigate to
- Create New Access and Secret Keys:
- Navigate to the βUser/Access Keysβ section in the MinIO console, click the
Create access key +
button and you will be redirected toCreate Access Key
form. - Donβt forget to change the
Expire
for the access key, clickCreate
. - Then copy your
Access Key
andSecret Key
to safe place for example the.env
.
- Navigate to the βUser/Access Keysβ section in the MinIO console, click the
π Step 3: Add MinIO as a Remote Storage in DVC
Now that your bucket is ready and you have your new access and secret keys, letβs configure DVC to use MinIO as a remote storage location. You can do this by running
dvc remote add -d minio-remote s3://BUCKET_NAME
Replace BUCKET_NAME
by the name of the remote storage you are adding.
s3://BUCKET_NAME is the S3-compatible URL pointing to your MinIO bucket.
Sometime there will be an error suggests that no remote is set as the default, ensure we specify minio-remote as the default remote
dvc remote default minio-remote
π Step 4: Configure MinIO Endpoint URL
Since MinIO is running locally or on a custom server, you also need to specify the endpoint URL where MinIO is accessible. Use the following command
dvc remote modify minio-remote endpointurl http://localhost:MINIO_ENDPOINT_PORT
Replace MINIO_ENDPOINT_PORT
with the port number youβve configured for MinIO (e.g., 12000).
βοΈ Step 5: Configuring the Remote Locally
For additional security or project-specific settings, you might want to configure your DVC remote locally. This ensures that your access credentials arenβt pushed to a shared repository. To set this up, use
dvc remote modify minio-remote --local access_key_id <YOUR_MINIO_ACCESS_KEY>
dvc remote modify minio-remote --local secret_access_key <YOUR_MINIO_SECRET_KEY>
Step 4 and Step 5 will generate configuration file in .dvc
directory called config
and config.local
.
π Step 6: Add Your Data to DVC
Now that DVC is configured with MinIO, itβs time to track your data files. Navigate to the directory containing your dataset (e.g., data/raw/) and run
dvc add data/raw/Crop_recommendation.csv
This command will track the specified file with DVC and generate a .dvc file in your project, which contains metadata about the tracked data.
π Step 7: Push Data to MinIO
Finally, push the tracked data to your MinIO remote using
dvc push
π Step 8: Updating the Dataset
If you ever make changes to your dataset, like adding new data, removing entries, or modifying existing dataβyouβll need to update DVC to track these changes. Hereβs how you can make it
- Add the Updated Dataset: Use the dvc add command to track the new version of the dataset.
dvc add data/raw/Crop_recommendation.csv
- Commit the Changes: Next, commit the changes to your Git repository. This will ensure that your DVC project is in sync with the latest dataset.
git add data/raw/Crop_recommendation.csv.dvc git commit -m "Updated dataset with new changes"
- Push the Updated Data: Finally, push the updated dataset to your MinIO remote.
dvc push
π Pull Data from MinIO
If youβre working on a different machine or need to retrieve the dataset and model files from MinIO, you can use the dvc pull command. This command will download the tracked data from your remote storage back into your local project.
dvc pull
π Bonus: Working with Scripts
Managing datasets is just one part of a data science project. Often, youβll also have scripts that process data, train models, and evaluate results. DVC can help you manage these scripts too, ensuring that they are tracked and executed in the correct order.
π Step 1: Add Your Scripts
Letβs say you have several scripts for different stages of your workflow
cleaning.py
: Cleans the dataset by checking for missing values and normalizing the data.split.py
: Splits the dataset into training and testing sets.train.py
: Trains a machine learning model on the training set.test.py
: Tests the trained model and evaluates its performance.
Hereβs the scripts that you can use for this tutorial.
π Step 2: Organize Your Project Structure
First, organize your project so that all your scripts are placed in a scripts directory. This keeps everything neat and easy to manage.
.
βββ data/
β βββ raw/
β βββ processed/
βββ scripts/
β βββ cleaning.py
β βββ split.py
β βββ train.py
β βββ test.py
βββ model/
β βββ model.pkl
βββ .dvc/
βββ .gitignore
βββ dvc.yaml
βββ .env
βββ docker-compose.yaml
π Step 3: Create DVC Stages for Each Script
Next, use DVC to create stages for each of these scripts. This ensures that each script is executed in the correct order, and the outputs are tracked properly.
-
Data Cleaning Stages
dvc stage add -n clean_data -d scripts/cleaning.py -d data/raw/Crop_recommendation.csv -o data/processed/Crop_recommendation_cleaned.csv python scripts/cleaning.py
-
Data Splitting Stages
dvc stage add -n split_data -d scripts/split.py -d data/processed/Crop_recommendation_cleaned.csv -o data/processed/train.csv -o data/processed/test.csv python scripts/split.py
-
Model Training Stages
dvc stage add -n train_model -d scripts/train.py -d data/processed/train.csv -o model/model.pkl python scripts/train.py
-
Model Testing Stages
dvc stage add -n test_model -d scripts/test.py -d model/model.pkl -d data/processed/test.csv python scripts/test.py
π Step 4: Running the Pipeline
Once all the stages are set up, you can run your entire pipeline with command
dvc repro
This command will execute each stage in the order youβve defined, from cleaning the data to training and testing the model. DVC tracks everything, so you can reproduce your results at any time, or revert to previous versions if needed.
π― Step 5: Push Everything to MinIO
After running your pipeline, push all your data, models, and DVC metadata to MinIO
dvc push
This ensures that your entire project, including scripts, datasets, and models, is safely stored and versioned in your MinIO bucket.
π Example Output
$ dvc repro
'data\raw\Crop_recommendation.csv.dvc' didn't change, skipping
Running stage 'clean_data':
> python scripts/cleaning.py
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Running stage 'split_data':
> python scripts/split.py
Updating lock file 'dvc.lock'
Running stage 'train_model':
> python scripts/train.py
Updating lock file 'dvc.lock'
Running stage 'test_model':
> python scripts/test.py
Model Accuracy: 98.64%
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add 'data\processed\.gitignore' dvc.lock 'model\.gitignore'
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
$ dvc push
Collecting |5.00 [00:00, 277entry/s]
Pushing
4 files pushed
π Final Thoughts
By integrating the scripts with DVC, youβve created a powerful, reproducible workflow that handles everything from data preparation to model evaluation. Whether youβre working on your local machine or collaborating with others, DVC ensures that your work is organized, tracked, and easily sharable.