Orchestrating LLM fine-tuning on K8s with SkyPilot and MLflow
While many focus on the modeling aspects during LLM fine-tuning, efficiently managing compute and experiment tracking is equally crucial. This guide demonstrates how to effectively leverage SkyPilot and MLflow, two powerful open-source tools, to orchestrate LLM fine-tuning jobs.
Modern LLM fine-tuning workflows involve multiple moving parts:
Resource orchestration across different cloud providers
Environment setup and dependency management
Experiment tracking and monitoring
Distributed training coordination
System metrics collection
Using SkyPilot for resource orchestration and MLflow for experiment tracking provides an easy-to-use and fully open-source stack for managing these complexities.
We’ll use the LLama-3-1-8B fine-tuning example from Philipp Schmid’s How to fine-tune open LLMs in 2025 blog post to demonstrate these tools in action.
Below, I’ll be using Nebius’ Kubernetes cluster, however SkyPilot supports virtualy all cloud providers you can think of.
First, install SkyPilot with Kubernetes support using pip:
$ pip install\"skypilot[kubernetes\"
Configure your Kubernetes cluster access by ensuring your kubeconfig is properly set up. Then, verify the installation:
$ sky check kubernetes
Checking credentials to enable clouds for SkyPilot.
Kubernetes: enabled
To enable a cloud, follow the hints above and rerun: sky check
If any problems remain, refer to detailed docs at: https://docs.skypilot.co/en/latest/getting-started/installation.html
🎉 Enabled clouds 🎉
✔ Kubernetes"
SkyPilot uses a YAML configuration to define jobs. Below the SkyPilot task definition sky.yaml is all that’s needed to kick of our training job on our infra.
# To launch the cluster:# sky launch -c dev sky.yaml --env-file .env# To rerun training (i.e. only the "run" section):# sky exec dev sky.yaml --env-file .env resources:cloud:kubernetes# or aws, gcp, azure, and many othersaccelerators:H100:8workdir:.# syncs current directory to ~/sky_workdir/ on the cluster.envs:CONFIG_FILE:recipes/llama-3-1-8b-qlora.yaml# setup step is executed once upon cluster provisioning with `sky launch`setup:|
sudo apt install nvtop -y
pip install -U -r requirements.txt
FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation
python generate_train_dataset.py
# run step is executed for both `sky exec` and `sky launch` commandsrun:|
accelerate launch \
--num_processes 8 \
train.py --config $CONFIG_FILE
You can find the rest of the training code and configs in this repository. The details of what’s inside of train.py are beyond the scope of this post. This repository also contains sky_multi_node.yaml, a multi-node version of the sky.yaml file.
In this example project, MLflow configuration is managed through environment variables. Create a .env file:
# .env
MLFLOW_TRACKING_URI=https://your-mlflow-server
MLFLOW_TRACKING_SERVER_CERT_PATH=/path/to/cert.pem
MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true
MLFLOW_EXPERIMENT_NAME=LLM_Fine_Tuning
MLFLOW_TRACKING_USERNAME=your-username
MLFLOW_TRACKING_PASSWORD=your-password
HF_TOKEN=your-huggingface-token
# TEST_MODE=true # Uncomment for development
The MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true setting enables collection of system metrics but requires additional dependencies:
# requirements.txt
psutil==6.1.1
pynvml==12.0.0
For this tutorial, I used Nebius’ Managed MLflow. Setting up the MLflow server instance was straightforward, and they provided all the necessary configuration values to connect with the server.
In distributed training environments, MLflow logging must be carefully managed to prevent logging conflicts between processes. Multiple processes attempting to log metrics simultaneously can lead to race conditions, duplicate entries, or corrupted logs. Additionally, system metrics need to be properly attributed to individual nodes to maintain accurate monitoring data. Here’s the key integration code:
deftrain_function(model_args, script_args, training_args):
# Initialize MLflow callback
mlflow_callback = MLflowCallback()
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
peft_config=peft_config,
callbacks=[mlflow_callback],
)
# Only initialize MLflow on the main process
run_id = Noneif trainer.accelerator.is_main_process:
mlflow_callback.setup(training_args, trainer, model)
# Set node ID for system metrics
node_id = trainer.accelerator.process_index
mlflow_callback._ml_flow.system_metrics.set_system_metrics_node_id(node_id)
# Get run ID for post-training logging
run_id = mlflow_callback._ml_flow.active_run().info.run_id
logger.info(f'Run ID: {run_id}')
# Training loop
train_result = trainer.train()
# Post-training metrics logging only on main processif trainer.accelerator.is_main_process:
if run_id isnotNone:
metrics = train_result.metrics
train_samples = len(train_dataset)
with mlflow.start_run(run_id=run_id):
mlflow.log_param('train_samples', train_samples)
for key, value in metrics.items():
mlflow.log_metric(key=key, value=value)
Process management: Only the main process should initialize MLflow runs and log metrics to avoid conflicts.
Run ID tracking: The MLflow run ID is stored becausetrainer.train ()automatically ends its MLflow run when complete. Without capturing the ID beforehand, we wouldn’t be able to log additional metrics after training finishes.
System metrics: Each node in distributed training needs a unique identifier for system metrics collection
MLflow provides a web UI for monitoring experiments. Key metrics tracked include training metrics (loss, learning rate, batch size and training speed in samples/second) and system metrics.
Here’s how to query metrics programmatically assuming you’ve set all required MLflow environment variables:
We are now accepting pre-orders for NVIDIA GB200 NVL72 and NVIDIA HGX B200 clusters to be deployed in our data centers in the United States and Finland from early 2025. Based on NVIDIA Blackwell, the architecture to power a new industrial revolution of generative AI, these new clusters deliver a massive leap forward over existing solutions.