Nebius AI monthly digest, April 2024
Nebius AI monthly digest, April 2024
Our main news of the past month is that Nebius AI has become available to everyone! We also participated in the MLOps podcast and published several videos about setting up training, as well as stories about how Nebius AI clients are building their models.
Videos on setting up training infrastructure
Fail fast & recover faster: infrastructure resilience of the LLM training
Training an LLM in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced. In this talk, Filipp Fisin, Senior ML Engineer at Nebius AI, provided an overview of the techniques for more resilient training that we find useful:
How to deploy Slurm on Nebius AI
Our Cloud Solution Architect Panu Koskela is back to show you the essentials
MLOps Community podcast: handling multi-terabyte large model checkpoints
In the latest episode of the podcast, our ML Engineer Simon Karasik shared his five-year experience in the field and provided an introduction to the topic of LLM checkpointing. The audio
What’s new on our docs and blog
Nebius AI is now open to everyone
Whether you are a company or an individual engineer, log in
Demo: applying RAG with open tools
Retrieval-augmented generation is a technique that enhances language models by combining generative AI with a retrieval component. Check out a quick example of applying RAG in a real-world context.
Training a 20B foundational model: Recraft’s journey
Recraft, recently funded in a round led by Khosla Ventures and former GitHub CEO Nat Friedman, is the first generative AI model built for designers. Featuring 20 billion parameters, the model was trained from scratch on Nebius AI. Here’s how.
How Unum partnered with us to preserve knowledge in compact models
In our field, effective partnerships that harness complementary strengths can drive significant breakthroughs. Such is the case with the collaboration between Nebius AI and Unum, an AI research lab known for developing compact and efficient models.
The first AI safety benchmark is here, with Nebius AI contribution
The AI Safety v0.5
Marketplace releases
Kubeflow
Handle machine learning workflows on Kubernetes with this open-source toolkit.
NVIDIA Triton™ Inference Server
Run any AI model from multiple frameworks.
Ray Cluster
Orchestrate scalable distributed computing environments for a variety of large-scale AI workloads.