Search

Contact sales Log in to AI Studio Log in to AI Cloud

Nebius AI monthly digest, April 2024

Our main news of the past month is that Nebius AI has become available to everyone! We also participated in the MLOps podcast and published several videos about setting up training, as well as stories about how Nebius AI clients are building their models.

May 6, 2024

5 mins to read

Videos on setting up training infrastructure

Fail fast & recover faster: infrastructure resilience of the LLM training
Training an LLM in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced. In this talk, Filipp Fisin, Senior ML Engineer at Nebius AI, provided an overview of the techniques for more resilient training that we find useful:

How to deploy Slurm on Nebius AI
Our Cloud Solution Architect Panu Koskela is back to show you the essentials of running a Slurm cluster and a tool for managing resources in a computing environment.

MLOps Community podcast: handling multi-terabyte large model checkpoints
In the latest episode of the podcast, our ML Engineer Simon Karasik shared his five-year experience in the field and provided an introduction to the topic of LLM checkpointing. The audio is available across popular podcast platforms, and here’s the video.

What’s new on our docs and blog

Nebius AI is now open to everyone
Whether you are a company or an individual engineer, log in with your Google or GitHub account and start running your ML experiments. To make your journey easier, we have prepared pages in the docs on how to create an account, set up your billing (including, for example, linking a credit card) and sort out your taxes.

Demo: applying RAG with open tools
Retrieval-augmented generation is a technique that enhances language models by combining generative AI with a retrieval component. Check out a quick example of applying RAG in a real-world context.

Training a 20B foundational model: Recraft’s journey
Recraft, recently funded in a round led by Khosla Ventures and former GitHub CEO Nat Friedman, is the first generative AI model built for designers. Featuring 20 billion parameters, the model was trained from scratch on Nebius AI. Here’s how.

How Unum partnered with us to preserve knowledge in compact models
In our field, effective partnerships that harness complementary strengths can drive significant breakthroughs. Such is the case with the collaboration between Nebius AI and Unum, an AI research lab known for developing compact and efficient models.

The first AI safety benchmark is here, with Nebius AI contribution
The AI Safety v0.5 Proof of Concept is the result of months of collaboration between industry professionals and researchers. The benchmark has been developed by MLCommons, an engineering consortium based on the philosophy of open collaboration to enhance AI systems. Fedor Zhdanov, our Head of Applied AI, participates in the AI Safety working group developing this benchmark.

Marketplace releases

Kubeflow
Handle machine learning workflows on Kubernetes with this open-source toolkit.

NVIDIA Triton™ Inference Server
Run any AI model from multiple frameworks.

Ray Cluster
Orchestrate scalable distributed computing environments for a variety of large-scale AI workloads.

Explore Nebius AI

Key services

Managed Service for Kubernetes

author

Nebius team

See also

Nebius AI monthly digest, March 2024

March was a busy month for us: we opened access to Managed databases, hosted a webinar on Slurm vs Kubernetes, published new handy guides in our documentation and several ML-focused articles on the blog.

Nebius AI monthly digest, February 2024

In the last few weeks, we held our first webinar featuring Recraft, updated docs with some useful guides, shared Dubformer’s AI dubbing story, tackled the topic of AI research, and have been expanding our portfolio of ML-related products on Marketplace.

Nebius AI monthly digest, January 2024

This past month, we explored the possibilities of K8s and JupyterLab in ML, announced our first live webinar featuring Anna Veronika Dorogush, founder and CEO of Recraft, and updated the documentation with handy tutorials on testing GPU clusters.

Sign in to save this post