Young Builder’s Notes: AI on Quicksand

Nebius Token Factory’s youngest team member learns why building AI is chasing a moving target.

As a solutions architect at Nebius, my work is primarily in helping run proofs of concept for prospective clients. I work with LLMs, which are just one slice of the AI landscape. Yet they span an endless mix of use cases, applications, and challenges that make each PoC wildly diverse.

Some are simple, taking only a few hours. Longer engagements can require model-editing, heavy optimizations, and even changes to the inference-serving engine itself. Speaking with the client, I note their needs — model, average input and output sizes, latency constraints. Once everything is understood, it is time to begin the PoC.

It can be a straightforward process. I know what levers to turn and what places on the sword to strike. I pick the correct hardware, tune GPU memory, fit the prompt-batching to their input volume — each adding performance towards our target. I finish promptly, and my work is done.

Other times, it is not.

Simple optimizations bear little improvement, and the solution can take days. Blockers in model compatibility can take weeks. In the AI industry, this can be a major problem.

It is a common week, and I get a common task — our client is using GLM-4.6, they are paying X per 1 million tokens to their provider, and we need to beat that price. Simple. I head to the forge, and begin my craft. My tool? Four last-gen GPUs from the global leader.

I start with the basics — batching, sequence limits, GPU utilization. But the client needs a specific Requests Per Second (RPS) target, and our GPUs could theoretically hit it. The gap between theory and practice, though, is massive. I need to squeeze every last bit of performance from the hardware.

So I go deeper. I tune the backend selection, enable performance-boosting settings, and try dozens of other adjustments — each one fighting for another fraction of throughput. Days turn into two weeks of grinding through optimization after optimization, slowly pushing us closer to the RPS ceiling.

With all the research that is required for this project, I find home in the vLLM repository. A rapidly-evolving, state-of-the-art inference engine, armed with thousands of open source contributors. I read the documentation, ask questions to lead devs, and search through hundreds of issues the community has raised.

I slowly sharpen the sword until I finally perfect my build. A patched vLLM version, fourteen different flags, and GLM 4.6 operating at blazing speeds. We deliver the endpoint to the client, and we are both happy about a job well done.

Almost every week, the vLLM team releases an update with large implications. And a new one pops up just a couple of days after I deliver the job. Buried in the release notes is a major breakthrough. What took me two weeks of careful tuning to approach, the update can handle alone. My optimizations still work, but they’re unnecessary — there is now a direct path to the same place.

This is what working in AI is like. It is not the twice-a-year GPT release or the staggering new development seen on the news. It is an accumulation of tiny changes across engineering systems. You cannot prepare or anticipate, there are no guarantees. You simply have to get used to building on quicksand.

I am not upset about the time that I spent — it wasn’t useless. I delivered something to the client that they were both happy with. I also learned new things that will carry on to future tasks, new skills that make me a better engineer. Most importantly, I learned the importance of staying agile in an industry that is moving so fast.

As a college student, I have learned to treat my work like a class. Every day is a lecture where I get to learn and expand my knowledge — PoCs are my homework where I can apply what I learn. If you do not complete them or attend lectures, when the next unit comes along, the ground has shifted.

At vLLM, contributors are open-source developers building out of genuine passion. Closed-source companies are moving just as fast, even if much of it happens behind the scenes. If you’re not using the latest tools, you sink. And all of this keeps bringing me back to one question: which is shaping the future more, open-source or closed-source AI?

Explore the Token Factory

Docs

Explore Nebius AI Cloud

Docs

Alex Hanley

Solutions Architect