Async workloads and batch inference at scale

Process millions of requests asynchronously with our high-throughput Batch API.

Asynchronous API

Submit entire inference jobs at once and retrieve results typically within 24 hours, freeing your systems from real-time processing constraints.

Cost optimization

Get fast model quality at base model prices. Batch requests to fast models are charged at the economical base model rate, allowing you to maximize your AI budget.

Process up to 10 GB in one go

Handle large-scale data effortlessly. Our Batch API supports requests up to 10 GB, helping you to maintain efficient, high-volume processing and avoid rate limits.

Batch vs. normal API: processing 1M requests

Example for processing 1 million inference requests with Meta/Llama-3.1-70B-Instruct (3 millions TPM, 1200 RPM).

Batch API

  • Single JSONL file (10 Gb). 1 million requests processed asynchronously
  • Processing time: ~24 hours
  • No rate limits consumption
  • Set and forget — no monitoring needed
  • Fast models variants at base model price

Normal API

  • 1 million individual API calls
  • Processing time: Minimum 13.9 hours (assuming max throughput 1200 RPM)
  • Requires complex retry and queue logic
  • Consumes from your rate limits
  • Higher than base model price

Transform your creative workflow

Model distillation pipeline

Distill knowledge from a large, state-of-the-art language model into a smaller, more efficient one by first generating a massive synthetic training set from a wide range of prompts and documents. With the Batch API, submit these large text sets in a single operation, retrieve the outputs asynchronously and use the collected results to fine-tune the smaller model, for improved performance and reduced latency.

Embeddings for vector database construction

Build a high-performance vector database from millions of documents by leveraging the batch API. Submit all texts in one request and retrieve their embeddings in bulk, to eliminate repetitive single-request generation. This streamlined process accelerates indexing and enables faster, more efficient semantic searches.

Content moderation and analysis

Process backlogged content during off-peak hours or handle high-volume content analysis tasks by submitting batches of text for automated review, classification or extraction of key insights.

Technical capabilities

Process up to 5 million requests per file

Handle millions of individual inference operations in a single batch (can upscale on demand)

Support for files up to 10 GB

Submit extensive datasets in one operation, without splitting or chunking.

Run up to 500 concurrent batches

Scale your processing across multiple parallel jobs for maximum throughput (can upscale on demand)

How it works

Prepare your JSONL file

Create a file where each line represents a request with a custom ID, method, URL and body.

Upload your file

Submit your batch through our API by using a simple call.

Create a batch

Specify the endpoint and completion window within 24 hours.

Monitor progress and download results

Track batch status through our API. Access completed results when processing is finished.

Pricing is simple

Batch inference is automatically billed at 50% of the base real-time model price, rounded up to the nearest cent.

Example: If a model’s base price is $0.13 input and $0.40 output, Batch inference is $0.07 input and $0.20 output respectively.

Questions and answers

A Batch API allows you to submit large sets of data or multiple tasks at once, process them asynchronously and retrieve all results in a single response. This approach reduces network overhead, improves efficiency and streamlines handling extensive workloads.