> ## Documentation Index
> Fetch the complete documentation index at: https://novita.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Novita Deployments User Guide

> **Navigation**: Models Console → Deployments
> **Applies to**: Current production version
> **Last updated**: 2026-04-20

***

## Table of Contents

1. [What are Deployments](#1-what-are-deployments)
2. [Quick Start (Go live in 5 minutes)](#2-quick-start-go-live-in-5-minutes)
3. [Creating a Deployment](#3-creating-a-deployment)
   * 3.1 [Naming](#31-naming)
   * 3.2 [Selecting a Model](#32-selecting-a-model)
   * 3.3 [Selecting a GPU Instance](#33-selecting-a-gpu-instance)
   * 3.4 [Configuring Autoscaling](#34-configuring-autoscaling)
   * 3.5 [Engine Settings (Advanced)](#35-engine-settings-advanced)
4. [Deployment Lifecycle & Status](#4-deployment-lifecycle--status)
5. [Autoscaling In Depth](#5-autoscaling-in-depth)
6. [LoRA Adapter Support](#6-lora-adapter-support)
7. [Billing](#7-billing)
8. [FAQ & Troubleshooting](#8-faq--troubleshooting)

***

## 1. What are Deployments

A Deployment is Novita's **dedicated AI inference endpoint** product. Unlike Serverless endpoints that share compute resources with other users, each Deployment gives you:

* **Exclusive GPU**: All compute resources are yours alone — no noisy neighbors
* **Predictable Performance SLA**: Dedicated compute means consistent, foreseeable inference latency
* **Flexible Model Sources**: Deploy any model from Hugging Face or the Novita model catalog
* **OpenAI-Compatible Chat API**: For pure text inference, simply swap `base_url` and `model` to migrate existing OpenAI integrations
* **Per-second Billing**: You are only charged while the endpoint is active. Billing pauses automatically when Scale-to-Zero kicks in

**When to use Deployments:**

| Use Case                            | Why it fits                                     |
| ----------------------------------- | ----------------------------------------------- |
| Production API services             | Stable latency, fully isolated from other users |
| Private or fine-tuned model serving | Deploy any custom HuggingFace model             |
| High-concurrency inference          | Scale to multiple replicas automatically        |
| Cost-sensitive workloads            | Scale-to-Zero stops billing during idle periods |

***

## 2. Quick Start (Go live in 5 minutes)

**Step 1 — Navigate to Deployments**

Log in to Novita → left sidebar → **Models Console** → **Models APIs** → **Deployments**

**Step 2 — Create a Deployment**

Click **+ New Deployment** and fill in:

* A Deployment name (e.g. `my-llama3-endpoint`)
* Model source (the Novita model catalog is recommended for the fastest setup)
* GPU instance (the system auto-recommends a suitable spec for your model)
* Autoscaling settings

**Step 3 — Wait for the Deployment to start**

Startup time varies with model size, typically **5–60 minutes**, progressing through three phases:

1. Requesting GPU
2. Downloading Model
3. Engine Initializing

Once the status shows **RUNNING**, the endpoint is ready to receive requests.

**Step 4 — Call the API**

Go to the Deployment detail page → **Quick Start** panel → copy the ready-to-run code snippet.

> Manage your API Keys under **Settings → API Keys**.

***

## 3. Creating a Deployment

Click **+ New Deployment** to open the creation form, which has four configuration sections.

### 3.1 Naming

Recommended naming format: `{model}-{environment}-{purpose}` — e.g. `llama3-prod-chatbot`.

### 3.2 Selecting a Model

Two model sources are supported:

#### Novita Model Catalog (Recommended)

Choose from Novita's hosted model list — no token required, **works out of the box**. Covers all major open-source models (Llama 3, Qwen, DeepSeek, Mistral, and more).

> Novita pre-validates model compatibility and applies engine optimizations, resulting in faster startup and higher stability.

#### Hugging Face Models

Enter a HuggingFace repository ID (e.g. `meta-llama/Meta-Llama-3-8B-Instruct`).

* **Public models**: No token needed, deploy directly
* **Private or Gated models**: A HuggingFace Access Token must be linked first

**How to link your HF Token:**

1. Go to [HuggingFace → Settings → Access Tokens](https://huggingface.co/settings/tokens) and create a token
2. In the Model field on the Create Deployment form, click **Integrate HF Token**
3. Paste and save the token

> If your token expires or is revoked, active Deployments that rely on it will fail to re-pull the model. Keep your token up to date.

#### LoRA Adapter (Optional)

After selecting a Base Model, you can attach one or more LoRA Adapters from HuggingFace. Multiple adapters can run on the same Deployment without requiring additional GPU resources.

See [Section 6 — LoRA Adapter Support](#6-lora-adapter-support) for details.

**Model file format requirements (for custom HuggingFace models):**

### 3.3 Selecting a GPU Instance

The system automatically recommends a GPU configuration based on your model size.

> **TIGHT MEMORY warning**: If the selected GPU has limited VRAM for the chosen model, the system shows a `TIGHT MEMORY` warning. Increase the GPU count or contact Novita support.

> GPU type **cannot be changed** after a Deployment is created. To switch GPU type, delete and recreate the Deployment.

***

### 3.4 Configuring Autoscaling

Autoscaling controls how many replicas run in response to traffic.

#### Enable Autoscaling (Recommended)

Use the dual-handle slider to set the replica range:

| Parameter        | Description                                                                 | Default        |
| ---------------- | --------------------------------------------------------------------------- | -------------- |
| Min Replicas     | Minimum active replicas at all times. Set to 0 to enable Scale-to-Zero      | 1              |
| Max Replicas     | Maximum replicas during peak traffic                                        | 3              |
| Scale-down Delay | Seconds to wait after traffic drops before scaling down (prevents flapping) | 300s (minimum) |

**Scale-to-Zero (Min Replicas = 0):**

* After idling for longer than the Scale-down Delay, the Deployment enters **SLEEPING** status and billing pauses
* The first incoming request wakes it up automatically
* Cold start time: typically 5 minutes depending on model size
* ⚠️ Best suited for dev/test or low-frequency workloads. For production, keep Min Replicas ≥ 1

#### Disable Autoscaling

Runs a fixed number of replicas. Best for workloads with strict latency SLAs that cannot tolerate any scaling delay.

### 3.5 Engine Settings (Advanced)

Novita supports two inference engines — **vLLM** and **SGLang** — matched automatically to your model. These settings are hidden by default during Deployment creation.

#### Max Concurrency per Replica

Controls how many requests a single replica handles simultaneously.

| Setting              | Effect                                                  |
| -------------------- | ------------------------------------------------------- |
| Below recommended    | Lower latency, but limited throughput                   |
| Equal to recommended | Optimal balance of throughput and latency (recommended) |
| Above recommended    | Higher throughput, but increased per-request latency    |

> The system calculates a recommended value based on your GPU instance. Default is 16.

#### Suffix Decoding

N-gram based speculative decoding that pre-generates future tokens to speed up inference.

* Most effective for **highly predictable output formats** (e.g. code generation, structured JSON)
* Provides limited benefit for free-form conversation; excessively high values may actually increase latency

***

## 4. Deployment Lifecycle & Status

### State Transition Diagram

```text theme={"system"}
Create
  │
  ▼
PENDING ──── Waiting for GPU resource allocation
  │
  ▼
DEPLOYING ── Three sub-phases:
  │            ├─ Requesting GPU
  │            ├─ Downloading Model
  │            └─ Engine Initializing
  │
  ├──────────────── FAILED (deployment failed)
  │
  ▼
RUNNING ──── Live and accepting requests
  │
  ├─ Zero traffic + Scale-to-Zero enabled ──► SLEEPING
  │                                               │
  │                                 First request ──► DEPLOYING ──► RUNNING
  │
  ├─ Config update ──► ROLLING (zero-downtime rolling update)
  │
  ├─ Traffic change ──► SCALING (autoscaling in progress)
  │
  └─ Manual terminate ──► TERMINATING ──► TERMINATED (can be redeployed or deleted)
```

> **When billing starts**: Only running replicas are billed. Instances still deploying and replicas still scaling up do not count toward charges.

***

## 5. Autoscaling In Depth

### How It Works

Novita autoscaling monitors live traffic and dynamically adjusts replica count within the Min–Max range:

* **Scale-Up**: Request queue backlog detected → add replicas → more GPUs handle requests in parallel
* **Scale-Down**: Traffic drops → wait for Scale-down Delay to expire → reduce replicas
* **Scale-to-Zero**: When Min Replicas = 0 and the Deployment has been idle past the delay, it enters SLEEPING and billing stops

### Cost vs. Availability Trade-off

| Configuration        | Cost                         | Availability                       | Best for                           |
| -------------------- | ---------------------------- | ---------------------------------- | ---------------------------------- |
| Min=0, Max=N         | Lowest (no charge when idle) | Cold start delay (5 min)           | Dev/test, low-frequency workloads  |
| Min=1, Max=N         | Medium                       | Always available, scales on demand | Most production workloads ✅        |
| Min=N, Max=N (fixed) | Highest                      | No scaling delay at all            | Ultra-low latency SLA requirements |

### Per-Replica Cost

Each additional replica adds cost at the same GPU rate as the base replica.
Example: a 2× H100 Deployment that scales to 2 replicas doubles the GPU cost.

### Best Practices

* Set **Min Replicas = 1** in production to avoid cold starts impacting end users
* The default Scale-down Delay of 300s (5 minutes) works well for most cases; increase it if your traffic is highly variable
* Set Max Replicas to no more than 1.5× your expected (peak QPS / per-replica QPS) to avoid unexpected cost spikes

***

## 6. LoRA Adapter Support

### What is LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds lightweight adapter layers on top of a Base Model to customize it for specific tasks — without retraining the full model.

### Using LoRA in Novita Deployments

**Adding adapters at creation time:**

In Create Deployment → Model field → after selecting a Base Model, click **+ Add Adapter** and enter the LoRA adapter's HuggingFace repository ID.

**Viewing adapters at runtime:**

In the Engine Configuration panel, a `+N LoRA` badge appears next to the Model ID. Hover over it to see the full list of attached adapters.

### Multi-LoRA: Multiple Adapters on One Deployment

A single Deployment can run multiple LoRA adapters simultaneously. Specify which adapter to use per request via the `model` field:

> Multi-LoRA requires no extra GPU resources. All adapters share a single copy of the Base Model weights in memory.

***

## 7. Billing

### Billing Unit

Charged by **GPU-second**: number of GPUs × seconds running × unit price.

### When Billing Starts and Stops

| Event                  | Details                                                                              |
| ---------------------- | ------------------------------------------------------------------------------------ |
| **Billing starts**     | After GPU allocation completes during DEPLOYING (i.e. when Downloading Model begins) |
| **Billing stops**      | When the Deployment enters SLEEPING or TERMINATED status                             |
| **Continuous billing** | A RUNNING Deployment is billed even when it receives zero API requests               |

### GPU Pricing

> For the latest pricing, refer to the [Novita pricing page](https://novita.ai/pricing).

### Billing Example

**Scenario**: A customer deploys model instance X on a single RTX 4090 (priced at \$0.61/GPU/hour), with autoscaling set to Min=0, Max=5.

Usage and charges for 9:00–10:00:

1. **9:00:00 – 9:15:40** — Instance is SLEEPING. Charge: **\$0.00**
2. **9:15:41 – 9:16:45** — 1 running replica serving traffic (65 seconds).
   Charge: ($0.61 ÷ 3600) × 1 replica × 65s = **$0.011\*\*
3. **9:16:46 – 10:00:00** — 2 running replicas serving traffic (1,994 seconds).
   Charge: ($0.61 ÷ 3600) × 2 replicas × 1,994s = **$0.676\*\*

**Total for 9:00–10:00: $0 + $0.011 + $0.676 = $0.687**

### Cost Control Tips

1. **Enable Scale-to-Zero** (Min Replicas = 0) for low-frequency workloads — zero cost when idle
2. **Audit your Deployment list regularly** and delete unused Deployments
3. **Cap Max Replicas conservatively** to prevent unexpected cost spikes from runaway autoscaling
4. **TERMINATED status costs nothing** — terminate and redeploy on demand

***

## 8. FAQ & Troubleshooting

### Deployment Issues

**Q: My Deployment has been stuck in DEPLOYING for a long time — what should I do?**

* `Requesting GPU`: GPU resources may be constrained. Wait 5–10 minutes, or try a different GPU type
* `Downloading Model`: Large models (70B+) can take 10+ minutes to download
* `Engine Initializing`: Should complete within 5 minutes under normal conditions

**Q: My Deployment shows FAILED — what are the common causes?**

* Model is not in `.safetensors` format (`.bin` is not supported)
* HuggingFace Token is invalid or lacks access to a gated model
* Insufficient GPU VRAM for the model (TIGHT MEMORY configuration)
* Model architecture is not yet supported

Debugging steps: check the change log in the Settings Tab → verify model file format → validate the HF Token → increase GPU count and recreate the Deployment.

**Q: My Deployment is SLEEPING — how do I wake it up?**

Send any API request to it. The Deployment wakes up automatically. The first request waits for the cold start to complete before receiving a response.

***

### API Issues

**Q: What do the common HTTP error codes mean?**

| Code  | Cause                                                | Resolution                                                           |
| ----- | ---------------------------------------------------- | -------------------------------------------------------------------- |
| `400` | Malformed request                                    | Validate your request JSON; ensure all required fields are present   |
| `401` | Missing or invalid API Key                           | Include a valid key in `Authorization: Bearer <Key>`                 |
| `403` | API Key lacks access to this endpoint                | Confirm the key belongs to the same account that owns the Deployment |
| `404` | Wrong Endpoint URL or Model ID                       | Re-copy the URL and Model ID from the Quick Start panel              |
| `422` | Invalid parameter value (e.g. max\_tokens too large) | Adjust the parameter — try reducing max\_tokens                      |
| `429` | Rate limit exceeded                                  | Reduce request frequency, or contact Novita to raise your limit      |
| `500` | Internal server error                                | Retry after a short wait; if it persists, contact Novita support     |

**Q: Where do I find my API Key?**

Go to **Settings → API Keys** to create or manage keys. A key is only shown once at creation — save it immediately.

***

### Billing Issues

**Q: Why am I being charged when there are no requests?**

A RUNNING Deployment continuously occupies GPU resources regardless of request volume.
**Fix**: Enable Autoscaling and set Min Replicas = 0. The Deployment will automatically sleep and stop billing when idle.

**Q: How do I stop all charges completely?**

Two options:

* **Scale-to-Zero**: Let autoscaling trigger naturally (requires Autoscaling on with Min = 0)
* **Terminate**: Click **Terminate** on the Deployment detail page to release the GPU immediately

***

## Appendix: Glossary

| Term             | Definition                                                                            |
| ---------------- | ------------------------------------------------------------------------------------- |
| Deployment       | Novita's dedicated inference endpoint product                                         |
| Replica          | A single running instance of the inference service; multiple replicas run in parallel |
| Scale-to-Zero    | Setting Min Replicas to 0 so the endpoint sleeps when idle and billing stops          |
| Scale-down Delay | Wait period before scaling down, preventing flapping on variable traffic              |
| LoRA Adapter     | Lightweight fine-tuning plugin layered on top of a Base Model                         |
| Endpoint URL     | The API access address for this Deployment                                            |
| Endpoint ID      | Unique identifier for this Deployment                                                 |
| Base Model       | The underlying foundation model being served                                          |
| Max Concurrency  | Maximum simultaneous requests a single replica handles                                |
| Suffix Decoding  | N-gram speculative decoding to accelerate inference on predictable outputs            |
| GPU-second       | Billing unit: 1 GPU running for 1 second                                              |

***

*For support, contact the Novita team at: [support@novita.ai](mailto:support@novita.ai)*
