RunPod Pods vs Serverless: When to Use Which (And Why It Matters)

I've been running workloads on RunPod long enough to have made the wrong choice a few times. Rented a Pod for an inference API I should have put on Serverless. Tried to run a fine-tuning job on Serverless when I needed a Pod. The distinction matters — not just for cost, but for what's actually possible on each.

The fundamental difference

Pods are rented machines. You get a GPU instance with root access, SSH, JupyterLab, and VS Code support via Remote-SSH. The GPU is yours for as long as you're paying, billed by the second with no ingress or egress fees. You can SSH in, install packages, run interactive notebooks, start and stop processes. It's just a computer with a good GPU.

Serverless is GPU function-as-a-service. You write a handler function, package it in Docker, and RunPod runs it when jobs arrive. No persistent process, no SSH access by default, no root access to a running machine between jobs. You pay per second of actual execution. When nothing is running, cost is zero.

This isn't better/worse — they solve different problems. I use both, often for the same model at different stages.

When Pods are the right tool

Interactive development

You can't efficiently iterate on a model if every code change requires rebuilding a Docker image, pushing it to a registry, and waiting for a cold start. On a Pod, you SSH in, edit code in /workspace, run it, see results. Feedback loop is seconds.

Jupyter notebooks and data exploration

Most RunPod GPU templates come pre-configured with JupyterLab. For data science workflows — exploring datasets, prototyping model architectures, one-off experiments — you want a persistent notebook environment, not a request handler. Serverless doesn't fit this workflow at all.

Long training runs

Fine-tuning a model takes hours. You need a persistent GPU process with checkpoint support. Serverless has an execution timeout (600 seconds by default, configurable up to 7 days) but the practical workflow of iterative training with /workspace checkpoints maps to Pods.

Debugging and inspection

When something breaks, you want a shell. Pods give you that. Attach to a running process, check GPU memory with nvidia-smi, modify files in place, restart without rebuilding. Serverless doesn't give you this interactivity.

VS Code / Cursor remote development

If your workflow involves Remote-SSH with a full IDE, that's a Pod. Install the Remote-SSH extension, configure the connection using the pod's TCP port details, and your IDE is running directly on the GPU machine. Default workspace is /workspace.

When Serverless is the right tool

Inference APIs with variable traffic

Once a model is packaged and working, you don't need interactive access. A Serverless endpoint gives you autoscaling, job queuing, per-second billing, and an HTTP API. If traffic is bursty — quiet most of the time with occasional spikes — Serverless is dramatically cheaper than keeping a Pod running 24/7.

Production model serving

Submit jobs via /run (async) or /runsync (sync wait), poll status, get results. Monitoring, 90-day log retention, worker lifecycle management — handled. You focus on the handler function.

Pay-per-request economics

A 7B model inference job taking 5 seconds on an A4000 ($0.00016/s) costs $0.0008 per call. An A4000 Pod running idle costs roughly $0.28/hr. If you're averaging fewer than ~350 requests per hour, Serverless is cheaper — and it scales to zero when nobody is calling.

Storage: the three tiers

This is where a lot of RunPod confusion lives. Three storage types with very different behavior:

Type	Survives stop?	Survives delete?	Mount path	Cost
Container disk	No	No	System-managed	$0.10/GB/mo
Volume disk	Yes	No	/workspace	$0.10/GB/mo running, $0.20/GB/mo stopped
Network volume	Yes	Yes	/workspace	$0.07/GB/mo (<1 TB), $0.05/GB/mo (>1 TB)

Container disk is ephemeral — anything outside of /workspace is gone when the pod stops. Don't store model checkpoints there.

Volume disk persists until you explicitly delete the pod. Good for datasets and checkpoints you're actively working with. Note: it can only be expanded, never shrunk.

Network volumes survive pod deletion and can be attached to different pods. For fine-tuned models you want to keep permanently, this is the right tier.

Gotcha: pods with a network volume attached cannot be stopped — only terminated. If you want to pause work and resume later with data intact, use a volume disk (stop the pod). If you need permanent storage across pod deletions, use a network volume and accept that you'll terminate pods rather than stop them.

VRAM sizing

The rule of thumb: ~2 GB VRAM per billion parameters at full precision (BF16/FP32). 4-bit quantization reduces VRAM requirements roughly 4x.

7B model at BF16: ~14 GB — fits a 16 GB A4000
7B model at 4-bit: ~4 GB — fits almost any GPU
13B model at BF16: ~26 GB — need 48 GB card
13B model at 4-bit: ~7 GB — comfortable on 16 GB
70B model at BF16: ~140 GB — H200 or multi-GPU
70B model at 4-bit: ~35 GB — A100 (80 GB) is comfortable
SDXL / Flux image gen: 16–24 GB depending on resolution

Workload	GPU recommendation	Min VRAM
LLM inference (7–13B)	RTX 4090, L4	24 GB
LLM inference (30–70B)	A100, H100	48–80 GB
LLM training/fine-tuning	A100, H100	40–80 GB
Image gen (SDXL, Flux)	RTX 4090, L4	16–24 GB
Computer vision	Entry to mid-range	8–16 GB

Before committing, the HuggingFace Model Memory Calculator gives precise VRAM requirements for specific models at specific quantization levels.

Pricing comparison

Both Pods and Serverless bill by the second with no ingress/egress fees. The meaningful distinctions are cloud tier and pricing mode.

Community Cloud vs Secure Cloud

Community Cloud is peer-to-peer GPU rental. Cheaper, fewer reliability guarantees, GPU availability can change. Good for experimentation and cost-sensitive workloads that can tolerate interruption.

Secure Cloud runs in Tier 3/4 data centers. Higher reliability, more expensive. Right for production workloads where availability matters.

On-demand, savings plans, spot

On-demand: standard rate, non-interruptible, pay as you go.
Savings plans: prepay for 3 or 6 months of compute for a significant discount. Storage bills separately at standard rates. Right for predictable long-running workloads.
Spot instances: cheapest rate, interruptible with a 5-second warning (SIGTERM then SIGKILL). Excellent for checkpointed training jobs. Do not use for interactive work or anything without checkpoint/resume logic.

Spot instance warning: the 5-second grace period is not enough to save state unless your code explicitly handles SIGTERM and checkpoints fast. If you're running training on spot, checkpoint every N steps and test your resume path before you rely on it.

SSH setup

RunPod offers two SSH modes. The basic proxied SSH works immediately but doesn't support SCP or SFTP. For full SSH with file transfers, expose TCP port 22 when creating the pod:

# Generate a key if you don't have one
ssh-keygen -t ed25519 -C "runpod"

# Add the public key to your RunPod account settings
runpodctl ssh add-key --key "$(cat ~/.ssh/id_ed25519.pub)"

# Basic SSH (proxied, no SCP/SFTP)
ssh user@ssh.runpod.io -i ~/.ssh/id_ed25519

# Full SSH with TCP port exposed (SCP/SFTP supported)
ssh root@POD_IP -p PORT -i ~/.ssh/id_ed25519

The external port is shown in the pod dashboard and in the RUNPOD_TCP_PORT_22 environment variable inside the pod. Port numbers can change on restart — worth scripting rather than memorizing.

Global networking between pods

All NVIDIA GPU pods in your account are connected on a private network — no public internet, no extra config. Internal DNS is POD_ID.runpod.internal, bandwidth is 100 Mbps between pods, and this works across all 17 RunPod data centers.

This matters for multi-pod workflows: inference serving on one pod, pre/post-processing on another, distributed training, or just splitting services cleanly. Access a service on another pod:

# From inside any pod in your account:
curl http://abc123xyz456.runpod.internal:8000/predict

# Or in Python
import requests
resp = requests.post(
    "http://abc123xyz456.runpod.internal:8000/predict",
    json={"prompt": "hello world"}
)

The pod ID is in the RUNPOD_POD_ID environment variable inside the pod itself. Make sure services bind to 0.0.0.0 (not localhost) for the internal network to reach them.

The workflow that actually works

Develop on a Pod. Deploy to Serverless. Use the same Docker image for both.

Spin up a Pod with the GPU tier your model needs. SSH in or use Jupyter.
Install dependencies, download the model to /workspace, iterate on your handler code interactively.
Write the Dockerfile. Build it on the Pod itself — same architecture, no cross-compilation issues.
Test locally with python handler.py --rp_serve_api before pushing the image.
Push to a registry. Create a Serverless endpoint using the same image.
Terminate the Pod. Serverless handles production serving at a fraction of the idle cost.

Use an environment variable to switch behavior between the two modes in the same image:

import os
import runpod

# Model loads once regardless of mode
model = load_model()
tokenizer = load_tokenizer()

def handler(job):
    result = model.generate(job["input"]["prompt"])
    return {"output": tokenizer.decode(result)}

if os.environ.get("MODE_TO_RUN") == "pod":
    # Keep alive for SSH/Jupyter access
    import time
    print("Pod mode: connect via SSH or Jupyter")
    while True:
        time.sleep(60)
else:
    runpod.serverless.start({"handler": handler})

Set MODE_TO_RUN=pod in your Pod environment. Leave it unset on the Serverless endpoint (defaults to serverless mode). One image, correct behavior in both contexts, no drift between dev and prod.

Quick decision guide

Use a Pod when…	Use Serverless when…
You need interactive access (SSH, Jupyter)	You have a packaged model ready to serve
You're training or fine-tuning	Traffic is bursty or intermittent
You need to debug actively	You want pay-per-request economics
The workflow requires a persistent process	You want autoscaling and built-in queuing
You're iterating on code quickly	The model is stable and production-ready
You want VS Code Remote-SSH	Cold starts are acceptable for your use case

The most common mistake: trying to skip the Pod phase and iterating directly in Serverless. Rebuilding Docker images to test a one-line code change is painful. Start on a Pod. Move to Serverless when the model works.