RunPod Pods vs Serverless: When to Use Which (And Why It Matters)
Pods give you a full GPU machine with SSH. Serverless gives you a function that runs on a GPU when called. They're not interchangeable — here's how to pick the right one, and why the workflow that actually works usually involves both.
I've been running workloads on RunPod long enough to have made the wrong choice a few times. Rented a Pod for an inference API I should have put on Serverless. Tried to run a fine-tuning job on Serverless when I needed a Pod. The distinction matters — not just for cost, but for what's actually possible on each.
The fundamental difference
Pods are rented machines. You get a GPU instance with root access, SSH, JupyterLab, and VS Code support via Remote-SSH. The GPU is yours for as long as you're paying, billed by the second with no ingress or egress fees. You can SSH in, install packages, run interactive notebooks, start and stop processes. It's just a computer with a good GPU.
Serverless is GPU function-as-a-service. You write a handler function, package it in Docker, and RunPod runs it when jobs arrive. No persistent process, no SSH access by default, no root access to a running machine between jobs. You pay per second of actual execution. When nothing is running, cost is zero.
This isn't better/worse — they solve different problems. I use both, often for the same model at different stages.
When Pods are the right tool
Interactive development
You can't efficiently iterate on a model if every code change requires rebuilding a Docker image,
pushing it to a registry, and waiting for a cold start. On a Pod, you SSH in, edit code in
/workspace, run it, see results. Feedback loop is seconds.
Jupyter notebooks and data exploration
Most RunPod GPU templates come pre-configured with JupyterLab. For data science workflows — exploring datasets, prototyping model architectures, one-off experiments — you want a persistent notebook environment, not a request handler. Serverless doesn't fit this workflow at all.
Long training runs
Fine-tuning a model takes hours. You need a persistent GPU process with checkpoint support.
Serverless has an execution timeout (600 seconds by default, configurable up to 7 days) but
the practical workflow of iterative training with /workspace checkpoints maps to Pods.
Debugging and inspection
When something breaks, you want a shell. Pods give you that. Attach to a running process, check
GPU memory with nvidia-smi, modify files in place, restart without rebuilding.
Serverless doesn't give you this interactivity.
VS Code / Cursor remote development
If your workflow involves Remote-SSH with a full IDE, that's a Pod. Install the Remote-SSH
extension, configure the connection using the pod's TCP port details, and your IDE is running
directly on the GPU machine. Default workspace is /workspace.
When Serverless is the right tool
Inference APIs with variable traffic
Once a model is packaged and working, you don't need interactive access. A Serverless endpoint gives you autoscaling, job queuing, per-second billing, and an HTTP API. If traffic is bursty — quiet most of the time with occasional spikes — Serverless is dramatically cheaper than keeping a Pod running 24/7.
Production model serving
Submit jobs via /run (async) or /runsync (sync wait), poll status,
get results. Monitoring, 90-day log retention, worker lifecycle management — handled. You focus
on the handler function.
Pay-per-request economics
A 7B model inference job taking 5 seconds on an A4000 ($0.00016/s) costs $0.0008 per call. An A4000 Pod running idle costs roughly $0.28/hr. If you're averaging fewer than ~350 requests per hour, Serverless is cheaper — and it scales to zero when nobody is calling.
Storage: the three tiers
This is where a lot of RunPod confusion lives. Three storage types with very different behavior:
| Type | Survives stop? | Survives delete? | Mount path | Cost |
|---|---|---|---|---|
| Container disk | No | No | System-managed | $0.10/GB/mo |
| Volume disk | Yes | No | /workspace | $0.10/GB/mo running, $0.20/GB/mo stopped |
| Network volume | Yes | Yes | /workspace | $0.07/GB/mo (<1 TB), $0.05/GB/mo (>1 TB) |
Container disk is ephemeral — anything outside of /workspace is
gone when the pod stops. Don't store model checkpoints there.
Volume disk persists until you explicitly delete the pod. Good for datasets and checkpoints you're actively working with. Note: it can only be expanded, never shrunk.
Network volumes survive pod deletion and can be attached to different pods. For fine-tuned models you want to keep permanently, this is the right tier.
VRAM sizing
The rule of thumb: ~2 GB VRAM per billion parameters at full precision (BF16/FP32). 4-bit quantization reduces VRAM requirements roughly 4x.
- 7B model at BF16: ~14 GB — fits a 16 GB A4000
- 7B model at 4-bit: ~4 GB — fits almost any GPU
- 13B model at BF16: ~26 GB — need 48 GB card
- 13B model at 4-bit: ~7 GB — comfortable on 16 GB
- 70B model at BF16: ~140 GB — H200 or multi-GPU
- 70B model at 4-bit: ~35 GB — A100 (80 GB) is comfortable
- SDXL / Flux image gen: 16–24 GB depending on resolution
| Workload | GPU recommendation | Min VRAM |
|---|---|---|
| LLM inference (7–13B) | RTX 4090, L4 | 24 GB |
| LLM inference (30–70B) | A100, H100 | 48–80 GB |
| LLM training/fine-tuning | A100, H100 | 40–80 GB |
| Image gen (SDXL, Flux) | RTX 4090, L4 | 16–24 GB |
| Computer vision | Entry to mid-range | 8–16 GB |
Before committing, the HuggingFace Model Memory Calculator gives precise VRAM requirements for specific models at specific quantization levels.
Pricing comparison
Both Pods and Serverless bill by the second with no ingress/egress fees. The meaningful distinctions are cloud tier and pricing mode.
Community Cloud vs Secure Cloud
Community Cloud is peer-to-peer GPU rental. Cheaper, fewer reliability guarantees, GPU availability can change. Good for experimentation and cost-sensitive workloads that can tolerate interruption.
Secure Cloud runs in Tier 3/4 data centers. Higher reliability, more expensive. Right for production workloads where availability matters.
On-demand, savings plans, spot
- On-demand: standard rate, non-interruptible, pay as you go.
- Savings plans: prepay for 3 or 6 months of compute for a significant discount. Storage bills separately at standard rates. Right for predictable long-running workloads.
- Spot instances: cheapest rate, interruptible with a 5-second warning (SIGTERM then SIGKILL). Excellent for checkpointed training jobs. Do not use for interactive work or anything without checkpoint/resume logic.
SSH setup
RunPod offers two SSH modes. The basic proxied SSH works immediately but doesn't support SCP or SFTP. For full SSH with file transfers, expose TCP port 22 when creating the pod:
# Generate a key if you don't have one
ssh-keygen -t ed25519 -C "runpod"
# Add the public key to your RunPod account settings
runpodctl ssh add-key --key "$(cat ~/.ssh/id_ed25519.pub)"
# Basic SSH (proxied, no SCP/SFTP)
ssh user@ssh.runpod.io -i ~/.ssh/id_ed25519
# Full SSH with TCP port exposed (SCP/SFTP supported)
ssh root@POD_IP -p PORT -i ~/.ssh/id_ed25519
The external port is shown in the pod dashboard and in the RUNPOD_TCP_PORT_22
environment variable inside the pod. Port numbers can change on restart — worth scripting
rather than memorizing.
Global networking between pods
All NVIDIA GPU pods in your account are connected on a private network — no public internet,
no extra config. Internal DNS is POD_ID.runpod.internal, bandwidth is 100 Mbps
between pods, and this works across all 17 RunPod data centers.
This matters for multi-pod workflows: inference serving on one pod, pre/post-processing on another, distributed training, or just splitting services cleanly. Access a service on another pod:
# From inside any pod in your account:
curl http://abc123xyz456.runpod.internal:8000/predict
# Or in Python
import requests
resp = requests.post(
"http://abc123xyz456.runpod.internal:8000/predict",
json={"prompt": "hello world"}
)
The pod ID is in the RUNPOD_POD_ID environment variable inside the pod itself.
Make sure services bind to 0.0.0.0 (not localhost) for the internal
network to reach them.
The workflow that actually works
Develop on a Pod. Deploy to Serverless. Use the same Docker image for both.
- Spin up a Pod with the GPU tier your model needs. SSH in or use Jupyter.
- Install dependencies, download the model to
/workspace, iterate on your handler code interactively. - Write the Dockerfile. Build it on the Pod itself — same architecture, no cross-compilation issues.
- Test locally with
python handler.py --rp_serve_apibefore pushing the image. - Push to a registry. Create a Serverless endpoint using the same image.
- Terminate the Pod. Serverless handles production serving at a fraction of the idle cost.
Use an environment variable to switch behavior between the two modes in the same image:
import os
import runpod
# Model loads once regardless of mode
model = load_model()
tokenizer = load_tokenizer()
def handler(job):
result = model.generate(job["input"]["prompt"])
return {"output": tokenizer.decode(result)}
if os.environ.get("MODE_TO_RUN") == "pod":
# Keep alive for SSH/Jupyter access
import time
print("Pod mode: connect via SSH or Jupyter")
while True:
time.sleep(60)
else:
runpod.serverless.start({"handler": handler})
Set MODE_TO_RUN=pod in your Pod environment. Leave it unset on the Serverless
endpoint (defaults to serverless mode). One image, correct behavior in both contexts, no
drift between dev and prod.
Quick decision guide
| Use a Pod when… | Use Serverless when… |
|---|---|
| You need interactive access (SSH, Jupyter) | You have a packaged model ready to serve |
| You're training or fine-tuning | Traffic is bursty or intermittent |
| You need to debug actively | You want pay-per-request economics |
| The workflow requires a persistent process | You want autoscaling and built-in queuing |
| You're iterating on code quickly | The model is stable and production-ready |
| You want VS Code Remote-SSH | Cold starts are acceptable for your use case |
The most common mistake: trying to skip the Pod phase and iterating directly in Serverless. Rebuilding Docker images to test a one-line code change is painful. Start on a Pod. Move to Serverless when the model works.