How to Run AI Models on Proxmox VE w/ GPU Passthrough

Homelab with multiple servers running different LLM models (Llama, Mistral, Phi, Qwen, Gemma) on Proxmox VE.

Running AI models like large language models (LLMs) on your own hardware keeps local inference private, avoids cloud API costs, and fits right into a Proxmox homelab. With GPU passthrough, you can give a Proxmox VE VM or LXC container dedicated access to a GPU and run tools such as Ollama, text-generation-webui, or vLLM for self-hosted AI inference. This guide walks you through why and how to run AI models on Proxmox VE: PCI passthrough setup, choosing between a VM and an LXC, and a concrete Ollama example. For Proxmox basics and migration from VMware, see our Proxmox VE Guide.

Why Run AI Models on Proxmox?

Proxmox VE already hosts your VMs and containers, so adding AI workloads keeps everything in one place. You get the same backup (vzdump, Proxmox Backup Server) and monitoring (e.g. Pulse) you use for other guests. Passing through a GPU to a single guest gives that workload full access to the card, which is ideal for LLM inference and other GPU-accelerated AI tasks. Without passthrough, you’d be limited to CPU-only inference (slower) or more advanced setups like vGPU. For most home lab server setups, a dedicated GPU per AI guest is the simplest and fastest option. The Proxmox web UI and existing backup and monitoring workflows apply to your AI VMs and LXC containers just like any other workload.

VM vs LXC for AI Workloads

VMs (KVM): Full isolation and any guest OS. You can run Ubuntu, Windows, or another OS and install the GPU driver and AI stack inside the VM. GPU passthrough Proxmox is well documented for KVM: add the PCI device in the Proxmox UI and install the driver in the guest (e.g. NVIDIA driver on Ubuntu 22.04). Slightly higher overhead than containers, but maximum flexibility and the most straightforward path for NVIDIA GPU passthrough.

LXC containers: Lighter and Linux-only. Proxmox container GPU access is possible by allowing the container to use the GPU device (e.g. /dev/dri or the NVIDIA device nodes) and giving the right capabilities. Good for Ollama on Proxmox LXC or other Linux-native AI tools when you want lower resource use and fast startup. You’ll need to configure device passthrough and, for NVIDIA, often the NVIDIA container toolkit or host driver access.

Recommendation: Use a VM if you need Windows or want the most straightforward GPU driver setup; use LXC if you’re comfortable with Linux-only and want to save resources and get faster startup for run LLM on Proxmox workloads.

Suggested GPUs for AI Workloads

VRAM is the main constraint for running LLMs: smaller models (7B–8B parameters) can run on 8GB; larger models (13B–70B) benefit from 12GB or more. The table below lists consumer and prosumer AMD and NVIDIA cards commonly used with Ollama and similar tools, from good to best for local inference.

Tier	NVIDIA	AMD	VRAM	Notes
Good	RTX 3060	RX 6600 XT	12 GB / 8 GB	Entry-level; 12GB NVIDIA strong value for 7B–13B models
Good	RTX 4060	RX 6650 XT	8 GB	Compact builds; smaller models only
Better	RTX 3070 / 4060 Ti (16GB)	RX 6700 XT	8–16 GB / 12 GB	Sweet spot for 13B–34B at good speed
Better	RTX 3080	RX 6800 XT	10–12 GB / 16 GB	Fast inference; AMD 16GB for larger models
Best	RTX 3090 / 4090	RX 7900 XT / XTX	24 GB / 20–24 GB	24GB for 70B-class models; best for heavy workloads
Best	RTX 4080	—	16 GB	Strong performance; 16GB limit for model size

NVIDIA has the broadest support in Ollama and most AI stacks on Linux; AMD requires ROCm in the guest and works well with Ollama on supported distros. Both work with Proxmox GPU passthrough.

Enable GPU Passthrough on Proxmox

Host prerequisites

Enable IOMMU (Input-Output Memory Management Unit) in the BIOS—Intel VT-d or AMD-Vi, depending on your CPU. Then ensure the Linux kernel has IOMMU enabled: add intel_iommu=on or amd_iommu=on to the kernel command line (e.g. in /etc/default/grub under GRUB_CMDLINE_LINUX, then update-grub and reboot). If the GPU you want to pass through is currently used by the host, blacklist the host driver (e.g. nouveau or nvidia) so that vfio-pci can claim it. Verify after reboot with dmesg | grep -e DMAR -e IOMMU to confirm IOMMU is active.

Proxmox: bind the GPU to vfio-pci

Find the GPU PCI IDs with lspci -nn (e.g. 10de:xxxx for NVIDIA). Add to /etc/modprobe.d/vfio.conf:

options vfio-pci ids=xxxx:yyyy

Replace xxxx:yyyy with your GPU’s vendor:device ID. Add vfio_pci vfio vfio_iommu_type1 to /etc/modules, then run update-initramfs -u and reboot. After reboot, lspci -k should show the GPU using the vfio-pci driver—that’s your success checkpoint before adding the device to a guest.

VM: add the PCI device

In the VM’s hardware configuration in the Proxmox web UI, add a PCI device (Host PCI device). Select your GPU; enable “All Functions” if it’s a multi-function device, and optionally enable “ROM-Bar” and “PCI-Express” if the guest needs them (some NVIDIA cards need ROM-Bar for proper initialization). Start the VM and install the GPU driver inside the guest (e.g. NVIDIA driver on Ubuntu). Verify in the guest with nvidia-smi or the equivalent for your GPU.

LXC: expose the GPU to the container

Edit the LXC config (e.g. /etc/pve/lxc/<CTID>.conf). Add a line allowing the GPU device, e.g.:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir

(Adjust device numbers for your system; see Proxmox and LXC docs for your GPU.) Ensure the container has the right capabilities and that the GPU driver is available inside the container (e.g. install NVIDIA container toolkit or mount driver libraries). Security note: passthrough gives the guest full access to the GPU.

Troubleshooting GPU passthrough

GPU not visible in the guest: Confirm the GPU is bound to vfio-pci on the host (lspci -k). For VMs, try enabling ROM-Bar and PCI-Express; ensure no other guest is using the same PCI device. For LXC, check lxc.cgroup2.devices.allow and mount entries match your GPU’s major/minor numbers.

IOMMU not enabled: Verify in BIOS that Intel VT-d or AMD-Vi is on; check kernel cmdline for intel_iommu=on or amd_iommu=on and that the host has been rebooted after changing GRUB.

Ollama or app doesn’t see the GPU: Inside the guest, confirm the driver is loaded (nvidia-smi or ls /dev/dri). For Ollama, ensure the service runs after the GPU is available; restart the Ollama service if you added the GPU after the guest was already running.

Install and Run Ollama in a Proxmox Guest

Guest OS: Use Ubuntu 22.04 LTS or similar. For NVIDIA, install the driver inside the guest (or use an LXC with the host’s driver access as above). Assumed: you have root or sudo and the GPU is already visible in the guest.

Install Ollama: In the guest, run:

curl -fsSL https://ollama.com/install | sh

Start the service (usually systemctl start ollama or the service starts automatically). Pull a model (e.g. Llama 3.2):

ollama pull llama3.2

Run a chat:

ollama run llama3.2

Ollama API: Ollama serves an API on port 11434 by default. You can call it from other VMs or your network; put it behind a reverse proxy and firewall if exposed. Alternatives like text-generation-webui (oobabooga) or vLLM can run in the same VM or LXC—install them in the guest after the GPU is working for additional model formats and inference options.

Top 5 open source models to run on your stack

Once Ollama is installed in your Proxmox guest, you can pull and run these popular open-source models. Use ollama pull <model> (see download links). Quantized variants keep VRAM within the ranges in the GPU table above.

Model	Description	Size / VRAM	Download
Llama 3.2	Meta; general chat and code (1B, 3B, 8B variants)	1B–8B; 8GB+ for 8B	ollama.com/library/llama3.2
Mistral 7B	Mistral AI; strong 7B for chat and reasoning	7B; ~6GB VRAM (quantized)	ollama.com/library/mistral
Phi-3	Microsoft; small 3.8B and 14B, good for edge	3.8B–14B; 4GB+ / 10GB+	ollama.com/library/phi3
Qwen 2.5	Alibaba; multilingual, 0.5B to 72B	0.5B–72B; scale to your GPU	ollama.com/library/qwen2.5
Gemma 2	Google; open weights, 2B / 9B / 27B	2B–27B; 8GB+ for 9B	ollama.com/library/gemma2

After pulling, run with ollama run <model> (e.g. ollama run mistral). More models: Ollama model library.

Backup and Monitor Your AI Stack

Back up the VM or LXC with Proxmox’s built-in backup (vzdump) or Proxmox Backup Server (PBS) on a schedule so your AI workload and model data are protected. Our Proxmox backup guide covers configuration and restore. For monitoring GPU and guest health (CPU, memory, disk, and optional GPU metrics), use a solution like Pulse; see our Proxmox monitoring guide for setup. Keeping backups and monitoring in place ensures your run AI models on Proxmox setup is as reliable as the rest of your homelab.

FAQ

Can I pass through an NVIDIA GPU to a Proxmox VM?
Yes. Enable IOMMU (Intel VT-d or AMD-Vi) on the host, bind the GPU to vfio-pci, add the PCI device to the VM in the Proxmox UI, and install the NVIDIA driver inside the guest. The GPU will appear as a standard NVIDIA device in the VM.

Is it better to run Ollama in a VM or LXC on Proxmox?
Use a VM for maximum compatibility and any OS (e.g. Windows); use LXC for lower overhead and Linux-only if you’re comfortable configuring GPU access in the container. Both support running AI models on Proxmox VE with Ollama.

Do I need a dedicated GPU for Ollama on Proxmox?
A dedicated GPU is recommended for good inference performance. Ollama can run on CPU only, but LLM inference will be much slower. GPU passthrough gives the guest exclusive access to the card.

Can I run multiple AI models on one Proxmox host?
Yes. Create multiple VMs or LXC containers; each can have its own GPU (if you have multiple GPUs) or share CPU for smaller models. You can run different Ollama models or different runners (e.g. Ollama and vLLM) in separate guests.

How do I back up my Ollama data on Proxmox?
Back up the entire VM or LXC with vzdump or Proxmox Backup Server. Optionally, back up Ollama’s data directory inside the guest (e.g. ~/.ollama) for an extra copy of pulled models and settings.

What is IOMMU and why do I need it for GPU passthrough on Proxmox?
IOMMU (Input-Output Memory Management Unit) allows the CPU to manage device access to memory and is required for safe PCI passthrough. Enable it in BIOS (Intel VT-d or AMD-Vi) and in the kernel so vfio-pci can isolate the GPU for a single guest.

Why does my GPU not appear in the Proxmox VM after passthrough?
Confirm the GPU is bound to vfio-pci on the host with lspci -k. In the VM config, try enabling “All Functions,” “ROM-Bar,” and “PCI-Express.” Reboot the VM after adding the PCI device and install the GPU driver inside the guest.

Can I use AMD GPUs for AI inference with Ollama on Proxmox?
Yes. Use GPU passthrough for the AMD card (vfio-pci or AMD GPU driver on host depending on setup). In the guest, use a Linux OS with ROCm or the appropriate AMD driver; Ollama supports AMD GPUs on Linux for running AI models.

What port does Ollama use and how do I secure it on Proxmox?
Ollama serves its API on port 11434 by default. Restrict access with a firewall (e.g. allow only from your LAN or reverse proxy). If exposing beyond the host, put Ollama behind a reverse proxy (e.g. Nginx) with TLS and authentication.

How do I monitor GPU usage for my AI workload on Proxmox?
Use Proxmox’s own metrics and a monitoring solution such as Pulse (see our Proxmox monitoring guide). Inside the guest, use nvidia-smi (NVIDIA) or GPU-specific tools to see utilization; Pulse can aggregate guest and node metrics for your AI VMs and containers.

Conclusion

Running AI models on Proxmox VE is straightforward once GPU passthrough is in place: choose a VM or LXC, bind the GPU to vfio-pci, add it to the guest, install the driver and Ollama (or another runner such as vLLM or text-generation-webui), and use your existing backup and monitoring to keep the stack reliable. For more on Proxmox itself, backups, and monitoring, use the guides linked in the intro.

VMinstall Training

How to Run AI Models on Proxmox VE (GPU Passthrough, Ollama & More)