Ollama CVE-2026-7482: Critical Vulnerability Fixed

Overview: “Bleeding Llama” Hits Local AI Servers

On 4 May 2026, a critical vulnerability was published under CVE-2026-7482 in Ollama, the widely used runtime for running large language models (LLMs) locally. Documented by researchers at Cyera under the name “Bleeding Llama”, the flaw affects the GGUF model loader and allows unauthenticated attackers to read memory contents of the server process over the network. According to the entry in the NVD (National Vulnerability Database), the CVSS 3.1 score is 9.1 (CRITICAL).

The bug was fixed in patch commit 88d57d0 (Pull Request #14406) and shipped in Ollama v0.17.1. According to the vendor, all versions prior to 0.17.1 are affected. The incident is particularly serious because — according to several scans, including by runZero and Cyera — around 300,000 Ollama instances are publicly reachable on the internet, and the critical endpoints require no authentication by default.

Key facts about the vulnerability

CVE ID: CVE-2026-7482 (“Bleeding Llama”) · CWE-125: Out-of-bounds Read · CVSS 3.1: 9.1 (CRITICAL) · Vector: AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:H

Affected: Ollama < 0.17.1 · Fix: v0.17.1 · Published: 2026-05-04, last updated 2026-05-11.

Technical Analysis of the Out-of-Bounds Read Vulnerability

The vulnerability lies in the processing of GGUF model files — a container format for quantised LLMs established by the llama.cpp/Ollama community. Specifically, the HTTP endpoint /api/create accepts an attacker-controlled GGUF file. According to the NVD description, the declared tensor offset and tensor size in this file are larger than the actual file length. During the subsequent quantisation process in the files fs/ggml/gguf.go and server/quantization.go (function WriteTo()), the server reads beyond the end of the allocated heap buffer.

Because this foreign memory content is then incorporated into the model artefact produced by the quantisation, it can be exfiltrated via the equally unauthenticated /api/push endpoint to an attacker-controlled registry. CVE-2026-7482 therefore describes not just a classic memory leak (CWE-125) but a complete exfiltration chain without user interaction.

Metric	Value	Meaning
AV:N	Network	Exploitable over the network
AC:L	Low	Low attack complexity
PR:N	None	No authentication required
UI:N	None	No user interaction required
C:H	High	High impact on confidentiality
A:H	High	High impact on availability (crash possible)

What ends up in the leaking memory

According to the NVD entry, the heap regions read out can contain the following sensitive data among others: environment variables, API keys (for example for cloud LLM providers integrated as a fallback), system prompts with internal instructions, as well as conversation data from parallel users. In multi-user or multi-tenant setups this represents a direct breach of tenant separation.

Attack Chain: From Manipulated GGUF to Data Exfiltration

To understand how “Bleeding Llama” works, you need to consider the unauthenticated endpoints and the quantisation workflow together. The following steps mirror the chain documented in the NVD description.

Sequence of a Bleeding Llama attack

1Attacker scans the internet (e.g. Shodan, Censys) for exposed Ollama instances on TCP/11434.
2They craft a GGUF file in which the tensor offset and size deliberately extend beyond the actual file length.
3The file is sent via an unauthenticated POST to /api/create to trigger a quantisation operation.
4While WriteTo() in server/quantization.go reads the tensor, it reads past the heap buffer (CWE-125).
5The memory contents — including environment variables, API keys, foreign chat data — flow into the new model artefact.
6Via /api/push the attacker uploads the model to a registry they control and extracts the data from there.

What stands out is how few prerequisites attackers need: no credentials, no user click, no local access. Anyone running Ollama with OLLAMA_HOST=0.0.0.0 — a configuration documented in the official examples — exposes the vulnerable endpoints directly at the network edge.

Sample probe request (defensive analysis)

# Only use in authorised test environments!
# Checks whether an Ollama server responds on /api/version and which version is running.

curl -sS http://ollama.example.internal:11434/api/version
# -> {"version":"0.17.0"}   ← vulnerable, because < 0.17.1

# Quick reachability check on the well-known default port via netcat
nc -zv ollama.example.internal 11434

No authentication on /api/create and /api/push

In the upstream distribution, both endpoints are unauthenticated by default. Although Ollama binds only to 127.0.0.1 in its default configuration, the variant OLLAMA_HOST=0.0.0.0 shown in the official documentation is very widespread in practice — for example in container deployments and on development VMs.

Immediate Actions for Operators

The most important measure is trivial but critical: immediate upgrade to v0.17.1 or newer. Beyond that, operators should harden the network path to Ollama and consistently verify whether the endpoints need to be externally reachable at all.

Roll out the patch

Update Ollama to v0.17.1 or higher. Patch commit 88d57d0 contains the bounds check in the GGUF processing.

Check exposure

Scan TCP/11434 across your own infrastructure. Block external reachability consistently unless strictly necessary.

Reverse proxy with AuthN

Place Ollama behind a reverse proxy (e.g. with mTLS, bearer token or OAuth proxy), as /api/create itself does not implement authentication.

Rotate secrets

If exploitation is suspected: rotate API keys from environment variables — primarily those for downstream cloud LLM providers and registries.

Upgrade examples

# Linux (standard installer)
curl -fsSL https://ollama.com/install.sh | sh
ollama --version
# expected output: ollama version is 0.17.1 (or higher)

# Docker deployment
docker pull ollama/ollama:0.17.1
docker stop ollama && docker rm ollama
docker run -d --name ollama \
  -p 127.0.0.1:11434:11434 \
  -v ollama:/root/.ollama \
  ollama/ollama:0.17.1

# Optional: explicitly force bind to loopback
export OLLAMA_HOST=127.0.0.1:11434

Network segmentation as a second line of defence

Even after the patch, the recommendation stands not to expose inference servers directly to the internet. A dedicated zone for AI workloads makes sense, in which only specific applications are allowed to communicate with the model server — ideally via token-based authentication at an upstream gateway. This significantly reduces the attack surface for future vulnerabilities in LLM runtimes as well.

Context: What “Bleeding Llama” Means for AI Security

CVE-2026-7482 is exemplary of a class of vulnerabilities that will appear more frequently as local AI stacks become more widespread. Model formats such as GGUF are highly complex, binary containers; their parsers walk the line between performance optimisation and memory safety. Where bounds checks are missing, classic memory safety problems arise — this time in a Go code path that is generally considered robust.

Two structural lessons can be drawn: first, unauthenticated admin endpoints such as /api/create or /api/push must be understood as an anti-pattern — every inference platform should require authentication by default. Second, LLM penetration testing will need to go beyond classical prompt injection: the runtime layer (model loaders, quantisers, tool-use engines) must also be actively tested.

Conclusion and Source Overview

CVE-2026-7482 shows that locally operated AI models are not a security island. A missing bounds check in the GGUF parser together with unauthenticated management endpoints amounts to a critical memory leak that exposes not just API keys and system prompts but also data from parallel users. The patch in Ollama v0.17.1 closes the immediate gap — the structural lessons around authentication, network segmentation and LLM runtime hardening remain relevant for the long term.

Primary sources used

According to the NVD entry CVE-2026-7482 (NIST National Vulnerability Database, published 2026-05-04, last updated 2026-05-11), the vulnerability is located in the files fs/ggml/gguf.go and server/quantization.go.

The fix is documented in the official patch Commit 88d57d0 / Pull Request #14406 of the Ollama repository on GitHub; the corrections are part of release v0.17.1.

The CVSS 3.1 vector (AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:N/A:H) and the CWE classification (CWE-125 Out-of-bounds Read) are taken from the NVD and the GitLab Advisory Database (GLAD).