FastAPI + Ollama: an AI backend in Python in 30 minutes
How to build a REST API that queries local AI models with FastAPI and Ollama. Streaming, chat endpoints, Celery integration for heavy tasks.
Published: June 3, 2025
FastAPI is the fastest Python framework for building REST APIs. Ollama exposes your local AI models at http://localhost:11434. The two connect naturally — and in under half an hour you have a working AI backend, no cloud, no per-token costs, no data leaving your network.
Minimal setup: /chat endpoint in 15 lines
Install the dependencies first:
pip install fastapi uvicorn ollama
Then create main.py:
from fastapi import FastAPI
from ollama import Client
app = FastAPI()
ollama = Client()
@app.post("/chat")
async def chat(prompt: str):
response = ollama.chat(
model="qwen2.5:7b",
messages=[{"role": "user", "content": prompt}]
)
return {"response": response['message']['content']}
Start with uvicorn main:app --reload and you have a POST endpoint at http://localhost:8000/chat. Test immediately:
curl -X POST "http://localhost:8000/chat?prompt=Explain+SOLID+in+3+lines"
Ollama must already be running in the background — if not: docker run -d -p 11434:11434 ollama/ollama.
Token-by-token streaming (Server-Sent Events)
The synchronous response blocks until the model finishes. For a ChatGPT-style experience — character by character as it arrives — use streaming:
from fastapi.responses import StreamingResponse
@app.post("/chat/stream")
async def chat_stream(prompt: str):
def generate():
stream = ollama.chat(
model="qwen2.5:7b",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
token = chunk['message']['content']
yield f"data: {token}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
The frontend consumes the stream with EventSource or fetch + ReadableStream. The practical upside: the user sees the response start in ~200ms instead of waiting 8-10 seconds for a long reply.
Celery for heavy tasks: don’t block the HTTP thread
If you need to process 100 PDFs with AI, analyze a dataset, or run operations that take tens of seconds, don’t do it in the HTTP thread. The user waits, the server stalls, timeouts hit.
Solution: Celery with Redis as broker. The endpoint responds immediately with a task ID, the worker processes in the background.
tasks.py:
from celery import Celery
celery_app = Celery("tasks", broker="redis://localhost:6379/0",
backend="redis://localhost:6379/0")
@celery_app.task
def analyze_document(file_path: str, user_email: str):
from ollama import Client
client = Client()
with open(file_path) as f:
content = f.read()
result = client.chat(
model="qwen2.5:7b",
messages=[{"role": "user", "content": f"Analyze: {content}"}]
)
# send_email(user_email, result['message']['content'])
return result['message']['content']
In main.py add:
from tasks import analyze_document
@app.post("/analyze")
async def analyze(file_path: str, email: str):
task = analyze_document.delay(file_path, email)
return {"task_id": task.id, "status": "processing"}
@app.get("/status/{task_id}")
async def status(task_id: str):
from celery.result import AsyncResult
result = AsyncResult(task_id)
return {"status": result.status, "result": result.result}
Start the worker: celery -A tasks worker --loglevel=info
Deploy with Docker Compose
docker-compose.yml with all three services:
services:
api:
build: .
ports:
- "8000:8000"
depends_on: [redis, ollama]
worker:
build: .
command: celery -A tasks worker --loglevel=info
depends_on: [redis, ollama]
redis:
image: redis:7-alpine
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
volumes:
ollama_data:
Minimal Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
What to do
- Start with the minimal setup: install FastAPI + Ollama and get the
/chatendpoint running locally — 10 minutes, zero infrastructure. - Add streaming as soon as you want to use it in a frontend: the UX difference is immediately obvious and the implementation is a handful of lines.
- Use Celery only when you have tasks that run longer than 2-3 seconds — for everything else, the synchronous endpoint or streaming is more than enough.