Skip to content
AImpact
IT EN
AI for developers 6 min read

FastAPI + Ollama: an AI backend in Python in 30 minutes

How to build a REST API that queries local AI models with FastAPI and Ollama. Streaming, chat endpoints, Celery integration for heavy tasks.

Published: June 3, 2025

FastAPI is the fastest Python framework for building REST APIs. Ollama exposes your local AI models at http://localhost:11434. The two connect naturally — and in under half an hour you have a working AI backend, no cloud, no per-token costs, no data leaving your network.

Minimal setup: /chat endpoint in 15 lines

Install the dependencies first:

pip install fastapi uvicorn ollama

Then create main.py:

from fastapi import FastAPI
from ollama import Client

app = FastAPI()
ollama = Client()

@app.post("/chat")
async def chat(prompt: str):
    response = ollama.chat(
        model="qwen2.5:7b",
        messages=[{"role": "user", "content": prompt}]
    )
    return {"response": response['message']['content']}

Start with uvicorn main:app --reload and you have a POST endpoint at http://localhost:8000/chat. Test immediately:

curl -X POST "http://localhost:8000/chat?prompt=Explain+SOLID+in+3+lines"

Ollama must already be running in the background — if not: docker run -d -p 11434:11434 ollama/ollama.

Token-by-token streaming (Server-Sent Events)

The synchronous response blocks until the model finishes. For a ChatGPT-style experience — character by character as it arrives — use streaming:

from fastapi.responses import StreamingResponse

@app.post("/chat/stream")
async def chat_stream(prompt: str):
    def generate():
        stream = ollama.chat(
            model="qwen2.5:7b",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            token = chunk['message']['content']
            yield f"data: {token}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

The frontend consumes the stream with EventSource or fetch + ReadableStream. The practical upside: the user sees the response start in ~200ms instead of waiting 8-10 seconds for a long reply.

Celery for heavy tasks: don’t block the HTTP thread

If you need to process 100 PDFs with AI, analyze a dataset, or run operations that take tens of seconds, don’t do it in the HTTP thread. The user waits, the server stalls, timeouts hit.

Solution: Celery with Redis as broker. The endpoint responds immediately with a task ID, the worker processes in the background.

tasks.py:

from celery import Celery

celery_app = Celery("tasks", broker="redis://localhost:6379/0",
                    backend="redis://localhost:6379/0")

@celery_app.task
def analyze_document(file_path: str, user_email: str):
    from ollama import Client
    client = Client()
    with open(file_path) as f:
        content = f.read()
    result = client.chat(
        model="qwen2.5:7b",
        messages=[{"role": "user", "content": f"Analyze: {content}"}]
    )
    # send_email(user_email, result['message']['content'])
    return result['message']['content']

In main.py add:

from tasks import analyze_document

@app.post("/analyze")
async def analyze(file_path: str, email: str):
    task = analyze_document.delay(file_path, email)
    return {"task_id": task.id, "status": "processing"}

@app.get("/status/{task_id}")
async def status(task_id: str):
    from celery.result import AsyncResult
    result = AsyncResult(task_id)
    return {"status": result.status, "result": result.result}

Start the worker: celery -A tasks worker --loglevel=info

Deploy with Docker Compose

docker-compose.yml with all three services:

services:
  api:
    build: .
    ports:
      - "8000:8000"
    depends_on: [redis, ollama]

  worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    depends_on: [redis, ollama]

  redis:
    image: redis:7-alpine

  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:

Minimal Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

What to do

  • Start with the minimal setup: install FastAPI + Ollama and get the /chat endpoint running locally — 10 minutes, zero infrastructure.
  • Add streaming as soon as you want to use it in a frontend: the UX difference is immediately obvious and the implementation is a handful of lines.
  • Use Celery only when you have tasks that run longer than 2-3 seconds — for everything else, the synchronous endpoint or streaming is more than enough.