YOLOv8: AI object detection in Python for sysadmins and developers
How to use YOLOv8 for real-time object detection in images and video. 5-minute setup, use with webcam/ONVIF IP camera, Docker deployment. No cloud GPU needed.
Published: June 3, 2025
YOLOv8 is the most widely used object detection model in the world. It runs on a normal CPU — on a modern i7 it does 25-30 FPS on 720p video with no GPU. Install one package, write three lines of Python, and you can detect people, vehicles, and objects in a video in real time. No cloud, no variable costs, no data leaving your network.
Install and first test in 3 commands
pip install ultralytics
yolo predict model=yolov8n.pt source='https://ultralytics.com/images/bus.jpg'
yolo predict model=yolov8n.pt source=0 # webcam
The first command installs everything. The second downloads the yolov8n model (nano, 6MB) and runs it on a test image — you see bounding boxes around people and the bus. The third runs it on your webcam in real time.
Models range from yolov8n (nano, fast, less accurate) to yolov8x (extra-large, slower but ~55% mAP on COCO). For real CPU use, yolov8n or yolov8s are the right starting point.
Connecting to an ONVIF IP camera via RTSP
If you have IP cameras on your network (Hikvision, Dahua, any ONVIF camera), they connect via RTSP stream:
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
for result in model.track(source="rtsp://admin:password@192.168.0.92/stream1",
stream=True, persist=True):
# result.boxes contains bounding boxes, confidence, class
boxes = result.boxes
for box in boxes:
cls = result.names[int(box.cls)]
conf = float(box.conf)
print(f"{cls}: {conf:.2f}")
stream=True processes frame by frame without loading everything into memory. persist=True enables cross-frame tracking — required for ByteTrack. The RTSP URL varies by brand: for Hikvision it’s usually rtsp://user:pass@IP/Streaming/Channels/101.
Tracking with ByteTrack: counting unique people
YOLOv8 alone tells you “there are 3 people in this frame”. ByteTrack tells you “that person is the same one from the previous frame, ID is 42”. With tracking you can count unique visits, calculate dwell time, detect intrusions.
from ultralytics import YOLO
from collections import defaultdict
model = YOLO("yolov8n.pt")
track_history = defaultdict(list)
unique_ids = set()
for result in model.track(source="rtsp://192.168.0.92/stream1",
stream=True, persist=True, tracker="bytetrack.yaml"):
if result.boxes.id is not None:
for track_id, cls_idx in zip(result.boxes.id.int(), result.boxes.cls.int()):
if result.names[int(cls_idx)] == "person":
unique_ids.add(int(track_id))
print(f"Unique people detected: {len(unique_ids)}")
ByteTrack is included in ultralytics — nothing extra to install. Practical use cases: counting customers entering a store, monitoring access to a restricted area, detecting vehicles in a parking lot.
Deploying on a server with FastAPI in front
To expose detections as a REST API:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from ultralytics import YOLO
import cv2, base64
app = FastAPI()
model = YOLO("yolov8n.pt")
@app.get("/detect")
async def detect_stream():
def generate():
cap = cv2.VideoCapture("rtsp://192.168.0.92/stream1")
while True:
ret, frame = cap.read()
if not ret:
break
results = model(frame)
annotated = results[0].plot()
_, buffer = cv2.imencode('.jpg', annotated)
yield (b'--frame\r\nContent-Type: image/jpeg\r\n\r\n'
+ buffer.tobytes() + b'\r\n')
return StreamingResponse(generate(), media_type="multipart/x-mixed-replace; boundary=frame")
Minimal docker-compose.yml:
services:
detector:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/app/models
devices:
- /dev/video0:/dev/video0 # only if using USB webcam
What to do
- Install
ultralyticsand testyolov8non your webcam or a local video before touching IP cameras — you need to see the boxes working locally to calibrate expectations. - For ONVIF cameras, verify the RTSP URL with VLC before using it in code:
VLC → Open Network → rtsp://user:pass@IP/stream1. - Add ByteTrack only if you need persistent identity across frames — if you just need “how many objects are there right now”, plain detection is enough and faster.