local ai

How to Integrate Ollama Local API with Python: Step-by-Step Guide

How to Integrate Ollama Local API with Python: Step-by-Step Guide

Running large language models locally using Ollama is highly efficient, but the real power comes from integrating these models into your own applications. Because Ollama runs a background service that exposes a local REST API, you can easily connect it to Python scripts for tasks like text parsing, code generation, and agents.

In this guide, we'll implement step-by-step integrations with the local Ollama API using Python, covering streaming outputs and JSON schemas.

<rect width="800" height="240" rx="16" fill="var(--bg-secondary)" stroke="var(--border-glass)" stroke-width="1" />

<!-- Python Script Block -->
<rect x="40" y="70" width="180" height="100" rx="12" fill="var(--bg-primary)" stroke="url(#gradient-python)" stroke-width="1.5" />
<text x="130" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Python Script</text>
<text x="130" y="132" font-size="11" fill="var(--text-muted)" text-anchor="middle">import ollama</text>
<text x="130" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">Inference Script Client</text>

<!-- Request Arrow -->
<path d="M 220 100 L 340 100" stroke="var(--accent-cyan)" stroke-width="2" stroke-dasharray="4" />
<text x="280" y="85" font-size="11" fill="var(--accent-cyan)" text-anchor="middle">POST /api/chat</text>

<!-- Ollama Port Block -->
<rect x="340" y="70" width="160" height="100" rx="12" fill="var(--bg-primary)" stroke="var(--border-glass)" stroke-width="1.5" />
<text x="420" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Ollama API Host</text>
<text x="420" y="132" font-size="12" fill="var(--accent-cyan)" font-weight="600" text-anchor="middle">localhost:11434</text>
<text x="420" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">REST HTTP Service</text>

<!-- Response Arrow -->
<path d="M 340 140 L 220 140" stroke="var(--accent-purple)" stroke-width="2" stroke-dasharray="4" />
<text x="280" y="158" font-size="11" fill="var(--accent-purple)" text-anchor="middle">JSON Stream (SSE)</text>

<!-- Local Model Box -->
<rect x="540" y="70" width="220" height="100" rx="12" fill="var(--bg-primary)" stroke="url(#gradient-api)" stroke-width="2" />
<text x="650" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Inference Engine</text>
<text x="650" y="132" font-size="12" fill="var(--success)" font-weight="600" text-anchor="middle">Llama 3.1 8B (GPU)</text>
<text x="650" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">Zero Token Latency</text>

<!-- Connective Line -->
<path d="M 500 120 L 540 120" stroke="var(--border-focus)" stroke-width="2" />

1. Connecting via the Official Ollama Python SDK

The easiest way to communicate with Ollama in Python is by using the official SDK. Install the package using pip or uv:

pip install ollama

Now, let's write a basic script to query our model (e.g., Llama 3.1 or Qwen 2.5 Coder) and generate a response:

import ollama

response = ollama.chat(model='llama3.1', messages=[
    {
        'role': 'user',
        'content': 'Write a Python function to compute the Fibonacci sequence.',
    },
])

print(response['message']['content'])

The SDK automatically serializes inputs and returns a structured dictionary containing model metadata and generated tokens.


2. Streaming Token Outputs in Real-Time

For long-form writing or code generation, waiting for the model to finish processing before showing the response creates a slow user experience. We can stream tokens in real-time as the local GPU generates them:

import ollama

stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Explain quantum computing in three paragraphs.'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
print() # Add a final newline

This returns tokens character-by-character, mimicking standard commercial chat interfaces.


3. Direct HTTP Requests (Without the SDK)

If you prefer to avoid third-party dependencies, you can communicate directly with Ollama's HTTP server running on port 11434 using Python's built-in urllib or the requests library.

Here is how to query the /api/generate endpoint using the requests package:

import requests
import json

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.1",
    "prompt": "List 3 high-intensity exercises.",
    "stream": False
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers, data=json.dumps(payload))
result = response.json()

print(result['response'])

Technical Parameter Specifications

When creating custom agent tasks, adjusting the generation parameters directly impacts output speed and quality. Here is a reference table for key parameters:

Parameter Type Default Description
temperature float 0.8 Controls randomness. Lower values make responses more predictable.
top_p float 0.9 Nucleus sampling. Limits choices to high-probability tokens.
num_ctx int 2048 Sets context window length (token memory).
num_predict int -1 Maximum number of tokens to predict in output.

You can feed these parameters directly into the SDK's options:

response = ollama.generate(
    model='llama3.1',
    prompt='Write a short sci-fi plot hook.',
    options={
        'temperature': 0.3,
        'num_ctx': 4096
    }
)

Frequently Asked Questions

Can I run multiple models concurrently via Python?
Ollama runs model requests sequentially by default to protect system memory. If you submit multiple requests at the same time, Ollama loads them into memory sequentially, which can slow down execution times. For concurrent tasks, consider hosting models on separate nodes or upgrading memory.
How do I verify if the local Ollama server is running?
You can ping the server's root address using a simple GET request. Visiting `http://localhost:11434` in your browser should return the text: `Ollama is running`.
M

Written by Mehmet Demir

Mehmet is a Systems Architect specializing in local LLM deployments and workplace automations.

Sponsored Content
AdSlot: 728x90 In-Article Banner
Development Placeholder (AdSense Inactive)