How to Integrate Ollama Local API with Python: Step-by-Step Guide

Running large language models locally using Ollama is highly efficient, but the real power comes from integrating these models into your own applications. Because Ollama runs a background service that exposes a local REST API, you can easily connect it to Python scripts for tasks like text parsing, code generation, and agents.

In this guide, we'll implement step-by-step integrations with the local Ollama API using Python, covering streaming outputs and JSON schemas.

<rect width="800" height="240" rx="16" fill="var(--bg-secondary)" stroke="var(--border-glass)" stroke-width="1" />

<!-- Python Script Block -->
<rect x="40" y="70" width="180" height="100" rx="12" fill="var(--bg-primary)" stroke="url(#gradient-python)" stroke-width="1.5" />
<text x="130" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Python Script</text>
<text x="130" y="132" font-size="11" fill="var(--text-muted)" text-anchor="middle">import ollama</text>
<text x="130" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">Inference Script Client</text>

<!-- Request Arrow -->
<path d="M 220 100 L 340 100" stroke="var(--accent-cyan)" stroke-width="2" stroke-dasharray="4" />
<text x="280" y="85" font-size="11" fill="var(--accent-cyan)" text-anchor="middle">POST /api/chat</text>

<!-- Ollama Port Block -->
<rect x="340" y="70" width="160" height="100" rx="12" fill="var(--bg-primary)" stroke="var(--border-glass)" stroke-width="1.5" />
<text x="420" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Ollama API Host</text>
<text x="420" y="132" font-size="12" fill="var(--accent-cyan)" font-weight="600" text-anchor="middle">localhost:11434</text>
<text x="420" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">REST HTTP Service</text>

<!-- Response Arrow -->
<path d="M 340 140 L 220 140" stroke="var(--accent-purple)" stroke-width="2" stroke-dasharray="4" />
<text x="280" y="158" font-size="11" fill="var(--accent-purple)" text-anchor="middle">JSON Stream (SSE)</text>

<!-- Local Model Box -->
<rect x="540" y="70" width="220" height="100" rx="12" fill="var(--bg-primary)" stroke="url(#gradient-api)" stroke-width="2" />
<text x="650" y="110" font-size="15" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Inference Engine</text>
<text x="650" y="132" font-size="12" fill="var(--success)" font-weight="600" text-anchor="middle">Llama 3.1 8B (GPU)</text>
<text x="650" y="150" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">Zero Token Latency</text>

<!-- Connective Line -->
<path d="M 500 120 L 540 120" stroke="var(--border-focus)" stroke-width="2" />

1. Connecting via the Official Ollama Python SDK

The easiest way to communicate with Ollama in Python is by using the official SDK. Install the package using pip or uv:

pip install ollama

Now, let's write a basic script to query our model (e.g., Llama 3.1 or Qwen 2.5 Coder) and generate a response:

import ollama

response = ollama.chat(model='llama3.1', messages=[
    {
        'role': 'user',
        'content': 'Write a Python function to compute the Fibonacci sequence.',
    },
])

print(response['message']['content'])

The SDK automatically serializes inputs and returns a structured dictionary containing model metadata and generated tokens.

2. Streaming Token Outputs in Real-Time

For long-form writing or code generation, waiting for the model to finish processing before showing the response creates a slow user experience. We can stream tokens in real-time as the local GPU generates them:

import ollama

stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Explain quantum computing in three paragraphs.'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
print() # Add a final newline

This returns tokens character-by-character, mimicking standard commercial chat interfaces.

3. Direct HTTP Requests (Without the SDK)

If you prefer to avoid third-party dependencies, you can communicate directly with Ollama's HTTP server running on port 11434 using Python's built-in urllib or the requests library.

Here is how to query the /api/generate endpoint using the requests package:

import requests
import json

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.1",
    "prompt": "List 3 high-intensity exercises.",
    "stream": False
}

headers = {
    "Content-Type": "application/json"
}

response = requests.post(url, headers=headers, data=json.dumps(payload))
result = response.json()

print(result['response'])

Technical Parameter Specifications

When creating custom agent tasks, adjusting the generation parameters directly impacts output speed and quality. Here is a reference table for key parameters:

Parameter	Type	Default	Description
`temperature`	float	0.8	Controls randomness. Lower values make responses more predictable.
`top_p`	float	0.9	Nucleus sampling. Limits choices to high-probability tokens.
`num_ctx`	int	2048	Sets context window length (token memory).
`num_predict`	int	-1	Maximum number of tokens to predict in output.

You can feed these parameters directly into the SDK's options:

response = ollama.generate(
    model='llama3.1',
    prompt='Write a short sci-fi plot hook.',
    options={
        'temperature': 0.3,
        'num_ctx': 4096
    }
)

Frequently Asked Questions

Can I run multiple models concurrently via Python?

Ollama runs model requests sequentially by default to protect system memory. If you submit multiple requests at the same time, Ollama loads them into memory sequentially, which can slow down execution times. For concurrent tasks, consider hosting models on separate nodes or upgrading memory.

How do I verify if the local Ollama server is running?

You can ping the server's root address using a simple GET request. Visiting `http://localhost:11434` in your browser should return the text: `Ollama is running`.

How to Integrate Ollama Local API with Python: Step-by-Step Guide

1. Connecting via the Official Ollama Python SDK

2. Streaming Token Outputs in Real-Time

3. Direct HTTP Requests (Without the SDK)

Technical Parameter Specifications

Frequently Asked Questions

Written by Mehmet Demir

Smart Related Articles

Integrating Llama 3.1 Local API with Node.js: Quickstart

Setting Up a Local RAG System with LangChain and Python