local ai

How to Run Llama 3.1 Locally with Ollama: Step-by-Step Guide

How to Run Llama 3.1 Locally with Ollama: Step-by-Step Guide

Running large language models (LLMs) on your local hardware has shifted from a hobbyist experiment to a necessity for privacy-conscious developers and solopreneurs. Hosting models on your own machine yields zero API costs, offline functionality, and complete data sovereignty—your sensitive inputs never touch external servers.

In this guide, we'll walk through installing Ollama and deploying Meta's Llama 3.1 on your local machine.


What is Ollama?

Ollama is a lightweight, open-source tool that packages model weights, configurations, and runtime dependencies into a single executable. It detects your system's GPU (graphics card) automatically and exposes a local API server that mirrors the OpenAI API structure, allowing for easy drop-in integrations.

[!NOTE] Ollama runs natively on macOS, Linux, and Windows. It automatically leverages Apple Silicon (M1/M2/M3/M4 Max/Ultra) unified memory and NVIDIA CUDA cores for acceleration.

<rect width="800" height="280" rx="16" fill="var(--bg-secondary)" stroke="var(--border-glass)" stroke-width="1" />

<!-- Developer/User Icon & Box -->
<rect x="40" y="90" width="130" height="100" rx="12" fill="var(--bg-primary)" stroke="var(--border-glass)" stroke-width="1.5" />
<text x="105" y="130" font-size="14" fill="var(--text-primary)" font-weight="600" text-anchor="middle">Terminal / IDE</text>
<text x="105" y="152" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">User CLI / Client</text>
<text x="105" y="170" font-size="11" fill="var(--accent-cyan)" text-anchor="middle">Port 11434</text>

<!-- Arrow 1 -->
<path d="M 170 140 L 270 140" stroke="var(--border-focus)" stroke-width="2" stroke-dasharray="4" />
<text x="220" y="125" font-size="11" fill="var(--text-muted)" text-anchor="middle">HTTP REST</text>

<!-- Ollama Server Engine Box -->
<rect x="280" y="50" width="240" height="180" rx="16" fill="var(--bg-primary)" stroke="url(#cyan-purple)" stroke-width="2" />
<text x="400" y="85" font-size="16" fill="var(--text-primary)" font-weight="700" text-anchor="middle">Ollama Service Engine</text>
<line x1="300" y1="100" x2="500" y2="100" stroke="var(--border-glass)" stroke-width="1" />

<!-- Model Store inside Ollama -->
<rect x="300" y="120" width="200" height="85" rx="8" fill="var(--bg-secondary)" stroke="var(--border-glass)" stroke-width="1" />
<text x="400" y="145" font-size="12" fill="var(--text-muted)" text-anchor="middle">Memory Manager & Router</text>
<text x="400" y="170" font-size="14" fill="var(--text-primary)" font-weight="600" text-anchor="middle">Llama 3.1 8B Model</text>
<text x="400" y="188" font-size="11" fill="var(--accent-purple)" text-anchor="middle">Local RAM / Unified Memory</text>

<!-- Arrow 2 -->
<path d="M 520 140 L 620 140" stroke="var(--border-focus)" stroke-width="2" stroke-dasharray="4" />
<text x="570" y="125" font-size="11" fill="var(--text-muted)" text-anchor="middle">Inference</text>

<!-- GPU Acceleration Box -->
<rect x="630" y="90" width="130" height="100" rx="12" fill="var(--bg-primary)" stroke="var(--border-glass)" stroke-width="1.5" />
<text x="695" y="130" font-size="14" fill="var(--text-primary)" font-weight="600" text-anchor="middle">System Hardware</text>
<text x="695" y="152" font-size="11" fill="var(--text-dimmed)" text-anchor="middle">GPU (Metal/CUDA)</text>
<text x="695" y="170" font-size="11" fill="var(--success)" text-anchor="middle">Hardware Acceleration</text>

Step 1: Install Ollama

Navigate to the official Ollama website and download the client for your operating system:

  1. macOS: Download the .zip archive, extract it, and drag the Ollama application to your Applications folder.
  2. Windows: Download the .exe installer and follow the wizard.
  3. Linux: Open a terminal and run the one-line install script:
curl -fsSL https://ollama.com/install.sh | sh

Verify that the installation was successful by running the version command in your terminal:

ollama --version

Step 2: Download and Run Llama 3.1

Meta's Llama 3.1 is available in several parameter sizes. For typical developer laptops and desktop setups, the 8B (8 Billion parameter) model is the sweet spot. It runs smoothly on hardware with at least 8GB of unified memory or VRAM. If you are constrained by system memory, read our guide on the best local LLM for 8GB RAM setups.

Run the following command to download and initiate the model:

ollama run llama3.1

Ollama will fetch the model weights (approximately 4.7 GB). Once the download completes, your terminal prompt will change:

>>> Send a message (/? for help)

You can now chat with your local AI directly inside the terminal. All token inference is computed locally on your system, completely offline.


Essential Ollama CLI Reference

Use these commands in your daily workflow to manage your local library:

Command Description Example Usage
ollama run <model> Pulls and starts a model in interactive chat mode ollama run llama3.1
ollama pull <model> Downloads a model to your local disk without launching ollama pull phi3
ollama rm <model> Deletes a model to free up disk storage space ollama rm llama3
ollama list Lists all models currently installed on your computer ollama list
ollama show <model> Displays technical details and parameters of a model ollama show llama3.1

Customizing the System Prompt via Modelfile

You can easily adjust Llama 3.1's behavior and system guidelines by creating a custom Modelfile. Let's build a dedicated coding assistant that returns clean code outputs.

Create an extension-less file named Modelfile on your computer and add the following configuration:

FROM llama3.1

# Lower temperature for predictable, logical responses
PARAMETER temperature 0.2

# Set custom system instructions
SYSTEM """
You are a senior TypeScript software engineer. Answer questions directly, providing clean code blocks and concise bullet-point explanations.
"""

Save the file and build your custom model by executing:

ollama create mycoder -f ./Modelfile

Run your new specialized assistant:

ollama run mycoder

Frequently Asked Questions

Do I need an active internet connection to use Ollama?
No. An internet connection is only required to download the model weights initially. Once saved locally, Ollama runs entirely offline.
What are the minimum hardware specs for Llama 3.1 8B?
You need at least 8GB of RAM (unified memory on Mac) or VRAM (NVIDIA GPU). For larger parameter sizes like the 70B model, you will need a minimum of 64GB RAM or dual high-end graphics cards.
M

Written by Mehmet Demir

Mehmet is a Systems Architect specializing in local LLM deployments and workplace automations.

Sponsored Content
AdSlot: 728x90 In-Article Banner
Development Placeholder (AdSense Inactive)