Running large language models (LLMs) on your local hardware has shifted from a hobbyist experiment to a necessity for privacy-conscious developers and solopreneurs. Hosting models on your own machine yields zero API costs, offline functionality, and complete data sovereignty—your sensitive inputs never touch external servers.
In this guide, we'll walk through installing Ollama and deploying Meta's Llama 3.1 on your local machine.
What is Ollama?
Ollama is a lightweight, open-source tool that packages model weights, configurations, and runtime dependencies into a single executable. It detects your system's GPU (graphics card) automatically and exposes a local API server that mirrors the OpenAI API structure, allowing for easy drop-in integrations.
[!NOTE] Ollama runs natively on macOS, Linux, and Windows. It automatically leverages Apple Silicon (M1/M2/M3/M4 Max/Ultra) unified memory and NVIDIA CUDA cores for acceleration.
Step 1: Install Ollama
Navigate to the official Ollama website and download the client for your operating system:
- macOS: Download the
.ziparchive, extract it, and drag the Ollama application to your Applications folder. - Windows: Download the
.exeinstaller and follow the wizard. - Linux: Open a terminal and run the one-line install script:
curl -fsSL https://ollama.com/install.sh | sh
Verify that the installation was successful by running the version command in your terminal:
ollama --version
Step 2: Download and Run Llama 3.1
Meta's Llama 3.1 is available in several parameter sizes. For typical developer laptops and desktop setups, the 8B (8 Billion parameter) model is the sweet spot. It runs smoothly on hardware with at least 8GB of unified memory or VRAM. If you are constrained by system memory, read our guide on the best local LLM for 8GB RAM setups.
Run the following command to download and initiate the model:
ollama run llama3.1
Ollama will fetch the model weights (approximately 4.7 GB). Once the download completes, your terminal prompt will change:
>>> Send a message (/? for help)
You can now chat with your local AI directly inside the terminal. All token inference is computed locally on your system, completely offline.
Essential Ollama CLI Reference
Use these commands in your daily workflow to manage your local library:
| Command | Description | Example Usage |
|---|---|---|
ollama run <model> |
Pulls and starts a model in interactive chat mode | ollama run llama3.1 |
ollama pull <model> |
Downloads a model to your local disk without launching | ollama pull phi3 |
ollama rm <model> |
Deletes a model to free up disk storage space | ollama rm llama3 |
ollama list |
Lists all models currently installed on your computer | ollama list |
ollama show <model> |
Displays technical details and parameters of a model | ollama show llama3.1 |
Customizing the System Prompt via Modelfile
You can easily adjust Llama 3.1's behavior and system guidelines by creating a custom Modelfile. Let's build a dedicated coding assistant that returns clean code outputs.
Create an extension-less file named Modelfile on your computer and add the following configuration:
FROM llama3.1
# Lower temperature for predictable, logical responses
PARAMETER temperature 0.2
# Set custom system instructions
SYSTEM """
You are a senior TypeScript software engineer. Answer questions directly, providing clean code blocks and concise bullet-point explanations.
"""
Save the file and build your custom model by executing:
ollama create mycoder -f ./Modelfile
Run your new specialized assistant:
ollama run mycoder