Ollama:The Ultimate Solution for Building a Local AI Assistant
Ollama: The Ultimate Solution for Building a Local AI Assistant
Ollama is currently one of the most convenient ways to deploy local large language models (LLMs). With its lightweight runtime framework and strong ecosystem, you can run open-source models like Llama3, Qwen, Mistral, and Gemma locally without a network, enabling chat, document summarization, code generation, and even providing API services.
🧠 What is Ollama?
Ollama is an open-source project that allows you to easily pull, run, and invoke LLM models without complex configuration. Supported models include:
- Meta's LLaMA series
- Alibaba's Qwen series
- Mistral and Mixtral
- Google's Gemma
- Microsoft's Phi series, etc.
Ollama also has a built-in API interface, compatible with the OpenAI style, making it very suitable for integration with projects like LangChain, LLamaIndex, and FastGPT.
🚀 Quick Start (Local Deployment)
1. Install Ollama
Supported on Linux / macOS / Windows.
# macOS (using Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows (requires WSL2 support)
wsl --install
After installation, verify:
ollama --version
2. Pull and Run a Model
Take llama3
as an example:
ollama pull llama3
ollama run llama3
After starting, you'll see an interactive interface like:
>>> What is the capital of France?
Paris is the capital of France.
By default, it uses the 4bit Q-form model, which is very memory efficient—only 6–8GB RAM is needed to run Llama3 8B.
🧩 Programming Interface (API Calls)
Ollama has a built-in local REST API, listening by default at http://localhost:11434
.
You can call it with Python:
requirements.txt
requests
ollama_client.py
import requests
def ask_ollama(prompt, model='llama3'):
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False
}
response = requests.post(url, json=payload)
return response.json()["response"]
if __name__ == "__main__":
print(ask_ollama("Write a Python quicksort function"))
Example Output
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
less = [x for x in arr[1:] if x < pivot]
greater = [x for x in arr[1:] if x >= pivot]
return quicksort(less) + [pivot] + quicksort(greater)
🔁 Streaming Response (SSE)
Ollama supports Server-Sent Events for streaming responses, simulating ChatGPT's "streaming generation". Example code:
import requests
def stream_chat(prompt, model="llama3"):
url = "http://localhost:11434/api/generate"
with requests.post(url, json={
"model": model,
"prompt": prompt,
"stream": True
}, stream=True) as resp:
for line in resp.iter_lines():
if line:
data = line.decode('utf-8').replace("data: ", "")
print(data, end='', flush=True)
if __name__ == "__main__":
stream_chat("Please introduce the Four Great Inventions of China.")
🧠 Multi-Model Running & Custom Models
Ollama supports running multiple models simultaneously. You can create your own model combinations by editing .modelfile
. For example, to build a Qwen+RAG model:
FROM qwen:7b
SYSTEM "You are a Chinese knowledge expert, only answer in Chinese."
Run command:
ollama create my-qwen -f MyQwen.Modelfile
ollama run my-qwen
🧪 Practice: Local Intelligent Q&A Assistant
You can quickly build a Q&A system by combining LangChain
+ Ollama
:
pip install langchain langchain-community
from langchain.chat_models import ChatOllama
from langchain.schema import HumanMessage
llm = ChatOllama(model="llama3")
response = llm([HumanMessage(content="Summarize the main plot of 'The Three-Body Problem' in Chinese")])
print(response.content)
Example output:
'The Three-Body Problem' mainly tells the story of scientist Ye Wenjie sending a signal to space from the Red Coast Base, which is received by the Trisolaran civilization, leading to their decision to invade Earth... (omitted)
🖥️ System Resource Usage Test
Tested on MacBook Pro M1 with LLaMA3 8B:
- Memory usage: about 6.2GB
- Inference latency: average 400ms/sentence
- No internet required, no privacy leakage risk
✅ Summary: Why Choose Ollama?
Feature | Ollama Performance |
---|---|
Easy Installation | ✅ One command to set up |
Multi-model Support | ✅ LLaMA/Qwen/Mistral, etc. |
Low Resource Usage | ✅ Supports quantized model running |
API Compatibility | ✅ Supports OpenAI format |
Private Deployment & Security | ✅ Runs locally, data never leaves your machine |
🔚 Postscript: From Ollama to RAG and Intelligent Agents
Ollama is not just a chat tool, but a core component for building local AI systems. You can combine it with:
...to build intelligent Q&A, agent executors, custom semantic search, and more.
If you also want a local AI assistant that works offline, is always available, and responds quickly, Ollama is undoubtedly one of the best choices right now.
Blog: ftmi.info
Model recommendations:llama3
(strong in English),qwen:7b
(excellent in Chinese)