Large Language Models (LLM)

[email protected]6/10/25...About 10 min

Large Language Models (LLMs) are currently a hot topic in the field of artificial intelligence. Their foundational knowledge covers everything from basic concepts to core technologies and application patterns. Here is a summary of essential concepts and knowledge points for getting started with LLMs:

1. Basic Concepts

What is a Language Model (LM)?

Definition: A machine learning model designed to predict and generate language sequences that conform to grammar and semantic rules. Simply put, it can predict the next word or character in a sequence.
Development: From early N-gram models and statistical language models, to later RNNs, LSTMs, and now deep learning models based on Transformers.
Simple Example: A bigram model predicts the next word based on the previous one:

# Simple bigram model example
from collections import defaultdict
import random

class BigramModel:
    def __init__(self):
        self.bigrams = defaultdict(list)
    def train(self, text):
        words = text.split()
        for i in range(len(words)-1):
            self.bigrams[words[i]].append(words[i+1])
    def generate(self, start_word, length=5):
        current = start_word
        result = [current]
        for _ in range(length):
            if current in self.bigrams:
                next_word = random.choice(self.bigrams[current])
                result.append(next_word)
                current = next_word
            else:
                break
        return ' '.join(result)
# Usage example
model = BigramModel()
model.train("I like programming I want to learn programming languages")
print(model.generate("I"))  # Output: I like programming I want

What is a Large Language Model (LLM)?
- "Large" means:
  - Large model size: Billions to trillions of parameters learned during training, determining how the model processes and generates text.
  - Large training data: Pre-trained on massive text data, including books, articles, web pages, code, etc.
- Core abilities: Understand and generate human language, with general knowledge, logical reasoning, and task generalization abilities.
- Common model scale comparison:
```
Model Name        Parameters         Typical Application
GPT-3             175B              General text generation, dialogue
GPT-4             Not public (1T+)  Stronger reasoning and creativity
LLaMA 2-7B        7B                Lightweight, device deployment
Claude 2          Not public        Professional writing, code generation
Gemini Ultra      Not public        Multimodal understanding and generation
```
What is the Transformer architecture?
- Foundation of LLMs: Most modern LLMs are based on the Transformer architecture.
- Core mechanisms:
  - Self-Attention Mechanism: Allows the model to "attend" to all words in a sequence, assigning different attention weights based on their relationships, capturing long-range dependencies.
  - Encoder-Decoder Structure: The original Transformer includes an encoder (for input) and a decoder (for output). Many LLMs (like GPT) mainly use the decoder for text generation.
- Simplified self-attention implementation:
```
import numpy as np
def simplified_self_attention(sequence, d_k=64):
    seq_len = len(sequence)
    Q = sequence
    K = sequence
    V = sequence
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True)
    output = np.dot(attention_weights, V)
    return output, attention_weights
```
Token and Tokenization
- Token: The smallest unit of text processed by LLMs. It could be a word, part of a word, punctuation, or even a Chinese character.
- Tokenization: The process of splitting raw text into tokens. Common methods include BPE, WordPiece, etc.
- Context Window/Length: The maximum number of tokens an LLM can process at once.
- Token counting example with tiktoken:
```
import tiktoken
def count_tokens(text, model="gpt-3.5-turbo"):
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)
# Example
text = "Hello, World!"
token_count = count_tokens(text)
print(f"Text contains {token_count} tokens")
```

2. How LLMs Work: Stages

Pre-training

Goal: Learn general language patterns, grammar, semantics, and factual knowledge from large amounts of unlabeled text.
Method: Usually unsupervised learning. Common objectives are "next token prediction" or "masked language modeling."
Result: A "base model" with strong language understanding and generation abilities, but not necessarily good at following instructions.
Simple masked language model example:

import torch
import torch.nn as nn
class SimpleMaskedLM(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.linear = nn.Linear(embed_size, vocab_size)
    def forward(self, x):
        embedded = self.embedding(x)
        output = self.linear(embedded)
        return output
vocab_size = 10000
embed_size = 256
model = SimpleMaskedLM(vocab_size, embed_size)
batch_size, seq_len = 32, 50
input_tokens = torch.randint(0, vocab_size, (batch_size, seq_len))
output = model(input_tokens)

Instruction Fine-tuning / Supervised Fine-tuning (SFT)

Goal: Teach the base model to understand and follow human instructions.
Method: Supervised learning on high-quality "instruction-response" pairs.
Result: An "instruct model" that is easier to use and better at following prompts.
Instruction dataset example:

instruction_dataset = [
    {"instruction": "Translate the following text to English", "input": "我喜欢编程", "output": "I like programming"},
    {"instruction": "Summarize the main content of the following text", "input": "人工智能是计算机科学的一个分支，致力于开发能够模拟人类智能的系统。", "output": "Artificial intelligence is a computer system that simulates human intelligence."}
]
# Example fine-tuning code with transformers
from transformers import Trainer, TrainingArguments
def prepare_dataset(examples):
    prompts = [f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:" for ex in examples]
    return {
        "input_ids": tokenizer(prompts, truncation=True, padding=True)["input_ids"],
        "labels": tokenizer([ex["output"] for ex in examples])["input_ids"]
    }

Reinforcement Learning from Human Feedback (RLHF)

Goal: Further optimize the instruct model to make its responses more aligned with human preferences, safer, and less harmful.
Method:
- Reward Model: Collect human preference data to train a reward model that predicts human ratings for different responses.
- Reinforcement Learning: Use the reward model as a reward function to fine-tune the instruct model (e.g., with PPO) for higher-reward responses.
Result: A model optimized with RLHF, such as ChatGPT, Gemini, etc.
Simplified RLHF flow example:

import torch.nn.functional as F
class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model
        self.reward_head = nn.Linear(768, 1)
    def forward(self, input_ids):
        hidden_states = self.base_model(input_ids).last_hidden_state
        reward = self.reward_head(hidden_states[:, -1, :])
        return reward
def ppo_train_step(policy_model, reward_model, optimizer, input_ids, epsilon=0.2):
    with torch.no_grad():
        old_outputs = policy_model(input_ids)
        old_probs = F.softmax(old_outputs.logits, dim=-1)
    new_outputs = policy_model(input_ids)
    new_probs = F.softmax(new_outputs.logits, dim=-1)
    rewards = reward_model(input_ids)
    ratio = new_probs / old_probs
    surr1 = ratio * rewards
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * rewards
    policy_loss = -torch.min(surr1, surr2).mean()
    optimizer.zero_grad()
    policy_loss.backward()
    optimizer.step()
    return policy_loss.item()

3. Key Applications and Techniques

Prompt Engineering

Core: Design and optimize prompts for LLMs to guide them to generate desired outputs.
Importance: Mastering prompt engineering is key to using LLMs effectively.
Common techniques:
- Clarity: Clearly state what you want the model to do.
- Role-playing: Ask the model to act as a specific role (e.g., "You are a professional programmer").
- Few-shot Learning: Provide a few examples in the prompt.
- Chain-of-Thought/Reasoning: Guide the model to think step by step.
- Self-Correction: Ask the model to check and correct its own output.
Prompt engineering example:

from openai import OpenAI
client = OpenAI()
def generate_with_prompt(prompt_template, **kwargs):
    prompt = prompt_template.format(**kwargs)
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content
# 1. Basic prompt
basic_prompt = "Translate the following text to English: {text}"
# 2. Few-shot example
few_shot_prompt = """
Example 1:
Input: This book is very interesting
Output: This book is very interesting
Example 2:
Input: I will go to Beijing tomorrow
Output: I will go to Beijing tomorrow
Now please translate:
Input: {text}
Output:
"""
# 3. Chain-of-Thought prompt
cot_prompt = """
Let's solve this math problem step by step:
Problem: {problem}
Steps:
1) Understand the known conditions
2) Determine the solution method
3) Calculate step by step
4) Verify the answer
Please start solving:
"""
# 4. Role-playing prompt
role_prompt = """
You are an experienced Python developer. Please help me optimize the following code, focusing on:
1) Code performance
2) Readability
3) Python best practices
Code:
{code}
Please provide suggestions and the improved code:
"""

Retrieval Augmented Generation (RAG)

Goal: Address LLM knowledge timeliness, domain knowledge gaps, and hallucination issues.
Principle:
1. User asks a question.
2. System retrieves relevant information from an external knowledge base.
3. The retrieved information is provided as context to the LLM.
4. The LLM generates an answer based on this context.
Core: Vector databases and embeddings for efficient retrieval.
RAG implementation example:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class SimpleRAG:
    def __init__(self):
        self.encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
        self.knowledge_base = []
        self.embeddings = []
    def add_knowledge(self, documents):
        self.knowledge_base.extend(documents)
        new_embeddings = self.encoder.encode(documents)
        if len(self.embeddings) == 0:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
    def retrieve(self, query, top_k=3):
        query_embedding = self.encoder.encode([query])[0]
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.knowledge_base[i] for i in top_indices]
    def generate_answer(self, query):
        relevant_docs = self.retrieve(query)
        prompt = f"""
        Based on the following reference information, answer the question:
        Reference:
        {' '.join(relevant_docs)}
        Question: {query}
        Please provide an accurate answer:
        """
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
# Example usage
rag = SimpleRAG()
documents = [
    "Python is a high-level programming language known for its clean syntax and rich libraries.",
    "Python was created by Guido van Rossum and first released in 1991.",
    "Python is widely used in web development, data science, and AI."
]
rag.add_knowledge(documents)
question = "What are the main application areas of Python?"
answer = rag.generate_answer(question)
print(answer)

Tool Usage / Function Calling

Goal: Extend LLM capabilities to interact with the external world and perform specific operations.
Principle: LLMs can recognize when to call external tools (APIs, functions, search engines, calculators, etc.) and generate the required parameters.
Agent concept: Combine LLMs, tools, memory, and planning modules to form autonomous AI agents.
Function calling example:

import json
from datetime import datetime
import requests
# Define available tool functions
def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny, 25°C"
def calculate_age(birth_year: int) -> int:
    current_year = datetime.now().year
    return current_year - birth_year
def search_web(query: str) -> str:
    return f"Search results for '{query}'..."
# Tool function registry
AVAILABLE_TOOLS = {
    "get_weather": {
        "description": "Get weather information for a specified city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    },
    "calculate_age": {
        "description": "Calculate age based on birth year",
        "parameters": {
            "type": "object",
            "properties": {
                "birth_year": {"type": "integer", "description": "Birth year"}
            },
            "required": ["birth_year"]
        }
    },
    "search_web": {
        "description": "Search the web for information",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search keyword"}
            },
            "required": ["query"]
        }
    }
}
def process_with_tools(user_input: str):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an assistant. You can use the following tools: " + json.dumps(AVAILABLE_TOOLS, ensure_ascii=False)},
            {"role": "user", "content": user_input}
        ],
        functions=list(AVAILABLE_TOOLS.items()),
        function_call="auto"
    )
    message = response.choices[0].message
    if message.function_call:
        func_name = message.function_call.name
        func_args = json.loads(message.function_call.arguments)
        if func_name == "get_weather":
            result = get_weather(**func_args)
        elif func_name == "calculate_age":
            result = calculate_age(**func_args)
        elif func_name == "search_web":
            result = search_web(**func_args)
        return result
    else:
        return message.content
# Example usage
queries = [
    "What's the weather in Beijing today?",
    "How old is someone born in 2000?",
    "Please help me search for information about Python"
]
for query in queries:
    result = process_with_tools(query)
    print(f"Question: {query}")
    print(f"Answer: {result}\n")

4. LLM Limitations and Challenges

Hallucination: LLMs may generate plausible but incorrect or fabricated information.

Example:

# Hallucination detection example
def detect_hallucination(question: str, answer: str, knowledge_base: list) -> bool:
    sentences = answer.split('.')
    for sentence in sentences:
        if sentence.strip():
            found_support = False
            for knowledge in knowledge_base:
                if any(key in knowledge for key in sentence.split()):
                    found_support = True
                    break
            if not found_support:
                return True  # Possible hallucination
    return False  # Supported by knowledge base
# Example
knowledge_base = [
    "OpenAI was founded in 2015.",
    "GPT-3 was released in 2020.",
    "The Transformer architecture was proposed by Google in 2017."
]
question = "When was OpenAI founded?"
answer1 = "OpenAI was founded in 2015 as an AI research company."
answer2 = "OpenAI was founded in 2010 by Musk."  # Incorrect
print(f"Answer 1 hallucination: {detect_hallucination(question, answer1, knowledge_base)}")
print(f"Answer 2 hallucination: {detect_hallucination(question, answer2, knowledge_base)}")

Knowledge Cutoff: LLMs' knowledge is limited to their training data's cutoff date and cannot access the latest information (RAG can help mitigate this).

Solution example:

from datetime import datetime
import requests
class DynamicKnowledgeBase:
    def __init__(self):
        self.static_knowledge = {}
        self.dynamic_knowledge = {}
    def get_current_info(self, topic: str) -> str:
        current_time = datetime.now().strftime("%Y-%m-%d")
        return f"Latest information about {topic} (updated: {current_time})"
    def answer_question(self, question: str) -> str:
        current_year = datetime.now().year
        if str(current_year) in question or "latest" in question or "now" in question:
            topic = question.replace("latest", "").replace("now", "")
            current_info = self.get_current_info(topic)
            prompt = f"""
            Question: {question}
            Latest info: {current_info}
            Please answer based on the latest info.
            """
            return f"Answer based on latest data: {prompt}"
        else:
            return "Answer based on model training data"
# Example
kb = DynamicKnowledgeBase()
print(kb.answer_question("What are the AI trends in 2024?"))

Computational Resource Consumption: Training and running LLMs require significant computational resources.

Optimization example:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def optimize_model_inference(model_name: str, device: str = 'cuda'):
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        device_map="auto",
        torch_dtype=torch.float16
    )
    def batch_generate(prompts: list, batch_size: int = 4):
        results = []
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i + batch_size]
            inputs = tokenizer(batch, return_tensors="pt", padding=True)
            with torch.cuda.amp.autocast():
                outputs = model.generate(**inputs)
            results.extend(tokenizer.batch_decode(outputs))
        return results
    def prune_model(model, pruning_ratio=0.3):
        for name, module in model.named_modules():
            if isinstance(module, torch.nn.Linear):
                weights = module.weight.data.abs()
                threshold = torch.quantile(weights, pruning_ratio)
                mask = weights > threshold
                module.weight.data *= mask
        return model
# Model distillation example
class DistillationTrainer:
    def __init__(self, teacher_model, student_model):
        self.teacher = teacher_model
        self.student = student_model
    def distill(self, input_ids, temperature=2.0):
        with torch.no_grad():
            teacher_logits = self.teacher(input_ids).logits / temperature
        student_logits = self.student(input_ids).logits / temperature
        loss = torch.nn.functional.kl_div(
            torch.nn.functional.log_softmax(student_logits, dim=-1),
            torch.nn.functional.softmax(teacher_logits, dim=-1),
            reduction='batchmean'
        )
        return loss

Bias: Biases in training data may be learned and amplified by LLMs, leading to unfair or discriminatory outputs.
Interpretability: LLMs are "black boxes"; their decision-making process is hard to fully understand.

Safety and Ethics: How to prevent LLMs from being misused and ensure they comply with ethical standards.

Safety check example:

class SafetyChecker:
    def __init__(self):
        self.sensitive_words = {
            "Personal Info": ["ID card", "bank card", "password"],
            "Harmful Content": ["violence", "discrimination", "illegal"],
            "Security Risk": ["vulnerability", "attack", "injection"]
        }
    def check_content(self, text: str) -> dict:
        results = {category: [] for category in self.sensitive_words}
        for category, words in self.sensitive_words.items():
            for word in words:
                if word in text:
                    results[category].append(word)
        return results
    def is_safe(self, text: str) -> bool:
        results = self.check_content(text)
        return all(len(matches) == 0 for matches in results.values())
    def filter_sensitive(self, text: str) -> str:
        filtered_text = text
        for category, words in self.sensitive_words.items():
            for word in words:
                filtered_text = filtered_text.replace(word, "*" * len(word))
        return filtered_text
# Example
checker = SafetyChecker()
text = "This is a text containing bank card info, which may have security vulnerabilities."
print(f"Safety check: {checker.check_content(text)}")
print(f"Is safe: {checker.is_safe(text)}")
print(f"Filtered text: {checker.filter_sensitive(text)}")

5. Other Important Concepts

Parameters: The key indicator of LLM size.

Parameter counting example:

def count_parameters(model: torch.nn.Module) -> dict:
    total_params = 0
    trainable_params = 0
    non_trainable_params = 0
    for param in model.parameters():
        num_params = param.numel()
        total_params += num_params
        if param.requires_grad:
            trainable_params += num_params
        else:
            non_trainable_params += num_params
    return {
        "Total parameters": total_params,
        "Trainable parameters": trainable_params,
        "Non-trainable parameters": non_trainable_params,
        "Parameters (GB)": total_params * 4 / (1024**3)
    }

Knowledge Distillation: Training a smaller "student" model to mimic a larger "teacher" model, reducing deployment cost and inference latency.

Distillation process example:

import torch.nn as nn
class SimpleDistillation:
    def __init__(self, teacher_model, student_model, temperature=2.0):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
    def distillation_loss(self, student_logits, teacher_logits, labels=None, alpha=0.5):
        soft_loss = nn.KLDivLoss(reduction='batchmean')(
            nn.functional.log_softmax(student_logits / self.temperature, dim=-1),
            nn.functional.softmax(teacher_logits / self.temperature, dim=-1)
        ) * (self.temperature ** 2)
        if labels is not None:
            hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
            return alpha * hard_loss + (1 - alpha) * soft_loss
        return soft_loss
    def train_step(self, inputs, optimizer, labels=None):
        with torch.no_grad():
            teacher_outputs = self.teacher(**inputs)
        student_outputs = self.student(**inputs)
        loss = self.distillation_loss(
            student_outputs.logits,
            teacher_outputs.logits,
            labels
        )
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        return loss.item()

Quantization: Reducing model weight precision to decrease model size and memory usage, speeding up inference.

Quantization example:

def quantize_model(model, quantization_type="dynamic"):
    if quantization_type == "dynamic":
        quantized_model = torch.quantization.quantize_dynamic(
            model,
            {torch.nn.Linear},
            dtype=torch.qint8
        )
    elif quantization_type == "static":
        model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
        torch.quantization.prepare(model, inplace=True)
        torch.quantization.convert(model, inplace=True)
    return model
def compare_model_sizes(original_model, quantized_model):
    def get_size_mb(model):
        torch.save(model.state_dict(), "temp.pt")
        size_mb = os.path.getsize("temp.pt") / (1024 * 1024)
        os.remove("temp.pt")
        return size_mb
    original_size = get_size_mb(original_model)
    quantized_size = get_size_mb(quantized_model)
    print(f"Original model size: {original_size:.2f} MB")
    print(f"Quantized model size: {quantized_size:.2f} MB")
    print(f"Compression ratio: {original_size/quantized_size:.2f}x")

Base Model vs. Instruct Model:

Base Model: Only pre-trained, good at text completion, but may not follow instructions well.
Instruct Model: Fine-tuned with instructions and RLHF, better at understanding and following instructions.
Model comparison example:

def compare_models(base_model, instruct_model, prompts: list):
    results = []
    for prompt in prompts:
        base_output = base_model.generate(prompt)
        instruct_output = instruct_model.generate(f"Please execute the following instruction: {prompt}")
        results.append({
            "Prompt": prompt,
            "Base Model Output": base_output,
            "Instruct Model Output": instruct_output
        })
    return results
# Example
test_prompts = [
    "Explain what machine learning is",
    "Write a Python function to calculate the Fibonacci sequence",
    "Summarize the main content of this paragraph"
]
for result in compare_models(base_model, instruct_model, test_prompts):
    print(f"\nPrompt: {result['Prompt']}")
    print(f"Base Model: {result['Base Model Output']}")
    print(f"Instruct Model: {result['Instruct Model Output']}")

Getting started with LLMs: It is recommended to start by understanding the Transformer architecture and prompt engineering, combined with hands-on practice (e.g., using the OpenAI API or Google AI Studio). Also, pay attention to RAG and Agent application patterns, as they are key directions for current LLM applications.

6. Practical Resources

Open Source Models:
- LLaMA 2
- Mistral
- ChatGLM
- Baichuan
Development Frameworks:
- Hugging Face Transformers
- LangChain
- LlamaIndex
- FastChat
Deployment Tools:
- ONNX Runtime
- TensorRT
- vLLM
- Text Generation Inference
Learning Resources:
- Hugging Face Courses
- FastAI Courses
- DeepLearning.AI Courses
- Stanford CS324 LLM Course

With these resources, you can gradually deepen your understanding of LLM principles and applications, and start building your own AI applications. Remember, LLM technology is evolving rapidly, so maintaining a habit of learning and practice is crucial.