Large Language Models (LLM)
Large Language Models (LLMs) are currently a hot topic in the field of artificial intelligence. Their foundational knowledge covers everything from basic concepts to core technologies and application patterns. Here is a summary of essential concepts and knowledge points for getting started with LLMs:
1. Basic Concepts
What is a Language Model (LM)?
- Definition: A machine learning model designed to predict and generate language sequences that conform to grammar and semantic rules. Simply put, it can predict the next word or character in a sequence.
- Development: From early N-gram models and statistical language models, to later RNNs, LSTMs, and now deep learning models based on Transformers.
- Simple Example: A bigram model predicts the next word based on the previous one:
# Simple bigram model example from collections import defaultdict import random class BigramModel: def __init__(self): self.bigrams = defaultdict(list) def train(self, text): words = text.split() for i in range(len(words)-1): self.bigrams[words[i]].append(words[i+1]) def generate(self, start_word, length=5): current = start_word result = [current] for _ in range(length): if current in self.bigrams: next_word = random.choice(self.bigrams[current]) result.append(next_word) current = next_word else: break return ' '.join(result) # Usage example model = BigramModel() model.train("I like programming I want to learn programming languages") print(model.generate("I")) # Output: I like programming I want
What is a Large Language Model (LLM)?
- "Large" means:
- Large model size: Billions to trillions of parameters learned during training, determining how the model processes and generates text.
- Large training data: Pre-trained on massive text data, including books, articles, web pages, code, etc.
- Core abilities: Understand and generate human language, with general knowledge, logical reasoning, and task generalization abilities.
- Common model scale comparison:
Model Name Parameters Typical Application GPT-3 175B General text generation, dialogue GPT-4 Not public (1T+) Stronger reasoning and creativity LLaMA 2-7B 7B Lightweight, device deployment Claude 2 Not public Professional writing, code generation Gemini Ultra Not public Multimodal understanding and generation
- "Large" means:
What is the Transformer architecture?
- Foundation of LLMs: Most modern LLMs are based on the Transformer architecture.
- Core mechanisms:
- Self-Attention Mechanism: Allows the model to "attend" to all words in a sequence, assigning different attention weights based on their relationships, capturing long-range dependencies.
- Encoder-Decoder Structure: The original Transformer includes an encoder (for input) and a decoder (for output). Many LLMs (like GPT) mainly use the decoder for text generation.
- Simplified self-attention implementation:
import numpy as np def simplified_self_attention(sequence, d_k=64): seq_len = len(sequence) Q = sequence K = sequence V = sequence scores = np.dot(Q, K.T) / np.sqrt(d_k) attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=1, keepdims=True) output = np.dot(attention_weights, V) return output, attention_weights
Token and Tokenization
- Token: The smallest unit of text processed by LLMs. It could be a word, part of a word, punctuation, or even a Chinese character.
- Tokenization: The process of splitting raw text into tokens. Common methods include BPE, WordPiece, etc.
- Context Window/Length: The maximum number of tokens an LLM can process at once.
- Token counting example with tiktoken:
import tiktoken def count_tokens(text, model="gpt-3.5-turbo"): encoding = tiktoken.encoding_for_model(model) tokens = encoding.encode(text) return len(tokens) # Example text = "Hello, World!" token_count = count_tokens(text) print(f"Text contains {token_count} tokens")
2. How LLMs Work: Stages
Pre-training
- Goal: Learn general language patterns, grammar, semantics, and factual knowledge from large amounts of unlabeled text.
- Method: Usually unsupervised learning. Common objectives are "next token prediction" or "masked language modeling."
- Result: A "base model" with strong language understanding and generation abilities, but not necessarily good at following instructions.
- Simple masked language model example:
import torch import torch.nn as nn class SimpleMaskedLM(nn.Module): def __init__(self, vocab_size, embed_size): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_size) self.linear = nn.Linear(embed_size, vocab_size) def forward(self, x): embedded = self.embedding(x) output = self.linear(embedded) return output vocab_size = 10000 embed_size = 256 model = SimpleMaskedLM(vocab_size, embed_size) batch_size, seq_len = 32, 50 input_tokens = torch.randint(0, vocab_size, (batch_size, seq_len)) output = model(input_tokens)
Instruction Fine-tuning / Supervised Fine-tuning (SFT)
- Goal: Teach the base model to understand and follow human instructions.
- Method: Supervised learning on high-quality "instruction-response" pairs.
- Result: An "instruct model" that is easier to use and better at following prompts.
- Instruction dataset example:
instruction_dataset = [ {"instruction": "Translate the following text to English", "input": "我喜欢编程", "output": "I like programming"}, {"instruction": "Summarize the main content of the following text", "input": "人工智能是计算机科学的一个分支,致力于开发能够模拟人类智能的系统。", "output": "Artificial intelligence is a computer system that simulates human intelligence."} ] # Example fine-tuning code with transformers from transformers import Trainer, TrainingArguments def prepare_dataset(examples): prompts = [f"Instruction: {ex['instruction']}\nInput: {ex['input']}\nOutput:" for ex in examples] return { "input_ids": tokenizer(prompts, truncation=True, padding=True)["input_ids"], "labels": tokenizer([ex["output"] for ex in examples])["input_ids"] }
Reinforcement Learning from Human Feedback (RLHF)
- Goal: Further optimize the instruct model to make its responses more aligned with human preferences, safer, and less harmful.
- Method:
- Reward Model: Collect human preference data to train a reward model that predicts human ratings for different responses.
- Reinforcement Learning: Use the reward model as a reward function to fine-tune the instruct model (e.g., with PPO) for higher-reward responses.
- Result: A model optimized with RLHF, such as ChatGPT, Gemini, etc.
- Simplified RLHF flow example:
import torch.nn.functional as F class RewardModel(nn.Module): def __init__(self, base_model): super().__init__() self.base_model = base_model self.reward_head = nn.Linear(768, 1) def forward(self, input_ids): hidden_states = self.base_model(input_ids).last_hidden_state reward = self.reward_head(hidden_states[:, -1, :]) return reward def ppo_train_step(policy_model, reward_model, optimizer, input_ids, epsilon=0.2): with torch.no_grad(): old_outputs = policy_model(input_ids) old_probs = F.softmax(old_outputs.logits, dim=-1) new_outputs = policy_model(input_ids) new_probs = F.softmax(new_outputs.logits, dim=-1) rewards = reward_model(input_ids) ratio = new_probs / old_probs surr1 = ratio * rewards surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * rewards policy_loss = -torch.min(surr1, surr2).mean() optimizer.zero_grad() policy_loss.backward() optimizer.step() return policy_loss.item()
3. Key Applications and Techniques
Prompt Engineering
- Core: Design and optimize prompts for LLMs to guide them to generate desired outputs.
- Importance: Mastering prompt engineering is key to using LLMs effectively.
- Common techniques:
- Clarity: Clearly state what you want the model to do.
- Role-playing: Ask the model to act as a specific role (e.g., "You are a professional programmer").
- Few-shot Learning: Provide a few examples in the prompt.
- Chain-of-Thought/Reasoning: Guide the model to think step by step.
- Self-Correction: Ask the model to check and correct its own output.
- Prompt engineering example:
from openai import OpenAI client = OpenAI() def generate_with_prompt(prompt_template, **kwargs): prompt = prompt_template.format(**kwargs) response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content # 1. Basic prompt basic_prompt = "Translate the following text to English: {text}" # 2. Few-shot example few_shot_prompt = """ Example 1: Input: This book is very interesting Output: This book is very interesting Example 2: Input: I will go to Beijing tomorrow Output: I will go to Beijing tomorrow Now please translate: Input: {text} Output: """ # 3. Chain-of-Thought prompt cot_prompt = """ Let's solve this math problem step by step: Problem: {problem} Steps: 1) Understand the known conditions 2) Determine the solution method 3) Calculate step by step 4) Verify the answer Please start solving: """ # 4. Role-playing prompt role_prompt = """ You are an experienced Python developer. Please help me optimize the following code, focusing on: 1) Code performance 2) Readability 3) Python best practices Code: {code} Please provide suggestions and the improved code: """
Retrieval Augmented Generation (RAG)
- Goal: Address LLM knowledge timeliness, domain knowledge gaps, and hallucination issues.
- Principle:
- User asks a question.
- System retrieves relevant information from an external knowledge base.
- The retrieved information is provided as context to the LLM.
- The LLM generates an answer based on this context.
- Core: Vector databases and embeddings for efficient retrieval.
- RAG implementation example:
from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity import numpy as np class SimpleRAG: def __init__(self): self.encoder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2') self.knowledge_base = [] self.embeddings = [] def add_knowledge(self, documents): self.knowledge_base.extend(documents) new_embeddings = self.encoder.encode(documents) if len(self.embeddings) == 0: self.embeddings = new_embeddings else: self.embeddings = np.vstack([self.embeddings, new_embeddings]) def retrieve(self, query, top_k=3): query_embedding = self.encoder.encode([query])[0] similarities = cosine_similarity([query_embedding], self.embeddings)[0] top_indices = np.argsort(similarities)[-top_k:][::-1] return [self.knowledge_base[i] for i in top_indices] def generate_answer(self, query): relevant_docs = self.retrieve(query) prompt = f""" Based on the following reference information, answer the question: Reference: {' '.join(relevant_docs)} Question: {query} Please provide an accurate answer: """ response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content # Example usage rag = SimpleRAG() documents = [ "Python is a high-level programming language known for its clean syntax and rich libraries.", "Python was created by Guido van Rossum and first released in 1991.", "Python is widely used in web development, data science, and AI." ] rag.add_knowledge(documents) question = "What are the main application areas of Python?" answer = rag.generate_answer(question) print(answer)
Tool Usage / Function Calling
- Goal: Extend LLM capabilities to interact with the external world and perform specific operations.
- Principle: LLMs can recognize when to call external tools (APIs, functions, search engines, calculators, etc.) and generate the required parameters.
- Agent concept: Combine LLMs, tools, memory, and planning modules to form autonomous AI agents.
- Function calling example:
import json from datetime import datetime import requests # Define available tool functions def get_weather(city: str) -> str: return f"The weather in {city} is sunny, 25°C" def calculate_age(birth_year: int) -> int: current_year = datetime.now().year return current_year - birth_year def search_web(query: str) -> str: return f"Search results for '{query}'..." # Tool function registry AVAILABLE_TOOLS = { "get_weather": { "description": "Get weather information for a specified city", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] } }, "calculate_age": { "description": "Calculate age based on birth year", "parameters": { "type": "object", "properties": { "birth_year": {"type": "integer", "description": "Birth year"} }, "required": ["birth_year"] } }, "search_web": { "description": "Search the web for information", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search keyword"} }, "required": ["query"] } } } def process_with_tools(user_input: str): response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are an assistant. You can use the following tools: " + json.dumps(AVAILABLE_TOOLS, ensure_ascii=False)}, {"role": "user", "content": user_input} ], functions=list(AVAILABLE_TOOLS.items()), function_call="auto" ) message = response.choices[0].message if message.function_call: func_name = message.function_call.name func_args = json.loads(message.function_call.arguments) if func_name == "get_weather": result = get_weather(**func_args) elif func_name == "calculate_age": result = calculate_age(**func_args) elif func_name == "search_web": result = search_web(**func_args) return result else: return message.content # Example usage queries = [ "What's the weather in Beijing today?", "How old is someone born in 2000?", "Please help me search for information about Python" ] for query in queries: result = process_with_tools(query) print(f"Question: {query}") print(f"Answer: {result}\n")
4. LLM Limitations and Challenges
Hallucination: LLMs may generate plausible but incorrect or fabricated information.
- Example:
# Hallucination detection example def detect_hallucination(question: str, answer: str, knowledge_base: list) -> bool: sentences = answer.split('.') for sentence in sentences: if sentence.strip(): found_support = False for knowledge in knowledge_base: if any(key in knowledge for key in sentence.split()): found_support = True break if not found_support: return True # Possible hallucination return False # Supported by knowledge base # Example knowledge_base = [ "OpenAI was founded in 2015.", "GPT-3 was released in 2020.", "The Transformer architecture was proposed by Google in 2017." ] question = "When was OpenAI founded?" answer1 = "OpenAI was founded in 2015 as an AI research company." answer2 = "OpenAI was founded in 2010 by Musk." # Incorrect print(f"Answer 1 hallucination: {detect_hallucination(question, answer1, knowledge_base)}") print(f"Answer 2 hallucination: {detect_hallucination(question, answer2, knowledge_base)}")
Knowledge Cutoff: LLMs' knowledge is limited to their training data's cutoff date and cannot access the latest information (RAG can help mitigate this).
- Solution example:
from datetime import datetime import requests class DynamicKnowledgeBase: def __init__(self): self.static_knowledge = {} self.dynamic_knowledge = {} def get_current_info(self, topic: str) -> str: current_time = datetime.now().strftime("%Y-%m-%d") return f"Latest information about {topic} (updated: {current_time})" def answer_question(self, question: str) -> str: current_year = datetime.now().year if str(current_year) in question or "latest" in question or "now" in question: topic = question.replace("latest", "").replace("now", "") current_info = self.get_current_info(topic) prompt = f""" Question: {question} Latest info: {current_info} Please answer based on the latest info. """ return f"Answer based on latest data: {prompt}" else: return "Answer based on model training data" # Example kb = DynamicKnowledgeBase() print(kb.answer_question("What are the AI trends in 2024?"))
Computational Resource Consumption: Training and running LLMs require significant computational resources.
- Optimization example:
import torch from transformers import AutoModelForCausalLM, AutoTokenizer def optimize_model_inference(model_name: str, device: str = 'cuda'): model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16 ) def batch_generate(prompts: list, batch_size: int = 4): results = [] for i in range(0, len(prompts), batch_size): batch = prompts[i:i + batch_size] inputs = tokenizer(batch, return_tensors="pt", padding=True) with torch.cuda.amp.autocast(): outputs = model.generate(**inputs) results.extend(tokenizer.batch_decode(outputs)) return results def prune_model(model, pruning_ratio=0.3): for name, module in model.named_modules(): if isinstance(module, torch.nn.Linear): weights = module.weight.data.abs() threshold = torch.quantile(weights, pruning_ratio) mask = weights > threshold module.weight.data *= mask return model # Model distillation example class DistillationTrainer: def __init__(self, teacher_model, student_model): self.teacher = teacher_model self.student = student_model def distill(self, input_ids, temperature=2.0): with torch.no_grad(): teacher_logits = self.teacher(input_ids).logits / temperature student_logits = self.student(input_ids).logits / temperature loss = torch.nn.functional.kl_div( torch.nn.functional.log_softmax(student_logits, dim=-1), torch.nn.functional.softmax(teacher_logits, dim=-1), reduction='batchmean' ) return loss
Bias: Biases in training data may be learned and amplified by LLMs, leading to unfair or discriminatory outputs.
Interpretability: LLMs are "black boxes"; their decision-making process is hard to fully understand.
Safety and Ethics: How to prevent LLMs from being misused and ensure they comply with ethical standards.
- Safety check example:
class SafetyChecker: def __init__(self): self.sensitive_words = { "Personal Info": ["ID card", "bank card", "password"], "Harmful Content": ["violence", "discrimination", "illegal"], "Security Risk": ["vulnerability", "attack", "injection"] } def check_content(self, text: str) -> dict: results = {category: [] for category in self.sensitive_words} for category, words in self.sensitive_words.items(): for word in words: if word in text: results[category].append(word) return results def is_safe(self, text: str) -> bool: results = self.check_content(text) return all(len(matches) == 0 for matches in results.values()) def filter_sensitive(self, text: str) -> str: filtered_text = text for category, words in self.sensitive_words.items(): for word in words: filtered_text = filtered_text.replace(word, "*" * len(word)) return filtered_text # Example checker = SafetyChecker() text = "This is a text containing bank card info, which may have security vulnerabilities." print(f"Safety check: {checker.check_content(text)}") print(f"Is safe: {checker.is_safe(text)}") print(f"Filtered text: {checker.filter_sensitive(text)}")
5. Other Important Concepts
Parameters: The key indicator of LLM size.
- Parameter counting example:
def count_parameters(model: torch.nn.Module) -> dict: total_params = 0 trainable_params = 0 non_trainable_params = 0 for param in model.parameters(): num_params = param.numel() total_params += num_params if param.requires_grad: trainable_params += num_params else: non_trainable_params += num_params return { "Total parameters": total_params, "Trainable parameters": trainable_params, "Non-trainable parameters": non_trainable_params, "Parameters (GB)": total_params * 4 / (1024**3) }
Knowledge Distillation: Training a smaller "student" model to mimic a larger "teacher" model, reducing deployment cost and inference latency.
- Distillation process example:
import torch.nn as nn class SimpleDistillation: def __init__(self, teacher_model, student_model, temperature=2.0): self.teacher = teacher_model self.student = student_model self.temperature = temperature def distillation_loss(self, student_logits, teacher_logits, labels=None, alpha=0.5): soft_loss = nn.KLDivLoss(reduction='batchmean')( nn.functional.log_softmax(student_logits / self.temperature, dim=-1), nn.functional.softmax(teacher_logits / self.temperature, dim=-1) ) * (self.temperature ** 2) if labels is not None: hard_loss = nn.CrossEntropyLoss()(student_logits, labels) return alpha * hard_loss + (1 - alpha) * soft_loss return soft_loss def train_step(self, inputs, optimizer, labels=None): with torch.no_grad(): teacher_outputs = self.teacher(**inputs) student_outputs = self.student(**inputs) loss = self.distillation_loss( student_outputs.logits, teacher_outputs.logits, labels ) optimizer.zero_grad() loss.backward() optimizer.step() return loss.item()
Quantization: Reducing model weight precision to decrease model size and memory usage, speeding up inference.
- Quantization example:
def quantize_model(model, quantization_type="dynamic"): if quantization_type == "dynamic": quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) elif quantization_type == "static": model.qconfig = torch.quantization.get_default_qconfig('fbgemm') torch.quantization.prepare(model, inplace=True) torch.quantization.convert(model, inplace=True) return model def compare_model_sizes(original_model, quantized_model): def get_size_mb(model): torch.save(model.state_dict(), "temp.pt") size_mb = os.path.getsize("temp.pt") / (1024 * 1024) os.remove("temp.pt") return size_mb original_size = get_size_mb(original_model) quantized_size = get_size_mb(quantized_model) print(f"Original model size: {original_size:.2f} MB") print(f"Quantized model size: {quantized_size:.2f} MB") print(f"Compression ratio: {original_size/quantized_size:.2f}x")
Base Model vs. Instruct Model:
- Base Model: Only pre-trained, good at text completion, but may not follow instructions well.
- Instruct Model: Fine-tuned with instructions and RLHF, better at understanding and following instructions.
- Model comparison example:
def compare_models(base_model, instruct_model, prompts: list): results = [] for prompt in prompts: base_output = base_model.generate(prompt) instruct_output = instruct_model.generate(f"Please execute the following instruction: {prompt}") results.append({ "Prompt": prompt, "Base Model Output": base_output, "Instruct Model Output": instruct_output }) return results # Example test_prompts = [ "Explain what machine learning is", "Write a Python function to calculate the Fibonacci sequence", "Summarize the main content of this paragraph" ] for result in compare_models(base_model, instruct_model, test_prompts): print(f"\nPrompt: {result['Prompt']}") print(f"Base Model: {result['Base Model Output']}") print(f"Instruct Model: {result['Instruct Model Output']}")
Getting started with LLMs: It is recommended to start by understanding the Transformer architecture and prompt engineering, combined with hands-on practice (e.g., using the OpenAI API or Google AI Studio). Also, pay attention to RAG and Agent application patterns, as they are key directions for current LLM applications.
6. Practical Resources
- Open Source Models:
- LLaMA 2
- Mistral
- ChatGLM
- Baichuan
- Development Frameworks:
- Hugging Face Transformers
- LangChain
- LlamaIndex
- FastChat
- Deployment Tools:
- ONNX Runtime
- TensorRT
- vLLM
- Text Generation Inference
- Learning Resources:
- Hugging Face Courses
- FastAI Courses
- DeepLearning.AI Courses
- Stanford CS324 LLM Course
With these resources, you can gradually deepen your understanding of LLM principles and applications, and start building your own AI applications. Remember, LLM technology is evolving rapidly, so maintaining a habit of learning and practice is crucial.