liminfo

LLM Reference

Free reference guide: LLM Reference

38 results

About LLM Reference

The LLM Reference is a searchable code-snippet reference covering the most important APIs, patterns, and techniques in large language model application development. It is organized into eight categories — OpenAI, Anthropic, Prompting, Embeddings, Fine-tuning, RAG, Agents, and Evaluation — making it a practical companion for AI engineers building production LLM applications.

AI engineers, machine learning researchers, and backend developers building AI-powered applications use this reference when integrating models from OpenAI and Anthropic, designing prompt strategies, implementing retrieval-augmented generation pipelines, and setting up LangChain agents. The OpenAI section covers the full chat completions API including streaming with chunk iteration, function calling with tools schema, JSON mode for structured output, and the Vision API for multimodal image input. The Anthropic section mirrors this with the messages.create() API, streaming with the context manager pattern, tool use with input_schema, system prompts, and the Batches API.

The Embeddings section shows how to generate vectors with openai.embeddings.create(), compute cosine similarity with NumPy, use sentence-transformers with all-MiniLM-L6-v2, store and query vectors in ChromaDB, and build FAISS indexes for fast similarity search. The RAG section covers the full retrieval-augmented generation pipeline: document chunking with RecursiveCharacterTextSplitter, LangChain RetrievalQA, and hybrid BM25+vector search. The Agents section explains the ReAct pattern, LangChain tool definitions with @tool, and multi-agent architectures with Planner/Executor/Reviewer roles.

Key Features

  • OpenAI API: chat.completions.create, streaming chunks, function calling with tools schema, JSON mode, Vision with image_url
  • Anthropic API: messages.create, streaming context manager, tool use with input_schema, system prompts, Batches API
  • Prompt engineering: zero-shot, few-shot examples, chain-of-thought reasoning, persona assignment, structured output requests
  • Embeddings: openai text-embedding-3-small, cosine similarity with NumPy, sentence-transformers, ChromaDB storage, FAISS index
  • Fine-tuning: OpenAI JSONL format and fine_tuning.jobs.create, LoRA with PEFT, SFT Trainer from trl, DPO training
  • RAG: RecursiveCharacterTextSplitter chunking, LangChain RetrievalQA, hybrid BM25+vector search with RRF
  • Agents: ReAct thought/action/observation loop, LangChain create_tool_calling_agent, @tool decorator, multi-agent patterns
  • Evaluation: LLM-as-Judge prompts, BLEU/ROUGE metrics, human evaluation framework, A/B testing, MMLU/HumanEval benchmarks

Frequently Asked Questions

What does the LLM Reference cover?

It covers eight practical areas of LLM development: the OpenAI Chat Completions API (including streaming, function calling, and vision), the Anthropic Claude Messages API (including tool use and batches), prompt engineering techniques, vector embeddings and similarity search, fine-tuning workflows (OpenAI, LoRA, SFT, DPO), RAG pipeline construction, LangChain agent patterns, and model evaluation frameworks.

How do I enable streaming responses with the OpenAI API?

Pass stream=True to client.chat.completions.create(). The method then returns a stream iterator. Iterate over it and check chunk.choices[0].delta.content — it will be a string fragment or None. Print each fragment with end="" to display the response incrementally as it arrives.

What is the difference between OpenAI function calling and Anthropic tool use?

Both allow the model to request calling an external function. In OpenAI, you pass a tools list with objects containing type: "function" and a function definition including name and parameters JSON schema. In Anthropic, you pass a tools list with name and input_schema fields. The model responds with a tool_use block instead of a text block when it decides to call a tool.

What is RAG and when should I use it?

RAG (Retrieval-Augmented Generation) combines a vector search over your own documents with an LLM to answer questions grounded in specific data. Use RAG when you need the model to reference private, up-to-date, or domain-specific knowledge that was not in its training data. The typical pipeline is: chunk documents, embed chunks, store in a vector DB, retrieve similar chunks at query time, and include them in the LLM prompt.

What is LoRA fine-tuning and how does it differ from full fine-tuning?

LoRA (Low-Rank Adaptation) adds small trainable matrices to the model's weight layers instead of updating all parameters. This reduces GPU memory requirements by 10-100x compared to full fine-tuning while achieving comparable results. Use the peft library with LoraConfig(r=8, lora_alpha=32) to apply LoRA to a base model before training.

What is chain-of-thought prompting?

Chain-of-thought (CoT) prompting asks the model to reason step by step before giving a final answer. Adding "Think step by step:" or "Let's think through this:" to the prompt significantly improves accuracy on multi-step reasoning tasks like math word problems and logical deduction. It works because it forces the model to generate intermediate reasoning tokens that guide the final answer.

How do I use LLM-as-Judge for evaluation?

LLM-as-Judge uses a capable model (typically GPT-4 or Claude) to score the outputs of another model based on criteria like correctness, helpfulness, and safety. Write an evaluation prompt that presents the question, the model's response, and a scoring rubric, then have the judge model output a score. This is more scalable than human evaluation and more nuanced than automatic metrics like BLEU.

What benchmarks are commonly used to evaluate LLMs?

MMLU (Massive Multitask Language Understanding) tests broad knowledge across 57 subjects. HumanEval measures code generation quality on Python programming problems. HellaSwag tests commonsense reasoning about everyday situations. TruthfulQA measures how well a model avoids generating false statements. These benchmarks are referenced in most LLM research papers and model cards.