Systematic Guide to LLM Prompt Engineering

A comprehensive guide to prompt crafting techniques for getting precise results from large language models (LLMs) like ChatGPT and Claude. A practical guide for developers, covering the basic structure through advanced patterns with real-world examples.

prompt engineeringChatGPT promptsClaude promptsfew-shot promptingchain-of-thoughtsystem promptLLM usageAI prompt writingstructured outputzero-shotprompt debuggingLLM best practices

Problem

When using LLMs (ChatGPT, Claude, etc.), it is common to fail to get the desired results. Even with the same question, the quality of the output varies dramatically depending on how the prompt is written. Vague instructions produce irrelevant answers, overly long prompts miss the key point, and without specifying an output format, post-processing becomes difficult. To effectively leverage LLMs across various domains such as workflow automation, code generation, document writing, and data analysis, a systematic engineering approach to designing and iteratively refining prompts is essential.

Required Tools

Claude API

Anthropic's Claude model API. Supports system prompts, multi-turn conversations, structured output, and XML tag separation. Install with pip install anthropic.

OpenAI API

OpenAI model API including GPT-4o. Provides various output control features such as function calling, JSON mode, and structured outputs. Install with pip install openai.

JSON Formatter

A tool for validating and formatting JSON output generated by LLMs. Useful for verifying structured output.

Markdown Editor

A tool for editing and previewing markdown-formatted prompt templates and LLM output.

Solution Steps

Prerequisites

To practice prompt engineering, prepare the following environment. 1. Python 3.9+ or Node.js 18+ installed 2. API key issuance: - Claude API: Create an API key at console.anthropic.com (usage-based billing) - OpenAI API: Create an API key at platform.openai.com (usage-based billing) 3. SDK installation (see code below) Cost reference (input/output per 1M tokens): - Claude Sonnet 4.5: $3/$15 (recommended for practice - best value) - Claude Opus 4.6: $5/$25 (highest quality) - GPT-4o: $2.5/$10 - Claude Haiku 4.5: $1/$5 (for bulk processing) 1M tokens is approximately 750,000 words, and practice typically costs less than $1. This guide primarily uses the Claude API, but all techniques apply equally to the OpenAI API.

# Install Python SDKs
pip install anthropic openai

# Set API keys as environment variables (never hardcode in source!)
# Linux/Mac:
export ANTHROPIC_API_KEY="sk-ant-api03-..."
export OPENAI_API_KEY="sk-..."

# Windows PowerShell:
$env:ANTHROPIC_API_KEY="sk-ant-api03-..."
$env:OPENAI_API_KEY="sk-..."

# Initialize and test the Python client
import anthropic
client = anthropic.Anthropic()  # Automatically uses environment variable

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=100,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(message.content[0].text)
# If you see a response like "Hello! ...", setup is complete

Basic Prompt Structure: Role, Context, Instruction, Output Format

An effective prompt consists of four key elements. 1. Role: Defines the perspective from which the LLM should respond. Assigning a specific role like "You are a senior backend developer" makes the model respond with domain-specific expertise and tone. Without a role, the model responds in a generic assistant tone, but specifying one changes the use of technical terminology, depth of analysis, and practical perspective. 2. Context: Explains the background information and situation. Specifying the current project's tech stack, target audience, constraints, etc. yields more relevant responses. Without enough context, the model has to make assumptions, which can lead the response in an unwanted direction. 3. Instruction: The specific task for the LLM to perform. Use clear verbs like "analyze," "compare," or "write code." One clear task per prompt is most effective. 4. Output Format: Specifies the structure of the result. Explicitly requesting JSON, tables, numbered lists, etc. makes post-processing easier. Without a specified format, the model freely chooses one, leading to inconsistency. Zero-shot vs Few-shot: - Zero-shot: Getting results with instructions alone, without examples. Suitable for simple tasks. - Few-shot: Providing input/output examples so the model learns the pattern. Covered in detail in the next step.

# ============================================
# Bad Prompt vs Good Prompt Comparison
# ============================================
import anthropic

client = anthropic.Anthropic()

# [Bad Prompt] - Vague and unstructured
bad_prompt = "Write some Python code for me"
# Result: Might produce a vague "Hello World" level code

# [Good Prompt] - Includes all 4 elements
good_prompt = """
## Role
You are a Python backend developer with 10 years of experience.

## Context
- Developing a REST API server based on FastAPI
- Using a PostgreSQL database
- JWT authentication is required
- Python 3.12, Pydantic v2 environment

## Instruction
Write a JWT-based login API endpoint.
- Login with email and password
- Verify password with bcrypt
- Issue access token (1 hour) and refresh token (7 days)
- Define input/output types with Pydantic models
- Return appropriate HTTP status codes on failure

## Output Format
- Complete Python code (FastAPI)
- Include docstrings for each function
- Add curl test commands below the code
"""

# Result: Production-quality JWT login API code

# ============================================
# Executing via API
# ============================================
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    # Place the role in the system prompt (recommended)
    system="""You are a senior full-stack developer.
You always write production-quality code,
prioritizing error handling and type safety.
Start with the code directly without unnecessary preamble.""",
    messages=[
        {
            "role": "user",
            "content": """Write an API that meets the following requirements.

## Requirements
- User profile retrieval API (GET /users/{id})
- Return 404 for non-existent users
- Response in JSON format
- Exclude the password field from the response

## Output Format
Output only Python FastAPI code. Include explanations as comments."""
        }
    ]
)

print(message.content[0].text)

# ============================================
# How Role Assignment Affects Results (Practical Comparison)
# ============================================

# Without role: "Explain sorting algorithms"
# -> Textbook-style listing (bubble sort, quicksort, merge sort...)

# Interviewer role: "You are a coding interviewer at a FAANG company.
#   Explain sorting algorithms as you would to a candidate,
#   focusing on time complexity and practical trade-offs."
# -> "In practice, O(n log n) is usually sufficient.
#    For nearly sorted data, Tim Sort is...
#    With memory constraints, in-place sorting..."

# Junior mentor role: "You are a senior guiding a junior developer.
#   Explain sorting to someone learning it for the first time."
# -> Visual analogies, step-by-step examples, code execution results included

Few-shot Prompting: Teaching Patterns with Examples

Few-shot prompting is a core technique where you provide desired input/output examples so the LLM follows the pattern. Why is it effective? The model infers patterns (format, length, tone, judgment criteria) from examples included in the prompt. Without examples, the model decides the format on its own, leading to inconsistency, but with just 2-5 examples, it follows the pattern with remarkable accuracy. Key principles for writing Few-shot examples: 1. Diversity: Include various cases such as positive/negative/neutral, short/long inputs 2. Edge cases: Include ambiguous cases, mixed sentiments, and exceptional situations 3. Consistency: All examples must follow the same output format 4. Representativeness: Use examples similar to your actual data Two ways to implement Few-shot in the Claude API: (1) List examples within the prompt (simple) (2) Provide user/assistant pairs as multi-turn conversation (more accurate, better for enforcing format) Practical use cases: - Sentiment analysis: Review -> positive/negative/neutral + reasoning - Data extraction: Unstructured text -> structured JSON - Document classification: Email/document -> category - Code conversion: Code in one language -> another language - Style conversion: Formal -> informal, technical docs -> blog posts

# ============================================
# Few-shot Prompting Practical Examples
# ============================================
import anthropic
import json

client = anthropic.Anthropic()

# ============================================
# Method 1: Listing examples within the prompt (Sentiment Analysis)
# ============================================
few_shot_prompt = """Analyze the sentiment of the following customer reviews.

## Analysis Criteria
- Sentiment: Positive/Negative/Neutral/Mixed
- Score: 1-5 (1=very negative, 5=very positive)
- Key Keywords: Extract words/expressions that determine the sentiment
- Business Action: Action to take regarding this review

## Examples

Review: "I was amazed at how fast the delivery was. The packaging was thorough and the product quality is excellent."
Sentiment: Positive
Score: 5/5
Key Keywords: fast delivery, thorough packaging, excellent quality
Business Action: Reply to positive review + consider sending a repeat purchase coupon

Review: "It has been 2 weeks since I ordered and it still has not arrived. Customer service does not answer the phone either."
Sentiment: Negative
Score: 1/5
Key Keywords: delivery delay (2 weeks), unresponsive customer service
Business Action: Urgent - check delivery status and contact customer, offer compensation

Review: "I was disappointed that the color was very different from the photo, but the exchange process was very friendly."
Sentiment: Mixed
Score: 3/5
Key Keywords: color mismatch (negative), friendly exchange process (positive)
Business Action: Improve product photo accuracy + praise CS team

## Review to Analyze

Review: "The value for money is decent, but there was no manual so I did not know how to use it at first. I barely managed to set it up after watching YouTube."
"""

# ============================================
# Method 2: Multi-turn conversation for Few-shot (More Accurate)
# ============================================
# Provide user/assistant message pairs as examples in the Claude API
# -> The model follows the output format more precisely

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=512,
    system="""You are a text sentiment analysis expert.
Always respond in the JSON format below only.""",
    messages=[
        # Few-shot example 1 (user -> assistant pair)
        {"role": "user", "content": "Review: 'Fast delivery and great quality'"},
        {"role": "assistant", "content": json.dumps({
            "sentiment": "positive", "score": 5,
            "keywords": ["fast delivery", "great quality"],
            "action": "Thank-you reply + encourage repeat purchase"
        })},

        # Few-shot example 2
        {"role": "user", "content": "Review: 'Defective product, had to get a refund'"},
        {"role": "assistant", "content": json.dumps({
            "sentiment": "negative", "score": 1,
            "keywords": ["defective product", "refund"],
            "action": "Quality control check + offer compensation"
        })},

        # Few-shot example 3 (edge case: mixed sentiment)
        {"role": "user", "content": "Review: 'The design is pretty but the durability is disappointing'"},
        {"role": "assistant", "content": json.dumps({
            "sentiment": "mixed", "score": 3,
            "keywords": ["pretty design", "low durability"],
            "action": "Review durability improvements + market design strengths"
        })},

        # Actual analysis request
        {"role": "user", "content": "Review: 'It was pricey so I hesitated, but the more I use it the more satisfied I am. I want to buy other colors too.'"},
    ]
)

result = json.loads(message.content[0].text)
print(json.dumps(result, indent=2))

# ============================================
# Practical Use: Unstructured Text -> Structured Data Extraction
# ============================================
extraction_prompt = """Extract information from the email below and return it as JSON.

## Example

Email: "Hi, this is John Smith. Could we have a meeting at Conference Room A next Tuesday (3/12) at 2pm? Attendees would be me, Sarah Johnson (team lead), and Mike Davis. I would like to discuss the project progress."
JSON:
{
  "sender": "John Smith",
  "type": "Meeting request",
  "datetime": "March 12, Tuesday, 2:00 PM",
  "location": "Conference Room A",
  "attendees": ["John Smith", "Sarah Johnson (team lead)", "Mike Davis"],
  "agenda": "Project progress discussion",
  "urgency": "Normal",
  "required_action": "Confirm schedule and reply with acceptance/decline"
}

## Email to Extract

Email: "This is Lisa from marketing. Please send the Q1 performance report by tomorrow morning. The CEO requested it urgently. Please send it as an Excel file."
JSON:
"""

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=512,
    messages=[{"role": "user", "content": extraction_prompt}]
)
print(message.content[0].text)

Chain-of-Thought (CoT): Guiding Step-by-Step Reasoning

Chain-of-Thought (CoT) is a technique that guides the model to think step by step. It significantly improves accuracy on complex reasoning, math problems, code debugging, and business decision-making. Why CoT is effective: LLMs are more accurate when they explicitly generate intermediate steps rather than jumping to a final answer. The intermediate reasoning process serves as the model's "working memory." Three ways to apply CoT: 1. Simple trigger: Just add "Let's think step by step" (zero-shot CoT) 2. Structured step specification: Explicitly list the analysis sequence 3. Self-verification: Add "After answering, verify your own response" Situations where CoT is especially effective: - Math/logical reasoning problems - Code debugging (tracing execution flow) - Decision-making that considers multiple conditions simultaneously - Pros/cons comparative analysis - Legal/regulatory interpretation Situations where CoT is unnecessary: - Simple translation/summarization - Fact lookup (capitals, population, etc.) - Format conversion (JSON -> CSV, etc.)

# ============================================
# Chain-of-Thought Practical Examples
# ============================================
import anthropic

client = anthropic.Anthropic()

# ============================================
# Example 1: Applying CoT to Code Debugging
# ============================================
# Without CoT: "Find the bug in this code" -> may give a superficial answer
# With CoT: Trace execution flow line by line -> accurately identifies the cause

debug_prompt = """The following Python code has a bug.
Analyze it step by step following the analysis sequence below.

```python
def find_duplicates(lst):
    seen = set()
    duplicates = set()
    for item in lst:
        if item in seen:
            duplicates.add(item)
    return list(duplicates)

# Test
print(find_duplicates([1, 2, 3, 2, 4, 3, 5]))
# Expected result: [2, 3]
# Actual result: []
```

## Analysis Sequence (follow this order exactly)
1. Explain the intent of the code in one sentence
2. Trace the execution flow line by line with input [1, 2, 3, 2, 4, 3, 5]
   - Organize the state of seen and duplicates at each iteration in a table
3. Explain the exact location and cause of the bug
4. Present the corrected code
5. Verify the execution result with the same input after the fix
"""

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    messages=[{"role": "user", "content": debug_prompt}]
)
print(message.content[0].text)
# -> Accurately finds that the code is missing seen.add(item)

# ============================================
# Example 2: Applying CoT to Business Decision-Making
# ============================================
decision_prompt = """You are a CTO. You need to decide on the tech stack for the following situation.

## Situation
- Need to launch an online shopping mall MVP within 3 months
- Dev team: 2 full-stack (React experience), 1 backend (Python experience)
- Expected traffic: Initial DAU 1,000, DAU 50,000 after 1 year
- Budget: Server costs under $500/month

## Candidate Stacks
A: Next.js + Supabase (serverless)
B: React + Django + PostgreSQL (traditional)
C: React + NestJS + MongoDB (microservices)

## Analysis Sequence (step by step)
1. Organize pros/cons for each stack based on team capability/timeline/traffic/cost
2. Evaluate MVP launch feasibility within 3 months (high/medium/low)
3. Evaluate scalability to 50,000 DAU
4. Estimate monthly server costs
5. Analyze risk factors
6. Final recommendation + 1-line rationale
"""

# ============================================
# Example 3: Self-verification Pattern
# ============================================
# Guide the model to verify its own answer after generating it

self_verify_prompt = """Write a SQL query that meets the following requirements.

## Table Structure
- users (id, name, email, created_at, status)
- orders (id, user_id, amount, created_at, status)
- products (id, name, price, category)
- order_items (id, order_id, product_id, quantity)

## Requirements
Among users who signed up in 2024, find those with total order amounts
of $10,000 or more. Show name, email, total order count, and total order amount.
Sort by order amount descending and show only the top 10.

## Output Sequence
1. Write the SQL query
2. Self-verify the query from these perspectives:
   a. Are JOIN relationships correct (N:1, 1:N direction)?
   b. Is NULL handling needed in WHERE conditions?
   c. Are all required columns included in GROUP BY?
   d. Index recommendations from a performance perspective
3. If issues are found during verification, revise the query
4. Final query + expected execution plan explanation
"""

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    messages=[{"role": "user", "content": self_verify_prompt}]
)
print(message.content[0].text)

System Prompt Design Patterns (4 Practical Patterns)

System prompts define the LLM's overall behavior, persona, and constraints. They take higher priority than user messages, ensuring consistent behavior. In Claude, set via the system parameter. In OpenAI, use role: "system" messages. Four patterns commonly used in practice: (1) Persona Pattern: Define area of expertise, experience, communication style - Use: Maintain consistent tone and expertise - Example: Code reviewer, technical writer, data analyst (2) Gatekeeper Pattern: Specify allowed/prohibited behaviors - Use: Set safety boundaries for chatbots, restrict topics - Example: Customer support chatbot declining competitor comparison requests (3) Template Pattern: Force output structure - Use: Guarantee consistently formatted output - Example: Always output code review results in the same table format (4) Meta Prompt Pattern: Guide clarifying questions for ambiguous requests - Use: Gather sufficient information before starting work - Example: When asked "Build me a website," first confirm target users, features, etc. Tips for Claude: - Using XML tags (<rules>, <expertise>, etc.) helps the model clearly distinguish each section - Longer system prompts increase token costs, so keep them concise with just the essentials - When system prompt rules conflict with user messages, the system prompt takes priority

# ============================================
# System Prompt Practical Patterns (4 Types)
# ============================================
import anthropic

client = anthropic.Anthropic()

# ============================================
# [Pattern 1] Persona Pattern - Code Review Expert
# ============================================
persona_system = """You are a senior software engineer with 15 years of experience.

<expertise>
- Python, TypeScript, Go specialist
- Distributed systems, microservice architecture
- OWASP Top 10 security vulnerability analysis
- Performance optimization and code quality management
</expertise>

<communication_style>
- Point out code issues precisely, but always provide improved code alongside
- Constructive feedback, not criticism (mention what is done well too)
- For security issues, specify severity and CWE number
- Use the tone "Here is how it could be improved"
</communication_style>

<constraints>
- Always assume a production environment
- Do not use deprecated APIs
- Include comments in the code
</constraints>"""

# ============================================
# [Pattern 2] Gatekeeper Pattern - Customer Support Chatbot
# ============================================
gatekeeper_system = """You are a customer support chatbot for 'SmartHome' company.

<rules>
- Only answer questions about SmartHome products
- Competitor product comparison requests: "I'm sorry, I cannot compare other products, but I can tell you about our product's strengths."
- Personal information (SSN, credit card numbers, etc.): Never request
- Technical questions you cannot answer: "I will check with the technical team and respond via email. Could you provide your email address?"
- Refunds/exchanges: Provide basic policy info then recommend contacting the customer center
</rules>

<product_info>
- Smart Bulb (SH-100): WiFi connected, 16 million colors, voice control
- Smart Plug (SH-200): Power monitoring, timer, remote control
- Smart Hub (SH-300): Zigbee/Z-Wave support, connects up to 100 devices
</product_info>

<tone>Friendly and professional. Short, clear sentences.</tone>"""

# ============================================
# [Pattern 3] Template Pattern - Enforce Consistent Output Format
# ============================================
template_system = """You are a code reviewer.
All code reviews must follow the format below exactly. Do not change the format.

<output_format>
## Code Review Results

### Summary
(1-2 sentence overall assessment)

### Positive Aspects
- (with specific code locations)

### Areas for Improvement
| Line | Severity | Issue | Improved Code |
|------|----------|-------|---------------|
| (number) | (Critical/Major/Minor) | (description) | `improved code` |

### Security Checklist
- [ ] SQL injection prevention
- [ ] XSS prevention
- [ ] Authentication/authorization verification
- [ ] Sensitive information exposure

### Total Score: X/10
</output_format>"""

# Using the Template Pattern
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    system=template_system,
    messages=[{
        "role": "user",
        "content": """Please review the following code:

```python
@app.route('/login', methods=['POST'])
def login():
    username = request.form['username']
    password = request.form['password']
    query = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
    user = db.execute(query).fetchone()
    if user:
        session['user_id'] = user['id']
        return redirect('/dashboard')
    return 'Login failed', 401
```"""
    }]
)
print(message.content[0].text)
# -> Always flags SQL injection, plaintext password, etc. in the same table format

# ============================================
# [Pattern 4] Meta Prompt - Guide Clarifying Questions
# ============================================
meta_system = """You are a requirements analysis expert.

When the user makes an ambiguous request, do not start working immediately.
First ask questions to clarify the following:

1. Target users (Who will use it?)
2. Usage environment (Web? Mobile? API? CLI?)
3. Scale and performance (concurrent users, response time, data volume)
4. Constraints (tech stack, budget, deadline, compatibility)
5. Success criteria (What outcome defines completion?)

Start working only after gathering sufficient information.
Do not make assumptions about uncertain points - always confirm."""

# Test: Sending a vague request returns clarifying questions
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=512,
    system=meta_system,
    messages=[{
        "role": "user",
        "content": "Build me a website"
    }]
)
print(message.content[0].text)
# -> "Let me clarify a few things about the website you need..." + 5 questions

Structured Output: JSON Schema Enforcement and Parsing

To programmatically process LLM output, structured formats are essential. Just saying "output as JSON" is unreliable. You need to specify the schema, validate parsing, and have retry logic on failure. Three ways to get reliable JSON output from Claude: 1. Specify JSON schema in the prompt + instruct "output only JSON" 2. Prefill technique: Guide the assistant response to start with "{" 3. Prompt + parsing validation + retry logic In OpenAI: - response_format={"type": "json_object"} (JSON mode) - response_format={"type": "json_schema", "json_schema": {...}} (Structured Outputs) Practical tips: - Clearly distinguish required and optional fields (specify required) - Specify data types for each field (string, number, array, boolean) - When there are enum values, list the allowed values (e.g., "status": "active" | "inactive") - Including example JSON in the output format description greatly increases compliance - Always add retry logic that catches parsing failures

# ============================================
# Structured JSON Output Practical Patterns
# ============================================
import anthropic
import json
from typing import Optional

client = anthropic.Anthropic()

# ============================================
# Method 1: Specify JSON Schema in the Prompt
# ============================================
json_prompt = """Analyze the following product description and output in JSON format.

## Product Description
"Galaxy S25 Ultra, 256GB, Titanium Black color.
6.9-inch Dynamic AMOLED display,
Snapdragon 8 Elite processor, 5000mAh battery.
AI camera, Galaxy AI features included. Launch price $1,299.99."

## JSON Schema (follow this structure exactly)
{
  "product_name": "string (product name)",
  "brand": "string (brand name)",
  "category": "string (smartphone | tablet | laptop | wearable)",
  "specs": {
    "storage": "string",
    "display": "string",
    "processor": "string",
    "battery": "string",
    "color": "string"
  },
  "price": {
    "amount": "number (numeric integer)",
    "currency": "string"
  },
  "features": ["string (key features, up to 5)"],
  "ai_features": ["string (AI-related features only)"]
}

Output only pure JSON without code blocks."""

# ============================================
# Method 2: Prefill Technique (Claude-specific)
# ============================================
# Force the assistant message to start with "{" -> must start with JSON

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": json_prompt},
        # Prefill: guide assistant response to start with "{"
        {"role": "assistant", "content": "{"},
    ]
)
# Combine "{" + rest of JSON for parsing
raw_json = "{" + message.content[0].text
result = json.loads(raw_json)
print(json.dumps(result, indent=2))

# ============================================
# Method 3: Safe Parsing + Retry Pattern (Production)
# ============================================
def get_structured_output(
    prompt: str,
    system: str = "",
    max_retries: int = 3,
    model: str = "claude-sonnet-4-5-20250514",
) -> Optional[dict]:
    """Safely parse JSON output with retry on failure"""

    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model=model,
                max_tokens=2048,
                system=system + "\n\nAlways output only valid JSON. Do not wrap in code blocks (```).",
                messages=[
                    {"role": "user", "content": prompt},
                    {"role": "assistant", "content": "{"},
                ]
            )

            raw = "{" + message.content[0].text

            # Clean up if JSON code block is included
            if "```" in raw:
                import re
                match = re.search(r'```(?:json)?\s*({.*?})\s*```', raw, re.DOTALL)
                if match:
                    raw = match.group(1)

            result = json.loads(raw)
            return result

        except json.JSONDecodeError as e:
            print(f"[Attempt {attempt + 1}/{max_retries}] JSON parse failed: {e}")
            if attempt == max_retries - 1:
                print(f"Final failure. Raw response: {raw[:200]}...")
                return None
        except Exception as e:
            print(f"[Attempt {attempt + 1}/{max_retries}] API error: {e}")
            if attempt == max_retries - 1:
                return None

    return None

# Usage example
result = get_structured_output(
    prompt="Compare Python, JavaScript, Go, and Rust and organize as JSON. Include type_system, main_use, learning_curve(easy/medium/hard), speed(fast/medium/slow) fields for each language.",
    system="You are a programming language expert."
)
if result:
    print(json.dumps(result, indent=2))

# ============================================
# OpenAI Structured Outputs (Reference)
# ============================================
from openai import OpenAI

openai_client = OpenAI()

response = openai_client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Always output only valid JSON."},
        {"role": "user", "content": "Compare Python vs TypeScript as JSON. Include name, typing, ecosystem, performance fields."}
    ]
)
parsed = json.loads(response.choices[0].message.content)
print(json.dumps(parsed, indent=2))

Prompt Debugging, A/B Testing, and Chaining Strategies

Prompt engineering is not about writing the perfect prompt in one go, but about iteratively testing and improving. Five key strategies: (1) A/B Testing: Compare different prompts with the same input - Record results for each prompt version and quantify quality - Ideally, build an automated testing pipeline (2) Negative Instructions: "Do not..." - "Code only, without explanations," "If you do not know, say 'no information available' instead of guessing" - Explicitly block unwanted output (3) Temperature Adjustment: Tune based on task nature - 0.0: Data extraction, classification, JSON generation (consistency matters) - 0.3-0.5: Code generation, document writing (slight creativity) - 0.7-1.0: Brainstorming, marketing copy, creative writing (diversity matters) (4) Prompt Chaining: Split complex tasks into stages - Use step 1 output as step 2 input - Each stage's quality can be verified independently - Multiple short prompts are more effective than one long prompt (5) Claude Prefill Technique: Control the output start - Placing the beginning of the response in the assistant message forces generation in that format - Can enforce JSON, XML, code blocks in a specific language, etc. Prompt version management: - Version control prompts like code (Git, JSONL logs) - Record input/output/success rate for each version - Regression testing (ensure previously working cases do not break)

# ============================================
# Prompt Debugging & Iterative Improvement Code
# ============================================
import anthropic
import json
from datetime import datetime

client = anthropic.Anthropic()

# ============================================
# [Strategy 1] Prompt A/B Testing Framework
# ============================================
def ab_test_prompt(
    versions: dict[str, dict],  # {"v1": {"system": ..., "user": ...}, "v2": ...}
    test_inputs: list[str],
    model: str = "claude-sonnet-4-5-20250514",
) -> list[dict]:
    """Compare multiple prompt versions with the same inputs"""
    results = []

    for input_text in test_inputs:
        for version_name, prompts in versions.items():
            message = client.messages.create(
                model=model,
                max_tokens=1024,
                temperature=0,  # Fix temperature for comparison
                system=prompts.get("system", ""),
                messages=[{
                    "role": "user",
                    "content": prompts["user"].replace("{INPUT}", input_text)
                }]
            )

            results.append({
                "version": version_name,
                "input": input_text[:50],
                "output": message.content[0].text[:200],
                "tokens": message.usage.output_tokens,
                "timestamp": datetime.now().isoformat(),
            })

    return results

# Execute A/B test
versions = {
    "v1_simple": {
        "system": "You are a summarization AI.",
        "user": "Summarize the following: {INPUT}",
    },
    "v2_structured": {
        "system": """You are a news summarization expert.
<rules>
- Summarize key points in 3 sentences or fewer
- Include 5W1H
- Exclude subjective interpretation
- Preserve numbers and proper nouns exactly
</rules>""",
        "user": """Summarize the following article.

<article>{INPUT}</article>

<format>
## Key Summary
(3 sentences)
## Keywords
(up to 5)
</format>""",
    },
}

results = ab_test_prompt(
    versions=versions,
    test_inputs=[
        "Samsung Electronics recorded Q1 2025 revenue of $60 billion. The semiconductor division grew 40% year-over-year, driving the results.",
    ],
)
for r in results:
    print(f"[{r['version']}] Tokens: {r['tokens']}")
    print(f"  Output: {r['output'][:100]}...\n")

# ============================================
# [Strategy 2] Remove Unwanted Output with Negative Instructions
# ============================================
clean_system = """You are a code generation expert.

<do>
- Include comments in the code
- Include error handling
- Include type hints
</do>

<do_not>
- No preamble/conclusion like "Sure!", "Here is..." before/after code
- Do not use deprecated APIs
- Do not use security-vulnerable patterns (f-string SQL, eval, etc.)
- No unnecessary imports
</do_not>"""

# ============================================
# [Strategy 3] Prefill for Output Format Control (Claude)
# ============================================
# Force JSON output
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=512,
    messages=[
        {"role": "user", "content": "Tell me the characteristics of the 4 seasons as a JSON array. Include season, temp_range, features fields for each item."},
        {"role": "assistant", "content": "["},  # Guide to start as array
    ]
)
json_result = "[" + message.content[0].text
print(json.loads(json_result))

# Force code block
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=512,
    messages=[
        {"role": "user", "content": "Write a Python function that calculates Fibonacci numbers."},
        {"role": "assistant", "content": "def fibonacci(n: int) -> int:\n"},
    ]
)
code = "def fibonacci(n: int) -> int:\n" + message.content[0].text
print(code)

# ============================================
# [Strategy 4] Prompt Chaining (Multi-stage Complex Analysis)
# ============================================
def chained_code_review(code: str) -> dict:
    """Perform code review using 3-stage prompt chaining"""

    # Stage 1: Bug detection
    step1 = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"""
Find only bugs and logic errors in the following code and list them as a numbered list.
Each item: line number, issue description, severity (Critical/Major/Minor)

```
{code}
```"""}]
    )
    bugs = step1.content[0].text

    # Stage 2: Security vulnerability analysis
    step2 = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": f"""
Analyze security vulnerabilities in the following code based on OWASP Top 10.
Each item: CWE number, vulnerability name, location, attack scenario, fix method

```
{code}
```"""}]
    )
    security = step2.content[0].text

    # Stage 3: Generate comprehensive review report
    step3 = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=2048,
        messages=[{"role": "user", "content": f"""
Synthesize the analysis results below into a code review report.

## Original Code
```
{code}
```

## Bug Analysis
{bugs}

## Security Analysis
{security}

## Output Format
1. Total score (X/10)
2. Complete corrected code
3. Summary of changes (table)
"""}]
    )

    return {
        "bugs": bugs,
        "security": security,
        "report": step3.content[0].text,
    }

# ============================================
# [Strategy 5] Temperature Comparison Experiment
# ============================================
for temp in [0.0, 0.5, 1.0]:
    message = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=200,
        temperature=temp,
        messages=[{"role": "user", "content": "Write a haiku about 'artificial intelligence'."}]
    )
    print(f"Temperature {temp}: {message.content[0].text}")
# temp=0.0: Always the same haiku
# temp=0.5: Slightly different haiku each time
# temp=1.0: Completely different haiku every time

Operations/Expansion Tips: Prompt Library and Production Patterns

Operational patterns and scaling strategies for applying prompt engineering in production. Building a Prompt Library: - Modularize frequently used prompts for reuse - Manage system prompts, few-shot examples, and output formats as separate files - Creating prompt builders as Python functions or classes enables the entire team to use consistent prompts Cost Optimization: - Simple tasks (classification, extraction): process with Haiku ($1/$5) - Complex reasoning/code generation: use Sonnet ($3/$15) - Use Opus ($5/$25) only when highest quality is needed - Optimize prompt length: remove unnecessary examples, keep it concise - Caching: cache results for identical inputs to save API calls Production Checklist: - JSON output must have parsing validation + retry logic - Rate limit handling (exponential backoff on 429 errors) - Sensitive information filtering (never include personal data in prompts) - Response time monitoring and timeout settings - Record performance metrics by prompt version

# ============================================
# Prompt Library - Reusable Builder
# ============================================
import anthropic
import json
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class PromptTemplate:
    """Reusable prompt template"""
    role: str = ""
    context: str = ""
    instruction: str = ""
    output_format: str = ""
    examples: list = None
    constraints: list = None

    def build(self) -> str:
        """Generate a structured prompt with XML tags"""
        parts = []
        if self.role:
            parts.append(f"<role>\n{self.role}\n</role>")
        if self.context:
            parts.append(f"<context>\n{self.context}\n</context>")
        if self.examples:
            examples_text = "\n\n".join(
                f"Input: {ex['input']}\nOutput: {ex['output']}"
                for ex in self.examples
            )
            parts.append(f"<examples>\n{examples_text}\n</examples>")
        if self.constraints:
            rules = "\n".join(f"- {c}" for c in self.constraints)
            parts.append(f"<constraints>\n{rules}\n</constraints>")
        parts.append(f"<instruction>\n{self.instruction}\n</instruction>")
        if self.output_format:
            parts.append(f"<output_format>\n{self.output_format}\n</output_format>")
        return "\n\n".join(parts)

# ============================================
# Pre-defined Prompt Library
# ============================================
PROMPTS = {
    "code_review": PromptTemplate(
        role="Senior security engineer with 10 years of experience",
        constraints=[
            "Analyze based on OWASP Top 10",
            "Specify CWE number for each vulnerability",
            "Always include corrected code examples",
            "Flag deprecated API usage",
        ],
        output_format='JSON: {"score": number, "issues": [{"severity": "critical|major|minor", "line": number, "description": string, "fix": string, "cwe": string}]}',
    ),
    "summarize": PromptTemplate(
        role="News editor",
        constraints=[
            "3 sentences or fewer",
            "Include 5W1H",
            "Exclude subjective interpretation",
            "Preserve numbers and proper nouns exactly",
        ],
        output_format="## Key Summary\n(3 sentences)\n\n## Keywords\n(up to 5)",
    ),
    "translate_tech": PromptTemplate(
        role="Technical documentation translation expert",
        constraints=[
            "Keep technical terms with original language in parentheses (e.g., async)",
            "Do not translate code variable names",
            "Use natural sentence structure",
            "Maintain the original tone (formal/informal)",
        ],
    ),
}

# Usage example
review_prompt = PROMPTS["code_review"]
review_prompt.instruction = "Analyze the security vulnerabilities in the following code."
review_prompt.context = "Authentication module of a Python FastAPI web application"

full_prompt = review_prompt.build() + "\n\n```python\n# code here...```"

message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=2048,
    temperature=0,
    messages=[{"role": "user", "content": full_prompt}]
)
print(message.content[0].text)

# ============================================
# Model Routing by Task (Cost Optimization)
# ============================================
def smart_route(task_type: str, prompt: str) -> str:
    """Automatically select the optimal model based on task type"""
    model_map = {
        "classify":  "claude-haiku-4-5-20250514",   # Classification: fast and cheap
        "extract":   "claude-haiku-4-5-20250514",   # Data extraction: fast and cheap
        "summarize": "claude-sonnet-4-5-20250514",  # Summarization: mid-range
        "code":      "claude-sonnet-4-5-20250514",  # Code generation: mid-range
        "analyze":   "claude-sonnet-4-5-20250514",  # Analysis: mid-range
        "creative":  "claude-opus-4-6-20250514",    # Creative: highest quality
        "complex":   "claude-opus-4-6-20250514",    # Complex reasoning: highest quality
    }

    model = model_map.get(task_type, "claude-sonnet-4-5-20250514")

    message = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )

    return message.content[0].text

# Usage example
result = smart_route("classify", "Is this email spam? 'Get 50% off with this coupon!'")
# -> Processed quickly with Haiku (cost: 1/3 of Sonnet)

Core Code

The core of prompt engineering: (1) Structure with role/context/instruction/output format, (2) Teach patterns with few-shot examples, (3) Control consistency with temperature, (4) Ensure reliability with output validation. Modularize these 4 steps into a prompt builder function for reuse.

# ============================================
# Prompt Engineering Core Summary - Practical Workflow
# ============================================
import anthropic
import json

client = anthropic.Anthropic()

# 1. Structured prompt (role + context + instruction + output format)
system = """You are a senior developer.
<rules>
- Write production-quality code
- Prioritize error handling and type safety
- Include comments in the code
</rules>"""

# 2. Few-shot to teach output pattern
messages = [
    {"role": "user", "content": "Function: list sum"},
    {"role": "assistant", "content": "def sum_list(nums: list[int]) -> int:\n    return sum(nums)"},
    {"role": "user", "content": "Function: reverse string"},  # Actual request
]

# 3. Temperature = 0 (consistent code generation)
message = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    temperature=0,
    system=system,
    messages=messages,
)

# 4. Validate result (parse check for JSON, execution test for code)
print(message.content[0].text)

Common Mistakes

Prompt is too vague, failing to get the desired result

Instead of "write good code," specify concrete requirements. Include the programming language, framework, input/output format, error handling approach, and coding style. Give specific instructions like "Write a JWT login API using FastAPI, use bcrypt hashing, and define input/output with Pydantic v2 models."

Requesting too many tasks in a single prompt, causing quality degradation

Split complex tasks using prompt chaining. Extract data in step 1, analyze in step 2, generate the report in step 3. Validate output at each step before using it as input for the next step, and overall quality improves significantly.

Using LLM output without verification, causing errors

Verify JSON output is parseable with json.loads() and add retry logic on failure. Test code output by actually running it. Important facts (statistics, dates, legal information) must be cross-checked against original sources. Hallucination is an inherent limitation of LLMs.

Writing a new prompt from scratch every time, leading to inefficiency and inconsistency

Create a prompt builder function to manage system prompts, few-shot examples, and output formats as reusable modules. Building prompt version management (Git) and an A/B testing pipeline improves the entire team's prompt quality.

Not considering temperature and model selection, causing unnecessary costs

Separate models and temperature by task. Classification/extraction: Haiku (temp=0), code generation/analysis: Sonnet (temp=0-0.3), creative/brainstorming: Opus (temp=0.7-1.0). Set max_tokens to match the expected output length to reduce unnecessary costs.

Related liminfo Services

JSON Formatter Markdown Editor Text Tools JSON Validator