Model Selection & Evaluation
Understand model tradeoffs, read benchmarks critically, and design multi-model routing strategies.
Stage 1: Understanding Model Tradeoffs
You’ll know this when… you can explain the key differences between model families (Claude, GPT, Gemini, Llama) and confidently choose the right model for a given task based on capability, cost, and speed.
Key Concepts
- Model families and their strengths: Claude (reasoning, safety), GPT (ecosystem), Gemini (multimodal, context), Llama (open-source, self-hosted)
- Model tiers within a family: Opus vs. Sonnet vs. Haiku — when to use each
- The cost-quality-speed triangle: you can optimize for two
- Context window sizes and why they matter (and when they don’t)
- Multimodal capabilities: text, images, code, audio, video
- When “good enough” beats “best” — most tasks don’t need the flagship model
Recommended Resources
Practice Project
Benchmark your own pipeline. Run your Intelligence Hub’s evaluation step (agent/evaluate.js) with three different models: Claude Opus, Claude Sonnet, and Claude Haiku. Use the same 10 input articles for each. Compare: scoring consistency, quality of so_what analysis, why_useful clarity, API cost, and response time. Write up which model you’d recommend for daily runs vs. deep analysis.
Stage 2: Benchmarks & Evaluation — Reading the Fine Print
You’ll know this when… you can critically evaluate model benchmarks, understand what they actually measure, and avoid being misled by cherry-picked results.
Key Concepts
- Common benchmarks: MMLU, HumanEval, GPQA, MATH, ARC — what each tests
- Why benchmark scores don’t tell the whole story (contamination, overfitting, task mismatch)
- Evaluation for YOUR use case: building task-specific test sets
- Human evaluation vs. automated evaluation (LLM-as-judge)
- Elo ratings and arena-style comparisons (Chatbot Arena)
- The difference between “performs well on benchmarks” and “works for my workflow”
Recommended Resources
Practice Project
Build a briefing quality scorecard. Create a rubric for evaluating Intelligence Hub briefing quality: relevance accuracy (does a score-8 item deserve an 8?), summary clarity, actionability of so_what, beginner-friendliness of why_useful. Score 3 real briefings. Use this rubric to compare model outputs from Stage 1 more rigorously. This is your own custom benchmark.
Stage 3: Multi-Model Routing — Right Model for the Job
You’ll know this when… you can design and implement a system that automatically routes different tasks to different models based on complexity, cost, and quality requirements.
Key Concepts
- Why one model doesn’t fit all: cheap models for simple tasks, expensive models for hard ones
- Routing strategies: rule-based (task type), classifier-based (complexity scoring), cascading (try cheap first, escalate if uncertain)
- Cost optimization: using Haiku for 80% of calls and Opus for 20% can cut costs dramatically
- Fallback chains: if Model A fails or times out, try Model B
- A/B testing models in production
- Caching: identical prompts get identical responses — cache them
Recommended Resources
- Anthropic — Model Selection Guide
- Anthropic — Extended Thinking for Complex Tasks
- Martian — LLM Routing
Practice Project
Add model routing to your agent. Modify agent/evaluate.js to use a two-tier strategy: (1) Send all items to Claude Haiku for initial scoring (fast, cheap). (2) Only send items that Haiku scores 5+ to Claude Opus for deep analysis (so_what, why_useful). Compare total cost and quality against sending everything to Opus. Log the cost savings in agent-log.json.