Model Selection & Evaluation — Learning Track

Stage 1: Understanding Model Tradeoffs

You’ll know this when… you can explain the key differences between model families (Claude, GPT, Gemini, Llama) and confidently choose the right model for a given task based on capability, cost, and speed.

Key Concepts

Model families and their strengths: Claude (reasoning, safety), GPT (ecosystem), Gemini (multimodal, context), Llama (open-source, self-hosted)
Model tiers within a family: Opus vs. Sonnet vs. Haiku — when to use each
The cost-quality-speed triangle: you can optimize for two
Context window sizes and why they matter (and when they don’t)
Multimodal capabilities: text, images, code, audio, video
When “good enough” beats “best” — most tasks don’t need the flagship model

Recommended Resources

Practice Project

Benchmark your own pipeline. Run your Intelligence Hub’s evaluation step (agent/evaluate.js) with three different models: Claude Opus, Claude Sonnet, and Claude Haiku. Use the same 10 input articles for each. Compare: scoring consistency, quality of so_what analysis, why_useful clarity, API cost, and response time. Write up which model you’d recommend for daily runs vs. deep analysis.

Stage 2: Benchmarks & Evaluation — Reading the Fine Print

You’ll know this when… you can critically evaluate model benchmarks, understand what they actually measure, and avoid being misled by cherry-picked results.

Key Concepts

Common benchmarks: MMLU, HumanEval, GPQA, MATH, ARC — what each tests
Why benchmark scores don’t tell the whole story (contamination, overfitting, task mismatch)
Evaluation for YOUR use case: building task-specific test sets
Human evaluation vs. automated evaluation (LLM-as-judge)
Elo ratings and arena-style comparisons (Chatbot Arena)
The difference between “performs well on benchmarks” and “works for my workflow”

Recommended Resources

Practice Project

Build a briefing quality scorecard. Create a rubric for evaluating Intelligence Hub briefing quality: relevance accuracy (does a score-8 item deserve an 8?), summary clarity, actionability of so_what, beginner-friendliness of why_useful. Score 3 real briefings. Use this rubric to compare model outputs from Stage 1 more rigorously. This is your own custom benchmark.

Stage 3: Multi-Model Routing — Right Model for the Job

You’ll know this when… you can design and implement a system that automatically routes different tasks to different models based on complexity, cost, and quality requirements.

Key Concepts

Why one model doesn’t fit all: cheap models for simple tasks, expensive models for hard ones
Routing strategies: rule-based (task type), classifier-based (complexity scoring), cascading (try cheap first, escalate if uncertain)
Cost optimization: using Haiku for 80% of calls and Opus for 20% can cut costs dramatically
Fallback chains: if Model A fails or times out, try Model B
A/B testing models in production
Caching: identical prompts get identical responses — cache them

Recommended Resources

Practice Project

Add model routing to your agent. Modify agent/evaluate.js to use a two-tier strategy: (1) Send all items to Claude Haiku for initial scoring (fast, cheap). (2) Only send items that Haiku scores 5+ to Claude Opus for deep analysis (so_what, why_useful). Compare total cost and quality against sending everything to Opus. Log the cost savings in agent-log.json.