← All Tracks

Model Selection & Evaluation

Understand model tradeoffs, read benchmarks critically, and design multi-model routing strategies.

beginner 3 stages 8 hours not started
Claude APINode.js

Stage 1: Understanding Model Tradeoffs

You’ll know this when… you can explain the key differences between model families (Claude, GPT, Gemini, Llama) and confidently choose the right model for a given task based on capability, cost, and speed.

Key Concepts

  • Model families and their strengths: Claude (reasoning, safety), GPT (ecosystem), Gemini (multimodal, context), Llama (open-source, self-hosted)
  • Model tiers within a family: Opus vs. Sonnet vs. Haiku — when to use each
  • The cost-quality-speed triangle: you can optimize for two
  • Context window sizes and why they matter (and when they don’t)
  • Multimodal capabilities: text, images, code, audio, video
  • When “good enough” beats “best” — most tasks don’t need the flagship model

Practice Project

Benchmark your own pipeline. Run your Intelligence Hub’s evaluation step (agent/evaluate.js) with three different models: Claude Opus, Claude Sonnet, and Claude Haiku. Use the same 10 input articles for each. Compare: scoring consistency, quality of so_what analysis, why_useful clarity, API cost, and response time. Write up which model you’d recommend for daily runs vs. deep analysis.


Stage 2: Benchmarks & Evaluation — Reading the Fine Print

You’ll know this when… you can critically evaluate model benchmarks, understand what they actually measure, and avoid being misled by cherry-picked results.

Key Concepts

  • Common benchmarks: MMLU, HumanEval, GPQA, MATH, ARC — what each tests
  • Why benchmark scores don’t tell the whole story (contamination, overfitting, task mismatch)
  • Evaluation for YOUR use case: building task-specific test sets
  • Human evaluation vs. automated evaluation (LLM-as-judge)
  • Elo ratings and arena-style comparisons (Chatbot Arena)
  • The difference between “performs well on benchmarks” and “works for my workflow”

Practice Project

Build a briefing quality scorecard. Create a rubric for evaluating Intelligence Hub briefing quality: relevance accuracy (does a score-8 item deserve an 8?), summary clarity, actionability of so_what, beginner-friendliness of why_useful. Score 3 real briefings. Use this rubric to compare model outputs from Stage 1 more rigorously. This is your own custom benchmark.


Stage 3: Multi-Model Routing — Right Model for the Job

You’ll know this when… you can design and implement a system that automatically routes different tasks to different models based on complexity, cost, and quality requirements.

Key Concepts

  • Why one model doesn’t fit all: cheap models for simple tasks, expensive models for hard ones
  • Routing strategies: rule-based (task type), classifier-based (complexity scoring), cascading (try cheap first, escalate if uncertain)
  • Cost optimization: using Haiku for 80% of calls and Opus for 20% can cut costs dramatically
  • Fallback chains: if Model A fails or times out, try Model B
  • A/B testing models in production
  • Caching: identical prompts get identical responses — cache them

Practice Project

Add model routing to your agent. Modify agent/evaluate.js to use a two-tier strategy: (1) Send all items to Claude Haiku for initial scoring (fast, cheap). (2) Only send items that Haiku scores 5+ to Claude Opus for deep analysis (so_what, why_useful). Compare total cost and quality against sending everything to Opus. Log the cost savings in agent-log.json.