How to measure Claude Code proficiency (beyond token counts)

Every team is hiring for “AI-fluent” developers. Nobody knows how to measure it.

The most common attempt right now is counting tokens. Some teams pull token usage from the Anthropic Console or OpenAI dashboard and rank developers by spend. It feels like a metric. It isn’t a measurement of skill.

This post lays out what actually signals proficiency with AI coding agents like Claude Code, Codex CLI, and OpenCode, and how to use those signals for hiring, internal benchmarking, and your own growth.

Why token counts mislead you

A senior developer who has refined their workflow can ship more in 100,000 tokens than a junior burning through a million. The high-skill move is fewer turns, sharper prompts, better planning, smaller context windows. The high-spend developer is often the one who hasn’t learned to structure their work.

Worse, token-maxxing actively rewards the wrong behavior. If you measure your team by token usage, you incentivize them to use AI for tasks where it isn’t the right tool, to pad their context windows with irrelevant files, and to restart sessions instead of editing prompts.

You can replace “token count” with “lines of code” or “commit count” and the same critique applies. Output volume isn’t skill.

Eleven dimensions that actually signal AI coding agent skill

Real proficiency with Claude Code, Codex CLI, OpenCode, and similar tools shows up across eleven dimensions. Eight apply to every tool. Three are tool-specific bonuses. Each is observable from session activity, so you can measure it without watching someone work.

Eight core dimensions (every tool)

1. Customization

Has this developer shaped the tool to their workflow? Concretely: do they have a CLAUDE.md with project-specific instructions? Custom slash commands? Hooks that run on tool calls? An AGENTS.md for OpenCode? A .codex/ settings file?

A new user runs the defaults. A power user has bent the tool around their preferences. Customization is the first dimension that shows up as a developer moves from curious to capable.

2. Parallel agents

Can this developer orchestrate multiple agents at once? In Claude Code this means spawning subagents in parallel with the Task tool, defining specialized agents in .claude/agents/, and using them for independent work streams. In Codex CLI it means running parallel runs.

A developer who only ever runs one agent at a time is leaving 5-10x leverage on the table. Parallel orchestration is the dimension that separates “AI as autocomplete” from “AI as a team you direct.”

3. Background work

Does this developer delegate work to the agent and step away, or do they babysit every turn? Sessions that load context up front, hand off a task, and run while the human is doing something else are a real signal of trust and structure.

Background-work-shy developers treat AI like an autocomplete. Background-work-fluent developers treat it like a colleague who can finish things without supervision.

4. Tool breadth

How many distinct tools, skills, and MCP servers does this developer wire into their environment? An MCP server connecting to GitHub, Linear, or a database. A skill that handles a recurring workflow. A custom hook for a project-specific pre-flight check.

Tool breadth correlates strongly with how deeply someone has explored the tool. New users have none. Power users have a dozen.

5. Planning

Does this developer plan before acting? Look for explicit use of plan mode, written plan documents in the repo, structured /spec or /plan workflows, and conversations that start with “let’s plan this first” before any file edit.

Developers who plan ship more correct code with fewer revisions. Developers who skip planning generate more turns, more tokens, and more rework. This is one of the strongest signals because it correlates with everything else.

6. Repetition

How deeply does this developer engage with the skills they have installed? Skill breadth (count) and skill repetition (total invocations) are different signals. Someone with 30 installed skills who has invoked each one twice is not the same as someone with 5 skills they actually use every day.

Repetition shows commitment. It is the difference between collecting tools and building muscle memory.

7. Custom skills

Has this developer written their own reusable skill or command? Defined a workflow once and turned it into something they invoke by name? Shared a skill with a teammate?

This is the highest-leverage dimension. A developer who has internalized “if I do this twice, I make it a skill” is on a different growth curve than a developer who reruns the same prompt every time.

8. Multi-tasking

Does this developer run multiple Claude Code, Codex CLI, or OpenCode sessions in parallel? Different repositories. Different features. Different time horizons.

Single-session users treat AI like a chat window. Multi-session users treat it like a team. The shift in mental model produces a step change in output, and it shows up clearly in activity data.

Three tool-specific bonuses

These show up only on certain tools, but contribute to the composite AIQ rank when they apply.

9. Thinking mode (Codex)

Codex CLI exposes reasoning effort as a first-class control. A developer who knows when to dial reasoning up (hard refactors, ambiguous specs) and when to keep it low (mechanical edits) is operating at a higher level than one who lets it default forever. This dimension reads reasoning_engagement from Codex sessions.

10. Schedule reliability (Cowork)

Claude Cowork lets you schedule autonomous local-agent-mode sessions on a cadence — every morning, every Monday, every commit. The dimension measures whether your scheduled agents are actually firing on cadence or silently failing. Reliability separates “set and forget” from “set, forget, regret.”

11. Queue discipline (Cowork)

When Cowork queues work for an agent to pick up, do you follow through on what got queued? The dimension reads queue completion across the window. Low scores correlate with developers who add to the queue faster than they review the output.

How to measure each dimension

Every AI coding agent writes a session log locally. A proficiency measurement reads those logs, counts the structural signals across the eleven dimensions, and produces a percentile per dimension plus a composite score.

This is what AIQ Rank does. The scan runs locally on the developer’s machine, the raw transcripts never leave it, and the output is a score card per tool plus a combined AIQ rank.

Using this for hiring

If you’re an engineering manager screening developers who claim AI proficiency, the question isn’t “how many tokens do you burn.” The question is how they actually configure and use the tool day to day.

Concretely: send candidates the install command for AIQ Rank, ask them to run it and share the resulting profile URL. You see their score per dimension, you see which tools they actually use, you see whether their planning-to-edit ratio looks like a power user or a novice. The scan takes 60 seconds and gives you 100x more signal than asking “are you good at AI” in an interview.

This is the kind of pre-interview AI proficiency assessment that scales. It replaces the broken interview question (developers who can’t actually use AI will still say they can) with an objective check.

Using this for internal team measurement

For engineering leaders asking “is my team actually using AI well or just installing it,” the same framework applies. Pull AIQ Rank scores for each team member. See who’s customizing the tool, orchestrating parallel agents, writing custom skills. See who is stuck running default Claude Code with no CLAUDE.md.

The result is a leaderboard. The leaderboard is a teaching artifact: high-scorers become mentors, low-scorers see specifically where they’re behind. It works better than mandatory training because it gives engineers a concrete next move on each dimension.

Using this for your own growth

For an individual developer, the score is a mirror. You see where you rank, you see which dimensions you’re light on, you get something concrete to practice.

If your tool breadth is in the bottom quartile, install three MCP servers and use them this week. If your custom skills score is zero, take a recurring task and turn it into a slash command. If your planning score is low, start using plan mode before file edits for the next ten sessions.

The point isn’t the score. The point is the next move the score makes obvious.

What about other tools?

This framework works for every AI coding agent that surfaces structural signals: Claude Code, Codex CLI, OpenCode, Cursor, Cowork, Aider, and others. The eight core dimensions hold across the set. The tool-specific bonuses (Codex thinking mode, Cowork schedule reliability and queue discipline) apply where the tool surfaces those signals.

Measure real activity. Skip token counts.

If you want to see where you rank, the scan is at aiqrank.com. It takes 60 seconds and runs locally. Free for individuals, free during launch for teams.