Skip to content
sustainability product-launch metrics

Introducing the Eco Score: The First Sustainability Metric for AI-Assisted Coding

Every Claude Code and Codex turn has an energy cost. The Eco Score makes your AI coding climate impact visible — with a composite 0–100 score, three sub-scores, and a leaf rating you can actually improve.

Pierre Sauvignon
Pierre Sauvignon 8 min read
Eco Score leaf rating on the LobsterOne dashboard

The Eco Score is LobsterOne’s new sustainability metric for AI-assisted developers — a single 0–100 rating, surfaced as one to four leaves, that captures how efficiently you use Claude Code, Codex, and other frontier coding tools. It is built on three sub-scores that are under your control: model discipline, cache efficiency, and session concision. None of the frontier labs publish per-token energy data, so the score does not pretend to be a precise carbon number. What it does is turn the behavioural levers that actually move your impact into something you can see and improve.

If you want to jump straight to the data, the Eco Score is live at app.lobsterone.ai/climate for anyone signed in. This post explains why we built it, what it measures, and what to do with it.

Why AI-Assisted Developers Need a Sustainability Metric

Modern Claude Code sessions routinely burn 10–80 million tokens. A heavy developer can consume several billion cache-read tokens across a month — numbers that sound abstract until you convert them into energy. Our triangulation from Epoch AI’s analysis, Google’s Gemini environmental disclosure, and the “How Hungry is AI?” arxiv benchmark puts a typical Claude Code session around 40–80 watt-hours, and a full workday of agentic coding north of a kilowatt-hour per developer. That is not catastrophic on a single-person basis. Multiply it by the tens of thousands of engineers shifting to agentic workflows this year, and it stops being a rounding error.

The problem is that developers have no signal. Token counts are too low-level. Dollar cost ignores the energy mix and overstates the variable cost of caching. Nobody has built a behavioural metric that maps cleanly onto the things you can actually change. Eco Score is that metric.

For the underlying energy methodology and where each coefficient comes from, see how much energy does AI coding use. For the actionable playbook, see reducing your AI coding energy consumption.

What the Eco Score Measures

The score has three sub-components, each 0–100, weighted into a single composite.

Model discipline (40%) weights your token usage by model class. Haiku carries a weight of 1.0, Sonnet 0.6, Opus 0.15. A developer who uses Haiku for grep-style tasks, Sonnet as the workhorse, and Opus only for genuinely hard problems scores high. A developer who sends every prompt to Opus scores low — not because Opus is bad, but because Opus-for-everything is the single biggest unnecessary energy draw in an agentic workflow. Opus’ output is roughly ten times more energy per token than Haiku’s, and most coding tasks do not need the capability gap to justify that.

Cache efficiency (30%) measures cache_read / cache_creation on a logarithmic scale. A 10x reuse ratio scores 50, a 100x reuse ratio scores 100. Cache reads are an order of magnitude cheaper per token than fresh input because they skip KV-cache recomputation entirely. A developer whose workflow keeps the system prompt and project context stable between turns gets dozens of reads per cached block. A developer who constantly swaps projects, edits CLAUDE.md mid-session, or resets context thrashes the cache and pays full input cost every turn.

Session concision (30%) penalises marathon sessions. The formula is 1 − (median_session_tokens / 50M). Your median session at 5M tokens scores 90. At 50M, 0. Long sessions are expensive for a reason most developers miss: attention compute is O(n²) in the context length, and the cache invalidates more often as the conversation drifts. A focused 30-minute session that ships a PR is structurally more efficient than a four-hour debugging session that carried all of that context on every turn.

The composite is a weighted average. We show it as a 0–100 number, a leaf rating (one to four filled leaves with half-leaf granularity), and a band label — “Excellent”, “Room to improve”, or “Needs attention”.

Why a Leaf Rating and Not a Number

We tested the composite as a big number. Developers looked at their 62 and asked what good looked like. We tested it as a ring gauge. Developers looked at their half-full ring and asked whether that was bad. A leaf rating borrows the pattern everyone already understands from restaurant reviews, hotel stars, and app store ratings. Two leaves is clearly room to improve. Four leaves is clearly excellent. Three is a comfortable middle. You do not need to read a methodology page to understand whether you should care.

The rating is not a percentile. We deliberately use absolute thresholds. If the whole industry improves its cache hit rate, the score benchmark does not silently rise to compensate. A four-leaf developer in 2026 is doing the same thing as a four-leaf developer in 2027.

What a Real User’s Score Looks Like

Here are the numbers for a heavy Claude Code user — 148 sessions in 30 days, 5 billion cache-read tokens, 9 million output tokens, roughly 34,000 turns.

  • Model discipline: 15 — 100% on Opus. The dominant lever. Moving 50% of this load to Sonnet would roughly double this sub-score.
  • Cache efficiency: 81 — 41.4× reuse ratio. Strong. Most of the savings are already being captured.
  • Session concision: 73 — 13.3M median session. Reasonable. A few marathon sessions above 100M pull the p90 up but the median is fine.
  • Composite: 52 — “Room to improve”, two leaves.

The lesson from this profile is counterintuitive for someone who has been optimising token counts. The levers that look small (session length, cache behaviour) are already close to their ceiling. The lever that looks small because it is invisible (model choice) is the one carrying almost all of the remaining headroom. Moving from 100% Opus to a realistic mix — Haiku for search and summarisation, Sonnet as the default, Opus reserved for hard reasoning — would roughly double the composite from 52 to around 85. That is a structural change, not a grind.

How to Improve Your Eco Score

Move grep-like tasks off Opus. Reading files, summarising diffs, pattern-matching across a repo, routing between tools — these do not benefit from Opus’ reasoning depth. Claude Code supports model selection per interaction. Use it.

Stabilise your context between turns. Every edit to CLAUDE.md, every project switch, and every manual context reset invalidates the prompt cache. If you must change scope, do it at natural boundaries (new task, new session) rather than mid-conversation. See our session analytics guide for a deeper look at how context drift hurts you.

Start fresh sessions. When a conversation has drifted from the original task, spin up a new session instead of loading the drift as context. The O(n²) tax on long sessions is real and largely invisible until you look at p90 session sizes.

Watch the drift detector. The Wh-per-turn chart on the Climate page catches the one failure mode that is hard to notice: sessions getting slowly more expensive over time without producing proportionally more output. The divergence banner quantifies it. A positive divergence of 20% or more is the kind of drift that compounds.

Coming Next

Eco Score is the first public product of a sustainability thread we plan to keep pulling. Next on our roadmap: team-level Eco Scores on private leaderboards, a weekly digest that flags when your score drifts, and a public methodology page with the complete derivation of every coefficient. If there is a behavioural lever we should be measuring that we do not, we want to hear about it — we are on the public leaderboard and respond on the blog’s contact.

Track these metrics automatically with LobsterOne

Get Started Free

Frequently Asked Questions

Is the Eco Score based on real energy measurements?

No. The frontier labs do not publish per-token energy figures for their proprietary models, so any precise carbon number would be false precision. The score is based on behavioural signals that correlate strongly with real energy cost — model choice, cache discipline, session length — weighted from the best public research available. The methodology post walks through exactly where each coefficient comes from.

Why does Opus penalise me so heavily?

Opus is roughly seven to ten times more energy-intensive per output token than Haiku, based on the best public estimates and Anthropic’s own pricing structure. It is an excellent model for hard reasoning. It is overkill for tasks that Haiku or Sonnet handle equally well, and the energy gap is where the biggest sustainability headroom is hiding.

Can the Eco Score be gamed?

You can push your score up by switching to cheaper models without actually improving your outcomes. That is not gaming — that is the score doing its job. The score is not a KPI to optimise blindly; it is a feedback loop that should make you pause before defaulting to the heaviest model for trivial tasks.

Does Eco Score work with Codex and other tools?

Yes. The scoring framework applies to any agentic coding tool whose telemetry we ingest. Codex is supported at launch. Cursor and others follow the same session-level rollup once telemetry is connected.

How often does the score update?

Rolling window, evaluated on your last 7 days (the header pill) and 30 days (the dashboard card). Updates within minutes of new telemetry arriving.

Pierre Sauvignon

Pierre Sauvignon

Founder

Founder of LobsterOne. Building tools that make AI-assisted development visible, measurable, and fun.

Related Articles