Skip to content
sustainability research energy

How Much Energy Does AI Coding Use? A Developer's Guide to LLM Carbon Footprint

The public data on Claude Code and Codex energy consumption — Wh per token, per session, per workday — triangulated from Epoch AI, Google's Gemini disclosure, and peer-reviewed benchmarks. What the numbers actually mean for developers.

Pierre Sauvignon
Pierre Sauvignon 10 min read
Energy per token chart for Claude Code and Codex

AI coding uses more energy than most developers realise and less than headlines suggest. The best public estimates put a typical Claude Code session at 40–80 watt-hours and a full workday of agentic coding north of one kilowatt-hour per developer — roughly equivalent to charging a smartphone fifty to seventy times, or driving a gasoline car about two kilometres. This guide walks through where those numbers come from, what they actually measure, and why the per-token energy cost varies by an order of magnitude depending on which Claude or Codex model you use.

None of the frontier AI labs — Anthropic, OpenAI, or anyone else — publishes per-token energy data for their proprietary coding models. That silence is why the conversation has been dominated by breathless headlines and hand-waving dismissals. We are going to do the thing those are skipping: assemble the public evidence that does exist, triangulate it, and give you numbers you can act on.

What We Actually Know

Google Gemini published its first environmental disclosure in August 2025. The median text prompt uses 0.24 watt-hours, emits 0.03 grams of CO₂e, and consumes 0.26 millilitres of water. This is the only first-party data point we have from a major frontier lab.

Epoch AI published a bottom-up derivation for ChatGPT in February 2025. Their central estimate: roughly 0.3 watt-hours for a typical GPT-4o query with 500 output tokens on H100 hardware. At maximum context length (100,000 input tokens), the same analysis pushes the estimate to roughly 40 watt-hours per query. The full methodology is worth reading if you want the compute and utilisation assumptions behind the number.

“How Hungry is AI?”, a May 2025 arxiv benchmark, measured real inference energy across frontier models on H100/H200 hardware with batched requests. GPT-4o ran at approximately 1.2 watt-hours per 1,000-input-plus-1,000-output prompt. Claude 3.7 Sonnet ran at 2.8 watt-hours for the same prompt — roughly 2.3 times higher. Reasoning models (o3, DeepSeek-R1) ran 10 to 30 times higher again.

Sam Altman’s blog, in June 2025, claimed “the average query uses about 0.34 watt-hours.” The figure is undocumented, the model is unspecified, and there is no reproducible methodology — but it sits squarely within the Google and Epoch ranges, so we include it as a sanity check rather than a source of truth.

Simon Couch, in January 2026, back-solved Claude per-token-type energy figures from Epoch’s GPT-4o anchors and Anthropic’s published pricing ratios. His derived figures are the basis for our own.

Triangulating all of this, we land on a defensible central estimate in watt-hours per million tokens (Wh/MTok), broken out by token type:

Token typeWh / MTok
Input (uncached)400
Output2,000
Cache read40
Cache write500

These are the coefficients LobsterOne uses to compute climate impact in the dashboard. They are defensible within roughly ±3× uncertainty, which is the honest error bar given the state of public disclosure. If Anthropic or OpenAI publish real data tomorrow, these will shift. Until then, they are the best numbers available.

Why Output Is Five Times More Expensive Than Input

The Wh/MTok table above shows output at 2,000 and uncached input at 400. The reason is structural. Input tokens are processed in parallel during the prefill stage — the model ingests the entire prompt in one go, heavily batched, with strong amortisation of per-token overhead. Output tokens are produced one at a time during the decode stage, and each decoded token requires a full pass through the model’s parameters. Decoding is roughly an order of magnitude more energy per token than prefill in practice, though the exact ratio depends on prompt length, batch size, and inference engine.

Anthropic’s pricing reflects this. Output tokens are billed at five times the rate of input. Cache reads, which skip prefill recomputation entirely and are close to pure memory I/O, cost one tenth of fresh input. Cache writes cost slightly more than fresh input because the KV cache has to be populated. The pricing ratios are not arbitrary — they track the underlying energy physics closely enough that using them to back-solve per-token-type coefficients is a defensible shortcut when first-party energy data is absent.

What This Means for a Claude Code Session

A realistic Claude Code session — an hour of agentic work, maybe five or six turns, each one loading project context — typically lands in the 40 to 80 watt-hour range. A full workday with two or three concurrent Claude Code instances (which is not rare for developers doing parallel tasks) adds up to roughly 1.3 kilowatt-hours. At the US grid average of 400 grams of CO₂e per kilowatt-hour (EPA eGRID 2022), that is 520 grams of CO₂e per developer per day.

For context, that is roughly:

  • 2.9 kilometres driven in a gasoline car (at 180g CO₂e/km)
  • 87 smartphone charges (at 15 Wh each)
  • 65 minutes of microwave use
  • One quarter of a beef burger’s lifecycle footprint

Tens of thousands of developers doing this daily, and the number stops looking small. But unlike most climate signals, this one has actionable levers. The difference between using Opus for everything and using Opus only when needed is roughly a 5× reduction in per-turn energy. That is not a rounding error.

The Model Matters More Than the Volume

The most counterintuitive finding in the literature is that for a given developer, the choice of model moves energy consumption more than the sheer volume of tokens. The “How Hungry is AI?” benchmark showed Claude 3.7 Sonnet consuming 2.3 times more energy than GPT-4o for identical prompts, and reasoning models (extended thinking, o3-class) consuming 10 to 30 times more than standard queries.

Within the Claude family, the energy curve is rough but clear:

  • Haiku: most efficient per token. Suitable for grep-style tasks, summarisation, classification, short-form routing.
  • Sonnet: balanced. The workhorse for most coding tasks. Roughly 4× Haiku’s energy cost per output token on H100-class hardware, based on parameter counts.
  • Opus: the premium option. Roughly 7–10× Haiku’s energy cost. Reserved for tasks where the reasoning gap meaningfully matters.
  • Extended thinking (any model): multiplies the decode cost by 5–10× because it generates large volumes of internal reasoning tokens before producing the final answer.

A developer who lives in Opus with thinking mode enabled is burning roughly 50–100× the energy of a developer using Haiku for the same simple tasks. For a hard architectural question, that ratio is justified. For “read this file and summarise the imports,” it is not.

This is the insight behind the Eco Score’s model discipline sub-score — the 40% weight on model choice is there because, empirically, that is where the biggest sustainability headroom lives.

Caching Is Invisible and Enormous

The other lever that surprises developers is caching. Cache read tokens dominate the token counts of any active Claude Code user — by factors of 10 to 100 over fresh input tokens, because each conversational turn re-reads the entire prior context from the KV cache. A heavy user can accumulate billions of cache-read tokens in a month against tens of millions of fresh-input tokens.

The good news is cache reads are energetically cheap. At 40 Wh per million tokens against 400 for fresh input, they are an order of magnitude less expensive. The bad news is that the cache invalidates when context changes. Every edit to CLAUDE.md mid-session, every project switch, every context reset blows the cache and forces full prefill on the next turn.

A healthy cache reuse ratio — how many times each cached block gets read before invalidation — runs 30–50× for disciplined workflows and over 100× for the most stable ones. A chaotic workflow that frequently invalidates runs under 10×, which roughly doubles the energy cost of every turn.

For the full tactical playbook on capturing this savings, see reducing your AI coding energy consumption.

What to Stop Worrying About

A few things that get breathless coverage but do not meaningfully move the needle:

Context length within a session. Attention is O(n²), but the cache handles most of the repeated work. A 100k-token context is not 100× more expensive than a 1k-token context — it is closer to 5–10× per turn, most of which is amortised across turns as long as the cache holds.

Token volume in isolation. A developer burning 5 billion tokens of cache reads is not necessarily worse than one burning 50 million — it depends entirely on how much of that volume is cache reads (cheap) versus fresh input and output (expensive). Raw token counts without the type breakdown are a poor proxy for energy.

Water consumption per query. Google’s 0.26 mL per median query is real but contextually small against datacenter-wide water cycles. Some articles quote 20–50 mL per 1,000 tokens by conflating direct cooling water with upstream power-plant cooling water — this is a scope error. Stick to the first-party Gemini number or scale energy estimates by the ~1.8 L/kWh US datacenter WUE.

What to Actually Optimise

In rough order of impact on a typical Claude Code workflow:

  1. Model selection — route trivial tasks to Haiku, default tasks to Sonnet, hard problems to Opus. Do not default everything to Opus.
  2. Cache stability — keep CLAUDE.md, project context, and system prompts stable between turns within a session.
  3. Session concision — start fresh sessions when context drifts rather than carrying it forward indefinitely.
  4. Extended thinking discipline — only enable thinking mode for problems that actually need it. For simple refactors and summarisation, it is pure energy waste.
  5. Output length — ask for diffs rather than full file rewrites when the change is small.

Every one of those is a behavioural change, not an infrastructure change. The Eco Score measures exactly these levers because they are the ones developers control.

Track these metrics automatically with LobsterOne

Get Started Free

Frequently Asked Questions

Is there an official per-token energy figure from Anthropic?

No. Anthropic has not published per-token energy data for any Claude model. Our coefficients are triangulated from the best adjacent public research (Epoch AI’s GPT-4o analysis, the “How Hungry is AI?” benchmark, and Anthropic’s own pricing ratios). See the Eco Score launch post for the full derivation.

Why do different sources give different numbers?

Because they measure different things under different assumptions. Hardware (H100 vs A100), batch size, inference engine (vLLM vs TensorRT-LLM vs Transformers), quantisation (FP8 vs FP16), prompt length, and model size all move the number. A single Wh-per-token figure without those caveats is meaningless. Our table gives central estimates with a ±3× uncertainty band, which is as honest as the data allows.

Does the energy cost include training?

No. Training is amortised over billions of queries once a model is deployed, making the per-query contribution negligible. The numbers in this guide are inference-only, which is where developer behaviour actually lives.

Is this better or worse than a Google search?

A single modern Google search uses roughly 0.3 watt-hours including the result page render, network, and indexing. A single Claude Code turn uses 5–20 watt-hours. The comparison is not really fair — Claude is doing orders of magnitude more computation per query — but the ratio is real and worth being aware of.

How does the carbon footprint compare to other developer activities?

A developer’s daily AI coding energy (roughly 1 kWh) is comparable to running a laptop for 8 hours, or one-fifth of a CI/CD pipeline run on a large repo, or one-tenth of a 10-mile commute in a gasoline car. It is neither negligible nor dominant. It is a new category of footprint that deserves its own attention without displacing the others.

Pierre Sauvignon

Pierre Sauvignon

Founder

Founder of LobsterOne. Building tools that make AI-assisted development visible, measurable, and fun.

Related Articles