Claude Code leads SWE-bench Pro standardized (45.89% vs 41.04%), Morph Tier 1 'deepest reasoning', Educative 1h17m single-shot. Codex CLI leads Terminal-Bench (77.3% GPT-5.3-Codex), 3-4x more token-efficient, 240+ tok/s. Emerging consensus: use both — Claude for planning, Codex for implementation.
Coding CLIs / Code Agents
The hottest category right now. Ten+ serious CLI agents competing across three tiers. SWE-bench Pro (standardized) is necessary but no longer sufficient — METR found ~50% of SWE-bench-passing PRs would NOT be merged by real maintainers. Rankings weight benchmarks alongside practical tests, adoption, safety, and independent evaluations.
22
Ranked
21
Signals
Verdict
Claude Code is #1 — 7.88M npm downloads/week (3x nearest rival), 79K stars, ~4% of GitHub public commits (SemiAnalysis, Feb 2026). Leads SWE-bench Pro standardized (45.89%, SEAL #1). $2.5B annualized revenue. Quality regression perception is a live trust issue ('dumbed down?' — 1,085 HN pts, Feb 2026) but MarginLab monitoring shows no statistical degradation.
Codex CLI is #2 — 2.49M npm downloads/week (clear #2 by active use). Rust rewrite eliminates Node.js dependency — unique in category. GPT-5.3-Codex leads SWE-bench Pro custom scaffold at 56.8% (non-standardized). Terminal-Bench 77.3%. Best for locked-down environments or OpenAI model loyalists.
Gemini CLI is #3 — 98K stars (highest raw count), best free tier (1K req/day, no credit card), 1M context window, 678K npm downloads/week. File deletion incident (AI Incident DB #1178) is a visible trust flag. Not recommended for unattended agentic use without a human review step.
Cline (cline.bot) is #4 — 3.35M VS Code installs (5M across editors), $32M funding (Emergence Capital), named enterprise customers (Salesforce, Samsung, SAP). Supply chain incident (v2.3.0 'OpenClaw') is a documented trust flag — would move to Tier 1/2 with a credible security audit.
OpenCode is #5 — 124K stars (largest AI coding repo), v1.2.27 active (2026-03-16), OpenAI official partnership. 393K npm downloads/week. RCE fixed in v1.1.10+. Trust story is messier than peers due to corporate conflict + security history.
RooCode is #6 — 1.37M VS Code installs, 5.0/5 VS Code rating (highest quality signal in IDE-agent segment). Cline fork with enterprise governance focus. v3.51.1 (2026-03-08). Best for teams wanting Cline-style agentic coding with stricter governance.
Aider is #7 — 191K PyPI/week, 5.7M lifetime installs. Multi-model, git-native, no vendor lock-in. Category pressure growing: HN thread 'stopped using Aider in favor of Claude Code' (#44154020). Best for Python devs who want fine-grained model control. v0.86.2 (2026-02-12).
The deeper read
SWE-bench Pro (standardized) is necessary but no longer sufficient. METR's March 10, 2026 study found ~50% of SWE-bench-passing PRs would NOT be merged by real maintainers (278 HN pts). Maintainer merge rates are ~24pp lower than automated grading. Rankings weight SWE-bench alongside practical tests, adoption, safety, and independent evaluations.
Verifiable traction is the new tie-breaker. Aider's 191,828/week PyPI installs are a public artifact — harder to game than star counts or social media engagement. Rankings now weight independently verifiable usage metrics more heavily.
Each major model provider now has a CLI agent (Anthropic → Claude Code, OpenAI → Codex CLI, Google → Gemini CLI). The emerging consensus is a hybrid pattern: Claude Code for planning/architecture, Codex CLI for implementation. Multi-model tools (Aider, Crush, Goose, Qwen Code) offer a third lane: model-agnostic with no vendor dependency.
Current ranking
Best for: Architecture, planning, complex reasoning, security analysis, niche languages
7.88M npm downloads/week — 3x nearest rival. ~4% of GitHub public commits (SemiAnalysis). #1 SWE-bench Pro standardized (45.89%, SEAL). $2.5B annualized revenue (fastest enterprise SaaS to $1B ARR). HN peak 2,127 pts — unmatched community mindshare.
⚡ Quality regression perception: 'Claude Code is being dumbed down?' (1,085 HN pts, Feb 2026) is a live trust issue. Rate limits are the #1 complaint. 3-4x higher token consumption per task than Codex CLI.
Best for: OpenAI ecosystem, locked-down environments, token efficiency, sandbox-first safety
2.49M npm downloads/week — clear #2 by active use. Rust rewrite eliminates Node.js dependency — unique in category. Terminal-Bench 77.3% (GPT-5.3-Codex). 3-4x more token-efficient than Claude Code. Free with ChatGPT subscription.
⚡ SWE-bench Pro standardized 41.04% — trails Claude Code by ~5pp. Tied to OpenAI models only. Custom scaffold score (56.8%) is not standardized.
Best for: Budget-constrained developers, large-context tasks, free entry point
Best free tier in category: 1K req/day, no credit card. 1M native context — largest. 98K stars (highest raw count). 678K npm downloads/week. Google-backed, open source (Apache 2.0).
⚡ File deletion incident (AI Incident DB #1178) is a visible trust flag. 11.6x download gap vs Claude Code despite higher star count — brand-driven stars. Not recommended for unattended agentic use without human review.
Best for: VS Code developers, enterprise teams with governance requirements
3.35M VS Code installs (5M across editors). $32M raise (Emergence Capital). Named enterprise customers: Salesforce, Samsung, SAP. v3.73.0 released 2026-03-16. Dominates the IDE-embedded-agent segment.
⚡ Supply chain incident: v2.3.0 'OpenClaw' compromise — no third-party security audit published. Primarily a VS Code extension; CLI surface is secondary. Would move to Tier 1 with a credible security audit.
Best for: Maximum model flexibility, open-source-first teams, OpenAI ecosystem
124,766 stars — largest AI coding repo by raw count. 393K npm downloads/week. v1.2.27 active (2026-03-16). OpenAI official partnership. RCE fixed in v1.1.10+. 75+ model providers.
⚡ Trust story is messy: RCE disclosure (432 HN pts) + Anthropic blocking incident (625 HN pts). Star count inflated by controversy. No published benchmark scores.
Best for: Teams wanting Cline-style agentic coding with stricter governance and multi-model flexibility
5.0/5 VS Code rating on 1,372,346 installs — strongest quality signal in the IDE-agent segment. Cline fork inherits proven codebase while adding enterprise governance. v3.51.1 (2026-03-08).
⚡ Fork positioning — unclear differentiation beyond governance vs upstream Cline. No published benchmarks. Smaller enterprise validation than Cline.
Best for: Python developers, maximum model flexibility, git-native workflow, token efficiency
191,828 PyPI/week, 5.7M lifetime installs — most independently verifiable usage outside Claude Code. Multi-model (any OpenAI-compatible API), git-native, no vendor lock-in. v0.86.2 released 2026-02-12.
⚡ Category pressure: HN 'stopped using Aider in favor of Claude Code' (#44154020). Codex CLI at 2.49M and Gemini at 678K npm/week have overtaken Aider's download rank. Shipping cadence behind daily-release competitors.
Best for: JetBrains loyalists wanting BYOK pricing with institutional IDE vendor support
JetBrains distribution: 14M existing user base. BYOK pricing. Explicit one-click migration from Claude Code. LLM-agnostic. Most strategically significant new entrant — revisit in 60 days.
⚡ Beta only — launched 2026-03-09. No public repo. No benchmark scores. No independent reviews yet.
Best for: Enterprise open governance, provider-agnostic agentic workflows, Apache 2.0 licensing
33K stars, v1.28.0 released 2026-03-18. Linux Foundation AAIF founding member. Provider-agnostic, MCP reference implementation. Block institutional backing.
⚡ No published benchmarks. No download data — cannot assess active-use gap vs top-tier tools. 'Super jank' reputation in HN comments.
Best for: Background agents enforcing code quality on PRs
2,372,585 VS Code installs (second-highest in IDE segment). 31,935 stars. Last release v1.2.17 (2026-03-13). Pivoted to async CI agents for PR enforcement.
⚡ Category shifting under it — more AI coding assistant framework than agentic CLI. Low HN engagement (44 pts) relative to VS Code install count.
Best for: Teams with large, complex codebases needing deep code intelligence
Most sophisticated sub-agent architecture (Oracle, Librarian, Painter). Sourcegraph code intelligence DNA. 36K npm weekly downloads. Free tier + BYOK.
⚡ Corporate spin-out: sourcegraph/amp GitHub returns 404, now independent Amp Inc. Update catalog entry. No SWE-bench benchmark, 0 HN pts in tracked period. Priority: verify current state before recommending.
Best for: Benchmark research, academic reference, issue-level repair evaluation
18,777 stars. Princeton NLP origin. SWE-agent scaffold: 79.2% SWE-bench Verified on Opus 4.5 — original SWE-bench paper. Strong academic credential.
⚡ Last release v1.1.0: 2025-05-22 — 10 months stale. Not a production tool. Down-ranked to academic/research reference.
Best for: Terminal-first developers wanting polished UX, multi-platform support
Best terminal UX in the category. Charmbracelet proven track record (Bubble Tea, 25K+ apps). Multi-model, LSP, MCP, cross-platform. 21K stars, v0.50.1 (2026-03-17). HN: 367 pts.
⚡ No published benchmark scores. Custom license (not standard OSS). Insufficient evidence of production coding-agent use to rank above Tier 4 — revisit if download data surfaces.
Best for: Web UI-based coding agent, research teams
69,352 stars. Last release 1.5.0 (2026-03-11). Active development.
⚡ Primary interface is web UI, not CLI — may belong in a separate web-agent category. Missing download signal. Low HN traction (70 pts) relative to star count suggests research-primary audience.
Best for: Cost=zero open-weight model, local/on-prem deployment
Qwen3-Coder-Next: 70.6% SWE-bench Verified (highest open-weight model). 1K free daily requests. 20K stars, v0.12.6 released 2026-03-17.
⚡ Alibaba/Chinese cloud provenance — enterprise and GovCloud scrutiny required. SWE-bench Pro standardized 38.70% lowest among ranked tools. Near-zero HN engagement (~7 pts).
Best for: Teams wanting highest raw benchmark number, semantic codebase indexing
51.80% SWE-bench Pro on Augment scaffold — highest raw number in category. Augment Context Engine provides deep semantic codebase understanding.
⚡ No public release — 153 GitHub stars. Benchmark uses non-standardized Augment scaffold. Single blog post is the only public artifact.
Best for: Chinese developer ecosystem, teams using Moonshot AI models
7.2K stars, 124K PyPI weekly downloads. K2.5 model (HN: 388 pts). Moonshot AI $1B+ funding.
⚡ Western ecosystem integration limited. No SWE-bench Pro or Terminal-Bench scores published.
Best for: Privacy-conscious teams, OpenRouter users
16.8K stars, 131K npm weekly downloads. $8M seed funding. OpenRouter-native.
⚡ Early stage. Needs differentiation beyond OpenRouter integration.
Best for: Teams already on GitHub Copilot subscription needing a terminal companion
15M Copilot subscriber distribution. Multi-model (Opus 4.6, Sonnet 4.6, GPT-5.3-Codex, Gemini 3 Pro). Enterprise Agent Control Plane.
⚡ CVE-2026-29783 hit 2 days after GA — arbitrary code execution via shell expansion (PromptArmor). No published benchmark scores. Low community signal (24 HN pts, 9.4K stars).
Best for: Polished commercial IDE with integrated AI
$29.3B valuation, most adopted commercial AI IDE. Strong UX, agent modes (Jan 2026).
⚡ IDE-first, CLI is secondary. Closed-source, paid, vendor-locked.
Best for: Terminal-first developers who want an integrated AI environment
26K+ stars, 75.8% SWE-bench Verified, TIME Best Inventions. Full terminal replacement.
⚡ Closed-source. 4,350 open issues. Category mismatch — more 'AI terminal' than coding CLI agent.
Best for: Spec-driven development, AWS integration, GovCloud
Amazon-backed. GovCloud focus. Spec-driven development approach. CLI v1.27 (2026-03-02).
⚡ No public repo, no benchmark, no meaningful HN engagement. Insufficient evidence to rank at this time.
See the full comparison.
Stars, downloads, evidence — all skills side by side.
Skills comparison
GitHub stars and evidence count for top ranked skills.
GitHub Stars
Evidence items
+12 more not shown
Star growth over time
GitHub stars trajectory for top skills in this category.
GitHub Stars
Head to head
Gemini CLI has independent SWE-bench Verified scores (76.2%), 1M native context, and the best free tier. OpenCode has more stars (123K vs 98K) and model flexibility (75+ providers). But Gemini has proven benchmarks while OpenCode has none — that's the gap.
Gemini CLI: free tier (1K req/day), 1M context, 98K stars, Deep Think mode. Codex CLI: Terminal-Bench 77.3% (GPT-5.3-Codex), sandbox-first safety, free with ChatGPT. Gemini wins on cost and context; Codex wins on proven terminal performance and speed.
Both are model-agnostic, but Aider hasn't shipped a release in 7 months while OpenCode ships daily. Aider has verifiable PyPI downloads (183K/week); OpenCode's 5M MAD claim is unverified. Aider's token efficiency (4.2x less than Claude Code) is unmatched.
Gemini CLI now at 43.30% SWE-bench Pro standardized vs Claude Code's 45.89% — gap narrowed to 2.59pp. Gemini wins overwhelmingly on cost (free 1K req/day) and context (1M native). Claude Code wins on adoption (8M vs 647K npm/wk), revenue ($2.5B ARR), and HN mindshare (2,127 vs 1,428 pts). Tool-calling weaknesses keep Gemini at #3.
Amp has the most sophisticated sub-agent architecture (Oracle, Librarian, Painter) from Sourcegraph's code intelligence DNA. Claude Code has 58x more npm downloads (8M vs 139K), published benchmarks (SWE-bench Pro #1), and 24x more HN engagement. Amp is a bet on code intelligence depth; Claude Code is the proven all-rounder.
Missing a contender?
If there's a skill we haven't ranked, submit it.
Public signals
Aider moves from #8 to #2 based on verified 2026-03-18 data: 191,828/week PyPI installs, 5.7M lifetime installs, 15B tokens/week (homepage). The only tool in the category with a fully independent, verifiable download number outside of Claude Code. Multi-model, git-native, no vendor lock-in.
Verified 2026-03-18 against SEAL public leaderboard. Claude Code #1 (45.89%). Augment scaffold 51.80% is highest raw number but not on standardized leaderboard. Qwen3-Coder-Next: 44.3% on Qwen Code leaderboard (unconfirmed on SEAL). Gemini CLI and Codex CLI standardized numbers unconfirmed on this date.
Charmbracelet's multi-model coding CLI enters the active ranking at #5. Built on Bubble Tea ecosystem (25K+ apps). LSP integration, MCP support, cross-platform. HN launch: 367 pts. No benchmarks — community quality signal is the main trust anchor.
Block's open-source agentic CLI enters active ranking at #6. Linux Foundation AAIF founding member, Apache 2.0, MCP reference implementation. Ships v1.28.0 today (2026-03-18). Best pick for teams requiring vendor-neutral governance.
Qwen3-Coder-Next: 70.6% SWE-bench Verified — highest open-weight model score in the category. 1,000 free daily requests via Qwen OAuth. Best cost=zero option. Alibaba provenance is a consideration for enterprise/GovCloud. v0.12.6 released 2026-03-17.
Augment scaffold: 51.80% SWE-bench Pro — highest raw number in the category. Same Opus 4.5 model scores 45.89% on SEAL standardized. Gap is scaffold architecture, not model capability. Cannot rank above tools with millions of verified installs on a single blog post. 153 stars, no public release.
Cline v2.3.0 'OpenClaw' compromise is a documented supply chain incident. Demoted from #4 to #10 (Watch tier). Would restore to Tier 2 with a credible third-party security audit. 59K stars and active shipping (v3.73.0 2026-03-16) — the underlying tool remains relevant.
OpenCode last released v0.0.55 on 2025-06-27 — 9 months stale at time of this ranking. Known unauthenticated RCE vulnerability (432 HN pts on disclosure). Removed from active ranking pending resumed development and security remediation.
SWE-bench Verified top 5 within 1 point (80.0–80.9%) — OpenAI stopped reporting it. SWE-bench Pro standardized: Claude Code 45.89% (#1). Custom scaffold scores (Augment 51.80%, Codex 56.8%) are not comparable to standardized results.
METR's March 10, 2026 study found ~50% of SWE-bench-passing PRs would NOT be merged by real maintainers (278 HN pts). Maintainer merge rates are ~24pp lower than automated grading. SWE-bench Pro is necessary but no longer sufficient as a sole authority.
~4% of public GitHub commits (~135K/day, SemiAnalysis est.), projected 20%+ by EOY 2026. 42,896x growth in 13 months. $2.5B annualized revenue (fastest enterprise SaaS to $1B ARR — Constellation Research). 8M+ npm weekly downloads. The hardest real-usage metric in the category.
Multiple independent sources (Calvin French-Owen, Pawel Jozefiak, Blake Crosley) converge on using Claude Code for planning/architecture and Codex CLI for implementation. Not a compromise — may be the optimal workflow.
Independent daily monitoring with 56% baseline pass rate and no statistically significant degradation. No other tool has this level of external quality assurance.
$2.5B annualized revenue, 500+ customers at $1M+/year. Fastest enterprise SaaS to $1B ARR in history (Constellation Research). 7.88M npm weekly downloads — 3x Codex (2.49M), 11.6x Gemini (678K).
Codex CLI at 2.49M npm/week is the clear #2 by active-use downloads — 13x Aider's 191K PyPI/week. Aider remains the strongest verifiable non-npm metric but download gap is too large to sustain #2 position. Aider moves to #7.
Cline dominates the IDE-embedded-agent segment: 3.35M VS Code installs (5M across editors), $32M from Emergence Capital, named enterprise customers. Moves from Watch tier to #4. Supply chain incident (v2.3.0 'OpenClaw') remains a documented trust flag.
OpenCode resumed active development after a gap. v1.2.27 released 2026-03-16. OpenAI official partnership after Anthropic blocking incident. 124,766 stars (largest AI coding repo). 393K npm downloads/week. RCE fixed in v1.1.10+. Moves from archived to #5.
RooCode added as a critical catalog gap. 5.0/5 VS Code rating on 1,372,346 installs is the strongest quality signal in the IDE-embedded segment. Cline fork with enterprise governance focus. v3.51.1 (2026-03-08). #6 in updated ranking.
High-engagement HN thread questioning Claude 4.5 quality regression. MarginLab independent monitoring shows no statistical degradation (p<0.05), but community trust perception is a real cost. The 'dumbing down' narrative is now the single most-cited concern in the Claude Code user community.
sourcegraph/amp GitHub repo returns 404. Amp spun out from Sourcegraph as independent 'Amp Inc.' Tool still ships (36K npm downloads/week) under ampcode.com. All catalog links updated. Corporate restructure is a material change — verify before recommending.
SWE-agent last released v1.1.0 on 2025-05-22 — 10 months stale. All active tools in the category are releasing weekly. Strong academic credential (Princeton, original SWE-bench paper) but not a production tool. Down-ranked to research/academic reference at #12.
What changes this
Auggie CLI public GA release + independent SWE-bench Pro reproduction → could move to Tier 1 if the 51.80% scaffold advantage holds outside Augment's own benchmark setup.
Gemini CLI publishing a credible SEAL SWE-bench Pro number → could move to #1 or #2 depending on result; currently ranked on traction alone.
Junie CLI post-beta community evidence → JetBrains' 11M+ installed base is large enough that strong first 60 days of public reception would immediately justify a Tier 2 slot.
Cline publishing a credible third-party security audit → would restore trust score and move it back into active Tier 2 consideration.
Aider publishing a SWE-bench Pro standardized number → would likely lock in #2 slot; currently its install verifiability is the strongest non-Anthropic signal in the category.
OpenCode resuming active development and patching the RCE → minimum bar to re-enter the ranking.
Claude Code quality regression persisting (the 'dumbed down' thread had 1,085 pts / 702 comments) → if perception hardens into documented capability regression, Tier 1 position is at risk.
If Gemini CLI fixes the file deletion pattern and files a clean safety record for 3+ months, its free tier + 1M context makes it a serious #2 contender.
If Codex CLI closes the SWE-bench standardized gap while maintaining cost/speed advantages, the #3/#4 ordering could shift.