Copilot: 15M devs, 12K orgs, 564 HN pts, proven track record. Cursor: $2B ARR, event-driven triggers, 7M MAU but 7 HN pts on Automations. Copilot wins on distribution and trust; Cursor wins on autonomy model. Copilot for now; Cursor could challenge within 6 months.
Software Factories
Autonomous coding agents that plan, write, test, and ship code with minimal human oversight. The category has split into distinct lanes: platform-integrated (Copilot), event-driven always-on (Cursor Automations), open-source (OpenHands), enterprise-managed (Factory), and standalone SaaS (Devin/Windsurf). Production safety incidents (Kiro, Replit) are now a category-defining concern alongside benchmark scores.
15
Ranked
13
Signals
Verdict
GitHub Copilot Coding Agent is the enterprise default — 20M users, full GA with multi-model, Agentic Code Review GA (Mar 2026). Wins on distribution and integration depth, not raw capability (SWE-bench 56.0%).
Cursor Automations is the most innovative architecture — event-driven triggers (Slack, PagerDuty, Linear, webhooks, cron) are genuinely new. $2B ARR, $29.3B valuation, 1M+ DAU. Hierarchical planner/worker/judge at 1M+ LOC scale.
OpenHands is the open-source standard — 68.8K stars, $23.8M raised, MIT licensed, model-agnostic, ICLR paper, #1 Multi-SWE-Bench. Best for regulated industries and data sovereignty.
Augment Code enters at #4 — $252M total funding, ~70.6% SWE-bench (third-party report, unaudited). Strong enterprise backing but near-zero community signal. Would be #2 if benchmark is verified.
Aider is the CLI power-user standard — 49.2% SWE-bench Verified (independently reproducible), 5.7M pip installs, Apache 2.0, BYOK. Best for scriptable CI pipelines and zero-vendor-lock workflows.
The deeper read
The category has matured past the 'autonomous engineer' hype. The real split is platform-integrated (Copilot) vs event-driven (Cursor Automations) vs open-source (OpenHands) vs standalone SaaS (Devin, Factory). Production safety incidents (Kiro: 6.3M orders lost, Replit: codebase wiped) are now the #1 concern above benchmarks.
SWE-bench Verified is contaminated — OpenAI confirmed (Feb 2026) every frontier model shows training data contamination. Models reproduce exact fixes from memory. All historical SWE-bench Verified scores are suspect. SWE-bench Pro (1,865 tasks) is the successor but few contenders have published Pro scores yet.
Production safety is now the #1 concern. Kiro (6.3M lost orders), Replit (1,206 records destroyed), ACM study (25-30% AI code contains security weaknesses). Gartner predicts >40% of agentic AI projects canceled by 2027 due to inadequate risk controls.
The trust paradox: 95% of developers use AI tools weekly, but only 33% trust AI output accuracy (Stack Overflow 2025). Tools with built-in review mechanisms (Copilot's self-review, mandatory approval gates) have a structural advantage.
Current ranking
Best for: Teams already on GitHub that want zero-friction async coding with enterprise compliance baked in
20M all-time users, GA since September 25, 2025. Multi-model (Claude Opus 4.6, GPT-5.3-Codex, Gemini 3 Pro), plan/autopilot modes, cross-session repo memory, MCP integration. Agentic Code Review GA (March 5, 2026). Custom agents via .github/agents/. 564 HN pts on public preview — highest community engagement in category. SWE-bench Verified 56.0%. Runs in GitHub Actions VM with no-approve-own-PR guardrails.
⚡ SWE-bench 56.0% trails top contenders. GitHub-only — no GitLab/Bitbucket. CI requires human gate.
Best for: Teams already on Cursor who want event-driven automated coding triggered by external events (Slack, PagerDuty, Linear, webhooks, cron)
$2B ARR (March 2026), 1M+ DAU, $29.3B valuation. 'Scaling Long-Running Agents' blog (Jan 14, 2026, HN 290 pts): 1M+ LOC across 1,000 files in one week; hierarchical planner/worker/judge architecture. Launched March 5, 2026. 35% of Cursor's own PRs merged by cloud agents. SWE-bench Verified 51.7% (morphllm). Persistent memory across runs. Built-in demo video recording.
⚡ HN engagement at Automations launch: 7 pts — community not yet sold on event-driven paradigm. Locked to Cursor IDE. SWE-bench score not independently audited.
Best for: Regulated industries, privacy-sensitive orgs, teams wanting to self-host with their own models
68,800 GitHub stars, 3M+ downloads, MIT license. 66.4% SWE-bench Verified with inference-time scaling + critic (single: 60.6%) — highest verified score with auditable methodology. ICLR 2025 accepted paper, 292 citations, 38 highly influential (Semantic Scholar). $23.8M raised. #1 on Multi-SWE-Bench (8 languages); top on LiveSWEBench.
⚡ No disclosed revenue or paying customer count. Enterprise adoption evidence thin beyond logo slides. HN engagement lower than peers.
Best for: Enterprise teams with high budget seeking spec-first multi-agent orchestration with mandatory approval gates
$252M total funding — largest in category after Cognition. ~70.6% SWE-bench Verified (third-party report) — if accurate, highest in category. Mandatory approval gates. BYOA flexibility (Claude Code, Codex, OpenCode). Eric Schmidt–backed.
⚡ Limited public artifact visibility — no open-source repo, no public GitHub activity, near-zero community signal. 70.6% score not independently audited. Would move to #2 with a verified, auditable benchmark submission.
Best for: CLI power users, BYOK, scriptable CI pipelines — solo developers or headless coding agents with zero lock-in
42,109 GitHub stars; 5.7M total pip installs, 703K/month. 49.2% SWE-bench Verified — independently reproducible (pip-installable, open weights). Apache 2.0 license; BYOK; works with any model via API. HN consistently positive; frequent mention in developer tooling threads.
⚡ Lower ceiling than OpenHands on complex tasks. 49.2% SWE-bench trails top tier. Less IDE integration.
Best for: VS Code users wanting BYOK agentic experience with enterprise backing and 5M installs
59,114 GitHub stars; 5M multi-platform installs — second-largest OSS install base in category. Apache 2.0 license; BYOK model support. $32M funding (Emergence Capital). Named enterprise customers: Salesforce, Samsung, SAP.
⚡ No independent SWE-bench submission — capability claims are community-sourced. Supply chain incident (v2.3.0 'OpenClaw') is a documented trust flag. Primarily a VS Code extension.
Best for: Non-developer / vibe-coding audience building full-stack apps with minimal code knowledge
Parallel sub-agents (auth, DB, backend, frontend simultaneously), mobile app generation, infinite canvas design variants. ChatGPT distribution partnership — potentially massive non-developer reach. 2.28M monthly visits.
⚡ No SWE-bench submission; no verifiable benchmark. Production deletion incident (CEO apologized after AI agent wiped a company's codebase). For professional software factories (issue → PR, CI integration), falls short of top tier.
Best for: Large enterprises (5,000+ engineers) wanting vendor-managed, compliance-friendly coding agent with white-glove support
$70M total funding; enterprise customer base. 63.1% Terminal Bench score (#1 on that leaderboard). Wipro partnership (tens of thousands of engineers) — largest enterprise deployment commitment. Customers: MongoDB, EY, Bayer, Zapier, Clari.
⚡ No public repo; no community presence; invitation-only. Terminal Bench is less established than SWE-bench Verified. No independent third-party reviews. Would move up with a SWE-bench Verified submission or credible public case study.
Best for: Enterprise teams in regulated industries (government, healthcare, finance) needing spec-driven audit trail and AWS infrastructure integration
AWS GovCloud launch (Feb 2026) — signals real enterprise intent. Spec-driven workflow (requirements → design → implementation) is a genuine differentiator for regulated/compliance-heavy teams. AWS backing provides a distribution floor.
⚡ February 2026 outage controversy (The Register) — not confirmed Kiro-caused, but reputational drag. No SWE-bench submission; no public user count or ARR. Too early to rank higher without verified benchmark.
Best for: CLI-first power users who value Sourcegraph code intelligence lineage and BYOK flexibility
Co-founded by Quinn Slack and Beyang Liu (Sourcegraph) — high developer credibility. Self-reported profitable; Sequoia/a16z backing. CLI-first with Smart/Rush/Deep modes; Agent Skills system; composable code review agent.
⚡ No GitHub repo; no SWE-bench submission; no independent review; discontinued VS Code extension Feb 2026. 'Profitable' is self-reported. Deep mode capability claims unverified. Cannot rank higher without a reproducible benchmark or third-party review.
Best for: Archived from top tier — benchmark stale since 2024
$696M total funding; $73M ARR (Jun 2025). Real enterprise use: Goldman Sachs, Santander, Nubank.
⚡ 13.86% SWE-bench Lite (self-reported, 2024) — oldest and worst verified score in the category. No updated SWE-bench submission since 2024 despite massive funding. Archived until a verified, current benchmark submission appears.
Best for: Free experimentation — proactive task scanning is unique but not yet a daily driver
Google backing ensures longevity. 2.28M beta visits. Proactive task scanning (finds TODOs unprompted) — unique in category. Free tier (15 tasks/day) most generous.
⚡ No SWE-bench benchmark, no published user count, no open-source presence. Every reviewer says 'not yet a daily driver.' Down-ranked to watch until a benchmark or verifiable case study appears.
Best for: Internal-tooling reference implementation; open, auditable, provider-agnostic
27,000 stars; 60% Block employee adoption. Apache license. Linux Foundation AAIF founding member.
⚡ Performance is entirely model-dependent; no independent SWE-bench submission. Down-ranked to watch; best considered as an internal-tooling reference implementation rather than a general-purpose software factory.
Best for: Research-grade autonomous bug fixing and benchmark reproducibility
Princeton research project. 18.7K stars. MIT licensed. Well-documented agent architecture. mini-swe-agent (100 lines) scores >74% SWE-bench.
⚡ Research scaffold, not production tool. No hosted offering. No new signals in 30 days.
Best for: Simple, controllable autonomous loops with human-readable state
Simplest pattern: while-true prompt loop with file persistence. Full control. No black box. Adopted by Anthropic, Vercel, Block's Goose.
⚡ More pattern than full factory. Requires good prompt engineering. No built-in task decomposition.
See the full comparison.
Stars, downloads, evidence — all skills side by side.
Skills comparison
GitHub stars and evidence count for top ranked skills.
GitHub Stars
Evidence items
+5 more not shown
Star growth over time
GitHub stars trajectory for top skills in this category.
GitHub Stars
Head to head
Different lanes. Copilot: zero-friction, GitHub-native, no setup. OpenHands: self-hostable, model-agnostic, MIT, scales to 1000s of parallel tasks. Copilot for GitHub-native teams; OpenHands for control, self-hosting, data sovereignty.
Cursor: 13x Devin's pre-acq revenue, event-driven triggers are novel, $2B ARR. Devin: more autonomy history, 67% merge rate on defined tasks, $10.2B valuation. Cursor wins on revenue and trigger model; Devin has more autonomy history but weaker product evidence.
OpenHands: 68K stars, MIT, free, model-agnostic, broader enterprise logos (AMD, Apple, Google, NVIDIA). Devin: $10.2B valuation, $150M+ ARR, but 15% complex-task success. OpenHands for control and cost; Devin for turnkey defined-task automation.
Factory: Terminal-Bench #1, Sequoia+NVIDIA backing, enterprise customers. But 7 HN pts with 0 comments = near-zero community validation. Investor excitement ≠ developer adoption. Needs independent verification to move up.
Missing a contender?
If there's a skill we haven't ranked, submit it.
Public signals
4.7M paid subscribers (Microsoft earnings). ~90% Fortune 100 penetration. 1 in 5 code reviews on GitHub now agentic. Jira integration public preview Mar 5 — assign issue → get PR.
Only contender with event-driven triggers (Slack, PagerDuty, Linear, webhooks, cron). 35% of Cursor's own PRs merged by agents. Revenue doubled in 3 months. Named customers: OpenAI, Midjourney, Perplexity, Shopify.
MIT license, model-agnostic, self-hostable. 4M+ downloads. ICLR paper accepted 2025. Planning Agent (v1.5.0, Mar 2026) adds Plan/Code mode toggle. Only serious option for air-gapped/self-hosted deployment.
Proactive scanning finds TODOs and proposes work unprompted — unique in category. 534 + 339 HN pts. Free 15 tasks/day. But every reviewer says 'not yet a daily driver.' No SWE-bench scores.
Highest-valued player in category. Windsurf adds 350+ enterprise customers and $82M ARR. Goldman Sachs piloting 'hundreds to thousands of Devins.' Staff exits post-acquisition raise integration questions.
Only rigorous independent evaluation of a software factory agent. 14 failures, 3 successes, 3 inconclusive across 20 tasks. Eval is 14 months old (Devin 1.x) — Devin 2.0 not yet independently re-tested.
AI agent autonomously deleted live production environment. Amazon announced 90-day 'code safety reset' covering ~335 critical systems. Suspended from ranking pending safety reset and independent audit.
Largest enterprise deployment commitment in category. Terminal-Bench #1 at 58.75%. Previously claimed 84.8% SWE-bench is UNVERIFIED. Zero grassroots developer adoption.
Models reproduce exact code fixes from memory. All historical SWE-bench Verified scores suspect. SWE-bench Pro (1,865 tasks) is successor but few contenders have published Pro scores.
Stack Overflow 2025. Tools with built-in review mechanisms (Copilot's self-review, mandatory approval gates) have structural advantage. Gartner: >40% agentic AI projects canceled by 2027.
290 HN pts. Hierarchical planner/worker/judge architecture. Most technically specific public artifact in the category — concrete scale numbers with first-party evidence.
Largest-funded new entrant after Cognition. ~70.6% SWE-bench Verified via third-party report — if verified, would be #2 in category. Near-zero community signal but enterprise-grade backing.
CLI-first spin-out from Sourcegraph. Co-founders have highest developer credibility of any new entrant (Quinn Slack: Slack CEO, Beyang Liu: Sourcegraph CEO). Self-reported profitable. Agent Skills system and composable code review agent are differentiated features.
What changes this
If Augment Code / Intent publishes a verified SWE-bench submission → moves to #2 or #1 if score holds at 70%+.
If Amp publishes an open benchmark or third-party review → enters top 5.
If Kiro publishes a SWE-bench result and clarifies the outage → moves from #9 to #6-7.
If Devin submits a current, verified SWE-bench run → re-enters top tier from archived.
If Jules publishes any quantitative capability evidence → exits watch status.
If Cursor Automations community reception grows (HN currently 7 pts) → confirms or denies whether event-driven paradigm has real developer demand.
If SWE-bench Pro adoption becomes standard → all scores above 50% should be discounted ~20%; recalibrate entire ranking.
If OpenHands closes a large named enterprise deal → strengthens #3 claim on enterprise trust.
If another major safety incident at any ranked tool → that tool drops 2+ ranks and gets safety warning.