Software Factories

Autonomous coding agents that plan, write, test, and ship code with minimal human oversight. The category has split into distinct lanes: platform-integrated (Copilot), event-driven always-on (Cursor Automations), open-source (OpenHands), enterprise-managed (Factory), and standalone SaaS (Devin/Windsurf). Production safety incidents (Kiro, Replit) are now a category-defining concern alongside benchmark scores.

Ranked

Signals

Verdict

GitHub Copilot Coding Agent is the enterprise default — 20M users, full GA with multi-model, Agentic Code Review GA (Mar 2026). Wins on distribution and integration depth, not raw capability (SWE-bench 56.0%).

Cursor Automations is the most innovative architecture — event-driven triggers (Slack, PagerDuty, Linear, webhooks, cron) are genuinely new. $2B ARR, $29.3B valuation, 1M+ DAU. Hierarchical planner/worker/judge at 1M+ LOC scale.

OpenHands is the open-source standard — 68.8K stars, $23.8M raised, MIT licensed, model-agnostic, ICLR paper, #1 Multi-SWE-Bench. Best for regulated industries and data sovereignty.

Augment Code enters at #4 — $252M total funding, ~70.6% SWE-bench (third-party report, unaudited). Strong enterprise backing but near-zero community signal. Would be #2 if benchmark is verified.

Aider is the CLI power-user standard — 49.2% SWE-bench Verified (independently reproducible), 5.7M pip installs, Apache 2.0, BYOK. Best for scriptable CI pipelines and zero-vendor-lock workflows.

The deeper read

The category has matured past the 'autonomous engineer' hype. The real split is platform-integrated (Copilot) vs event-driven (Cursor Automations) vs open-source (OpenHands) vs standalone SaaS (Devin, Factory). Production safety incidents (Kiro: 6.3M orders lost, Replit: codebase wiped) are now the #1 concern above benchmarks.

SWE-bench Verified is contaminated — OpenAI confirmed (Feb 2026) every frontier model shows training data contamination. Models reproduce exact fixes from memory. All historical SWE-bench Verified scores are suspect. SWE-bench Pro (1,865 tasks) is the successor but few contenders have published Pro scores yet.

Production safety is now the #1 concern. Kiro (6.3M lost orders), Replit (1,206 records destroyed), ACM study (25-30% AI code contains security weaknesses). Gartner predicts >40% of agentic AI projects canceled by 2027 due to inadequate risk controls.

The trust paradox: 95% of developers use AI tools weekly, but only 33% trust AI output accuracy (Stack Overflow 2025). Tools with built-in review mechanisms (Copilot's self-review, mandatory approval gates) have a structural advantage.

Current ranking

GitHub Copilot Coding AgentOfficial★ N/A 43

Best for: Teams already on GitHub that want zero-friction async coding with enterprise compliance baked in

20M all-time users, GA since September 25, 2025. Multi-model (Claude Opus 4.6, GPT-5.3-Codex, Gemini 3 Pro), plan/autopilot modes, cross-session repo memory, MCP integration. Agentic Code Review GA (March 5, 2026). Custom agents via .github/agents/. 564 HN pts on public preview — highest community engagement in category. SWE-bench Verified 56.0%. Runs in GitHub Actions VM with no-approve-own-PR guardrails.

⚡ SWE-bench 56.0% trails top contenders. GitHub-only — no GitLab/Bitbucket. CI requires human gate.

Cursor Automations★ N/A 41

Best for: Teams already on Cursor who want event-driven automated coding triggered by external events (Slack, PagerDuty, Linear, webhooks, cron)

$2B ARR (March 2026), 1M+ DAU, $29.3B valuation. 'Scaling Long-Running Agents' blog (Jan 14, 2026, HN 290 pts): 1M+ LOC across 1,000 files in one week; hierarchical planner/worker/judge architecture. Launched March 5, 2026. 35% of Cursor's own PRs merged by cloud agents. SWE-bench Verified 51.7% (morphllm). Persistent memory across runs. Built-in demo video recording.

⚡ HN engagement at Automations launch: 7 pts — community not yet sold on event-driven paradigm. Locked to Cursor IDE. SWE-bench score not independently audited.

OpenHands★ 69K+ ↑12%88

Best for: Regulated industries, privacy-sensitive orgs, teams wanting to self-host with their own models

68,800 GitHub stars, 3M+ downloads, MIT license. 66.4% SWE-bench Verified with inference-time scaling + critic (single: 60.6%) — highest verified score with auditable methodology. ICLR 2025 accepted paper, 292 citations, 38 highly influential (Semantic Scholar). $23.8M raised. #1 on Multi-SWE-Bench (8 languages); top on LiveSWEBench.

⚡ No disclosed revenue or paying customer count. Enterprise adoption evidence thin beyond logo slides. HN engagement lower than peers.

Augment Code / Intent Agent★ N/A 38

Best for: Enterprise teams with high budget seeking spec-first multi-agent orchestration with mandatory approval gates

$252M total funding — largest in category after Cognition. ~70.6% SWE-bench Verified (third-party report) — if accurate, highest in category. Mandatory approval gates. BYOA flexibility (Claude Code, Codex, OpenCode). Eric Schmidt–backed.

⚡ Limited public artifact visibility — no open-source repo, no public GitHub activity, near-zero community signal. 70.6% score not independently audited. Would move to #2 with a verified, auditable benchmark submission.

Aider★ 42K+ ↑11%91

Best for: CLI power users, BYOK, scriptable CI pipelines — solo developers or headless coding agents with zero lock-in

42,109 GitHub stars; 5.7M total pip installs, 703K/month. 49.2% SWE-bench Verified — independently reproducible (pip-installable, open weights). Apache 2.0 license; BYOK; works with any model via API. HN consistently positive; frequent mention in developer tooling threads.

⚡ Lower ceiling than OpenHands on complex tasks. 49.2% SWE-bench trails top tier. Less IDE integration.

Cline★ 59K+ 64

Best for: VS Code users wanting BYOK agentic experience with enterprise backing and 5M installs

59,114 GitHub stars; 5M multi-platform installs — second-largest OSS install base in category. Apache 2.0 license; BYOK model support. $32M funding (Emergence Capital). Named enterprise customers: Salesforce, Samsung, SAP.

⚡ No independent SWE-bench submission — capability claims are community-sourced. Supply chain incident (v2.3.0 'OpenClaw') is a documented trust flag. Primarily a VS Code extension.

Replit Agent 4★ N/A 48

Best for: Non-developer / vibe-coding audience building full-stack apps with minimal code knowledge

Parallel sub-agents (auth, DB, backend, frontend simultaneously), mobile app generation, infinite canvas design variants. ChatGPT distribution partnership — potentially massive non-developer reach. 2.28M monthly visits.

⚡ No SWE-bench submission; no verifiable benchmark. Production deletion incident (CEO apologized after AI agent wiped a company's codebase). For professional software factories (issue → PR, CI integration), falls short of top tier.

Factory.ai★ 610 53

Best for: Large enterprises (5,000+ engineers) wanting vendor-managed, compliance-friendly coding agent with white-glove support

$70M total funding; enterprise customer base. 63.1% Terminal Bench score (#1 on that leaderboard). Wipro partnership (tens of thousands of engineers) — largest enterprise deployment commitment. Customers: MongoDB, EY, Bayer, Zapier, Clari.

⚡ No public repo; no community presence; invitation-only. Terminal Bench is less established than SWE-bench Verified. No independent third-party reviews. Would move up with a SWE-bench Verified submission or credible public case study.

Kiro (AWS)Official★ N/A 38

Best for: Enterprise teams in regulated industries (government, healthcare, finance) needing spec-driven audit trail and AWS infrastructure integration

AWS GovCloud launch (Feb 2026) — signals real enterprise intent. Spec-driven workflow (requirements → design → implementation) is a genuine differentiator for regulated/compliance-heavy teams. AWS backing provides a distribution floor.

⚡ February 2026 outage controversy (The Register) — not confirmed Kiro-caused, but reputational drag. No SWE-bench submission; no public user count or ARR. Too early to rank higher without verified benchmark.

Amp (Amp Inc.)★ N/A 28

Best for: CLI-first power users who value Sourcegraph code intelligence lineage and BYOK flexibility

Co-founded by Quinn Slack and Beyang Liu (Sourcegraph) — high developer credibility. Self-reported profitable; Sequoia/a16z backing. CLI-first with Smart/Rush/Deep modes; Agent Skills system; composable code review agent.

⚡ No GitHub repo; no SWE-bench submission; no independent review; discontinued VS Code extension Feb 2026. 'Profitable' is self-reported. Deep mode capability claims unverified. Cannot rank higher without a reproducible benchmark or third-party review.

Below the cut line

Devin / Cognition★ N/A 38

Best for: Archived from top tier — benchmark stale since 2024

$696M total funding; $73M ARR (Jun 2025). Real enterprise use: Goldman Sachs, Santander, Nubank.

⚡ 13.86% SWE-bench Lite (self-reported, 2024) — oldest and worst verified score in the category. No updated SWE-bench submission since 2024 despite massive funding. Archived until a verified, current benchmark submission appears.

Jules (Google)★ N/A 34

Best for: Free experimentation — proactive task scanning is unique but not yet a daily driver

Google backing ensures longevity. 2.28M beta visits. Proactive task scanning (finds TODOs unprompted) — unique in category. Free tier (15 tasks/day) most generous.

⚡ No SWE-bench benchmark, no published user count, no open-source presence. Every reviewer says 'not yet a daily driver.' Down-ranked to watch until a benchmark or verifiable case study appears.

Goose (Block)★ 33K+ 68

Best for: Internal-tooling reference implementation; open, auditable, provider-agnostic

27,000 stars; 60% Block employee adoption. Apache license. Linux Foundation AAIF founding member.

⚡ Performance is entirely model-dependent; no independent SWE-bench submission. Down-ranked to watch; best considered as an internal-tooling reference implementation rather than a general-purpose software factory.

SWE-agent★ 19K+ 68

Best for: Research-grade autonomous bug fixing and benchmark reproducibility

Princeton research project. 18.7K stars. MIT licensed. Well-documented agent architecture. mini-swe-agent (100 lines) scores >74% SWE-bench.

⚡ Research scaffold, not production tool. No hosted offering. No new signals in 30 days.

Ralph Loop PatternOfficial★ 720 44

Best for: Simple, controllable autonomous loops with human-readable state

Simplest pattern: while-true prompt loop with file persistence. Full control. No black box. Adopted by Anthropic, Vercel, Block's Goose.

⚡ More pattern than full factory. Requires good prompt engineering. No built-in task decomposition.

See the full comparison.

Stars, downloads, evidence — all skills side by side.

COMPARE →

Skills comparison

GitHub stars and evidence count for top ranked skills.

GitHub Stars

Evidence items

Strong

Moderate

+5 more not shown

Star growth over time

GitHub stars trajectory for top skills in this category.

GitHub Stars

OpenHands

Aider

Head to head

Copilot Coding AgentvsCursor Automations

Copilot: 15M devs, 12K orgs, 564 HN pts, proven track record. Cursor: $2B ARR, event-driven triggers, 7M MAU but 7 HN pts on Automations. Copilot wins on distribution and trust; Cursor wins on autonomy model. Copilot for now; Cursor could challenge within 6 months.

Copilot Coding AgentvsOpenHands

Different lanes. Copilot: zero-friction, GitHub-native, no setup. OpenHands: self-hostable, model-agnostic, MIT, scales to 1000s of parallel tasks. Copilot for GitHub-native teams; OpenHands for control, self-hosting, data sovereignty.

Cursor AutomationsvsDevin / Cognition

Cursor: 13x Devin's pre-acq revenue, event-driven triggers are novel, $2B ARR. Devin: more autonomy history, 67% merge rate on defined tasks, $10.2B valuation. Cursor wins on revenue and trigger model; Devin has more autonomy history but weaker product evidence.

OpenHandsvsDevin / Cognition

OpenHands: 68K stars, MIT, free, model-agnostic, broader enterprise logos (AMD, Apple, Google, NVIDIA). Devin: $10.2B valuation, $150M+ ARR, but 15% complex-task success. OpenHands for control and cost; Devin for turnkey defined-task automation.

Factoryvsfield

Factory: Terminal-Bench #1, Sequoia+NVIDIA backing, enterprise customers. But 7 HN pts with 0 comments = near-zero community validation. Investor excitement ≠ developer adoption. Needs independent verification to move up.

Missing a contender?

If there's a skill we haven't ranked, submit it.

SUBMIT A SKILL →

Public signals

Scale2026-03-05

GitHub Copilot: 4.7M subscribers, 60M+ code reviews, 12K+ orgs, Jira integration GA

4.7M paid subscribers (Microsoft earnings). ~90% Fortune 100 penetration. 1 in 5 code reviews on GitHub now agentic. Jira integration public preview Mar 5 — assign issue → get PR.

New Entrant2026-03-05

Cursor Automations: event-driven always-on agents, $2B ARR, $29.3B valuation

Only contender with event-driven triggers (Slack, PagerDuty, Linear, webhooks, cron). 35% of Cursor's own PRs merged by agents. Revenue doubled in 3 months. Named customers: OpenAI, Midjourney, Perplexity, Shopify.

Growth2026-03-04

OpenHands: 68.8K stars, $18.8M Series A, ICLR paper, Planning Agent v1.5.0

MIT license, model-agnostic, self-hostable. 4M+ downloads. ICLR paper accepted 2025. Planning Agent (v1.5.0, Mar 2026) adds Plan/Code mode toggle. Only serious option for air-gapped/self-hosted deployment.

New Entrant2026-03

Jules (Google): 2.28M beta visits, proactive task scanning, free tier

Proactive scanning finds TODOs and proposes work unprompted — unique in category. 534 + 339 HN pts. Free 15 tasks/day. But every reviewer says 'not yet a daily driver.' No SWE-bench scores.

M&A2025-09

Cognition acquires Windsurf, raises to $10.2B valuation, ~$900M total funding

Highest-valued player in category. Windsurf adds 350+ enterprise customers and $82M ARR. Goldman Sachs piloting 'hundreds to thousands of Devins.' Staff exits post-acquisition raise integration questions.

Independent Eval2025-01

Answer.AI: Devin achieves ~15% success on complex real-world tasks

Only rigorous independent evaluation of a software factory agent. 14 failures, 3 successes, 3 inconclusive across 20 tasks. Eval is 14 months old (Devin 1.x) — Devin 2.0 not yet independently re-tested.

Incident2026-03

Kiro safety incident: 6.3M orders lost, 90-day safety reset announced

AI agent autonomously deleted live production environment. Amazon announced 90-day 'code safety reset' covering ~335 critical systems. Suspended from ranking pending safety reset and independent audit.

Enterprise2025-09-25

Factory AI: Wipro partnership (tens of thousands of engineers), Sequoia/NVIDIA-backed

Largest enterprise deployment commitment in category. Terminal-Bench #1 at 58.75%. Previously claimed 84.8% SWE-bench is UNVERIFIED. Zero grassroots developer adoption.

Warning2026-02-23

SWE-bench Verified contaminated — OpenAI confirms training data leakage in all frontier models

Models reproduce exact code fixes from memory. All historical SWE-bench Verified scores suspect. SWE-bench Pro (1,865 tasks) is successor but few contenders have published Pro scores.

Trust Signal2026-02

95% use AI tools weekly, but only 33% trust output — the trust paradox

Stack Overflow 2025. Tools with built-in review mechanisms (Copilot's self-review, mandatory approval gates) have structural advantage. Gartner: >40% agentic AI projects canceled by 2027.

Deep Dive2026-01-14

Cursor: Scaling Long-Running Agents — 1M+ LOC across 1,000 files in one week

290 HN pts. Hierarchical planner/worker/judge architecture. Most technically specific public artifact in the category — concrete scale numbers with first-party evidence.

New Entrant2026

Augment Code / Intent Agent: $252M total funding, ~70.6% SWE-bench (unaudited)

Largest-funded new entrant after Cognition. ~70.6% SWE-bench Verified via third-party report — if verified, would be #2 in category. Near-zero community signal but enterprise-grade backing.

New Entrant2026-03

Amp Inc. spin-out from Sourcegraph — Quinn Slack, Beyang Liu, Sequoia/a16z-backed

CLI-first spin-out from Sourcegraph. Co-founders have highest developer credibility of any new entrant (Quinn Slack: Slack CEO, Beyang Liu: Sourcegraph CEO). Self-reported profitable. Agent Skills system and composable code review agent are differentiated features.

What changes this

If Augment Code / Intent publishes a verified SWE-bench submission → moves to #2 or #1 if score holds at 70%+.

If Amp publishes an open benchmark or third-party review → enters top 5.

If Kiro publishes a SWE-bench result and clarifies the outage → moves from #9 to #6-7.

If Devin submits a current, verified SWE-bench run → re-enters top tier from archived.

If Jules publishes any quantitative capability evidence → exits watch status.

If Cursor Automations community reception grows (HN currently 7 pts) → confirms or denies whether event-driven paradigm has real developer demand.

If SWE-bench Pro adoption becomes standard → all scores above 50% should be discounted ~20%; recalibrate entire ranking.

If OpenHands closes a large named enterprise deal → strengthens #3 claim on enterprise trust.

If another major safety incident at any ranked tool → that tool drops 2+ ranks and gets safety warning.