Skillbench Specification
Product Summary
Skillbench is a curated website for finding the best skill, platform, or native capability for a specific agent task category.
The product is not a broad marketplace and not a generic registry mirror. The core value is:
- narrow, useful categories
- strong ranking and comparison
- evidence-backed recommendations
- real citations to people comparing tools in public
The main user intent is:
"What's the best skill for the thing I'm trying to do?"
The site should answer that quickly, explain why, and show the strongest alternatives in the same category.
V1 Scope
V1 is a research and comparison product, not a live benchmark lab.
We will not initially run the skills ourselves in a formal benchmark harness. Instead, we will:
- collect public evidence
- collect pairwise comparisons
- collect examples of real people using the skills
- collect official posts, blog posts, Reddit threads, and X/Twitter posts
- rank skills and platforms based on evidence quality and relevance
Later, the system may run local benchmarks, but that is not required for the initial version.
Core Principles
- Flat browsing experience, not a deep directory maze
- Opinionated recommendations, not neutral listing
- Comparisons first
- Evidence first
- Minimal number of page types
- Static website, rebuilt locally on a schedule
Core Entities
There are five first-class entities:
1. Category (formerly "Job")
What the user wants to do. A narrow task category.
Examples:
- Web browsing
- Web form automation
- Google Docs reading
- Notion reading
- Task tracking
- Browser testing
- iOS simulator control
- Xcode builds
- Coding CLIs
2. Platform
The underlying product, environment, or native capability used to solve the category.
Examples:
- Google Docs
- Notion
- Linear
- GitHub Issues
- Browser Use
- Playwright
- Puppeteer
- Claude Code native browser
- Cursor native support
Important: a platform may solve the category natively without a separate skill.
3. Skill
A specific skill, package, repo, integration, or workflow used on top of a platform.
4. Source
A citation that provides evidence.
5. Bundle
A full setup/stack from a known persona. Bundles track what popular/trending builders actually ship with day to day, referencing skills in our registry.
Examples:
- Karpathy's coding stack (Claude Code + Cursor)
- swyx's agent stack (Claude Code + Firecrawl + Browser Use)
Examples:
- X/Twitter post
- Reddit thread
- blog post
- official documentation
- release note
- demo video
Taxonomy
The data model is tiered:
Category
-> Platform
-> Skill
The browsing experience should still feel flat. Users should mostly start at the homepage, then go to a category page, and only drill down if needed.
This structure solves the case where multiple platforms can address the same category.
Example:
Task Tracking
-> Linear
-> Skill A
-> Skill B
-> Notion
-> Skill C
-> GitHub Issues
-> Native support
-> Skill D
Page Types
V1 should have only four page types:
1. Homepage
Primary goals:
- show core categories
- show top current recommendations
- show recent changes
- surface strong evidence and comparisons
2. Category Page
This is the main product surface.
A category page should include:
- ranked options
- comparison table
- short explanation of the ranking
- supporting evidence
- related platforms and related categories
3. Platform Page
A platform page answers:
- how good this platform is for agent usage
- whether native support is enough
- which skills on top of it are best
4. Skill Page
A skill page answers:
- what this skill is good for
- how it ranks inside its platform or category
- what evidence supports it
- what better alternatives exist
Homepage Structure
The homepage should be simple and editorial.
Suggested sections:
- search bar
- core categories
- best right now
- what changed
- evidence feed
The homepage is not meant to be exhaustive. It should route users into narrow categories quickly.
Ranking Model
The site should rank options clearly rather than presenting them as an unordered directory.
Ranking should happen at two levels:
- within a category
- within a platform
The output should be a clear ranked table, not a vague prose-only recommendation.
We do not need a single universal score across the whole site. Ranking is context-specific.
Comparison Model
Comparison is the core product behavior.
We especially care about:
- pairwise comparisons between skills
- pairwise comparisons between platforms
- public posts where someone explicitly says one option is better than another
Strong comparison evidence includes:
- "X is better than Y for this workflow"
- "I switched from X to Y because..."
- "Y is now outdated"
- "X is official, but Y is more capable"
Status Model
Each skill or platform should have a status:
activewatcharchived
Active
Currently relevant and should appear in default tables.
Watch
Possibly relevant, but uncertain, stale, or not yet sufficiently evidenced.
Archived
Dead, superseded, stale, or clearly outperformed.
Archived items should not appear in default comparisons, but they should still have their own page and a visible reason for archival.
Evidence Model
We want public citations and backlinks, not unsupported claims.
Evidence sources can include:
- official docs
- release notes
- maintainer posts
- blog posts
- X/Twitter posts
- Reddit threads
- YouTube videos
We especially want sources where people:
- tried the skill
- compared it against another option
- showed output
- explained tradeoffs
Source Credibility
Sources should be weighted by signal quality.
Suggested tiers:
Tier 1: Official
- official docs
- official repos
- official release notes
- creator posts
Tier 2: Expert
- respected practitioners
- credible researchers
- detailed technical blog posts
- known builders in the space
Tier 3: Community
- Reddit threads
- community X/Twitter posts
- forum discussion
Tier 4: Weak — DISCARD by default
- low-effort SEO pages
- shallow reposts
- generic AI-generated summaries
- HN threads with fewer than 10 points or fewer than 5 substantive comments
- drive-by opinions without specifics ("honestly easier with X" with no supporting detail)
- isolated praise without comparison or usage evidence
Tier 4 signals should almost never appear in the product. If one is included, it must have an explicit justification for why it is still useful despite being weak.
Signal quality minimum bar
Every signal shipped in the product must meet ALL of these:
-
Traction: visible community engagement (upvotes, replies, stars, retweets). A 2-comment HN thread is not a signal.
-
Substance: a specific claim, comparison, experience report, or demonstrated output. Generic sentiment is not enough.
-
Attribution: who said it matters. Maintainers and known builders outweigh anonymous drive-by comments.
-
Freshness: signals older than 2 months are dead weight — cut them. If a tool is truly a leader, it will have fresh mentions and fresh comparisons. Do not grandfather stale signals just because they were once relevant.
-
Visual proof: screenshots must show the product in use, not homepages or landing pages.
-
Multiple independent sources required for ALL claims: a single source is worth nothing. Every claim shipped in the product — verdict, strength, weakness, comparison, capability statement — must be backed by at least 2-3 independent sources. Single-source claims are hypotheses at best. Label them as unverified or cut them.
-
Engagement floor: if the source has few likes, few comments, few upvotes — it is not a signal. A blog post with zero comments, an HN thread with 2 replies, a tweet with no engagement — all noise. The public must have visibly reacted for it to count.
-
Author reputation: if the author has no visible track record — no public projects, no history in the space, no reputation — the signal is suspect. Weight maintainers, known builders, and people with public work 10x over anonymous or low-profile commenters.
-
Date tracking: every artifact — evidence items, live signals, observed outputs — MUST carry a date. Undated artifacts cannot be assessed for freshness and are treated as stale.
When in doubt, cut the signal. Three strong signals beat eight weak ones.
Promotional Material Rule
NEVER trust the product's own website, blog, or company review as strong evidence. Self-reported claims (official docs, launch posts, company blogs about their own product) may be referenced for factual details (feature lists, pricing, benchmarks they claim) but MUST be clearly marked as selfReported. They do NOT count toward multi-source verification. "Company X says their product is great" is noise. "Independent reviewer Y says X is great" is signal.
We should also classify evidence type:
- direct use
- comparison
- announcement
- tutorial
- complaint
- opinion
Discovery and Research Pipeline
The content pipeline should run locally on the machine, not on a hosted server.
This pipeline has four stages:
1. Discover
Goal:
- detect new or newly relevant skills, platforms, and discussions
Inputs:
- top entries from skill registries
- web search
- X/Twitter search
- Reddit search
- recent posts from the last day or week
Important:
- we should not rely only on a fixed list of tracked accounts
- search should be freshness-based and topic-based
- we care more about top or hot items than exhaustive coverage
Typical discovery triggers:
- new skill appears in top registry results
- a known skill gets a major update
- recent public comparisons appear
- recent discussion suggests a skill is now hot or obsolete
2. Deep Dive
Goal:
- gather structured evidence for a category, platform, or skill
Tasks:
- collect official references
- collect credible public commentary
- collect pairwise comparisons
- collect examples and quotes
- collect reasons people prefer one option over another
Deep dive should happen for:
- all items initially
- later, only triggered items or stale pages
3. Rank
Goal:
- update rankings and statuses
Tasks:
- rank skills inside each platform
- rank platforms inside each category
- assign
active,watch, orarchived - update short verdicts
- update which items appear in default comparison tables
4. Publish
Goal:
- regenerate the website
Tasks:
- update homepage
- update category pages
- update platform pages
- update skill pages
Content Format
Content should live in Markdown with frontmatter.
Reasons:
- easy to edit manually
- easy for agents to generate
- easy to render statically
- good fit for research-heavy pages
Frontmatter should be type-checked.
Recommended approach:
- Markdown files for jobs, platforms, skills, and sources
- TypeScript schemas with Zod for validation
We should avoid introducing extra YAML files where a Markdown document is enough.
Search
Search should be very simple in V1.
Requirements:
- client-side only
- no backend search service
- no vector database
- basic string similarity / keyword matching
This is enough for initial navigation.
Design Direction
The site should feel sharp and editorial, not like a generic docs template.
Principles:
- strong typography
- intentional contrast
- minimal but expressive layout
- hover states that reveal useful comparison details
- evidence and ranking should be visually prominent
We should avoid generic AI-generated landing page aesthetics.
Stack Choice
Chosen stack:
- Next.js
- Tailwind CSS v4
- shadcn/ui
Reasoning:
- strongest tool support across coding agents
- best generation quality from modern AI coding tools
- easy static-friendly content site
- good fit for a custom editorial UI
We are not using a monorepo for now.
V1 will be a single Next.js application.
Publishing Infrastructure
Chosen deployment target:
- Vercel
Reasoning:
- best default fit for Next.js
- simplest deployment path
- easy preview deployments
- enough for a static-first content product
The research and content generation pipeline still runs locally. Vercel is only for publishing the built site.
Initial Open Questions
These are still open and can be refined after the first real category pages:
- which initial categories should appear on the homepage
- how much detail should appear directly on the homepage versus category pages
- whether platform pages should be fully first-class in navigation or mostly supporting pages
- how much of the source material should be embedded inline versus linked out
- how we visually show a pairwise comparison without making tables too heavy
V1 Success Criteria
V1 is successful if a user can:
- land on the homepage
- identify their category quickly
- see the best options ranked
- understand why the top option wins
- inspect real public evidence
- navigate to a narrower platform or skill page if needed
The website should feel useful even before formal local benchmarks exist.