Skillbench Specification

Product Summary

Skillbench is a curated website for finding the best skill, platform, or native capability for a specific agent task category.

The product is not a broad marketplace and not a generic registry mirror. The core value is:

narrow, useful categories
strong ranking and comparison
evidence-backed recommendations
real citations to people comparing tools in public

The main user intent is:

"What's the best skill for the thing I'm trying to do?"

The site should answer that quickly, explain why, and show the strongest alternatives in the same category.

V1 Scope

V1 is a research and comparison product, not a live benchmark lab.

We will not initially run the skills ourselves in a formal benchmark harness. Instead, we will:

collect public evidence
collect pairwise comparisons
collect examples of real people using the skills
collect official posts, blog posts, Reddit threads, and X/Twitter posts
rank skills and platforms based on evidence quality and relevance

Later, the system may run local benchmarks, but that is not required for the initial version.

Core Principles

Flat browsing experience, not a deep directory maze
Opinionated recommendations, not neutral listing
Comparisons first
Evidence first
Minimal number of page types
Static website, rebuilt locally on a schedule

Core Entities

There are five first-class entities:

1. Category (formerly "Job")

What the user wants to do. A narrow task category.

Examples:

Web browsing
Web form automation
Google Docs reading
Notion reading
Task tracking
Browser testing
iOS simulator control
Xcode builds
Coding CLIs

2. Platform

The underlying product, environment, or native capability used to solve the category.

Examples:

Google Docs
Notion
Linear
GitHub Issues
Browser Use
Playwright
Puppeteer
Claude Code native browser
Cursor native support

Important: a platform may solve the category natively without a separate skill.

3. Skill

A specific skill, package, repo, integration, or workflow used on top of a platform.

4. Source

A citation that provides evidence.

5. Bundle

A full setup/stack from a known persona. Bundles track what popular/trending builders actually ship with day to day, referencing skills in our registry.

Examples:

Karpathy's coding stack (Claude Code + Cursor)
swyx's agent stack (Claude Code + Firecrawl + Browser Use)

Examples:

X/Twitter post
Reddit thread
blog post
official documentation
release note
demo video

Taxonomy

The data model is tiered:

Category
  -> Platform
    -> Skill

The browsing experience should still feel flat. Users should mostly start at the homepage, then go to a category page, and only drill down if needed.

This structure solves the case where multiple platforms can address the same category.

Example:

Task Tracking
  -> Linear
    -> Skill A
    -> Skill B
  -> Notion
    -> Skill C
  -> GitHub Issues
    -> Native support
    -> Skill D

Page Types

V1 should have only four page types:

1. Homepage

Primary goals:

show core categories
show top current recommendations
show recent changes
surface strong evidence and comparisons

2. Category Page

This is the main product surface.

A category page should include:

ranked options
comparison table
short explanation of the ranking
supporting evidence
related platforms and related categories

3. Platform Page

A platform page answers:

how good this platform is for agent usage
whether native support is enough
which skills on top of it are best

4. Skill Page

A skill page answers:

what this skill is good for
how it ranks inside its platform or category
what evidence supports it
what better alternatives exist

Homepage Structure

The homepage should be simple and editorial.

Suggested sections:

search bar
core categories
best right now
what changed
evidence feed

The homepage is not meant to be exhaustive. It should route users into narrow categories quickly.

Ranking Model

The site should rank options clearly rather than presenting them as an unordered directory.

Ranking should happen at two levels:

within a category
within a platform

The output should be a clear ranked table, not a vague prose-only recommendation.

We do not need a single universal score across the whole site. Ranking is context-specific.

Comparison Model

Comparison is the core product behavior.

We especially care about:

pairwise comparisons between skills
pairwise comparisons between platforms
public posts where someone explicitly says one option is better than another

Strong comparison evidence includes:

"X is better than Y for this workflow"
"I switched from X to Y because..."
"Y is now outdated"
"X is official, but Y is more capable"

Status Model

Each skill or platform should have a status:

active
watch
archived

Active

Currently relevant and should appear in default tables.

Watch

Possibly relevant, but uncertain, stale, or not yet sufficiently evidenced.

Archived

Dead, superseded, stale, or clearly outperformed.

Archived items should not appear in default comparisons, but they should still have their own page and a visible reason for archival.

Evidence Model

We want public citations and backlinks, not unsupported claims.

Evidence sources can include:

official docs
release notes
maintainer posts
blog posts
X/Twitter posts
Reddit threads
YouTube videos

We especially want sources where people:

tried the skill
compared it against another option
showed output
explained tradeoffs

Source Credibility

Sources should be weighted by signal quality.

Suggested tiers:

Tier 1: Official

official docs
official repos
official release notes
creator posts

Tier 2: Expert

respected practitioners
credible researchers
detailed technical blog posts
known builders in the space

Tier 3: Community

Reddit threads
community X/Twitter posts
forum discussion

Tier 4: Weak — DISCARD by default

low-effort SEO pages
shallow reposts
generic AI-generated summaries
HN threads with fewer than 10 points or fewer than 5 substantive comments
drive-by opinions without specifics ("honestly easier with X" with no supporting detail)
isolated praise without comparison or usage evidence

Tier 4 signals should almost never appear in the product. If one is included, it must have an explicit justification for why it is still useful despite being weak.

Signal quality minimum bar

Every signal shipped in the product must meet ALL of these:

Traction: visible community engagement (upvotes, replies, stars, retweets). A 2-comment HN thread is not a signal.
Substance: a specific claim, comparison, experience report, or demonstrated output. Generic sentiment is not enough.
Attribution: who said it matters. Maintainers and known builders outweigh anonymous drive-by comments.
Freshness: signals older than 2 months are dead weight — cut them. If a tool is truly a leader, it will have fresh mentions and fresh comparisons. Do not grandfather stale signals just because they were once relevant.
Visual proof: screenshots must show the product in use, not homepages or landing pages.
Multiple independent sources required for ALL claims: a single source is worth nothing. Every claim shipped in the product — verdict, strength, weakness, comparison, capability statement — must be backed by at least 2-3 independent sources. Single-source claims are hypotheses at best. Label them as unverified or cut them.
Engagement floor: if the source has few likes, few comments, few upvotes — it is not a signal. A blog post with zero comments, an HN thread with 2 replies, a tweet with no engagement — all noise. The public must have visibly reacted for it to count.
Author reputation: if the author has no visible track record — no public projects, no history in the space, no reputation — the signal is suspect. Weight maintainers, known builders, and people with public work 10x over anonymous or low-profile commenters.
Date tracking: every artifact — evidence items, live signals, observed outputs — MUST carry a date. Undated artifacts cannot be assessed for freshness and are treated as stale.

When in doubt, cut the signal. Three strong signals beat eight weak ones.

Promotional Material Rule

NEVER trust the product's own website, blog, or company review as strong evidence. Self-reported claims (official docs, launch posts, company blogs about their own product) may be referenced for factual details (feature lists, pricing, benchmarks they claim) but MUST be clearly marked as selfReported. They do NOT count toward multi-source verification. "Company X says their product is great" is noise. "Independent reviewer Y says X is great" is signal.

We should also classify evidence type:

direct use
comparison
announcement
tutorial
complaint
opinion

Discovery and Research Pipeline

The content pipeline should run locally on the machine, not on a hosted server.

This pipeline has four stages:

1. Discover

Goal:

detect new or newly relevant skills, platforms, and discussions

Inputs:

top entries from skill registries
web search
X/Twitter search
Reddit search
recent posts from the last day or week

Important:

we should not rely only on a fixed list of tracked accounts
search should be freshness-based and topic-based
we care more about top or hot items than exhaustive coverage

Typical discovery triggers:

new skill appears in top registry results
a known skill gets a major update
recent public comparisons appear
recent discussion suggests a skill is now hot or obsolete

2. Deep Dive

Goal:

gather structured evidence for a category, platform, or skill

Tasks:

collect official references
collect credible public commentary
collect pairwise comparisons
collect examples and quotes
collect reasons people prefer one option over another

Deep dive should happen for:

all items initially
later, only triggered items or stale pages

3. Rank

Goal:

update rankings and statuses

Tasks:

rank skills inside each platform
rank platforms inside each category
assign active, watch, or archived
update short verdicts
update which items appear in default comparison tables

4. Publish

Goal:

regenerate the website

Tasks:

update homepage
update category pages
update platform pages
update skill pages

Content Format

Content should live in Markdown with frontmatter.

Reasons:

easy to edit manually
easy for agents to generate
easy to render statically
good fit for research-heavy pages

Frontmatter should be type-checked.

Recommended approach:

Markdown files for jobs, platforms, skills, and sources
TypeScript schemas with Zod for validation

We should avoid introducing extra YAML files where a Markdown document is enough.

Search

Search should be very simple in V1.

Requirements:

client-side only
no backend search service
no vector database
basic string similarity / keyword matching

This is enough for initial navigation.

Design Direction

The site should feel sharp and editorial, not like a generic docs template.

Principles:

strong typography
intentional contrast
minimal but expressive layout
hover states that reveal useful comparison details
evidence and ranking should be visually prominent

We should avoid generic AI-generated landing page aesthetics.

Stack Choice

Chosen stack:

Next.js
Tailwind CSS v4
shadcn/ui

Reasoning:

strongest tool support across coding agents
best generation quality from modern AI coding tools
easy static-friendly content site
good fit for a custom editorial UI

We are not using a monorepo for now.

V1 will be a single Next.js application.

Publishing Infrastructure

Chosen deployment target:

Vercel

Reasoning:

best default fit for Next.js
simplest deployment path
easy preview deployments
enough for a static-first content product

The research and content generation pipeline still runs locally. Vercel is only for publishing the built site.

Initial Open Questions

These are still open and can be refined after the first real category pages:

which initial categories should appear on the homepage
how much detail should appear directly on the homepage versus category pages
whether platform pages should be fully first-class in navigation or mostly supporting pages
how much of the source material should be embedded inline versus linked out
how we visually show a pairwise comparison without making tables too heavy

V1 Success Criteria

V1 is successful if a user can:

land on the homepage
identify their category quickly
see the best options ranked
understand why the top option wins
inspect real public evidence
navigate to a narrower platform or skill page if needed

The website should feel useful even before formal local benchmarks exist.