Documentation
Methodology
How SkillPack measures, scores, and ranks every solution in the catalog. All metrics are computed from public data — GitHub APIs, package registries, and community signals.
Overview
What we measure and why
Every solution in the SkillPack catalog is evaluated across multiple dimensions: trust, complexity, tier, type, evidence quality, and community traction. The goal is to give developers a fast, honest signal about whether a solution is worth adopting — backed by verifiable public data, not marketing claims.
We collect data from GitHub (stars, push activity, languages, contributors), package registries (npm, PyPI weekly downloads), and social platforms (Hacker News mentions). On top of raw data, our research pipeline produces editorial evidence and rankings reviewed by human editors.
Trust Score
0–100 · Computed
Tier
4 levels · LLM-classified
Complexity
1–5 · Repo analysis
Evidence
Strong / Moderate · Editorial
Stars & Downloads
Absolute · API
Mentions
7-day window · HN Algolia
Trust Score
Composite health signal (0–100)
The trust score is a weighted composite of five sub-scores, each normalized to 0–100. It's computed at display-time from the latest data — never cached or stale.
Freshness component
Based on how recently the repository received a push. Actively maintained projects score highest.
| Last push | Score |
|---|---|
| < 7 days | 100 |
| < 30 days | 85 |
| < 90 days | 60 |
| < 180 days | 30 |
| ≥ 180 days | 0 |
| No data | 50 (neutral) |
Community component
Measures star velocity — the growth rate between the two most recent data points. Uses a continuous scale instead of step thresholds. Falls back to log-scale absolute star count when insufficient history.
| Growth % | Score |
|---|---|
| ≥ 20% | 100 |
| 0–20% | 45–100 (continuous) |
| Declining | 20–45 (continuous) |
Fallback (fewer than 2 data points): uses log-scale based on absolute star count. 100 stars → ~40, 1K → ~60, 10K → ~80, 50K+ → 100.
Adoption component
Weekly download count from npm or PyPI, scored on a continuous log scale. For solutions without a package (GitHub-only tools), GitHub stars are used as an adoption proxy.
| Weekly downloads | Score |
|---|---|
| ≥ 100,000 | 100 |
| Log scale | 20 + 20 × log₁₀(downloads) |
| No package, ≥10K stars | 70 |
| No package, ≥1K stars | 55 |
| No package, ≥100 stars | 40 |
| No package, <100 stars | 30 |
Evidence component
Based on the ratio of strong evidence to total evidence items collected by our research pipeline. Scales from 30 (all moderate) to 100 (all strong).
If a solution has zero evidence items, this component scores 40 (neutral) rather than 0 — absence of evidence is not punitive.
Momentum component
Rewards solutions with accelerating growth. Compares the most recent star growth period against the one before it. Accelerating growth scores up to 100, decelerating growth scores down to 20. Solutions with fewer than 3 data points score 50 (neutral).
Hard caps
Regardless of sub-scores, the final trust score is capped in certain conditions:
Display colors
Tier
How many moving parts
Tier measures the architectural scope of a skill — from a single focused capability to a bundle of coordinated skills. Classification is performed by an LLM analyzing each skill's name and summary.
Single focused capability. Does one thing well.
Example: A linter rule, a single MCP tool, a formatting script
Multi-step workflow combining several operations.
Example: Code review bot, test generator with coverage analysis
Coordinates multiple tools with decision logic.
Example: Coding CLI that plans, edits, tests, and iterates autonomously
Bundle of multiple skills working as a suite.
Example: Full development environment with linting, testing, deploy, and monitoring
Skill Type
What the skill does
Each skill is classified into one of four functional types by an LLM analyzing its name, summary, and description.
Adds domain knowledge — patterns, conventions, best practices. Teaches the agent how to think about a domain.
Produces output — code, files, content, scaffolding. The agent creates something tangible.
Prevents mistakes — scanning, linting, auditing, enforcement. The safety net.
Bridges to external services — MCP servers, APIs, CLI tools. Extends what the agent can reach.
Complexity
Setup & expertise required (1–5)
Complexity is derived from the repository's codebase size (total bytes across all languages) and the number of programming languages used. It approximates how much effort is needed to understand, configure, and run the skill.
| Level | Condition | Visual |
|---|---|---|
| 5 | > 500 KB + 4+ languages | ●●●●● |
| 4 | > 200 KB + 3+ languages | ●●●●○ |
| 3 | > 50 KB + 2+ languages | ●●●○○ |
| 2 | > 10 KB | ●●○○○ |
| 1 | ≤ 10 KB | ●○○○○ |
Data is collected from the GitHub Languages API, which returns a byte count per language for each repository.
Evidence Quality
Strong vs. moderate sources
Every skill's assessment is backed by collected evidence — URLs to public artifacts like HN threads, GitHub issues, blog posts, and benchmark reports. Each evidence item is tagged with a quality level.
HN threads with high engagement (100+ points), published third-party benchmarks, public GitHub issues showing real adoption, independent reviews from recognized publications.
Company blog posts, press releases, forum discussions with limited engagement, conference talks by the skill's own team. These may carry bias but still contribute useful signal.
Evidence is collected during the research pipeline's Deep-Dive stage. Each item records the source URL, date, engagement level, author, and a one-line gist.
Rankings & Cut Line
Editorial ranking per category
Within each category, skills are ranked by an editorial process that weighs evidence quality, real-world usage, recency, and direct workflow fit. The ranking is opinionated — it reflects our assessment of which tools actually deliver value.
Ranking criteria
- Official vendor support and active maintenance
- Workflow fit — does it solve a real problem in its category?
- Public trust signals — community adoption, independent reviews
- Recency — recent releases and active development
- Demonstrability — can the skill be shown working in practice?
The cut line
Skills that don't meet the quality threshold are placed below the cut line. These are still tracked and displayed (at reduced opacity) but are not recommended for adoption. A skill can move above the cut line when new evidence or traction warrants it.
GitHub Stars
Community interest signal
Star counts are fetched from the GitHub REST API (GET /repos/{owner}/{repo}) and stored as monthly snapshots. The absolute count is displayed alongside a trend arrow showing growth direction.
Stars are a useful but imperfect signal. They indicate awareness, not adoption. That's why stars are just one component of the trust score — weighted alongside downloads and evidence.
Downloads
Weekly install counts
We collect weekly download counts from two package registries:
Download data is stored as monthly snapshots and feeds into both the trust score's adoption component and the trend calculation.
Social Mentions
Hacker News signal
We track how often a skill is mentioned on Hacker News using the Algolia Search API. Only stories (not comments) with more than 5 points in the last 7 days are counted — this filters out noise and self-promotion.
Mention counts are stored as monthly snapshots and displayed as time-series charts on category pages.
Trends
Direction arrows (↑ ↓ —)
Trend arrows show the growth direction between the two most recent data points in any time-series metric (stars, downloads, mentions).
| Change | Direction | Arrow |
|---|---|---|
| ≥ +5% | Up | ↑ (green) |
| -5% to +5% | Flat | — (gray) |
| ≤ -5% | Down | ↓ (red) |
If fewer than 2 data points exist or the previous value is zero, no trend is shown.
Freshness
Last push activity
Freshness is simply the number of days since the repository's last push, fetched from the GitHub API's pushed_at field.
| Days since push | Display |
|---|---|
| 0 | "today" (green) |
| 1–7 | "Nd ago" (green) |
| 8–30 | "Nw ago" (gray) |
| 31+ | "Nmo ago" (dim) |
Freshness is the heaviest-weighted component (30%) of the trust score because an abandoned repository is the single strongest negative signal for a skill's reliability.
Research Pipeline
How data is collected
All research is orchestrated by Ralph — our automated pipeline that runs per-category through six stages. Each stage spawns a dedicated AI agent with specific tools and constraints.
Web search, HN Algolia, GitHub trending, registry checks. Finds all serious contenders and new signals.
Builds measurable evidence for every contender. Collects engagement metrics, official artifacts, and usage evidence. Hard quality gates.
Editorial ranking per category. Top skills recommended, rest placed below the cut line. Evidence-first, opinionated.
Reads rank findings and updates the catalog — evidence, rankings, verdicts, and signals.
Collects GitHub stars, npm/PyPI downloads, and HN mentions via APIs. No AI needed.
Builds the project and runs link checks. Catches broken TypeScript or dead URLs before deploy.
The pipeline can run all categories in parallel with configurable concurrency. A full sweep across all categories takes roughly 2–5 hours depending on parallelism.