Documentation

Methodology

How SkillPack measures, scores, and ranks every solution in the catalog. All metrics are computed from public data — GitHub APIs, package registries, and community signals.

Overview

What we measure and why

Every solution in the SkillPack catalog is evaluated across multiple dimensions: trust, complexity, tier, type, evidence quality, and community traction. The goal is to give developers a fast, honest signal about whether a solution is worth adopting — backed by verifiable public data, not marketing claims.

We collect data from GitHub (stars, push activity, languages, contributors), package registries (npm, PyPI weekly downloads), and social platforms (Hacker News mentions). On top of raw data, our research pipeline produces editorial evidence and rankings reviewed by human editors.

Trust Score

0–100 · Computed

Tier

4 levels · LLM-classified

Complexity

1–5 · Repo analysis

Evidence

Strong / Moderate · Editorial

Stars & Downloads

Absolute · API

Mentions

7-day window · HN Algolia

Trust Score

Composite health signal (0–100)

The trust score is a weighted composite of five sub-scores, each normalized to 0–100. It's computed at display-time from the latest data — never cached or stale.

score = (freshness × 0.25) + (community × 0.25) + (adoption × 0.25) + (evidence × 0.15) + (momentum × 0.10)

Freshness

25%

Community

25%

Adoption

25%

Evidence

15%

Momentum

10%

Freshness component

Based on how recently the repository received a push. Actively maintained projects score highest.

Last push	Score
< 7 days	100
< 30 days	85
< 90 days	60
< 180 days	30
≥ 180 days	0
No data	50 (neutral)

Community component

Measures star velocity — the growth rate between the two most recent data points. Uses a continuous scale instead of step thresholds. Falls back to log-scale absolute star count when insufficient history.

Growth %	Score
≥ 20%	100
0–20%	45–100 (continuous)
Declining	20–45 (continuous)

Fallback (fewer than 2 data points): uses log-scale based on absolute star count. 100 stars → ~40, 1K → ~60, 10K → ~80, 50K+ → 100.

Adoption component

Weekly download count from npm or PyPI, scored on a continuous log scale. For solutions without a package (GitHub-only tools), GitHub stars are used as an adoption proxy.

Weekly downloads	Score
≥ 100,000	100
Log scale	20 + 20 × log₁₀(downloads)
No package, ≥10K stars	70
No package, ≥1K stars	55
No package, ≥100 stars	40
No package, <100 stars	30

Evidence component

Based on the ratio of strong evidence to total evidence items collected by our research pipeline. Scales from 30 (all moderate) to 100 (all strong).

evidence_score = 30 + (strong_count / total_count) × 70

If a solution has zero evidence items, this component scores 40 (neutral) rather than 0 — absence of evidence is not punitive.

Momentum component

Rewards solutions with accelerating growth. Compares the most recent star growth period against the one before it. Accelerating growth scores up to 100, decelerating growth scores down to 20. Solutions with fewer than 3 data points score 50 (neutral).

Hard caps

Regardless of sub-scores, the final trust score is capped in certain conditions:

CAP ≤ 10Repository is archived on GitHub

CAP ≤ 30Last push was more than 365 days ago

Display colors

≥ 80 Healthy50–79 Caution< 50 At Risk

Tier

How many moving parts

Tier measures the architectural scope of a skill — from a single focused capability to a bundle of coordinated skills. Classification is performed by an LLM analyzing each skill's name and summary.

Atomic

Single focused capability. Does one thing well.

Example: A linter rule, a single MCP tool, a formatting script

Composite

Multi-step workflow combining several operations.

Example: Code review bot, test generator with coverage analysis

Orchestrator

Coordinates multiple tools with decision logic.

Example: Coding CLI that plans, edits, tests, and iterates autonomously

Pack

Bundle of multiple skills working as a suite.

Example: Full development environment with linting, testing, deploy, and monitoring

Skill Type

What the skill does

Each skill is classified into one of four functional types by an LLM analyzing its name, summary, and description.

Expertise

Adds domain knowledge — patterns, conventions, best practices. Teaches the agent how to think about a domain.

Generator

Produces output — code, files, content, scaffolding. The agent creates something tangible.

Guardian

Prevents mistakes — scanning, linting, auditing, enforcement. The safety net.

Connector

Bridges to external services — MCP servers, APIs, CLI tools. Extends what the agent can reach.

Complexity

Setup & expertise required (1–5)

Complexity is derived from the repository's codebase size (total bytes across all languages) and the number of programming languages used. It approximates how much effort is needed to understand, configure, and run the skill.

complexity = f(total_code_bytes, language_count)

Level	Condition	Visual
5	> 500 KB + 4+ languages	●●●●●
4	> 200 KB + 3+ languages	●●●●○
3	> 50 KB + 2+ languages	●●●○○
2	> 10 KB	●●○○○
1	≤ 10 KB	●○○○○

Data is collected from the GitHub Languages API, which returns a byte count per language for each repository.

Evidence Quality

Strong vs. moderate sources

Every skill's assessment is backed by collected evidence — URLs to public artifacts like HN threads, GitHub issues, blog posts, and benchmark reports. Each evidence item is tagged with a quality level.

StrongIndependent, verifiable artifacts

HN threads with high engagement (100+ points), published third-party benchmarks, public GitHub issues showing real adoption, independent reviews from recognized publications.

ModerateSecondary or self-reported sources

Company blog posts, press releases, forum discussions with limited engagement, conference talks by the skill's own team. These may carry bias but still contribute useful signal.

Evidence is collected during the research pipeline's Deep-Dive stage. Each item records the source URL, date, engagement level, author, and a one-line gist.

Rankings & Cut Line

Editorial ranking per category

Within each category, skills are ranked by an editorial process that weighs evidence quality, real-world usage, recency, and direct workflow fit. The ranking is opinionated — it reflects our assessment of which tools actually deliver value.

Ranking criteria

Official vendor support and active maintenance
Workflow fit — does it solve a real problem in its category?
Public trust signals — community adoption, independent reviews
Recency — recent releases and active development
Demonstrability — can the skill be shown working in practice?

The cut line

Skills that don't meet the quality threshold are placed below the cut line. These are still tracked and displayed (at reduced opacity) but are not recommended for adoption. A skill can move above the cut line when new evidence or traction warrants it.

3Recommended skill

Below the cut line

4Not recommended yet

GitHub Stars

Community interest signal

Star counts are fetched from the GitHub REST API (GET /repos/{owner}/{repo}) and stored as monthly snapshots. The absolute count is displayed alongside a trend arrow showing growth direction.

Stars are a useful but imperfect signal. They indicate awareness, not adoption. That's why stars are just one component of the trust score — weighted alongside downloads and evidence.

Downloads

Weekly install counts

We collect weekly download counts from two package registries:

npmapi.npmjs.org/downloads/point/last-week/{package}

PyPIpypistats.org/api/packages/{package}/recent

Download data is stored as monthly snapshots and feeds into both the trust score's adoption component and the trend calculation.

Social Mentions

Hacker News signal

We track how often a skill is mentioned on Hacker News using the Algolia Search API. Only stories (not comments) with more than 5 points in the last 7 days are counted — this filters out noise and self-promotion.

HN Algolia → tags=story, points > 5, created within last 7 days

Mention counts are stored as monthly snapshots and displayed as time-series charts on category pages.

Trends

Direction arrows (↑ ↓ —)

Trend arrows show the growth direction between the two most recent data points in any time-series metric (stars, downloads, mentions).

pct = ((latest - previous) / previous) × 100

Change	Direction	Arrow
≥ +5%	Up	↑ (green)
-5% to +5%	Flat	— (gray)
≤ -5%	Down	↓ (red)

If fewer than 2 data points exist or the previous value is zero, no trend is shown.

Freshness

Last push activity

Freshness is simply the number of days since the repository's last push, fetched from the GitHub API's pushed_at field.

Days since push	Display
0	"today" (green)
1–7	"Nd ago" (green)
8–30	"Nw ago" (gray)
31+	"Nmo ago" (dim)

Freshness is the heaviest-weighted component (30%) of the trust score because an abandoned repository is the single strongest negative signal for a skill's reliability.

Research Pipeline

How data is collected

All research is orchestrated by Ralph — our automated pipeline that runs per-category through six stages. Each stage spawns a dedicated AI agent with specific tools and constraints.

Discover~5 min

Web search, HN Algolia, GitHub trending, registry checks. Finds all serious contenders and new signals.

Deep-Dive~15–25 min

Builds measurable evidence for every contender. Collects engagement metrics, official artifacts, and usage evidence. Hard quality gates.

Rank~5 min

Editorial ranking per category. Top skills recommended, rest placed below the cut line. Evidence-first, opinionated.

Catalog Update~5 min

Reads rank findings and updates the catalog — evidence, rankings, verdicts, and signals.

Metrics~1 min

Collects GitHub stars, npm/PyPI downloads, and HN mentions via APIs. No AI needed.

QA~2 min

Builds the project and runs link checks. Catches broken TypeScript or dead URLs before deploy.

The pipeline can run all categories in parallel with configurable concurrency. A full sweep across all categories takes roughly 2–5 hours depending on parallelism.