Jina Reader

stale

Simplest URL-to-markdown conversion — prepend r.jina.ai to any URL. ReaderLM-v2 (1.5B SLM) presented at ICLR 2025. OSS repo stale since May 2025.

Score 60stale

Where it wins

Simplest possible interface — just prepend r.jina.ai/ to any URL

ReaderLM-v2: 1.5B params, 512K context, 29 languages, ICLR 2025

10,248 GitHub stars

Apache-2.0 license

Where to be skeptical

OSS repo stale — no commits since May 2025 (10+ months)

Reader-only — no search or discovery capability

Firecrawl is a strict superset and 4-5x cheaper at volume

On downward trajectory — recommend delisting if no activity by mid-2026

Editorial verdict

#6 in search-news — effectively dead. No commits since May 2025 (10+ months). ReaderLM-v2 is strong for edge/on-device but hosted API only. Firecrawl is a strict superset. Do not recommend for new projects.

Source

GitHub: jina-ai/reader

Docs: jina.ai

Found via SkillPack? ★ Star us on GitHub

Videos

Reviews, tutorials, and comparisons from the community.

Build your Content Aggregator that finds the Latest AI News (N8N Step-by-Step)

Ricardo Taipe·2025-03-20

Vibe code an email agent with Replit

Matt Palmer·2025-09-23

Convert ANY site to markdown for free

Matt Palmer·2025-03-25

Search & News

#06of 18

Simplest URL-to-markdown conversion (one-line API) with ReaderLM-v2 for local extraction

Crawl4AI

Free, open-source web scraping (Apache-2.0). 62K stars, 6,353 forks (nearly matches Firecrawl), actively maintained (v0.8.5, 2026-03-18), 384K weekly PyPI downloads. Best open-source alternative to Firecrawl.

SearXNG

Privacy-first, self-hosted meta-search engine aggregating 70+ upstream engines. Zero cost, zero API keys, full data sovereignty.

Exa MCP Server

Official Exa MCP for fast web search and crawling when the workflow is search-first rather than page-ops-first.

ScrapeGraphAI

LLM-graph-based web scraper — describe what you want, AI builds the extraction graph. 23K stars, 194 HN pts, active development (v1.74.0, Mar 2026). Open-source + hosted API.

Public evidence

strong2026-03

10,248 GitHub stars

Solid star count but growth has plateaued with stale repo.

10,248 starsGitHub community

strong2025

HN: ReaderLM launch — 199 pts

Strong HN reception for the ReaderLM model, not the hosted service.

199 points on HNHN community

strong2025

ReaderLM-v2 presented at ICLR 2025

1.5B SLM, 512K context, 29 languages, 3x quality over v1. Academic credibility.

ICLR 2025 presentationICLR (top ML conference)

moderate2025-05

OSS repo stale since May 2025

Hosted API and ReaderLM-v2 are active, but the open-source repo shows no development.

No commits for 10+ monthsGitHub activity

Raw GitHub source

GitHub README peek

Constrained peek so you can sanity-check the source material without leaving the site.

Reader

Your LLMs deserve better input.

Reader does two things:

Read: It converts any URL to an LLM-friendly input with https://r.jina.ai/https://your.url. Get improved output for your agent and RAG systems at no cost.
Search: It searches the web for a given query with https://s.jina.ai/your+query. This allows your LLMs to access the latest world knowledge from the web.

Check out the live demo

Or just visit these URLs (Read) https://r.jina.ai/https://github.com/jina-ai/reader, (Search) https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F and see yourself.

Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI. Check out rate limit

This repository is the open source branch of the codebase behind https://r.jina.ai and https://s.jina.ai. It runs in stateless or bucket-cached mode; the MongoDB-backed SaaS storage layer is not included here.

Updates

2026-04 — Re-synchronized the open source branch with the SaaS code. The MongoDB-backed storage layer is stripped; the oss branch runs in stateless mode out of the box, with optional MinIO/S3-compatible bucket caching via docker compose. See Local development.
2025-12 — Storage layer decoupled and binary file uploads landed. PDFs and MS Office documents (Word, Excel, PowerPoint) can now be POSTed directly via the file body field — no need to host them first. See cookbooks.md.
2025-03 — Major refactor: Reader is no longer a Firebase application. The SaaS migrated off Firestore + Cloud Functions to a Cloud Run image with MongoDB Atlas, removing the platform-coupled bits and unblocking the local-Docker path above.
2024-05 — s.jina.ai launched, extending Reader from URL→markdown to search→markdown. PDFs added the same month — any URL ending in .pdf is parsed with PDF.js and returned as markdown.
2024-04 — Reader released and r.jina.ai went live as Jina AI's first SaaS API for converting URLs to LLM-friendly input.

What Reader can read

Web pages — rendered with headless Chrome, or fetched lightweight via curl-impersonate. Reader picks intelligently between the two.
PDFs — any URL, parsed with PDF.js. See this NASA PDF result vs the original.
MS Office documents — Word, Excel, PowerPoint, converted via LibreOffice and then processed as HTML/PDF.
Images — captioned by a vision-language model, so your downstream text-only LLM gets just enough hints to reason about them.

Usage

Using `r.jina.ai` for single URL fetching

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

Using `r.jina.ai` for a full website fetching (Google Colab)

Using `s.jina.ai` for web search

Simply prepend https://s.jina.ai/ to your search query. Note that if you are using this in the code, make sure to encode your search query first, e.g. if your query is Who will win 2024 US presidential election? then your url should look like:

https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F

Behind the scenes, Reader searches the web, fetches the top 5 results, visits each URL, and applies r.jina.ai to it. This is different from many web search function-calling in agent/RAG frameworks, which often return only the title, URL, and description provided by the search engine API. If you want to read one result more deeply, you have to fetch the content yourself from that URL. With Reader, http://s.jina.ai automatically fetches the content from the top 5 search result URLs for you (reusing the tech stack behind http://r.jina.ai). This means you don't have to handle browser rendering, blocking, or any issues related to JavaScript and CSS yourself.

Using `s.jina.ai` for in-site search

Simply specify site in the query parameters such as:

curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'

Interactive Code Snippet Builder

We highly recommend using the code builder to explore different parameter combinations of the Reader API.

Using request headers

You can control the behavior of the Reader API using request headers. The list below covers the most useful ones — for the full surface with up-to-date defaults and validation rules, see the live API docs at https://r.jina.ai/docs, or the source of truth in src/dto/crawler-options.ts.

x-respond-with — select the output format.
- markdown returns markdown without going through readability
- html returns documentElement.outerHTML
- text returns document.body.innerText
- screenshot returns the URL of the webpage's screenshot
- pageshot similar to screenshot but tries to capture the whole page instead of just the viewport
- frontmatter returns Markdown with a YAML frontmatter block. The default plain-text response uses a custom Title: … / URL Source: … header format; frontmatter replaces that with a front matter block. Example:
```
curl -H 'X-Respond-With: frontmatter' 'https://r.jina.ai/https://example.com'
```
```
---
title: "Example Domain"
description: "This domain is for use in illustrative examples."
url: "https://example.com/"
---

## Example Domain

This domain is for use in illustrative examples in documents. ...
```
- markdown+frontmatter — like frontmatter but covers the full page without readability filtering.
x-engine — enforces a fetching engine: browser (headless Chrome), curl (lightweight, no JS), or auto (the default — Combined use of both browser and curl).
x-proxy-url — route the traffic through your designated proxy.
x-cache-tolerance — integer seconds; how stale a cached page is acceptable.
x-no-cache: true — bypass the cached page (lifetime 3600s). Equivalent to x-cache-tolerance: 0.
x-target-selector — a CSS selector. Reader returns content within the matched element instead of the full page. Useful when automatic content extraction misses what you want.
x-wait-for-selector — a CSS selector. Reader waits until the matched element is rendered before returning. If x-target-selector is set, this can be omitted to wait for the same element.
x-timeout — integer seconds (max 180). When set, Reader will not return early; it waits for network idle or until the timeout is reached.
x-max-tokens — integer (≥500). Trim the response so it never exceeds this many tokens. Useful as a per-request guardrail when feeding a fixed-size context window — Reader truncates rather than rejects.
x-token-budget — integer. Reject the request if the resulting content would exceed this many tokens. Use this when over-budget output is worse than no output (e.g. cost control). Ignored on the search endpoint.
x-respond-timing — explicit control over when Reader is willing to return. Trade off latency against completeness:
- html — return as soon as the raw HTML lands. No JS execution, no waiting.
- visible-content — return the moment readable content is parseable. Lowest latency that still produces text.
- mutation-idle — wait for DOM mutations to settle for ≥0.2s. Good default for SPAs that lazy-render above the fold.
- resource-idle — wait for content-affecting resources to finish loading (≥0.5s quiet). The default heuristic for content-shaped requests.
- media-idle — wait for media (images, video, fonts) to also finish. Use with screenshot / pageshot / vlm.
- network-idle — full networkidle0. Slowest, most complete. Implied when x-timeout ≥ 20.
When omitted, Reader picks one based on x-respond-with, x-timeout, and x-with-iframe. See presumedRespondTiming in src/dto/crawler-options.ts for the exact rules.
x-with-generated-alt: true — caption images on the page with a VLM.
x-retain-images — control how images survive into the output:
- all (default) — keep ![alt](https://raw.githubusercontent.com/jina-ai/reader/main/url) markdown for every image.
- none — drop images entirely.
- alt — keep alt text only, no URLs. Cheap on tokens; useful when the downstream LLM has no use for the image link.
x-retain-links — control how links survive into the output:
- all (default) — keep [text](url) markdown.
- none — drop links entirely.
- text — keep link anchor text only, drop URLs. Best for embedding / semantic-index pipelines where URLs are noise.
- gpt-oss — emit citations in gpt-oss's 【{id}†...】 format and append a numbered URL footer (also auto-enables x-with-links-summary).
x-retain-media — control how <video>, <audio>, and embedded video iframes (<iframe> from YouTube, Vimeo, Bilibili, etc.) appear in the output:
- link (default) — markdown link, e.g. [Video 1](url). Embedded iframes are rewritten to their canonical watch URL. Respects x-md-link-style.
- none — drop media entirely; non-video iframes fall back to their inner text content.
- text — bare label only, e.g. Video 1 or Audio 1. No URL.
- image — markdown image syntax, e.g. ![Video 1](https://raw.githubusercontent.com/jina-ai/reader/main/url).
- html — the original HTML element with cosmetic attributes (class, id, style, data-*, aria-*) stripped. Embedded video iframes keep their original embed src rather than the canonical watch URL.
x-with-links-summary / x-with-images-summary — append a deduplicated footer of all links / images to the output. Combine with x-retain-links: text or x-retain-images: alt to get inline anchor/alt text plus one canonical URL list at the end — convenient when you want the model to see URLs without paying for them inline. x-with-links-summary: all keeps every link instead of only the unique ones.
x-markdown-chunking — opt-in semantic chunking of the markdown response. Returns a JSON array (or -delimited text) of chunks instead of one blob:
- true / h1 … h5 — heading-based split at the given heading level (e.g. h3 chunks at #, ##, and ###).
- structured / s1 … s5 — block-level structured split. s1 is coarsest, s5 finest.
x-preset — apply a pre-packaged option bundle for common scenarios. Preset values only take effect for options the caller does not set explicitly (via body or another header). See cookbooks.md for examples.
- reader — for displaying content to human users.
- index — for semantic indexing / embedding pipelines.
- research — for AI research agents needing structured, citable output.
- agent — for AI agents doing everyday browsing tasks.
- spider — for recursive site crawling with a full link inventory.
x-detach-invisibles — detach elements with eventual display:none before snapshotting. Implies browser engine; disables caching.

View on GitHub →