Small Language Model for Converting Raw HTML to Clean Markdown

Jina Reader turns any webpage into clean Markdown with a simple URL tweak, optimizing content for LLMs. Efficient, lightweight, and multi-language support.

Sep 18, 2024

∙ Paid

In April 2024, we launched Jina Reader, a simple yet powerful API. Just add `r.jina.ai` in front of any URL, and it converts the webpage into a Markdown format optimized for large language models (LLMs).

Though the technology behind Jina Reader is complex, the core “reading” process is fairly simple:

First, we use a headless browser to read the webpage's code. Then, Mozilla’s Readability extracts the main content, removing headers, footers, navigation bars, and sidebars. We use regex and the Turndown library to clean up the HTML and convert it to Markdown. With the structured Markdown file, LLMs can easily extract information, summarize, and reason.

In the first few weeks after launch, we received a lot of user feedback about content quality.

Some thought the content was too detailed, while others said it wasn’t detailed enough. Others pointed out that the Readability filter removed the wrong content, or Turndown’s HTML conversion wasn’t perfect.

Fortunately, many of these issues were solved with regex and small tweaks.

However, we started wondering: instead of continuously patching (which is hard to maintain and doesn’t support multiple languages), could we solve this issue end-to-end with language models?

At first glance, using LLMs for this seems costly and slow. But what if we use small language models (SLMs)?

AI Disruption

Small Language Model for Converting Raw HTML to Clean Markdown

Jina Reader turns any webpage into clean Markdown with a simple URL tweak, optimizing content for LLMs. Efficient, lightweight, and multi-language support.

This post is for paid subscribers