Small Language Model for Converting Raw HTML to Clean Markdown
Jina Reader turns any webpage into clean Markdown with a simple URL tweak, optimizing content for LLMs. Efficient, lightweight, and multi-language support.
In April 2024, we launched Jina Reader, a simple yet powerful API. Just add `r.jina.ai` in front of any URL, and it converts the webpage into a Markdown format optimized for large language models (LLMs).
Though the technology behind Jina Reader is complex, the core “reading” process is fairly simple:
First, we use a headless browser to read the webpage's code. Then, Mozilla’s Readability extracts the main content, removing headers, footers, navigation bars, and sidebars. We use regex and the Turndown library to clean up the HTML and convert it to Markdown. With the structured Markdown file, LLMs can easily extract information, summarize, and reason.
In the first few weeks after launch, we received a lot of user feedback about content quality.
Some thought the content was too detailed, while others said it wasn’t detailed enough. Others pointed out that the Readability filter removed the wrong content, or Turndown’s HTML conversion wasn’t perfect.
Fortunately, many of these issues were solved with regex and small tweaks.
However, we started wondering: instead of continuously patching (which is hard to maintain and doesn’t support multiple languages), could we solve this issue end-to-end with language models?
At first glance, using LLMs for this seems costly and slow. But what if we use small language models (SLMs)?
Keep reading with a 7-day free trial
Subscribe to AI Disruption to keep reading this post and get 7 days of free access to the full post archives.