DEFUDDLE(1)

NAME

defuddleExtract the main content from web pages.

SYNOPSIS

INFO

6.1k stars
242 forks
0 views

DESCRIPTION

Extract the main content from web pages.

README

de·​fud·dle /diˈfʌdl/ transitive verb
to remove unnecessary elements from a web page, and make it easily readable.

Beware! Defuddle is very much a work in progress!

Defuddle extracts the main content from web pages. It cleans up web pages by removing clutter like comments, sidebars, headers, footers, and other non-essential elements, leaving only the primary content.

Try the Defuddle Playground →

Features

Defuddle aims to output clean and consistent HTML documents. It was written for Obsidian Web Clipper with the goal of creating a more useful input for HTML-to-Markdown converters like Turndown.

Defuddle can be used as a replacement for Mozilla Readability with a few differences:

  • More forgiving, removes fewer uncertain elements.
  • Provides a consistent output for footnotes, math, code blocks, etc.
  • Uses a page's mobile styles to guess at unnecessary elements.
  • Extracts more metadata from the page, including schema.org data.

Usage

Browser

import Defuddle from 'defuddle';

// Parse the current document const defuddle = new Defuddle(document); const result = defuddle.parse();

// Access the content and metadata console.log(result.content); console.log(result.title); console.log(result.author);

Node.js

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';

// Parse HTML from a string const html = '<html><body><article>...</article></body></html>'; const result = await Defuddle(html);

// Parse HTML from a URL const dom = await JSDOM.fromURL('https://example.com/article&#39;); const result = await Defuddle(dom);

// With options const url = 'https://example.com/article&#39;; // Original URL of the page const result = await Defuddle(dom, url, { debug: true, // Enable debug mode for verbose logging markdown: true // Convert content to markdown });

// Access the content and metadata console.log(result.content); console.log(result.title); console.log(result.author);

Note: for defuddle/node to import properly, the module format in your package.json has to be set to { "type": "module" }

CLI

Defuddle includes a command-line interface for parsing web pages directly from the terminal.

# Parse a local HTML file
defuddle parse page.html

Parse a URL

defuddle parse https://example.com/article

Output as markdown

defuddle parse page.html --markdown

Output as JSON with metadata

defuddle parse page.html --json

Extract a specific property

defuddle parse page.html --property title

Save output to a file

defuddle parse page.html --output result.html

Enable debug mode

defuddle parse page.html --debug

CLI Options

OptionAliasDescription
--output <file>-oWrite output to a file instead of stdout
--markdown-mConvert content to markdown format
--mdAlias for --markdown
--json-jOutput as JSON with metadata and content
--property <name>-pExtract a specific property (e.g., title, description, domain)
--debugEnable debug mode

Installation

npm install defuddle

For Node.js usage, you'll also need to install JSDOM:

npm install jsdom

Response

Defuddle returns an object with the following properties:

PropertyTypeDescription
authorstringAuthor of the article
contentstringCleaned up string of the extracted content
descriptionstringDescription or summary of the article
domainstringDomain name of the website
faviconstringURL of the website's favicon
imagestringURL of the article's main image
metaTagsobjectMeta tags
parseTimenumberTime taken to parse the page in milliseconds
publishedstringPublication date of the article
sitestringName of the website
schemaOrgDataobjectRaw schema.org data extracted from the page
titlestringTitle of the article
wordCountnumberTotal number of words in the extracted content

Bundles

Defuddle is available in three different bundles:

  1. Core bundle (defuddle): The main bundle for browser usage. No dependencies.
  2. Full bundle (defuddle/full): Includes additional features for math equation parsing and Markdown conversion.
  3. Node.js bundle (defuddle/node): Optimized for Node.js environments using JSDOM. Includes full capabilities for math and Markdown conversion.

The core bundle is recommended for most use cases. It still handles math content, but doesn't include fallbacks for converting between MathML and LaTeX formats. The full bundle adds the ability to create reliable <math> elements using mathml-to-latex and temml libraries.

Options

OptionTypeDefaultDescription
debugbooleanfalseEnable debug logging
urlstringURL of the page being parsed
markdownbooleanfalseConvert content to Markdown
separateMarkdownbooleanfalseKeep content as HTML and return contentMarkdown as Markdown
removeExactSelectorsbooleantrueRemove elements matching exact selectors like ads, social buttons, etc.
removePartialSelectorsbooleantrueRemove elements matching partial selectors like ads, social buttons, etc.
removeImagesbooleanfalseRemove images.
useAsyncbooleantrueAllow async extractors to fetch from third-party APIs when no local content is available.

Debug mode

You can enable debug mode by passing an options object when creating a new Defuddle instance:

const article = new Defuddle(document, { debug: true }).parse();
  • More verbose console logging about the parsing process
  • Preserves HTML class and id attributes that are normally stripped
  • Retains all data-* attributes
  • Skips div flattening to preserve document structure

HTML standardization

Defuddle attempts to standardize HTML elements to provide a consistent input for subsequent manipulation such as conversion to Markdown.

Headings

  • The first H1 or H2 heading is removed if it matches the title.
  • H1s are converted to H2s.
  • Anchor links in H1 to H6 elements are removed and become plain headings.

Code blocks

Code block are standardized. If present, line numbers and syntax highlighting are removed, but the language is retained and added as a data attribute and class.

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

Inline references and footnotes are converted to a standard format:

Inline reference<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div id="footnotes"> <ol> <li class="footnote" id="fn:1"> <p> Footnote content.&nbsp;<a href="#fnref:1" class="footnote-backref">↩</a> </p> </li> </ol> </div>

Math

Math elements, including MathJax and KaTeX, are converted to standard MathML:

<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline" data-latex="a \neq 0">
  <mi>a</mi>
  <mo>≠</mo>
  <mn>0</mn>
</math>

Development

Build

To build the package, you'll need Node.js and npm installed. Then run:

# Install dependencies
npm install

Clean and build

npm run build

Third-party services

When using parseAsync(), if no content can be extracted from the local HTML, Defuddle may fetch content from third-party APIs as a fallback. This only happens when the page HTML contains no usable content (e.g. client-side rendered SPAs). You can disable this by setting useAsync: false in options.

  • FxTwitter API — Used to extract X (Twitter) article content, which is not available in server-rendered HTML.

SEE ALSO

clihub3/4/2026DEFUDDLE(1)