Convert HTML to Markdown with a serverless function

Jul 4, 2020 · 982 words · 5 minute read

Outlined below is the setup for a AWS lambda function which combines fetching the HTML for a URL, stripping it back to just the essential article content, and then converting it to Markdown. To deploy it you’ll need an AWS account, and to have the serverless framework installed.

Step 1 - Download the full HTML for the URL

First get the full html of the url getting converted. As this is running in a lambda function I decided to try out an ultra-lightweight node http client called phin (which is 95% smaller than my usual favourite Axios):

const phin = require('phin')
const fetchPageHtml  async fetchUrl => {
  const response = await phin(fetchUrl)
  return response.body;
};

Step 2 - Convert to readable HTML

Converting to readable HTML is a feature originally offered by Instapaper (going back to 2008) as part of the core experience of a “read it later” service, but is now built into most browsers. Before converting to markdown its a good idea to strip out the unnecessary parts of the HTML (adverts, menus, images, etc), and just display the text of the main article in a clean and less distracting way.

This process won’t work for every web page - it is designed for blog posts, news articles etc which have a clear “body content” section which can be the focus of the output.

Mozilla have open sourced their code for doing this in a Readability library, which can be reused here:

const readability = require("readability");
const JSDOM = require("jsdom").JSDOM;

const extractMainContent = (pageHtml, url) => {
  const doc = new JSDOM(pageHtml, {
    url,
  });
  const reader = new Readability(doc.window.document);
  const article = reader.parse();
  return article.content;
};

This returns the HTML for just the article in a more readable form.

Step 3 - Convert readable HTML to markdown

There is a CLI tool called pandoc which converts HTML to markdown. The elevator pitch for pandoc is:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

To try this out locally before running it from the lambda function, you can follow one of their installation methods, and then test it from the command line by piping a html file as the input:

cat sample.html | pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none

The options used here are:

-f html is the input format

-t commonmark is the output format (a particular markdown flavour)

You can add extra configuration options to the output by adding them to the output name.

commonmark-raw_html+backtick_code_blocks sets the converter to disable the raw_html extension, so no plain html is included in the output. It enables the backtick_code_blocks extension so that any code blocks are fenced with backticks rather than being indented.

The pandoc tool needs to be executed from within the node script, which involves spawning it in a child process, writing the html to the child stdin and then collect the markdown output via the child stdout.

Most of these functions have been taken from this very helpful blog post on working with stdout and stdin in nodejs.

First off this is the generic streamWrite function, which allows you to pipe the html to the pandoc process, by writing to the stdin stream of the child process.

const streamWrite = async (stream, chunk, encoding = 'utf8') =>
  new Promise((resolve, reject) => {
    const errListener = (err) => {
      stream.removeListener('error', errListener);
      reject(err);
    };
    stream.addListener('error', errListener);
    const callback = () => {
      stream.removeListener('error', errListener);
      resolve(undefined);
    };
    stream.write(chunk, encoding, callback);
  });

This similar function reads from the stdout stream of the child process, so you can collect the markdown that is output:

const {chunksToLinesAsync, chomp} = require('@rauschma/stringio');
const collectFromReadable = async (readable) => {
  let lines = [];
 for await (const line of chunksToLinesAsync(readable)) {
   lines.push(chomp(line));
 }
 return lines;
}

Finally this helper function converts the callback events for the child process into an “awaitable” async function:

const onExit = async (childProcess) =>
  new Promise((resolve, reject) => {
    childProcess.once('exit', (code) => {
      if (code === 0) {
        resolve(undefined);
      } else {
        reject(new Error('Exit with error code: '+code));
      }
    });
    childProcess.once('error', (err) => {
      reject(err);
    });
  });

To make the API a bit cleaner, here is that all wrapped up in a single helper function:

// spawns a child process, supplying stdin to the child STDIN, then reads from the child STDOUT and
// returns this as a string
const spawnHelper = async (command, stdin) => {
  const commandParts = command.split(" ");
  const childProcess = spawn(commandParts[0], commandParts.slice(1))
  await streamWrite(childProcess.stdin, stdin);
  childProcess.stdin.end();
  const outputLines = await collectFromReadable(childProcess.stdout);
  await onExit(childProcess);
  return outputLines.join("\n");
}

This makes calling pandoc from the node script much simpler:

const convertToMarkdown = async (html) => {
  const convertedOutput = await spawnHelper('/opt/bin/pandoc -f html -t commonmark-raw_html+backtick_code_blocks --wrap none', html)
  return convertedOutput;
}

To run this as an AWS lambda you need to include the pandoc binary. This is achieved by adding a shared lambda layer which includes a precompiled pandoc binary. You can build this yourself, or just include the public published layer in your serverless config.

# function config
layers:
  - arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1

Step 4 - Wrapping this up in the lambda handler function

Export a function from this module which has been configured as the handler. This is the function AWS will run every time the lambda receives a request.

module.exports.endpoint = async (event) => {
  const url = event.body
  const pageHtml = await fetchPageHtml(url);
  const article = await extractMainContent(pageHtml, url);
  const bodyMarkdown = await convertToMarkdown(article.content);
  // add the title and source url to the top of the markdown
  const markdown = `# ${article.title}\n\nSource: ${url}\n\n${bodyMarkdown}`
  return {
    statusCode: 200,
    body: markdown,
    headers: {
      'Content-type': 'text/markdown'
    }
  }
}

This is the full serverless.yml configuration that is needed for serverless to deploy everything:

service: url-to-markdown

frameworkVersion: ">=1.1.0 <2.0.0"

provider:
  name: aws
  runtime: nodejs12.x
  region: us-east-1

functions:
  downloadAndConvert:
    handler: handler.endpoint
    timeout: 10
    layers:
      - arn:aws:lambda:us-east-1:145266761615:layer:pandoc:1
    events:
      - http:
          path: convert
          method: post

Wrap Up

The full source code is available on github. Once deployed you can test it from the command line like so:

curl -X POST -d 'https://www.atlasobscura.com/articles/actual-1950s-proposal-nuke-alaska' https://zm13c3gpzh.execute-api.us-east-1.amazonaws.com/dev/convert