Sitegen - Open Source Static Site Generator in Python

If you're reading this, you know it worked!

This project is released under the MIT license. See the code on Github.
See the discussion on Hacker News.

Motivation

Four years ago, I started this blog to document some of my personal projects, and to share some of what I had learned in working on them. After a whirlwind of posts in the first month, it sat untouched for twenty-two more, at which point I took it down. While I intended to post regularly, or at the very least to keep it updated, I failed to do either.

I made that first version of the site using Squarespace, as I wanted to get something up and running quickly without having to worry too much about web design and hosting; in retrospect, that was a mistake. To be clear, it wasn't a bad product—it's just that nothing billed as 'all-in-one' will ever work for everyone. While their service allowed me to jump right into creating content, it never afforded me the flexibility that I wanted, and as the content that I created with it was not easily portable, I was reluctant to add more content knowing that doing so would create even more work down the line.

After migrating my domain name away from Squarespace and canceling my subscription to their service, I looked into other options for hosting, and decided to host it as a static site on Amazon S3, but left it at that. A few months ago, I finally decided to get back to work on it.

Why another static site generator?

Before deciding to write my own, I tried many of the existing static site generators, but wasn't enamored with any of them. Some were too bloated, others too incomplete (often with significant real or perceived overhead to implementing new features) or opinionated (without good overlap between what I wanted to do and what they required from me), or required markup languages I didn't care for, or lacked complete/correct documentation, or were too slow. It felt a little like shopping the cereal isle when they're out of your favorite—way too many options, each with more corn than the last, none quite right.

While I'm sure that, in time, I could have made one or several of these work well for me, I decided that my time would be better spent building exactly what I was looking for, even if it would be just as opinionated and quirky as some of what's already out there.

Goals

Write posts using Markdown with YAML metadata

  • Write posts using Markdown—mostly adhering to the Github-Flavored-Markdown specification, but with a few additions—for portability, longevity (as markdown-files really just text-files), ease of use, and for support in my favorite Markdown editor on Linux, Typora (which has lots of useful features, like support for custom CSS themes in the editor—so what I see when I write is what you'll see when I post, and the ability to automatically copy pictures referenced in the markdown file to a specified relative or absolute directory)
  • Place post metadata at the top of the document as YAML front-matter
  • The act of writing posts should be largely decoupled from the site generation process and specifics of the site design
  • All data (metadata included) about a post that is used by the site generator should live within that post's Markdown file, images excepted

Write site generator using Python

Write a flexible, extensible, and fast static site generator using Python 3.7, the language I'm most comfortable with nowadays. It should be able to, at a minimum:

  • Create a blank markdown post on command, with all supported YAML metadata fields present and partially filled (though using this feature shouldn't be necessary; copying and modifying an existing post should work the same)
  • Accept a directory of posts written in markdown (along with any assets that they reference) and a directory of templates (i.e. essentially the complete site, but with unfilled Jinja2 variables/expressions) as inputs
  • Build and save the complete site to a specified directory, consisting of, at a minimum:
    • a standalone page for each blog post
    • a blog index, consisting of a paginated sequence of blog posts, with a specified number of posts per-page
    • a 'topics' page, consisting of a list of all topics tagged in post metadata and the posts that pertain to them
    • all other static pages and assets referenced by the above pages, even if the site generator is not responsible for creating or modifying them during the build process
  • Handle any image processing (compressing and scaling) necessary to build the site
  • Cache previously-processed images and posts, only reprocessing posts that have been modified
  • Deterministically generate post URLs and anchor links, so that they do not inadvertently change between builds
  • Support comments, if possible, without bloating the page or adding in a bunch of trackers
  • Easily preview the newest build of the site locally before publishing it

Design site and templates in HTML and CSS, and style it with Bootstrap

  • Design the site using clean HTML/CSS templates (without using any Javascript) and the Bootstrap4 framework for styling, either in Bootstrap Studio or by hand
  • Use Jinja2 for templating

Implementation

Visual design

First, I designed the look of the site using Bootstrap Studio, with a bunch of lorem ipsum dummy text for anything that would be filled in by the site generator, and with actual content for any pages that the site generator would ignore (i.e. the 'About' and 'Resume' pages).

I hadn't used the Bootstrap framework before (and hadn't used HTML/CSS since writing a rather hideous Geocities site in the late 90's), so this was a great way to get acclimated to it. It offers simultaneous editing of HTML and CSS, drag-and-drop addition of Bootstrap components (that can be edited visually or converted to HTML for complete control), realtime visualization of everything (even custom elements), support for Google webfonts, one-click exporting of all assets for use as a static site, and it runs on Linux. It's also regularly updated and reasonably-priced. I'm not getting paid to say this—I'm just a happy customer.

Programming the site generator

Once I was happy with how things looked, I started working on the actual site generator; starting with processing a few test markdown files (filled with examples of things I wanted to support, like tables, code-blocks, and LaTeX-formatted math equations). I figured that using the python-markdown library would be a reasonable way to start, but found it to be dreadfully slow, even for processing a single post. After benchmarking it and several others, I instead decided to use misaka, which was at least an order-of-magnitude faster.

I began by writing the code in one long script, adding and testing one new feature at a time, and using type annotations and docstrings to help with organization. Once it became faster to find things by searching than by scrolling, I gradually separated related functions into modules. Most of the site generator is written functionally, and simple data structures are favored (e.g. dicts and nested dicts).

Features

Site Generation

At the moment, sitegen generates static HTML pages for the following, by expanding user-provided HTML templates containing Jinja2 variables and expressions:

  • a page for each blog post,
  • a paginated blog index, with a user-specified number of posts per page
  • an 'all-topics' page, listing all 'topics' and the posts that pertain to them

In addition, the site generator respects other (non-template) static pages and assets placed in its templates directory. These, like the 'Resume' and 'About' page of this site, are moved to the output directory without modification. This makes it very easy to design the site using placeholder content first, convert some of the pages to templates by replacing the placeholder content with Jinja2 variables and expressions.

If you aren't familiar with Jinja2, anything in {% %} is a pseudo-pythonic expression and anything in {{ }} is a variable, all of which will disappear in the output HTML, as they will be replaced with what they pertain to. Since the syntax isn't quite python, there are some quirks (like needing to use none rather than None when type-checking against it), but the documentation is extensive enough to be able to figure things out without too much hassle.

For an example of how simple this can be, the main body of the template for the 'all-topics' page (accessible by clicking on any topic listed under the title of any post in the blog) consists solely of:

<div class="post-body">
  {% for topic, posts in topic_posts_dict.items() %}
    <h2><a id="{{ topic }}" class="topic-list-title">{{ topic }}</a></h2>
    <ul class="topic-list">
      {% for post in posts %}
        <li><a href={{ post_db[post].url }}>{{ post_db[post].post_title }}</a><span class="text-muted topic-list-date">, posted {{ post_db[post].date|get_month_name }} {{ post_db[post].date.year }}.</span></li>
      {% endfor %}
    </ul>
  {% endfor %}
</div>

To expand this template, I reference two variables: post_db, a nested dictionary of dictionaries used to hold all data about each post, and topic_post_dict, a dictionary relating topics (keys) to posts tagged with them (values). Here, I iterate over the topics and their respective posts, and for each topic:posts pair, place the name of the topic into an h2 heading (with anchor link appropriate for linking from other pages) and the titles and publishing dates of all posts tagged with it into a list immediately below it.

Caching of previously processed images and posts

Images and the html bodies of previously processed posts are automatically cached between builds, and are only re-processed when modified (detected through md5 hash comparison of current files to their previously-processed counterparts). The post cache is json-formatted, and is saved to a directory defined in sitegen.config.

Given that image processing (compression and scaling) is the most computationally intensive and time consuming part of the build process, image caching significantly reduces build time. All html pages are regenerated on each build, but as this entails little more than inserting the previously-processed html bodies into the appropriate templates, it adds very little time to the build process.

As an illustrative example, processing the first 31 posts and corresponding 288 images, on a fairly anemic dual-core ultrabook, takes about 184 seconds without caching. Processing the same 31 posts without scaling or compressing the images takes a mere 2.2 seconds, and processing a new post (with images) generally takes a few seconds or less. Processing a single new post without images takes well under a second.

Sidebar table of contents

A table of contents is generated and placed in the sidebar if the render_toc metadata flag is set to True. TOC entries link to the relevant (and correct) section of the post, even if the heading is used multiple times within the post or by multiple posts on the same page, using header anchor links that are automatically generated for every post (even if a TOC is not generated).

Post comments via utteranc.es

A comments thread for a given post is displayed on the page for that post (but not in the blog index) if the enable_comments metadata flag is set to True.

Configurable paths

All paths used for storing templates, input posts (along with their image and non-image assets), generated outputs, and caching are configurable in sitegen.config.

Sitemap Generation

A sitemap is automatically generated, listing the pages generated for each blog post as well as any pages specified in the whitelist in sitegen.config.

Local site previewing

Immediately after building the site, sitegen starts a local HTTP server, and serves the site to localhost for previewing; CTRL + C shuts it down. This may be disabled in sitegen.config if desired.


Blog-post YAML front-matter metadata

All data contained within post_db stems directly from—or through minimal processing of—post metadata or post content. Supported metadata fields (automatically inserted at the top of a new post when using the create_new_post function) include:

Field Type Description Default Value
post_title str title of post None
post_description Optional[str] description of post None (may be left blank)
author str post author default author set in sitegen.config
date datetime strftime formatted date and time, %Y-%m-%d %H:%M:%S time and date that the post markdown file was first created (e.g. 2019-05-06 23:41:39)
slug str short-name identifier for the post value entered when creating the post (also used) as the filename
topics List[str] list of topics the post pertains to None (may be left blank)
related_posts List[str] bulleted list of related posts (listed by slug) None (may be left blank)
render_toc bool if True, a table of contents will be generated for the post and shown on the sidebar False
enable_comments bool if False, post comments will not be shown True

Markdown support

Standard markdown features are supported, as well as any extended features supported by Misaka (and by extension, Hoedown), including tables, inline code and fenced codeblocks with syntax highlighting (in any language supported by the Pygments library), inline and standalone math expressions in LaTeX, etc.

On top of the above features, the following were added through pre-processing of the input markdown files, or through the use of non-standard HTML tags (which are ignored by the markdown parser) that are identified through post-processing of the output HTML (using the Beautiful Soup library). While some of the features below are little more than syntactic sugar for assigning Bootstrap classes (where such classes are not automatically assigned), others hide a fair bit of additional complexity.

Processing of input images

By default, local images with recognized extensions are compressed and scaled before saving them to the output directory; images with unrecognized extensions are copied without modification. If you'd like to add support for any currently unrecognized formats, doing so in sitegen.utils.compress_image should be trivial. Note that no input images are directly modified, only their copies.

Compression: jpgs (or jpegs), pngs, gifs, and tifs (or tiffs) are compressed using Pillow; svgs are compressed using svgcleaner

Scaling (optional): By default, images wider than 1024 px (configurable in sitegen.config) are scaled to have that width unless their filenames end with 'large', in which case they are compressed but not scaled.

Image formatting in output html

Alignment: Images may be aligned to the left, right, or center of the container by wrapping the markdown image references in <float-left>, <float-right>, <float-center>, or <autoscale> tags (the last 2 are identical). While potentially confusing, the tag names were chosen to avoid collision with existing <align-left> and <align-right> HTML tags.

Scaling: images may be scaled by defining width, height, or both within any of the above custom alignment tags (e.g. <float-center width="50%">) using any units supported in CSS (e.g. %, em, rem, px)

Captioning: if a caption is defined within any of the above custom alignment tags (e.g. <float-right caption="awesome image">) placed below image, and constrained to image width; any images given a caption are automatically wrapped in <figure> tags

Carousels: image references placed on consecutive lines, wrapped within a single set of <carousel> tags, are placed into carousels (slideshows) with left/right controls and indicators, formatted using the Bootstrap framework

Effortless YouTube video embedding

If a youtube video link is wrapped in the custom <youtube-embed> tag, a 'privacy-enhanced' youtube embed link will be generated (if the link provided is not already an embed link), and an iframe for it will be created and made responsive, autoscaling to the parent container. The resulting embedded video will default to an aspect ratio of '16x9' (unless ratio is set in the tag; '16by9', '4by3', '1by1', and '21by9' are supported), and supports fullscreen viewing.

This will work for vanilla youtube links, embed links, or youtube links with complex query strings. For example, the markdown

<youtube-embed>
    [](https://www.youtube.com/watch?v=eCIHPdx1OAs&list=RDeCIHPdx1OAs&start_radio=1)
</youtube-embed>

would be replaced with the following HTML.

<div class="embed-responsive embed-responsive-16by9">
    <iframe allowfullscreen="" class="embed-responsive-item" src="https://www.youtube-nocookie.com/embed/eCIHPdx1OAs">
    </iframe>
</div>

Task lists with checkboxes

  • for unordered list items that begin with [] or [ ], bullets are removed and brackets are replaced with an unchecked—but disabled—checkbox
  • for those that begin with [x] or [X], bullets are removed and filled-brackets are replaced with a checked—but disabled—checkbox

Table formatting

Defined table column widths: by placing a width within angle brackets in the header cell of a given column (e.g. <3em> or <10%>), the column may be given a fixed width (any that are not specified). All other columns in the table will remain responsive. This is especially useful for ensuring that the columns of multiple tables in a post align regardless of their contents.

Left-aligned Centered Right-aligned 10em 5em 10%
Example table with contents aligned and
sized as you would expect .

Prettification: To aid with legibility, tables are automatically assigned the following Bootstrap classes:

  • table-sm: cuts table padding in half
  • table-striped: shades alternate rows of the table
  • table-hover: shades a row when hover

Simple Blockquote/Author Formatting

A markdown blockquote, by default, will be placed in <blockquote> tags and assigned the blockquote bootstrap class. If the author's name is placed within the blockquote, separated from the quote by a blank line and placed after an endash, emdash, or series of 2 to 3 dashes and a space, the author's name will be placed in the blockquote's footer for special treatment. For example, the markdown blockquote

> It's better to be approximately right than precisely wrong.
>
> — Warren Buffet

will be replaced with the following HTML

<blockquote class="blockquote">
    <p>It's better to be approximately right than precisely wrong.</p>
    <footer class="blockquote-footer">Warren Buffet</footer>
</blockquote>

and rendered as below.

It's better to be approximately right than precisely wrong.

Warren Buffet

Moving forward

To do

A few features that I'm interested in implementing—but haven't gotten around to—include:

  •  Parallelization of image processing steps to reduce build time
  •  Single-step post generation from Jupyter Notebooks
  •  Automated generation of related-posts (e.g. using TF-IDF or similar to identify keywords, and clustering to compare posts)
  •  Automated generation of suggested topics, in case I forgot to add a relevant one
  •  Automated generation of a suggested post-description if one is not provided
  •  Support for linking between posts, in the post test, by post slug rather than URL (this one shouldn't be hard, but I'll need to decide what syntax to use in markdown to signal that such a link should be inserted)
  •  Support for captions on tables and carousels (this also shouldn't be hard; I just haven't gotten around to it)
  •  For images that the user does not want to be scaled, display a scaled version of the image in the post that links to the full-size image
  •  Support for a simplified blog index—without full text—to reduce the size of blog index pages
  •  Pagination of the blog index by page length rather than by number of posts per page

If you have any suggestions, comments, or recommendations, feel free to leave a comment below.

Concerns and open questions

In compressing and scaling images as I did, I tried to balance (1) file size, (2) compatibility/longevity, and (3) sanity.

To lower file sizes, I considered converting input images to webps, which can be smaller than comparable jpegs or pngs, but support for them is still not universal and the likelihood that I'll be able to open them in twenty years is probably lower. I could have also saved multiple scaled copies of each image and loaded whichever was most appropriate for the user's viewport width, but this seems messy, and I'd have N times as many images to deal with.

I mention all of this as my blog is fairly image-heavy, and for readers on slow connections, I worry a little bit about page-size. I doubt that the pages for individual blog posts will present much of a problem, but having five or ten blog posts on a blog-index page might.

At the moment, I'm weighing three different options (none of which have to do with image format):

  • Simplify the blog index (listing titles and metadata but removing content) and link to the full blog post content instead. I worry that this might hurt discoverability, if anyone actually decides to read my blog.
  • Paginate the blog index variably. Rather than paginating by posts per page, paginate by page length, page size, or something else entirely. I worry that this might be confusing to readers, or that this might hurt accessibility in some way that I haven't considered.
  • Reduce the number of posts per page. Right now I have 5 per page—so maybe 3? This seems like the most sane.

Did my blog index load slowly for you? Do any of the above options seem better than what I have now? Any other suggestions?


To leave a comment below, sign in using Github.