Writing an Asciidoc Parser in Rust: Asciidocr
I really only ever make something when I want something to exist that doesn’t already, or when I want something that does exist to more readily suit my (admittedly) idiosyncratic needs or thoughts about how it should exist. For better or worse, I have a lot of wants, and so I make a lot of things (e.g., Two Page Tuesday, or last night’s mostly-successful attempt at tapering a pair of pants I got at Global Thrift, or an early solve for the problem I’m solving here).
So: I wrote an asciidoc parser in Rust. I called it
asciidocr
because the Command-Line
Rust book put an r
after all the "clone a UNIX tool" projects, and I liked
that convention.
Asciidoc is a lightweight markup language that is, in my opinion, the best one. Why it’s the best one is a separate issue entirely, but we can at least safely assume that it’s a good one, and the one that, for better or worse, I’ve been using to write nearly everything I’ve written for personal or professional use in the last five years or so. While it started as a Python project, it got new life (and a bunch of new features) when it was more or less taken over by the fine Asciidoctor folks, who wrote their converter in Ruby. It works very well, and does a lot of things. But.
It’s in Ruby, a language I have petty beef with and, more importantly, is an interpreted, not compiled language, which means that for every new machine I want to convert asciidoc files on, I need to install Ruby. And there are some other things to, in part pertaining to the way that templates must be written for custom output(s), it’s frankly a little slow, and whatever else.
But mostly it was the "I don’t want to have to write Ruby to extend the thing"
that got me thinking. I was dreaming about a
text-based writing management tool (like a
Scrivener but for folks who use vim), and having already written a tool to
make generating PDFs
from asciidoc easier, I knew that if I wanted to write this next app in
anything but Ruby, I’d need to either (a) subprocess out to the Ruby; (b) rely
on the old asciidoc.py
project, with its limitations (and also therefore
limiting myself to writing in Python, which, like Ruby, means that if I wanted
to share my tool, the folks using it would need to be able to install Python);
or (c) find or build a converter in a different language. So after getting part
of the way through an (a) implementation in Python, I cut my losses and started
looking more readily into option (c), for: I was learning Rust and Go(lang).
There is, in fact, a pretty good Go implementation of an asciidoc parser/converter. And there was a hot second when it looked like my company might transition to Go for some backend stuff, so I picked up Powerful Command-Line Applications in Go and got to work. Unfortunately I realized pretty quickly that I am allergic to the following, oft-repeated pattern in the language:
if err != nil {
return err
}
And then it became clear that we weren’t going to be using Go at work, so I dropped it.
Rust, on the other hand: boy-howdy did I love (and still do) working in that. And sure, there wasn’t a very feature-complete asciidoc parser or converter yet, but I liked the language and figured I could learn something: so I asked for some mentorship (thanks big time to Kit Dallege for everything that follows) and got to work.
My background is, of course, very humanities-focused. I mean, sure, there was a math minor in there somewhere, but that was all in service of a brief glimmer of a future doing philosophy of math, so. I’ve written a lot of code, and have been writing some kind of code or other since I was a small kid (thank you, hackable Geocities sites), but I have no "computer science education." Learning how to write a parser seemed like a good way to go.
And instead of relying on a lexing package (e.g., something like pest), where you write a grammar and the thing does it for you, Kit recommended I do the whole thing by hand, since I’d learn more (and potentially it could be faster, or at least a smaller binary).
So that’s more or less what I did. It’s not perfect; it could, of course, be improved; there are some decisions I made early on that I would not make today, knowing what I know how; and I am very fucking proud of it. So we can dig in.
Pretending it’s a Compiler
Googling around got me to a few resources that seemed like they’d be relevant, specifically the Commonmark Spec section about parsing, but what really ended up sticking in my brain was a book called Crafting Interpreters, which I someday would love to go back and really read for its intended purpose. But since I was going to be doing more or less the first half (up to the point where you do something with the tree you’ve created by scanning and parsing the code), I figured this would be a good place to start, and it was! Very well-written, too. So much so that it even made sense though I haven’t pretended to know anything about reading Java in years.
What this meant, anyway, was that I had a clear path forward. Prior to asking for help, I’d written a half-of-a-half implementation that mixed up the lexing and the parsing and the output all together, but this was going to be better, both in terms of building it, in terms of architecture, and in terms of being able to do other things with the tree/graph once I had it. So what I would then do was:
-
Scan the document into tokens
-
Parse those tokens into a tree
-
Take that tree and do something with it
Easy enough, right?
Scanning, Lexing, Whatever You Want to Call It
I’m still not sure what the difference between "scanning" and "lexing" is, if there is one at all, but anyway I needed to generate some tokens. I don’t plan on going into too much detail about the why/how of this (instead I refer you back to Crafting Interpreters), but there are a few interesting (annoying?) things about asciidoc that I think are worth mentioning here.
Like markdown, asciidoc is essentially a line-based language. The most
significant character is therefore the line break, \n
, and in some
worlds/lights it makes sense to parse asciidoc line-by-line. If I were to go
back and do it as a "one-shot" parser (which according to the chatter in the
Asciidoc community chat, isn’t possible
anyway), I might do it as a line-by-line thing. Instead, however, I did the
scanning character-by-character, in part because that’s what the book told me to
do, and in part because keeping track of the newline tokens actually made
parsing much easier in the end (I think/hope, anyway).
So the scanning.
Maybe the best "new thing I started using a lot" of 2024 was the humble Enum
.
I started using them in Python for a specific thing, and then started using them
more, and one of the things I like best about Rust is that it takes its
Enum
s seriously. So, to wit, the first thing I did was create a big ass
TokenType
enum:
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TokenType {
NewLineChar,
LineContinuation,
ThematicBreak,
PageBreak,
Comment,
PassthroughBlock, // i.e., "++++"
SidebarBlock, // i.e., "****"
SourceBlock, // i.e., "----"
// ...snip
Note
|
All source can be found in the Github repo. I’m going to condense and remove some comments and things in this post as needed to keep it clean. |
And then a Struct
for each token:
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token {
pub token_type: TokenType,
pub lexeme: String, // raw string of code
pub literal: Option<String>, // our literals are only ever strings (or represented as such)
pub line: usize,
pub startcol: usize,
pub endcol: usize,
/// The file's stack hierarchy if it's an include, otherwise stays empty
pub file_stack: Vec<String>,
}
There is a draft
official schema
for how an asciidoc document should be (able to be) represented, and that’s why
we’re keeping track of line
, startcol
, etc. I think if I were to go back and
clean this up, we could probably drop the literal
attribute, since we don’t
really need it (this was inspired/copied from the Crafting Interpreters way
of doing things, which has different requirements than what we have,
ultimately).
So once we have our Token
structs to play with, we can then proceed to
actually scanning the document into tokens. We create a Scanner
struct to hold
some state and the source and things:
#[derive(Debug)]
/// Scans an asciidoc `&str` into [`Token`]s to be consumed by the Parser.
pub struct Scanner<'a> {
pub source: &'a str,
start: usize,
startcol: usize,
current: usize,
line: usize,
file_stack: Vec<String>,
}
And then, because Rust has such good pattern matching, the actual work just becomes a(n admittedly gigantic) match/switch statement:
fn scan_token(&mut self) -> Token {
let c = self.source.as_bytes()[self.current] as char;
self.current += 1;
match c {
'\n' => self.add_token(TokenType::NewLineChar, false, 1),
'\'' => {
if self.starts_repeated_char_line(c, 3) {
self.current += 2;
self.add_token(TokenType::ThematicBreak, false, 0)
} else if ['\0', ' ', '\n'].contains(&self.peek_back()) && self.peek() == '`' {
self.current += 1;
self.add_token(TokenType::OpenSingleQuote, true, 0)
} else {
self.add_text_until_next_markup()
}
}
// ...snip
In order to keep things moving along speedily (because, in addition to being
"cool," Rust is also supposed to be "fast"), the actual scanning function is
implemented as an Iterator
(a "generator" in Python-speak):
impl<'a> Iterator for Scanner<'a> {
type Item = Token;
fn next(&mut self) -> Option<Self::Item> {
if !self.is_at_end() {
self.start = self.current;
return Some(self.scan_token());
}
None
}
}
(It was amazing how easy it was to do that, really.)
Some fun nuances, because we’re dealing with "text" instead of "code," that came
up ended up being character boundaries. So take something like the humble
ellipsis (…
) or an emoji: these require multiple bytes to represent. This
means that sometimes you might try to do something between the bytes it takes
to represent the character, which makes the scanner sad (and die, or in
Rust-parlance, panic!
).
(It occurs to me now that I should have specified earlier that we’re scanning byte by byte, not character-by-character; there are some reasons for doing this that I don’t feel like explaining to do with the way text is encoded and then handled by Rust, so, just, like, trust me that this was a good way to do it.)
Getting around this means that we just check for character boundaries when we
look around to see, based on context, what kind of token we should be producing.
And we do a lot of looking around! Here are a few, noting the easy-to-use
is_char_boundary()
function in there:
fn peek(&self) -> char {
if self.is_at_end() || !self.source.is_char_boundary(self.current) {
return '\0';
}
self.source.as_bytes()[self.current] as char
}
fn peek_back(&self) -> char {
if self.start == 0 || !self.source.is_char_boundary(self.start - 1) {
return '\0';
}
self.source.as_bytes()[self.start - 1] as char
}
fn peeks_ahead(&self, count: usize) -> &str {
if self.is_at_end()
|| self.current + count > self.source.len()
|| !self.source.is_char_boundary(self.current + count)
{
return "\0";
}
&self.source[self.current..self.current + count]
}
This means that, say, if we get a character -
, and know it’s the beginning of
a new line (i.e., that self.peek_back() == '\n'
), and we can peeks_ahead
to
see that self.peeks_ahead() == "---\n"
, we know that we should generate a
TokenType::SourceBlock
delimiter token. Scanning is essentially that, but,
like, a bunch of times with a bunch of edge cases and nuances (e.g., because
that four-repeated-characters-before-a-newline is such a common pattern, you
write a function that checks that for you).
This, naturally, segues into unit testing!
There are a lot of tests around the scanner! I haven’t yet gotten around to
running coverage on it, but I think it’s pretty good. One thing I don’t like
about Rust is that, by convention, you keep unit tests in the same file as the
code they’re testing. I see why you’d want to do that, but also my
scanner/mod.rs
file is a whopping 1932 lines long. Coming from Python-land…
ouch! Still: it works, especially if you use my new best friend
rstest
, which works so analogously to our
dear friend pytest
that I was able to get up and running in a matter of
minutes with it, simplifying the test-cases dramatically:
#[rstest]
#[case("NOTE", TokenType::NotePara)]
#[case("TIP", TokenType::TipPara)]
#[case("IMPORTANT", TokenType::ImportantPara)]
#[case("CAUTION", TokenType::CautionPara)]
#[case("WARNING", TokenType::WarningPara)]
fn inline_admonitions(#[case] markup_check: &str, #[case] expected_token: TokenType) {
let markup = format!("{}: bar.", markup_check);
let expected_tokens = vec![
Token::new_default(
expected_token,
format!("{}: ", markup_check),
Some(format!("{}: ", markup_check)),
1,
1,
markup_check.len() + 2, // account for space
),
Token::new_default(
TokenType::Text,
"bar.".to_string(),
Some("bar.".to_string()),
1,
markup_check.len() + 3,
markup_check.len() + 6,
),
];
scan_and_assert_eq(&markup, expected_tokens);
}
Easy, right? So let’s now suppose we scan our document-as-a-&str
into a
bunch of tokens. We then parse them. Yay!
Parser-ing
…and again we use a big-ass match
statement. But before we can really get into
that, we need to look at what we’re doing all this parsing into, namely a
(mostly) spec-compliant Abstract Syntax Graph.
So parsing then becomes a matter of looking at given Token and deciding what to
do with it. Because "what to do with it" is often a matter of context, we build
a lot of that context into our Parser
:
/// Parses a stream of tokens into an [`Asg`] (Abstract Syntax Graph), returning the graph once all
/// tokens have been parsed.
pub struct Parser {
/// Where the parsing "starts," i.e., the adoc file passed to the script
origin_directory: PathBuf,
/// allows for "what just happened" matching
last_token_type: TokenType,
/// optional document header
document_header: Header,
/// document-level attributes, used for replacements, etc.
document_attributes: HashMap<String, String>,
/// holding ground for graph blocks until it's time to push to the main graph
block_stack: Vec<Block>,
/// holding ground for inline elements until it's time to push to the relevant block
inline_stack: VecDeque<Inline>,
/// holding ground for includes file names; if inside an include push to stack, popping off
/// once the file's tokens have been accommodated (this allows for simpler nesting)
file_stack: Vec<String>,
/// holding ground for a block title, to be applied to the subsequent block
block_title: Option<Vec<Inline>>,
/// holding ground for block metadata, to be applied to the subsequent block
metadata: Option<ElementMetadata>,
/// counts in/out delimited blocks by line reference; allows us to warn/error if they are
/// unclosed at the end of the document
open_delimited_block_lines: Vec<usize>,
/// appends text to block or inline regardless of markup, token, etc. (will need to change
/// if/when we handle code callouts)
open_parse_after_as_text_type: Option<TokenType>,
// convenience flags
in_document_header: bool,
/// designates whether we're to be adding inlines to the previous block until a newline
in_block_line: bool,
/// designates whether new literal text should be added to the last span
in_inline_span: bool,
/// designates whether, despite newline last_tokens_types, we should append the current block
/// to the next
in_block_continuation: bool,
/// forces a new block when we add inlines; helps distinguish between adding to section.title
/// and section.blocks
force_new_block: bool,
/// Temporarily preserves newline characters as separate inline literal tokens (where ambiguous
/// blocks, i.e., DListItems, may require splitting the inline_stack on the newline)
preserve_newline_text: bool,
/// Some parent elements have non-obvious closing conditions, so we want an easy way to close these
close_parent_after_push: bool,
/// Used to see if we need to add a newline before new text; we don't add newlines to the text
/// literals unless they're continuous (i.e., we never count newline paras as paras)
dangling_newline: Option<Token>,
}
(As an aside: I’m keeping the comments on this struct, as opposed to many of the others I’ve shown above, in part because it’s useful and in part because I want to shout out to docs.rs for making it SUPER easy to generate really nice documentation for your project. Makes my former technical writer heart happy.)
We keep track of a lot of state, and frankly it got a little over-complicated, but also I didn’t have the time to make it simpler, so: it works, you know?
Again we have a big match
statement with a lot of arms like:
TokenType::QuoteVerseBlock => {
// check if it's verse
if let Some(metadata) = &self.metadata {
if metadata.declared_type == Some(AttributeType::Verse) {
self.parse_delimited_leaf_block(token);
return;
}
} else if self.open_parse_after_as_text_type.is_some() {
self.parse_delimited_leaf_block(token);
return;
}
self.parse_delimited_parent_block(token);
}
These, in turn generate various Block
and Inline
objects, that get added to
our Abstract Syntax Graph:
#[derive(Serialize, Debug)]
pub struct Asg {
pub name: String,
#[serde(rename = "type")]
pub node_type: NodeTypes,
#[serde(skip_serializing_if = "Option::is_none")]
pub attributes: Option<HashMap<String, String>>,
#[serde(skip_serializing_if = "Option::is_none")]
pub header: Option<Header>,
#[serde(skip)]
/// footnote references
document_id: String,
#[serde(skip)]
/// Has of all IDs in the document, and the references they point to
document_id_hash: HashMap<String, Vec<Inline>>,
/// Document contents
pub blocks: Vec<Block>,
pub location: Vec<Location>,
}
So by and by we build our graph, which takes something like:
This document has two paragraphs. Paragraphs may be separated by one or more empty lines.
Into:
{
"name": "document",
"type": "block",
"blocks": [
{
"name": "paragraph",
"type": "block",
"inlines": [
{
"name": "text",
"type": "string",
"value": "This document has two paragraphs.",
"location": [ { "line": 1, "col": 1 }, { "line": 1, "col": 33 } ]
}
],
"location": [ { "line": 1, "col": 1 }, { "line": 1, "col": 33 } ]
},
{
"name": "paragraph",
"type": "block",
"inlines": [
{
"name": "text",
"type": "string",
"value": "Paragraphs may be separated by one or more empty lines.",
"location": [ { "line": 4, "col": 1 }, { "line": 4, "col": 55 } ]
}
],
"location": [ { "line": 4, "col": 1 }, { "line": 4, "col": 55 } ]
}
],
"location": [ { "line": 1, "col": 1 }, { "line": 4, "col": 55 } ]
}
Note
|
All that location stuff is required by the schema; I don’t like it, but
hey, it’s not all about me. If ever somebody takes this to create a better
asciidoc LSP or something, it’ll be useful information. (Or if I ever start
doing more error handling/verification for the user.)
|
I could perhaps go into more detail about how the parsing actually works, but, you know, it’s just creating objects, and this is getting long. So if you’re curious, look at the code (or holler at me on Bluesky and I’ll do a follow-up post about whichever part you’re interested in). We’ll now turn to doing something with this graph we’ve made.
Turning it Into Something Useful (Templating)
The first, most obvious useful thing for the parser to do is produce HTML, since
that can be turned into basically anything else, one way or another. Instead of
targeting the kind of HTML that Asciidoctor produces (which I find overly div
heavy), I targeted a HTML standard called
"HTMLBook", in part because that’s
what I use for work and am therefore most comfortable with, and in part because
it’s clean and simple and more like what pick-your-favorite-markdown converter
produces. So to make HTML, we use templating. Yes! Our old friend templating.
From Dreamweaver templates to LiquidTemplates to handlebars to Jinja/Django,
they’re all more or less the same. More or less usable. Etc. For this project I
went with one called tera
, after trying one
called askama
, which was really really cool but
ultimately was hard to make work nicely with serde
.
tera
, on the other hand, is basically just Django templates. I write Django
templates at work. Easy:
{% import "inline.html.tera" as inline_macros %}
{% import "block.html.tera" as block_macros %}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>{%- if header %}
{%- for inline in header.title %}
{{- inline_macros::process_inline(inline=inline) -}}
{% endfor -%}
{% endif -%}</title>
</head>
<body>{% for block in blocks %}
{{ block_macros::process_block(block=block,skip_tag=false) -}}
{% endfor %}
</body>
</html>
There is a pretty annoying recursion issue (not the fault of tera
so much as
the fault of what I’m trying to do with it), which means that the block
and
inline
macro code is… ugly. But hey, it works to produce nice, clean documents
like the following:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title></title>
</head>
<body>
<p>What follows is an aside.</p>
<aside data-type="sidebar">
<h5>Aside Title</h5>
<p>Some aside text!</p></aside>
</body>
</html>
Nice.
But Wait! There’s More!
So now we’ve more or less gotten to the point where we’ve duplicated
a good chuck of what asciidoctor
, the reference implementation, does in terms
of parsing and conversation, but of course:
asciidocr
, this implementation, DOES NOT DO EVERYTHING ASCIIDOCTOR DOES, and
doesn’t intend to. But it does handle a whole bunch of the language, including
nice things like include::
directives (see the
limitations
doc in the repo for more), but this all started because I not only wanted a
non-interpreted-language implementation (with Rust we can generate binaries),
but also because I wanted to do other stuff, more easily.
So let’s talk about a little of that.
Docx
If there is a "killer feature" of asciidocr
, is it that it will — eventually — produce Word/docx
files natively. Creating docx files is a
PAIN IN THE ASS, but it’ll be worth it for folks like me who want to write their
fictions and whatever else in asciidoc, but then have to send journals and
agents publishers Word documents.
I’m currently rewriting the implementation of the DOCX backend, but even now, if
you install the tool with the --feature docx
enabled (for more on what I’m
talking about when I talk about installing a Rust feature, see
here), you can get a
docx created IF:
-
It’s only prose and headings
-
BUT it can include italics and bold and stuff
The reimplementation will be better and handle more things — tables, lists,
etc., — but I wanted to write this post now, instead of waiting for it to be
"done," since "done" is a myth when it comes to software. Anyway: go try it out!
My hope is for the docx
backend to be stable enough that I don’t need to
hide it behind a feature flag anymore.
Rust and Python
And, somewhat finally, another feature-flag thing: calling asciidocr
from
Python, making asciidoc conversions super fast with modern syntax (compared to
asciidoc.py
).
All the credit for this really goes to the pyo3
project, but building on top of their brilliant work, it’s very easy to do
something like:
#![cfg(feature="python")]
use std::path::PathBuf;
use crate::scanner;
use crate::parser;
use crate::backends::htmls::render_htmlbook;
use pyo3::{exceptions::PyRuntimeError, prelude::*};
/// parses a string using the specified backend
#[pyfunction]
fn parse_to_html(adoc_str: &str) -> PyResult<String> {
let graph = parser::Parser::new(PathBuf::from("-")).parse(scanner::Scanner::new(adoc_str));
match render_htmlbook(&graph) {
Ok(html) => Ok(html),
Err(_) => Err(PyRuntimeError::new_err("Error converting asciidoc string")),
}
}
#[pymodule]
fn asciidocr(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(parse_to_html, m)?)
}
Build a wheel, install it, and then from within Python:
$ python
Python 3.13.1 (main, Jan 7 2025, 10:41:20) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import asciidocr
>>> asciidoc = "This is _pretty freakin' cool_, right?!"
>>> html = asciidocr.parse_to_html(asciidoc)
>>> print(html)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title></title>
</head>
<body>
<p>This is <em>pretty freakin' cool</em>, right?!</p>
</body>
</html>
So that’s nice, and potentially useful. As a friend pointed out recently, I need to get this up on PyPI, but, you know, in time…
Loose Ends
So there’s writing an asciidoc parser in Rust, in a pretty high-level way (I could in theory go back and add more detail, but this post is far, far too long). And there are plenty of loose ends so far as the project itself goes, like:
-
Actually covering the entirety of the asciidoc language
-
Allowing users to supply stylesheets for HTML builds via the CLI (and I never talked above about the CLI, did I? Or the packaging process? Maybe separate posts; anyway I used
clap
). -
Creating an Asciidoctor-compliant HTML backend, because that means that folks can use this more as a "drop-in replacement" if they want
-
Finishing the docx build
-
…other future dream-big builds that I don’t want to talk about yet (OK: PDFs, I’m talking about PDFs).
-
And much, much more!
In any case.
As with all newer skills, the biggest benefit to my Rust knowledge was just
having to write an ass-ton of Rust. I also think I learned something about
design patterns, about balance (i.e., maybe it would have been more "pure" to
keep some things in the Parser
, but it was so much easier to just make the
Scanner
a little bit smarter sometimes), and about writing software more
generally. I like Rust, in part, because it makes you really consider what the
"right" thing to do is (okay: I really like it mostly because the tooling is so
damn good), and this in turn makes me think about writing all code different
(apologies to my coworkers, who now have to put up with me importing Rust-y
patterns into Python — I promise I’ll only do it when it makes sense!).
Mostly, though: I’m just happy I now have a tool that does more or less what I
want it to do, and quickly (not to brag, but compare some very non-scientific
testing that has asciidocr
converting a file to HTML in 0.01s user
, whereas
asciidoctor
takes a whole 0.32s user
. It’s an admittedly small but
noticeable difference, especially for larger documents). So in that sense
mission achieved. Yay.
Links: