draft: post about formatting lib

This commit is contained in:
りき萌 2025-08-17 22:22:00 +02:00
parent 332fb13668
commit 153ba9f0c2
5 changed files with 699 additions and 5 deletions

554
content/fmt.dj Normal file
View file

@ -0,0 +1,554 @@
title = "A string formatting library in 60 lines of C++"
+++
In this write-up, I will walk you through an implementation of a string formatting library for C++ I came up with for my video game.
The end result came out really compact, at only 60 lines of code---providing a skeleton that can be supplemented with additional functionality at low cost.
## Usage
Given a format buffer...
```cpp
char buffer[64];
String_Buffer buf = {str, sizeof str};
```
...the `fmt::format` function can be called with a format string parameter, containing the character sequence `{}` (a _hole_) where parameters are to be substituted, as well as the parameters themselves.
```cpp
fmt::format(buf, "Hello, {}!", "world");
assert(strcmp(str, "Hello, world!") == 0);
```
When a literal `{{` is needed, the `{` must be doubled---even when format arguments are not present.
```cpp
fmt::format(buf, "Hello, {{}!");
assert(strcmp(str, "Hello, {}!") == 0);
```
Further, when a format argument is not present, no undefined behaviour occurs---the hole is rendered as the empty string.
```cpp
fmt::format(buf, "empty {} hole");
assert(strcmp(str, "empty hole") == 0);
```
Multiple format arguments can be specified as well.
```cpp
fmt::format(buf, "[{}] [{}] {}", "main", "info", "Hewwo :3");
assert(strcmp(str, "[main] [info] Hewwo :3") == 0);
```
In case the buffer is not sufficiently large to contain the full string, the function writes as many characters as it can, and sets the `String_Buffer`'s `len` variable to the amount of characters required.
```cpp
fmt::format(
buf,
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaa{} {}",
"Vector", "Amber"
);
assert(strcmp(
str,
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
"aaaaaaaaaaaaaaaaaaV"
) == 0);
assert(buf.len == 74);
```
Additional functions can be written on top of this base functionality to improve ergonomics in real-world code.
These are omitted from this write-up for the sake of brevity.
## Problem statement
1. A string formatting library consists of a single function `format`.
You give the function a _format string_, which describes the output shape, as well as a set of _format parameters_, which the function then substitutes into the format string, rendering them in a human-readable way.
2. The `format` function ought to write to a pre-allocated buffer of characters.
This is a choice made in favour of simplicity: writing to a pre-allocated buffer can fail, but compared to arbitrary I/O, there is only one failure mode: the buffer is exhausted.
Naturally, this cannot work in memory-constrained environments, such as embedded devices---where you would want to write to a small buffer and flush it in a loop to reduce memory usage---but this does not apply in the context of a desktop video game.
3. As already mentioned in the usage overview, if the buffer is full, the function should return the number of characters that _would_ have been written, had the buffer capacity not been exceeded.
4. There _has_ to be a format string.
An example of a format string-less API is C++'s `<iostream>`.
Instead of having a format string like `printf`, `<iostream>` opts to use overloads of `operator<<` to write to the output.
This has the disadvantage of not being greppable (which is useful for debugging error logs), as well as not being localisable (because there is no format string that could be replaced at runtime).
Additionally, I don't want the format string to have extra specifiers such as C's `%d`, `%x`, etc. specifying the type of output, or Python's `{:.3}`, for specifying the style of output. The C approach is error-prone and inextensible, and the Python approach, while convenient, reduces greppability.
Instead, the representation is defined only according to the formatted value's type.
5. It has to have a small footprint.
There exist plenty of string formatting libraries for C++, such as [{fmt}](https://github.com/fmtlib/fmt), or even the recently introduced `std::print`, but they suffer from gigantic compile-time complexity through their heavy use of template metaprogramming.
While my compilation time benchmark results for {fmt} weren't as dire as those presented [in their README](https://github.com/fmtlib/fmt/tree/127413ddaa0d31149c8d41c7e10dcc27ae984b5a?tab=readme-ov-file#compile-time-and-code-bloat), they still don't paint a pretty picture---with a simple program using `printf` taking ~35 ms to compile, and the equivalent program using {fmt} taking ~200 ms.
I also find the benefits of an open rather than closed API, as well as compile-time checked format strings, dubious. Instead, I want something lean and small, using basic features of the language, and easy enough to drop into your own project, then extend and modify according to your needs---in spirit of [rxi's simple serialisation system](https://rxi.github.io/a_simple_serialization_system.html).
6. Simply using `printf` is [not good enough](#Why-not-printf).
## Implementation walkthrough
We will start by defining the `String_Buffer` type, which also serves as the formatter's state.
It represents a user-provided string buffer with a capacity and a length.
```cpp
struct String_Buffer
{
char* str;
int cap;
int len = 0;
};
```
A `String_Buffer` is intended to be initialised via aggregate initialisation (`{str, cap}`.)
This mimics the `snprintf` API, which accepts its buffer and size arguments in the same order.
At the core of the library's output is `write`.
It performs a bounds-checked write of a string with known length to the output string buffer.
```cpp
void write(String_Buffer& buf, const char* str, int len)
{
int remaining_cap = buf.cap - buf.len - 1; // leave one byte for NUL
int write_len = len > remaining_cap ? remaining_cap : len;
if (write_len > 0)
memcpy(buf.str + buf.len, str, write_len);
buf.len += len;
}
```
My implementation truncates the output if the buffer size is exhausted, but keeps incrementing the buffer's `len` past `cap`, such that the caller can know the full number of characters written after all `write`s, and adjust accordingly.
This is a deliberate choice coming from the fact that `String_Buffer` does not own the buffer's allocation, and the fact that string formatting is a performance-sensitive piece of code, which will be called often in the game loop.
However, it is trivial to replace the length saturation logic with a call to `realloc`, should that be the more appropriate choice.
Now onto parsing format strings.
Format strings can be defined as a sequence of _literals interspersed with arguments_.
That is, a format string always takes the form:
```ebnf
fstr = { literal, hole }, literal;
```
The leading and trailing `literal` can be the empty string.
The task of processing the literal parts is done by a function called `next_hole`.
It parses the format string, looking for a character sequence representing a hole `{}`, and writes the string preceding the hole `{}` to the output buffer.
```cpp
bool next_hole(String_Buffer& buf, const char*& fstr)
{
const char* prefix = fstr;
while (*fstr != 0) {
if (*fstr == '{') {
int len = fstr - prefix;
++fstr;
if (*fstr == '}') {
++fstr;
write(buf, prefix, len);
return true;
}
if (*fstr == '{') {
write(buf, prefix, len);
prefix = fstr;
++fstr;
}
}
++fstr;
}
write(buf, prefix, fstr - prefix);
return false;
}
```
`fstr` is received as a reference to a pointer, representing the format string's parsing state.
A call to `next_hole` will find the literal part, visualised with `---`, and leave the `fstr` pointer past the hole `{}`, visualised with `^`.
```
Hello, {}!
------- ^
```
In this case, it will return `true` to signal that it stopped at a hole.\
In case there is no hole however, and the end of the string is reached, it will return `false`.
```
Hello, {}!
-^ end of string
```
Additionally, we handle the `{{` escaping case.
Without the extra `if` clause, it would be printed into the output literally as `{{`.
Therefore, when `{` is encountered directly after another `{`, we have to flush the current span, and start a new one directly after the first `{`. Underlined with `---` are the spans of characters that get written to the output.
```
empty {{} hole
------- ------
```
Finally, we define `format`: the function that accepts a format string, a set of arguments, and inserts them into the output string.
It is the sole template in this library, and also the part that was most tricky to come up with.
```cpp
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
(format_value(buf, fstr, args), ...);
while (next_hole(buf, fstr)) {}
}
```
`format_value` is a function implemented by the user of the library.
Here is an example implementation for strings:
```cpp
void format_value(String_Buffer& buf, const char*& fstr, const char* value)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
}
```
The task of `format_value` is to consume a single hole, and fill it in with the formatted `value`.
To provide support for more data types, the function can be overloaded.
For example, providing an implementation of:
```cpp
void format_value(String_Buffer& buf, const char*& fstr, int value);
// ^^^^^^^^^
```
will make it possible to write out integers in addition to strings.
Note that the overloads of `format_value` _must_ be declared before `format`.
This is because the `format_value` name is not dependent on any template arguments, and is therefore early-bound at `format`'s definition site.
This choice was made for the sake of simplicity, but if it turns out to be a problem, it is possible to use specialisation. It is important to note though that specialisation bypasses overload resolution, so this will not work:
```cpp
template<typename T>
void format_value(String_Buffer& buf, const char*& fstr, T value) = delete;
template<>
void format_value<const char*>(
String_Buffer& buf, const char*& fstr, const char* value)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
}
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
(format_value<Args>(buf, fstr, args), ...);
while (next_hole(buf, fstr)) {}
}
format(buf, "Hello, {}!", "world");
```
because the type of `world` is `char [5]`, and not `const char*`, and `format_value<char [5]>` is deleted.
This should be solvable with some additional work, but I've deemed it unnecessary in my case.
In a single .cpp file, together with wrapping all the functionality in a namespace, this implementation, together with the implementation of `format_value` for strings, equates to a mere 60 lines of code.
In a real project, you will probably want to move some of the private implementation details to a separate .cpp file.
Therefore, here's the full source code listing, split into a header file, and an implementation file.
```cpp
#pragma once
struct String_Buffer
{
char* str;
int cap;
int len = 0;
};
namespace fmt {
// implementation detail
bool next_hole(String_Buffer& buf, const char*& fstr);
void format_value(String_Buffer& buf, const char*& fstr, const char* value);
// (add additional overloads here)
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
(format_value(buf, fstr, args), ...);
while (next_hole(buf, fstr)) {}
}
}
```
```cpp
#include "format.hpp"
#include <cstring>
namespace fmt
{
static void write(String_Buffer& buf, const char* str, int len)
{
int remaining_cap = buf.cap - buf.len - 1; // leave one byte for NUL
int write_len = len > remaining_cap ? remaining_cap : len;
if (write_len > 0)
memcpy(buf.str + buf.len, str, write_len);
buf.len += len;
}
bool next_hole(String_Buffer& buf, const char*& fstr)
{
const char* prefix = fstr;
while (*fstr != 0) {
if (*fstr == '{') {
int len = fstr - prefix;
++fstr;
if (*fstr == '}') {
++fstr;
write(buf, prefix, len);
return true;
}
if (*fstr == '{') {
write(buf, prefix, len);
prefix = fstr;
++fstr;
}
}
++fstr;
}
write(buf, prefix, fstr - prefix);
return false;
}
void format_value(String_Buffer& buf, const char*& fstr, const char* value)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
}
}
```
## Design remarks
### Escaping ambiguity
The choice of `{}` as the hole syntax is not accidental.
I evaluated whether holes could be represented with a single character `%`, like:
```cpp
fmt::format(buf, "Hello, %!", "world");
```
But it turned that using only a single character introduces an ambiguity around escaping.
What should this format to: `hello%`, or `%hello`?
```cpp
fmt::format(buf, "%%%", "hello");
```
It would be possible to use a different, unambiguous combination for escaping, such as `%_`, but it looks very alien, and you have to use it any time you want a `%` sign.
```cpp
fmt::format(buf, "%%_ complete", 33);
```
Compare this to the current approach, where you only have to double the `{` when it's directly preceding `}`.
```cpp
fmt::format(buf, "{}% complete", 33);
```
It also more closely mimics the final output string.
Reading the previous `%%_` example requires knowing that `%_` is a special sequence that turns into `%`, whereas reading this example doesn't require any extra knowledge (and progress reporting with percentages is a somewhat common use case for format strings).
### Iteration through parameter packs
Another idea I had was to do an `std::cout`-style API, though done with a function call rather than an operator chain:
```cpp
format(buf, "Hello, ", "world!");
```
The observation about poor greppability didn't occur to me until later, but it seemed simple enough to implement.
```cpp
void format_value(String_Buffer& buf, const char* value);
void format_value(String_Buffer& buf, int value);
template<typename... Args>
void format(String_Buffer& buf, const Args&... args)
{
(format_value(buf, args), ...);
}
```
If I went with this approach, it would be even less code, but the poor greppability and non-localisability of format strings kept bugging me, so I stared wondering if there's some way to add that format string.
It seemed impossible, because the format string can be provided at runtime.
This would mean `format` would have to iterate through the format string to parse out the holes `{}`, and when a hole is hit, insert the Nth parameter, starting with 0 for the first hole, N for the last hole.
But it _seemed_ to require indexing the parameter pack, and
- there is no way to index a parameter pack in C++20,
- there is no way to index it using a runtime value in C++26, which adds parameter pack indexing `pack...[x]`.
A few hours later, I realised it is possible to have _the parameter pack expansion drive the parsing_, rather than driving the parsing from `format` and trying to index the parameter pack.
I think this is single-handedly the most elegant bit of this library.
- It generates optimal, extremely minimal code: a sequence of calls to the appropriate overloads of `format_value`.
- It handles out-of-bounds gracefully: because there is no indexing of parameters, and therefore no out-of-bounds.
It makes me wonder what other cool things could be done with this technique.
### Failed idea: using dynamic typing for format arguments
My initial idea for a minimal C++ formatting library involved a `Format_Argument` type, passed in an `std::initializer_list` to the `format` function.
The API was shaped like this:
```cpp
enum class Format_Argument_Type
{
boolean,
int32,
float32,
vec4,
};
struct Format_Argument
{
Format_Argument_Type type;
union
{
bool b;
int32_t i;
float f;
Vec4 v;
};
};
void format_value(String_Buffer& buf, const Format_Argument& arg);
void format(
String_Buffer& buf,
const char* fstr,
std::initializer_list<Format_Argument> args);
```
This approach has a couple problems though, which were enough of a deal breaker for me that I dropped the idea.
- Efficiency.
The size of `Format_Argument` is as large as the biggest value able to be formatted.
In this case, assuming `Vec4` is four 32-bit floats, it is 20 bytes.
This space has to be allocated on the stack for the `initializer_list`.
It is unlikely compilers would be able to optimise this all away, especially if the `format` function lived in a separate object file.
- Verbosity.
The example above is actually incomplete.
What `Format_Argument` _has_ to look like is actually this:
```cpp
struct Format_Argument
{
Format_Argument_Type type;
union
{
bool b;
int32_t i;
float f;
Vec4 v;
};
Format_Argument(bool b) : type(Format_Argument_Type::boolean), b(b) {}
Format_Argument(int32_t i) : type(Format_Argument_Type::int32), i(i) {}
Format_Argument(float f) : type(Format_Argument_Type::float32), f(f) {}
Format_Argument(Vec4 v) : type(Format_Argument_Type::vec4), v(v) {}
};
```
And then you have to `switch` on the format argument's `type` in `format_value`, introducing further duplication.
### Why not `printf`
The elephant in the room.
Why do this when you have `printf`?
The answer to this is: verbosity.
Firstly, there is no way to extend `printf` with your own types in standard C.
I often want to `printf` 3D vectors for debugging, and I have to resort to listing out all the axes manually.
```cpp
printf(
"%f %f %f",
player.position.x,
player.position.y,
player.position.z
);
```
I think you can see how this gets old real quick.
Combine this with the inability to use `printf` as an expression, which is particularly painful with ImGui---where I often want to format a window title, or button label.
```cpp
char entity_name[64];
snprintf(
entity_name, sizeof entity_name,
"%d(%d) %s",
entity_id.index, entity_id.generation,
entity_kind::names[entity.kind]
);
if (ImGui::TreeNode(entity_name)) {
// ...
ImGui::TreePop();
}
```
Technically it is possible to write a function which allocates the temporary buffer and writes to it in one go, but this gets into the weeds of C's `va_list`, which is very verbose to use.
`printf` is also error-prone.
It is easy to mess up and use the wrong specifier type, or pass too few arguments to the function.
```c
printf("%x", 1.0f); // oops
printf("%x"); // ...not again
```
There is also no easy, idiomatic way to concatenate strings written with `snprintf`.
```c
char str[4] = {0};
int cursor = 0;
cursor += snprintf(str, sizeof str, "hello ");
cursor += snprintf(str, sizeof str, "world!");
```
This naive way is not actually correct, because `snprintf` returns the number of characters that _would_ be written into `str`, had the buffer been large enough.

View file

@ -178,7 +178,7 @@ impl<'a> Writer<'a> {
}
out.push_str("<p");
}
Container::Heading { level, .. } => write!(out, "<h{level}")?,
Container::Heading { level, id, .. } => write!(out, r#"<h{level} id="{id}""#)?,
Container::TableCell { head: false, .. } => out.push_str("<td"),
Container::TableCell { head: true, .. } => out.push_str("<th"),
Container::Caption => out.push_str("<caption"),

View file

@ -1,5 +1,5 @@
main.doc {
--doc-text-width: 80ch;
--doc-text-width: 85ch;
display: flex;
flex-direction: row;
@ -52,12 +52,15 @@ main.doc {
& ul,
& ol {
/* Is there a better way to add spacing to the marker, other than adding whitespace? */
list-style: "- ";
margin-top: 0;
margin-bottom: 0;
padding-bottom: 0.5lh;
padding-left: 3.2em;
}
& ul {
list-style: "- ";
}
}
& section.feed {

View file

@ -22,8 +22,8 @@
"captures": ["identifier", "keyword2"]
}
},
{ "regex": "(u8|u|U|L)'(\\\\'|[^'])'", "is": "string" },
{ "regex": "(u8|u|U|L)\"(\\\\\"|[^\"])*\"", "is": "string" },
{ "regex": "(u8|u|U|L)?'(\\\\'|[^'])'", "is": "string" },
{ "regex": "(u8|u|U|L)?\"(\\\\\"|[^\"])*\"", "is": "string" },
{ "regex": "[a-zA-Z_][a-zA-Z0-9_]*", "is": "identifier" },
{ "regex": "0[bB][01']+[uUlLfFlLdDwWbB]*", "is": "literal" },
{

137
static/syntax/cpp.json Normal file
View file

@ -0,0 +1,137 @@
{
"patterns": [
{
"regex": "#include (<.+?>)",
"is": { "default": "keyword1", "captures": ["string"] }
},
{ "regex": "#[a-zA-Z0-9_]+", "is": "keyword1" },
{ "regex": "\\/\\/.*", "is": "comment" },
{
"regex": "\\/\\*.*?\\*\\/",
"flags": ["dotMatchesNewline"],
"is": "comment"
},
{
"regex": "[a-zA-Z_][a-zA-Z0-9_]*(\\()",
"is": { "default": "function", "captures": ["default"] }
},
{ "regex": "(u8|u|U|L)?'(\\\\'|[^'])'", "is": "string" },
{ "regex": "(u8|u|U|L)?\"(\\\\\"|[^\"])*\"", "is": "string" },
{ "regex": "[a-zA-Z_][a-zA-Z0-9_]*", "is": "identifier" },
{ "regex": "0[bB][01']+[uUlLfFlLdDwWbB]*", "is": "literal" },
{
"regex": "0[xX][0-9a-fA-F']+(\\.[0-9a-fA-F']*([pP][-+]?[0-9a-fA-F']+)?)?+[uUlLwWbB]*",
"is": "literal"
},
{
"regex": "[0-9']+(\\.[0-9']*([eE][-+]?[0-9']+)?)?[uUlLfFlLdDwWbB]*",
"is": "literal"
},
{ "regex": "[+=/*^%<>!~|&\\.?:#-]+", "is": "operator" },
{ "regex": "[,;]", "is": "punct" }
],
"keywords": {
"alignas": { "into": "keyword1" },
"alignof": { "into": "keyword1" },
"and": { "into": "keyword1" },
"and_eq": { "into": "keyword1" },
"asm": { "into": "keyword1" },
"auto": { "into": "keyword1" },
"bitand": { "into": "keyword1" },
"bitor": { "into": "keyword1" },
"break": { "into": "keyword1" },
"case": { "into": "keyword1" },
"catch": { "into": "keyword1" },
"class": { "into": "keyword1" },
"compl": { "into": "keyword1" },
"concept": { "into": "keyword1" },
"const": { "into": "keyword1" },
"consteval": { "into": "keyword1" },
"constexpr": { "into": "keyword1" },
"constinit": { "into": "keyword1" },
"const_cast": { "into": "keyword1" },
"continue": { "into": "keyword1" },
"contract_assert": { "into": "keyword1" },
"co_await": { "into": "keyword1" },
"co_return": { "into": "keyword1" },
"co_yield": { "into": "keyword1" },
"decltype": { "into": "keyword1" },
"default": { "into": "keyword1" },
"delete": { "into": "keyword1" },
"do": { "into": "keyword1" },
"dynamic_cast": { "into": "keyword1" },
"else": { "into": "keyword1" },
"enum": { "into": "keyword1" },
"explicit": { "into": "keyword1" },
"export": { "into": "keyword1" },
"extern": { "into": "keyword1" },
"for": { "into": "keyword1" },
"friend": { "into": "keyword1" },
"goto": { "into": "keyword1" },
"if": { "into": "keyword1" },
"inline": { "into": "keyword1" },
"mutable": { "into": "keyword1" },
"namespace": { "into": "keyword1" },
"new": { "into": "keyword1" },
"noexcept": { "into": "keyword1" },
"not": { "into": "keyword1" },
"not_eq": { "into": "keyword1" },
"operator": { "into": "keyword1" },
"or": { "into": "keyword1" },
"or_eq": { "into": "keyword1" },
"private": { "into": "keyword1" },
"protected": { "into": "keyword1" },
"public": { "into": "keyword1" },
"register": { "into": "keyword1" },
"reinterpret_cast": { "into": "keyword1" },
"requires": { "into": "keyword1" },
"return": { "into": "keyword1" },
"sizeof": { "into": "keyword1" },
"static": { "into": "keyword1" },
"static_assert": { "into": "keyword1" },
"static_cast": { "into": "keyword1" },
"struct": { "into": "keyword1" },
"switch": { "into": "keyword1" },
"template": { "into": "keyword1" },
"this": { "into": "keyword2" },
"thread_local": { "into": "keyword1" },
"throw": { "into": "keyword1" },
"try": { "into": "keyword1" },
"typedef": { "into": "keyword1" },
"typeid": { "into": "keyword1" },
"typename": { "into": "keyword1" },
"union": { "into": "keyword1" },
"using": { "into": "keyword1" },
"virtual": { "into": "keyword1" },
"volatile": { "into": "keyword1" },
"wchar_t": { "into": "keyword1" },
"while": { "into": "keyword1" },
"xor": { "into": "keyword1" },
"xor_eq": { "into": "keyword1" },
"bool": { "into": "keyword2" },
"char": { "into": "keyword2" },
"char8_t": { "into": "keyword2" },
"char16_t": { "into": "keyword2" },
"char32_t": { "into": "keyword2" },
"double": { "into": "keyword2" },
"float": { "into": "keyword2" },
"int": { "into": "keyword2" },
"long": { "into": "keyword2" },
"short": { "into": "keyword2" },
"signed": { "into": "keyword2" },
"unsigned": { "into": "keyword2" },
"void": { "into": "keyword2" },
"_Atomic": { "into": "keyword2" },
"_BitInt": { "into": "keyword2" },
"_Complex": { "into": "keyword2" },
"_Decimal128": { "into": "keyword2" },
"_Decimal32": { "into": "keyword2" },
"_Decimal64": { "into": "keyword2" },
"_Imaginary": { "into": "keyword2" },
"nullptr": { "into": "literal" },
"false": { "into": "literal" },
"true": { "into": "literal" }
}
}