fmt: draft 2

This commit is contained in:
りき萌 2025-08-22 12:46:48 +02:00
parent 49af219330
commit ee12500ce7
2 changed files with 291 additions and 46 deletions

View file

@ -1,9 +1,9 @@
title = "A string formatting library in 60 lines of C++"
title = "A string formatting library in 65 lines of C++"
+++
In this write-up, I will walk you through an implementation of a string formatting library for C++ I came up with for my video game.
The end result came out really compact, at only 60 lines of code---providing a skeleton that can be supplemented with additional functionality at low cost.
The end result came out really compact, at only 65 lines of code---providing a skeleton that can be supplemented with additional functionality at low cost.
## Usage
@ -15,7 +15,7 @@ char buffer[64];
String_Buffer buf = {str, sizeof str};
```
...the `fmt::format` function can be called with a format string parameter, containing the character sequence `{}` (a _hole_) where parameters are to be substituted, as well as the parameters themselves.
...the `fmt::format` function provided by this library can be called with a format string parameter, containing the character sequence `{}` (a _hole_) where parameters are to be substituted, as well as the parameters themselves.
```cpp
fmt::format(buf, "Hello, {}!", "world");
@ -44,6 +44,7 @@ assert(strcmp(str, "[main] [info] Hewwo :3") == 0);
```
In case the buffer is not sufficiently large to contain the full string, the function writes as many characters as it can, and sets the `String_Buffer`'s `len` variable to the amount of characters required.
That way, it is possible for the caller to tell if the buffer has been exhausted, and reallocate it to an appropriate size.
```cpp
fmt::format(
@ -61,7 +62,7 @@ assert(buf.len == 74);
```
Additional functions can be written on top of this base functionality to improve ergonomics in real-world code.
These are omitted from this write-up for the sake of brevity.
These are included in [Ergonomic functions](#Ergonomic-functions).
## Problem statement
@ -75,7 +76,7 @@ You give the function a _format string_, which describes the output shape, as we
Naturally, this cannot work in memory-constrained environments, such as embedded devices---where you would want to write to a small buffer and flush it in a loop to reduce memory usage---but this does not apply in the context of a desktop video game.
3. As already mentioned in the usage overview, if the buffer is full, the function should return the number of characters that _would_ have been written, had the buffer capacity not been exceeded.
3. As already mentioned in the usage overview, if the buffer is full, the function should return the number of characters that _would_ have been written, had the buffer capacity not been exceeded---such that the caller can choose to reallocate the backing buffer to an appropriate size, and try formatting again.
4. There _has_ to be a format string.
@ -83,7 +84,7 @@ You give the function a _format string_, which describes the output shape, as we
Instead of having a format string like `printf`, `<iostream>` opts to use overloads of `operator<<` to write to the output.
This has the disadvantage of not being greppable (which is useful for debugging error logs), as well as not being localisable (because there is no format string that could be replaced at runtime).
Additionally, I don't want the format string to have extra specifiers such as C's `%d`, `%x`, etc. specifying the type of output, or Python's `{:.3}`, for specifying the style of output. The C approach is error-prone and inextensible, and the Python approach, while convenient, reduces greppability.
Additionally, I don't want the format string to have extra specifiers such as C's `%d`, `%x`, etc. specifying the type of output, or Python's `{:.3}`, for specifying the style of output. The C approach is error-prone and inextensible, and the Python approach, while convenient, increases parser complexity and reduces greppability.
Instead, the representation is defined only according to the formatted value's type.
5. It has to have a small footprint.
@ -133,6 +134,24 @@ This is a deliberate choice coming from the fact that `String_Buffer` does not o
However, it is trivial to replace the length saturation logic with a call to `realloc`, should that be the more appropriate choice.
Having this base `write` function, we can implement a set of overloaded functions that will write out values of various types to the string buffer.
These functions will be used by our `format` function, to write out format arguments.
The set of functions implemented here directly corresponds to the types of arguments you'll be able to pass into `format`.
```cpp
void write_value(String_Buffer& buf, const char* value)
{
write(buf, value, strlen(value));
}
void write_value(String_Buffer& buf, bool value) { /* ... */ }
void write_value(String_Buffer& buf, char value) { /* ... */ }
void write_value(String_Buffer& buf, int value) { /* ... */ }
```
See [`write_value` for various types](#write_value-for-various-types) for a set of example implementations for other types.
Now onto parsing format strings.
Format strings can be defined as a sequence of _literals interspersed with arguments_.
@ -175,7 +194,7 @@ bool next_hole(String_Buffer& buf, const char*& fstr)
`fstr` is received as a reference to a pointer, representing the format string's parsing state.
A call to `next_hole` will find the literal part, visualised with `---`, and leave the `fstr` pointer past the hole `{}`, visualised with `^`.
A call to `next_hole` will write out the literal part, visualised with `---`, and leave the `fstr` pointer past the hole `{}`, visualised with `^`.
{.monospaced}
```
@ -193,9 +212,7 @@ Hello, {}!
```
Additionally, we handle the `{{` escaping case.
Without the extra `if` clause, it would be printed into the output literally as `{{`.
Therefore, when `{` is encountered directly after another `{`, we have to flush the current span, and start a new one directly after the first `{`. Underlined with `---` are the spans of characters that get written to the output.
when `{` is encountered directly after another `{`, we have to flush the current literal, and start a new one directly after the first `{`. Underlined with `---` are the spans of characters that get written to the output.
{.monospaced}
```
@ -204,9 +221,16 @@ empty {{} hole
```
Finally, we define `format`: the function that accepts a format string, a set of arguments, and inserts them into the output string.
It is the sole template in this library, and also the part that was most tricky to come up with.
It makes use of an additional function `format_value`, which tries to find the next hole, and if found, writes out a format argument in its place.
```cpp
template<typename T>
void format_value(String_Buffer& buf, const char* fstr, const T& value)
{
if (next_hole(buf, fstr))
write_value(buf, value);
}
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
@ -215,59 +239,59 @@ void format(String_Buffer& buf, const char* fstr, const Args&... args)
}
```
`format_value` is a function implemented by the user of the library.
Here is an example implementation for strings:
For those unfamiliar with C++ template metaprogramming, `(format_value(buf, fstr, args), ...)` is a [_fold expression_](https://en.cppreference.com/w/cpp/language/fold.html).
Given any number of `args`, it will expand into a sequence of calls to `format_value`, one for each element in `args`, separated by the `,` operator. For example, if two arguments: a `const char*` and an `int`, are passed into `format`:
```cpp
void format_value(String_Buffer& buf, const char*& fstr, const char* value)
template<>
void format<const char*, int>(
String_Buffer& buf,
const char* fstr,
const char* a1, int a2)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
(format_value(buf, fstr, a1), format_value(buf, fstr, a2));
while (next_hole(buf, fstr)) {}
}
```
The task of `format_value` is to consume a single hole, and fill it in with the formatted `value`.
To provide support for more data types, the function can be overloaded.
For example, providing an implementation of:
```cpp
void format_value(String_Buffer& buf, const char*& fstr, int value);
```
will make it possible to write out integers in addition to strings.
Note that the overloads of `format_value` _must_ be declared before `format`.
This is because the `format_value` name is not dependent on any template arguments, and is therefore early-bound at `format`'s definition site.
Note that the overloads of `write_value` _must_ be declared before `format_value`.
This is because the `write_value` name is not dependent on any template arguments, and is therefore early-bound at `format_value`'s definition site.
This choice was made for the sake of simplicity, but if it turns out to be a problem, it is possible to use specialisation. It is important to note though that specialisation bypasses overload resolution, so this will not work:
```cpp
template<typename T>
void format_value(String_Buffer& buf, const char*& fstr, T value) = delete;
void write_value(String_Buffer& buf, T value) = delete;
template<>
void format_value<const char*>(
String_Buffer& buf, const char*& fstr, const char* value)
void write_value<const char*>(
String_Buffer& buf, const char* value)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
}
template<typename T>
void format_value(String_Buffer& buf, const char*& fstr, const T& value)
{
if (next_hole(buf, fstr))
write_value<T>(buf, value);
}
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
(format_value<Args>(buf, fstr, args), ...);
(format_value(buf, fstr, args), ...);
while (next_hole(buf, fstr)) {}
}
format(buf, "Hello, {}!", "world");
```
because the type of `world` is `char [5]`, and not `const char*`, and `format_value<char [5]>` is deleted.
because the type of `"world"` is `char [5]`, and not `const char*`, and `write_value<char [5]>` is deleted.
This should be solvable with some additional work, but I've deemed it unnecessary in my case.
In a single .cpp file, together with wrapping all the functionality in a namespace, this implementation, together with the implementation of `format_value` for strings, equates to a mere 60 lines of code.
In a single .cpp file, together with wrapping all the functionality in a namespace, this implementation, together with the implementation of `write_value` for strings, equates to a mere [65 lines of code](/static/text/20250822_fmt_min.cpp).
In a real project, you will probably want to move some of the private implementation details to a separate .cpp file.
Therefore, here's the full source code listing, split into a header file, and an implementation file.
@ -284,11 +308,18 @@ struct String_Buffer
namespace fmt {
void write_value(String_Buffer& buf, const char* value);
// (additional overloads here)
// implementation detail
bool next_hole(String_Buffer& buf, const char*& fstr);
void format_value(String_Buffer& buf, const char*& fstr, const char* value);
// (add additional overloads here)
template<typename T>
void format_value(String_Buffer& buf, const char*& fstr, const T& value)
{
if (next_hole(buf, fstr))
write_value(buf, value);
}
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
@ -317,6 +348,11 @@ static void write(String_Buffer& buf, const char* str, int len)
buf.len += len;
}
void write_value(String_Buffer& buf, const char*& fstr, const char* value)
{
write(buf, value, strlen(value));
}
bool next_hole(String_Buffer& buf, const char*& fstr)
{
const char* prefix = fstr;
@ -341,12 +377,6 @@ bool next_hole(String_Buffer& buf, const char*& fstr)
return false;
}
void format_value(String_Buffer& buf, const char*& fstr, const char* value)
{
if (next_hole(buf, fstr))
write(buf, value, strlen(value));
}
}
```
@ -389,7 +419,7 @@ Reading the previous `%%_` example requires knowing that `%_` is a special seque
### Iteration through parameter packs
Another idea I had was to do an `std::cout`-style API, though done with a function call rather than an operator chain:
Another idea I had was to do an `<iostream>`-style API, though done with a function call rather than an operator chain:
```cpp
format(buf, "Hello, ", "world!");
@ -468,7 +498,7 @@ This approach has a couple problems though, which were enough of a deal breaker
In this case, assuming `Vec4` is four 32-bit floats, it is 20 bytes.
This space has to be allocated on the stack for the `initializer_list`.
It is unlikely compilers would be able to optimise this all away, especially if the `format` function lived in a separate object file.
It is unlikely compilers would be able to optimise all that away, especially if the `format` function lived in a separate object file.
- Verbosity.
@ -534,7 +564,7 @@ if (ImGui::TreeNode(entity_name)) {
}
```
Technically it is possible to write a function which allocates the temporary buffer and writes to it in one go, but this gets into the weeds of C's `va_list`, which is very verbose to use.
It is possible to write a function which allocates the temporary buffer and writes to it in one go, akin to [my `fmt::print` function](#Ergonomic-functions), but even doing _that_ is verbose, as you have to deal with `va_list`---therefore needing two sets of functions, one for variadic arguments `...` and one for `va_list`.
`printf` is also error-prone.
It is easy to mess up and use the wrong specifier type, or pass too few arguments to the function.
@ -544,6 +574,8 @@ printf("%x", 1.0f); // oops
printf("%x"); // ...not again
```
This makes it unusable for localisation purposes.
There is also no easy, idiomatic way to concatenate strings written with `snprintf`.
```c
@ -554,3 +586,151 @@ cursor += snprintf(str, sizeof str, "world!");
```
This naive way is not actually correct, because `snprintf` returns the number of characters that _would_ be written into `str`, had the buffer been large enough.
Therefore, the second call to `snprintf` in the above example ends up writing past the buffer's bounds (at index 6.)
## Extras
Since the base library is very bare-bones, I'm including some additional snippets to help you get it integrated into your project.
### `write_value` for various types
```cpp
void write_value(String_Buffer& buf, const char* value)
{
write(buf, value, int(strlen(value)));
}
void write_value(String_Buffer& buf, bool value)
{
if (value)
write(buf, "true", 4);
else
write(buf, "false", 5);
}
void write_value(String_Buffer& buf, char value)
{
write(buf, &value, 1);
}
```
For integers, here's an implementation of `write_value` for `int64_t`.
This can confuse C++'s overload resolution, so I'd recommend adding additional overloads for smaller integers `int8_t`, `int16_t`, `int32_t`, also `long long`, and `ptrdiff_t`, calling into the `int64_t` overload.
```cpp
void write_value(String_Buffer& buf, int64_t value)
{
if (value == 0) {
write(buf, "0", 1);
return;
}
if (value < 0) {
write(buf, "-", 1);
value = -value;
}
char digits[20] = {};
int i = sizeof digits - 1;
while (value > 0) {
digits[i--] = '0' + (value % 10);
value /= 10;
}
int ndigits = sizeof digits - i - 1;
write(buf, digits + i + 1, ndigits);
}
```
A `uint64_t` version can be created in a similar manner, by removing the `if (value < 0)` case near the beginning.
This algorithm works for any radix (base 2, base 8, base 16, ...).
In my own implementation, I have a `Format_Hex` newtype, which changes the output to base 16.
```cpp
struct Format_Hex
{
uint64_t value;
};
namespace fmt
{
inline Format_Hex hex(uint64_t value) { return {value}; }
}
```
For floats, I defer the work onto `snprintf`'s `%g` specifier, because I trust it to do a better job than I ever could, even if a bit slow.
You can also use [Ryu](https://github.com/ulfjack/ryu) for this purpose.
```cpp
void write_value(String_Buffer& buf, double value)
{
char f[32] = {};
int len = snprintf(f, sizeof f, "%g", value);
if (len > sizeof f - 1)
len = sizeof f - 1;
write(buf, f, len);
}
```
And of course, don't forget about vectors---which were one of my motivating examples for abandoning `printf`.
```cpp
void write_value(String_Buffer& buf, Vec3 value)
{
write(buf, "(", 1);
write_value(buf, value.x);
write(buf, ", ", 2);
write_value(buf, value.y);
write(buf, ", ", 2);
write_value(buf, value.z);
write(buf, ")", 1);
}
```
### Ergonomic functions
The ergonomics of having to allocate a backing buffer, and then a `String_Buffer` afterwards, can get a bit cumbersome.
To help alleviate this, I have a `Static_String` type, together with a `print` function, which formats to a `Static_String` and returns it:
```cpp
template<int N>
struct Static_String
{
char data[N] = {};
const char* operator*() const { return data; }
};
namespace fmt
{
template<int N, typename... Args>
Static_String<N> print(const char* fstr, const Args&... args)
{
Static_String<N> str;
String_Buffer buf = {str.data, sizeof str.data};
format(buf, fstr, args...);
return str;
}
}
```
This makes it very easy to use a format string wherever an ordinary `const char*` is expected.
```cpp
if (ImGui::TreeNode(
*fmt::print<64>("{}({}) {}", index, generation, entity_kind)
)) {
// ...
ImGui::TreePop();
}
```
---
Thank you to my friend Tori for giving a whole bunch of solid feedback on a draft of this post.

View file

@ -0,0 +1,65 @@
#include <cstring>
struct String_Buffer
{
char* str;
int cap;
int len = 0;
};
namespace fmt
{
void write(String_Buffer& buf, const char* str, int len)
{
int remaining_cap = buf.cap - buf.len - 1; // leave one byte for NUL
int write_len = len > remaining_cap ? remaining_cap : len;
if (write_len > 0)
memcpy(buf.str + buf.len, str, write_len);
buf.len += len;
}
void write_value(String_Buffer& buf, const char* value)
{
write(buf, value, strlen(value));
}
bool next_hole(String_Buffer& buf, const char*& fstr)
{
const char* prefix = fstr;
while (*fstr != 0) {
if (*fstr == '{') {
int len = fstr - prefix;
++fstr;
if (*fstr == '}') {
++fstr;
write(buf, prefix, len);
return true;
}
if (*fstr == '{') {
write(buf, prefix, len);
prefix = fstr;
++fstr;
}
}
++fstr;
}
write(buf, prefix, fstr - prefix);
return false;
}
template<typename T>
void format_value(String_Buffer& buf, const char*& fstr, const T& value)
{
if (next_hole(buf, fstr))
write_value(buf, value);
}
template<typename... Args>
void format(String_Buffer& buf, const char* fstr, const Args&... args)
{
(format_value(buf, fstr, args), ...);
while (next_hole(buf, fstr)) {}
}
}