programming: composable virtual file systems

This commit is contained in:
liquidex 2024-11-25 16:59:51 +01:00
parent b522732a67
commit 3673036bec
4 changed files with 522 additions and 0 deletions

View file

@ -9,6 +9,10 @@
% id = "01HPD4XQQ5GPQ20C6BPA8G670F"
- :folder: *blog*
% content.link = "programming/blog/vfs"
id = "01JDJJHCGTAJ78K8WRXV00ZEJC"
+ :page: composable virtual file systems
% content.link = "programming/blog/cstring-starts-with"
id = "01JCGAM56KS2C6D4XZ1MRATXH4"
+ :page: prefix matches with C strings
@ -69,3 +73,4 @@
+ :folder: *opinions*

View file

@ -0,0 +1,504 @@
%% title = "composable virtual file systems"
% id = "01JDJGVC7BRZDSYTZCH0S357B8"
- you know what I hate?
% id = "01JDJGVC7BTT39GCS96D3N7GF1"
- [State.](https://overwerk.bandcamp.com/album/state){.secret title="the linked album is pretty good, though."}
% id = "01JDJGVC7BSXT52HB8A240XME2"
- guess what file systems are:
giant balls of state controlled fully by imperative code!
% id = "01JDJGVC7BEFG210RZV2G82TNC"
- how the heck are we supposed to write reliable code in an environment where we can't even be sure everything is in the state we expect it to?
% id = "01JDJGVC7BV6PQ1F6QK79M9RKJ"
- having to handle state desynchronizations results in a massive amount of complexity and space for bugs.
and unfortunately... there's really nothing you can do about this, because the whole _point_ of file systems is to enable persistent storage. :oh:
% id = "01JDJGVC7B88P2JXYMV1TS3EMY"
- so I've been thinking if _maybe_ there was some way to make file systems more declarative.
% id = "01JDJGVC7BKTJDSQ4BNWAZKZ66"
- after all, functional programming exists, and it enables you to think about data and transformations, rather than sending commands to the computer.
% id = "01JDJGVC7BGSEKHKSP75CFCQ5V"
+ systemd exists, NixOS also exists... both enable you to specify what state your system _should_ reach, rather than _how_ it should be done.
% id = "01JDJGVC7BCNEBQWTYKFG523K3"
- well, right.
actually, systemd _does_ have an imperative element to it, because you have to tell it what command should be executed to get your process spinning.
I'd argue that is still way better than a shell script, because it can restart failing services selectively without any extra work on your part.
% id = "01JDJGVC7B1M21FZ2HQ2T83JWY"
- my idea was basically to enable the programmer to specify an expected file system structure declaratively, like so:
```lua
root = dir {
crates = dir "crates",
readme = file "README.md",
}
```
and then you'd be able to access the readme file via `root.readme`, rather than by the usual fopen-fwrite-fclose interface.
% id = "01JDJGVC7B22EV79JSJ95QGW6Y"
- but the more I thought about the idea in this way, the less sense it made.
% id = "01JDJGVC7BZS445KBG16AMWHGS"
- _sure_ it may be fine for asserting the expected structure of files in the file system, but what if some of the files don't exist?
what about directories where you explicitly _don't_ know about the file structure?
% id = "01JDJGVC7BSKPBDZ6M5BGQNFCW"
- basically, this idea was weirdly bidirectional in a way that my little cat brain couldn't process.
so that never went anywhere.
% id = "01JDJGVC7B09FX0BEDK6A133G6"
- a few weeks ago however, I had a different revelation: what if instead of interfacing with the underlying file system, we... build one?
like... you know. a _virtual_ file system?
% id = "01JDJGVC7BKKFS0758KXN22VFQ"
- I was familiar with the idea of creating an API exposing a virtual file system through [LÖVE](https://love2d.org/), which exposes bindings to [PhysicsFS](https://icculus.org/physfs/).
% id = "01JDJGVC7B1SYGAADP1ECXA05G"
- I know of [at least one game](https://thelettervsixtim.es/) that makes use of PhysicsFS outside of LÖVE, but I had never used it myself in a project.
% id = "01JDJGVC7BCRY04ZZYYQ0DHVSS"
- looking at the API exposed by LÖVE though, it's pretty simple---you get a (global) file system, to which you can add mount points, which represent physical directories on your computer.
% id = "01JDJGVC7B5AR4NQC23ZY6HENB"
- this idea seemed incredibly cool, though going with my functional zen I just _needed_ to make it composable---so my end goal was to have a file system made out of pieces you can fit together like LEGO bricks!
% id = "01JDJGVC7BK00WN1B8WGAXEC07"
- ### a [fractal](https://www.youtube.com/watch?v=8lF2PQpwTEU){.secret} of files
% id = "01JDJGVC7B633NSHE870KR761Q"
- I started designing.
I knew I at least needed the ability to enumerate and read files, so I needed at least these two functions:
```rust
trait Dir {
/// List all entries under the given path.
fn dir(&self, path: &VPath) -> Vec<VPathBuf>;
/// Return the byte content of the entry at the given path, or `None` if the path does not
/// contain any content.
fn content(&self, path: &VPath) -> Option<Vec<u8>>;
}
```
this alone already gave me an _insane_ amount of insight!
% id = "01JDJGVC7BPEW2NAT7XSB5F97E"
- first of all, from the perspective of the program, do we _really_ need to differentiate between directories and files?
% id = "01JDJGVC7BHQWZDS17DY0096F6"
- from a GUI design standpoint, they're a useful abstraction for us humans---the more concrete an object, the easier it is to grasp.
_files_ and _folders_ seem a lot more easy to grasp than abstract _entries_ which _may_ represent documents, folders, or both.
but for a program, it couldn't care less.
% id = "01JDJGVC7BTKRY1YQ31E5WGGBQ"
- compare this code for walking the file system:
```rust
fn walk_dir_rec(dir: &dyn Dir, path: &VPath, mut f: impl FnMut(&VPath)) {
for entry in dir.dir(path) {
f(&entry);
walk_dir_rec(dir, &entry, f);
}
}
fn process_all_png_files(dir: &dyn Dir) {
walk_dir_rec(dir, VPath::ROOT, |path| {
if path.extension() == Some("png") {
if let Some(content) = dir.content(path) {
// do stuff with the file
}
}
});
}
```
% id = "01JDJH0ZNAHM8JPQNKMPZKQ2ZT"
- to this code, which has to differentiate between files and directories, because calling `dir` on a file or `content` on a directory is an error:
```rust
fn walk_dir_rec(dir: &dyn Dir, path: &VPath, mut f: impl FnMut(&VPath)) {
for entry in dir.dir(path) {
f(&entry);
if entry.kind == DirEntryKind::Dir {
walk_dir_rec(dir, &entry, f);
}
}
}
fn process_all_png_files(dir: &dyn Dir) {
walk_dir_rec(dir, VPath::ROOT, |entry| {
if entry.path.extension() == Some("png")
&& entry.kind == DirEntryKind::File
{
if let Some(content) = dir.content(entry.path) {
// do stuff with the file
}
}
});
}
```
to me, the logic seems a lot simpler in the former case, separating the concerns of walking the directory in `walk_dir_rec` from the concerns of reading the files in `process_all_png_files`!
% id = "01JDJGVC7B6GFRVG842PFEEC8E"
+ this does not _automatically_ mean it's a good idea to design an operating system around this, but it's interesting to think about the properties that emerge from removing the separation.
% id = "01JDJGVC7BDF8NWFJ90NN1EXSN"
- it may not even be the greatest idea to interface with the physical file system in this way, if the communication has to be bidirectional---since real world file systems separate files from directories, think about what happens if your program tries to write `content` to an entry which already has a `dir`.
% id = "01JDJGVC7BHFXM4XR1S4Y9XHBK"
- in that manner, this is a leaky abstraction.
% id = "01JDJGVC7BDB79H5S7WJHA53DM"
- second... this looks a lot like [resource forks](https://en.wikipedia.org/wiki/Fork_%28file_system%29)!
so imagine that you can add even more metadata to file system entries, by adding more methods to this trait.
% id = "01JDJGVC7BG0K62PK3YEH4NAJA"
- feels like an incredibly useful way to propagate auxiliary data through the program!
% id = "01JDJGVC7BJ41R94JDY78EYEWZ"
- with these two functions, the ability to join paths, and remove their prefixes, this is enough to start building interesting things.
% id = "01JDJGVC7B2GXQGZD3YGB6XYGF"
- since we'd like our file system to be composable, we'll need a composition operator first.
I'm naming mine `MemDir`, because it represents an in-*mem*ory *`dir`* with entries.
I'll spare you the implementation details, but it acts more or less like a hash map:
```rust
let mut dir = MemDir::new();
dir.add(VPath::new("README.txt"), readme_txt);
dir.add(VPath::new("src"), src);
```
% id = "01JDJGVC7B96AM20AQ5YGN7XQR"
- and now for the _opposite_ of the composition operator: we'll need an operator that decomposes `Dir`s into smaller ones.
the name of this one I'll take from the command line---`Cd`, meaning _change directory_:
```rust
let src = Cd::new(dir, VPath::new("src"));
```
% id = "01JDJGVC7B1N3EF1V2Q8MGK1N7"
- and voilá, the file system is composable!
% id = "01JDJGVC7BH8TN1WNZNJEVYEBB"
- interestingly enough, assuming your virtual paths cannot represent parent directories, this already forms the foundation of a capability-based file system.
% id = "01JDJGVC7BWSEYK4CFD7KDB0R3"
- if you want to give a program access to your `src` directory, but _nothing_ else, give it that `Cd` from above, and it cannot access anything outside of it.
% id = "01JDJGVC7BF77CE3N07NN9F2MX"
- ### [get real](https://www.youtube.com/watch?v=rwNSpBYszIk){.secret}
% id = "01JDJGVC7BQZAY4TP1WG5N6QR3"
- having a design in mind, I thought it would be interesting to integrate it into a real project.
the treehouse seemed like the right thing to test it out on, since it's effectively a compiler---it transforms a set of source directories into a target directory.
% id = "01JDJH59SN5219QPGJPZS7FMA6"
- and I have to say: so far, I'm liking it!
% id = "01JDJGVC7BVTP35M0QG74A15RJ"
- I ended up needing a few more resource forks to implement all the existing functionality.
```rust
pub trait Dir: Debug {
/// List all entries under the provided path.
fn dir(&self, path: &VPath) -> Vec<DirEntry>;
/// Return the byte content of the entry at the given path.
fn content(&self, path: &VPath) -> Option<Vec<u8>>;
/// Get a string signifying the current version of the provided path's content.
/// If the content changes, the version must also change.
///
/// Returns None if there is no content or no version string is available.
fn content_version(&self, path: &VPath) -> Option<String>;
/// Returns the size of the image at the given path, or `None` if the entry is not an image
/// (or its size cannot be known.)
fn image_size(&self, _path: &VPath) -> Option<ImageSize> {
None
}
/// Returns a path relative to `config.site` indicating where the file will be available
/// once served.
///
/// May return `None` if the file is not served.
fn anchor(&self, _path: &VPath) -> Option<VPathBuf> {
None
}
/// If a file can be written persistently, returns an [`EditPath`] representing the file in
/// persistent storage.
///
/// An edit path can then be made into an [`Edit`].
fn edit_path(&self, _path: &VPath) -> Option<EditPath> {
None
}
}
```
% id = "01JDJGVC7BNDFRCSB9B58MF9H0"
- `content_version` and `anchor` are both used to assemble URLs out of `Dir` entries.
I have a function `url` which, given a root URL, returns a URL with a `?v=` parameter for cache busting.
```rust
pub fn url(site: &str, dir: &dyn Dir, path: &VPath) -> Option<String> {
let anchor = dir.anchor(path)?;
if let Some(version) = dir.content_version(path) {
Some(format!("{}/{anchor}?v={version}", site))
} else {
Some(format!("{}/{anchor}", site))
}
}
```
% id = "01JDJGVC7BY8KF3AVJDQBED21Z"
+ `image_size` is used to automatically determine the size of images at build time.
that way I can add `width="" height=""` attributes to all `<img>` tags, preventing [layout shift](https://web.dev/articles/cls).
% id = "01JDJGVC7BTGNABZVPCV38XFBV"
- technically, as of writing this, not _all_ images have this.
notably, ones I paste into the markup, like this one:
![goofy close up screenshot of Hat Kid staring down into the camera][pic:01JDJCD4205JHQ01XJTYPP5KB3]
if you refresh the page with this branch open, you will notice the layout shift.
unless I've already fixed it as of the time you're reading this, in which case... :partying:
% id = "01JDJGVC7BVFG1T741YMEAWMEY"
- one notable piece of functionality that is currently missing is _version history_.
to be honest, I'm still figuring that one out in my head;
I have the feeling it's not exactly going to be simple, but it should end up being a lot more principled than [whatever this was][vh].
[vh]: https://src.liquidev.net/liquidex/treehouse/src/commit/86b4bf5b2ddb38f3b5fdf362da3c3e75704bfd48/crates/treehouse/src/generate.rs#L491
% id = "01JDJGVC7BJAKXR98XPJV4FWPC"
- ### [Radio Edit (radio edit)](https://www.youtube.com/watch?v=WQzx9o2-0d0){.secret}
% id = "01JDJGVC7BH3VFBZ4MG5TXJT25"
- _but wait liquidex! what's that `edit_path` do?_
% id = "01JDJGVC7B7WD9RA8KZ1RT4MPS"
- one notable thing about this virtual file system is that it doesn't allow writing to the virtual files.
% id = "01JDJGVC7BRFGQX5DK3XKCTH14"
- I mean, think about it.
it just doesn't make sense!
we have a `&` immutable reference, and allowing the program to edit files as it's compiling could wreck some real havoc...!
% id = "01JDJGVC7B4YT790RRS3ME6600"
- and not only that, there's also the question of _useless edits_, edits that tweak in-memory files.
those don't persist across restarts, so they don't really make much sense, do they?
% id = "01JDJGVC7B2EF87ZMN66P0KJDZ"
- it may seem pointless to want to have the treehouse---again, a compiler of sorts---capable of editing source files, but there _is_ one legit use case: I have a `fix` command, which fills in all the missing branch IDs in a file, optionally saving it in place.
% id = "01JDJGVC7B8SAJ90Z3XWW59N0G"
- there's also a `fix-all` command, which does this to all files.
% id = "01JDJGVC7BM5ZX03KDAW376WW9"
- so I needed to devise an API that would let me _write_ to files after the command is done running.
that's what the `edit_path` is for.
% id = "01JDJGVC7B0BCHNHQPZFXVNFA9"
- `edit_path` returns an `EditPath`, which represents a location somewhere in persistent storage.
having an `EditPath`, you can construct an `Edit`.
```rust
/// Represents a pending edit operation that can be written to persistent storage later.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Edit {
/// An edit that doesn't do anything.
NoOp,
/// Write the given string to a file.
Write(EditPath, String),
/// Execute a sequence of edits in order.
Seq(Vec<Edit>),
/// Execute the provided edits in parallel.
All(Vec<Edit>),
/// Makes an edit dry.
///
/// A dry edit only logs what operations would be performed, does not perform the I/O.
Dry(Box<Edit>),
}
```
`Edit`s take many shapes and forms, but the most important one for us is `Write`: it allows you to write a file to the disk.
the other ones are for composing `Edit`s together into larger ones.
% id = "01JDJGVC7BCT46CXXVKMXATAA3"
- `Seq` can be used to implement transactions, where all edits have to succeed for the parent edit to be considered successful.
% id = "01JDJHDZ8XWEZAA94AW1VS4P53"
- I use this for writing backups.
if writing a backup fails, we wouldn't want to overwrite the original file, because we wouldn't be able to restore it!
% id = "01JDJGVC7B05H6079QHXAXC8Y4"
- `All` can be used to aggregate independent edits together and execute them in parallel.
% id = "01JDJGVC7BKXXJSWYTGX7XA979"
- `Dry` can be used to implement a `--dry-run` command, only printing an edit instead of applying it.
% id = "01JDJGVC7B7MNC3CA5DEP1HXM6"
+ `NoOp` can be used when you need to produce an `Edit`, but don't actually want to perform any operations.
% id = "01JDJGVC7BJRSMS1E4NQRVZ1H2"
- this runs contrary to [my opinion on `None` enums][branch:01HCG7KTGGAFS07QYJXZG6WHJJ], for one reason: would you rather have to handle `Option<Edit>` everywhere, or just assume whatever `Edit` you're being passed is valid?
% id = "01JDJHJ69797960A42P068X4F2"
- try replacing recursive `Edit` references in the example above with `Option<Edit>`s.
% id = "01JDJGVC7BNBB2HQHHZ0WCDA1R"
- while `Edit`s are meant to be composed, they aren't really meant to be inspected outside of the code that applies edits to disk---therefore having a `NoOp` variant actually _improves_ readability, since `Edit` is constructed more often than it is deconstructed.
% id = "01JDJGVC7B0NV9B16QJKKK15NG"
- if this doesn't yell "you shouldn't treat anyone's opinions as dogma---not even your own ones," I don't know what will.
% id = "01JDJGVC7B988WXSERGV95345R"
- although I said before that the fork-based virtual file system is a leaky abstraction when you introduce writing to the physical file system, I don't think this particular API is susceptible to this---since it can expose `EditPath`s for entries that _can_ actually be written (ones with a `content`), you can disallow writing to directories that way.
% id = "01JDJGVC7BGJQXTVZ6Z9BK12BG"
- of course then you cannot create directories.
% id = "01JDJGVC7B2JTWKJVEE1VP6P68"
- also, [TOCTOU](https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use) bugs are a thing, but I disregard those as they don't really fit into a compiler's threat model.
% id = "01JDJGVC7BQVB72WJJ2DV1F91H"
- as I said in the beginning, I don't really like mutable state...
% id = "01JDJGVC7BQSKCS5GC93QF9EE1"
- ### improvise, _adapt_, overcome
% id = "01JDJGVC7BAN5P6KGHM5PVFQ78"
- thanks to the `Dir`'s inherent composability, it is trivial to build adapters on top of it.
I have a few in the treehouse myself.
% id = "01JDJGVC7BCG0W3SWSYR647ZCC"
- `TreehouseDir` is a file system which renders `.tree` files into HTML lazily.
% id = "01JDJGVC7BBJCR1TQCBKBHMBNS"
- since generating all this HTML is an expensive operation, I have it wrapped in a `ContentCache` adapter, which caches any successfully generated pages in memory.
% id = "01JDJGVC7B5458Y8KJKZ2K0HVR"
- I pre-warm the cache with all pages too, so that I can see any warnings that arise during generation.
this is done on multiple threads, because all my `Dir`s are `Send + Sync`.
% id = "01JDJGVC7BKJW4SJGK2MSX1HY3"
- I also have a `Dir` called `HtmlCanonicalize` layered on top of the `ContentCache`, which removes `html` extensions from paths.
`TreehouseDir` exposes canonical paths _without_ an `html`, but I have to support it for compatibility with old links.
% id = "01JDJGVC7BRMRT5YMTCTA4KQ38"
- `Blake3ContentVersionCache` is the sole implementor of `Dir::content_version`.
its purpose is to compute `content_version`s and cache them in memory for each path.
as the name suggests, versions are computed using a (truncated) [BLAKE3](https://en.wikipedia.org/wiki/BLAKE_%28hash_function%29#BLAKE3) hash.
% id = "01JDJGVC7BKKFMW6PQX3FZS4D4"
- `ImageSizeCache` is the sole implementor of `Dir::image_size`.
for paths with a supported extension (`png`, `svg`, `jpg`, `jpeg`, `webp`), it reads the stored image file's size and caches it in memory.
% id = "01JDJGVC7B29N205TP3NT2CR4R"
- `Anchored` is the sole implementor of `Dir::anchor`.
it lets me specify that the source file system's `static` directory ends up being available under `/static` on the website.
% id = "01JDJGVC7BCNF5DNH8FN8WP7H8"
- `Overlay` combines a base directory with an overlay directory, first routing requests to the overlay directory, and if those fail, routing them to the base directory.
this allows me to overlay a `MemDir` with the `static` directory and a `robots.txt` on top of a `TreehouseDir`---which together form the compiler's target directory.
% id = "01JDJGVC7BXA6G2AX5F2RVNH35"
- and guess what---the server serves straight from that virtual target directory, too!
after all, what's the point of writing it to disk if you already have everything assembled.
_It's all dynamic.™_
% id = "01JDJGVC7B2KFPFAB8BRV735YA"
- for the curious, here's roughly how the treehouse's virtual file systems are structured:
```rust
source: ImageSizeCache(Blake3ContentVersionCache(MemDir {
"treehouse.toml": BufferedFile(..), // content read at startup
"static": Anchored(PhysicalDir("static"), "static"),
"template": PhysicalDir("template"),
"content": PhysicalDir("content"),
}))
target: Overlay(
HtmlCanonicalize(ContentCache(TreehouseDir)),
MemDir {
"static": Cd(source, "static"),
"robots.txt": Cd(source, "static/robots.txt"),
},
)
```
% id = "01JDJGVC7BZKPH3ATEE1YP2MYS"
- I'm not too fond of that `treehouse.toml` hack, but this is because `PhysicalDir` cannot really be attached to singular files right now.
I'll definitely fix that one day...
% id = "01JDJGVC7B1DG4Y4PTVRT0NQFV"
- ### a fatal [flaw](https://musicbyflaws.bandcamp.com/){.secret}
% id = "01JDJGVC7BVV81GF2GRTQVCK7E"
- there is one flaw I want to point out with the current implementation of `Dir`: it uses trait methods to add new resource forks.
% id = "01JDJGVC7BJS79G1VJN1G29GRK"
- any time you add a new resource fork that's implemented only by one or two `Dir`s, you have to duplicate a stub across all existing adapters!
this means introducing a new fork means performing up to `N` edits, where `N` is the number of implementers of `Dir`.
% id = "01JDJJDAPMEVMJNQX7NF2AWEJE"
- this problem doesn't exist for `Dir`s which are _not_ adapters (aka suppliers), but in practice there are far less suppliers than adapters.
% id = "01JDJGVC7B7VDMV5HT6Q1W664D"
- one idea I've had to [fix this](https://anilist.co/anime/9253/SteinsGate/){.secret title="My Fork."} was to change the API shape to a single trait method.
```rust
pub trait Dir {
fn forks(&self, path: &VPath, forks: &mut Forks);
}
impl Dir for MyDir {
fn forks(&self, path: &VPath, forks: &mut Forks) {
forks.insert(|| MyFork);
}
}
impl<T> Dir for AdapterDir<T> {
fn forks(&self, path: &VPath, forks: &mut Forks) {
self.inner.forks(path, forks);
forks.insert(|| MyMomsEpicSilverware);
}
}
```
but that hasn't come to fruition yet, as I have no idea how to make it efficient yet object-safe...
I'm yet to add profiling to the treehouse, so I don't want to make risky performance decisions like this at this point.
% id = "01JDJGVC7B9799AC57MYCSA9X1"
- dynamic typing has a cost, after all!
% id = "01JDJGVC7BTM4MH9PC4M15MJWW"
- maybe one day I'll post an update on that.
% id = "01JDJGVC7BMVXNKZPJ6YHR7BMB"
- and that basically concludes all this virtual file system shenaniganry!
I hope you got something useful out of it.
% id = "01JDJGVC7B60FZDTJFJMQ09EVV"
- I look forward to finding out how this system will fare in the long run.
guess I'll report my progress, I dunno. next year?
% id = "01JDJGVC7BG54TARPJ20T48ZGP"
- see you then!

View file

@ -14,6 +14,18 @@ if you've been wondering what I've been up to, you've come to the right place.
% id = "01JCGWPM6TMAJT0B50GQSA4BDW"
- there is currently no RSS or Atom feed for this page, sorry!
% id = "01JDJJSEWASRWJGKMBNYMFD9B5"
- ### [composable virtual file systems][page:programming/blog/vfs]
% id = "01JDJJSEWAVZGJN3PWY94SJMXT"
- I recently rebuilt the treehouse to use a virtual file system for its source and target directories.
% id = "01JDJJSEWA7K5T3Z0Y6NQ8RBGX"
- this is an exploration of how I built my abstraction, how it works, and what I learned from it.
% id = "01JCGAM553TJJCEJ96ADEWETQC"
- ### [prefix matches with C strings][page:programming/blog/cstring-starts-with]
% id = "01JBAGZAZ30K443QYPK0XBNZWM"
- ### [the curious case of Amon Tobin's Creatures][page:music/creatures]
@ -157,3 +169,4 @@ if you've been wondering what I've been up to, you've come to the right place.
% id = "01HQ6G30PT1D729Z29NYVDCFDB"
- this post explores basically just that.

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB