286 lines
13 KiB
Text
286 lines
13 KiB
Text
% id = "01HA0GPJ8B0BZCTCDFZMDAW4W4"
|
|
- [repo][def:stitchkit/repo]
|
|
|
|
% id = "01HA0GPJ8BXETX6BWP8E9484DT"
|
|
- my UnrealScript compiler
|
|
|
|
% id = "01HA0GPJ8B6QW73GAH6KJQP0GD"
|
|
+ part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use
|
|
|
|
% id = "01HA0GPJ8B0YJBXVS61M57VRX0"
|
|
- the name "MuScript" is actually a reference to Mustache Girl
|
|
|
|
% id = "01HA0GPJ8BKRNXP9KJQCAWW4MD"
|
|
+ ### architecture
|
|
|
|
% id = "01HA0GPJ8BP1QVDE9Z2GHFV5DV"
|
|
- MuScript uses a query-based architecture similar to rustc
|
|
|
|
% id = "01HA0GPJ8BCX44E6N7BG350AET"
|
|
- the classic pass-based compiler architecture has the compiler drive itself by first
|
|
parsing all files, then analyzing them, then emitting bytecode - in passes
|
|
|
|
% id = "01HA0GPJ8B4K7VTV8BC45FW4YW"
|
|
- a query-based architecture works by driving the compiler by asking it questions
|
|
|
|
% id = "01HA0GPJ8BYPDRG206ZYREYPBX"
|
|
+ the most interesting question to us all being "do you know the bytecode of this package?"
|
|
|
|
% id = "01HA0GPJ8BZM9NGGGX765JM0CD"
|
|
+ which then triggers another question "do you know the classes in this package?"
|
|
|
|
% id = "01HA0GPJ8B9HZMCJ3DK0C2DT2Y"
|
|
+ which then triggers another question "do you know the contents (variables, functions, structs, enums, states, ...) of those classes?"
|
|
|
|
% id = "01HA0GPJ8BFPFDFB37ZHVWHZKV"
|
|
+ which then triggers another question "do you know all the variables in class `ComfortZone`?"
|
|
|
|
% id = "01HA0GPJ8BGWWAHAM5CZ722ZZH"
|
|
+ which then triggers another question "do you know the ID of the type `Int`?"
|
|
|
|
% id = "01HA0GPJ8BDN6XYJBP792XWDWG"
|
|
- "yeah, it's 4."
|
|
|
|
% id = "01HA0GPJ8B95ZTVG14HVTBXSBF"
|
|
- "there's this variable `HuggingQuota` with ID 427."
|
|
|
|
% id = "01HA0GPJ8BKCAEVNC9CJB7JA9S"
|
|
+ which then triggers another question "do you know all the functions in class `ComfortZone`?"
|
|
|
|
% id = "01HA0GPJ8BY6VT4PMGK2Q8J640"
|
|
+ which then triggers another question "do you know the bytecode of function `Hug`?"
|
|
|
|
% id = "01HA0GPJ8BR6ZX06RSQJEVWXMQ"
|
|
+ which then triggers another question "do you know the ID of the class `Person`?"
|
|
|
|
% id = "01HA0GPJ8BYP7GQK2HV804YF9S"
|
|
- "yes indeed, this class exists and has ID 42."
|
|
|
|
% id = "01HA0GPJ8BBS7TKTY2D203VMN1"
|
|
+ which then triggers another question "do you know the bytecode of this function?"
|
|
|
|
% id = "01HA0GPJ8BDJMHGD7S3A7F4Z4Q"
|
|
- …you get the idea.
|
|
|
|
% id = "01HA0GPJ8B7GXF79PMKF89NZ3H"
|
|
- "alright, here's the bytecode for `Hug`."
|
|
|
|
% id = "01HA0GPJ8BTB9AF5QCQF555BX3"
|
|
- "that's all."
|
|
|
|
% id = "01HA0GPJ8BY2R40Y5GP515853E"
|
|
+ ### ideas to try out
|
|
|
|
% id = "01HA0GPJ8BSMZ13V2S7DPZ508P"
|
|
- I jot down various silly ideas for MuScript in the future here
|
|
|
|
% id = "01HA0GPJ8B05TABAEC9JV73VRR"
|
|
+ parallelization with an event loop
|
|
|
|
% id = "01HA0GPJ8B48K60BWQ2XZZ0PB5"
|
|
+ the thing with a pass-based architecture is that with enough locking, it _may_ be easy to parallelize.
|
|
|
|
% id = "01HA0GPJ8B5QB6DF8YVTYC2HJY"
|
|
+ I can imagine parallelization existing on many levels here.
|
|
|
|
% id = "01HA0GPJ8BMQ8JAH09YGC3Y7VW"
|
|
- if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line.
|
|
|
|
% id = "01HA0GPJ8BKGJ15YYNS9C1QRYB"
|
|
- you could analyze all the type declarations in parallel. though with dependencies it gets hairy.
|
|
|
|
% id = "01HA0GPJ8B2984NNB7E8X0Z1X7"
|
|
- you could analyze the contents of classes in parallel.
|
|
|
|
% id = "01HA0GPJ8BFYNKVA6Z923E2370"
|
|
- thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages.
|
|
|
|
% id = "01HA0GPJ8B915YCDR0MZSW5RN1"
|
|
- hence why we haven't seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes
|
|
of `cc` or `rustc` or what have you.
|
|
|
|
% id = "01HA0GPJ8BHF1KRM8KGFMT1875"
|
|
+ where with a pass-based architecture the problem is dependencies between independent stages you're trying to parallelize, with a query-based architecture like MuScript's it gets even harder,
|
|
because the entire compiler is effectively a dependency ~~hell~~ machine.
|
|
|
|
% id = "01HA0GPJ8B20D3MKV6TMB19GW2"
|
|
- one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth.
|
|
|
|
% id = "01HA0GPJ8BA4T1C36R8WFJ3G2F"
|
|
- but there's _technically_ nothing holding us back from executing certain queries in parallel.
|
|
|
|
% id = "01HA0GPJ8B3FGTM417H46N0EXD"
|
|
- in fact, imagine if we executed _all_ queries in parallel.
|
|
|
|
% id = "01HA0GPJ8BENERTESAFQJ7G3R9"
|
|
+ enter: the `async` event loop compiler
|
|
|
|
% id = "01HA0GPJ8BF9E0MW1WS2EJY0J8"
|
|
- there is a central event loop that distributes tasks to be done to multiple threads
|
|
|
|
% id = "01HA0GPJ8B5WA232A319JTEFM2"
|
|
- every query function is `async`
|
|
|
|
% id = "01HA0GPJ8BQ15PT2M4BTA1ZNDX"
|
|
- meaning it suspends execution of the current function, writes back "need to compute the type ID of `Person` because that's not yet available" into a concurrent set (very important that it's a _set_) - let's call this set the "TODO set"
|
|
|
|
% id = "01HA0GPJ8B34GP8TD21AY8NXS1"
|
|
- on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked
|
|
|
|
% id = "01HA0GPJ8BKPHBPN44HM6WHKDJ"
|
|
- because we're using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately
|
|
|
|
% id = "01HA0GPJ8B0V2VJMAV1YCQ19Q8"
|
|
- though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? *(needs research.)*
|
|
|
|
% id = "01HA0GPJ8BBREEJCJRWPJJNR3N"
|
|
- once there are no more tasks in the queue, we're done compiling
|
|
|
|
% id = "01HAS9RREB6JCSQY986TE1SXVV"
|
|
+ parsing less
|
|
|
|
% id = "01HAS9RREB1VS73PWNRDVXE9C1"
|
|
- I measured the compiler's performance yesterday and most time is actually spent on parsing, out of all things
|
|
|
|
% id = "01HAS9RREB8K52CGJNJDSHMHFP"
|
|
- going along with the philosophy of [be lazy][branch:01HAS6RMNCZS9Y84N8WZ6594D1], we should probably be parsing less things then
|
|
|
|
% id = "01HAS9RREBH11B8VFZNNJR6SZH"
|
|
+ I don't think we need to parse method bodies if we're not emitting IR
|
|
|
|
% id = "01HAS9RREBXTPT2VKWJCD484C6"
|
|
- basically, out of this code:
|
|
```unrealscript
|
|
function Hug(Hat_Player OtherPlayer)
|
|
{
|
|
PlayAnimation('Hugging');
|
|
// or something, idk UE3
|
|
}
|
|
```
|
|
parse only this:
|
|
```unrealscript
|
|
function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ }
|
|
```
|
|
omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR
|
|
|
|
% id = "01HAS9RREBHFF98VX1YDRC78NA"
|
|
+ I don't think we need to parse the entire class if we only care about its superclass
|
|
|
|
% id = "01HAS9RREBG6V27P9C5CZJBD9K"
|
|
- basically, out of this code:
|
|
```unrealscript
|
|
class lqGoatBoy extends Hat_Player;
|
|
|
|
defaultproperties
|
|
{
|
|
Model = SkeletalMesh'lqFluffyZone.Sk_GoatBoy';
|
|
// etc
|
|
}
|
|
```
|
|
only parse the following:
|
|
```
|
|
class lqGoatBoy extends Hat_Player;
|
|
|
|
/* parser stops here, rest of text is ignored until needed */
|
|
```
|
|
and then only parse the rest if any class items are requested
|
|
|
|
% id = "01HASA3CG20D3EC87SCTWVR48A"
|
|
- real case: getting the superclasses of `Hat_Player` takes a really long time because it's _big_
|
|
(`Hat_Player.uc` itself is around 8000 lines of code, and it has many superclasses which are also pretty big)
|
|
|
|
% id = "01HD6NRBEZ8FMFMHW0TF62VBEC"
|
|
+ ### ideas I tried out
|
|
|
|
% id = "01HAS9RREBVAXX28EX3TGWTCSW"
|
|
+ :done: lexing first
|
|
|
|
% id = "01HAS9RREBM9VXFEPXKQ2R3EAZ"
|
|
- something that MuScript did not use to do is have a separate tokenization stage
|
|
|
|
% id = "01HAS9RREBE94GKXXM70TZ6RMJ"
|
|
+ this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat _some_ things in braces `{}` as strings, such as `cpptext`
|
|
|
|
% id = "01HAS9RREBQY6AWTXMD6DNS9DF"
|
|
- ```unrealscript
|
|
cpptext
|
|
{
|
|
// this has to be parsed as C++ code, which has some differing lexing rules
|
|
template <typename T>
|
|
void Hug(T& Whomst)
|
|
{
|
|
DoHug(T::StaticClass(), &Whomst);
|
|
}
|
|
}
|
|
```
|
|
|
|
% id = "01HAS9RREB4ZC9MN8YQWWNN7D2"
|
|
- but C++ is similar enough to UnrealScript that we are able to get away with lexing it using the main UnrealScript lexer
|
|
|
|
% id = "01HAS9RREBN6FS43W0YKC1BXJE"
|
|
- we even lex variable metadata `var int Something <ToolTip=bah>;` using the lexer, storing invalid characters and errors as some `InvalidCharacter` token kind or something
|
|
|
|
% id = "01HAS9RREBAXYQWNA068KKNG07"
|
|
+ and that's without emitting diagnostics - let the parser handle those instead
|
|
|
|
% id = "01HAS9RREBWZKAZGFKH3BXE409"
|
|
- one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of `<ToolTip=3D location>`, where `3D` is parsed as a number literal with an invalid suffix and thus errors out
|
|
|
|
% id = "01HD6NRBEZ2TCHKY1C4JK2EK0N"
|
|
- implementing this taught me one important lesson: context switching is expensive
|
|
|
|
% id = "01HD6NRBEZCKP5ZYZ3XQ9PVJTD"
|
|
- having the lexer as a separate pass made the parsing 2x faster, speeding up the
|
|
compiler pretty much two-fold (because that's where the compiler was spending most of its time)
|
|
|
|
% id = "01HD6NRBEZP6V4J1MS84C6KN1P"
|
|
- my suspicion as to why this was slow is that the code for parsing, preprocessing,
|
|
and reading tokens was scattered across memory - also with lots of branches that
|
|
needed to be checked for each token requested by the parser
|
|
|
|
% id = "01HD6NRBEZDM4QSN38TZJCXRAA"
|
|
+ I think also having token data in one contiguous block of memory also helped, though
|
|
isn't as efficient as it could be _yet_.
|
|
|
|
% id = "01HD6NRBEZWSA9HFNPKQPRHQK1"
|
|
- the current data structure as of writing this is
|
|
```rust
|
|
struct Token {
|
|
kind: TokenKind,
|
|
source_range: Range<usize>,
|
|
}
|
|
|
|
struct TokenArena {
|
|
tokens: Vec<Token>,
|
|
}
|
|
```
|
|
(with some irrelevant things omitted - things like source files are not relevant
|
|
for token streams themselves)
|
|
|
|
% id = "01HD6NRBEZXCE5TQSMQHQ29D90"
|
|
- I don't know if I'll ever optimize this to be even more efficient than it
|
|
already is, but source ranges are mostly irrelevant to the high level task of
|
|
matching tokens, so maybe arranging the storage like
|
|
```rust
|
|
struct Tokens {
|
|
kinds: Vec<TokenKind>,
|
|
source_ranges: Vec<Range<usize>>,
|
|
}
|
|
```
|
|
could help
|
|
|
|
% id = "01HD6NRBEZ90Z3GJ8GBFGN0KFC"
|
|
- another thing that could help is changing the `usize` source ranges to
|
|
`u32`, but I don't love the idea because it'll make it even harder to
|
|
support large files - not that we necessarily _will_ ever support them,
|
|
but it's something to consider
|
|
|
|
% id = "01HA4KNTTGG3YX2GYFQ89M2V6Q"
|
|
+ ### insanium
|
|
|
|
% id = "01HA4KNTTG8M9B0M86Z1DRNBQY"
|
|
- section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy
|
|
|
|
% id = "01HA4KNTTG08HQV1GYGKEBNY4E"
|
|
- also see [Stitchkit's insanium][branch:01HA4KNTTK4TCWTWQXDWPPEQE0]
|
|
|