% id = "01HA0GPJ8B0BZCTCDFZMDAW4W4" - [repo][def:stitchkit/repo] % id = "01HA0GPJ8BXETX6BWP8E9484DT" - my UnrealScript compiler % id = "01HA0GPJ8B6QW73GAH6KJQP0GD" + part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use % id = "01HA0GPJ8B0YJBXVS61M57VRX0" - the name "MuScript" is actually a reference to Mustache Girl % id = "01HA0GPJ8BKRNXP9KJQCAWW4MD" + ### architecture % id = "01HA0GPJ8BP1QVDE9Z2GHFV5DV" - MuScript uses a query-based architecture similar to rustc % id = "01HA0GPJ8BCX44E6N7BG350AET" - the classic pass-based compiler architecture has the compiler drive itself by first parsing all files, then analyzing them, then emitting bytecode - in passes % id = "01HA0GPJ8B4K7VTV8BC45FW4YW" - a query-based architecture works by driving the compiler by asking it questions % id = "01HA0GPJ8BYPDRG206ZYREYPBX" + the most interesting question to us all being "do you know the bytecode of this package?" % id = "01HA0GPJ8BZM9NGGGX765JM0CD" + which then triggers another question "do you know the classes in this package?" % id = "01HA0GPJ8B9HZMCJ3DK0C2DT2Y" + which then triggers another question "do you know the contents (variables, functions, structs, enums, states, ...) of those classes?" % id = "01HA0GPJ8BFPFDFB37ZHVWHZKV" + which then triggers another question "do you know all the variables in class `ComfortZone`?" % id = "01HA0GPJ8BGWWAHAM5CZ722ZZH" + which then triggers another question "do you know the ID of the type `Int`?" % id = "01HA0GPJ8BDN6XYJBP792XWDWG" - "yeah, it's 4." % id = "01HA0GPJ8B95ZTVG14HVTBXSBF" - "there's this variable `HuggingQuota` with ID 427." % id = "01HA0GPJ8BKCAEVNC9CJB7JA9S" + which then triggers another question "do you know all the functions in class `ComfortZone`?" % id = "01HA0GPJ8BY6VT4PMGK2Q8J640" + which then triggers another question "do you know the bytecode of function `Hug`?" % id = "01HA0GPJ8BR6ZX06RSQJEVWXMQ" + which then triggers another question "do you know the ID of the class `Person`?" % id = "01HA0GPJ8BYP7GQK2HV804YF9S" - "yes indeed, this class exists and has ID 42." % id = "01HA0GPJ8BBS7TKTY2D203VMN1" + which then triggers another question "do you know the bytecode of this function?" % id = "01HA0GPJ8BDJMHGD7S3A7F4Z4Q" - …you get the idea. % id = "01HA0GPJ8B7GXF79PMKF89NZ3H" - "alright, here's the bytecode for `Hug`." % id = "01HA0GPJ8BTB9AF5QCQF555BX3" - "that's all." % id = "01HA0GPJ8BY2R40Y5GP515853E" + ### ideas % id = "01HA0GPJ8BSMZ13V2S7DPZ508P" - I jot down various silly ideas for MuScript in the future here % id = "01HA0GPJ8B05TABAEC9JV73VRR" + parallelization with an event loop % id = "01HA0GPJ8B48K60BWQ2XZZ0PB5" + the thing with a pass-based architecture is that with enough locking, it *may* be easy to parallelize. % id = "01HA0GPJ8B5QB6DF8YVTYC2HJY" + I can imagine parallelization existing on many levels here. % id = "01HA0GPJ8BMQ8JAH09YGC3Y7VW" - if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line. % id = "01HA0GPJ8BKGJ15YYNS9C1QRYB" - you could analyze all the type declarations in parallel. though with dependencies it gets hairy. % id = "01HA0GPJ8B2984NNB7E8X0Z1X7" - you could analyze the contents of classes in parallel. % id = "01HA0GPJ8BFYNKVA6Z923E2370" - thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages. % id = "01HA0GPJ8B915YCDR0MZSW5RN1" - hence why we haven't seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes of `cc` or `rustc` or what have you. % id = "01HA0GPJ8BHF1KRM8KGFMT1875" + where with a pass-based architecture the problem is dependencies between independent stages you're trying to parallelize, with a query-based architecture like MuScript's it gets even harder, because the entire compiler is effectively a dependency ~~hell~~ machine. % id = "01HA0GPJ8B20D3MKV6TMB19GW2" - one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth. % id = "01HA0GPJ8BA4T1C36R8WFJ3G2F" - but there's _technically_ nothing holding us back from executing certain queries in parallel. % id = "01HA0GPJ8B3FGTM417H46N0EXD" - in fact, imagine if we executed _all_ queries in parallel. % id = "01HA0GPJ8BENERTESAFQJ7G3R9" + enter: the `async` event loop compiler % id = "01HA0GPJ8BF9E0MW1WS2EJY0J8" - there is a central event loop that distributes tasks to be done to multiple threads % id = "01HA0GPJ8B5WA232A319JTEFM2" - every query function is `async` % id = "01HA0GPJ8BQ15PT2M4BTA1ZNDX" - meaning it suspends execution of the current function, writes back "need to compute the type ID of `Person` because that's not yet available" into a concurrent set (very important that it's a _set_) - let's call this set the "TODO set" % id = "01HA0GPJ8B34GP8TD21AY8NXS1" - on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked % id = "01HA0GPJ8BKPHBPN44HM6WHKDJ" - because we're using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately % id = "01HA0GPJ8B0V2VJMAV1YCQ19Q8" - though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? **(needs research.)** % id = "01HA0GPJ8BBREEJCJRWPJJNR3N" - once there are no more tasks in the queue, we're done compiling % id = "01HAS9RREB6JCSQY986TE1SXVV" + parsing less % id = "01HAS9RREB1VS73PWNRDVXE9C1" - I measured the compiler's performance yesterday and most time is actually spent on parsing, out of all things % id = "01HAS9RREB8K52CGJNJDSHMHFP" - going along with the philosophy of [be lazy][branch:01HAS6RMNCZS9Y84N8WZ6594D1], we should probably be parsing less things then % id = "01HAS9RREBH11B8VFZNNJR6SZH" + I don't think we need to parse method bodies if we're not emitting IR % id = "01HAS9RREBXTPT2VKWJCD484C6" - basically, out of this code: ```unrealscript function Hug(Hat_Player OtherPlayer) { PlayAnimation('Hugging'); // or something, idk UE3 } ``` parse only this: ```unrealscript function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ } ``` omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR % id = "01HAS9RREBHFF98VX1YDRC78NA" + I don't think we need to parse the entire class if we only care about its superclass % id = "01HAS9RREBG6V27P9C5CZJBD9K" - basically, out of this code: ```unrealscript class lqGoatBoy extends Hat_Player; defaultproperties { Model = SkeletalMesh'lqFluffyZone.Sk_GoatBoy'; // etc } ``` only parse the following: ``` class lqGoatBoy extends Hat_Player; /* parser stops here, rest of text is ignored until needed */ ``` and then only parse the rest if any class items are requested % id = "01HASA3CG20D3EC87SCTWVR48A" - real case: getting the superclasses of `Hat_Player` takes a really long time because it's _big_ (`Hat_Player.uc` itself is around 8000 lines of code, and it has many superclasses which are also pretty big) % id = "01HAS9RREBVAXX28EX3TGWTCSW" + lexing first % id = "01HAS9RREBM9VXFEPXKQ2R3EAZ" - something that MuScript does not do currently is a separate tokenization stage % id = "01HAS9RREBE94GKXXM70TZ6RMJ" + this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat _some_ things in braces `{}` as strings, such as `cpptext` % id = "01HAS9RREBQY6AWTXMD6DNS9DF" - ```unrealscript cpptext { // this has to be parsed as C++ code, which has some differing lexing rules template void Hug(T& Whomst) { DoHug(T::StaticClass(), &Whomst); } } ``` % id = "01HAS9RREB4ZC9MN8YQWWNN7D2" - but C++ is similar enough to UnrealScript that we may be able to get away with lexing it using the main UnrealScript lexer % id = "01HAS9RREBN6FS43W0YKC1BXJE" - we could even lex variable metadata `var int Something ;` using the lexer, storing invalid characters and errors as some `InvalidCharacter` token kind or something % id = "01HAS9RREBAXYQWNA068KKNG07" + and that's without emitting diagnostics - let the parser handle those instead % id = "01HAS9RREBWZKAZGFKH3BXE409" - one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of ``, where `3D` is parsed as a number literal with an invalid suffix and thus errors out % id = "01HA4KNTTGG3YX2GYFQ89M2V6Q" + ### insanium % id = "01HA4KNTTG8M9B0M86Z1DRNBQY" - section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy % id = "01HA4KNTTG08HQV1GYGKEBNY4E" - also see [Stitchkit's insanium][branch:01HA4KNTTK4TCWTWQXDWPPEQE0]