treehouse/content/programming/projects/muscript.tree

% id = "01HA0GPJ8B0BZCTCDFZMDAW4W4"
- [repo][def:stitchkit/repo]

% id = "01HA0GPJ8BXETX6BWP8E9484DT"
- my UnrealScript compiler

% id = "01HA0GPJ8B6QW73GAH6KJQP0GD"
+ part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use

    % id = "01HA0GPJ8B0YJBXVS61M57VRX0"
    - the name "MuScript" is actually a reference to Mustache Girl

% id = "01HA0GPJ8BKRNXP9KJQCAWW4MD"
+ ### architecture

    % id = "01HA0GPJ8BP1QVDE9Z2GHFV5DV"
    - MuScript uses a query-based architecture similar to rustc

        % id = "01HA0GPJ8BCX44E6N7BG350AET"
        - the classic pass-based compiler architecture has the compiler drive itself by first
          parsing all files, then analyzing them, then emitting bytecode - in passes

        % id = "01HA0GPJ8B4K7VTV8BC45FW4YW"
        - a query-based architecture works by driving the compiler by asking it questions

            % id = "01HA0GPJ8BYPDRG206ZYREYPBX"
            + the most interesting question to us all being "do you know the bytecode of this package?"

                % id = "01HA0GPJ8BZM9NGGGX765JM0CD"
                + which then triggers another question "do you know the classes in this package?"

                    % id = "01HA0GPJ8B9HZMCJ3DK0C2DT2Y"
                    + which then triggers another question "do you know the contents (variables, functions, structs, enums, states, ...) of those classes?"

                        % id = "01HA0GPJ8BFPFDFB37ZHVWHZKV"
                        + which then triggers another question "do you know all the variables in class `ComfortZone`?"

                            % id = "01HA0GPJ8BGWWAHAM5CZ722ZZH"
                            + which then triggers another question "do you know the ID of the type `Int`?"

                                % id = "01HA0GPJ8BDN6XYJBP792XWDWG"
                                - "yeah, it's 4."

                            % id = "01HA0GPJ8B95ZTVG14HVTBXSBF"
                            - "there's this variable `HuggingQuota` with ID 427."

                        % id = "01HA0GPJ8BKCAEVNC9CJB7JA9S"
                        + which then triggers another question "do you know all the functions in class `ComfortZone`?"

                            % id = "01HA0GPJ8BY6VT4PMGK2Q8J640"
                            + which then triggers another question "do you know the bytecode of function `Hug`?"

                                % id = "01HA0GPJ8BR6ZX06RSQJEVWXMQ"
                                + which then triggers another question "do you know the ID of the class `Person`?"

                                    % id = "01HA0GPJ8BYP7GQK2HV804YF9S"
                                    - "yes indeed, this class exists and has ID 42."

                                % id = "01HA0GPJ8BBS7TKTY2D203VMN1"
                                + which then triggers another question "do you know the bytecode of this function?"

                                    % id = "01HA0GPJ8BDJMHGD7S3A7F4Z4Q"
                                    - …you get the idea.

                                % id = "01HA0GPJ8B7GXF79PMKF89NZ3H"
                                - "alright, here's the bytecode for `Hug`."

                            % id = "01HA0GPJ8BTB9AF5QCQF555BX3"
                            - "that's all."

% id = "01HA0GPJ8BY2R40Y5GP515853E"
+ ### ideas to try out

    % id = "01HA0GPJ8BSMZ13V2S7DPZ508P"
    - I jot down various silly ideas for MuScript in the future here

    % id = "01HA0GPJ8B05TABAEC9JV73VRR"
    + parallelization with an event loop

        % id = "01HA0GPJ8B48K60BWQ2XZZ0PB5"
        + the thing with a pass-based architecture is that with enough locking, it _may_ be easy to parallelize.

            % id = "01HA0GPJ8B5QB6DF8YVTYC2HJY"
            + I can imagine parallelization existing on many levels here.

                % id = "01HA0GPJ8BMQ8JAH09YGC3Y7VW"
                - if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line.

                % id = "01HA0GPJ8BKGJ15YYNS9C1QRYB"
                - you could analyze all the type declarations in parallel. though with dependencies it gets hairy.

                % id = "01HA0GPJ8B2984NNB7E8X0Z1X7"
                - you could analyze the contents of classes in parallel.

            % id = "01HA0GPJ8BFYNKVA6Z923E2370"
            - thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages.

                % id = "01HA0GPJ8B915YCDR0MZSW5RN1"
                - hence why we haven't seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes
                  of `cc` or `rustc` or what have you.

        % id = "01HA0GPJ8BHF1KRM8KGFMT1875"
        + where with a pass-based architecture the problem is dependencies between independent stages you're trying to parallelize, with a query-based architecture like MuScript's it gets even harder,
        because the entire compiler is effectively a dependency ~~hell~~ machine.

            % id = "01HA0GPJ8B20D3MKV6TMB19GW2"
            - one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth.

                % id = "01HA0GPJ8BA4T1C36R8WFJ3G2F"
                - but there's _technically_ nothing holding us back from executing certain queries in parallel.

                    % id = "01HA0GPJ8B3FGTM417H46N0EXD"
                    - in fact, imagine if we executed _all_ queries in parallel.

        % id = "01HA0GPJ8BENERTESAFQJ7G3R9"
        + enter: the `async` event loop compiler

            % id = "01HA0GPJ8BF9E0MW1WS2EJY0J8"
            - there is a central event loop that distributes tasks to be done to multiple threads

            % id = "01HA0GPJ8B5WA232A319JTEFM2"
            - every query function is `async`

                % id = "01HA0GPJ8BQ15PT2M4BTA1ZNDX"
                - meaning it suspends execution of the current function, writes back "need to compute the type ID of `Person` because that's not yet available" into a concurrent set (very important that it's a _set_) - let's call this set the "TODO set"

            % id = "01HA0GPJ8B34GP8TD21AY8NXS1"
            - on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked

                % id = "01HA0GPJ8BKPHBPN44HM6WHKDJ"
                - because we're using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately

                    % id = "01HA0GPJ8B0V2VJMAV1YCQ19Q8"
                    - though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? *(needs research.)*

            % id = "01HA0GPJ8BBREEJCJRWPJJNR3N"
            - once there are no more tasks in the queue, we're done compiling

    % id = "01HAS9RREB6JCSQY986TE1SXVV"
    + parsing less

        % id = "01HAS9RREB1VS73PWNRDVXE9C1"
        - I measured the compiler's performance yesterday and most time is actually spent on parsing, out of all things

            % id = "01HAS9RREB8K52CGJNJDSHMHFP"
            - going along with the philosophy of [be lazy][branch:01HAS6RMNCZS9Y84N8WZ6594D1], we should probably be parsing less things then

        % id = "01HAS9RREBH11B8VFZNNJR6SZH"
        + I don't think we need to parse method bodies if we're not emitting IR

            % id = "01HAS9RREBXTPT2VKWJCD484C6"
            - basically, out of this code:
            ```unrealscript
            function Hug(Hat_Player OtherPlayer)
            {
                PlayAnimation('Hugging');
                // or something, idk UE3
            }
            ```
            parse only this:
            ```unrealscript
            function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ }
            ```
            omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR

        % id = "01HAS9RREBHFF98VX1YDRC78NA"
        + I don't think we need to parse the entire class if we only care about its superclass

            % id = "01HAS9RREBG6V27P9C5CZJBD9K"
            - basically, out of this code:
            ```unrealscript
            class lqGoatBoy extends Hat_Player;

            defaultproperties
            {
                Model = SkeletalMesh'lqFluffyZone.Sk_GoatBoy';
                // etc
            }
            ```
            only parse the following:
            ```
            class lqGoatBoy extends Hat_Player;

            /* parser stops here, rest of text is ignored until needed */
            ```
            and then only parse the rest if any class items are requested

            % id = "01HASA3CG20D3EC87SCTWVR48A"
            - real case: getting the superclasses of `Hat_Player` takes a really long time because it's _big_
            (`Hat_Player.uc` itself is around 8000 lines of code, and it has many superclasses which are also pretty big)

% id = "01HD6NRBEZ8FMFMHW0TF62VBEC"
+ ### ideas I tried out

    % id = "01HAS9RREBVAXX28EX3TGWTCSW"
    + :done: lexing first

        % id = "01HAS9RREBM9VXFEPXKQ2R3EAZ"
        - something that MuScript did not use to do is have a separate tokenization stage

            % id = "01HAS9RREBE94GKXXM70TZ6RMJ"
            + this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat _some_ things in braces `{}` as strings, such as `cpptext`

                % id = "01HAS9RREBQY6AWTXMD6DNS9DF"
                - ```unrealscript
                cpptext
                {
                    // this has to be parsed as C++ code, which has some differing lexing rules
                    template <typename T>
                    void Hug(T& Whomst)
                    {
                        DoHug(T::StaticClass(), &Whomst);
                    }
                }
                ```

            % id = "01HAS9RREB4ZC9MN8YQWWNN7D2"
            - but C++ is similar enough to UnrealScript that we are able to get away with lexing it using the main UnrealScript lexer

            % id = "01HAS9RREBN6FS43W0YKC1BXJE"
            - we even lex variable metadata `var int Something <ToolTip=bah>;` using the lexer, storing invalid characters and errors as some `InvalidCharacter` token kind or something

                % id = "01HAS9RREBAXYQWNA068KKNG07"
                + and that's without emitting diagnostics - let the parser handle those instead

                    % id = "01HAS9RREBWZKAZGFKH3BXE409"
                    - one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of `<ToolTip=3D location>`, where `3D` is parsed as a number literal with an invalid suffix and thus errors out

        % id = "01HD6NRBEZ2TCHKY1C4JK2EK0N"
        - implementing this taught me one important lesson: context switching is expensive

            % id = "01HD6NRBEZCKP5ZYZ3XQ9PVJTD"
            - having the lexer as a separate pass made the parsing 2x faster, speeding up the
            compiler pretty much two-fold (because that's where the compiler was spending most of its time)

                % id = "01HD6NRBEZP6V4J1MS84C6KN1P"
                - my suspicion as to why this was slow is that the code for parsing, preprocessing,
                and reading tokens was scattered across memory - also with lots of branches that
                needed to be checked for each token requested by the parser

            % id = "01HD6NRBEZDM4QSN38TZJCXRAA"
            + I think also having token data in one contiguous block of memory also helped, though
            isn't as efficient as it could be _yet_.

                % id = "01HD6NRBEZWSA9HFNPKQPRHQK1"
                - the current data structure as of writing this is
                ```rust
                struct Token {
                    kind: TokenKind,
                    source_range: Range<usize>,
                }

                struct TokenArena {
                    tokens: Vec<Token>,
                }
                ```
                (with some irrelevant things omitted - things like source files are not relevant
                for token streams themselves)

                    % id = "01HD6NRBEZXCE5TQSMQHQ29D90"
                    - I don't know if I'll ever optimize this to be even more efficient than it
                    already is, but source ranges are mostly irrelevant to the high level task of
                    matching tokens, so maybe arranging the storage like
                    ```rust
                    struct Tokens {
                        kinds: Vec<TokenKind>,
                        source_ranges: Vec<Range<usize>>,
                    }
                    ```
                    could help

                        % id = "01HD6NRBEZ90Z3GJ8GBFGN0KFC"
                        - another thing that could help is changing the `usize` source ranges to
                        `u32`, but I don't love the idea because it'll make it even harder to
                        support large files - not that we necessarily _will_ ever support them,
                        but it's something to consider

% id = "01HA4KNTTGG3YX2GYFQ89M2V6Q"
+ ### insanium

    % id = "01HA4KNTTG8M9B0M86Z1DRNBQY"
    - section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy

    % id = "01HA4KNTTG08HQV1GYGKEBNY4E"
    - also see [Stitchkit's insanium][branch:01HA4KNTTK4TCWTWQXDWPPEQE0]
muscript async event loop idea 2023-09-10 23:41:55 +02:00			`% id = "01HA0GPJ8B0BZCTCDFZMDAW4W4"`
			`- [repo][def:stitchkit/repo]`

			`% id = "01HA0GPJ8BXETX6BWP8E9484DT"`
			`- my UnrealScript compiler`

			`% id = "01HA0GPJ8B6QW73GAH6KJQP0GD"`
			`+ part of the Stitchkit project, which aims to build a set of Hat in Time modding tools that are a joy to use`

			`% id = "01HA0GPJ8B0YJBXVS61M57VRX0"`
			`- the name "MuScript" is actually a reference to Mustache Girl`

			`% id = "01HA0GPJ8BKRNXP9KJQCAWW4MD"`
			`+ ### architecture`

			`% id = "01HA0GPJ8BP1QVDE9Z2GHFV5DV"`
			`- MuScript uses a query-based architecture similar to rustc`

			`% id = "01HA0GPJ8BCX44E6N7BG350AET"`
			`- the classic pass-based compiler architecture has the compiler drive itself by first`
			`parsing all files, then analyzing them, then emitting bytecode - in passes`

			`% id = "01HA0GPJ8B4K7VTV8BC45FW4YW"`
			`- a query-based architecture works by driving the compiler by asking it questions`

			`% id = "01HA0GPJ8BYPDRG206ZYREYPBX"`
			`+ the most interesting question to us all being "do you know the bytecode of this package?"`

			`% id = "01HA0GPJ8BZM9NGGGX765JM0CD"`
			`+ which then triggers another question "do you know the classes in this package?"`

			`% id = "01HA0GPJ8B9HZMCJ3DK0C2DT2Y"`
			`+ which then triggers another question "do you know the contents (variables, functions, structs, enums, states, ...) of those classes?"`

			`% id = "01HA0GPJ8BFPFDFB37ZHVWHZKV"`
			+ which then triggers another question "do you know all the variables in class `ComfortZone`?"

			`% id = "01HA0GPJ8BGWWAHAM5CZ722ZZH"`
			+ which then triggers another question "do you know the ID of the type `Int`?"

			`% id = "01HA0GPJ8BDN6XYJBP792XWDWG"`
			`- "yeah, it's 4."`

			`% id = "01HA0GPJ8B95ZTVG14HVTBXSBF"`
			- "there's this variable `HuggingQuota` with ID 427."

			`% id = "01HA0GPJ8BKCAEVNC9CJB7JA9S"`
			+ which then triggers another question "do you know all the functions in class `ComfortZone`?"

			`% id = "01HA0GPJ8BY6VT4PMGK2Q8J640"`
			+ which then triggers another question "do you know the bytecode of function `Hug`?"

			`% id = "01HA0GPJ8BR6ZX06RSQJEVWXMQ"`
			+ which then triggers another question "do you know the ID of the class `Person`?"

			`% id = "01HA0GPJ8BYP7GQK2HV804YF9S"`
			`- "yes indeed, this class exists and has ID 42."`

			`% id = "01HA0GPJ8BBS7TKTY2D203VMN1"`
			`+ which then triggers another question "do you know the bytecode of this function?"`

			`% id = "01HA0GPJ8BDJMHGD7S3A7F4Z4Q"`
			`- …you get the idea.`

			`% id = "01HA0GPJ8B7GXF79PMKF89NZ3H"`
			- "alright, here's the bytecode for `Hug`."

			`% id = "01HA0GPJ8BTB9AF5QCQF555BX3"`
			`- "that's all."`

			`% id = "01HA0GPJ8BY2R40Y5GP515853E"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`+ ### ideas to try out`
muscript async event loop idea 2023-09-10 23:41:55 +02:00
			`% id = "01HA0GPJ8BSMZ13V2S7DPZ508P"`
			`- I jot down various silly ideas for MuScript in the future here`

			`% id = "01HA0GPJ8B05TABAEC9JV73VRR"`
			`+ parallelization with an event loop`

			`% id = "01HA0GPJ8B48K60BWQ2XZZ0PB5"`
update issues 2024-07-22 20:34:42 +02:00			`+ the thing with a pass-based architecture is that with enough locking, it _may_ be easy to parallelize.`
muscript async event loop idea 2023-09-10 23:41:55 +02:00
			`% id = "01HA0GPJ8B5QB6DF8YVTYC2HJY"`
			`+ I can imagine parallelization existing on many levels here.`

			`% id = "01HA0GPJ8BMQ8JAH09YGC3Y7VW"`
			`- if you have a language like Zig where every line can be tokenized independently, you can spawn a task per line.`

			`% id = "01HA0GPJ8BKGJ15YYNS9C1QRYB"`
			`- you could analyze all the type declarations in parallel. though with dependencies it gets hairy.`

			`% id = "01HA0GPJ8B2984NNB7E8X0Z1X7"`
			`- you could analyze the contents of classes in parallel.`

			`% id = "01HA0GPJ8BFYNKVA6Z923E2370"`
			`- thing is, this stuff gets pretty hairy when you get to think about dependencies between all these different stages.`

			`% id = "01HA0GPJ8B915YCDR0MZSW5RN1"`
			`- hence why we haven't seen very many compilers that would adopt this architecture; most of them just sort of do their thing on one thread, and expect to parallelize by spawning more processes`
			of `cc` or `rustc` or what have you.

			`% id = "01HA0GPJ8BHF1KRM8KGFMT1875"`
			`+ where with a pass-based architecture the problem is dependencies between independent stages you're trying to parallelize, with a query-based architecture like MuScript's it gets even harder,`
			`because the entire compiler is effectively a dependency ~~hell~~ machine.`

			`% id = "01HA0GPJ8B20D3MKV6TMB19GW2"`
			`- one query depends on 10 subqueries, which all depend on their own subqueries, and so on and so forth.`

			`% id = "01HA0GPJ8BA4T1C36R8WFJ3G2F"`
			`- but there's _technically_ nothing holding us back from executing certain queries in parallel.`

			`% id = "01HA0GPJ8B3FGTM417H46N0EXD"`
			`- in fact, imagine if we executed _all_ queries in parallel.`

			`% id = "01HA0GPJ8BENERTESAFQJ7G3R9"`
			+ enter: the `async` event loop compiler

			`% id = "01HA0GPJ8BF9E0MW1WS2EJY0J8"`
			`- there is a central event loop that distributes tasks to be done to multiple threads`

			`% id = "01HA0GPJ8B5WA232A319JTEFM2"`
			- every query function is `async`

			`% id = "01HA0GPJ8BQ15PT2M4BTA1ZNDX"`
			- meaning it suspends execution of the current function, writes back "need to compute the type ID of `Person` because that's not yet available" into a concurrent set (very important that it's a _set_) - let's call this set the "TODO set"

			`% id = "01HA0GPJ8B34GP8TD21AY8NXS1"`
			`- on the next iteration, the event loop spawns a task for each element of the TODO set, and the tasks compute all the questions asked`

			`% id = "01HA0GPJ8BKPHBPN44HM6WHKDJ"`
			`- because we're using a set, the computation is never duplicated; remember that if an answer has already been memoized, it does not spawn a task and instead returns the answer immediately`

			`% id = "01HA0GPJ8B0V2VJMAV1YCQ19Q8"`
update issues 2024-07-22 20:34:42 +02:00			`- though this may be hard to do with Rust because, as far as I know, there is no way to suspend a function conditionally? (needs research.)`
muscript async event loop idea 2023-09-10 23:41:55 +02:00
			`% id = "01HA0GPJ8BBREEJCJRWPJJNR3N"`
			`- once there are no more tasks in the queue, we're done compiling`
just a little insanium for y'all 2023-09-12 14:31:36 +02:00
muscript/stitchkit notes 2023-09-20 14:47:50 +02:00			`% id = "01HAS9RREB6JCSQY986TE1SXVV"`
			`+ parsing less`

			`% id = "01HAS9RREB1VS73PWNRDVXE9C1"`
			`- I measured the compiler's performance yesterday and most time is actually spent on parsing, out of all things`

			`% id = "01HAS9RREB8K52CGJNJDSHMHFP"`
			`- going along with the philosophy of [be lazy][branch:01HAS6RMNCZS9Y84N8WZ6594D1], we should probably be parsing less things then`

			`% id = "01HAS9RREBH11B8VFZNNJR6SZH"`
			`+ I don't think we need to parse method bodies if we're not emitting IR`

			`% id = "01HAS9RREBXTPT2VKWJCD484C6"`
			`- basically, out of this code:`
			```unrealscript
			`function Hug(Hat_Player OtherPlayer)`
			`{`
			`PlayAnimation('Hugging');`
			`// or something, idk UE3`
			`}`
			```
			`parse only this:`
			```unrealscript
			`function Hug(Hat_Player OtherPlayer) { /* token blob: PlayAnimation ( 'Hugging' ) ; */ }`
			```
			`omitting the entire method body and treating it as an opaque blob of tokens until we need to emit IR`

			`% id = "01HAS9RREBHFF98VX1YDRC78NA"`
			`+ I don't think we need to parse the entire class if we only care about its superclass`

			`% id = "01HAS9RREBG6V27P9C5CZJBD9K"`
			`- basically, out of this code:`
			```unrealscript
			`class lqGoatBoy extends Hat_Player;`

			`defaultproperties`
			`{`
			`Model = SkeletalMesh'lqFluffyZone.Sk_GoatBoy';`
			`// etc`
			`}`
			```
			`only parse the following:`
			```
			`class lqGoatBoy extends Hat_Player;`

			`/* parser stops here, rest of text is ignored until needed */`
			```
			`and then only parse the rest if any class items are requested`

			`% id = "01HASA3CG20D3EC87SCTWVR48A"`
			- real case: getting the superclasses of `Hat_Player` takes a really long time because it's _big_
			(`Hat_Player.uc` itself is around 8000 lines of code, and it has many superclasses which are also pretty big)

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZ8FMFMHW0TF62VBEC"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`+ ### ideas I tried out`

muscript/stitchkit notes 2023-09-20 14:47:50 +02:00			`% id = "01HAS9RREBVAXX28EX3TGWTCSW"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`+ :done: lexing first`
muscript/stitchkit notes 2023-09-20 14:47:50 +02:00
			`% id = "01HAS9RREBM9VXFEPXKQ2R3EAZ"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- something that MuScript did not use to do is have a separate tokenization stage`
muscript/stitchkit notes 2023-09-20 14:47:50 +02:00
			`% id = "01HAS9RREBE94GKXXM70TZ6RMJ"`
			+ this is because UnrealScript has some fairly idiosyncratic syntax which requires us to treat _some_ things in braces `{}` as strings, such as `cpptext`

			`% id = "01HAS9RREBQY6AWTXMD6DNS9DF"`
			- ```unrealscript
			`cpptext`
			`{`
			`// this has to be parsed as C++ code, which has some differing lexing rules`
			`template <typename T>`
			`void Hug(T& Whomst)`
			`{`
			`DoHug(T::StaticClass(), &Whomst);`
			`}`
			`}`
			```

			`% id = "01HAS9RREB4ZC9MN8YQWWNN7D2"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- but C++ is similar enough to UnrealScript that we are able to get away with lexing it using the main UnrealScript lexer`
muscript/stitchkit notes 2023-09-20 14:47:50 +02:00
			`% id = "01HAS9RREBN6FS43W0YKC1BXJE"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			- we even lex variable metadata `var int Something <ToolTip=bah>;` using the lexer, storing invalid characters and errors as some `InvalidCharacter` token kind or something
muscript/stitchkit notes 2023-09-20 14:47:50 +02:00
			`% id = "01HAS9RREBAXYQWNA068KKNG07"`
			`+ and that's without emitting diagnostics - let the parser handle those instead`

			`% id = "01HAS9RREBWZKAZGFKH3BXE409"`
			- one place where the current approach of the lexer eagerly emitting diagnostics fails is the case of `<ToolTip=3D location>`, where `3D` is parsed as a number literal with an invalid suffix and thus errors out

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZ2TCHKY1C4JK2EK0N"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- implementing this taught me one important lesson: context switching is expensive`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZCKP5ZYZ3XQ9PVJTD"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- having the lexer as a separate pass made the parsing 2x faster, speeding up the`
			`compiler pretty much two-fold (because that's where the compiler was spending most of its time)`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZP6V4J1MS84C6KN1P"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- my suspicion as to why this was slow is that the code for parsing, preprocessing,`
			`and reading tokens was scattered across memory - also with lots of branches that`
			`needed to be checked for each token requested by the parser`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZDM4QSN38TZJCXRAA"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`+ I think also having token data in one contiguous block of memory also helped, though`
			`isn't as efficient as it could be _yet_.`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZWSA9HFNPKQPRHQK1"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- the current data structure as of writing this is`
			```rust
			`struct Token {`
			`kind: TokenKind,`
			`source_range: Range<usize>,`
			`}`

			`struct TokenArena {`
			`tokens: Vec<Token>,`
			`}`
			```
			`(with some irrelevant things omitted - things like source files are not relevant`
			`for token streams themselves)`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZXCE5TQSMQHQ29D90"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`- I don't know if I'll ever optimize this to be even more efficient than it`
			`already is, but source ranges are mostly irrelevant to the high level task of`
			`matching tokens, so maybe arranging the storage like`
Lua highlighting 2024-03-25 22:07:52 +01:00			```rust
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			`struct Tokens {`
			`kinds: Vec<TokenKind>,`
			`source_ranges: Vec<Range<usize>>,`
			`}`
			```
			`could help`

fix tree 2023-10-20 15:53:04 +02:00			`% id = "01HD6NRBEZ90Z3GJ8GBFGN0KFC"`
some notes on muscript lexer refactor + done emoji 2023-10-20 15:52:37 +02:00			- another thing that could help is changing the `usize` source ranges to
			`u32`, but I don't love the idea because it'll make it even harder to
			`support large files - not that we necessarily _will_ ever support them,`
			`but it's something to consider`

just a little insanium for y'all 2023-09-12 14:31:36 +02:00			`% id = "01HA4KNTTGG3YX2GYFQ89M2V6Q"`
			`+ ### insanium`

			`% id = "01HA4KNTTG8M9B0M86Z1DRNBQY"`
			`- section for incredibly wack ideas that will never work but fill my inner bit twiddle magician with joy`

			`% id = "01HA4KNTTG08HQV1GYGKEBNY4E"`
			`- also see [Stitchkit's insanium][branch:01HA4KNTTK4TCWTWQXDWPPEQE0]`