treehouse/content/programming/blog/haku.tree

2302 lines
73 KiB
Text
Raw Normal View History

2024-07-24 18:20:47 +02:00
%% title = "haku - writing a little programming language for fun"
scripts = ["treehouse/vendor/codejar.js", "treehouse/components/literate-programming.js"]
% id = "01J3K8A0D1774SFDPKDK5G9GPV"
- I've had this idea on my mind as of late, of a little pure functional programming language that would run in your browser.
2024-07-24 18:20:47 +02:00
% id = "01J3K8A0D1WTM2KHERFZG2FWBJ"
+ the primary use case would be writing fun audiovisual sketches you can inspect and edit live, because after all everything is declarative.
this was motivated by my discovery of [glisp][], which was recently on the front page of [Lobsters][glisp lobsters].
[glisp]: https://glisp.app
[glisp lobsters]: https://lobste.rs/s/amanh7/glisp_graphical_lisp
% id = "01J3K8A0D16PAM5AV11E8JF3AF"
- [I even commented about it!](https://lobste.rs/s/amanh7/glisp_graphical_lisp#c_oqa6ap)
% id = "01J3K8A0D1N4EGRKPFTP0FNZSW"
- so let's get going!
% id = "01J3K8A0D1ZXQ9NJ8CVGBQ7FZB"
- ### parsing
% id = "01J3K8A0D11KMK6MWCRT5KQV09"
- I don't know about you, but I like writing parsers.
however, since I'm trying to keep this language absolutely _tiny_, I think S-expressions might be the best fit for this purpose.
% id = "01J3K8A0D1PT058QRSXS142Y5T"
- honestly I don't even like S-expressions that much.
I find them extremely hard to read, but I dunno - maybe my mind will change after having written a language using them.
we can always swap the syntax out for something else later.
% id = "01J3K8A0D1198QXV2GFWF7JCV0"
- let me show you an example of how I'd like haku to look.
I find that is the best way of breaking down syntax into smaller parts.
```haku
; Recursive fibonacci
(def fib
(fn (n)
(if (< n 2)
n
(+ (fib (- n 1)) (fib (- n 2))))))
(print (fib 10))
```
% id = "01J3K8A0D1KNHJ10WCVX8C88WP"
- we have a handful of lexical elements: parentheses, identifiers, and numbers.
there are also comments and whitespace, of course.
those will get skipped over by the lexer, because we're not really building a production-grade language to need them.
% id = "01J3K8A0D14Z8W5K6KDEJQ6DZJ"
- syntactically, we only really have two types of productions.
there are literals, and there are lists.
% id = "01J3K8A0D1N8SP9J8EMBNEVG9C"
- when I say _literals_, I'm referring to both identifiers and integers.
we will of course differentiate between them in the syntax, because they mean different things.
% id = "01J3K8A0D14A94S2RNFV97DX18"
- we will start by writing the lexical analysis part of our parser, to join single characters up to slightly more managable pieces.
{:program=haku}
```javascript
export const lexer = {};
```
% id = "01J3K8A0D1YZMHNSRZMSBNQVD4"
- the entire idea of a lexer is that you read the input string left to right, top to bottom, and piece together larger _tokens_ out of that.
% id = "01J3K8A0D1C9YBXWK257GFMR68"
- for instance, for the input string
```haku
(example s-expression)
```
we will produce the tokens
| type | start | end | text |
| --- | --: | --: | --- |
| ( | 0 | 1 | `(` |
| identifier | 1 | 8 | `example` |
| identifier | 9 | 21 | `s-expression` |
| ) | 21 | 22 | `)` |
| end of file | 22 | 22 | |
% id = "01J3K8A0D1GGQ292D4MQBCGHWC"
- to lex the input into tokens, we'll need to know the input string (of course), and where we currently are in the string.
{:program=haku}
```javascript
lexer.init = (input) => {
return {
input,
position: 0,
};
};
```
% id = "01J3K8A0D139JN9J5TTA2WAP4R"
- we'll also define a few helper functions to make reading text a little easier, without having to perform any bounds checks whenever we read tokens.
{:program=haku}
```javascript
export const eof = "end of file";
lexer.current = (state) => {
return state.position < state.input.length
? state.input.charAt(state.position)
: eof;
};
lexer.advance = (state) => ++state.position;
```
% id = "01J3K8A0D1GPMDD8S063K6ETM3"
- our lexer will run in a loop, producing tokens until it hits the end of input or an error.
{:program=haku}
```javascript
export function lex(input) {
let tokens = [];
let state = lexer.init(input);
while (true) {
let start = state.position;
let kind = lexer.nextToken(state);
let end = state.position;
tokens.push({ kind, start, end });
if (kind == eof || kind == "error") break;
}
return tokens;
}
```
% id = "01J3K8A0D10GZMN36TDZWYH632"
- remember that error handling is important!
we mustn't forget that the user can produce invalid input - such as this string:
```haku
{example}
```
haku does not have curly braces in its syntax, so that's clearly an error!
reporting this to the user will be a much better experience than, perhaps... getting stuck in an infinite loop. :oh:
% id = "01J3K8A0D117B6AQ8YKMCX4KAK"
- now for the most important part - that `lexer.nextToken` we used will be responsible for reading back a token from the input, and returning what kind of token it has read.
for now, let's make it detect parentheses.
we of course also need to handle end of input - whenever our lexer runs out of characters to consume, as well as when it encounters any characters we don't expect.
{:program=haku}
```javascript
lexer.nextToken = (state) => {
let c = lexer.current(state);
if (c == "(" || c == ")") {
lexer.advance(state);
return c;
}
if (c == eof) return eof;
lexer.advance(state);
return "error";
};
```
% id = "01J3K8A0D1C5C5P32WQFW1PD0R"
- with all that frameworking in place, let's test if our lexer works!
{:program=haku}
```javascript
export function printTokens(input) {
let tokens = lex(input);
for (let { kind, start, end } of tokens) {
if (kind == "error") {
let errorString = input.substring(start, end);
console.log(`unexpected characters at ${start}..${end}: '${errorString}'`);
} else {
console.log(`${kind} @ ${start}..${end}`);
}
}
}
printTokens(`()((()))`);
```
{:program=haku}
```output
( @ 0..1
) @ 1..2
( @ 2..3
( @ 3..4
( @ 4..5
) @ 5..6
) @ 6..7
) @ 7..8
end of file @ 8..8
```
...seems pretty perfect!
% id = "01J3K8A0D1AV280QZ0Y10CPN62"
- except, of course, we're not handling whitespace or comments.
{:program=haku}
```javascript
printTokens(`( )`);
```
{:program=haku}
```output
( @ 0..1
unexpected characters at 1..2: ' '
```
% id = "01J3K8A0D1RHK349974Y23DG56"
- so let's write another function that will lex those.
{:program=haku}
```javascript
lexer.skipWhitespaceAndComments = (state) => {
while (true) {
let c = lexer.current(state);
if (c == " " || c == "\t" || c == "\n" || c == "\r") {
lexer.advance(state);
continue;
}
if (c == ";") {
while (
lexer.current(state) != "\n" &&
lexer.current(state) != eof
) {
lexer.advance(state);
}
lexer.advance(state); // skip over newline, too
continue;
}
break;
}
};
```
% id = "01J3K8A0D10F11DPN5TN0Y7AAX"
- except instead of looking at whitespace and comments in the main token reading function, we'll do that _outside_ of it, to avoid getting whitespace caught up in the actual tokens' `start`..`end` spans.
{:program=haku}
```javascript
export function lex(input) {
let tokens = [];
let state = lexer.init(input);
while (true) {
lexer.skipWhitespaceAndComments(state); // <--
let start = state.position;
let kind = lexer.nextToken(state);
let end = state.position;
tokens.push({ kind, start, end });
if (kind == eof || kind == "error") break;
}
return tokens;
}
```
% id = "01J3K8A0D1AQWFJHSC9XCCKNKF"
- now if we look at the output...
{:program=haku}
```javascript
printTokens(`( )`);
```
{:program=haku}
```output
( @ 0..1
) @ 2..3
end of file @ 3..3
```
the whitespace is ignored just fine!
% id = "01J3K8A0D1S7MCHYYVYMPWEHEF"
- and comments of course follow:
{:program=haku}
```javascript
printTokens(`
( ; comment comment!
)
`);
```
{:program=haku}
```output
( @ 5..6
) @ 30..31
end of file @ 32..32
```
% id = "01J3K8A0D16NF69K3MNNYH1VJ1"
- it'd be really nice if we could use identifiers though...
{:program=haku}
```javascript
printTokens(`(hello world)`);
```
{:program=haku}
```output
( @ 0..1
unexpected characters at 1..2: 'h'
```
so I guess that's the next thing on our TODO list!
% id = "01J3K8A0D1SF46M2E7DEP6V44N"
- we'll introduce a function that will tell us if a given character is a valid character in an identifier.
since S-expressions are so minimal, it is typical to allow all sorts of characters in identifiers -
in our case, we'll allow alphanumerics, as well as a bunch of symbols that seem useful.
and funky!
{:program=haku}
```javascript
export const isIdentifier = (c) =>
/^[a-zA-Z0-9+~!@$%^&*=<>+?/.,:\\|-]$/.test(c);
```
% id = "01J3K8A0D10TTSM7TV0C05PVNJ"
- this could probably be a whole lot faster if I had used a simple `c >= 'a' && c <= 'z'` chain, but I'm lazy, so a regex it is.
% id = "01J3K8A0D16VA5D4JGT26YZ4KP"
- when I said funky, I wasn't joking - have you ever seen `,` in an identifier?
% id = "01J3K8A0D11GYDHXVZJXVWAGHN"
- I'm allowing it since it isn't really gonna hurt anything.
I _did_ disallow `#` though, because that's commonly used for various extensions.
who knows what I might be able to cram under that symbol!
% id = "01J3K8A0D17S0FTBXHP36VVP8C"
- with a character set established, we can now stuff identifiers into our lexer.
I'll start by introducing a function that'll chew as many characters that meet a given condition as it can:
{:program=haku}
```javascript
lexer.advanceWhile = (state, fn) => {
while (fn(lexer.current(state))) {
lexer.advance(state);
}
};
```
% id = "01J3K8A0D1YV77A2TR64R74HRD"
- now we can add identifiers to `nextToken`:
{:program=haku}
```javascript
lexer.nextToken = (state) => {
let c = lexer.current(state);
if (isIdentifier(c)) {
lexer.advanceWhile(state, isIdentifier);
return "identifier";
}
if (c == "(" || c == ")") {
lexer.advance(state);
return c;
}
if (c == eof) return eof;
lexer.advance(state);
return "error";
};
```
% id = "01J3K8A0D1DKA8YCBCJVZXXGR4"
- let's try lexing that `(hello world)` string now.
{:program=haku}
```javascript
printTokens(`(hello world)`);
```
{:program=haku}
```output
( @ 0..1
identifier @ 1..6
identifier @ 7..12
) @ 12..13
end of file @ 13..13
```
nice!
% id = "01J3K8A0D15G77YG2A0CN8P0M6"
- in the original example, there were also a couple of numbers:
```haku
(+ (fib (- n 1)) (fib (- n 2)))
```
so let's also add support for some basic integers; we'll add decimals later if we ever need them.
% id = "01J3K8A0D18MA59WFYW7PCPQ30"
- defining integers is going to be a similar errand to identifiers, so I'll spare you the details and just dump all the code at you:
{:program=haku}
```javascript
export const isDigit = (c) => c >= "0" && c <= "9";
lexer.nextToken = (state) => {
let c = lexer.current(state);
if (isDigit(c)) {
lexer.advanceWhile(state, isDigit);
return "integer";
}
if (isIdentifier(c)) {
lexer.advanceWhile(state, isIdentifier);
return "identifier";
}
if (c == "(" || c == ")") {
lexer.advance(state);
return c;
}
if (c == eof) return eof;
lexer.advance(state);
return "error";
};
```
% id = "01J3K8A0D1SZ4YSR1KD2HYAWPV"
- note how we check `isDigit` _before_ `isIdentifier` -
this is really important, because otherwise identifiers would take precedence over integers!
% id = "01J3K8A0D1B5J858DJ6BKNJRKT"
- now let's see the results of all that hard work.
{:program=haku}
```javascript
printTokens(`(fib (- n 1))`);
```
{:program=haku}
```output
( @ 0..1
identifier @ 1..4
( @ 5..6
identifier @ 6..7
identifier @ 8..9
integer @ 10..11
) @ 11..12
) @ 12..13
end of file @ 13..13
```
looks good!
% id = "01J3K8A0D148R9B0HVMH79A3CK"
- #### an amen break
% id = "01J3K8A0D1WX6EH5H61BVR1X31"
- to let your head rest a bit after reading all of this, here are some fun numbers:
% id = "01J3K8A0D11D479PJKY22AQFTC"
- there are a total of
{:program=haku}
```javascript
console.log(Object.keys(lexer).length);
```
{:program=haku}
```output
6
```
functions in the `lexer` namespace.
not a whole lot, huh?
% id = "01J3K8A0D19XK0MRH4Z461G2J0"
- I was personally quite surprised how tiny an S-expression lexer can be.
they were right about S-expressions being a good alternative for when you don't want to write syntax!
the entire thing fits in *86 lines of code.*
% id = "01J3K8A0D1CG89X84KM2DN14ZT"
+ :bulb: for the curious: *here's why I implement lexers like this!*
% id = "01J3K8A0D1FYBKJ6X2W17QAK3Z"
- many tutorials will have you implementing lexers such that data is _parsed_ into the language's data types.
for instance, integer tokens would be parsed into JavaScript `number`s.
I don't like this approach for a couple reasons.
% id = "01J3K8A0D1P258JKRVG11M7B64"
- pre-parsing data like this pollutes your lexer code with wrangling tokens into useful data types.
I prefer it if the lexer is only responsible for _reading back strings_.
implemented my way, it can concern itself only with chewing through the source string; no need to extract substrings out of the input or anything.
% id = "01J3K8A0D14VZTKBPJTG3BGD0M"
- there's also a performance boost from implementing it this way: _lazy_ parsing, as I like to call it, allows us to defer most of the parsing work until it's actually needed.
if the token never ends up being needed (e.g. due to a syntax error,) we don't end up doing extra work eagerly!
% id = "01J3K8A0D1GYZ9Y9MK6K24JME7"
- if that doesn't convince you, consider that now all your tokens are the exact same data structure, and you can pack them neatly into a flat array.
if you're using a programming language with flat arrays, that is.
such as Rust or C.
I'm implementing this in JavaScript of course, but it's still neat not having to deal with mass `if`osis when extracting data from tokens - you're always guaranteed a token will have a `kind`, `start`, and `end`.
% id = "01J3K8A0D1NTPSD77WM84KVMRX"
- now. back to your regularly scheduled programming!
% id = "01J3K8A0D1X6A68K6TGX00FCTE"
- it's time for us to implement a parser for our S-expressions.
{:program=haku}
```javascript
export const parser = {};
```
% id = "01J3K8A0D1ZMJJHDMW24D1GESE"
- the goal is to go from this flat list of tokens:
| type | start | end | text |
| --- | --: | --: | --- |
| ( | 0 | 1 | `(` |
| identifier | 1 | 8 | `example` |
| identifier | 9 | 21 | `s-expression` |
| ) | 21 | 22 | `)` |
| end of file | 22 | 22 | |
to a nice recursive tree that represents our S-expressions:
```haku.ast
list
identifier example
identifier s-expression
```
% id = "01J3K8A0D1SSWPAKSNG8TA4N1H"
- there are many parsing strategies we could go with, but in my experience you can't go simpler than good ol' [recursive descent][].
[recursive descent]: https://en.wikipedia.org/wiki/Recursive_descent_parser
% id = "01J3K8A0D1NHD7QGQ1NZTDQRWX"
- the idea of recursive descent is that you have a stream of tokens that you read from left to right, and you have a set of functions that parse your non-terminals.
essentially, each function corresponds to a single type of node in your syntax tree.
% id = "01J3K8A0D1F01CKXP10M7WD6VV"
- does the "stream of tokens that you read from left to right" ring a bell?
if it does, that's because lexing operates on a _very_ similar process - it's just non-recursive!
% id = "01J3K8A0D111A22X9WW8NP3T3X"
- knowing that similarity, we'll start off with a similar set of helper functions to our lexer.
{:program=haku}
```javascript
parser.init = (tokens) => {
return {
tokens,
position: 0,
};
};
parser.current = (state) => state.tokens[state.position];
parser.advance = (state) => {
if (state.position < state.tokens.length - 1) {
++state.position;
}
};
```
note however that instead of letting `current` read out of bounds, we instead clamp `advance` to the very last token - which is guaranteed to be `end of file`.
% id = "01J3K8A0D1XF9PEBQ6D4F1P3BA"
- the S-expression grammar can compose in the following ways:
% id = "01J3K8A0D1CWFBC9JTM6PFZRR8"
- an S-expression is a literal integer, identifier, or a list.
% id = "01J3K8A0D1BM9QGDHWCX7PANPR"
- literal integers `65` and identifiers `owo` stand alone on their own.
they do not nest anything else inside of them.
% id = "01J3K8A0D19BXABXNV75N93A18"
- lists `(a b c)` are sequences of S-expressions enclosed in parentheses.
inside, they can contain literal integers and identifiers, or even other lists recursively.
% id = "01J3K8A0D1G43KZDVH7EW0ZAKQ"
- this yields the following [EBNF][] grammar:
```ebnf
Expr = "integer" | "identifier" | List;
List = "(" , { Expr } , ")";
```
[EBNF]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
% id = "01J3K8A0D1FPZE52S1RVWCR66Y"
- we'll start by implementing the `Expr = "integer" | "identifier"` rule.
parsing integers and identifiers is as simple as reading their single token, and returning a node for it:
{:program=haku}
```javascript
parser.parseExpr = (state) => {
let token = parser.current(state);
switch (token.kind) {
case "integer":
case "identifier":
parser.advance(state);
return { ...token };
default:
parser.advance(state);
return {
kind: "error",
error: "unexpected token",
start: token.start,
end: token.end,
};
}
};
```
% id = "01J3K8A0D1ENMQV0ZSP8C5ZX5A"
- of course again, we mustn't forget about errors!
it's totally possible for our lexer to produce a token we don't understand - such as an `error`, or an `end of file`.
or really any token we choose to introduce in the future, but choose to not be valid as an `Expr` starter.
% id = "01J3K8A0D1QRSPTYPH2JQ77HW9"
+ we'll wrap initialization and `parseExpr` in another function, which will accept a list of tokens and return a syntax tree, hiding the complexity of managing the parser state underneath.
{:program=haku}
```javascript
parser.parseRoot = (state) => parser.parseExpr(state);
export function parse(input) {
let state = parser.init(input);
let expr = parser.parseRoot(state);
if (parser.current(state).kind != eof) {
let strayToken = parser.current(state);
return {
kind: "error",
error: `found stray '${strayToken.kind}' token after expression`,
start: strayToken.start,
end: strayToken.end,
};
}
return expr;
}
```
this function also checks that there aren't any tokens after we're done parsing the root `Expr` production.
it wouldn't be very nice UX if we let the user input tokens that didn't do anything!
% id = "01J3K8A0D1KE4JRKEXWPAQJFDV"
- I'm adding that `parseRoot` alias in so that it's easy to swap the root production to something else than `Expr`.
% id = "01J3K8A0D1GP31XPC0VVZTJPMV"
- now we can try to parse a tree out of a little expression...
{:program=haku}
```javascript
export function printTree(input) {
let tokens = lex(input);
let tree = parse(tokens);
console.log(JSON.stringify(tree, null, " "));
}
```
...and print it into the console:
{:program=haku}
```javascript
printTree("-w-")
```
{:program=haku}
```output
{
"kind": "identifier",
"start": 0,
"end": 3
}
```
nice!
% id = "01J3K8A0D14YEA038BD8KAAECC"
- now it's time to parse some lists.
for that, we'll introduce another function, which will be called by `parseExpr` with an existing `(` token.
its task will be to read as many expressions as it can, until it hits a closing parenthesis `)`, and then construct a node out of that.
{:program=haku}
```javascript
parser.parseList = (state, leftParen) => {
parser.advance(state);
let children = [];
while (parser.current(state).kind != ")") {
if (parser.current(state).kind == eof) {
return {
kind: "error",
error: "missing closing parenthesis ')'",
start: leftParen.start,
end: leftParen.end,
};
}
children.push(parser.parseExpr(state));
}
let rightParen = parser.current(state);
parser.advance(state);
return {
kind: "list",
children,
start: leftParen.start,
end: rightParen.end,
};
};
```
% id = "01J3K8A0D1YZ93B7X3A14X1W0N"
- and the last thing left to do is to hook it up to our `parseExpr`, in response to a `(` token:
{:program=haku}
```javascript
parser.parseExpr = (state) => {
let token = parser.current(state);
switch (token.kind) {
case "integer":
case "identifier":
parser.advance(state);
return { ...token };
case "(":
return parser.parseList(state, token); // <--
default:
parser.advance(state);
return {
kind: "error",
error: "unexpected token",
start: token.start,
end: token.end,
};
}
};
```
% id = "01J3K8A0D1RHWQAA9FMDC654S9"
- now let's try parsing an S-expression!
{:program=haku}
```javascript
printTree("(hello! ^^ (nested nest))");
```
{:program=haku}
```output
{
"kind": "list",
"children": [
{
"kind": "identifier",
"start": 1,
"end": 7
},
{
"kind": "identifier",
"start": 8,
"end": 10
},
{
"kind": "list",
"children": [
{
"kind": "identifier",
"start": 12,
"end": 18
},
{
"kind": "identifier",
"start": 19,
"end": 23
}
],
"start": 11,
"end": 24
}
],
"start": 0,
"end": 25
}
```
% id = "01J3K8A0D1AJP9WHVKBBKKC3B7"
- I don't know about you, but I personally find the JSON output quite distracting and long.
I can't imagine how long it'll be on even larger expressions!
to counteract that, let's write an S-expression pretty printer:
{:program=haku}
```javascript
export function exprToString(expr, input) {
let inputSubstring = input.substring(expr.start, expr.end);
switch (expr.kind) {
case "integer":
case "identifier":
return inputSubstring;
case "list":
return `(${expr.children.map((expr) => exprToString(expr, input)).join(" ")})`;
case "error":
return `<error ${expr.start}..${expr.end} '${inputSubstring}': ${expr.error}>`;
}
}
```
% id = "01J3K8A0D1CB6B8BEY65ADJZSV"
- obviously this loses some information compared to the JSON - we no longer report start and end indices, but that is easy enough to add if you need it.
I don't need it, so I'll conveniently skip it for now.
% id = "01J3K8A0D1G1BPN5W4GT26EJX4"
- let's see if our pretty printer works!
{:program=haku}
```javascript
export function printTree(input) {
let tokens = lex(input);
let tree = parse(tokens);
console.log(exprToString(tree, input));
}
printTree("(hello! -w- (nestedy nest))");
```
{:program=haku}
```output
(hello! -w- (nestedy nest))
```
that's... the same string.
% id = "01J3K8A0D1XP4FQB2HZR9GV5CJ"
- let's try something more complicated, with comments and such.
{:program=haku}
```javascript
export function printTree(input) {
let tokens = lex(input);
let tree = parse(tokens);
console.log(exprToString(tree, input));
}
printTree(`
(def add-two
; Add two to a number.
(fn (n) (+ n 2)))
`);
```
{:program=haku}
```output
(def add-two (fn (n) (+ n 2)))
```
looks like it works!
% id = "01J3K8A0D10DRSP49WF8YH5WSH"
- of course this is hardly the _prettiest_ printer in the world.
% id = "01J3K8A0D1VCJ7TV6CN7M07N5J"
- for one, it does not even preserve your comments.
% id = "01J3K8A0D1K3M9223YM96PS68B"
- it does not add indentation either, it just blindly dumps a minimal S-expression into the console.
% id = "01J3K8A0D1P2EF0C657J1REV9Z"
- but it proves that our parser _works_ - we're able to parse an arbitrary S-expression into a syntax tree, and then traverse that syntax tree again, performing various recursive algorithms on it.
isn't that cool?
% id = "01J3K8A0D1PB6MSPBS1K6K6KR3"
- and that's all there'll be to parsing, at least for now!
% id = "01J3K8A0D11M0NJCBKKPAMVJ2J"
- maybe in the future I'll come up with something more complex, with a more human-friendly syntax.
who knows!
right now it's experimentation time, so these things don't really matter.
% id = "01J3K8A0D1HB566XYSET099Q26"
- #### amen break, part two
% id = "01J3K8A0D1KX5EWV5NW29PF525"
- the S-expression parser consists of a whopping
{:program=haku}
```javascript
console.log(Object.keys(parser).length);
```
{:program=haku}
```output
6
```
functions.
just like the lexer!
% id = "01J3K8A0D1RZE0F75S2C7PPTAZ"
- the parser is *99 lines of code*. quite tiny, if you ask me!
% id = "01J3K8A0D1K91SY17T780S7MPK"
- together with the lexer, the entire S-expression parser is *185 lines of JavaScript.*
that's a pretty small amount, especially given that it's extremely simple code!
% id = "01J3K8A0D1PJNDGKJH8DXN4G3G"
- I wouldn't call this parser production-ready, though.
a production-ready parser would have some way of _preserving comments_ inside the syntax tree, such that you can pretty-print it losslessly.
if you're bored, you can try to add that in!
% id = "01J3K8A0D1PJQJFAG2YADEKVNB"
+ here's a fun piece of trivia: I'm wrote a [Nim S-expression parser for Rosetta Code][nim s-expr] way back in [July 2019][nim s-expr diff].
[nim s-expr]: https://rosettacode.org/wiki/S-expressions#Nim
[nim s-expr diff]: https://rosettacode.org/wiki/S-expressions?diff=prev&oldid=202824
% id = "01J3K8A0D1BWG3TFFXDD6BCPP2"
- you can see it's quite different from how I wrote this parser - in particular, because I didn't need to focus so much on the parser being hot-patchable and reusable, it came out quite a lot more compact, despite having fully static types!
% id = "01J3K8A0D1F4R8KPHETV9N08YP"
- it's definitely not how I would write a parser nowadays.
it's pretty similar, but the syntax tree structures are quite different - it doesn't use the [lazy parsing][branch:01J3K8A0D1FYBKJ6X2W17QAK3Z] trick I talked about before.
% id = "01J3K8A0D178J6W49AFCE9HEQ6"
- I mean, it's only a trick I learned last year!
% id = "01J3K8A0D12VCHW6AJX0ZGPQBY"
- code style-wise it's also not my prettiest Nim code ever - it kind of abuses `template`s for referring to the current character with a single word, but that doesn't convey the fact that it's an effectful operation very well.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX2KCN1P0K287D4CRF"
- ### interpretation
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXBFKXCR3NRMMW0X4M"
- with a parser now ready, it would be nice if we could execute some actual code!
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX79WXQCYRBZTHE09N"
- we'll again start off by setting a goal.
I want to be able to evaluate arbitrary arithmetic expressions, like this one:
```haku
(+ (* 2 1) 1 (/ 6 2) (- 10 3))
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXKR1AQ7JDR23FQC8R"
- the simplest way to get some code up and running would be to write a _tree-walk interpreter_.
{:program=haku}
```javascript
export const treewalk = {};
```
this kind of interpreter is actually really simple!
it just involves walking through your syntax tree, executing each node one by one.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXED7Y4XSK0JVTHJ5P"
- we'll again start off by defining a function that initializes our interpreter's state.
right now there isn't really anything to initialize, but recall that we don't have our tokens parsed into any meaningful data yet, so we'll have to have access the source string to do that.
{:program=haku}
```javascript
treewalk.init = (input) => {
return { input };
};
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXSGW5PRZY7PG7CGZY"
- the core of our interpretation will be a function that descends down the node tree and _evaluates_ each node, giving us a result.
{:program=haku}
```javascript
treewalk.eval = (state, node) => {
switch (node.kind) {
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
```
for now we'll leave it empty.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX2T5K473GGY3K5VBX"
- in the meantime, let's prepare a couple convenient little wrappers to run our code:
{:program=haku}
```javascript
export function run(input, node) {
let state = treewalk.init(input);
return treewalk.eval(state, node);
}
export function printEvalResult(input) {
try {
let tokens = lex(input);
let ast = parse(tokens);
let result = run(input, ast);
console.log(result);
} catch (error) {
console.log(error.toString());
}
}
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXV5JW9HA3S4S0A8YQ"
- now we can try running some code!
let's see what happens.
{:program=haku}
```javascript
printEvalResult("65");
```
{:program=haku}
```output
Error: unhandled node kind: integer
```
...of course.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXAHMZRY06B17M6B8E"
- so let's patch those integers in!
this is where we'll need that source string of ours - we don't actually have a JavaScript `number` representation of the integers, so we'll need to parse them into place.
{:program=haku}
```javascript
treewalk.eval = (state, node) => {
switch (node.kind) {
case "integer":
let sourceString = state.input.substring(node.start, node.end);
return parseInt(sourceString);
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXW2Y9DT15P0KHRX6R"
- now when we run the program above...
{:program=haku}
```javascript
printEvalResult("65");
```
{:program=haku}
```output
65
```
we get sixty five!
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXJ91M0V541JB5WQZB"
- but that's of course a bit boring - it would be nice if we could like, y'know, _perform some arithmetic_.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX2N540RF6T3S6XWE3"
- traditionally, in Lisp-like languages, a list expression always represents a function application, with the head of the list being the function to call, and the tail of the function being the arguments to apply to the function.
let's implement that logic then!
{:program=haku}
```javascript
export const builtins = {};
treewalk.eval = (state, node) => {
switch (node.kind) {
case "integer":
let sourceString = state.input.substring(node.start, node.end);
return parseInt(sourceString);
case "list": // <--
let functionToCall = node.children[0];
let builtin = builtins[state.input.substring(functionToCall.start, functionToCall.end)];
return builtin(state, node);
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXE0JH63D6XN1GPTN9"
- we'm putting all of our built-in magic functions into a separate object `builtins`, so that they're easy to patch partially later.
you've seen my tricks already with hot-patching functions in objects, so this shouldn't be too surprising.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXCHN7HAAVS5MREN7Z"
+ you'll note I'm kind of cheating here - because we have no mechanism to represent variables just yet, I'm using the node's text as the key to our `builtins` table.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX00ZZ6W7V4DYJ92YP"
- heck, I'm not even validating that this is an identifier - so you can technically do something like this, too:
```haku
((what the fuck) lol)
```
which will call the builtin named `(what the fuck)`.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXMW943792EYSQHY4R"
- we could try this out now, except we don't actually have any builtins! so I'll add a few in, so that we can _finally_ perform our glorious arithmetic:
{:program=haku}
```javascript
function arithmeticBuiltin(op) {
return (state, node) => {
2024-07-26 23:21:29 +02:00
if (node.children.length < 3)
throw new Error("arithmetic operations require at least two arguments");
let result = treewalk.eval(state, node.children[1]);
for (let i = 2; i < node.children.length; ++i) {
result = op(result, treewalk.eval(state, node.children[i]));
}
return result;
};
}
builtins["+"] = arithmeticBuiltin((a, b) => a + b);
builtins["-"] = arithmeticBuiltin((a, b) => a - b);
builtins["*"] = arithmeticBuiltin((a, b) => a * b);
builtins["/"] = arithmeticBuiltin((a, b) => a / b);
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXKPEP5RY27EG3HJ94"
- one thing of note is how `arithmeticBuiltin` accepts two or more arguments.
you're free to pass in more than that, which is common among Lisps.
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXGPN2SXRYYFVJQ8KX"
- now let's try running our full arithmetic expression! drum roll please...
{:program=haku}
```javascript
printEvalResult("(+ (* 2 1) 1 (/ 6 2) (- 10 3))");
```
{:program=haku}
```output
13
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX3Y31RKGEF106ZB25"
- #### a brief intermission
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXCZKVWSFG6Y2T0XQS"
- I will now pause here to say, I'm kind of tired of writing this `printEvalResult` ceremony over and over again.
so I took a bit of time to enhance the treehouse's capabilities, and it's now capable of running languages other than JavaScript!
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXTZ8J0AW1WW5WSGQM"
- all we have to do is swap out the evaluation [kernel][]{title="like in Jupyter! Jupyter kernels are basically just support for different programming languages" style="cursor: help; text-decoration: 1px dotted underline;"}...
[kernel]: https://docs.jupyter.org/en/latest/projects/kernels.html
{:program=haku}
```javascript
2024-07-26 23:21:29 +02:00
import { getKernel, defaultEvalModule } from "treehouse/components/literate-programming/eval.js";
2024-07-26 23:21:29 +02:00
export const kernel = getKernel();
kernel.evalModule = async (state, source, language, params) => {
if (language == "haku") {
printEvalResult(source);
return true;
} else {
2024-07-26 23:21:29 +02:00
return await defaultEvalModule(state, source, language, params);
}
};
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RXXG67E9A8RPJFRV51"
- and now we can write haku in code blocks!
{:program=haku}
```haku
(+ (* 2 1) 1 (/ 6 2) (- 10 3))
```
{:program=haku}
```output
13
```
2024-07-25 23:36:50 +02:00
% id = "01J3REN79KVAPHJGXYYG1MJQ7K"
- anyways, it's time to turn haku into a real programming language!
% id = "01J3REN79KQ0RXG4FBPCBBKPQT"
- programming languages as we use them in the real world are [_Turing-complete_](https://en.wikipedia.org/wiki/Turing_completeness) - roughly speaking, a language is Turing-complete if it can simulate a Turing machine.
% id = "01J3REN79KEN5TZ8101GNTSKWV"
- this is not an accurate definition at all - for that, I strongly suggest reading the appropriate Wikipedia articles.
% id = "01J3REN79K76X933BKT9NH3H4H"
- the TL;DR is that conditional loops are all you really need for Turing-completeness.
% id = "01J3REN79KD5XJ7CJ7E0GF7RMW"
- there exist two main models for modeling Turing-complete abstract machines: Turing machines, and lambda calculus.
% id = "01J3REN79KEKF91X5HR85N7WH4"
- Turing machines are the core of imperative programming languages - a Turing machine basically just models a state machine.
similar to what you may find in a modern processor.
% id = "01J3REN79KH9MBRGGTG5VV084V"
- lambda calculus on the other hand is a declarative system, a skinned down version of math if you will.
an expression in lambda calculus computes a result, and that's it.
no states, no side effects.
just like functional programming.
% id = "01J3REN79K7TQ2K9ZCV3FW7ZSJ"
- which is why we'll use it for haku!
% id = "01J3REN79KZYEQWCTM8A8QA5VE"
- at the core of lambda calculus is the _lambda_ - yes, that one from your favorite programming language!
there are a few operations we can do on lambdas.
% id = "01J3REN79KJKQ865P2205770AD"
- first of all, a lambda is a function which takes one argument, and produces one result - both of which can be other lambdas.
in haku, we will write down lambdas like so:
```haku
(fn (a) r)
```
where `a` is the name of the argument, and `r` is the resulting expression.
% id = "01J3REN79KCV615KP159KCD057"
- in fact, haku will extend this idea by permitting multiple arguments.
```haku
(fn (a b c) r)
```
% id = "01J3REN79KZCMGRW652JCMS25X"
- a lambda can be _applied_, which basically corresponds to a function call in your favorite programming language.
we write application down like so:
```haku
(f x)
```
where `f` is any expression producing a lambda, and `x` is the argument to pass to that lambda.
% id = "01J3REN79K9QEW18H7KJ9FTQ3A"
- what's also important is that nested lambdas capture their outer lambdas' arguments!
so the result of this:
```haku
(((fn (x) (fn (y) (+ x y))) 1) 2)
```
is 3.
% id = "01J3REN79KFMM7TKTBDVY8YF8Y"
- this is by no means a formal explanation, just my intuition as to how it works.
formal definitions don't really matter for us anyways, since we're just curious little cats playing around with computer :cowboy:
% id = "01J3REN79KAFD4CVT56671TQFT"
- we'll start out with a way to define variables.
variables generally have _scope_ - look at the following JavaScript, for example:
{:program=scope-example}
```javascript
let x = 0;
console.log(x);
{
let x = 1;
console.log(x);
}
console.log(x);
```
{:program=scope-example}
```output
0
1
0
```
% id = "01J3REN79KXGFPQR5YZSKFVDRM"
- the same thing happens in haku (though we don't have a runtime for this yet, so you'll have to take my word for it.)
```haku
((fn (x)
((fn (x)
x)
2))
1)
```
this is perfectly fine, and the result should be 2 - not 1!
try evaluating this in your head, and you'll see what I mean.
it's better than me telling you all about it.
% id = "01J3REN79KQGCGPXVANVFYTXF7"
- so to represent scope, we'll introduce a new variable to our interpreter's state.
{:program=haku}
```javascript
treewalk.init = (input) => {
return { input, scopes: [new Map(Object.entries(builtins))] };
};
```
`scopes` will be a stack of [`Map`][Map]s, each representing a single scope.
our builtins will now live at the bottom of all scopes, as an ever-present scope living in the background.
[Map]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map
% id = "01J3REN79KS7612NEDY8NCAQS5"
- variable lookup will be performed by walking the scope stack _from top to bottom_, until we find a variable that matches.
{:program=haku}
```javascript
treewalk.lookupVariable = (state, name) => {
for (let i = state.scopes.length; i-- > 0; ) {
let scope = state.scopes[i];
if (scope.has(name)) {
return scope.get(name);
}
}
throw new Error(`variable ${name} is undefined`);
};
```
we're stricter than JavaScript here and will error out on any variables that are not defined.
% id = "01J3REN79KZC6SWHJBGG8FK17S"
- now we can go ahead and add variable lookups to our `eval` function!
we'll also go ahead and replace our bodged-in builtin support with proper evaluation of the first list element.
in most cases, such as `(+ 1 2)`, this will result in a variable lookup.
{:program=haku}
```javascript
treewalk.eval = (state, node) => {
switch (node.kind) {
case "integer":
let sourceString = state.input.substring(node.start, node.end);
return parseInt(sourceString);
case "identifier":
return treewalk.lookupVariable(state, state.input.substring(node.start, node.end));
case "list":
let functionToCall = treewalk.eval(state, node.children[0]);
return functionToCall(state, node);
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
```
% id = "01J3REN79K8HQX3XGSMJ039KPJ"
- if we didn't screw anything up, we should still be getting 13 here:
{:program=haku}
```haku
(+ (* 2 1) 1 (/ 6 2) (- 10 3))
```
{:program=haku}
```output
13
```
looks like all's working correctly!
% id = "01J3REN79KM4ZS66S5915E40E6"
- time to build our `fn` builtin.
we'll split the work into two functions: the actual builtin, which will parse the node's structure into some useful variables...
{:program=haku}
```javascript
builtins.fn = (state, node) => {
if (node.children.length != 3)
throw new Error("an `fn` must have an argument list and a result expression");
let params = node.children[1];
if (node.children[1].kind != "list")
throw new Error("expected parameter list as second argument to `fn`");
let paramNames = [];
for (let param of params.children) {
if (param.kind != "identifier") {
throw new Error("`fn` parameters must be identifiers");
}
paramNames.push(state.input.substring(param.start, param.end));
}
let expr = node.children[2];
return makeFunction(state, paramNames, expr);
};
```
% id = "01J3REN79K43RXJ974CBCY5EFP"
- and `makeFunction`, which will take that data, and assemble it into a function that follows our `(state, node) => result` calling convention.
{:program=haku}
```javascript
export function makeFunction(state, paramNames, bodyExpr) {
return (state, node) => {
if (node.children.length != paramNames.length + 1)
throw new Error(
`incorrect number of arguments: expected ${paramNames.length}, but got ${node.children.length - 1}`,
);
let scope = new Map();
for (let i = 0; i < paramNames.length; ++i) {
scope.set(paramNames[i], treewalk.eval(state, node.children[i + 1]));
}
state.scopes.push(scope);
let result = treewalk.eval(state, bodyExpr);
state.scopes.pop();
return result;
};
}
```
% id = "01J3REN79KEPTVDDWMFZR16PWC"
- now let's try using that new `fn` builtin!
{:program=haku}
```haku
((fn (a b)
(+ a b))
1 2)
```
{:program=haku}
```output
3
```
nice!
% id = "01J3REN79KWD313K6R91B33PBH"
- but, remember that lambdas are supposed to capture their outer variables! I wonder if that works.
{:program=haku}
```haku
((fn (f)
((f 1) 2))
(fn (x)
(fn (y)
(+ x y))))
```
{:program=haku}
```output
Error: variable x is undefined
```
...I was being sarcastic here of course, of course it doesn't work. :ralsei_dead:
% id = "01J3REN79K62M94MSMRKVDAYGM"
- so to add support for that, we'll clone the entire scope stack into the closure, and then restore it when necessary.
{:program=haku}
```javascript
export function makeFunction(state, paramNames, bodyExpr) {
let capturedScopes = [];
// Start from 1 to skip builtins, which are always present anyways.
for (let i = 1; i < state.scopes.length; ++i) {
// We don't really mutate the scopes after pushing them onto the stack, so keeping
// references to them is okay.
capturedScopes.push(state.scopes[i]);
}
return (state, node) => {
if (node.children.length != paramNames.length + 1)
throw new Error(
`incorrect number of arguments: expected ${paramNames.length}, but got ${node.children.length - 1}`,
);
let scope = new Map();
for (let i = 0; i < paramNames.length; ++i) {
scope.set(paramNames[i], treewalk.eval(state, node.children[i + 1]));
}
state.scopes.push(...capturedScopes); // <--
state.scopes.push(scope);
let result = treewalk.eval(state, bodyExpr);
state.scopes.pop();
return result;
};
}
```
with that, our program now works correctly:
{:program=haku}
```haku
((fn (f)
((f 1) 2))
(fn (x)
(fn (y)
(+ x y))))
```
{:program=haku}
```output
3
```
2024-07-26 23:21:29 +02:00
% id = "01J42RD8Y4VYAQB97XY057R26G"
- being able to define arbitrary functions gives us some pretty neat powers!
to test this out, let's write a little program that will calculate Fibonacci numbers.
% id = "01J42RD8Y4FJXH7HGG2AT3SDJC"
- there are a couple ways to write a number to calculate numbers in the Fibonacci sequence.
% id = "01J42RD8Y4SWPXCT67J8XKX87Z"
- the most basic is the recursive way, which is really quite simple to do:
{:program=fib-recursive}
```javascript
function fib(n) {
if (n < 2) {
return n;
} else {
return fib(n - 1) + fib(n - 2);
}
}
console.log(fib(10));
```
{:program=fib-recursive}
```output
55
```
the downside is that it's really inefficient! we end up wasting a lot of time doing repeat calculations.
try going through it yourself and see just how many calculations are repeated!
% id = "01J42RD8Y4V3G6RCB2ZABSTE5R"
- the one that's more efficient is the iterative version:
{:program=fib-iterative}
```javascript
function fib(n) {
let a = 0;
let b = 1;
let t = null;
for (let i = 0; i < n; ++i) {
t = a;
a = b;
b += t;
}
return a;
}
console.log(fib(10));
```
{:program=fib-iterative}
```output
55
```
% id = "01J42RD8Y4T30Z1BP0MZXHG4C8"
- in either, you will notice we need to support comparisons to know when to stop iterating!
so let's add those into our builtins:
{:program=haku}
```javascript
function comparisonBuiltin(op) {
return (state, node) => {
if (node.children.length != 3)
throw new Error("comparison operators require exactly two arguments");
let a = treewalk.eval(state, node.children[1]);
let b = treewalk.eval(state, node.children[2]);
return op(a, b) ? 1 : 0;
};
}
builtins["="] = comparisonBuiltin((a, b) => a === b);
builtins["<"] = comparisonBuiltin((a, b) => a < b);
```
it's easy enough to `!=`, `<=`, `>`, and `>=` from these, so we won't bother adding those in for now.
% id = "01J42RD8Y4H02HKWVD650T9BYG"
- if you're curious how to derive `!=` and `<=`, consider that we're returning zeros and ones, so we can do an AND operation by multiplying them.
% id = "01J42RD8Y4WZSKMT0BYXBM91GE"
- `>` can be derived by reversing the arguments of `<`.
% id = "01J42RD8Y4EWZ0V4KC7HX2KJAZ"
- of course, we'll also need an `if` to be able to branch on the result of our comparison operators.
{:program=haku}
```javascript
builtins["if"] = (state, node) => {
if (node.children.length != 4)
throw new Error("an `if` must have a condition, true expression, and false expression");
let condition = treewalk.eval(state, node.children[1]);
if (condition !== 0) {
return treewalk.eval(state, node.children[2]);
} else {
return treewalk.eval(state, node.children[3]);
}
};
```
% id = "01J42RD8Y4XBB8WE9QR36WFAQH"
- now we can write ourselves a recursive Fibonacci!
{:program=haku}
```haku
((fn (fib)
(fib fib 10))
; fib
(fn (fib n)
(if (< n 2)
n
(+ (fib fib (- n 1)) (fib fib (- n 2))))))
```
note that in order to achieve recursion, we need to pass `fib` into itself - this is because the `fib` variable we're binding into the first function is not visible in the second function.
but if we run it now:
{:program=haku}
```output
55
```
we can see it works just as fine as the JavaScript version!
% id = "01J42RD8Y4BS3EBAQXNR410ZH5"
- ### [rememeber to remember](https://www.youtube.com/watch?v=0ucW1eN8h9Y){.secret}
% id = "01J42RD8Y47WMW5DSVFVCADF60"
- now, you might be wondering why I'm cutting our Fibonacci adventures short.
after all, we're only just getting started?
% id = "01J42RD8Y46NJ03J6ZMT2EDBDB"
- thing is, I _really_ want to build something bigger.
and one expression per code block's not gonna cut it.
% id = "01J42RD8Y4SJS75FTA9SQ28RE2"
- I'd like to start building a little library of utilities for writing haku code, but I have no way of saving these utilities for later!
% id = "01J42RD8Y4GA0Q5Q2Z446DRD5Y"
- therefore, it's time for... a persistent environment!
% id = "01J42RD8Y4DCWSG17XJFSJF1SR"
- once again, let me sketch out what I'd like it to look like.
to declare a persistent value, you use `def`:
```haku
(def fib
(fn (n)
(if (< n 2)
n
(+ (fib (- n 1)) (fib (- n 2))))))
```
if this looks familar, that's because it probably is - [I used the exact same example at the start of the post][branch:01J3K8A0D1198QXV2GFWF7JCV0]!
% id = "01J42RD8Y46GDWJA41A76B57VF"
- once you `def`ine a persistent value, you can refer to it as usual.
persistent values will sit in a scope _above_ builtins, so you will be able to shadow those if you want to (but please don't.)
```haku
(def fn if) ; Whoops! Guess your soul belongs to me now
```
% id = "01J42RD8Y4ZF0XQH1RT020099B"
- of course, values will persist across code blocks, so I'd be able to refer to `fib` here as well:
```haku
(fib 12)
```
% id = "01J42RD8Y4EDKYXXZZ5SGFQCCS"
- and lastly, it'll be possible to put multiple expressions in a code block.
we'll only treat the last one as the result.
```haku
(def x 1)
(def y 2)
(def z (+ x y))
```
% id = "01J42RD8Y4FJ1S12WG27DVWFD7"
- so let's start by implementing the easiest part - the `def` builtin.
we'll need to augment our interpreter state once again, this time with the persistent environment:
{:program=haku}
```javascript
treewalk.init = (env, input) => {
return {
input,
scopes: [new Map(Object.entries(builtins)), env],
env,
};
};
```
% id = "01J42RD8Y4BWY2B56NMSNR27EP"
- of course now we will also need to teach our whole runtime about the environment, right down to the kernel...
{:program=haku}
```javascript
import { defaultEvalModule } from "treehouse/components/literate-programming/eval.js";
export function run(env, input, node) {
let state = treewalk.init(env, input);
return treewalk.eval(state, node);
}
export function printEvalResult(env, input) {
try {
let tokens = lex(input);
let ast = parse(tokens);
let result = run(env, input, ast);
// NOTE: `def` will not return any value, so we'll skip printing it out.
if (result !== undefined) {
console.log(result);
}
} catch (error) {
console.log(error.toString());
}
}
kernel.evalModule = async (state, source, language, params) => {
if (language == "haku") {
state.haku ??= { env: new Map() };
printEvalResult(state.haku.env, source);
return true;
} else {
return await defaultEvalModule(state, source, language, params);
}
};
```
% id = "01J42RD8Y4BREBB4KQ2WR0TH8Q"
- now for `def` - it'll take the value on the right and insert it into `env`, so that it can be seen in the future.
{:program=haku}
```javascript
builtins.def = (state, node) => {
if (node.children.length != 3)
throw new Error(
"a `def` expects the name of the variable to assign, and the value to assign to the variable",
);
if (node.children[1].kind != "identifier")
throw new Error("variable name must be an identifier");
let name = node.children[1];
let value = treewalk.eval(state, node.children[2]);
state.env.set(state.input.substring(name.start, name.end), value);
};
```
% id = "01J42RD8Y4FZNB2FV99YH00EHZ"
- now let's test it out!
{:program=haku}
```haku
(def x 1)
```
{:program=haku}
```haku
(+ x 1)
```
{:program=haku}
```output
2
```
seems to be working!
% id = "01J42RD8Y4HST3XK86HBVFA2XT"
- now for the second part: we still want to permit multiple declarations per block of code, but currently our syntax doesn't handle that:
{:program=haku}
```haku
(def x 1)
(def y 2)
```
{:program=haku}
```output
Error: unhandled node kind: error
```
~and by the way, I know this is a terrible error message. we'll return to that later.~
% id = "01J42RD8Y4JA8AZ7WT8E0WMXNA"
- this is a pretty simple augmentation to the base syntax.
instead of reading a single expression, we will read a _toplevel_ - as many expressions as possible until we hit `end of file`.
{:program=haku}
```javascript
parser.parseToplevel = (state) => {
let children = [];
while (parser.current(state).kind != eof) {
children.push(parser.parseExpr(state));
}
return {
kind: "toplevel",
children,
// Don't bother with start..end for now.
};
};
parser.parseRoot = (state) => parser.parseToplevel(state);
```
% id = "01J42RD8Y40SQVBHRBRWHWM9WD"
- I'm stealing the name _toplevel_ from OCaml.
the name _file_ didn't quite seem right, since a haku program is not really made out of files, but is rather a long sequence of code blocks.
% id = "01J42RD8Y4BYF2S4YSB4QB7YAQ"
- with a `toplevel` node ready, we can now handle it in our interpreter:
{:program=haku}
```javascript
treewalk.eval = (state, node) => {
switch (node.kind) {
case "integer":
let sourceString = state.input.substring(node.start, node.end);
return parseInt(sourceString);
case "identifier":
return treewalk.lookupVariable(state, state.input.substring(node.start, node.end));
case "list": {
let functionToCall = treewalk.eval(state, node.children[0]);
let result = functionToCall(state, node);
return result;
}
case "toplevel":
let result = undefined;
for (let i = 0; i < node.children.length; ++i) {
result = treewalk.eval(state, node.children[i]);
if (result !== undefined && i != node.children.length - 1)
throw new Error(`expression ${i + 1} had a result despite not being the last`);
}
return result;
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
```
% id = "01J42RD8Y49ZB65BE7C6WQRDZT"
- since `eval` (and likewise, a treehouse code block) is only allowed to have one result, we disallow any results other than the first one.
% id = "01J42RD8Y4A18TXC73V2020ZWH"
- and with that...
{:program=haku}
```haku
(def x 1)
(def y 2)
(+ x y)
```
{:program=haku}
```output
3
```
we can now declare multiple, persistent values per code block!
% id = "01J42RD8Y4QDRRGT2JRPYKR7GE"
- ### but it's never that easy is it
% id = "01J42RD8Y4XTD4N5S2KWQQC6DX"
- so let's declare a little function to add some numbers together...
{:program=haku}
```haku
(def add-two
(fn (x) (+ x 2)))
```
{:program=haku}
```haku
(add-two 1)
```
{:program=haku}
```output
Error: variable is undefined
```
'scuse me??
% id = "01J42RD8Y473B94NGG17REKXH0"
- not gonna lie, this one took me a while to figure out!
but recall the structure of our AST nodes.
it looks something like this:
```json
{
"kind": "identifier",
"start": 30,
"end": 32
}
```
% id = "01J42RD8Y44MHWB6HTDKKBYPA2"
- now remember what we do in order to look up variables.
```javascript
return treewalk.lookupVariable(state, state.input.substring(node.start, node.end));
```
what do you imagine happens when the `state.input` source string is different?
% id = "01J42RD8Y4FQZW5PBTZYQCAHG4"
- _and_, the source string _does_ end up being different, because we end up parsing each block from scratch - we never concatenate them into something bigger!
% id = "01J42RD8Y4KJCNNKNS74AQ7BEH"
- so we'll have to fix this up by remembering the source string alongside each node somehow.
I see two paths:
% id = "01J42RD8Y4PVBEYMHN43ZNWW6Z"
- pre-slice the source string into each node
% id = "01J42RD8Y48GK9QWMCRGM71KDM"
- store a reference to the entire source string in each node
% id = "01J42RD8Y4BBB813M8GZ5MZTPP"
+ I'm no JavaScript optimization expert, but the 2nd option seems like it would avoid a bit of overhead...
but I really _do_ like the fact our AST can be neatly printed into readable JSON, so to preserve that property, we'll go with the 1st option.
% id = "01J42RD8Y48Y1S1R92ZPXGH9Q5"
- speed isn't really our main concern with this first iteration of the interpreter - I prefer inspectability and easy prototyping.
% id = "01J42RD8Y4Y0E5HATN35JKJ05G"
- we'll write a function that walks over our AST, and inserts source strings into it.
{:program=haku}
```javascript
export function insertSources(node, input) {
if (node.start != null) {
node.source = input.substring(node.start, node.end);
}
if (node.children != null) {
for (let child of node.children) {
insertSources(child, input);
}
}
}
```
% id = "01J42RD8Y4HMG0E6KZFTDRAZ4R"
- now I _am_ aware this is changing [object shapes][] quite a lot, which is suboptimal.
but I would _really_ like to keep the interpreter simple, so bear with me.
[object shapes]: https://mathiasbynens.be/notes/shapes-ics
% id = "01J42RD8Y4RXF274JDZRWAXZ6D"
- now we can patch the relevant parts of the interpreter to read from the `node.source` field, instead of `substring`ing the source string passed to the interpreter. this is pretty mechanical so I'll just dump all the relevant code here:
{:program=haku}
```javascript
treewalk.eval = (state, node) => {
switch (node.kind) {
case "integer":
return parseInt(node.source); // <--
case "identifier":
return treewalk.lookupVariable(state, node.source); // <--
case "list":
let functionToCall = treewalk.eval(state, node.children[0]);
return functionToCall(state, node);
case "toplevel":
let result = undefined;
for (let i = 0; i < node.children.length; ++i) {
result = treewalk.eval(state, node.children[i]);
if (result !== undefined && i != node.children.length - 1)
throw new Error(`expression ${i + 1} had a result despite not being the last`);
}
return result;
default:
throw new Error(`unhandled node kind: ${node.kind}`);
}
};
builtins.fn = (state, node) => {
if (node.children.length != 3)
throw new Error("an `fn` must have an argument list and a result expression");
let params = node.children[1];
if (node.children[1].kind != "list")
throw new Error("expected parameter list as second argument to `fn`");
let paramNames = [];
for (let param of params.children) {
if (param.kind != "identifier") {
throw new Error("`fn` parameters must be identifiers");
}
paramNames.push(param.source); // <--
}
let expr = node.children[2];
return makeFunction(state, paramNames, expr);
};
builtins.def = (state, node) => {
if (node.children.length != 3)
throw new Error(
"a `def` expects the name of the variable to assign, and the value to assign to the variable",
);
if (node.children[1].kind != "identifier")
throw new Error("variable name must be an identifier");
let name = node.children[1];
let value = treewalk.eval(state, node.children[2]);
state.env.set(name.source, value); // <--
};
```
% id = "01J42RD8Y4YWW1DR71RE5A1RC3"
- and of course, to top it all off, we still need to insert source information into the nodes before evaluating our tree:
{:program=haku}
```javascript
import { defaultEvalModule } from "treehouse/components/literate-programming/eval.js";
export function printEvalResult(env, input) {
try {
let tokens = lex(input);
let ast = parse(tokens);
insertSources(ast, input); // <--
let result = run(env, input, ast);
// NOTE: `def` will not return any value, so we'll skip printing it out.
if (result !== undefined) {
console.log(result);
}
} catch (error) {
console.log(error.stack ? error.toString() + "\n\n" + error.stack : error.toString());
}
}
kernel.evalModule = async (state, source, language, params) => {
if (language == "haku") {
state.haku ??= { env: new Map() };
printEvalResult(state.haku.env, source);
return true;
} else {
return await defaultEvalModule(state, source, language, params);
}
};
```
% id = "01J42RD8Y4QJS26B0EFSSZES3P"
- let's see if `add-two` works now.
we have an outdated version of it in our `env` map, so let's declare it again, using two input blocks like we did before:
{:program=haku}
```haku
(def add-two
(fn (x) (+ x 2)))
```
{:program=haku}
```haku
(add-two 2)
```
{:program=haku}
```output
4
```
cool!
% id = "01J42RD8Y4NKFM2KS4J5EQ7J2M"
- ### data structures
% id = "01J42RD8Y46XQ0A8SAYCXD5HMZ"
- for a language to really be useful, it needs to have data structures.
fortunately we already have them at our disposal - enter *linked lists!*
% id = "01J42RD8Y48GHZ145RM9Z6CAQW"
- the coolest part about lists is that we don't even need to do anything on the JavaScript side to implement them - we can use our good old friend Lambda calculus, along with a really cool tool called [Church encoding][], which allows us to encode lists using nothing but functions!
[Church encoding]: https://en.wikipedia.org/wiki/Church_encoding
% id = "01J42RD8Y424WKHG4C16ZXW3WC"
- haku also has some tricks up its sleeve which allows us to break free from the minimalistic confines of Lambda calculus, which means we don't have to implement _everything_.
without further ado though, let's get started!
% id = "01J42RD8Y49SN1TDA7ST663958"
- first, we'll implement a way to construct a linked list node - aka `cons`.
{:program=haku}
```haku
(def clist/cons
(fn (h t)
(fn (get)
(get h t))))
```
% id = "01J42RD8Y4YFBQPV75DNDG7S2F"
- the way our lists will work is that each list node is an ordinary function.
we'll be able to pass a "getter" function to the list function to obtain the list's head and tail.
% id = "01J42RD8Y4JK7R4K43A102DQXW"
- I'm prefixing all of our Church-encoded list operations with `clist/` to differentiate them from potential future list representations we'd want to implement.
% id = "01J42RD8Y4J7WPR0WTKMFSZWXJ"
- now for extracting our head and tail.
{:program=haku}
```haku
(def clist/head
(fn (list)
(list (fn (h t) h))))
(def clist/tail
(fn (list)
(list (fn (h t) t))))
```
these happen by passing that getter function to our list and using it to extract its head or tail _only._
% id = "01J42RD8Y4KNKKBVZNF4PCNPWB"
- the last missing part is a marker for signifying the end of the list.
thing is, we don't really have to implement this, because we already have the literal `0`! so knowing whether we're at the end of the list is as simple as `(= (clist/tail node) 0)`.
% id = "01J42RD8Y49H5NWQY1TJNRBWCT"
- and that's our list representation!
let's give it a shot.
we'll define a list containing a bunch of the first five Fibonacci numbers:
{:program=haku}
```haku
(def clist-with-fib-5
(clist/cons 1 (clist/cons 1 (clist/cons 2 (clist/cons 3 (clist/cons 5 0))))))
```
% id = "01J42RD8Y4X61HPNY7E5RZDC03"
- and a function to _reduce_ a list to a single element.
this function has various names in various languages, but the idea is that it allows us to walk over a list, modifying a value along the way, until we get a single, final value.
{:program=haku}
```haku
(def clist/reduce
(fn (init op list)
(if (= (clist/tail list) 0)
(op init (clist/head list))
(clist/reduce (op init (clist/head list)) op (clist/tail list)))))
```
once again, the recursive logic is kind of tricky; if you draw it out, you should be able to understand it much easier!
% id = "01J42RD8Y4HS91N3CG3BBYRD5D"
- let's see if we can sum our Fibonacci numbers together:
{:program=haku}
```haku
(clist/reduce 0 + clist-with-fib-5)
```
{:program=haku}
```output
12
```
nice!
% id = "01J42RD8Y4YVAV8M82229NT7E7"
- #### can I just say something real quick
% id = "01J42RD8Y49CBEG05CT288WJTN"
- I'm swiftly starting to dislike my parenthesized syntax choices here.
they would be fine in an editor capable of highlighting mismatched parentheses, but [Helix][] refuses to highlight _any_ parentheses in [`.tree`][branch:01H8V55APDWN8TV31K4SXBTTWB] files until I add a `tree-sitter` grammar to it.
[Helix]: https://helix-editor.com
% id = "01J42RD8Y4CKBD54RD089X1YKT"
- the example above took me way too long to get working than I want to admit.
honestly it's a failure of tooling on my side, (should've embedded source spans into all these errors so that they can be reported more cleanly!) but I _really_ don't want to spend too much time on what's basically just a prototype.
% id = "01J42RD8Y4ED5TTP392VYGGWXS"
- I'll carry on with them for a bit longer though, I really don't wanna write a complicated parser right now.
2024-07-24 18:20:47 +02:00
% stage = "Draft"
id = "01J3K8A0D1D0NTT3JYYFMRYVSC"
- ### tests
% id = "01J3K8A0D1DQZCZSX4H82QQBHR"
- parser
{:program=test-parser}
```javascript
import { lex, parse, exprToString } from "haku/sexp.js";
let input = "(example s-expression)";
let tokens = lex(input);
tokens.forEach(token => console.log(`${token.kind} ${token.start}..${token.end} '${input.substring(token.start, token.end)}'`));
let ast = parse(tokens);
console.log(exprToString(ast, input));
```
{:program=test-parser}
```output
( 0..1 '('
identifier 1..8 'example'
identifier 9..21 's-expression'
) 21..22 ')'
end of file 22..22 ''
(example s-expression)
```
2024-07-25 23:12:37 +02:00
% id = "01J3NVV2RX1N1XETTTT177H9RM"
- treewalk
{:program=test-treewalk}
```javascript
2024-07-26 23:21:29 +02:00
import { lex, parse, exprToString, insertSources } from "haku/sexp.js";
import { run } from "haku/treewalk.js";
2024-07-25 23:36:50 +02:00
let input = `
2024-07-26 23:21:29 +02:00
(def x 1)
2024-07-25 23:36:50 +02:00
`;
let tokens = lex(input);
let ast = parse(tokens);
2024-07-26 23:21:29 +02:00
insertSources(ast, input);
console.log(run(new Map(), input, ast));
```
{:program=test-treewalk}
```output
2024-07-24 18:20:47 +02:00
```
2024-07-25 23:36:50 +02:00
% stage = "Draft"
id = "01J3REN79K08JWA7FKQ94YTB5Y"
+ ### design notes to self
% id = "01J3REN79KT9MEFYAZ39WT49V3"
- if I ever get to the point where haku compiles and runs itself, the interpreter shall be called `haha`
% id = "01J3REN79KDGD50J9VBGVMV6AB"
- if I ever get to the point where haku compiles itself to wasm, the compiler should be called `wah`