treehouse/content/programming/blog/lvalues.tree
2024-07-24 18:20:12 +02:00

345 lines
14 KiB
Text

%% title = "places, or what is up with *x not always meaning the same thing in different contexts"
% id = "01HY5R1ZV9DD7BV0F66Y0DHAEA"
- I recently got a question from my someone telling me they doesn't understand why `*x` does not read from the pointer `x` when on the left-hand side of an assignment.
% id = "01HY5R1ZV9G92SVA0XP7CG1X6K"
- as in this case:
```c
void example(int *x) {
int y = *x; // (1)
*x = 10; // (2)
}
```
% id = "01HY5R1ZV9MVWYF9VNPK403JDE"
- it seems pretty weird, right? why does it read from the pointer in declaration `(1)`, but not in the assignment `(2)`?
% id = "01HY5R1ZV9N9AY0HAYEWC0RBFT"
- doesn't `*x` mean "read the value pointed to by `x`?"
% id = "01HY5R1ZV9WJM9DMW5QGK04DCR"
- TL;DR: the deal with this example is that `*x` _does not mean_ "read from the location pointed to by `x`", but _just_ "the location pointed to by `x`".
% id = "01HY5R1ZV9JN9GZ8BECJ0KV1FC"
- *`*x` is not a _value_, it's a _memory location_, or _place_ in Rust parlance*
% id = "01HY5R1ZV9QVH3W5BRS16V6EPG"
- same thing with `x`
% id = "01HY5R1ZV9RQQCBNFHY0ZSX4RP"
- same thing with `some_struct.abc` or `some_struct->abc`
% id = "01HY5R1ZV9BMQ7NZJ06B0W48HP"
- but instead of jumping to conclusions, let's go back to the beginning.
let's think, what is it that makes places _different_?
% id = "01HY5R1ZV9JGE9FTN9CN736FJN"
- the main thing is that you can write to them and create references to them.
for instance, this doesn't work:
```c
void example(void) {
1 = 2; // error!
}
```
but this does:
```c
void example(void) {
int i;
i = 2; // fine!
}
```
% id = "01HY5R1ZV91NN3YMXFNRACPDS7"
- so really, places are kind of a different _type_ - we can do certain additional operations with them, such as writing!
% id = "01HY5R1ZV9KH99CGCFM7TM1ZZB"
- we'll call this type `place(T)`, where `T` is any arbitrary type that is stored at the memory location represented by the `place(T)`.
% id = "01HY5R1ZV9F6BEQ5RHRPZPRVM6"
- `place(T)` behaves a bit weirdly compared to other types.
for starters, it is impossible to write the type down in C code, so we're always bound to turn `place(T)` into something else quickly after its creation.
% id = "01HY5R1ZV92VA27TGY6DJPZNP9"
- for instance, in this example:
```c.types
void take_int(int x);
void example(int x) {
take_int(x /*: place(int) */);
}
```
the type of `x` being passed into the `take_int` function is `place(int)`, but since that function accepts an `int`, we convert from `place(int)`
to a regular `int`.
% id = "01HY5R1ZV9ST0B5PHERGD4J4R8"
- this conversion happens implicitly and involves _reading_ from the place -
remember that places represent locations in memory, so they're a bit like pointers.
we have to read from them before we can access the value.
% id = "01HY5R1ZV9E1NSCCTF1CXPMY3D"
- but there are operations in the language that _expect_ a `place(T)`, and therefore do not perform the implicit conversion.
% id = "01HY5R1ZV95BHSV9MF4Z46M377"
+ we're able to describe these operations as _functions_ which take in a type `T` and return a type `U` - written down like `T -> U`.
% id = "01HY5R1ZV90Z3S1FZ65R977P8Y"
- this notation is taken from functional languages like Haskell.
% id = "01HY5R1ZV9RERR87NHVSGS6TWP"
- the `->` operator is _right-associative_ - `T -> U -> V` is a function which returns a function `U -> V`, not a function that accepts a `T -> U`.
% id = "01HY5R1ZV9CG67DEA8Y649D40T"
- one of these operations is _assignment_, which is like a function `place(T) -> T -> T`.
it accepts a `place(T)` to write to, a `T` to write to that place, and returns the `T` written.
note that in that case no read occurs, since the implicit conversion described before does not apply.
```c.types
void example(void) {
int x = 0;
x /*: place(int) */ = 1 /*: int */; /*-> int (discarded) */
}
```
% id = "01HY5R1ZV9MNZKK381TS7K3VR2"
- another one of these operations is the _`&` reference operator_, which is like a function `place(T) -> T*`.
it accepts a `place(T)` and returns a pointer `T*` that points to the place's memory location in exchange.
```c.types
void example(void) {
int x = 0;
int* p = &(x /*: place(T) */);
}
```
% id = "01HY5R1ZV9YN7H33BMK0NPX4QT"
- and of course its analogue, the _`*` dereferencing operator_, which does not consume a place, but produces one.
it accepts a `T*` and produces a `place(T)` that is placed at the pointer's memory location - it's the reverse of `&x`, `T* -> place(T)`.
```c
void example(int* x) {
int y = *x;
}
```
% id = "01HY5R1ZV9KZZARNK8KXZTGAFX"
- another couple of operations that accept a `place(T)` is the _`.` and `[]` operators_, both of which can be used to refer to subplaces
within the place.
% id = "01HY5R1ZV9N6PY2ANQC45NN27E"
- the difference is that `.` is a _static, compile-time known_ subplace, while `[]` may be _dynamic, runtime known_.
% id = "01HY5R1ZV99N78RA7Z5GSVVPRB"
- the `.` operator takes in a `place(T)` and returns a `place(U)` depending on the type of structure field we're referencing.
% id = "01HY5R1ZV99PZT6SDK2HRCRMAZ"
- since there is no type that represents the set of fields of a structure `S`, we'll invent a type `anyfield(S)` which represents that set.
% id = "01HY5R1ZV9RYF2P5X53VW9S9K8"
- the type of a specific field `f` in the structure `S` is `field(S, f)`.
% id = "01HY5R1ZV9PACB0HVE976TZZ69"
- we'll also introduce a type `fieldtype(F)` which is the type stored in the field `F`.
% id = "01HY5R1ZV99M92Z0XH1H3385JE"
- given that, the type of the `.` operator is `place(T) -> F -> place(fieldtype(F))`, where `F` is an `anyfield(T)`.
% id = "01HY5R1ZV9VZFHRKY4QWBDTFJ7"
- example:
```c.types
void example(void) {
struct S { int x; } s;
s /*: place(struct S) */;
s /*: place(struct S) */ .x /*: field(struct S, x) */; /*-> place(int) (discarded) */
}
```
% id = "01HY5R1ZV952B3C5AHFME5TJG4"
- the `[]` operator takes in a `T*`, a `ptrdiff_t` to offset the pointer by, and returns a `place(T)` whose memory location is the offset pointer.
the function signature is therefore `T* -> ptrdiff_t -> place(T)`.
example:
```c.types
void example(int* array) {
int* p = &((array /*: int* */)[123] /*: place(int) */);
}
```
% id = "01HY5R1ZV9VC2NNFKY9W0AE8F1"
- we can actually think of this `a[i]` operator as syntax sugar for `*(a + i)`.
% id = "01HY5R1ZV9SKX5QGJCBRG5QAD9"
- this has a funny consequence where `array[0]` is equivalent to `0[array]` -
offsetting a pointer is just addition, and addition is commutative.
therefore we can swap the operands to `[]` and it will work just fine!
% id = "01HY5R1ZV9MQWK5VQZQG8JJ03J"
- I do wonder though why it doesn't produce a warning.
I'm no standards lawyer, but I _believe_ this may have something to do with implicit type conversions - the `0` gets promoted to a
pointer as part of the desugared addition.
I really need to read up about C's type promotion rules.
% id = "01HY5R1ZV9KA2Q3TKQWQPS19KV"
- now I have to confess, I lied to you.
there are no places in C.
% id = "01HY5R1ZV9PAD1XRS29YV64GV2"
- the C standard actually calls this concept "lvalues", which comes from the fact that they are *values* which are valid
*l*eft-hand sides of assignment.
% id = "01HY5R1ZV9BNZM45RK2AW8BF5N"
+ however, I don't like that name since it's quite esoteric - if you tell a beginner "`x` is not an lvalue," they will look at you confused.
but if you tell a beginner "`x` is not a place in memory," then it's immediately more clear!
so I will keep using the Rust name despite the name "lvalues" technically being more "correct" and standards-compliant.
% id = "01HY5R1ZV9DXQ8AQJXVK1JE9XS"
- I'm putting "correct" in quotes because I don't believe this is a matter of correctness, just opinion.
% id = "01HY5R1ZV9HP593J62VWDBWHK4"
- what's interesting about `place(T)` is that it's actually a real type in C++ - except under a different name: `T&`.
% id = "01HY5R1ZV961P77W5TPDP9AMT9"
- references are basically a way of introducing places into the type system for real, which is nice,
but on the other hand having places bindable to names results in some weird holes in the language.
% id = "01HY5R1ZV9F292M89VEBSMB80F"
- to begin with, in C we could assume that referencing any variable `T x` by its name `x` would produce a `place(T)`.
this is a simple and clear rule to understand.
% id = "01HY5R1ZV9WPRVN3S84B7H9W25"
- in C++, this is no longer the case - referencing a variable `T x` by its name `x` produces a `T&`,
but referencing a variable `T& x` by its name `x` produces a `T&`, not a `T& &`!
in layman's terms, C++ makes it impossible to rebind references to something else.
you can't make this variable point to `y`:
{% NOTE: using `c` syntax here instead of `cpp` because I don't have a working C++ syntax at the moment! %}
```c
int x = 0;
int y = 1;
int& r = x;
r = y; // nope, this is just the same as x = y
```
% id = "01HY5R1ZV9FQX72E6XDJ8BHY7K"
- and it's not like it could've been done any better - if we got a `T& &` instead, we'd be able to reassign a different place to
the variable, but then we'd get a type mismatch on something like `r = 1`
% id = "01HY5R1ZV98W2X9HSG1KM7CDPN"
- because assignment is `T& -> T -> T`;
if our `T` is `int& &`, the expected signature is `int& & -> int& -> int&`, but we're providing an `int`, not an `int&` -
and we can't make a reference out of a value!
% id = "01HY5R1ZV9AVSDRGP8XYG56KNK"
- so we'd need a way of doing `T& -> T`, but guess what: (almost) this already exists and is called "pointers" and "the unary `*` operator".
% id = "01HY5R1ZV99T9X3ANYQJS40NBX"
- except of course, with pointers the signature is `T* -> T&`.
% id = "01HY5R1ZV94MHBS9RERA4CWPTM"
- so by introducing references, C++ was actually made less consistent!
% id = "01HY5R1ZV93GGZ7CPTEPY8CW8K"
- I actually kind of wish references were more like they are in Rust - basically just pointers but non-null and guaranteed to be aligne
% id = "01HY5R1ZV9QWHZRJ5V53CVYG5V"
- anyways, as a final {-boss-} bonus of this blog post, I'd like to introduce you to the `x->y` operator (the C one)
% id = "01HY5R1ZV9AYFWV96FC07WX68G"
- if you've been programmming C or C++ for a while, you'll know that it's pretty dangerous to just go pointer-[walkin'](https://www.youtube.com/watch?v=d_dLIy2gQGU) with the `->` operator
```c
int* third(struct list* first) {
return &list->next->next->value;
}
```
% id = "01HY5R1ZV9XR75M4TN7H08HYPF"
+ there's a pretty high chance that using the `third` function will cause a crash for you if there are only two elements in the list.
% id = "01HY5R1ZV97K8Y4S4V68555MKV"
- if it doesn't cause a crash, you may have more serious problems to worry about :kamien:
% id = "01HY5R1ZV9EEN448SB62YMAGZP"
- but how does it cause a crash if we're taking the reference out of that whole `->` chain? shouldn't taking a reference not cause any reads?
% id = "01HY5R1ZV977KHP1QDXWY7CTZJ"
- the secret lies in what the `x->y` operator really does.
basically, it's just convenience syntax for `(*x).y`.
% id = "01HY5R1ZV9X8M5T64BD0MZ2WN2"
- let's start by dismantling the entire pointer access chain into separate expressions:
```c
int* third(struct list* first) {
struct list* second = first->next;
struct list* third = second->next;
return &third->value;
}
```
% id = "01HY5R1ZV9PGQMS2A6H5XTWYZX"
- now let's desugar the `->` operator:
```c
int* third(struct list* first) {
struct list* second = (*first).next;
struct list* third = (*second).next;
return &(*third).value;
}
```
% id = "01HY5R1ZV92CZAV13P4KFABTR1"
- and add some type annotations:
```c.types
int* third(struct list* first) {
struct list* second = (*first).next /*: place(struct list*) */;
struct list* third = (*second).next /*: place(struct list*) */;
return &(*third).value;
}
```
% id = "01HY5R1ZV9671KDJAATWBPGD2C"
- and now let's follow it line by line.
% id = "01HY5R1ZV936BNY9MY4ANCXYH1"
- ```c.types
struct list* second = (*first).next /*: place(struct list*) */;
```
first we read the value of the `next` field from the structure pointed to by `first`.
assuming `first` is a valid pointer, this shouldn't fail.
% id = "01HY5R1ZV9VNTRSCFWBW9BCTHJ"
- ```c.types
struct list* third = (*second).next /*: place(struct list*) */;
```
but now something bad happens: we don't know if the `second` pointer we just got a `place(T)` from is valid.
we offset it by `.next` and implicitly read from it, which is bad!
% id = "01HY5R1ZV9S50S4MYYQ1Q5P00Z"
- at this point there's no point in analyzing the rest of the function - we've hit Undefined Behavior!
% id = "01HY5R1ZV9A1FS9SRFJBM5NVSR"
- the conclusion here is that chaining `x->y` can be really dangerous if you don't check for the validity of each reference.
just doing one hop and a reference - `&x->y` - is fine, because we never end up reading from the invalid pointer -
it's like doing `&x[1]`.
but two hops is where it gets hairy - in `x->y->z`, the `->z` has to _read_ from `x->y` to know the pointer to read from.
% id = "01HY5R1ZV9VSE2WWH93NCAGRS8"
- TODO: in the future I'd like to embed a C compiler here that will desugar all place operations into explicit ones.
stay tuned for that!