2023-03-18

Type System Forests + Tunnels + Cliques

One of the easiest graph structures to reason about has two properties:

It’s a directed, acyclic graph
Each node has only one outgoing edge

This means that from any node, if I want to “move” from it, I can make exactly one choice.

I wasn’t aware that the name for this structure is a polytree.

When working with a type system I like to think about primitives as little as possible — my mental energy should be applied toward solving problems and not appeasing the capricious, pedantic overlords, governing from the compiler. Of course, it’s crucial to be able to control the type system for the sake of optimization and semantics; I want it to serve me, rather than the other way around.

You see where I’m going here: a type system modeled as polytrees is easy to build an intuition for and somewhat effortlessly work on top of. If you have, let’s say, a 16-bit signed integer and you want to add it to a 32-bit signed integer, it’s very simple for the compiler to understand what you meant — you almost certainly want the result to be a 32-bit signed integer.

There are, of course, types whose only commensurability is that they’re represented as bits. This means our types shouldn’t be expressed as a polytree, but instead a set — or forest — of polytrees. There is no natural correspondence between GIS data and intervals of time; that’s fine and natural. It obeys our intuition.

However, you might want to do something that the compiler thinks is a little goofy in moving between types — let’s say turning a signed integer into a date. There likely needs to be an interface there for the conversion because it necessarily has some explicit and potentially unintuitive semantics — for this we want to build paths between our polytrees. We don’t want anyone to unthinkingly traverse the path, but if they want to go that way, we want to allow it. I think of these as tunnels between the trees in the forests.

Imagine a cute image of a squirrel bounding along a tunnel between two trees

To summarize, you can implicitly traverse a type’s “native” polytree freely (where traversal is only among outgoing edges, remember), and can move to another polytree (or along your native tree’s incoming edges) only explicitly with a command. This structure is easy to reason about, easy to build intuition around, but doesn’t sacrifice expressivity.

This is largely how type systems work and I’m sure I’ve belabored a point obvious to most.

Examples

In Rust:

From moves within a polytree (i64::from(0i32))
TryFrom moves between polytrees (i32::try_from(0i64))

In Postgres:

Implicit casts move within a polytree (1::int + 2::bigint produces a bigint)
Explicit casts move between polytrees (1::bool)

However, Postgres is not the most sterling example, which we’ll focus on for the rest of this post.

PostgreSQL’s string clique

In PostgreSQL, each type belongs to a category, which ideally represents the polytree relationship I’ve been discussing, which neatly expresses and encapsulates the relationship between types, from the perspective of their ability to be cast to one another. Moving between categories requires an explicit cast, i.e. they’re obviously polyetrees.

What spurned me into writing this post, though, is one notable, annoying exception to the “connected polytree forest” design: TEXT, CHAR, and VARCHAR. All three types have very slightly different semantics, though they all aim to represent strings of text. Their differences don’t seem as material as this fact: all three can be implicitly cast to one another. They form a clique rather than a tree.

We could dive into the PostgreSQL code to find this, but it’s relatively simple to observe this from the API itself using COALESCE, which requires that there is a “best common type” among the parameters, i.e. there is a common type which all types can be cast to implicitly.

First some proof about what COALESCE expects:

SELECT COALESCE(INTERVAL '1m', DATE '2000-03-04');
ERROR:  COALESCE types interval and date cannot be matched

Now, this madness:

SELECT COALESCE('a'::TEXT, 'a'::CHAR);
a
SELECT COALESCE('a'::TEXT, 'a'::VARCHAR);
a
SELECT COALESCE('a'::CHAR, 'a'::TEXT);
a
SELECT COALESCE('a'::VARCHAR, 'a'::TEXT);
a

We can also inspect the types of the return values:

SELECT pg_typeof(COALESCE('a'::TEXT, 'a'::CHAR));
text
SELECT pg_typeof(COALESCE('a'::TEXT, 'a'::CHAR));
text
SELECT pg_typeof(COALESCE('a'::CHAR, 'a'::TEXT));
character
SELECT pg_typeof(COALESCE('a'::VARCHAR, 'a'::TEXT));
character varying

This shows us that COALESCE’s return type is dependent on the order of the inputs. A quick sketch of the algorithm that powers COALESCE’s type selection then is like this:

fn best_common_type(types: &[Types]) -> Option<Type> {
    let (mut candidate, rest) = types.split_first()?;

    for t in rest {
        // Check the directionality of the casts
        let candidate_to_t = can_implicitly_cast_from_to(candidate, t);
        let t_to_candidate = can_implicitly_cast_from_to(t, candidate);

        if
        // Types are incommensurate
        !(candidate_to_t || t_to_candidate) {
            return None;
        } else if
        // t is further up the hierarchy than candidate
        candidate_to_t && !t_to_candidate {
            candidate = t;
        }
    }

    return Some(candidate);
}

This bias toward the first element in types feels unsatisfying to me: this means that our ability to determine the best common types between sets is non-commutative.

Further, this code only works this way to account for the clique behavior among the text-like types; otherwise we’d be able to write something a little more elegant:

fn best_common_type(types: &[Types]) -> Option<Type> {
        types
            .iter()
            .try_reduce(|best_common_type, new| highest_in_polytree(best_common_type, new))
    }

For instance, imagining we use this same algorithm to determine the schema of a UNION (hint: PostgreSQL does); the resulting schema of a UNION b differs from b UNION a.

SELECT
 pg_typeof(x)
FROM
 (
  SELECT x FROM (VALUES ('a'::TEXT)) AS a (x)
  UNION SELECT * FROM (VALUES ('a'::CHAR)) AS b (x)
 ) AS o;
----
text

SELECT
 pg_typeof(x)
FROM
 (
  SELECT x FROM (VALUES ('a'::CHAR)) AS a (x)
  UNION SELECT * FROM (VALUES ('a'::TEXT)) AS b (x)
 ) AS o;
----
character

This represents an unexpected, unintuitive corner of PG’s type system: a simple fix of which would have been to linearize the relationship between the types as CHAR -> VARCHAR -> TEXT.

So what?

In my first draft of this post, I left things here–but it nagged me that I hadn’t really gotten to the heart of the problem, and that is going to take a little more digging and context.

We’re going to be looking at types and typemods, so we’re going to make a little utility to make understanding how PG sees these types a little simpler.

tl;dr: take (table name, colum name) and show us the type PG sees.

CREATE FUNCTION inspect_type(IN TEXT, IN TEXT)
	RETURNS TEXT
	LANGUAGE SQL
	IMMUTABLE
	RETURNS NULL ON NULL INPUT
	AS $$SELECT
	format_type(id, typmod)
FROM
	(
		SELECT
			atttypid AS id, atttypmod AS typmod
		FROM
			pg_attribute AS att
			JOIN pg_class AS class ON att.attrelid = class.oid
		WHERE
			class.relname = $1 AND att.attname = $2
	)
		AS r;$$;

That out of the way: let’s start by understanding what the differences are between these types are, which I’m just copying from PG:

Type	Use
`VARCHAR(n)`	variable-length with limit of `n`
`CHAR(n)`	fixed-length (`n`), blank padded
`TEXT`	variable unlimited length

One important complexity that this table elides is that CHAR equality also ignores the blank padding on the end of characters; this differs from the other types, which perform a byte-for-byte equality check.

Ok, now here comes the wrinkle:

Without applying a typmod (the bit after the type name in parens):

VARCHAR is exactly the same as TEXT, i.e. it is of unlimited length. This is reflected by the absence of a typmod, as you’d expect.

CREATE TABLE v (no_typmod VARCHAR, typmod VARCHAR(2));

SELECT inspect_type('v', 'no_typmod');
----
character varying

SELECT inspect_type('v', 'typmod');
----
character varying(2)

CHAR defaults to (1), i.e. one character.

CREATE TABLE c (no_typmod CHAR, typmod CHAR(2));

SELECT inspect_type('c', 'no_typmod');
----
character(1)

SELECT inspect_type('c', 'typmod');
----
character(2)

Note that TEXT doesn’t take typmods; no big deal.

So what if we create a relation with an implicitly-defined CHAR type?

CREATE TABLE implicit_char AS SELECT 'abc'::CHAR AS v;

SELECT inspect_type('implicit_char', 'v');
---
character(1)

Sane. Now, what if we do something that invokes the best common type algorithm?

CREATE TABLE wtf AS
	SELECT
		x
	FROM
		(
			SELECT x FROM (VALUES ('abcdef'::CHAR(2))) AS a (x)
			UNION SELECT * FROM (VALUES ('abcdef'::TEXT)) AS b (x)
		)
			AS o;

SELECT * FROM wtf;
----
ab
abcdef

Now had you guessed that the results would have been two rows ab, we’d be in agreement; but it seems like we somehow got a type of CHAR that doesn’t have a typmod applied to it?

SELECT inspect_type('wtf', 'x');
----
bpchar

BPCHAR! What is that? According to a throwaway line in PG’s docs:

bpchar (“blank-padded char”, the internal name of the character data type)

Ok, well if it’s the internal representation of the blank-padded character type, then maybe it…adds blank padding to the characters and…set the width to the maximum of the values?

SELECT length(x) FROM wtf;
----
1
2

No? Sure.

In reality, BPCHAR winds up being essentially an alias for TEXT except for the fact that it uses CHAR-like equality, which ignores trailing white space.

First, let’s insert a value equal to one of our other values but with some trailing whitespace:

INSERT INTO wtf VALUES ('ab     ');

Now this query’s a little tortured, but it shows values that are considered equal, but whose lengths are not equal.

SELECT
	l.x, octet_length(l.x), r.x, octet_length(r.x), l.x = r.x
FROM
	wtf AS l, wtf AS r
WHERE
	l.x = r.x AND octet_length(l.x) != octet_length(r.x);
----
ab|2|ab     |7|t
ab     |7|ab|2|t

We have to use octet_length because length(bpchar) ignores whitespace, natch.

The real problem

Now what would happen if we had changed the order of the statements in the UNION used to create wtf? Does the non-commutativity matter at all?

CREATE TABLE wtf_non_commutative AS
	SELECT
		x
	FROM
		(
      SELECT * FROM (VALUES ('abcdef'::TEXT)) AS b (x)
			UNION SELECT x FROM (VALUES ('abcdef'::CHAR(2))) AS a (x)
		)
			AS o;

SELECT * FROM wtf_non_commutative;
----
ab
abcdef

SELECT inspect_type('wtf_non_commutative', 'x');
----
text

So if we perform the same type of equality check on this new text-like table, we instead get no results.

SELECT
	COUNT(*)
FROM
	wtf_non_commutative AS l, wtf_non_commutative AS r
WHERE
	l.x = r.x AND octet_length(l.x) != octet_length(r.x);
----
0

Your tables' equality semantics are impacted by the type system’s non-commutativity. Most notably, this changes the behavior of filter and joins–absolutely essential parts of using a database. And the changes are subtle, making them very easy to miss entirely or misunderstand.

Rationalizing your way through this takes a long time and is, for most users, incredibly uninteresting. Keeping these minute details and differences in mind when trying to write business logic annoys and taxes people.

So I’ll implore you: next time you’re designing a type system, keep polytrees in mind.