Type System Forests + Tunnels + Cliques
One of the easiest graph structures to reason about has two properties:
- It’s a directed, acyclic graph
- Each node has only one outgoing edge
This means that from any node, if I want to “move” from it, I can make exactly one choice.
I wasn’t aware that the name for this structure is a polytree.
When working with a type system I like to think about primitives as little as possible — my mental energy should be applied toward solving problems and not appeasing the capricious, pedantic overlords, governing from the compiler. Of course, it’s crucial to be able to control the type system for the sake of optimization and semantics; I want it to serve me, rather than the other way around.
You see where I’m going here: a type system modeled as polytrees is easy to build an intuition for and somewhat effortlessly work on top of. If you have, let’s say, a 16-bit signed integer and you want to add it to a 32-bit signed integer, it’s very simple for the compiler to understand what you meant — you almost certainly want the result to be a 32-bit signed integer.
There are, of course, types whose only commensurability is that they’re represented as bits. This means our types shouldn’t be expressed as a polytree, but instead a set — or forest — of polytrees. There is no natural correspondence between GIS data and intervals of time; that’s fine and natural. It obeys our intuition.
However, you might want to do something that the compiler thinks is a little goofy in moving between types — let’s say turning a signed integer into a date. There likely needs to be an interface there for the conversion because it necessarily has some explicit and potentially unintuitive semantics — for this we want to build paths between our polytrees. We don’t want anyone to unthinkingly traverse the path, but if they want to go that way, we want to allow it. I think of these as tunnels between the trees in the forests.
Imagine a cute image of a squirrel bounding along a tunnel between two trees
To summarize, you can implicitly traverse a type’s “native” polytree freely (where traversal is only among outgoing edges, remember), and can move to another polytree (or along your native tree’s incoming edges) only explicitly with a command. This structure is easy to reason about, easy to build intuition around, but doesn’t sacrifice expressivity.
This is largely how type systems work and I’m sure I’ve belabored a point obvious to most.
Examples
In Rust:
From
moves within a polytree (i64::from(0i32)
)TryFrom
moves between polytrees (i32::try_from(0i64)
)
In Postgres:
- Implicit casts move within a polytree (
1::int + 2::bigint
produces abigint
) - Explicit casts move between polytrees (
1::bool
)
However, Postgres is not the most sterling example, which we’ll focus on for the rest of this post.
PostgreSQL’s string clique
In PostgreSQL, each type belongs to a category, which ideally represents the polytree relationship I’ve been discussing, which neatly expresses and encapsulates the relationship between types, from the perspective of their ability to be cast to one another. Moving between categories requires an explicit cast, i.e. they’re obviously polyetrees.
What spurned me into writing this post, though, is one notable, annoying exception to the “connected
polytree forest” design: TEXT
, CHAR
, and VARCHAR
. All three types have very slightly different
semantics, though they all aim to represent strings of text. Their differences don’t seem as
material as this fact: all three can be implicitly cast to one another. They form a clique rather
than a tree.
We could dive into the PostgreSQL code to find this, but it’s relatively simple to observe this from
the API itself using COALESCE
, which requires that there is a “best common type” among the
parameters, i.e. there is a common type which all types can be cast to implicitly.
First some proof about what COALESCE
expects:
SELECT COALESCE(INTERVAL '1m', DATE '2000-03-04');
ERROR: COALESCE types interval and date cannot be matched
Now, this madness:
SELECT COALESCE('a'::TEXT, 'a'::CHAR);
a
SELECT COALESCE('a'::TEXT, 'a'::VARCHAR);
a
SELECT COALESCE('a'::CHAR, 'a'::TEXT);
a
SELECT COALESCE('a'::VARCHAR, 'a'::TEXT);
a
We can also inspect the types of the return values:
SELECT pg_typeof(COALESCE('a'::TEXT, 'a'::CHAR));
text
SELECT pg_typeof(COALESCE('a'::TEXT, 'a'::CHAR));
text
SELECT pg_typeof(COALESCE('a'::CHAR, 'a'::TEXT));
character
SELECT pg_typeof(COALESCE('a'::VARCHAR, 'a'::TEXT));
character varying
This shows us that COALESCE
’s return type is dependent on the order of the inputs. A quick sketch
of the algorithm that powers COALESCE
’s type selection then is like this:
fn best_common_type(types: &[Types]) -> Option<Type> {
let (mut candidate, rest) = types.split_first()?;
for t in rest {
// Check the directionality of the casts
let candidate_to_t = can_implicitly_cast_from_to(candidate, t);
let t_to_candidate = can_implicitly_cast_from_to(t, candidate);
if
// Types are incommensurate
!(candidate_to_t || t_to_candidate) {
return None;
} else if
// t is further up the hierarchy than candidate
candidate_to_t && !t_to_candidate {
candidate = t;
}
}
return Some(candidate);
}
This bias toward the first element in types feels unsatisfying to me: this means that our ability to determine the best common types between sets is non-commutative.
Further, this code only works this way to account for the clique behavior among the text
-like
types; otherwise we’d be able to write something a little more elegant:
fn best_common_type(types: &[Types]) -> Option<Type> {
types
.iter()
.try_reduce(|best_common_type, new| highest_in_polytree(best_common_type, new))
}
For instance, imagining we use this same algorithm to determine the schema of a UNION (hint: PostgreSQL does); the resulting schema of a UNION b differs from b UNION a.
SELECT
pg_typeof(x)
FROM
(
SELECT x FROM (VALUES ('a'::TEXT)) AS a (x)
UNION SELECT * FROM (VALUES ('a'::CHAR)) AS b (x)
) AS o;
----
text
SELECT
pg_typeof(x)
FROM
(
SELECT x FROM (VALUES ('a'::CHAR)) AS a (x)
UNION SELECT * FROM (VALUES ('a'::TEXT)) AS b (x)
) AS o;
----
character
This represents an unexpected, unintuitive corner of PG’s type system: a simple fix of which would
have been to linearize the relationship between the types as CHAR
-> VARCHAR
-> TEXT
.
So what?
In my first draft of this post, I left things here–but it nagged me that I hadn’t really gotten to the heart of the problem, and that is going to take a little more digging and context.
We’re going to be looking at types and typemods, so we’re going to make a little utility to make understanding how PG sees these types a little simpler.
tl;dr: take (table name, colum name)
and show us the type PG sees.
CREATE FUNCTION inspect_type(IN TEXT, IN TEXT)
RETURNS TEXT
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT
AS $$SELECT
format_type(id, typmod)
FROM
(
SELECT
atttypid AS id, atttypmod AS typmod
FROM
pg_attribute AS att
JOIN pg_class AS class ON att.attrelid = class.oid
WHERE
class.relname = $1 AND att.attname = $2
)
AS r;$$;
That out of the way: let’s start by understanding what the differences are between these types are, which I’m just copying from PG:
Type | Use |
---|---|
VARCHAR(n) |
variable-length with limit of n |
CHAR(n) |
fixed-length (n ), blank padded |
TEXT |
variable unlimited length |
One important complexity that this table elides is that CHAR
equality also ignores the blank
padding on the end of characters; this differs from the other types, which perform a byte-for-byte
equality check.
Ok, now here comes the wrinkle:
Without applying a typmod
(the bit after the type name in parens):
-
VARCHAR
is exactly the same asTEXT
, i.e. it is of unlimited length. This is reflected by the absence of a typmod, as you’d expect.CREATE TABLE v (no_typmod VARCHAR, typmod VARCHAR(2)); SELECT inspect_type('v', 'no_typmod'); ---- character varying SELECT inspect_type('v', 'typmod'); ---- character varying(2)
-
CHAR
defaults to(1)
, i.e. one character.CREATE TABLE c (no_typmod CHAR, typmod CHAR(2)); SELECT inspect_type('c', 'no_typmod'); ---- character(1) SELECT inspect_type('c', 'typmod'); ---- character(2)
Note that TEXT
doesn’t take typmods; no big deal.
So what if we create a relation with an implicitly-defined CHAR
type?
CREATE TABLE implicit_char AS SELECT 'abc'::CHAR AS v;
SELECT inspect_type('implicit_char', 'v');
---
character(1)
Sane. Now, what if we do something that invokes the best common type algorithm?
CREATE TABLE wtf AS
SELECT
x
FROM
(
SELECT x FROM (VALUES ('abcdef'::CHAR(2))) AS a (x)
UNION SELECT * FROM (VALUES ('abcdef'::TEXT)) AS b (x)
)
AS o;
SELECT * FROM wtf;
----
ab
abcdef
Now had you guessed that the results would have been two rows ab
, we’d be in agreement; but it
seems like we somehow got a type of CHAR
that doesn’t have a typmod applied to it?
SELECT inspect_type('wtf', 'x');
----
bpchar
BPCHAR
! What is that? According to a throwaway line in PG’s docs:
bpchar
(“blank-padded char”, the internal name of the character data type)
Ok, well if it’s the internal representation of the blank-padded character type, then maybe it…adds blank padding to the characters and…set the width to the maximum of the values?
SELECT length(x) FROM wtf;
----
1
2
No? Sure.
In reality, BPCHAR
winds up being essentially an alias for TEXT
except for the fact that it
uses CHAR
-like equality, which ignores trailing white space.
First, let’s insert a value equal to one of our other values but with some trailing whitespace:
INSERT INTO wtf VALUES ('ab ');
Now this query’s a little tortured, but it shows values that are considered equal, but whose lengths are not equal.
SELECT
l.x, octet_length(l.x), r.x, octet_length(r.x), l.x = r.x
FROM
wtf AS l, wtf AS r
WHERE
l.x = r.x AND octet_length(l.x) != octet_length(r.x);
----
ab|2|ab |7|t
ab |7|ab|2|t
We have to use octet_length
because length(bpchar)
ignores whitespace, natch.
The real problem
Now what would happen if we had changed the order of the statements in the UNION
used to create
wtf
? Does the non-commutativity matter at all?
CREATE TABLE wtf_non_commutative AS
SELECT
x
FROM
(
SELECT * FROM (VALUES ('abcdef'::TEXT)) AS b (x)
UNION SELECT x FROM (VALUES ('abcdef'::CHAR(2))) AS a (x)
)
AS o;
SELECT * FROM wtf_non_commutative;
----
ab
abcdef
SELECT inspect_type('wtf_non_commutative', 'x');
----
text
So if we perform the same type of equality check on this new text
-like table, we instead get no
results.
SELECT
COUNT(*)
FROM
wtf_non_commutative AS l, wtf_non_commutative AS r
WHERE
l.x = r.x AND octet_length(l.x) != octet_length(r.x);
----
0
Your tables' equality semantics are impacted by the type system’s non-commutativity. Most notably, this changes the behavior of filter and joins–absolutely essential parts of using a database. And the changes are subtle, making them very easy to miss entirely or misunderstand.
Rationalizing your way through this takes a long time and is, for most users, incredibly uninteresting. Keeping these minute details and differences in mind when trying to write business logic annoys and taxes people.
So I’ll implore you: next time you’re designing a type system, keep polytrees in mind.