As much as I like Go, I sometimes really miss dynamically-typed languages. I find Go tedious when I have to parse some arbitrary text format or lines from stdin, which would be trivial with Python, Ruby or Perl.
Parsing JSON with Go is easy – just unmarshal it right into a struct. I
wanted to unmarshal a line of text the same way, so I wrote a library,
strum
, to do exactly that.
The first part of this article demonstrates strum
. It’s
short! If you’re not interested in learning Go reflection, you can stop
there.
The second part explores strum
internals. I explain how it uses Go’s
reflect
library, covering input validation, type-based dispatch,
interface satisfaction, pointer following, and more.
N.B. This was written before Go 1.18, generics, and the any
alias.
Using strum to unmarshal text 🔗︎
Advent of Code gives amazing coding puzzles every December. Each puzzle has input in the form of text files. Here’s an example from Day 2 of the 2021 AoC, involving a navigation DSL:
forward 5
down 5
forward 8
up 3
down 8
forward 2
That’s a string and an int. Let’s model each line like this:
type Move struct {
Direction string
Distance int
}
The text file has one move per line. We want to read the whole file and
produce a slice containing all the moves. Here’s how to do that in 3 lines
with strum
, assuming that the input is piped via stdin
.
d := strum.NewDecoder(os.Stdin)
var moves []Move
err := d.DecodeAll(&moves)
By default, strum
splits on white space. But strum
can do more advanced
tokenization. Consider the input from Day 5, where each line of text
represents the start and end points of a line in the Cartesian plane:
105,697 -> 287,697
705,62 -> 517,250
531,627 -> 531,730
21,268 -> 417,268
913,731 -> 271,89
With strum
and a regular expression, it’s easy to turn each line into a
slice of ints of the form [x1,y1,x2,y2]
. The WithTokenRegexp
method
changes a decoder from white space tokenization to getting tokens from
submatches:
re := regexp.MustCompile(`(\d+),(\d+) -> (\d+),(\d+)`)
d := strum.NewDecoder(os.Stdin).WithTokenRegexp(re)
var lines [][]int
err := d.DecodeAll(&lines)
For AoC, that’s probably enough. But what if we wanted to get fancy and
model each line as a struct of two points? strum
can do that, too, though
it takes some extra indirection.
strum
converts a string into tokens, then decodes each token into a Go
variable based on the variable’s type. If a variable implements the
encoding.TextUnmarshaler
interface, strum
uses that to convert the token. Let’s model a Line
as
follows:
type Line struct {
Start *Point
End *Point
}
The fields are *Point
, not Point
so that we can implement
TextUnmarshaler
on *Point
. The WithSplitOn
method tokenizes on something
other than whitespace, so we can split on " -> "
to get tokens for each point,
and then rely on on the TextUnmarshaler
to turn tokens like "105,697"
into
a *Point
struct.
d := strum.NewDecoder(os.Stdin).WithSplitOn(" -> ")
var lines []Line
err := d.DecodeAll(&lines)
All that remains is defining Point
and implementing its TextUnmarshaler
.
We can use strum
again, splitting on the comma to get the int
fields:
type Point struct {
X int
Y int
}
func (p *Point) UnmarshalText(text []byte) error {
d := strum.NewDecoder(bytes.NewBuffer(text)).WithSplitOn(",")
var q Point
err := d.Decode(&q)
if err != nil {
return err
}
*p = q
return nil
}
This is a bit contrived, but it works! With strum
, we can ignore text
conversion details and just focus on the structure of the data we want from
our input. Perfect!
How strum uses Go’s reflect library 🔗︎
The reflect
library allows us to inspect a variable’s type and to make
type-aware manipulations of the variable at runtime. Decoders like strum
take interface{}
as an argument and use reflection to determine how to
decode to the desired type. For struct types, reflection reveals the names
and types of fields.
Reflection starts with the reflect.ValueOf(v)
call, which returns a
reflect.Value
that references v
. In many ways, reflect.Value
is like
interface{}
because it can reference any variable type. Unlike
interface{}
, we can call methods on it to manipulate the underlying
variable. If the method is inappropriate for the underlying variable, like
trying to dereference a non-pointer value, reflect
generally panics.
Working with reflect.Value
involves inspecting its Type
(declared type) or
Kind
(underlying, built-in representation) and calling type-appropriate
methods.
In this walkthrough, I’m going to focus on the Decode
method for a single
line of text, rather than the DecodeAll
method I showed above, but the
principles are the same. I’ll be pulling out excerpts focusing on the use of
reflection rather than going line-by-line, but you can see all the code in the
strum repo if that helps you.
The signature for Decode
is this:
func (d *Decoder) Decode(v interface{}) error
Decode
must verify that it has received a non-nil pointer. 🔗︎
Why? Because we need it to store the decoded result somewhere. This check is
trickier than it seems because interface{}
can take a literal nil
or a
typed pointer that is nil
and only the latter has a reflect.Value
. If the
reflect.Value
is actually a pointer, we can check it with IsNil()
and
dereference it with Elem()
.
I factored that logic out into a function for reuse that takes interface{}
and returns the reflect.Value
of the variable that v
points to:
func extractDestValue(v interface{}) (reflect.Value, error) {
if v == nil {
return reflect.Value{}, fmt.Errorf("argument must be a non-nil pointer")
}
argValue := reflect.ValueOf(v)
if argValue.Kind() != reflect.Ptr {
return reflect.Value{}, fmt.Errorf("argument must be a pointer, not %s", argValue.Kind())
}
if argValue.IsNil() {
return reflect.Value{}, fmt.Errorf("argument must be a non-nil pointer")
}
return argValue.Elem(), nil
}
That return value is either the variable that will receive a decoded value or it’s a pointer that we may need to follow recursively later.
Decode
must be type-aware. 🔗︎
To decide what rules to use for that variable, we need to understand what kind of variable it is. There are three categories of interest:
- Types for which Decode has special handling (e.g. time.Time)
- Types that implement
TextUnmarshaler
- Everything else, for which we switch based on the underlying, unnamed,
built-in variable type, known as a
Kind
.
Given a reflect.Value
, rv
, the type is provided by rv.Type()
, which
returns a reflect.Type
. That can be matched against a pre-computed type
like so:
var durationType = reflect.TypeOf(time.Duration(0))
var timeType = reflect.TypeOf(time.Time{})
switch rv.Type() {
case durationType:
// ...
case timeType:
// ...
}
It’s a little more complicated to check if a variable’s type implements an
interface like TextUnmarshaler
. We need to make a nil pointer to
the interface, get the reflect.Type
of that, then get the interface type
with Elem()
. Once we have that, we can check if a value implements the
interface with Implements()
:
var textUnmarshalerType =
reflect.TypeOf((*encoding.TextUnmarshaler)(nil)).Elem()
func isTextUnmarshaler(rv reflect.Value) bool {
return rv.Type().Implements(textUnmarshalerType)
}
Fortunately, switching on the reflect.Kind
is straightforward:
switch rv.Kind() {
case reflect.Bool:
// ...
case reflect.String:
// ...
}
Whenever strum
needs type-aware behaviors, we check for specific types with
custom behavior first, then for TextUnmarshaler
, then fall back to
Kind
.
Decode
treats single and compound destinations differently. 🔗︎
There are two main places where strum
is type-aware. The first is
determining the destination type to decide whether to tokenize and how many
tokens are allowed:
- A slice or a struct – multiple tokens are allowed
- A string – no tokenization; the string gets the whole line
- Everything else – a single token is allowed
But there’s a special case to consider. What if the destination is a pointer?
That could happen if someone passes Decode
a pointer to a pointer, like
this:
var m *Move
d.Decode(&m)
In this case, strum
recursively dereferences the pointer to find the
destination with Elem()
. But note in the example above that m
is a nil
pointer. (This is not a problem for Decode
itself – it receives the
non-nil pointer to the nil pointer.)
In order to dereference the pointer, we must make sure it’s not-nil. To do
that, we make use of the reflect.New()
function, which given a type returns
a pointer to that type. Again, a tricky bit with reflect is that we have a
pointer, rv
, but New()
wants the type that rv
points to. We can get
that via rv.Type().Elem()
. Once we have the new pointer, we can assign it
with Set()
:
func maybeInstantiatePtr(rv reflect.Value) {
if rv.Kind() == reflect.Ptr && rv.IsNil() {
np := reflect.New(rv.Type().Elem())
rv.Set(np)
}
}
reflect.Set()
is both magical and dangerous. It takes a reflect.Value
and
assigns it as long as the types are compatible and panics otherwise. In this
case, we know it’s compatible because of how we’ve constructed np
.
Decode
must zero structs before iterating fields. 🔗︎
When decoding into a struct, if there are fewer tokens than fields, we want the remaining fields to be zero-valued, not whatever existed beforehand. For example, what if a struct defined outside a loop is iteratively decoded inside the loop?
var m Move
for {
err := d.Decode(&m)
if err != nil {
break
}
processMove(m)
}
If there aren’t enough tokens on some line for all fields of Move, then after
Decode
, we don’t want m
to have values from the previous call to Decode
.
We zero the struct with tools we’ve seen before: New()
, Elem()
and
Set()
. Remember that New()
makes a pointer to a type, so we need to
deference it with Elem()
to have a valid value for Set()
to assign:
destType := destValue.Type()
destValue.Set(reflect.New(destType).Elem())
Once that’s done, we can iterate struct fields. If a reflect.Value
is a
struct, then Field(i)
gives a reflect.Value
for the ith field
and NumField()
tells us how many fields there are. We can loop over tokens
and decode them into the corresponding field, erroring if we have more tokens
then fields.
However, there’s one more problem. The Set()
method will panic if applied
to a non-exported field. So we also have to check if each field is exported
using the field’s reflect.Type
.
That leaves us with a loop like this for struct decoding:
numFields := destValue.NumField()
for i := range tokens {
if i >= numFields {
return errors.New("too many tokens")
}
if !destType.Field(i).IsExported() {
return errors.New("cannot decode to unexported field")
}
// ... decode tokens[i] to destValue.Field(i) ...
}
Slice unmarshaling is similar, but instead of getting a reflect.Value
for a
field, we have to create a new reflect.Value
for an element of the slice,
decode the token into that value, and then append it.
Decode
must parse tokens based on their type or kind. 🔗︎
The second main place where strum
is type-aware decoding a token string for
a destination variable of a particular type. As before, types with special
handling are preferred over types that implement TextUnmarshaler
.
If neither applies, then the decoding is based on the Kind
.
For example, given a specially-handled type like time.Duration
, we convert
the token string, s
, using time.ParseDuration(s)
and then Set()
the parsed
duration into the destination value (which might be a single variable, a
struct field, or element to be added to a slice). Recall that Set()
takes a
reflect.Value
, so we can’t directly pass the parsed duration.
switch rv.Type() {
case durationType:
t, err := time.ParseDuration(s)
if err != nil {
return err
}
rv.Set(reflect.ValueOf(t))
// ... other cases ...
}
If the destination value instead implements TextUnmarshaler
, we
pass the token string to the UnmarshalText()
method. However, this only
works on an instantiated value, so we may have to instantiate a pointer as
shown earlier.
To call the method, we retrieve the method as a reflect.Value
with
MethodByName()
. Normally – when not using reflection – if you call a
method as a function, you have to pass the receiver as the first argument.
But with reflection, the return value of MethodByName()
already includes the
receiver internally. We call a reflect.Value
method with Call()
, which
both takes and returns []reflect.Value
.
Thus, the TextUnmarshaler
handling looks like this:
if isTextUnmarshaler(rv) {
maybeInstantiatePtr(rv)
f := rv.MethodByName("UnmarshalText")
xs := []byte(s)
args := []reflect.Value{reflect.ValueOf(xs)}
ret := f.Call(args)
if !ret[0].IsNil() {
return ret[0].Interface().(error)
}
return nil
}
If the destination isn’t a special type or a TextUnmarshaler
, then the
decoding behavior is based on the Kind
. Generally, these cases are all
handled similarly using strconv
functions to
parse the token string. Here’s an example of handling all int kinds together,
using rv.Type().Bits()
for the width:
case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64:
i, err := strconv.ParseInt(s, 0, rv.Type().Bits())
if err != nil {
return decodingError(name, err)
}
rv.SetInt(i)
Unlike Set()
, SetInt()
and similar methods take a normal, typed variable,
not a reflect.Value
.
Reflection is a rare tool, but you don’t need to be afraid of it. 🔗︎
Reflection is complex, inefficient, and prone to panics if used incorrectly. But as I hope you can see, sometimes it’s the right tool for certain jobs.
This article omitted some details, particularly around error handling,
reusable abstractions, and testing. But strum
has only about 600 lines of
non-test code, so check out the strum repo
if you’d like to see what I left out.
Notes:
- In a reddit
comment
on this article,
nofeaturesonlybugs
describes how to cache reflection data for better performance and links to a code example.