Unmarshal text with Go reflection

As much as I like Go, I sometimes really miss dynamically-typed languages. I find Go tedious when I have to parse some arbitrary text format or lines from stdin, which would be trivial with Python, Ruby or Perl.

Parsing JSON with Go is easy – just unmarshal it right into a struct. I wanted to unmarshal a line of text the same way, so I wrote a library, strum, to do exactly that.

The first part of this article demonstrates strum. It’s short! If you’re not interested in learning Go reflection, you can stop there.

The second part explores strum internals. I explain how it uses Go’s reflect library, covering input validation, type-based dispatch, interface satisfaction, pointer following, and more.

N.B. This was written before Go 1.18, generics, and the any alias.

Using strum to unmarshal text 🔗︎

Advent of Code gives amazing coding puzzles every December. Each puzzle has input in the form of text files. Here’s an example from Day 2 of the 2021 AoC, involving a navigation DSL:

forward 5
down 5
forward 8
up 3
down 8
forward 2

That’s a string and an int. Let’s model each line like this:

type Move struct {
    Direction string
    Distance  int
}

The text file has one move per line. We want to read the whole file and produce a slice containing all the moves. Here’s how to do that in 3 lines with strum, assuming that the input is piped via stdin.

d := strum.NewDecoder(os.Stdin)

var moves []Move
err := d.DecodeAll(&moves)

By default, strum splits on white space. But strum can do more advanced tokenization. Consider the input from Day 5, where each line of text represents the start and end points of a line in the Cartesian plane:

105,697 -> 287,697
705,62 -> 517,250
531,627 -> 531,730
21,268 -> 417,268
913,731 -> 271,89

With strum and a regular expression, it’s easy to turn each line into a slice of ints of the form [x1,y1,x2,y2]. The WithTokenRegexp method changes a decoder from white space tokenization to getting tokens from submatches:

re := regexp.MustCompile(`(\d+),(\d+) -> (\d+),(\d+)`)
d := strum.NewDecoder(os.Stdin).WithTokenRegexp(re)

var lines [][]int
err := d.DecodeAll(&lines)

For AoC, that’s probably enough. But what if we wanted to get fancy and model each line as a struct of two points? strum can do that, too, though it takes some extra indirection.

strum converts a string into tokens, then decodes each token into a Go variable based on the variable’s type. If a variable implements the encoding.TextUnmarshaler interface, strum uses that to convert the token. Let’s model a Line as follows:

type Line struct {
    Start *Point
    End   *Point
}

The fields are *Point, not Point so that we can implement TextUnmarshaler on *Point. The WithSplitOn method tokenizes on something other than whitespace, so we can split on " -> " to get tokens for each point, and then rely on on the TextUnmarshaler to turn tokens like "105,697" into a *Point struct.

d := strum.NewDecoder(os.Stdin).WithSplitOn(" -> ")

var lines []Line
err := d.DecodeAll(&lines)

All that remains is defining Point and implementing its TextUnmarshaler. We can use strum again, splitting on the comma to get the int fields:

type Point struct {
    X int
    Y int
}

func (p *Point) UnmarshalText(text []byte) error {
    d := strum.NewDecoder(bytes.NewBuffer(text)).WithSplitOn(",")
    var q Point
    err := d.Decode(&q)
    if err != nil {
        return err
    }
    *p = q
    return nil
}

This is a bit contrived, but it works! With strum, we can ignore text conversion details and just focus on the structure of the data we want from our input. Perfect!

How strum uses Go’s reflect library 🔗︎

The reflect library allows us to inspect a variable’s type and to make type-aware manipulations of the variable at runtime. Decoders like strum take interface{} as an argument and use reflection to determine how to decode to the desired type. For struct types, reflection reveals the names and types of fields.

Reflection starts with the reflect.ValueOf(v) call, which returns a reflect.Value that references v. In many ways, reflect.Value is like interface{} because it can reference any variable type. Unlike interface{}, we can call methods on it to manipulate the underlying variable. If the method is inappropriate for the underlying variable, like trying to dereference a non-pointer value, reflect generally panics. Working with reflect.Value involves inspecting its Type (declared type) or Kind (underlying, built-in representation) and calling type-appropriate methods.

In this walkthrough, I’m going to focus on the Decode method for a single line of text, rather than the DecodeAll method I showed above, but the principles are the same. I’ll be pulling out excerpts focusing on the use of reflection rather than going line-by-line, but you can see all the code in the strum repo if that helps you.

The signature for Decode is this:

func (d *Decoder) Decode(v interface{}) error

`Decode` must verify that it has received a non-nil pointer. 🔗︎

Why? Because we need it to store the decoded result somewhere. This check is trickier than it seems because interface{} can take a literal nil or a typed pointer that is nil and only the latter has a reflect.Value. If the reflect.Value is actually a pointer, we can check it with IsNil() and dereference it with Elem().

I factored that logic out into a function for reuse that takes interface{} and returns the reflect.Value of the variable that v points to:

func extractDestValue(v interface{}) (reflect.Value, error) {
    if v == nil {
        return reflect.Value{}, fmt.Errorf("argument must be a non-nil pointer")
    }

    argValue := reflect.ValueOf(v)

    if argValue.Kind() != reflect.Ptr {
        return reflect.Value{}, fmt.Errorf("argument must be a pointer, not %s", argValue.Kind())
    }

    if argValue.IsNil() {
        return reflect.Value{}, fmt.Errorf("argument must be a non-nil pointer")
    }

    return argValue.Elem(), nil
}

That return value is either the variable that will receive a decoded value or it’s a pointer that we may need to follow recursively later.

`Decode` must be type-aware. 🔗︎

To decide what rules to use for that variable, we need to understand what kind of variable it is. There are three categories of interest:

Types for which Decode has special handling (e.g. time.Time)
Types that implement TextUnmarshaler
Everything else, for which we switch based on the underlying, unnamed, built-in variable type, known as a Kind.

Given a reflect.Value, rv, the type is provided by rv.Type(), which returns a reflect.Type. That can be matched against a pre-computed type like so:

var durationType = reflect.TypeOf(time.Duration(0))
var timeType = reflect.TypeOf(time.Time{})

switch rv.Type() {
case durationType:
    // ...
case timeType:
    // ...
}

It’s a little more complicated to check if a variable’s type implements an interface like TextUnmarshaler. We need to make a nil pointer to the interface, get the reflect.Type of that, then get the interface type with Elem(). Once we have that, we can check if a value implements the interface with Implements():

var textUnmarshalerType =
    reflect.TypeOf((*encoding.TextUnmarshaler)(nil)).Elem()

func isTextUnmarshaler(rv reflect.Value) bool {
    return rv.Type().Implements(textUnmarshalerType)
}

Fortunately, switching on the reflect.Kind is straightforward:

switch rv.Kind() {
case reflect.Bool:
    // ...
case reflect.String:
    // ...
}

Whenever strum needs type-aware behaviors, we check for specific types with custom behavior first, then for TextUnmarshaler, then fall back to Kind.

`Decode` treats single and compound destinations differently. 🔗︎

There are two main places where strum is type-aware. The first is determining the destination type to decide whether to tokenize and how many tokens are allowed:

A slice or a struct – multiple tokens are allowed
A string – no tokenization; the string gets the whole line
Everything else – a single token is allowed

But there’s a special case to consider. What if the destination is a pointer? That could happen if someone passes Decode a pointer to a pointer, like this:

var m *Move
d.Decode(&m)

In this case, strum recursively dereferences the pointer to find the destination with Elem(). But note in the example above that m is a nil pointer. (This is not a problem for Decode itself – it receives the non-nil pointer to the nil pointer.)

In order to dereference the pointer, we must make sure it’s not-nil. To do that, we make use of the reflect.New() function, which given a type returns a pointer to that type. Again, a tricky bit with reflect is that we have a pointer, rv, but New() wants the type that rv points to. We can get that via rv.Type().Elem(). Once we have the new pointer, we can assign it with Set():

func maybeInstantiatePtr(rv reflect.Value) {
    if rv.Kind() == reflect.Ptr && rv.IsNil() {
        np := reflect.New(rv.Type().Elem())
        rv.Set(np)
    }
}

reflect.Set() is both magical and dangerous. It takes a reflect.Value and assigns it as long as the types are compatible and panics otherwise. In this case, we know it’s compatible because of how we’ve constructed np.

`Decode` must zero structs before iterating fields. 🔗︎

When decoding into a struct, if there are fewer tokens than fields, we want the remaining fields to be zero-valued, not whatever existed beforehand. For example, what if a struct defined outside a loop is iteratively decoded inside the loop?

var m Move
for {
    err := d.Decode(&m)
    if err != nil {
        break
    }
    processMove(m)
}

If there aren’t enough tokens on some line for all fields of Move, then after Decode, we don’t want m to have values from the previous call to Decode.

We zero the struct with tools we’ve seen before: New(), Elem() and Set(). Remember that New() makes a pointer to a type, so we need to deference it with Elem() to have a valid value for Set() to assign:

destType := destValue.Type()
destValue.Set(reflect.New(destType).Elem())

Once that’s done, we can iterate struct fields. If a reflect.Value is a struct, then Field(i) gives a reflect.Value for the i^th field and NumField() tells us how many fields there are. We can loop over tokens and decode them into the corresponding field, erroring if we have more tokens then fields.

However, there’s one more problem. The Set() method will panic if applied to a non-exported field. So we also have to check if each field is exported using the field’s reflect.Type.

That leaves us with a loop like this for struct decoding:

numFields := destValue.NumField()
for i := range tokens {
    if i >= numFields {
        return errors.New("too many tokens")
    }
    if !destType.Field(i).IsExported() {
        return errors.New("cannot decode to unexported field")
    }
    // ... decode tokens[i] to destValue.Field(i) ...
}

Slice unmarshaling is similar, but instead of getting a reflect.Value for a field, we have to create a new reflect.Value for an element of the slice, decode the token into that value, and then append it.

`Decode` must parse tokens based on their type or kind. 🔗︎

The second main place where strum is type-aware decoding a token string for a destination variable of a particular type. As before, types with special handling are preferred over types that implement TextUnmarshaler. If neither applies, then the decoding is based on the Kind.

For example, given a specially-handled type like time.Duration, we convert the token string, s, using time.ParseDuration(s) and then Set() the parsed duration into the destination value (which might be a single variable, a struct field, or element to be added to a slice). Recall that Set() takes a reflect.Value, so we can’t directly pass the parsed duration.

switch rv.Type() {
case durationType:
    t, err := time.ParseDuration(s)
    if err != nil {
        return err
    }
    rv.Set(reflect.ValueOf(t))
// ... other cases ...
}

If the destination value instead implements TextUnmarshaler, we pass the token string to the UnmarshalText() method. However, this only works on an instantiated value, so we may have to instantiate a pointer as shown earlier.

To call the method, we retrieve the method as a reflect.Value with MethodByName(). Normally – when not using reflection – if you call a method as a function, you have to pass the receiver as the first argument. But with reflection, the return value of MethodByName() already includes the receiver internally. We call a reflect.Value method with Call(), which both takes and returns []reflect.Value.

Thus, the TextUnmarshaler handling looks like this:

if isTextUnmarshaler(rv) {
    maybeInstantiatePtr(rv)
    f := rv.MethodByName("UnmarshalText")
    xs := []byte(s)
    args := []reflect.Value{reflect.ValueOf(xs)}
    ret := f.Call(args)
    if !ret[0].IsNil() {
        return ret[0].Interface().(error)
    }
    return nil
}

If the destination isn’t a special type or a TextUnmarshaler, then the decoding behavior is based on the Kind. Generally, these cases are all handled similarly using strconv functions to parse the token string. Here’s an example of handling all int kinds together, using rv.Type().Bits() for the width:

case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32, reflect.Int64:
    i, err := strconv.ParseInt(s, 0, rv.Type().Bits())
    if err != nil {
        return decodingError(name, err)
    }
    rv.SetInt(i)

Unlike Set(), SetInt() and similar methods take a normal, typed variable, not a reflect.Value.

Reflection is a rare tool, but you don’t need to be afraid of it. 🔗︎

Reflection is complex, inefficient, and prone to panics if used incorrectly. But as I hope you can see, sometimes it’s the right tool for certain jobs.

This article omitted some details, particularly around error handling, reusable abstractions, and testing. But strum has only about 600 lines of non-test code, so check out the strum repo if you’d like to see what I left out.

Notes:

In a reddit comment on this article, nofeaturesonlybugs describes how to cache reflection data for better performance and links to a code example.

Unmarshal text with Go reflection

Using strum to unmarshal text 🔗︎

How strum uses Go’s reflect library 🔗︎

Decode must verify that it has received a non-nil pointer. 🔗︎

Decode must be type-aware. 🔗︎

Decode treats single and compound destinations differently. 🔗︎

Decode must zero structs before iterating fields. 🔗︎

Decode must parse tokens based on their type or kind. 🔗︎