Serialization Format

SimpleBuffers is designed to encode simple, stable data schemas as efficiently as possible. Fixed-size data is packed optimally with no padding, labels, or any other metadata. Variable-sized data structures such as lists and strings require a small amount of additional data (by default, two bytes). This is explained more below.

Note that data is serialized into little-endian format, as this is natively supported by practically all modern processors, allowing for efficient decoding in almost all scenarios.

Take the following example schema:

enum RobotJoint {
    j0 = 0;
    j1 = 1;
    j2 = 2;
    j3 = 3;
    j4 = 4;
    j5 = 5;
}

sequence Init {
    expected_firmware: u32;
}

sequence MoveToEntry {
    joint: RobotJoint;
    angle: f32;
    speed: f32;
}

sequence MoveTo {
    joints: [MoveToEntry];
    stop_smoothly: bool;
}

sequence Request {
    id: u32;
    payload: oneof {
        init: Init;
        moveTo: MoveTo;
    };
}

This schema represents a simple serial protocol that can be used to control a robot arm. Let's go through it step-by-step.

Enums

Every element of an enum must be explicitly assigned to a number. When enums are serialized, the appropriate number is written into the buffer, which can be decoded back into an enum later. Enums will always use the smallest possible data type that can fully represent them. Most enums, including RobotJoint, are encoded to a single octet.

enum BigEnum {
    element_a = 0;
    element_b = 1;
    element_c = 1000;
}

The above BigEnum will be serialized as a 16-bit value because element_c cannot fit within an octet. Note that this is true even if the value being serialized is element_a or element_b; the size of an enum is fixed.

Fixed-Sized Sequences

Next, let's take a look at our Init sequence:

sequence Init {
    expected_firmware: u32;
}

It only has a single value: expected_firmware, which is a 32-bit unsigned integer. Sequences induce zero overhead. This means that the size of Init is exactly equal to the sum of the sizes of its elements. Init, therefore, will always use 32 bits.

MoveToEntry also only includes fixed-size elements:

sequence MoveToEntry {
    joint: RobotJoint;
    angle: f32;
    speed: f32;
}

angle and speed are both 32-bit floats, and joint is an enum. In this case, RobotJoint fits into a single octet, so MoveToEntry uses \(32 + 32 + 8 = 72\) bits. The actual serialization of a MoveEntry would look like this:

block-beta
    columns 3

    block:raw:3
        rawjoint["joint = j1"]
        rawangle["angle = 45"]
        rawspeed["speed = 100"]
    end
    space:3
    block:ser:3
        serjoint["0x01"]
        serangle["0x42340000"]
        serspeed["0x42c80000"]
    end
    space:1 down<[" "]>(down):1 space:1
    block:final:3
        final["01 42 34 00 00 42 c8 00 00"]:3
    end

    rawjoint-->serjoint
    rawangle-->serangle
    rawspeed-->serspeed

Fixed-size sequences are great. They are not only 100% data-efficient, but they also provide constant-time access to any element, no matter how deeply nested. This is true because the positions of all elements are known at compile-time and can be baked into the generated code. However, some types of data do not have a set size. This data must be encoded differently.

Lists

Lists consist of a variable number of repeated data. Because we do not know the length of the list at compile-time, we cannot allocate fixed-size field in a sequence. Take a look at MoveTo:

sequence MoveTo {
    joints: [MoveToEntry];
    stop_smoothly: bool;
}

We know the size of stop_smoothly, but joints could have any number of elements. This is a problem because now we cannot know the position of stop_smoothly at compile-time; it will change depending on the length of joints.

block-beta
    columns 2

    block:raw:2
        rawjoints["joints = [...]"]
        rawstop["stop_smoothly = true"]
    end
    space:2
    block:ser:2
        serjoints["???"]
        serstop["0x01"]
    end
    down<[" "]>(down):2
    block:final:2
        final["??? 01"]:2
    end

    rawjoints-->serjoints
    rawstop-->serstop

To solve this, we must find a way to force joints to be a fixed size. Fortunately, a solution already exists: pointers. Instead of storing the full list in joints, we can instead store a fixed-size pointer and place the list at the end of the buffer where it can no longer hurt us. In practice, we prefer to store a relative offset rather than an absolute pointer, as this allows complex sequences to be decomposed more effectively.

We must also encode the size of the list. This is done in the fixed-sized segment, which allows access without indirection. Both the offset and the list size are stored as unsigned 16-bit integers.

block-beta
    columns 4

    block:raw:3
        rawjointslen["joints size = 3"]
        rawjoints["joints = [...]"]
        rawstop["stop_smoothly = true"]
    end
    space
    space:4
    block:serstatic:3
        serjointslen["0x0003"]
        serjointsptr["0x0003 (offset)"]
        serstop["0x01"]
    end
    block:serdyn:1
        serjointsdata["..."]
    end
    space:1 down<[" "]>(down):2 space:1
    block:final:4
        final["03 00 03 00 01 ..."]:4
    end

    rawjoints-->serjointsdata
    rawjointslen-->serjointslen
    rawstop-->serstop

Strings

Unlike lists, strings must be null-terminated. This means that we do not have to store the size of the string. Otherwise, they are identical.

Oneofs

There are two possible ways to implement the oneof: as a fixed-sized union or a dynamically-sized pointer. The main drawback of a union is the space requirements. Like an enum, a union must always be as large as the largest possible member. Unlike an enum, however, union members are expected to sometimes be vastly different sizes. This leads to increased storage inefficiency in all but the best case. For this reason, SimpleBuffers instead implements the oneof as a dynamically-sized structure.

Like a list, the oneof stores two values in the fixed-size segment of the buffer: the type of data being stored, and an offset to the data. The data type is stored as a single octet. Oneofs with more than 255 members are unsupported.

Let's take a look at how a Request with an Init payload would be serialized:

sequence Request {
    id: u32;            // <-- 0
    payload: oneof {
        init: Init;     // <-- .expected_firmware = 3
        moveTo: MoveTo;
    };
}

block-beta
    columns 4

    block:braw:4
        rawid["id = 0"]
        rawpayloadtype["payload type = init"]
        rawfw["payload.expected_firmware = 3"]
    end
    space:4
    block:bser0:3
        serid["0x00000000"]
        serpayloadtype["0x00"]
        serpayloadoffset["0x0002 (offset)"]
    end
    block:serdyn:1
        serfw["0x03"]
    end
    space:1 down<[" "]>(down):2 space:1
    block:final:4
        final["00 00 00 00 02 00 03"]:4
    end

    rawid-->serid
    rawpayloadtype-->serpayloadtype
    rawfw-->serfw

Now, we look at the other oneof case: a MoveTo payload. At first, this appears slightly more complicated, as MoveTo requires its own dynamic sizing. In practice, however, it is fairly simple.

When serializing data, a cursor is placed in the destination buffer at the end of the fixed-sized data. Every piece of dynamic data is placed at the cursor position, and the cursor is incremented to the end of the new data.

sequence Request {
    id: u32;            // <-- 1
    payload: oneof {
        init: Init;
        moveTo: MoveTo; // <-- .joints size = 3, .stop_smoothly = true, .joints = ...
    };
}

block-beta
    columns 5

    block:braw:5
        rawid["id = 1"]
        rawpayloadtype["payload type = moveto"]
        rawjointslen["payload.joints size = 3"]
        rawjoints["payload.joints = [...]"]
        rawstop["payload.stop_smoothly = true"]
    end
    space:5
    block:bser:3
        serid["0x00000001"]
        serpayloadtype["0x01"]
        serpayloadoffset["0x0002 (offset)"]
    end
    space:2
    block:bdyn0:3
        serjointslen["0x03"]
        serjointsptr["0x0003 (offset)"]
        serstop["0x01"]
    end
    block:serdyn1:1
        serjoints["..."]
    end
    space:1
    space:2 down<[" "]>(down) space:2
    block:final:5
        final["01 00 00 00 01 02 00 03 03 00 01 ..."]:5
    end

    rawid-->serid
    rawpayloadtype-->serpayloadtype
    rawjointslen-->serjointslen
    rawjoints-->serjoints
    rawstop-->serstop