Note: This is a candidate version of the specification. This note will be removed once v1 is closed, and any changes will be described at the end. Please get in touch if you’re implementing it.
Contents
- Introduction
- Supported values
- Representation
- Variable-length encoding
- Reference-implementation
- Changes
Introduction
This specification defines strepr, a stable representation that enables computing hashes and cryptographic signatures out of a defined set of composite values that is commonly found across a number of languages and applications.
Although the defined representation is a serialization format, it isn’t meant to be used as a traditional one. It may not be seen entirely in memory at once, or written to disk, or sent across the network. Its role is specifically in aiding the generation of hashes and signatures for values that are serialized via other means (JSON, BSON, YAML, HTTP headers or query parameters, configuration files, etc).
The format is designed with the following principles in mind:
Understandable — The representation must be easy to understand to increase the chances of it being implemented correctly.
Portable — The defined logic works properly when the data is being transferred across different platforms and implementations, independently from the choice of protocol and serialization implementation.
Unambiguous — As a natural requirement for producing stable hashes, there is a single way to process any supported value being held in the native form of the host language.
Meaning-oriented — The stable representation holds the meaning of the data being transferred, not its type. For example, the number 7 must be represented in the same way whether it’s being held in a float64 or in an uint16.
Supported values
The following values are supported:
- nil: the
nil/null/none
singleton - bool: the
true
andfalse
singletons - string: raw sequence of bytes
- integers: positive, zero, and negative integer numbers
- floats: IEEE754 binary floating point numbers
- list: sequence of values
- map: associative value→value pairs
Representation
nil = 'z'
The nil/null/none
singleton is represented by the single byte 'z'
(0x7a
).
bool = 't' / 'f'
The true
and false
singletons are represented by the bytes 't'
(0x74
) and 'f'
(0x66
), respectively.
unsigned integer = 'p' <value>
Positive and zero integers are represented by the byte 'p'
(0x70
) followed by the variable-length encoding of the number.
For example, the number 131
is always represented as {0x70, 0x81, 0x03}
, independently from the type that holds it in the host language.
negative integer = 'n' <absolute value>
Negative integers are represented by the byte 'n'
(0x6e
) followed by the variable-length encoding of the absolute value of the number.
For example, the number -131
is always represented as {0x6e, 0x81, 0x03}
, independently from the type that holds it in the host language.
string = 's' <num bytes> <bytes>
Strings are represented by the byte 's'
(0x73
) followed by the variable-length encoding of the number of bytes in the string, followed by the specified number of raw bytes. If the string holds a list of Unicode code points, the raw bytes must contain their UTF-8 encoding.
For example, the string hi
is represented as {0x73, 0x02, 'h', 'i'}
Due to the complexity involved in Unicode normalization, it is not required for the implementation of this specification. Consequently, Unicode strings that if normalized would be equal may have different stable representations.
binary float = 'd' <binary64>
32-bit or 64-bit IEEE754 binary floating point numbers that are not holding integers are represented by the byte 'd'
(0x64
) followed by the big-endian 64-bit IEEE754 binary floating point encoding of the number.
There are two exceptions to that rule:
1. If the floating point value is holding a NaN
, it must necessarily be encoded by the following sequence of bytes: {0x64, 0x7f, 0xf8, 0x00 0x00, 0x00, 0x00, 0x00, 0x00}
. This ensures all NaN
values have a single representation.
2. If the floating point value is holding an integer number it must instead be encoded as an unsigned or negative integer, as appropriate. Floating point values that hold integer numbers are defined as those where floor(v) == v && abs(v) != ∞
.
For example, the value 1.1 is represented as {0x64, 0x3f, 0xf1, 0x99, 0x99, 0x99, 0x99, 0x99, 0x9a}
, but the value 1.0 is represented as {0x70, 0x01}
, and -0.0 is represented as {0x70, 0x00}
.
This distinction means all supported numbers have a single representation, independently from the data type used by the host language and serialization format.
list = 'l' <num items> [<item> ...]
Lists of values are represented by the byte 'l'
(0x6c
), followed by the variable-length encoding of the number of pairs in the list, followed by the stable representation of each item in the list in the original order.
For example, the value [131, -131]
is represented as {0x6c, 0x70, 0x81, 0x03, 0x6e, 0x81, 0x03, 0x65}
map = 'm' <num pairs> [<item key> <item value> ...]
Associative maps of values are represented by the byte 'm'
(0x6d
) followed by the variable-length encoding of the number of pairs in the map, followed by an ordered sequence of the stable representation of each key and value in the map. The pairs must be sorted so that the stable representation of the keys is in ascending lexicographical order. A map must not have multiple keys with the same representation.
For example, the map {"a": 4, 5: "b"}
is always represented as {0x6d, 0x02, 0x70, 0x05, 0x73, 0x01, 'b', 0x73, 0x01, 'a', 0x70, 0x04}
.
Variable-length encoding
Integers are variable-length encoded so that they can be represented in short space and with unbounded size. In an encoded number, the last byte holds the 7 least significant bits of the unsigned value, and zero as the eight bit. If there are remaining non-zero bits, the previous byte holds the next 7 bits, and the eight bit is set on to flag the continuation to the next byte. The process continues until there are non-zero bits remaining. The most significant bits end up in the first byte of the encoded value, which must necessarily not be 0x80
.
For example, the number 128 is variable-length encoded as {0x81, 0x00}
.
Reference implementation
A reference implementation is available, including a test suite which should be considered when implementing the specification.
Changes
draft1 → draft2
- Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.
- Enforce a single NaN representation for floats.
- Explain that map key uniqueness refers to the representation.
- Don’t claim the specification is easy to implement; floats require attention.
- Mention reference implementation.
What was the rationale for defining strings as a sequence of bytes instead of Unicode characters? Reduction of complexity is the priority?
Strings can still be treated agnostically if metadata about their encoding is provided. Leaving it out as an excercise to the developer is bound to cause varying interpretations on how to deal with strings, which violates the unambiguity principle that has been established.
> What was the rationale for defining strings as a sequence of bytes instead of Unicode characters?
From a signature standpoint, there’s little value in stating that the bytes are an encoding of unicode code points (UTF-8, etc) as there’s still room for equivalent strings to be encoded in different ways. For such a requirement to have real value, it’d also have to state that the unicode string must be in a normalized form, which is a non-trivial procedure. Instead, it seems better to leave that out of the specification, and state that the signature should cover whatever bytes are part of the string. If you want to (or need to) normalize unicode strings, that’s an easy requirement on top of strepr v1.
It seems good. I think it would be better to rename string to “bytes” and use ‘b’ instead of “string” and ‘s’ in order to improve understandability.
Also, it would be good if you elaborate on why you decided against tnetstrings or msgpack.
> Also, it would be good if you elaborate on why you decided against tnetstrings or msgpack.
I didn’t decide against them. They have different goals and strepr can in fact be used to hash and sign values in these formats. The introduction has more details about that distinction.
Like Carlo, for me the introduction didn’t get across the advantage of this over alternatives. Also, I wonder if lacking a -0.0 representation might cause problems for some uses.
> Like Carlo, for me the introduction didn’t get across the advantage of this over alternatives.
Thanks for the feedback. The document is indeed lacking a good rationale section. I’ll fix that.
> Also, I wonder if lacking a -0.0 representation might cause problems for some uses.
The question to ask is whether hashing and signing +0.0 and -0.0 the same way is okay or not. I believe for most practical purposes it is not only okay, but desirable, because unfortunately it’s not rare for code to misuse that information, which would change the hash. Practical example coming from Javascript:
This doesn’t mean you cannot represent it, though. Representing the value is left up to whatever serialization format is being used.
I think the lexicographic order for keys of maps is currently underspecified. How do you sort {ä, ö, x, y}? By the Finnish rules or the Estonian rules?
> How do you sort {ä, ö, x, y}?
The sorting is applied to the stable representation of the keys, so it’s sorting sequences of bytes.
Specification updated to draft 2, as detailed in the changes section.
What about floats that are large enough that they “represent an integer” according to the definition above, but don’t represent a unique integer?
> (…) but don’t represent a unique integer?
All integers are unique. What you allude to is that floats are unable to represent certain integer numbers, which isn’t a problem in the context of strepr.
There exists a float that represents an integer n as well as n+1. Should it be encoded as n or as n+1 in strepr? (I think this is the case for instance for n=2^50.)
> There exists a float that represents an integer n as well as n+1.
There’s no float number that represents two different integers. What exists is two different integer numbers that may be rounded to the same float number due to constraints of the format. That’s not a relevant issue for strepr.
Thanks. I’ve learned today that there is a canonical mapping from IEEE floats to reals and not to real intervals.