Reference strepr implementation

In an effort to polish the recently released draft of the strepr v1 specification, I’ve spent the last couple of days in a Go reference implementation.

The implemented algorithm is relatively simple, efficient, and consumes a conservative amount of memory. The aspect of it that deserved the most attention is the efficient encoding of a float number when it carries an integer value, as covered before. The provided tests are a useful reference as well.

The API offered by the implemented package is minimal, and matches existing conventions. For example, this simple snippet will generate a hash for the stable representation of the provided value:

value := map[string]interface{}{"a": 1, "b": []int{2, 3}}
hash := sha1.New()
strepr.NewEncoder(hash).Encode(value)
fmt.Printf("%x\n", hash.Sum(nil))
// Outputs: 29a77d09441528e02a27dc498d0a757da06250a0

Along with the reference implementation comes a simple command line tool to play with the concept. It allows easily arriving at the same result obtained above by processing a JSON value instead:

$ echo '{"a": 1.0, "b": [2, 3]}' | ./strepr -in-json -out-sha1
29a77d09441528e02a27dc498d0a757da06250a0

Or YAML:

$ cat | ./strepr -in-yaml -out-sha1                 
a: 1
b:
   - 2
   - 3
29a77d09441528e02a27dc498d0a757da06250a0

Or even BSON, the binary format used by MongoDB:

$ bsondump dump.bson
{ "a" : 1, "b" : [ 2, 3 ] }
1 objects found
$ cat dump.bson | ./strepr -in-bson -out-sha1
29a77d09441528e02a27dc498d0a757da06250a0

In all of those cases the hash obtained is the same, despite the fact that the processed values were typed differently in some occasions. For example, due to its Javascript background, some JSON libraries may unmarshal numbers as binary floating point values, while others distinguish the value based on the formatting used. The strepr algorithm flattens out that distinction so that different platforms can easily agree on a common result.

To visualize (or debug) the stable representation defined by strepr, the reference implementation has a debug dump facility which is also exposed in the command line tool:

$ echo '{"a": 1.0, "b": [2, 3]}' | ./strepr -in-json -out-debug
map with 2 pairs (0x6d02):
   string of 1 byte (0x7301) "a" (0x61)
    => uint 1 (0x7001)
   string of 1 byte (0x7301) "b" (0x62)
    => list with 2 items (0x6c02):
          - uint 2 (0x7002)
          - uint 3 (0x7003)

Assuming a Go compiler and the go tool are available, the command line strepr tool may be installed with:

$ go get launchpad.net/strepr/cmd/strepr

As a result of the reference implementation work, a few clarifications and improvements were made to the specification:

  • Enforce the use of UTF-8 for Unicode strings and explain why normalization is being left out.
  • Enforce a single NaN representation for floats.
  • Explain that map key uniqueness refers to the representation.
  • Don’t claim the specification is easy to implement; floats require attention.
  • Mention reference implementation.
This entry was posted in Design, Erlang, Go, Java, Lua, MongoDB, Perl, Project, Python, Ruby, Security, Snippet. Bookmark the permalink.

2 Responses to Reference strepr implementation

  1. John Meinel says:

    You should probably add tests for very large floats. Like 1e100. Your current implementation will write that out in long form, while the float64 representation is a lot more compact. Is that intentional?

    Given the base128 encoding, anything larger than (2^7*8 = 2^56 ~ 72e15) is more compact in float64 without loss of precision. You could go up to 2^64 as the limit of uint64. Obviously if you had a uint128 with all bits of precision, you would lose precision in a float64, but you’re starting with a float64, so you know you have no more than 52 bits of precision.

    The masking trick you did with (1<<(-(exp)) – 1) is pretty neat.

    I don't know the exact cutoff you want, but just adding something like:

    exp < 4 (for 2^56 bits)
    or
    exp >> json.dumps({“x”: 2**100+1})
    ‘{“x”: 1267650600228229401496703205377}’
    >>> json.loads(‘{“x”: 1267650600228229401496703205377}’)[‘x’]
    1267650600228229401496703205377L

    even though:
    >>> long(float((2**100)+1)) – ((2**100) +1)
    -1L

    Anyway, my point is that while you can encode 2**100+1 with all bits of precision, by the time golang has unmarshalled it into a float64, you’ve already lost it.

    I guess the worst case in float64 is 10^308 or so, which is only 128 bytes. Though 16x the size of the float64 representation, it isn’t unbounded.

  2. niemeyer says:

    > (…) will write that out in long form, while the float64 representation is a lot more compact. Is that intentional?

    Yes, this is intentional to achieve the results described both in this post and in the specification. It’s also not a problem in any way, since the point of strepr is not to work as yet another serialization library, but to produce a stable representation for hashing. The current implementation can encode a list of 10k floats, all of them carrying the maximum value, in under 4ms and with a buffer of under 150 bytes.

    > Anyway, my point is that while you can encode 2**100+1 with all bits of precision, by the time golang has unmarshalled it into a float64, you’ve already lost it.

    There’s no unmarshalling. That’s not the purpose of strepr.

Leave a Reply

Your email address will not be published. Required fields are marked *