home     blog     about     contact
published: 2021-02-08
last change: 2021-02-23

Twopow (2pow) Integer Prefix Coding


A variable, arbitrary length format for integers, which is also self-delimited (i.e. an integer prefix code) is very useful for all integer numbers like lengths or timestamps in communication protocol specifications. They are suitable especially for use in blockchain and cryptographic protocols.

Examples for such codes are the universal codes, e.g. the Exponential-Golomb coding, of which Elias gamma coding is a special case.

Format

Two format variants are shown below. The format has an unary encoding of a precode p using '0' bits and a delimiter of a single '1' bit, trailed by a binary subcode s of length either len(s) = floor(2p-1) (variant 1) or len(s) = 2p (variant 2):

Variant 1

 n: bit string representing n
-------------------------------------------------------------
 0: '1' // p = 0, s = 0
 1: '01 0' // p = 1, s = 0
 2: '01 1' // p = 1, s = 1
 3: '001 00' // p = 2, s = 0
 4: '001 01' // p = 2, s = 1
 5: '001 10' // p = 2, s = 2
 6: '001 11' // p = 2, s = 3
 7: '0001 0000' // p = 3, s = 0
 8: '0001 0001' // p = 3, s = 1
 9: '0001 0010' // p = 3, s = 2
10: '0001 0011' // p = 3, s = 3
11: '0001 0100' // p = 3, s = 4
12: '0001 0101' // p = 3, s = 5
13: '0001 0110' // p = 3, s = 6
14: '0001 0111' // p = 3, s = 7
15: '0001 1000' // p = 3, s = 8
16: '0001 1001' // p = 3, s = 9
17: '0001 1010' // p = 3, s = 10
18: '0001 1011' // p = 3, s = 11
19: '0001 1100' // p = 3, s = 12
20: '0001 1101' // p = 3, s = 13
21: '0001 1110' // p = 3, s = 14
22: '0001 1111' // p = 3, s = 15
23: '00001 00000000' // p = 4, s = 0
24: '00001 00000001' // p = 4, s = 1
25: '00001 00000010' // p = 4, s = 2
26: '00001 00000011' // p = 4, s = 3
...
..: '000001 00000000 00000000' // p = 5, s = 0
...
..: '0000001 00000000 00000000 00000000 00000000' // p = 6, s = 0
...
..: '00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000' // p = 7, s = 0
...

This Variant 1 is very similar to Elias gamma coding. The difference is that that Elias gamma coding's precode p denotes the length of the following subcode len(s) = p directly whereas variant 1's unary precode denotes the subcode's length len(s) = floor(2p-1) by a power of 2 of the precode. That helps Variant 1 to save space in comparison to Elias gamma when encoding longer chunks of data, which line up to boundaries of 2n bits (or bytes), like cryptographic hashes or signatures.

Variant 2

 n: bit string representing n
-------------------------------------------------------------
 0: '1 0' // p = 0, s = 0
 1: '1 1' // p = 0, s = 1
 2: '01 00' // p = 1, s = 0
 3: '01 01' // p = 1, s = 1
 4: '01 10' // p = 1, s = 2
 5: '01 11' // p = 1, s = 3
 6: '001 0000' // p = 2, s = 0
 7: '001 0001' // p = 2, s = 1
 8: '001 0010' // p = 2, s = 2
 9: '001 0011' // p = 2, s = 3
10: '001 0100' // p = 2, s = 4
11: '001 0101' // p = 2, s = 5
12: '001 0110' // p = 2, s = 6
13: '001 0111' // p = 2, s = 7
14: '001 1000' // p = 2, s = 8
15: '001 1001' // p = 2, s = 9
16: '001 1010' // p = 2, s = 10
17: '001 1011' // p = 2, s = 11
18: '001 1100' // p = 2, s = 12
19: '001 1101' // p = 2, s = 13
20: '001 1110' // p = 2, s = 14
21: '001 1111' // p = 2, s = 15
22: '0001 00000000' // p = 3, s = 0
...
..: '0001 11111111' // p = 3, s = 255
..: '00001 00000000 00000000' // p = 4, s = 0
...
..: '000001 00000000 00000000 00000000 00000000' // p = 5, s = 0
...
..: '0000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000' // p = 6, s = 0
...
..: '00000001 +128 bit' // p = 7
...
..: '000000001 +256 bit' // p = 8
...
..: '0000000001 +512 bit' // p = 9
...

Difference of Variant 1 and 2

The biggest difference of variant 2 in comparison to variant 1:
The data size of the value 0 is 2 bits instead of 1 bit, so it has more data size. Similarly the 2 has one more data bit, but for all other numbers we need one bit less. Variant 2 sounds favorable in most cases to me.

Further Variants

You can also create other variants of this encoding by parameterizing the formula for the length of s with a paramter m: len(s) = floor(2p+m), with m ∈ ℤ.
With m < 0, the first |m| numbers are encoded to unary with len(s)=0, i.e. without trailing subcode, like in Variant 1 (m=-1). Variant 2 is m=0. With m > 0, the minimum data size of a number (which is always equal to the data size of the encoded number zero in that encoding) is increased to min_data_size = 1 + 2m. This is a tradeoff for encoding small numbers in more data and really big numbers in less data.

Lining Up With Bytes

To create an encoding that lines up the encoded sizes to byte or word boundaries, you have to do something slightly more complex: len(s) = 2p+m - p - 1 with m=3 or m=4, respectively.

Advantages for Blockchain and Cryptographic Protocols

These variants are more suitable for blockchain protocols than Exponential-Golomb coding is, because they use less bits for longer chunks of data like hashes and signatures, as they frequently occur in cryptographic protocols. For example any 64 bit integer (8 bytes) can be encoded in 9 bytes or less in the above variants, whereas it can take up to 126 bits in typical Exponential-Golomb codings like Elias gamma.
In general using variable length codes instead of fixed lengths allows for seamless extensions of the protocol by only adding new semantics for existing syntax.

Such a number format (any variable length integer prefix code) has many advantages for specifying communication protocols:
- complex format specifications which make use of this integer format are easier to read and implement because no additional length information is required. because this encoding has the prefix property, the length can be determined from only the datum itself, i.e. it is self-delimited.
- resulting data formats are theoretically unbounded and therefor
- data formats are forward compatible when valid ranges for values need to be extended
- the encoding automatically compresses the data, resulting in smaller data sizes

You can encode arbitrary binary data by prepending a '1' bit to the data and then encoding it using one of the above variants. This works with any integer prefix code or universal code, e.g. with Exponential-Golomb coding. To decode, do the two steps in reverse: first decode to an integer and then remove the leading '1' bit from the integer's data.