Serialization#
This page defines a binary format used to store ErgoTree contracts in persistent stores, transfer them over the wire, and enable cross-platform interoperation.
Overview#
The terms of the language can be serialized into an array of bytes for storage in the Ergo blockchain (e.g., Box.propositionBytes).
When the guarding script of an input box of a transaction is validated, the propositionBytes array is deserialized to an ErgoTree IR (represented by the ErgoTree class), which can be evaluated.
The serialization procedure is specified in general terms. The serialization format of ErgoTree types (specified in the SType class) and nodes (specified in the Value class) is correspondingly defined in section 5.1 and Appendix C.
Table 1 shows size limits checked during contract deserialization, which is important to resist malicious script attacks.
Table 1: Serialization limits#
Constant | Value | Description |
---|---|---|
VLQ max | 10 | Maximum size of VLQ encoded byte sequence (See VLQ formats E.1) |
T max | 100 | Maximum size of serialized type term (see Type format 5.1) |
D max | 4Kb | Maximum size of serialized data instance (see Data format 5.2) |
C max | = T max + D max | Maximum size of serialized data instance (see Const format 5.3) |
Expr max | 4Kb | Maximum size of serialized ErgoTree term (see Expr format 5.4) |
ErgoTree max | 4Kb | Maximum size of serialized ErgoTree contract (see ErgoTree format 5.5) |
All the serialization formats used and defined throughout this section are listed in Table 2, which introduces a name for each format and shows the number of bytes each format may occupy in the byte stream.
Table 2: Serialization formats#
Format | #bytes | Description |
---|---|---|
\(\lst{Byte}\) | \(1\) | 8-bit signed two's-complement integer |
\(\lst{Short}\) | \(2\) | 16-bit signed two's-complement integer (big-endian) |
\(\lst{Int}\) | \(4\) | 32-bit signed two's-complement integer (big-endian) |
\(\lst{Long}\) | \(8\) | 64-bit signed two's-complement integer (big-endian) |
\(\lst{UByte}\) | \(1\) | 8-bit unsigned integer |
\(\lst{UShort}\) | \(2\) | 16-bit unsigned integer (big-endian) |
\(\lst{UInt}\) | \(4\) | 32-bit unsigned integer (big-endian) |
\(\lst{ULong}\) | \(8\) | 64-bit unsigned integer (big-endian) |
\(\lst{VLQ(UShort)}\) | \([1..3]\) | Encoded unsigned \(\lst{Short}\) value using VLQ. |
\(\lst{VLQ(UInt)}\) | \([1..5]\) | Encoded unsigned 32-bit integer using VLQ. |
\(\lst{VLQ(ULong)}\) | \([1..\MaxVlqSize]\) | Encoded unsigned 64-bit integer using VLQ. |
\(\lst{Bits}\) | \([1..\MaxBits]\) | A collection of bits packed in a sequence of bytes. |
\(\lst{Bytes}\) | \([1..\MaxBytes]\) | A sequence of bytes, which size is stored elsewhere or wellknown. |
\(\lst{Type}\) | \([1..\MaxTypeSize]\) | Serialized type terms of \(\langname\). |
\(\lst{Data}\) | \([1..\MaxDataSize]\) | Serialized data values of \(\langname\). |
\(\lst{GroupElement}\) | \(33\) | Serialized elements of eliptic curve group. |
\(\lst{SigmaProp}\) | \([1..\MaxSigmaProp]\) | Serialized sigma propositions. |
\(\lst{AvlTree}\) | \(44\) | Serialized dynamic dictionary digest. |
\(\lst{Constant}\) | \([1..\MaxConstSize]\) | Serialized \(\langname\) constants (values with types). |
\(\lst{Expr}\) | \([1..\MaxExprSize]\) | Serialized expression terms of \(\langname\). |
\(\lst{ErgoTree}\) | \([1..\MaxErgoTreeSize]\) | Serialized instances of \(\langname\) contracts. |
We use the [1..n]
notation when serialization may produce from 1 to n bytes (depending on the actual data).
The serialization format of ErgoTree is optimized for compact storage and rapid deserialization.
In many cases, the serialization procedure is data dependent and thus has branching logic.
We use a pseudo-language with operators like for, match, if, and optional to express this complex serialization logic in the specification.
The language allows us to specify a structure out of simple serialization slots.
Each slot specifies a fragment of the serialized stream of bytes, whereas operators specify how the slots are combined to form the resulting stream of bytes.
The notation is summarized in Table 3.
Table 3: Serialization Notation#
Notation | Description |
---|---|
\(\Denot{T}\) where \(T\) - type | Denotes a set of values of type \(T\) |
\(v \in \Denot{T}\) | The value \(v\) belongs to the set \(\Denot{T}\) |
\(v : T\) | Same as \(v \in \Denot{T}\) |
\(\lst{match}\) \((t, v)\) | Pattern match on pair \((t, v)\) where \(t, v\) - values |
\(\lst{with}\) \((Unit, v \in \Denot{Unit})\) | Pattern case |
\(\lst{for}\) i=1 \(\lst{to}\) len \ \(~\lst{serialize(}\)v_i\(\lst{)}\) \ \(\lst{end for}\) | Call the given \(\lst{serialize}\) function repeatedly.The outputs bytes of all invocations are concatenated and become the output of the \(\lst{for}\) statement. |
\(\(\lst{if}~\)condition\)~\(\lst{then}\) | Serialize one of the branches depending of the condition. The output bytes of the executed branch becomes the output of the \(\lst{if}\) statement. |
\(\lst{serialize1(}\)v_1$\lst{)} | \lst{else} |
In the next section, we describe how types (Int, Coll[Byte], etc.) are serialized; then, we define the serialization of typed data.
This will give us a basis to describe the serialization of Constant nodes of ErgoTree. After that, we will proceed to the serialization of arbitrary ErgoTree trees.
Type Serialization#
For the motivation behind this type of encoding, please see Appendix D.1.
Distribution of type codes#
The whole space of 256 one-byte codes is divided, as shown in Table 4.
Table 4: Distribution of type codes between Data and Function types#
Value/Interval | Distribution |
---|---|
\(\lst{0x00}\) | special value to represent undefined type (\(\lst{NoType}\) in \(\ASDag\)) |
\(\lst{0x01 - 0x6F(111)}\) | data types including primitive types, arrays, options aka nullable types, classes (in future), 111 = 255 - 144 different codes |
\(\lst{0x70(112) - 0xFF(255)}\) | function types \(\lst{T1 => T2}\), 144 = 12 x 12 different codes~\footnote{Note that the function types are never serialized in version 1 of the Ergo protocol, this encoding is reserved for future development of the protocol.} |
Encoding of Data Types#
There are eight different values for embeddable types, and three more are reserved for future extensions. Each embeddable type has a type code in the range 1,...,11
as shown in Table 5.
Table 5: Embeddable Types#
Code | Type |
---|---|
1 | Boolean |
2 | Byte |
3 | Short (16-bit) |
4 | Int (32 bit) |
5 | Long (64-bit) |
6 | BigInt (represented by java.math.BigInteger) |
7 | GroupElement (represented by org.bouncycastle.math.ec.ECPoint) |
8 | SigmaProp |
9 | reserved for Char |
10 | reserved |
11 | reserved |
Table 6: Code Ranges of Data Types#
Interval | Constructor | Description |
---|---|---|
0x01 - 0x0B(11) | embeddable types (including 3 reserved) | |
0x0C(12) | Coll[_] | |
0x0D(13) - 0x17(23) | Coll[_] | Collection of embeddable types (Coll[Byte], Coll[Int], etc.) |
0x18(24) | Coll[Coll[_]] | Nested collection of non-embeddable types (Coll[Coll[(Int,Boolean)]]) |
0x19(25) - 0x23(35) | Coll[Coll[_]] | Nested collection of embeddable types (Coll[Coll[Byte]], Coll[Coll[Int]]) |
0x24(36) | Option[_] | Option of non-embeddable type (Option[(Int, Byte)]) |
0x25(37) - 0x2F(47) | Option[_] | Option of embeddable type (Option[Int]) |
0x30(48) | Option[Coll[_]] | Option of Coll of non-embeddable type (Option[Coll[(Int, Boolean)]]) |
0x31(49) - 0x3B(59) | Option[Coll[_]] | Option of Coll of embeddable type (Option[Coll[Int]]) |
0x3C(60) | (,) | Pair of non-embeddable types (((Int, Byte), (Boolean,Box)), etc.) |
0x3D(61) - 0x47(71) | (_, Int) | Pair of types where first is embeddable ((_, Int)) |
0x48(72) | (,,_) | Triple of types |
0x49(73) - 0x53(83) | (Int, _) | Pair of types where second is embeddable ((Int, _)) |
0x54(84) | (,,,) | Quadruple of types |
0x55(85) - 0x5F(95) | (, ) | Symmetric pair of embeddable types ((Int, Int), (Byte,Byte), etc.) |
0x60(96) | (,...,) | Tuple type with more than 4 items (Int, Byte, Box, Boolean, Int) |
0x61(97) | Any | Any type |
0x62(98) | Unit | Unit type |
0x63(99) | Box | Box type |
0x64(100) | AvlTree | AvlTree type |
0x65(101) | Context | Context type |
0x66(102) | reserved for String | |
0x67(103) | reserved for TypeVar | |
0x68(104) | Header | Header type |
0x69(105) | PreHeader | PreHeader type |
0x6A(106) | Global | Global type |
0x6B(107)-0x6E(110) | reserved for future use | |
0x6F(111) | Reserved for future Class type (e.g. user-defined types) |
We use the encoding schema defined below for each type of constructor, like Coll or Option.
- Type constructor has an associated base code which is a multiple of 12
- (e.g. 12 for Coll[_], 24 for Coll[Coll[_]], etc.).
- The base code can be added to the embeddable type code to produce the code of the constructed type.
- For example
12 + 1 = 13
is a code of Coll[Byte]. - The code of the type constructor (e.g. 12 in this example) is used when the type parameter is non-embeddable type (e.g. Coll[(Byte, Int)]).
- In this case, the code of the type constructor is read first, and then recursive descent is performed to read bytes of the parameter type (in this case (Byte, Int)).
- For example
This encoding allows very simple and fast decoding using div and mod operations.
Following the above encoding schema, the interval of codes for data types is divided, as shown in Table 6
Encoding of Function Types#
We use 12 different values for both domain and range types of functions.
This gives us 144 (12∗12) function types in total and allows us to represent 121 (11∗11) functions over primitive types using just a single byte.
Each code F in a range of the function types (i.e. F ∈ {112, . . . , 255}) can be represented as F = D∗12+R+112, where D, R ∈ {0, . . . , 11} - indices of domain and range types correspondingly, 112 - is the first code in an interval of function types.
- If D = 0, the domain type is not embeddable, and the recursive descent is necessary to write/read the domain type.
- If R = 0, then the range type is not embeddable, and the recursive descent is necessary to write/read the range type.
Recursive Descent#
When an argument of a type constructor is not a primitive type, we fall back to the simple encoding schema, in which case we emit the separate code for the type constructor according to the table above and descend recursively to every child node of the type tree.
We do this descend only for those children whose code cannot be embedded in the parent code.
For example, serialization of Coll[(Int,Boolean)] proceeds as the following:
- Emit 0x0C because the elements type of the collection is not embeddable
- Recursively serialize (Int, Boolean)
- Emit 0x41(=0x3D+4) because the first type of the pair is embeddable, and its code is 4
- Recursively serialize Boolean
- Emit 0x02 - the code for embeddable type Boolean
More examples of type serialization are shown in Table 7
Type | D | R | Serialized Bytes | #Bytes | Comments |
---|---|---|---|---|---|
Byte | 2 | 1 | simple embeddable type | ||
Coll[Byte] | 12 + 2 = 14 | 1 | embeddable type in Coll | ||
Coll[Coll[Byte]] | 24 + 2 = 26 | 1 | embeddable type in nested Coll | ||
Option[Byte] | 36 + 2 = 38 | 1 | embeddable type in Option | ||
Option[Coll[Byte]] | 48 + 2 = 50 | 1 | embeddable type in Coll nested in Option | ||
(Int,Int) | 84 + 4 = 88 | 1 | symmetric pair of embeddable type | ||
Int=>Boolean | 4 | 1 | 161 = 4*12+1+112 | 1 | embeddable domain and range |
(Int,Int)=>Int | 0 | 4 | 115=0*12+4+112, 88 | 2 | embeddable range, then symmetric pair |
(Int,Boolean) | 60 + 4, 1 | 2 | Int embedded in pair, then Boolean | ||
(Int,Box)=>Boolean | 0 | 1 | 0*12+1+112, 60+4, 99 | 3 func with embedded range, then Int embedded, then Box |
Data Serialization#
In ErgoTree, all runtime data values have an associated type also available at runtime (this is called type reification).
However, the serialization format separates data values from their type descriptors. This saves space when for example, a collection of items is serialized. It is done so that a type tree can fully describe the contents of a typed data structure.
For example, having a typed data object d: (Int, Coll[Byte], Boolean), we can tell, by examining the structure of the type, that d is a tuple with three items; the first item contains a 32-bit integer, the second - a collection of bytes, and the third - logical true/false value.
To serialize/deserialize typed data, we need to know its type descriptor (type tree). The data serialization procedure is recursive over a type tree and the corresponding sub-components of the data object. For primitive types (the leaves of the type tree), the format is fixed. The data values of ErgoTree types are serialized according to the predefined recursive function shown in Figure 5 which uses the notation from Table 3.
Figure 5: Data serialization format
GroupElement serialization#
A value of the GroupElement type is represented in reference implementation using SecP256K1Point class of the org.bouncycastle.math.ec.custom.sec
package and serialized into ASN.1 encoding.
The different encodings are considered during deserialization, including point compression for Fp (see X9.62 sec. 4.2.1 pg. 17).
Figure 6: GroupElement serialization format
SigmaProp serialization#
In reference implementation values of the SigmaProp type are serialized using SigmaBoolean. serializer.
AvlTree serialization#
In reference, implementation values of the AvlTree type are serialized using AvlTreeData. serializer.
Figure 8: AvlTree serialization format
Constant Serialization#
The Constant format is simple and self-sufficient to represent any data value. Serialized bytes of the Constant format contains both the type bytes and the data bytes; thus, they can be stored, or wire transferred and then later unambiguously interpreted. The format is shown in Figure 9
Figure 9: Constant serialization format
To parse the Constant format, first, use the type serializer from section [5.1] and read the type. Then use the parsed type as an argument of the data serializer given in section [5.2]
Expression Serialization#
Expressions of ErgoTree are serialized as tree data structures using the recursive procedure described in [Figure 10]. Expression nodes are represented in the reference implementation using the Value class.
ErgoTree serialization#
The ErgoTree propositions stored in UTXO boxes are represented in the reference implementation using ErgoTree class. The class is serialized using the ErgoTree serialization format shown in [Figure 11].
Figure 11: ErgoTree serialization format
Serialized instances of ErgoTree class are self-sufficient and can be stored and passed around.
ErgoTree format defines the top-level serialization format of ErgoTree scripts. The interpretation of the byte array depends on the first header bytes, which use VLQ encoding up to 30 bits. We define meaning for only the first byte, which may be extended in future versions. The meaning of the bits is shown in [Figure 12].
Figure 12: ErgoTree header bits
Currently, we don’t specify interpretation for the second and other bytes of the header. We reserve the possibility to extend the header by using Bit 7 == 1 and chain additional bytes as in VLQ.
Once the new bytes are required, a new language version should be created and implemented via soft-forkability. That new language will give an interpretation for the new bytes.
The default behavior of ErgoTreeSerializer is to preserve the original structure of ErgoTree and check the consistency. In case of any inconsistency, the serializer throws an exception.
If constant segregation Bit4 is set to 1, then the constants collection contains the constants for which there may be ConstantPlaceholder nodes in the tree. However, if the constant segregation bit is 0, then the constants collection should be empty, and any placeholder in the tree will lead to an exception.
Type Serialization#
This page is a WIP. Please see ErgoTree.pdf for full details.
In this section, we describe how the types (like \lst{Int}, \lst{Coll[Byte]}, etc.) are serialized; then, we define the serialization of typed data. This will give us a basis to describe the serialization of Constant nodes of \ASDag. From that, we proceed to serialization of arbitrary \ASDag trees.
For the motivation behind this type of encoding, please see Appendix~\ref{sec:appendix:motivation:type}.
Distribution of type codes#
The whole space of 256 codes is divided as the following:
Interval | Distribution |
---|---|
\(\lst{0x00}\) | special value to represent undefined type (\(\lst{NoType}\) in \(\ASDag\)) |
\(\lst{0x01 - 0x6F(111)}\) | data types including primitive types, arrays, options aka nullable types, classes (in future), 111 = 255 - 144 different codes |
\(\lst{0x70(112) - 0xFF(255)}\) | function types \(\lst{T1 => T2}\), 144 = 12 x 12 different codes |
Encoding Data Types#
There are nine different values for primitive types, and two more are reserved for future extensions. Each primitive type has an id in a range {1,...,11} as the following.
Id | Type |
---|---|
1 | Boolean |
2 | Byte |
3 | Short (16 bit) |
4 | Int (32 bit) |
5 | Long (64 bit) |
6 | BigInt (java.math.BigInteger) |
7 | GroupElement (org.bouncycastle.math.ec.ECPoint) |
8 | SigmaProp |
9 | reserved for Char |
10 | reserved for Double |
11 | reserved |
For each type constructor like Coll
or Option
, we follow the encoding schema defined below. Each type constructor has an associated base code (e.g., 12 for Coll[_]
, 24 for Coll[Coll[_]]
, etc.), which is a multiple of 12.
To obtain the code of a constructed type, the base code is added to the primitive type ID. For example, 12 + 1 = 13 represents the code for Coll[Byte]
. The code of the type constructor (12 in this example) is used when the type parameter is a non-primitive type, such as Coll[(Byte, Int)]
. In this case, the code of the type constructor is read first, followed by a recursive descent to read the bytes of the parameter type (in this case, (Byte, Int)
).
This encoding scheme enables simple and efficient decoding by utilizing division and modulus operations.
The interval of codes for data types is divided as the following:
Interval | Type constructor | Description |
---|---|---|
0x01 - 0x0B(11) | primitive types (including 2 reserved) | |
0x0C(12) | Coll[_] |
Collection of non-primitive types (Coll[(Int,Boolean)] ) |
0x0D(13) - 0x17(23) | Coll[_] |
Collection of primitive types (Coll[Byte] , Coll[Int] , etc.) |
0x18(24) | Coll[Coll[_]] |
Nested collection of non-primitive types (Coll[Coll[(Int,Boolean)]] ) |
0x19(25) - 0x23(35) | Coll[Coll[_]] |
Nested collection of primitive types (Coll[Coll[Byte]] , Coll[Coll[Int]] ) |
0x24(36) | Option[_] |
Option of non-primitive type (Option[(Int, Byte)] ) |
0x25(37) - 0x2F(47) | Option[_] |
Option of primitive type (Option[Int] ) |
0x30(48) | Option[Coll[_]] |
Option of Coll of non-primitive type (Option[Coll[(Int, Boolean)]] ) |
0x31(49) - 0x3B(59) | Option[Coll[_]] |
Option of Coll of primitive type (Option[Coll[Int]] ) |
0x3C(60) | (_,_) |
Pair of non-primitive types (((Int, Byte), (Boolean,Box)) , etc.) |
0x3D(61) - 0x47(71) | (_, Int) |
Pair of types where first is primitive ((_, Int) ) |
0x48(72) | (_,_,_) |
Triple of types |
0x49(73) - 0x53(83) | (Int, _) |
Pair of types where second is primitive ((Int, _) ) |
0x54(84) | (_,_,_,_) |
Quadruple of types |
0x55(85) - 0x5F(95) | (_, _) |
Symmetric pair of primitive types ((Int, Int) , (Byte,Byte) , etc.) |
0x60(96) | (_, ..., _) |
Tuple type with more than 4 items (Int, Byte, Box, Boolean, Int) |
0x61(97) | Any |
Any type |
0x62(98) | Unit |
Unit type |
0x63(99) | Box |
Box type |
0x64(100) | AvlTree |
AvlTree type |
0x65(101) | Context |
Context type |
0x65(102) | String |
String |
0x66(103) | IV |
TypeIdent |
0x67(104) - 0x6E(110) | reserved for future use | |
0x6F(111) | Reserved for future Class type (e.g. user-defined types) |
Encoding Function Types#
We use \(12\) different values for both domain and range types of functions. This gives us \(12 * 12 = 144\) function types in total and allows us to represent \(11*11 = 121\) functions over primitive types using just a single byte.
Each code \(F\) in a range of function types can be represented as \(F = D * 12 + R + 112\), where \(D, R \in \{0,\dots,11\}\) - indices of domain and range types correspondingly, \(112\) - is the first code in an interval of function types.
If \(D = 0\), then the domain type is not primitive and recursive descent is necessary to write/read the domain type.
If \(R = 0\), then the range type is not primitive and recursive descent is necessary to write/read the range type.
Recursive Descent#
When an argument of a type constructor is not a primitive type, we fallback to the simple encoding schema.
In such a case, we emit the special code for the type constructor according to the table above and descend recursively to every child node of the type tree.
We only perform this descent for those children whose code cannot be embedded in the parent code. For example, serialization of Coll[(Int,Boolean)]
proceeds as follows:
- Emit
0x0C
because the element of the collection is not primitive. - Recursively serialize
(Int, Boolean)
. - Emit
0x3D
because the first item in the pair is primitive. - Recursively serialize
Boolean
. - Emit
0x02
- the code for the primitive typeBoolean
.
Table: Size of Serialized Types#
Type | D | R | Bytes | #Bytes | Comments |
---|---|---|---|---|---|
Byte |
1 | 1 | |||
Coll[Byte] |
12 + 1 = 13 | 1 | |||
Coll[Coll[Byte]] |
24 + 1 = 25 | 1 | |||
Option[Byte] |
36 + 1 = 37 | 1 | register | ||
Option[Coll[Byte]] |
48 + 1 = 49 | 1 | register | ||
(Int,Int) |
84 + 3 = 87 | 1 | fold | ||
Box => Boolean |
7 | 2 | 198 = 7 * 12 + 2 + 112 | 1 | exist, for all |
(Int,Int) => Int |
0 | 3 | 115 = 0 * 12 + 3 + 112, 87 | 2 | fold |
(Int,Boolean) |
60 + 3, 2 | 2 | |||
(Int,Box) => Boolean |
0 | 2 | 0 * 12 + 2 + 112, 60 + 3, 7 | 3 |