Serialization#
This page defines the binary format used to store ErgoTree contracts in persistent stores, transfer them over the wire, and enable cross-platform interoperation.
Overview#
The terms of the ErgoTree language can be serialized into an array of bytes for storage in the Ergo blockchain (e.g., in the propositionBytes
field of a Box
).
When the guarding script of a transaction's input box is validated, the propositionBytes
array is deserialized into an ErgoTree Intermediate Representation (IR) (represented by the ErgoTree
class in the reference implementation), which can then be evaluated.
The serialization procedure is specified in general terms. The serialization format for ErgoTree types (specified by the SType
class hierarchy) and expression nodes (specified by the Value
class hierarchy) is defined in section 5.1 and Appendix C, respectively.
Table 1 shows the size limits checked during contract deserialization, which are important for resisting malicious script attacks.
Table 1: Serialization limits#
Constant | Value | Description |
---|---|---|
VLQ max | 10 | Maximum size of VLQ encoded byte sequence (See VLQ formats E.1) |
T max | 100 | Maximum size of serialized type term (see Type format 5.1) |
D max | 4Kb | Maximum size of serialized data instance (see Data format 5.2) |
C max | = T max + D max | Maximum size of serialized data instance (see Const format 5.3) |
Expr max | 4Kb | Maximum size of serialized ErgoTree term (see Expr format 5.4) |
ErgoTree max | 4Kb | Maximum size of serialized ErgoTree contract (see ErgoTree format 5.5) |
All the serialization formats used and defined throughout this section are listed in Table 2, which introduces a name for each format and shows the number of bytes each format may occupy in the byte stream.
Table 2: Serialization formats#
Format | #bytes | Description |
---|---|---|
\(\lst{Byte}\) | \(1\) | 8-bit signed two's-complement integer |
\(\lst{Short}\) | \(2\) | 16-bit signed two's-complement integer (big-endian) |
\(\lst{Int}\) | \(4\) | 32-bit signed two's-complement integer (big-endian) |
\(\lst{Long}\) | \(8\) | 64-bit signed two's-complement integer (big-endian) |
\(\lst{UByte}\) | \(1\) | 8-bit unsigned integer |
\(\lst{UShort}\) | \(2\) | 16-bit unsigned integer (big-endian) |
\(\lst{UInt}\) | \(4\) | 32-bit unsigned integer (big-endian) |
\(\lst{ULong}\) | \(8\) | 64-bit unsigned integer (big-endian) |
\(\lst{VLQ(UShort)}\) | \([1..3]\) | Encoded unsigned \(\lst{Short}\) value using VLQ. |
\(\lst{VLQ(UInt)}\) | \([1..5]\) | Encoded unsigned 32-bit integer using VLQ. |
\(\lst{VLQ(ULong)}\) | \([1..\MaxVlqSize]\) | Encoded unsigned 64-bit integer using VLQ. |
\(\lst{Bits}\) | \([1..\MaxBits]\) | A collection of bits packed in a sequence of bytes. |
\(\lst{Bytes}\) | \([1..\MaxBytes]\) | A sequence of bytes, whose size is stored elsewhere or well-known. |
\(\lst{Type}\) | \([1..\MaxTypeSize]\) | Serialized type terms of \(\langname\). |
\(\lst{Data}\) | \([1..\MaxDataSize]\) | Serialized data values of \(\langname\). |
\(\lst{GroupElement}\) | \(33\) | Serialized elements of elliptic curve group. |
\(\lst{SigmaProp}\) | \([1..\MaxSigmaProp]\) | Serialized sigma propositions. |
\(\lst{AvlTree}\) | \(44\) | Serialized dynamic dictionary digest. |
\(\lst{Constant}\) | \([1..\MaxConstSize]\) | Serialized \(\langname\) constants (values with types). |
\(\lst{Expr}\) | \([1..\MaxExprSize]\) | Serialized expression terms of \(\langname\). |
\(\lst{ErgoTree}\) | \([1..\MaxErgoTreeSize]\) | Serialized instances of \(\langname\) contracts. |
We use the [1..n]
notation when serialization may produce from 1 to n bytes (depending on the actual data).
The serialization format of ErgoTree is optimized for compact storage and rapid deserialization.
In many cases, the serialization procedure is data dependent and thus has branching logic.
We use a pseudo-language with operators like for
, match
, if
, and optional
to express this complex serialization logic in the specification.
The language allows us to specify a structure composed of simple serialization slots.
Each slot specifies a fragment of the serialized stream of bytes, while operators specify how the slots are combined to form the resulting stream of bytes.
The notation is summarized in Table 3.
Table 3: Serialization Notation#
Notation | Description |
---|---|
\(\Denot{T}\) where \(T\) - type | Denotes a set of values of type \(T\) |
\(v \in \Denot{T}\) | The value \(v\) belongs to the set \(\Denot{T}\) |
\(v : T\) | Same as \(v \in \Denot{T}\) |
\(\lst{match}\) \((t, v)\) | Pattern match on pair \((t, v)\) where \(t, v\) - values |
\(\lst{with}\) \((Unit, v \in \Denot{Unit})\) | Pattern case |
\(\lst{for}\) i=1 \(\lst{to}\) len \(~\lst{serialize(}\)v_i\(\lst{)}\) \ \(\lst{end for}\) | Call the given \(\lst{serialize}\) function repeatedly. The output bytes of all invocations are concatenated and become the output of the \(\lst{for}\) statement. |
\(\(\lst{if}~\)condition\)~\(\lst{then}\) | Serialize one of the branches depending on the condition. The output bytes of the executed branch become the output of the \(\lst{if}\) statement. |
\(\lst{serialize1(}\)v_1$\lst{)} | \lst{else} |
In the next section, we describe how types (Int, Coll[Byte], etc.) are serialized; then, we define the serialization of typed data.
This will give us a basis to describe the serialization of Constant nodes of ErgoTree. After that, we will proceed to the serialization of arbitrary ErgoTree trees.
Type Serialization#
For the motivation behind this type of encoding, please see Appendix D.1.
Distribution of type codes#
The whole space of 256 one-byte codes is divided, as shown in Table 4.
Table 4: Distribution of type codes between Data and Function types#
Value/Interval | Distribution |
---|---|
\(\lst{0x00}\) | special value to represent undefined type (\(\lst{NoType}\) in \(\ASDag\)) |
\(\lst{0x01 - 0x6F(111)}\) | data types including primitive types, arrays, options aka nullable types, classes (in future), 111 = 255 - 144 different codes |
\(\lst{0x70(112) - 0xFF(255)}\) | function types \(\lst{T1 => T2}\), 144 = 12 x 12 different codes~\footnote{Note that the function types are never serialized in version 1 of the Ergo protocol, this encoding is reserved for future development of the protocol.} |
Encoding of Data Types#
There are eight different values for embeddable types, and three more are reserved for future extensions. Each embeddable type has a type code in the range 1,...,11
as shown in Table 5.
Table 5: Embeddable Types#
Code | Type |
---|---|
1 | Boolean |
2 | Byte |
3 | Short (16-bit) |
4 | Int (32 bit) |
5 | Long (64-bit) |
6 | BigInt (represented by java.math.BigInteger) |
7 | GroupElement (represented by org.bouncycastle.math.ec.ECPoint) |
8 | SigmaProp |
9 | reserved for Char |
10 | reserved |
11 | reserved |
Table 6: Code Ranges of Data Types#
Interval | Constructor | Description |
---|---|---|
0x01 - 0x0B(11) | embeddable types (including 3 reserved) | |
0x0C(12) | Coll[_] | |
0x0D(13) - 0x17(23) | Coll[_] | Collection of embeddable types (Coll[Byte], Coll[Int], etc.) |
0x18(24) | Coll[Coll[_]] | Nested collection of non-embeddable types (Coll[Coll[(Int,Boolean)]]) |
0x19(25) - 0x23(35) | Coll[Coll[_]] | Nested collection of embeddable types (Coll[Coll[Byte]], Coll[Coll[Int]]) |
0x24(36) | Option[_] | Option of non-embeddable type (Option[(Int, Byte)]) |
0x25(37) - 0x2F(47) | Option[_] | Option of embeddable type (Option[Int]) |
0x30(48) | Option[Coll[_]] | Option of Coll of non-embeddable type (Option[Coll[(Int, Boolean)]]) |
0x31(49) - 0x3B(59) | Option[Coll[_]] | Option of Coll of embeddable type (Option[Coll[Int]]) |
0x3C(60) | (,) | Pair of non-embeddable types (((Int, Byte), (Boolean,Box)), etc.) |
0x3D(61) - 0x47(71) | (_, Int) | Pair of types where first is embeddable ((_, Int)) |
0x48(72) | (,,_) | Triple of types |
0x49(73) - 0x53(83) | (Int, _) | Pair of types where second is embeddable ((Int, _)) |
0x54(84) | (,,,) | Quadruple of types |
0x55(85) - 0x5F(95) | (_, _) | Symmetric pair of embeddable types ((Int, Int), (Byte,Byte), etc.) |
0x60(96) | (,...,) | Tuple type with more than 4 items (Int, Byte, Box, Boolean, Int) |
0x61(97) | Any | Any type |
0x62(98) | Unit | Unit type |
0x63(99) | Box | Box type |
0x64(100) | AvlTree | AvlTree type |
0x65(101) | Context | Context type |
0x66(102) | reserved for String | |
0x67(103) | reserved for TypeVar | |
0x68(104) | Header | Header type |
0x69(105) | PreHeader | PreHeader type |
0x6A(106) | Global | Global type |
0x6B(107)-0x6E(110) | reserved for future use | |
0x6F(111) | Reserved for future Class type (e.g. user-defined types) |
We use the encoding schema defined below for each type constructor, like Coll or Option.
- A type constructor has an associated base code which is a multiple of 12
- (e.g., 12 for Coll[_], 24 for Coll[Coll[_]], etc.).
- The base code can be added to the embeddable type code to produce the code of the constructed type.
- For example,
12 + 1 = 13
is the code for Coll[Byte]. - The code of the type constructor (e.g., 12 in this example) is used when the type parameter is a non-embeddable type (e.g., Coll[(Byte, Int)]).
- In this case, the code of the type constructor is read first, and then recursive descent is performed to read the bytes of the parameter type (in this case, (Byte, Int)).
- For example,
This encoding allows very simple and fast decoding using div and mod operations.
Following the above encoding schema, the interval of codes for data types is divided, as shown in Table 6.
Encoding of Function Types#
We use 12 different values for both domain and range types of functions.
This gives us 144 (12∗12) function types in total and allows us to represent 121 (11∗11) functions over primitive types using just a single byte.
Each code F in the range of function types (i.e., F ∈ {112, . . . , 255}) can be represented as F = D∗12+R+112, where D, R ∈ {0, . . . , 11} are the indices of domain and range types correspondingly, and 112 is the first code in the interval of function types.
- If D = 0, the domain type is not embeddable, and recursive descent is necessary to write/read the domain type.
- If R = 0, then the range type is not embeddable, and recursive descent is necessary to write/read the range type.
Recursive Descent#
When an argument of a type constructor is not a primitive type, we fall back to the simple encoding schema. In this case, we emit the separate code for the type constructor according to the table above and descend recursively to every child node of the type tree.
We perform this descent only for those children whose code cannot be embedded in the parent code.
For example, serialization of Coll[(Int,Boolean)] proceeds as follows:
- Emit 0x0C because the element type of the collection is not embeddable.
- Recursively serialize (Int, Boolean).
- Emit 0x41(=0x3D+4) because the first type of the pair is embeddable, and its code is 4.
- Recursively serialize Boolean.
- Emit 0x02 - the code for the embeddable type Boolean.
More examples of type serialization are shown in Table 7.
Type | D | R | Serialized Bytes | #Bytes | Comments |
---|---|---|---|---|---|
Byte | 2 | 1 | simple embeddable type | ||
Coll[Byte] | 12 + 2 = 14 | 1 | embeddable type in Coll | ||
Coll[Coll[Byte]] | 24 + 2 = 26 | 1 | embeddable type in nested Coll | ||
Option[Byte] | 36 + 2 = 38 | 1 | embeddable type in Option | ||
Option[Coll[Byte]] | 48 + 2 = 50 | 1 | embeddable type in Coll nested in Option | ||
(Int,Int) | 84 + 4 = 88 | 1 | symmetric pair of embeddable type | ||
Int=>Boolean | 4 | 1 | 161 = 4*12+1+112 | 1 | embeddable domain and range |
(Int,Int)=>Int | 0 | 4 | 115=0*12+4+112, 88 | 2 | embeddable range, then symmetric pair |
(Int,Boolean) | 60 + 4, 1 | 2 | Int embedded in pair, then Boolean | ||
(Int,Box)=>Boolean | 0 | 1 | 0*12+1+112, 60+4, 99 | 3 | func with embedded range, then Int embedded, then Box |
Data Serialization#
In ErgoTree, all runtime data values have an associated type also available at runtime (this is called type reification).
However, the serialization format separates data values from their type descriptors. This saves space when, for example, a collection of items is serialized. This is done so that a type tree can fully describe the contents of a typed data structure.
For example, having a typed data object d: (Int, Coll[Byte], Boolean), we can tell, by examining the structure of the type, that d is a tuple with three items; the first item contains a 32-bit integer, the second a collection of bytes, and the third a logical true/false value.
To serialize/deserialize typed data, we need to know its type descriptor (type tree). The data serialization procedure is recursive over the type tree and the corresponding sub-components of the data object. For primitive types (the leaves of the type tree), the format is fixed. The data values of ErgoTree types are serialized according to the predefined recursive function shown in Figure 5, which uses the notation from Table 3.
Figure 5: Data serialization format
GroupElement serialization#
A value of the GroupElement type is represented in the reference implementation using the SecP256K1Point
class of the org.bouncycastle.math.ec.custom.sec
package and serialized using ASN.1 encoding.
Different encodings are considered during deserialization, including point compression for Fp (see X9.62 sec. 4.2.1 pg. 17).
Figure 6: GroupElement serialization format
SigmaProp serialization#
In the reference implementation, values of the SigmaProp type are serialized using SigmaBoolean.serializer
.
AvlTree serialization#
In the reference implementation, values of the AvlTree type are serialized using AvlTreeData.serializer
.
Figure 8: AvlTree serialization format
Constant Serialization#
The Constant format is simple and self-sufficient to represent any data value. Serialized bytes in the Constant format contain both the type bytes and the data bytes; thus, they can be stored or transferred over the wire and then later unambiguously interpreted. The format is shown in Figure 9.
Figure 9: Constant serialization format
To parse the Constant format, first, use the type serializer from section 5.1 and read the type. Then use the parsed type as an argument of the data serializer given in section 5.2.
Expression Serialization#
Expressions of ErgoTree are serialized as tree data structures using the recursive procedure described in Figure 10. Expression nodes are represented in the reference implementation using the Value
class hierarchy.
ErgoTree serialization#
The ErgoTree propositions stored in UTXO boxes are represented in the reference implementation using the ErgoTree
class. The class is serialized using the ErgoTree serialization format shown in Figure 11.
Figure 11: ErgoTree serialization format
Serialized instances of the ErgoTree
class are self-sufficient and can be stored and passed around.
The ErgoTree format defines the top-level serialization format of ErgoTree scripts. The interpretation of the byte array depends on the first header byte(s), which use VLQ encoding (up to 30 bits are reserved). We currently define meaning only for the first byte, which may be extended in future versions. The meaning of the bits is shown in Figure 12.
Figure 12: ErgoTree header bits
Currently, we don’t specify interpretation for the second and other bytes of the header. We reserve the possibility to extend the header by using Bit 7 == 1 and chaining additional bytes as in VLQ.
Once new bytes are required, a new language version should be created and implemented via soft-forkability. That new language will give an interpretation for the new bytes.
The default behavior of ErgoTreeSerializer is to preserve the original structure of ErgoTree and check consistency. In case of any inconsistency, the serializer throws an exception.
If the constant segregation bit (Bit 4) is set to 1, then the constants collection contains the constants for which there may be ConstantPlaceholder
nodes in the tree. However, if the constant segregation bit is 0, then the constants collection should be empty, and any placeholder in the tree will lead to an exception.