Skip to content

ErgoScript Compiler Phases#

This document provides an overview of the key phases in the ErgoScript compiler pipeline, including specific implementations in the sigma-rust and sigmastate-interpreter projects. The ErgoScript compiler takes high-level ErgoScript code and translates it into an intermediate form that can be evaluated by the ErgoTree interpreter or serialized for execution in the Ergo blockchain.

Compiler Pipeline Overview#

The Sigma frontend implements the following pipeline:

SourceCodeparsebindtypecheckbuildGraphbuildTreeErgoTree

Here’s a breakdown of each phase:

  1. Source Code: The input is a string of Unicode characters representing the ErgoScript code.
  2. Parse: Converts the source code into an Abstract Syntax Tree (AST).
  3. Bind: Resolves identifiers within the AST, linking them to their corresponding definitions.
  4. Typecheck: Ensures all expressions in the AST are type-safe by assigning types to each node.
  5. BuildGraph: Transforms the typed AST into a directed acyclic graph (DAG) for optimization.
  6. BuildTree: Converts the optimized graph into an ErgoTree, the executable intermediate representation.

1. Lexical Analysis (Lexer)#

Objective: The Lexer, or tokenizer, scans the source code and converts it into a sequence of tokens. Tokens are the smallest meaningful elements in the code, such as keywords, operators, literals, and identifiers.

  • How it works: The Lexer reads the input source code character by character and groups them into tokens. These tokens are categorized and labeled for the parser to consume.

    • Example: Keywords like if, else, or operators like +, - are converted into tokens that represent these elements in the language.
  • In sigma-rust: Implemented using the Logos crate, which provides an efficient way to define lexing rules in Rust.

  • In sigmastate-interpreter: A custom Scala-based Lexer is used for tokenization.

2. Syntax Analysis (Parser)#

Objective: The Parser takes the list of tokens generated by the Lexer and constructs an Abstract Syntax Tree (AST). The AST represents the syntactical structure of the source code in a tree format, where each node corresponds to a language construct like expressions, statements, or functions.

  • How it works: The parser traverses the tokens in a predefined grammatical order to build the AST. This process involves ensuring that the source code follows the syntax rules of the language. In Sigma, this is done via the SigmaCompiler.parse method.

    • Example: An expression like a + b would be represented as a node in the AST with + as the root, and a and b as its children.

    • Error Handling: Any parsing errors throw a ParserException.

  • In sigma-rust: The parser uses the Rowan library to manage Concrete Syntax Trees (CST) and Abstract Syntax Trees (AST). AST nodes wrap Rowan's trees (CST) and expose node-specific details through methods (e.g., BinaryExpr::lhs()).

  • In sigmastate-interpreter: The Scala implementation utilizes its own parsing strategies to construct the AST from tokens.

3. High-level Intermediate Representation (HIR)#

Objective: The HIR simplifies the AST by abstracting away certain syntactical details while retaining the logical structure of the code. This intermediate representation is more conducive to further analysis and transformation.

  • How it works: The AST is transformed into HIR by "lowering" its structure, which means simplifying and standardizing the representation while retaining all necessary information for the subsequent compilation phases.

    • Example: Complex language constructs in the AST are reduced to simpler, more uniform representations in the HIR.
  • In sigma-rust: Each HIR node has a kind (enum), span (source code reference), and an optional type. This makes it easier to perform operations like type inference and binding.

  • In sigmastate-interpreter: HIR serves a similar purpose, with emphasis on compatibility with the Scala language's functional programming features.

4. Binding (Binder)#

Objective: The Binder phase resolves and links identifiers in the HIR to their corresponding definitions. This includes resolving variables, functions, and constants to ensure that the script's logic is correctly mapped to the underlying data structures.

  • How it works: The Binder, implemented in SigmaBinder.bind, traverses the HIR and replaces symbolic names with actual references. This process involves checking the environment (like global variables and predefined constants) and linking them appropriately. It also transforms environment values of predefined Scala types (such as Int, Boolean, Box, etc.) into constant nodes (IntConstant, BoxConstant, etc.) of the corresponding type.

    • Example: An identifier like HEIGHT in a script would be resolved to the actual block height from the environment.

    • Error Handling: Any binding errors throw a BinderException.

  • In sigma-rust: The Binder rewrites the HIR tree, swapping identifiers (e.g., HEIGHT), some predefined functions (e.g., min/max), and variables from the environment (ScriptEnv) with their HIR nodes, ensuring correct references.

  • In sigmastate-interpreter: A similar process is implemented in Scala, using the Scala language's features for name resolution and binding.

5. Type Inference#

Objective: Type Inference ensures that all expressions in the script are type-safe by determining the types of all expressions in the script. This phase is crucial for ensuring that all operations in the script are performed on compatible types.

  • How it works: The Type Inference algorithm recursively analyzes the HIR and assigns types to each expression. It checks for type consistency and infers types where they are not explicitly provided. The type assignment is performed by the assignType tree transformation, which assigns correct types to all tree nodes.

    • Example: In a script, if an integer is added to a floating-point number, the type inference will ensure that the operation is valid and determine the result's type.
  • Error Handling: Any type inference errors throw a TyperException.

  • In sigma-rust: Type inference is implemented as a separate phase that traverses the HIR tree and assigns types, ensuring type correctness before moving to the next phase.

  • In sigmastate-interpreter: Type inference follows Scala's type system rules, ensuring that all expressions are type-safe.

6. Graph Building#

Objective: This phase converts the type-checked AST into a directed acyclic graph (DAG) where nodes represent operations and edges represent dependencies between operations. This graph representation allows for optimizations such as constant propagation, common subexpression elimination, and dead

code elimination.

  • How it works: The IRContext.buildGraph method takes the typed AST and builds the graph. During graph building, the following optimizations are performed:
  • Constant Propagation: Replaces variables with their known constant values.
  • Common Subexpression Elimination: Identifies and eliminates redundant computations.
  • Dead Code Elimination: Removes code that does not affect the final output.

  • In sigma-rust: The MIR generated during this phase is the final form of the intermediate representation used by the interpreter and for serialization.

  • In sigmastate-interpreter: The graph building phase focuses on preparing the code for efficient execution by the ErgoTree interpreter.

7. Tree Building#

Objective: This phase converts the optimized graph into an ErgoTree, which is the executable intermediate representation of the script. The ErgoTree is then ready to be interpreted by the ErgoTree interpreter.

  • How it works: The IRContext.buildTree method takes the graph produced in the previous phase and constructs the final ErgoTree. This involves finalizing the structure of the script in a form that can be executed by the interpreter.

  • In sigmastate-interpreter: The resulting ErgoTree can be processed by the Sigma interpreter, allowing for secure and efficient execution of the script.

8. Type Checking#

Objective: Type Checking ensures that the types of all expressions and operations in the MIR are consistent and correct. This phase verifies that the script adheres to the type rules established during type inference.

  • How it works: The Type Checker traverses the MIR, checking that each operation's input types are compatible with its output types, and ensuring that the script is type-safe.

    • Example: The Type Checker would flag an error if a string were incorrectly used in a numerical operation.
  • In sigma-rust: Type checking is performed alongside the MIR traversal, validating the correctness of the types before the script is executed.

9. Execution (ErgoTree Interpreter)#

Objective: The ErgoTree interpreter evaluates the MIR nodes by calling the Evaluable::eval() method on the tree root. Each node implements the Evaluable::eval() trait method.

10. ErgoTree Serialization#

Objective: Each MIR node implements the SigmaSerializable trait with sigma_parse() and sigma_serialize() methods for serialization and deserialization.

  • How it works: This phase ensures that the ErgoTree can be serialized into a format suitable for storage or transmission and can be deserialized back into an executable form.

  • In sigma-rust: Each MIR node implements the SigmaSerializable trait for serialization and deserialization.

Resources#