[−][src]Module binjs::io::context::format
Format documentation.
Context 0.1 Format (née "fbssdc")
- Dominic Cooney (dpc) 2019-06-27
- Reformated for markdown David Teller (yoric) 2019-09-02
Overview
BinAST is a Work In Progress format for JavaScript programs designed for compact transfer and fast, streaming startup. This document describes the "Context 0.1" format of files on disk.
The goals of this iteration of the BinAST format are to:
- Represent JavaScript syntax trees and metadata about variable capture, etc.
- Be compact when compressed with Brotli.
- Support random access to the content of lazy functions within the decompressed stream.
The syntax trees and metadata are described in this IDL file: https://github.com/binast/binjs-ref/spec/es6.webidl and can be produced using the binjs_encode
tool’s --show-ast
option; see https://github.com/binast/binjs-ref/. You should have a copy of the IDL handy when reading this document.
This document describes the structure of the files on disk at the bit level. This document includes commentary about design choices in the format, but not in the AST/IDL structure. The Context 0.1 format consists of the following items, in order:
- A signature, the bytes
89h
,42h
,4ah
,53h
(these three bytes"BJS"
)dh
ah
0h
ah
2h
(NOTE: Currently not implemented!) - A table of strings.
- A set of tables for constructing codes used to decode the AST
- An encoded Script AST node
The subsequent sections describe each of these items in detail.
To ease skimming, the format is summarized in BNF in green. The previous list can be written this way:
File ::= Signature StringTable CodeTables Node
Repeated items will be written between square brackets with braces indicating the number of items. Literal bytes will be written in hexadecimal with a trailing h
.
Signature
The file starts with these bytes: 89h
, 42h
, 4ah
, 53h
(these three bytes "BJS"
) dh
ah
0h
ah
followed by a version number.
Signature ::= 89h 42h 4ah 53h 0dh 0ah 00h 0ah VersionNumber VersionNumber ::= 02h
String Table
JavaScript programs contain a lot of text data, not only in string literals, but also in property names, variable names, etc. The format lifts string data into a header section and refers to strings by index.
This has a couple of benefits: First, property names can be relatively long and are referred to repetitively, so using an index is more succinct. Second, there is a lot of redundancy between strings and Brotli compresses these more effectively if they are close together (although the benefit only scales ~logarithmically.)
String Table Format
The string table consists of:
- The number of strings, n, in varuint encoding.
- n strings.
Each string consists of the literal bytes of string content, with a null terminator. Because strings may contain embedded nulls, nulls are escaped as 01h 00h
, which means 01h
bytes must be escaped too; they’re escaped as 01h 01h
.
The format does not do any semantically impactful string mangling, for example, if a string literal has an escape sequence, the escape sequence will be represented in the string data.
StringTable ::= StringCount [String]{StringCount} StringCount ::= VarUint String ::= [StringByte] 00h StringBytes ::= 02h .. ffh (a literal byte) | 01 00h (the byte 00h) | 01 01h (the byte 01h)
External String Tables
The format can optionally refer to strings from an external string table. External string tables make it possible to produce very small files because strings can often be cached, reused, and shared between files (for example a library function and its caller will both refer to that method’s name.) These strings are assigned indices starting after the last intrinsic string. The external string table is specified out-of-band and a decoder has limited capacity to detect a missing external string table -- when it encounters an out-of-bounds string table reference.
External string tables start with the ASCII bytes astdict
. The following bytes are in the same format as the intrinsic string table.
StringFile ::= StringSignature StringTable StringSignature ::= 61h 73h 74h 64h 69h 63h 74h
Code Tables
In general the AST is serialized by traversing the AST writing the symbols encountered into the file. (There is an exception to this supporting random access to lazy function bodies described later.) Because the content of files can vary a lot, the format does not use a fixed set of codes but instead the file contains tables of codes used in the file.
Symbols
ASTs contain the following kinds of symbols:
- Primitive values: Booleans, unsigned integers, JavaScript numbers (IEEE 754 64-bit floating point numbers), strings, and optional strings.
- Enumeration values. (These are described using strings in es6.webidl, but the format treats enums as a distinct kind of value.)
- Interface type tags. For example, Script and Block are interface type tags. Note that typedefs in es6.webidl are for convenience when reading but the format deals with desugared interface definitions without typedefs.
- None, a special symbol which represents an absent interface.
Symbols can appear in the file in two ways: literally; or encoded with a code. The format makes it clear when each form is used and the two forms are never mixed.
Symbol Context Modeling
The Context 0.1 format saves space by using different symbol → code mappings depending on the part of the AST being encoded. For example, when encoding a Boolean value, only two symbols are valid so the format can use a code which is at most 1 bit long, even though that overlaps with codes in other contexts.
There are two kinds of contexts:
- Fields, for example
AssertedBoundName.isCaptured
. - Array lengths of a given array type, for example a
FrozenArray<(Spread or Expression)>
length
The set of symbols which are allowed depend on the type associated with the context:
- For a field with an array type, the associated type is the element type of the array.
- For other fields, the associated type is the type of the field.
- For an array length, the associated type is unsigned long.
Given an associated type, the allowable symbols are:
- For a primitive type, the allowed symbols are values of that type.
- For an enum type, the allowed symbols are the members of the enum.
- For a set of interface type tags, possibly including
None
, the allowed symbols are those type tags, andNone
if applicable.
Each context which appears in the AST will have a code table, with two exceptions:
-
Fields with a monomorphic interface type or a
FrozenArray
of a monomorphic interface element type must always produce that interface type tag. It is redundant to produce tables for these and the format prevents that. -
The
StaticMemberAssignmentTarget.property
field andStaticMemberExpression.property
field are treated as the same context. These fields usually have large tables of property names with similar lengths, so they are combined to save space. Algorithms which talk about enumerating the fields of an interface should treat these two interfaces as sharing one property field.
Code Table Encoding
Each code table does not specify the codes directly, but instead specifies the length of each symbol’s code. How codes are assigned based on length is described later.
Code tables can be encoded multiple ways:
CodeTable ::= UnitCodeTable | MultiCodeTableImplicit | MultiCodeTableExplicit | EmptyCodeTable UnitCodeTable ::= 00h LiteralSymbol MultiCodeTableExplicit ::= 01h CodeCount [CodeLength]{CodeCount} [LiteralSymbol]{CodeCount} CodeLength ::= 00h .. 14h MultiCodeTableExplicit ::= 01h [CodeLength]{SymbolCount} EmptyCodeTable ::= 02h
The UnitCodeTable
is a table with a single symbol. This symbol is assigned given a zero-length code. As a consequence when that table is used to encode a symbol in the tree, it generates no bits in the output.
Because modern minified JavaScript tends to be very regular UnitCodeTables
occur a lot in practice and are a large contributor to the size reduction of BinAST encoded tree sizes. For example, BinAST tracks whether each scope uses eval
(see hasDirectEval
in es6.webidl
.) However modern JavaScript practice avoids using eval
. Encoding a unit code table for false
takes two bytes, 00h 00h
, so the format can represent an unlimited number of block scope’s lack of eval
with a fixed overhead of two bytes -- fewer after Brotli compression.
Note that a unit code table is the only way to produce a zero-length code. MultiCodeTableImplicit
may mention symbols with zero length, but these are skipped during code assignment.
The MultiCodeTable
is a table with multiple symbols. There are two variants of the encoding depending on the type of symbol being encoded.
The implicit table is for symbols which form a small, closed set. The lengths of all symbols are enumerated. Unused symbols are denoted with zero length and are skipped during code assignment.
The explicit table is for symbols which form an effectively open set. The table includes a symbol count, then the lengths of all symbols, then the literal symbols.
Type | Table Type | Ordering |
---|---|---|
Boolean | Implicit | false , true |
Unsigned integers | Explicit | Encoder determined |
JavaScript numbers | Explicit | Encoder determined |
Strings | Explicit | Encoder determined |
Optional strings | Explicit | Encoder determined |
Enumeration values | Implicit | IDL declaration order |
Interface type tags & None | Implicit | None first; then lexical order of interface names |
Finally, the EmptyCodeTable
represents a set of unused codes. How empty code tables arise is described later.
Symbol Literal Encoding
Symbols appear literally in the UnitCodeTable
and MultiCodeTableExplicit
. When a symbol appears literally, it is encoded this way:
Type | Literal Encoding |
---|---|
Booleans | One byte. false , 00h ; true , 01h . |
Unsigned integers | 4-byte big-endian integer |
JavaScript numbers | 8-byte big-endian IEEE-754 |
Strings | Varuint string table index |
Optional string | If None , 00h , otherwise Varuint 1 + string table index |
Enumeration value | Varuint index of IDL declaration order |
Interface type tag & None | Given a set of relevant interface type tags, and None if relevant, sorted by symbol ordering (see below), varuint index of the symbol in the set |
Code Assignment: From Symbol Lengths to Codes
To encode or decode a file, each symbol must be assigned not just a code length but a specific sequence of bits. The format uses a canonical Huffman code for this purpose.
Given a set S
of length, symbol pairs, canonical Huffman codes can be assigned this way:
- Sort
S
by length. For pairs with the same length, sort them by the symbol. (Symbol ordering is defined below.) - Initialize
code = 0
,last_length = 0
- For i ∈ 1...|S|
next_length
= length associated withS[i]
- Assign code of length
next_length
to symbolS[i]
code = (code + 1) × 2^(next_length - last_length)
- last_length = next_length
The codes constructed by this algorithm begin in the high-order bit, so when these codes are written into the file the high-order bit must appear first. That is, codes are written as bitwise big-endian, variable-length numbers.
A decoder should check codes fit in their prescribed length; a file which produced codes longer than their length is invalid. Codes have a maximum length of 20 bits, which in practice limits JavaScript files to roughly a million strings and a million numbers.
Symbol Ordering for Code Assignment
To briefly recap, files contain tables specifying the code lengths (but not the codes themselves) of symbols which appear in the serialized AST. The encoder and decoder must agree on the order different symbols with the same code length are assigned codes using the algorithm above. Here is the order:
Type | Order for Code Assignment |
---|---|
Boolean | false , true |
Unsigned integers | Order literals appear in the code table |
JavaScript numbers | Order literals appear in the code table |
Strings | String table index |
Optional strings | None , then string table index |
Enumeration values | Lexicographic order of IDL string constant |
Interface type tags & None | None , then lexicographic order of name |
Note, this order is for assigning codes. This ordering is not related to the ordering values like strings and numbers might usually have, although that situation could arise coincidentally as in implementation detail of an encoder. As you have seen, enumeration values use a different ordering in enumerating lengths in MultiCodeImplicitTables
. (This was an oversight but a harmless one.)
Code Table Set Encoding
The format can model hundreds of contexts. Most files use far fewer contexts than this. To keep the file compact, the code table set only encodes tables used by the file. It does this by doing a depth-first walk through the fields of the IDL, starting with the fields of the Script interface. The walk is pruned in two ways: when it encounters array length contexts with code tables which only contain the symbol zero; when interface type tags are not present in a set of allowable symbols.
To decode models:
- Initialize the set of visited contexts to empty
- Initialize the set of empty array types to empty
- Push all of the fields of the Script interface
- While the stack of fields is not empty
- Pop a field
- Decode a code table for the effective type of the field.
- For each of the code table’s symbols that are interface type tags:
- Push all of the fields of the interface
To push a field:
- If the field is in the set of visited contexts, stop
- Add the field to the set of visited fields
- If the field has a FrozenArray type
- Determine if the array type is always empty
- If so, stop
- If the effective type is a monomorphic interface, push all of the interface’s fields
- Otherwise, push the field onto the stack
To determine a field’s effective type:
- If the field’s type is an array type, the effective type is the array element type. Stop.
- Otherwise, the effective type is the field’s type. Stop.
To determine if an array type is always empty:
- If the array type is in the set of empty arrays types, it is always empty. Stop.
- If the array type is in the set of visited contexts, it is not always empty. Stop.
- Decode a code table for unsigned longs. This is the code table for this array type.
- Add the array type to the set of visited contexts.
- If the code table has any non-zero symbols, the array type is not always empty. Stop.
- Otherwise, add the array type to the set of empty array types. It is always empty. Stop.
The sharing of array length models by array type, instead of specific fields, explains the need for EmptyCodeTables
: A situation can arise where an array length model includes zero and non-zero values, but within the context of a specific field the array length is always zero in practice.
Node Encoding
With a complete string table and set of codes in hand, encoding and decoding nodes is straightforward:
Node ::= [Symbol] <pad to byte boundary> LazyItemCount [LazyItemLength]{LazyItemCount} [Node]{LazyItemCount} LazyItemCount ::= varuint LazyItemLength ::= varuint
To deserialize an interface I:
- For each field f in I:
- If f’s type is an array:
1. Decode an array length, n, using the code table for the array type consuming the prefix of the input bit stream.
2. For 1 ... n:
- Deserialize a value in context f.
- Deserialize a value in context f.
To deserialize a value in a field context f:
- If f has a Lazy attribute, add the effective type of f to the lazy item list. (Currently, these are always monomorphic interface types.)
- If f’s effective type is a monomorphic interface type, deserialize an interface of f’s effective type. Note, this does not consume the bit stream or consult any code table.
- If f’s effective type includes interface type tags:
- Decode an interface type tag J, or None, using the code table for f consuming the prefix of the input bitstream.
- If J is not none, deserialize an interface J.
- Otherwise, decode a symbol using the code table for f consuming the prefix of the input bitstream.
An AST subtree may contain lazy sections; these are tagged [Lazy]
in es6.webidl. A tree or subtree has postamble 0-padding to a byte boundary, a redundant count of lazy items, and a list of byte lengths which refer to the following serialized lazy nodes. This arrangement supports random access to lazy fields, once the subtree containing the field has been deserialized to discover the types of those fields.
The encoding is recursive -- the script’s lazy nodes may themselves have lazy nodes which are written at the end of that node’s serialization.
Variable-length Unsigned Integers (varuint)
Encoding
Unsigned integers appear in various places in the format. These are encoded using a byte-oriented format of 7 bits of little-endian data and the high order bit indicating continuation. These should not exceed 5 bytes containing up to 32 bits and 3 zero bits.
Varuint ::= [1nnnnnnn]{0..4} 0nnnnnnn