[−][src]Module binjs_io::multipart

An optimization of TokenReader/TokenWriter, designed to minimize the size of the file. A multipart format, in which each part can be compressed independently.

Overview

The file is divided in sections. Each section is prefixed by its bytelength, so as to permit skipping a section and/or reading sections concurrently. Each section may be compressed independently, possibly with different compression formats, with the expectation that this will let compressors take best advantage of the distinct structures of each section.

(future versions may allow file-wide compression, too)

The sections are:

the grammar table;
the strings table (which contains both strings and identifiers);
the representation of the tree.

The grammar table lists the AST nodes used in the file. Its primary role is to serve as a lightweight versioning mechanism - for instance, older versions of JS may define a node Function with three fields (body, arguments and optional name), while more recent versions of JS may define the same node with five fields (body, arguments, async, generator and optional name). A BinAST file may contain either variants of Function, depending on when it was created. The grammar table lets recent parsers determine that some fields are omitted and should be replaced by their default value. In fact, a BinAST file could even contain both variants of Function, for compression purposes. Also, when a parser encounters a grammar table with nodes that either have an unknown name or contain unknown fields, it may decide to reject the file immediately (it doesn't have to, mind you).

The strings table lists all strings (including identifiers) in the file. Its primary role is to speed up parsing by making sure that each string only needs to be parsed/checked/atomized once during parsing. Its secondary role is compression.

In the current version, the tree is a sequence of tokens. All these tokens are ambiguous and a stream may only be tokenized by a client that knows both the grammar and the grammar table. Specific tokens (lists) contain their byte length, so as to allow skipping them for purposes of lazy parsing and/or concurrent parsing.

Format

The entire file is formatted as:

the characters "BINJS";
a container version number (varnum, currently 0);
the compressed grammar table (see below);
the compressed strings table (see below);
the compressed tree (see below).

Grammar table

The grammar table serves to map tagged tuple indices to actual constructions in the JS grammar.

the characters "[GRAMMAR]";
a prefix identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;").
the number of compressed bytes (varnum);
compressed in the format identified by prefix:
- the number of entries (varnum);
- for each entry,
  - byte length of entry (varnum);
  - one of
    - the invalid strings [255, 0] (representing the null interface, only valid if byte length is 2);
    - a utf-8 encoded string (utf-8 encoded, bytelen bytes, no terminator).

Strings table

The grammar table serves to map tagged tuple indices to strings.

the characters "[STRINGS]";
a prefix identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;").
the number of compressed bytes (varnum);
compressed in the format identified by prefix;
- the number of entries (varnum);
- for each entry,
  - byte length of string (varnum);
  - one of
    - the invalid strings [255, 0] (representing the null string, only valid if byte length is 2);
    - a utf-8 encoded string (utf-8 encoded, bytelen bytes, no terminator).

The tree

This contains the actual tree for a specific grammar. The file does not contain all the information to determine the nature of next token. Rather, this must be led by the grammar.

the characters "[TREE]";
a prefix identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;").
the number of compressed bytes (varnum);
compressed in the format identified by prefix:
- one tree token.

Tree token

A tree token is defined as one of

a number of bytes (aka Offset), represented as:
- a varnum;
a null float, represented as:
- a low-endian IEEE764 64-bit floating point value signalling NaN (8 bytes),
a non-null float, represented as:
- a low-endian IEEE764 64-bit floating point value non-signalling NaN (8 bytes),
a null boolean, represented as:
- a single byte with value 2 (one byte);
a non-null boolean, represented as:
- a single byte with value 0 (false) or 1 (true) (one byte);
a string, representing as
- an entry in the table of strings (varnum);
a list, represented as
- number of items (varnum);
- for each item
  - the token;
a tagged tuple, represented as
- an entry in the grammar table (varnum);
- for each field
  - the token

Structs

FormatProvider	Command-line management.
Statistics
Targets
TreeTokenReader
TreeTokenWriter