[−][src]Module binjs_io::multipart
An optimization of TokenReader/TokenWriter, designed to minimize the size of the file. A multipart format, in which each part can be compressed independently.
Overview
The file is divided in sections. Each section is prefixed by its bytelength, so as to permit skipping a section and/or reading sections concurrently. Each section may be compressed independently, possibly with different compression formats, with the expectation that this will let compressors take best advantage of the distinct structures of each section.
(future versions may allow file-wide compression, too)
The sections are:
- the grammar table;
- the strings table (which contains both strings and identifiers);
- the representation of the tree.
The grammar table lists the AST nodes used in the file. Its primary role is to serve as a lightweight
versioning mechanism - for instance, older versions of JS may define a node Function
with three fields
(body
, arguments
and optional name
), while more recent versions of JS may define the same node
with five fields (body
, arguments
, async
, generator
and optional name
). A BinAST file
may contain either variants of Function
, depending on when it was created. The grammar table lets recent
parsers determine that some fields are omitted and should be replaced by their default value. In fact, a
BinAST file could even contain both variants of Function
, for compression purposes. Also, when a
parser encounters a grammar table with nodes that either have an unknown name or contain unknown
fields, it may decide to reject the file immediately (it doesn't have to, mind you).
The strings table lists all strings (including identifiers) in the file. Its primary role is to speed up parsing by making sure that each string only needs to be parsed/checked/atomized once during parsing. Its secondary role is compression.
In the current version, the tree is a sequence of tokens. All these tokens are ambiguous and a stream may only be tokenized by a client that knows both the grammar and the grammar table. Specific tokens (lists) contain their byte length, so as to allow skipping them for purposes of lazy parsing and/or concurrent parsing.
Format
The entire file is formatted as:
- the characters
"BINJS"
; - a container version number (
varnum
, currently0
); - the compressed grammar table (see below);
- the compressed strings table (see below);
- the compressed tree (see below).
Grammar table
The grammar table serves to map tagged tuple indices to actual constructions in the JS grammar.
- the characters
"[GRAMMAR]"
; - a
prefix
identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;"). - the number of compressed bytes (
varnum
); - compressed in the format identified by
prefix
:- the number of entries (
varnum
); - for each entry,
- byte length of entry (
varnum
); - one of
- the invalid strings [255, 0] (representing the null interface, only valid if byte length is 2);
- a utf-8 encoded string (utf-8 encoded,
bytelen
bytes, no terminator).
- byte length of entry (
- the number of entries (
Strings table
The grammar table serves to map tagged tuple indices to strings.
- the characters
"[STRINGS]"
; - a
prefix
identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;"). - the number of compressed bytes (
varnum
); - compressed in the format identified by
prefix
;- the number of entries (
varnum
); - for each entry,
- byte length of string (
varnum
); - one of
- the invalid strings [255, 0] (representing the null string, only valid if byte length is 2);
- a utf-8 encoded string (utf-8 encoded,
bytelen
bytes, no terminator).
- byte length of string (
- the number of entries (
The tree
This contains the actual tree for a specific grammar. The file does not contain all the information to determine the nature of next token. Rather, this must be led by the grammar.
- the characters
"[TREE]"
; - a
prefix
identifying the compression format used for the grammar (one of "identity;", "br;", "gzip;", "compress;", "deflate;"). - the number of compressed bytes (
varnum
); - compressed in the format identified by
prefix
:- one tree token.
Tree token
A tree token is defined as one of
- a number of bytes (aka Offset), represented as:
- a
varnum
;
- a
- a null float, represented as:
- a low-endian IEEE764 64-bit floating point value signalling NaN (8 bytes),
- a non-null float, represented as:
- a low-endian IEEE764 64-bit floating point value non-signalling NaN (8 bytes),
- a null boolean, represented as:
- a single byte with value
2
(one byte);
- a single byte with value
- a non-null boolean, represented as:
- a single byte with value
0
(false) or1
(true) (one byte);
- a single byte with value
- a string, representing as
- an entry in the table of strings (
varnum
);
- an entry in the table of strings (
- a list, represented as
- number of items (
varnum
); - for each item
- the token;
- number of items (
- a tagged tuple, represented as
- an entry in the grammar table (
varnum
); - for each field
- the token
- an entry in the grammar table (
Structs
FormatProvider | Command-line management. |
Statistics | |
Targets | |
TreeTokenReader | |
TreeTokenWriter |