Epsilon Grammar Studio


The creation of a capable parser for a given language is time consuming, error prone task. For more complex languages its very hard to keep track of the determinism level at the grammar development stage. Epsilon Grammar Studio (EGS) is a solution to this problems. EGS generates object oriented parsers from ABNF syntax grammars and checks the language determinism at compile time. EGS is a parser generator (compiler compiler).

The input of the generated parsing machines (PM) is in bytes, that can be decoded by many different optionally available decoders (ASCII, ISO 8859-1 (Latin1), WIN 1252, UTF8, UTF16 LE/BE and UTF32 LE/BE). The decoded input sequence forms Unicode char array that may pass a lexical analyses where one or more characters can be grouped to phrases (an extension to the ABNF syntax). The phrases are recognized by the parser grammar as a single token. This method of two phases parsing effectively may parse some context-free languages deterministically.

During runtime, the PM are emitting events per input syntax error discovered, that optionally contain byte offset of the error, Unicode code point offset or textual line and line character offsets. Additionally the syntax error message may contain information of the current not recognized token as well as a list of all possible expected tokens at the error location.

The result of a successful parsing by a PM is an explicit concrete syntax tree that can be iterated as much as need before its destructed. The parsing process uses dynamic memory for in depth recursion, and only few function calls depth are made using the dedicated thread stack (DTS). As a consequence, the DTS may be significantly reduced, especially important in server applications.

The PM may be run in a single thread (executed in steps or till completion: error/success) or in up to three threads for multi-threaded (MT) parallel parsing, where each thread operates on specific part of the parsing pipeline (The PM does not spawn threads to do parsing for different ambiguous cases, but splits the parsing in sequential tasks that are run by different threads, effectively creating a pipeline), that may bring noticeable speed up for the PM especially for longer inputs.

Basics


This section converse basic theory related to grammars and parsing. More theory, examples, use cases and product specifications can be found in the documentation page.

Parsing


Strictly the parsing is a process of understanding the exact meaning of a string of symbols (called `input` of `tokens` later on) conforming a formal grammar. For direct and short example the string of three symbols '1+2' and a given formal grammar describing mathematical expressions, the parsing is the process of understanding that this is a summation of two numbers by one digit each, the first is one and the second is two. After this knowledge for the string is collected a mathematical processor could realize the summation and output a solution '3'. Its important to note that this processor MAY not know how the parsing process was done, only that the input it have to operate on (a ST in its case) is correct.

Syntax Tree


The result of a parsing process is often a Syntax Tree (ST). This is a tree, that represents the elements found in the process and their relations. For the example up, the ST could be with a root the sign '+' and two children: '1' and '2'. Given this structure, the mathematical processor could operate on the ST without prior knowledge of how it was constructed - result from a parsing process, generated by an algorithm or loaded from a stream. This devision between a processor and an input generator brings a lot of flexibility to the developers (clear separation of tasks, easy to exchange parts from the whole and so on).

From the representation perspective:

  • Concrete ST (CST) - directly constructed to match fully the grammar structure. Grammar Studio generates this type of trees.
  • Abstract ST (AST) - containing some, not all, predefined level of detail.

From the availability perspective:

  • Explicit - the tree is fully constructed and available to be traversed multiple times. This gives ability to perform more complex analysis, without the usage of additional structures. Grammar Studio generates this type of trees.
  • Implicit - the tree is never constructed, but only the information for its construction is available, usually in steps. This limits the tree usage to only one time.

Additionally the access to a ST may be:

  • Functional Paradigm - the access to ST elements is by calling global functions that receive and return basic primitive types of the used programing language.
  • Object Oriented Paradigm (OOP) - the tree elements are represented by objects that contain the data relevant to them including references to another objects. Grammar Studio generates this type of trees.

The construction of a syntax tree can be in derived in two general ways:

  • Left most derivation - the tree is constructed as following the more intuitive way, first the parent is created then the left children of it will be constructed then the right. Grammar Studio generates this type of trees.
  • Right most derivation - first the right children are constructed then the left, then the parent will be created and will add the children.

As a summary Epsilon Grammar Studio can produce parsers that process the input from Left to right and generate Left most derivation (LL) syntax trees that are explicit, concrete and object oriented. The determinism can be extended arbitrary resulting in LL(*) parsers with possible backtracking for LL(k>1).

Context-Free Grammars


These are grammars that their rules can be applied regardless of the previous parsed tokens and location in the input. The context-free grammars are of a great practical interest and many programming languages and Internet protocols are defined by such grammars. Epsilon Grammar Studio produces parsers from context-free grammars.

Augmented Backus-Naur Form


This notation technique for context-free grammars is defined by RFC 5234 (Internet Standard 68) and updated by RFC 7405 for string sensitivity. Both standards are operational in Epsilon Grammar Studio for defining the Lexer and Parser grammars with special cases, extensions and exceptions handled as define in the product documentation. As a short description they are:

  • Visual character range - the ABNF grammar is extended to support visual definition of ranges. The extended syntax element is two single characters inside single quotes each and separated by a dash. As example 'A'-'Z' donates range of chars 'A' to 'Z' inclusive. Reasoning: the ANBF standard requires ranges to be defined in binary, decimal or hexadecimal notation only (so %x41-5A donates 'A'-'Z' too), that is hard to read at first sight. It is allowed to omit the second character and define only 'A' that will match only the character itself. The definition is case sensitive. More details about this feature can be found in the product documentation.
  • Phrases - the ABNF syntax is extended to support matching for tokens that are not only characters, but list of characters or system tokens like EndOfFile (EOF) token. If a lexer grammar is given the generated PM will pre process the input with it and group the characters in lists by the maching rules. Then in the parser grammar each this new token can be matches with a syntax: {lexer_rule_name, content} where if no content is specified any content of the chosen rule will match. The exact syntax of the phrases grammar element is in the product documentation. Example: {keyword, "int"} matches lexer token 'keyword' with content "int" (case insensitive). There is a system token that indicates the end of the input that can be matched with {$EOF} syntax. This token is emitted only ones. More details about this features can be found in the product documentation.
  • Epsilon prediction - if one rule or a group have more then one epsilon prediction, only one of this predictions (determined at compile time, to have smallest processing and memory footprint) will be used. As a consequence ambiguous grammars with a respect to epsilon predictions can be parsed in linear time. Grammar studio issues warning on such places at compile time. More details about this feature can be found in the product documentation.
  • Epsilon repetitions - when a rule have an epsilon prediction and there is a reference that to that rule that have a ranged repetition, the parsing if this reference will be deterministic i.e. one may try to parse 2*5rule (rule have one or more epsilon predictions), and if only 2 times the rule can be matched, the result will be 2 times of the rule followed by 3 epsilon predictions. The other combinations are eliminated at compile time (as 3 epsilons and 2 rules or 1 rule 3 epsilons and again 1 rule). The effect is that the parsing of this kind of ambiguous grammars is linear in Grammar Studio. The drawback is that the compiled PM code is somehow bigger in size. More details about this feature can be found in the product documentation.
  • Special cases - the ABNF syntax permits strings with length of zero and zero maximum element repetitions that are not permitted in the Grammar Studio.

String Analizing Pipeline


The string recognition process in Grammar Studio is divided in three main parts: Lexing, Parsing and Syntax Tree Construction (called for short Building). A PM is created by 2 individual grammars one for the Lexer (that may by empty) and one for the Parser. Organizing the process as a pipeline, gives a great possibility of multi-threaded support already implemented in the product. The String Analyzing Pipeline (SAP) elements are:

  • Source - this module supplies the raw bytes of the input. It can be any kind of stream: file, memory, network socket, be generated in real-time or other. The Grammar Studio is generating a pure class that can be extended to any type of source, and implements memory source (reading from a byte array) by default.
  • Decoder - There is default decoder available in each PM for ASCII character encoding and optionally available decoders for Win-1252, ISO-8859-1 (Latin1), UTF-8, UTF-16 LE/BE and UTF-32 LE/BE. This element is organizing (decoding) the raw bytes into symbols.
  • Lexer - each received token from the previous decoder pipeline element (PE) is used as a symbol. The Lexer grammar is used to recognize phrases from the list of symbols. If a phrase is recognized, it is transmitted to the parser as a single phrase token. If no phrase is recognized a single symbol is transmitted as a symbol token and the remaining symbols are again tested with the grammar rules. The lexer grammars are not allowed to match any extensions (as {$EOF} system token and phrases because before the lexer there is no element that produces them), no recursion is allowed between the rules (i.e. the rule calls must form a tree) and the repetitions per element are only *1 (same as 0*1), * (same as 0*), 1*1 (used by default if omitted) and 1*. This grammar SHOULD recognize words and simple phrases only, and SHOULD be kept simple and short. Examples and more can be found in the product documentation.
  • Parser - the input of the SAP Parser (SAPP) element is a token from the Lexer. The full ABNF syntax is supported as defined in this page. For maximum performance many possible states of the parser are explicitly created in the PM source code, which MAY result in significant target code size, for this reason it is RECOMMENDED to keep the grammar short and clear of collisions. Grammar Studio is implementing many analysis algorithms with the goal to catch the runtime collisions at compile time, so the grammar developer can design more faster running Parser grammars. If the supplied grammar is LL(k>1) then the parser MAY backtrack when need. If the grammar is LL(1) then the Parser PE is expected to run in linear to the tokens speed - O(n). For each operation the parser is performing commands for building the syntax tree are send to the next PE - the Builder.
  • Optimizer - this element is buffering the commands sent from the Parser PE. In case that a backtrack is required if the Optimizer have buffered commands, this commands will be erased and never transmitted to the Builder. This MAY provide significant speed up in the parsing process especially (but not only) when backtrack is realized. The buffering options can be changed in the project options.
  • Builder - the received commands from the Parser (thru the Optimizer) are used to construct the final ST, that matches the Parser grammar as it is. In case of a backtrack, the SAP Builder (SAPB) element adapts the tree to match the current state of the Parser.%
  • Glue code - after the ST is complete the client code can use it and free it when no more needed. Glue code is OPTIONAL, but RECOMMENDED. This SHOULD be a code, written by the client, that connects the automatically generated syntax tree source code structures with the actual client program.

Parsing Machine


The architecture of the generated PM is as a separate module inside the client program, that requires only standard libraries of the target language to compile i.e. no Dynamic Link Libraries or third party software is required. The design of the PM is:

  • Online - the generated PM processes as much as input is available. When no more input is available the machine pauses. When more input becomes available the client code MUST signal this to the machine so it can continue its processes.
  • Memory Model - the PM uses fixed memory blocks (FMB) for its internal in depth parsing state. The current implementation is as a memory pool.
  • Run time - The machine may be run till completion: more input is need, till and error is discovered in the input or till successful recognition. Additionally PM can run on small steps as possible - this allows many single thread (ST) PM to execute in one dedicated thread in turns.
  • Code diversity - each time the PM is generated from Grammar Studio the parser SAP element code is randomized. This effectively makes the reverse engineering more difficult of the parser grammar after its compiled into an executable.
  • Dynamic stack - the generated Parser SAP element does not use the dedicated thread stack (DTS) for in deep grammar execution, but creates its own dynamic stack. This ensures that at runtime the client program is not going to collapse from missing stack space, because of grammar recursions. Some function calls deep in the PM code MAY occur, but the depth is independent from the grammar recursions.
  • Threading - one PM may be compiled to run as single or multi threaded (MT). Currently for MT three threads are started per PM instance. Because of the dynamic stack each PM maintains, each thread that runs the PM can be started with very little DTS (this is valid for multi-threaded and for single threaded client thread). Because the PM instance is called from the client code, and the PM does some system calls (as at least to allocate memory), its not possible to calculate in advance the maximum DTS size needed. However, because no in deep recursion is going to happen on DTS, a client program MAY shrink the parsing threads stack significantly.