Tunnel Grammar Studio examples
Content
Use cases
The following grammars are defining different use cases are ordered from the more simple to to the more complex.
Text Splitter
To recognize text expressions from the input one may write a lexer as:
word = 1* ('A'-'Z' / 'a'-'z') number = 1* '0'-'9'
The lexer grammar reads as: a 'word' is one or more case insensitive 'a'-'z' letters and a number is one or more digits. The parser grammar may be as:
text = 1*expression expression = ({word} / {number}) *(0*1 "," " " ({word} / {number})) "."
The parser grammar reads as: a text is one or more expressions that each is 'one or more words that in between have an OPTIONAL comma and a REQUIRED space' and MUST finish with a full stop. A valid input for a PM generated from this two grammars will be: Use case 1: Hello World, from ExperaSoft. Additionally one may use only a parser grammar to achieve the same:
text = 1*expression expression = (word / number) *(0*1 "," " " (word / number)) "." word = 1* ('A'-'Z' / 'a'-'z') number = 1* '0'-'9'
However the two approaches will have different syntax trees. The first way with the 2 grammars, is expected to run faster in the general case, because the Lexer operates faster then the Parser, because no context information is recorded, as a consequence each token 'word' and 'number' will be represented by a single string type of object. In the second approach with single grammar, the context information is recorded, that will take more memory and time to be recognized and constructed. As a general rule, everything that represents single tokens SHOULD be moved into the Lexer grammar.
Mathematical Expression
In many programming languages mathematical expressions are build in to match the most popular infix notation. Such recognition can be made with this grammars. The lexer grammar is:
varname = 1* ('A'-'Z' / 'a'-'z') integer = 1* '0'-'9' float = 1* '0'-'9' "." * '0'-'9'
And the parser grammar:
expression = addition addition = multiplication *(("+" / "-") multiplication) multiplication = number *(("*" / "/") number) number = {integer} / {float} / {varname} / "(" addition ")"
As defined these grammars will generate PM which will be LL(1) and accept the following expressions:
2*2 or 5*(10+60*2) or 2*(21-(x/2)*2) and so on...
If one wants to introduce space in between the digits and signs, one may change the parser grammar as in this example:
expression = *WSP addition addition = multiplication *(("+" / "-") *WSP multiplication) multiplication = number *(("*" / "/") *WSP number) number = ({integer} / {float} / {varname} / "(" *WSP addition ")") *WSP WSP = %x20 / %x9
The new introduced WSP rule is short for white space (a space or a horizontal tab) and will be OPTIONAL at the beginning of the math expression and after each symbol and a phrase token - all currently in the number rule. The priority of the multiplication is higher and comes naturally from the rules relation. These grammars form a LL(1) parser that can parse linearly input as:
24 / 4.5 + 11 * (offset - 2) or 5644*(b + 1233.21 / 899.004) and so on...
Log File Reader
Many computer programs are saving some kind of log files for the started or completed operations, for the errors and warnings occurred during the program runtime. This logs are often specifically formated for the company needs. One may define grammars that read this logs without the need to create a specific parser by hand, if one have a need to make a statistics on the log or another automatic analysis. Additionally when the log structure changes, one have to just update the grammars defined and recompile the PM. That is expected to save a lot of time from hand coding the parser at every change.
string = '"' *(%x20-%x21 / %x23-%x5B / %x5D-10FFFF / '\' ('\' / '"')) '"' word = ('A'-'Z' / 'a'-'z' ) *('A'-'Z' / 'a'-'z' / '0'-'9' / '-' / '_') number = 1* '0'-'9' ["." * '0'-'9']
log = *line line = item *("," item) CRLF item = {string} / {word} / {number} CRLF = %xD %xA
In words, the example lexer grammar is recognizing simple tokens as:
- string: starting and ending with double quotes with a possible escaping '\' char followed by '\' or double quote
- number: an integer or a float
- words: starting with a letter and optionally continuing with a letter, digit, dash or underscore
Simple Markup Language
Many companies have option files that have to be human and machine readable. To define simple markup language for this purpose one may use as a base the Extended Markup Language (XML):
comment-start = "<!--" comment-final = "-->" tag = "<" 0*1 "/" word = 1* ('A'-'Z' / 'a'-'z')
document = *WSP tag *WSP tag = tag-start content tag-final tag-start = {tag, "<"} *WSP {word} *WSP '>' tag-final = {tag, "</"} *WSP {word} *WSP '>' content = *(tag / %x21-3B / %x3D-10FFFF / {word} / WSP) WSP = %x9 / ; tab %x13 %x10 / ; new line %x20 / ; space {comment-start} *(%x0-10FFFF / {word} / {comment-start} / {tag}) {comment-final}
<options> <!-- this is an option file with one string --> <string>hello word!</string> </options>
Simple C/C++
If one wants to create a script language, one may base it on the C/C++ programming language. On the first phase of the parsing the syntax is checked with the PM, on a second phase the semantical meaning of the parsed code could be checked and finally an interpreter or a generator to another language (possibly Assembler) could execute the syntactically and semantically valid parsed input. Those are grammars that produce PM for C/C++ similar language.
comment = "/*" / "*/" / "//" CRLF = %x13 %x10 keyword = %s"const" / %s"if"
document = *WSP *function function = %s"function" 1*WSP name *WSP func-args scope func-args = '(' *WSP [argument-list] ')' func-call = '(' *WSP [operation-list] ')' argument-list = argument 1*WSP *(',' 1*WSP argument 1*WSP) argument = [{keyword, "const"} 1*WSP] type-and-name type-and-name = name 1*WSP name operation-list = operation *(',' 1*WSP operation) scope = '{' WSP *declaration "}" 1*WSP declaration = decl-var / decl-if decl-var = type-and-name *WSP ['=' *WSP operation] ';' *WSP decl-if = {keyword, "if"} *WSP '(' *WSP operation ')' *WSP scope [{keyword, "else"} *WSP scope] operation = addition addition = multiplication *(("+" / "-") *WSP multiplication) multiplication = value *WSP *(("*" / "/") *WSP value *WSP) value = name 0*1 func-call / number / "(" *WSP operation ")" name = ('a'-'z' / 'A'-'Z') *('A'-'Z' / 'a'-'z' / '0'-'9' / '_') number = 1* '0'-'9' ['.' * '0'-'9'] WSP = %x9 / %x20 / {CRLF} / line-comment / text-comment line-comment = {comment, "//"} *(%x0-10FFFF / {comment}) {CRLF} text-comment = {comment, "/*"} *( %x0-10FFFF / {CRLF} / {comment, "/*"} / {comment, "//"} ) {comment, "*/"}
The semantic analysis is not defined into the grammar, as for example is it an error if one calls a function that is not defined, or calls a function that is defined later in the source code. Because the parsing phase is completed fully before the semantic analysis is started, one may simply permit calling of functions that are defined later in the code, that is one of the advantages of having the parsing phrase separate of any other phase.
Using the parsing machine
This section contains C++ examples of how to use the generated parsing machine.
Beginner
This is 'just give me the Syntax Tree' example. It parses the grammar listed bellow. This example demonstrates parsing machine prepare, run, check result and a syntax tree retrieval.
list = integer *("," integer) integer = 1* '0'-'9'
Advanced
To report properly syntax errors found in the input functions in the PM events class must be overwritten by extending the class. This example adds CustomEvents class in the Basic example and uses it instead of default event handler events to receive and print the runtime parsing events. The grammar used is the same as in the beginner example.
Expert
After a syntax tree element is available, it can be iterated as much as need before its released. The following glue code can be added to the advanced example to view the resulted syntax tree. The used grammar is the same as in the beginner example.