

Ok so you’ve seen that which is my quest, now for a little more slightly boring, yet more useful, background information on how I’m tackling this problem.
First of all, I’ll point out that currently the state of the ’syntax aware editor’ industry is pretty sad. (This is not the case in the world of C# and Visual Basic, and perhaps some of the languages that work in full force in visual studio and or mono/sharp develop. The examples I cited, of course, are out there now, using dynamic compilation in combination with reflection to achieve their magical goals. I’m talking about doing this purely from the position of using a grammar as input and a 100% compiler-free parser generated entirely from the grammar. I’ll cede the point that dynamic compilation is really cool, and can do a lot more than what I’m trying to do in the way of telling you what to do next in your coding, but this project is meant for the casual domain specific language writer.) Most of the tools out there right now are ‘lexer’ driven only, and they almost always rely on some kind of regular expression framework, or worse, find/replace methodology. That might be ok for some smaller projects, but it doesn’t check syntax correctness and it doesn’t scale very well. Two examples right now are Compona SyntaxBox (now a part of the puzzle framework and recently integrated into the Fireball framework), and Scintilla. These tools are great for what they do, but they don’t validate syntax… (SyntaxBox has some rudimentary support for ‘intelligent sensing auto-completion,’ which I so term for not wanting to step on anyone in Redmond’s trademarks.)
Why a Parser?
So, if the state of the art in the non-commercial software arena of this field is to just use text (lexical) matching alone, why do we need a parser? I have already alluded to the answer… Syntax Checking. A parser can tell you if what you’ve got is valid according to the language rules… A tokenizer, (also often termed a lexical analyzer, a scanner, or a ‘lexer’, for short,) the current type of commonly employed tool, can just tell you if what you’ve got is a keyword, constant, whitespace, comment or whatever other kind of language token you might have. Knowing what kind of token you have is ok, if all you’re doing is making pretty colors, but if you type some ridiculous string of keywords, having them all bolded and colored doesn’t amount to anything useful if it’s improperly sequenced. For example, I’ll use javascript:
if (x > 12) x -= 5; // valid
if ++x {} else {} for; //still produces colors but makes no sense..
Both of these two lines will be happily highlighted by a lexer driven system, but only a parser will be able to tell you that the second line is meaningless gook.
How I do Parsing
In an acronym, LALR(1). What that stands for is Look Ahead Left to Right, and the 1 means I have one symbol of lookahead. The LR part just means I read the input text from Left to Right, and the Lookahead means that if two or more types of symbols start out the same, I can look at the next symbol in order to differentiate them. I will leave the exact details of what Lookahead means to the reader (we have wikipedia and google, so I’m not going to pretend to be a lesson in parsing 101.) One thing I will cover in a little more detail is the ‘Shift-Reduce-Accept’ cycle works in LALR. When a parser gets some input text, it’s fed to a Lexical Analyzer (lexer) and compared against a table of known text patterns (called the parsing tables) character by character. Each new character of input is compared against the starting points in the lexer section of the parsing tables to see which are the matches. Then, the next character in the input sequence is checked against the list of all possible valid ‘next characters’ based on the set of matches generated by the first character. As more characters are added to the stream of the lexer’s input, it is able to eliminate more and more of the possible alternatives, until ultimately, it will have a single exact match. A lexical match generates what is known as a ‘Token’ which is fed back to the parser, and stored on a stack or a list in the parser’s working memory. After returning (Fetching) a token, the lexer is technically done with it’s duty, and the parser resumes control. Upon fetching, the latest few tokens are compared against a ‘rule table’ which is a special table in the parsing tables that specifies sequences of whole tokens, not just characters. A match in this rule table will trigger all of the component tokens to be pulled off the list or stack and they will be replaced by a reference to the rule which they match. This ‘rule matching’ is known as a ‘reduction’ action. If no rule is yet matched after a token, the parser will continue invoking the lexer to grab new tokens and stash the results… this grabbing and queuing of tokens is known as ’shifting.’ So, upon acquiring a new token, the parser will try to perform a reduction, if it cannot make a valid reduction from the list of recently read tokens, it will continue invoking the lexer for tokens, and shifting the results until it can. When there are no more tokens in the input stream the parser has a decision to make. If all of the tokens can be reduced, then it will return an Accept message (meaning, yup, ok we read the input and it’s all within the language, following the language definitions to the letter,) or it will return a parsing error message, usually supplying some indication of what the last tokens were, and what tokens it was expecting but didn’t find. There is one more layer of complexity at work here… The parser will do this shifting and reducing as I’ve described, but it has to have a defined starting point, which also happens to be it’s ending point. There has to be some rule which contains some combination of the other valid rules that can serve as a unit of validity for the entire input text. This is known as the Start Rule (or Start Symbol.) The parser cannot just willy nilly match rules without first matching the start rule. If you’re parsing a program, for example, you’d have to define somehow what constitutes a program… You might say a program is a collection of statements, and a statement any one of a few predefined statements, a block, or a lonely expression. A comment would not be considered a statement by that definition, and if you say you’re looking for a collection of statements (one or more statements), and your entire source text is just a bunch of comments, then the parser will not produce an Accept message on the source, because the start rule wasn’t matched.
Gold Specific:
Gold is a fantastic tool, known rather inaptly* as a compiler-compiler, that will physically produce these aforementioned parsing tables. But it still needs some specification of your tokens and rules (and start symbol) in order to do so. This specification is called a grammar, which serves as a way to specify the set of all input that is a member your defined language. (In language parlance we say that a grammar ‘generates’ a language, meaning that it specifies the entire set of valid input considered to be in the language, even though nothing is actually physically ‘generated’.) If you grab a copy of Gold Parser Builder, you can crack open the help files and see a few example grammars. On the website, you can also download a number of them, one of which is called ‘Simple.’ Simple is a grammar which specifies a BASIC-like language for use in examples and testing of the builder tool itself. Since I’ll be referring to it from time to time as my example language, you can find it below. This grammar was written by the author of Gold Parser Builder, and the master and commander of the entire Gold world, Devin Cook.
"Name" = 'Simple'
"Author" = 'Devin Cook'
"Version" = '2.0'
"About" = 'This is a very simple grammar designed for use in examples'
"Case Sensitive" = False
"Start Symbol" = <Statements>
{String Ch 1} = {Printable} - ['']
{String Ch 2} = {Printable} - ["]
Id = {Letter}{AlphaNumeric}*
! String allows either single or double quotes
StringLiteral = '' {String Ch 1}* ''
| '"' {String Ch 2}* '"'
NumberLiteral = {Digit}+('.'{Digit}+)?
<Statements> ::= <Statement> <Statements>
| <Statement>
<Statement> ::= display <Expression>
| display <Expression> read ID
| assign ID '=' <Expression>
| while <Expression> do <Statements> end
| if <Expression> then <Statements> end
| if <Expression> then <Statements> else <Statements> end
<Expression> ::= <Expression> '>' <Add Exp>
| <Expression> '<' <Add Exp>
| <Expression> '<=' <Add Exp>
| <Expression> '>=' <Add Exp>
| <Expression> '==' <Add Exp>
| <Expression> '<>' <Add Exp>
| <Add Exp>
<Add Exp> ::= <Add Exp> '+' <Mult Exp>
| <Add Exp> '-' <Mult Exp>
| <Add Exp> '&' <Mult Exp>
| <Mult Exp>
<Mult Exp> ::= <Mult Exp> '*' <Negate Exp>
| <Mult Exp> '/' <Negate Exp>
| <Negate Exp>
<Negate Exp> ::= '-' <Value>
| <Value>
<Value> ::= ID
| StringLiteral
| NumberLiteral
| '(' <Expression> ')'
I’ll come back to this later in more detail, but for now, notice the Start Symbol is defined at the top as <Statements>, and <Statements> is defined as a list of one or more <Statement>s, and of course that comments are not valid statements. So hopefully this helps to illustrate what I was saying about how LALR works.
To be continued… again.
*FOOTNOTE: I say that it’s inaptly named a ‘compiler-compiler’ because a lexer and parser are only the first two components of what is known as the ‘front end’ of a compiler. You cannot produce an entire compiler from the front end, and you cannot even produce the entire front end in most cases using a so-called ‘compiler-compiler.’ There are other very famous tools also known as compiler-compilers, not the least of which being Lex/Yacc (unix) or Flex/Bison (Gnu versions of Lex and Yacc.)




More Options ...

Categories
Tag Cloud
Blog RSS
Comments RSS

Void (Default)
Life
Earth
Wind
Water
Fire
Lightweight
12:02 pm - December 16th, 2007
Interesting read. I evaluated Gold for use in my project and I opted to not use it. I\’m working in plain C and the \’engine templates\’ would have me writing a whole lot of code I don\’t want to. In other words, I don\’t want to write the actual parsing code, I just want to add code to the grammar file for semantic actions. I imagine the situation is different in C# as I\’m sure someone has implemented a proper framework or is that what you\’re doing? Have you looked into Antlr? It supports C# as a target and provides a lot more functionality (tree building) than Gold. Regards.
4:18 pm - December 16th, 2007
Yeah I’ve looked at antlr. The problem is that when you specify a grammar in antlr it’s not just a language independent grammar. You actually embed stuff from the code in it. Add the semantic actions on top of that, and well, it’s still not very readable to anyone but you. I do think it’s a great idea though… The good news is we’ve already built a tree builder and actually have the thing to generate the strongly typed tree already assembled. It’s just awaiting a few of my lazy moments to add the rest of the things before it goes out to the world. I like the idea of talking about a language ONLY as a grammar… and in Antlr, you can’t really do that (Antlr is a great tool, just I’m a little partial to gold.)