Building a Syntax Highlighter: Part II

November 26, 2007 on 1:24 am | In .NET Coding | 2 Comments

Ok so you’ve seen that which is my quest, now for a little more slightly boring, yet more useful, background information on how I’m tackling this problem.

First of all, I’ll point out that currently the state of the ’syntax aware editor’ industry is pretty sad. (This is not the case in the world of C# and Visual Basic, and perhaps some of the languages that work in full force in visual studio and or mono/sharp develop. The examples I cited, of course, are out there now, using dynamic compilation in combination with reflection to achieve their magical goals. I’m talking about doing this purely from the position of using a grammar as input and a 100% compiler-free parser generated entirely from the grammar. I’ll cede the point that dynamic compilation is really cool, and can do a lot more than what I’m trying to do in the way of telling you what to do next in your coding, but this project is meant for the casual domain specific language writer.) Most of the tools out there right now are ‘lexer’ driven only, and they almost always rely on some kind of regular expression framework, or worse, find/replace methodology. That might be ok for some smaller projects, but it doesn’t check syntax correctness and it doesn’t scale very well. Two examples right now are Compona SyntaxBox (now a part of the puzzle framework and recently integrated into the Fireball framework), and Scintilla. These tools are great for what they do, but they don’t validate syntax… (SyntaxBox has some rudimentary support for ‘intelligent sensing auto-completion,’ which I so term for not wanting to step on anyone in Redmond’s trademarks.)

Why a Parser?
So, if the state of the art in the non-commercial software arena of this field is to just use text (lexical) matching alone, why do we need a parser? I have already alluded to the answer… Syntax Checking. A parser can tell you if what you’ve got is valid according to the language rules… A tokenizer, (also often termed a lexical analyzer, a scanner, or a ‘lexer’, for short,) the current type of commonly employed tool, can just tell you if what you’ve got is a keyword, constant, whitespace, comment or whatever other kind of language token you might have. Knowing what kind of token you have is ok, if all you’re doing is making pretty colors, but if you type some ridiculous string of keywords, having them all bolded and colored doesn’t amount to anything useful if it’s improperly sequenced. For example, I’ll use javascript:


    if (x > 12) x -= 5; // valid
 
    if ++x {} else {} for; //still produces colors but makes no sense..

Both of these two lines will be happily highlighted by a lexer driven system, but only a parser will be able to tell you that the second line is meaningless gook.

How I do Parsing
In an acronym, LALR(1). What that stands for is Look Ahead Left to Right, and the 1 means I have one symbol of lookahead. The LR part just means I read the input text from Left to Right, and the Lookahead means that if two or more types of symbols start out the same, I can look at the next symbol in order to differentiate them. I will leave the exact details of what Lookahead means to the reader (we have wikipedia and google, so I’m not going to pretend to be a lesson in parsing 101.) One thing I will cover in a little more detail is the ‘Shift-Reduce-Accept’ cycle works in LALR. When a parser gets some input text, it’s fed to a Lexical Analyzer (lexer) and compared against a table of known text patterns (called the parsing tables) character by character. Each new character of input is compared against the starting points in the lexer section of the parsing tables to see which are the matches. Then, the next character in the input sequence is checked against the list of all possible valid ‘next characters’ based on the set of matches generated by the first character. As more characters are added to the stream of the lexer’s input, it is able to eliminate more and more of the possible alternatives, until ultimately, it will have a single exact match. A lexical match generates what is known as a ‘Token’ which is fed back to the parser, and stored on a stack or a list in the parser’s working memory. After returning (Fetching) a token, the lexer is technically done with it’s duty, and the parser resumes control. Upon fetching, the latest few tokens are compared against a ‘rule table’ which is a special table in the parsing tables that specifies sequences of whole tokens, not just characters. A match in this rule table will trigger all of the component tokens to be pulled off the list or stack and they will be replaced by a reference to the rule which they match. This ‘rule matching’ is known as a ‘reduction’ action. If no rule is yet matched after a token, the parser will continue invoking the lexer to grab new tokens and stash the results… this grabbing and queuing of tokens is known as ’shifting.’ So, upon acquiring a new token, the parser will try to perform a reduction, if it cannot make a valid reduction from the list of recently read tokens, it will continue invoking the lexer for tokens, and shifting the results until it can. When there are no more tokens in the input stream the parser has a decision to make. If all of the tokens can be reduced, then it will return an Accept message (meaning, yup, ok we read the input and it’s all within the language, following the language definitions to the letter,) or it will return a parsing error message, usually supplying some indication of what the last tokens were, and what tokens it was expecting but didn’t find. There is one more layer of complexity at work here… The parser will do this shifting and reducing as I’ve described, but it has to have a defined starting point, which also happens to be it’s ending point. There has to be some rule which contains some combination of the other valid rules that can serve as a unit of validity for the entire input text. This is known as the Start Rule (or Start Symbol.) The parser cannot just willy nilly match rules without first matching the start rule. If you’re parsing a program, for example, you’d have to define somehow what constitutes a program… You might say a program is a collection of statements, and a statement any one of a few predefined statements, a block, or a lonely expression. A comment would not be considered a statement by that definition, and if you say you’re looking for a collection of statements (one or more statements), and your entire source text is just a bunch of comments, then the parser will not produce an Accept message on the source, because the start rule wasn’t matched.

Gold Specific:
Gold is a fantastic tool, known rather inaptly* as a compiler-compiler, that will physically produce these aforementioned parsing tables. But it still needs some specification of your tokens and rules (and start symbol) in order to do so. This specification is called a grammar, which serves as a way to specify the set of all input that is a member your defined language. (In language parlance we say that a grammar ‘generates’ a language, meaning that it specifies the entire set of valid input considered to be in the language, even though nothing is actually physically ‘generated’.) If you grab a copy of Gold Parser Builder, you can crack open the help files and see a few example grammars. On the website, you can also download a number of them, one of which is called ‘Simple.’ Simple is a grammar which specifies a BASIC-like language for use in examples and testing of the builder tool itself. Since I’ll be referring to it from time to time as my example language, you can find it below. This grammar was written by the author of Gold Parser Builder, and the master and commander of the entire Gold world, Devin Cook.


"Name"    = 'Simple'
"Author"  = 'Devin Cook'
"Version" = '2.0'
"About"   = 'This is a very simple grammar designed for use in examples'
 
"Case Sensitive" = False
"Start Symbol"   = <Statements>
 
{String Ch 1} = {Printable} - ['']
{String Ch 2} = {Printable} - ["]
 
Id            = {Letter}{AlphaNumeric}*
 
! String allows either single or double quotes
 
StringLiteral = ''   {String Ch 1}* ''
              | '"'  {String Ch 2}* '"'
 
NumberLiteral = {Digit}+('.'{Digit}+)?
 
<Statements> ::= <Statement> <Statements>
               | <Statement>
 
<Statement>  ::= display <Expression>
               | display <Expression> read ID
               | assign ID '=' <Expression>
               | while <Expression> do <Statements> end
               | if <Expression> then <Statements> end
               | if <Expression> then <Statements> else <Statements> end
              
<Expression>   ::= <Expression> '>'  <Add Exp>
                 | <Expression> '<'  <Add Exp>
                 | <Expression> '<=' <Add Exp>
                 | <Expression> '>=' <Add Exp>
                 | <Expression> '==' <Add Exp>
                 | <Expression> '<>' <Add Exp>
                 | <Add Exp>
 
<Add Exp>      ::= <Add Exp> '+' <Mult Exp>
                 | <Add Exp> '-' <Mult Exp>
                 | <Add Exp> '&' <Mult Exp>
                 | <Mult Exp>
 
<Mult Exp>    ::= <Mult Exp> '*' <Negate Exp>
                | <Mult Exp> '/' <Negate Exp>
                | <Negate Exp>
 
<Negate Exp>  ::= '-' <Value>
                | <Value>
 
<Value>       ::= ID          
                | StringLiteral
                | NumberLiteral
                | '(' <Expression> ')'
 

I’ll come back to this later in more detail, but for now, notice the Start Symbol is defined at the top as <Statements>, and <Statements> is defined as a list of one or more <Statement>s, and of course that comments are not valid statements. So hopefully this helps to illustrate what I was saying about how LALR works.

To be continued… again.

*FOOTNOTE: I say that it’s inaptly named a ‘compiler-compiler’ because a lexer and parser are only the first two components of what is known as the ‘front end’ of a compiler. You cannot produce an entire compiler from the front end, and you cannot even produce the entire front end in most cases using a so-called ‘compiler-compiler.’ There are other very famous tools also known as compiler-compilers, not the least of which being Lex/Yacc (unix) or Flex/Bison (Gnu versions of Lex and Yacc.)

Annoying Me Softly, the RichTextBox

November 24, 2007 on 4:05 pm | In .NET Coding | No Comments

This is a side note to my parser project. I’m using a rich text box to display colorized syntax, and it’s really annoying me now that when I set the .Rtf property with a trailing newline (\par) the .Text property pretends I didn’t put it there. In fact, the .Text property seems to like trimming my text when I set it via the .Rtf proprety, but if I set it directly via the .Text, I don’t get said trimming. Can someone please tell me why this has to be this way? I’ve got the whole syntax editor thing worked out and it highlights according to a mapping I’ve defined for the tokens, and the parser DFA generated by Gold, but when I end the input with a newline character, it truncates it and shovels me back up to the previous line. How is it possible that I’m the first person to run into this? Here’s what happens:


MyRTFBox.Rtf = "someRTFgoodies...LastWord\r\n}\r\n\par";
 
string foo = MyRTFBox.Text;

when I put a debug watch on foo… it comes back fully formed having removed the trailing whitespace. “someRTFgoodies…LastWord”;

What?!? How can I work with that?

I’d like to thank Microsoft one more time for assuming they know what I’m trying to do and just doing it for me. You guys really know how to make my life easier.

EDIT: I was looking at the source to the RichTextBox in the Reflector, and I notice that this behavior is caused by the way they’re copying the text as a byte array (to account for various possible methods of character encoding.) and in the Encoding process one of the ‘helpful’ side effects happens to be the trimming of trailing whitespace. The only way I can get around this is to write my own RTF box? Does anyone out there know another way here?

Edit Edit Edit: It has finally occurred to me what to do about this. Since this problem goes away if there is text after the enter key, I just don’t trigger the parsing when enter is pressed. Duh. Thanks for being so patient with me… but STILL it’s silly of the richtextbox to prune my text without asking.

Building a Syntax Highlighter using Gold Parser Builder: Part I

November 23, 2007 on 12:24 am | In .NET Coding | 1 Comment

Here we are. Finally something useful again on my blog, instead of some drivel about the crap I’ve been working on with no code samples. This won’t be a HOWTO from start to end but rather a narration of my efforts on the subject. I intend to have it working at the end, so by the time I post the final installation, it will indeed be a HOWTO.

I’m working on a syntax-checking text highlighter (colorizer), and I’ve discovered that it’s much more difficult than I’d thought it would be. I have written parsers as front ends for a compiler before. I’ve written recognizers that tell you if you’ve made a syntax error… I assumed it might be a trivial exercise then to make myself a syntax highlighting component that can plug into an editor, but I forgot some terribly obvious points… I’ll start by listing these:

White Space Matters
That’s right. Whitespace matters even if the language I’m parsing doesn’t care about white space because coders will write code with all kinds of white space to facilitate readability. This is my number one issue. It’s the one I haven’t entirely decided how to surmount.

Even bad Text has to Stay in the editor
If there happens to be some code in the editor that doesn’t parse, I cannot ignore it, and I have to save it in the text stream because otherwise it would be destructive to the whole idea of ‘unfinished code.’

You must parse all of the input text at some logical moment
If your users are editing some code, you won’t want it to miss something, so it’s got to be parsing pretty often. How often is too often? Well, I don’t know. I’ll say for now that I’m going to do it at every keystroke, and I’ll parse all of the input at once. This second bit about parsing all of it at once will eventually have to change for two reasons. I have to continue parsing after an error, because it’s possible that the rest of the document is valid, and should be highlighted, leaving just the lonely error bit to be complained about. The other reason is that documents can be huge. Immense. Large. It doesn’t make sense to parse the entire input file on every key stroke unless the input is below some reasonable threshold. (Like for example, the stuff that’s visible on the screen.)

Mainly these are the only three big issues, but they are indeed big. There are products out there that help and that are meant to do just syntax highlighting, or those that do so via regular expressions or find/replace etc… but these are computationally intense, and they usually don’t offer the ability to point out syntax errors… For example, Scintilla.. it will highlight text all purty-like, but it doesn’t tell me that I need a semi colon at the end of the line when one is missing. The same goes for the tools like SyntaxBox etc. They’re great tools, but they have limited functionality. Nope, I can’t cop out and cheat on this one. I have to implement something real, and that doesn’t require visual studio integration. (Which is why I can’t use a Babel language service.)

Gold Parser Builder is my parser generation tool, and since I work in .NET (or rather since I am on this project) I’ll be using .NET engines for the task. The two engines with which I am familiar are the Calitha engine, created by Robert van Loenhout, and the Morozov engine, created by Vladimir Morozov. The parsing technique exposed by Gold is LALR(1).

The first task is to generate a parser that’s capable of passing over the source text and actually parsing it. With gold parser that’s incredibly simple.. there is a grammar designer in which you specify the language to be parsed in a series of EBNF rules. After the builder checks the grammar, it will generate a deterministic finite automaton (DFA) which can recognize the specified language of the grammar. There is even a great testing facility built into the builder which allows me to test my grammar for correctness. I can throw sample input text at the generated DFA and it will tell me if the text parses or not, and throw a detailed trace and error log if for some reason that it will not.

I’ve been active on the Gold Parser Builder mailing list and the google group for the project, and I’ve raised the issue of building a tool such as a syntax checking text colorizer and asked for help from the community. I received a response from a certain Kelly Parker, with whom I’ve begun working on the task. I’m also working independently on a commercial version of this project for use with a certain not-yet-disclosed entity, but concurrently, I’m producing a more generic open source version that decouples the language itself from the components of the parser from the highlighting/checking utility for use in an editor. Kelly and I are working on this open source version. (The two implementations are completely different, because they have different goals. Whenever I refer to ‘the project’ from now on, I’ll be talking about the open source version.)

To be continued…

Early Christmas this year!

November 19, 2007 on 10:02 am | In .NET Coding | No Comments

Visual Studio 2008 drops this week as an RTM. I’m going to be checking the website about every 3 hours now. Whee!

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^