Building a Syntax Highlighter using Gold Parser Builder: Part I

November 23, 2007 on 12:24 am | In .NET Coding |

Here we are. Finally something useful again on my blog, instead of some drivel about the crap I’ve been working on with no code samples. This won’t be a HOWTO from start to end but rather a narration of my efforts on the subject. I intend to have it working at the end, so by the time I post the final installation, it will indeed be a HOWTO.

I’m working on a syntax-checking text highlighter (colorizer), and I’ve discovered that it’s much more difficult than I’d thought it would be. I have written parsers as front ends for a compiler before. I’ve written recognizers that tell you if you’ve made a syntax error… I assumed it might be a trivial exercise then to make myself a syntax highlighting component that can plug into an editor, but I forgot some terribly obvious points… I’ll start by listing these:

White Space Matters
That’s right. Whitespace matters even if the language I’m parsing doesn’t care about white space because coders will write code with all kinds of white space to facilitate readability. This is my number one issue. It’s the one I haven’t entirely decided how to surmount.

Even bad Text has to Stay in the editor
If there happens to be some code in the editor that doesn’t parse, I cannot ignore it, and I have to save it in the text stream because otherwise it would be destructive to the whole idea of ‘unfinished code.’

You must parse all of the input text at some logical moment
If your users are editing some code, you won’t want it to miss something, so it’s got to be parsing pretty often. How often is too often? Well, I don’t know. I’ll say for now that I’m going to do it at every keystroke, and I’ll parse all of the input at once. This second bit about parsing all of it at once will eventually have to change for two reasons. I have to continue parsing after an error, because it’s possible that the rest of the document is valid, and should be highlighted, leaving just the lonely error bit to be complained about. The other reason is that documents can be huge. Immense. Large. It doesn’t make sense to parse the entire input file on every key stroke unless the input is below some reasonable threshold. (Like for example, the stuff that’s visible on the screen.)

Mainly these are the only three big issues, but they are indeed big. There are products out there that help and that are meant to do just syntax highlighting, or those that do so via regular expressions or find/replace etc… but these are computationally intense, and they usually don’t offer the ability to point out syntax errors… For example, Scintilla.. it will highlight text all purty-like, but it doesn’t tell me that I need a semi colon at the end of the line when one is missing. The same goes for the tools like SyntaxBox etc. They’re great tools, but they have limited functionality. Nope, I can’t cop out and cheat on this one. I have to implement something real, and that doesn’t require visual studio integration. (Which is why I can’t use a Babel language service.)

Gold Parser Builder is my parser generation tool, and since I work in .NET (or rather since I am on this project) I’ll be using .NET engines for the task. The two engines with which I am familiar are the Calitha engine, created by Robert van Loenhout, and the Morozov engine, created by Vladimir Morozov. The parsing technique exposed by Gold is LALR(1).

The first task is to generate a parser that’s capable of passing over the source text and actually parsing it. With gold parser that’s incredibly simple.. there is a grammar designer in which you specify the language to be parsed in a series of EBNF rules. After the builder checks the grammar, it will generate a deterministic finite automaton (DFA) which can recognize the specified language of the grammar. There is even a great testing facility built into the builder which allows me to test my grammar for correctness. I can throw sample input text at the generated DFA and it will tell me if the text parses or not, and throw a detailed trace and error log if for some reason that it will not.

I’ve been active on the Gold Parser Builder mailing list and the google group for the project, and I’ve raised the issue of building a tool such as a syntax checking text colorizer and asked for help from the community. I received a response from a certain Kelly Parker, with whom I’ve begun working on the task. I’m also working independently on a commercial version of this project for use with a certain not-yet-disclosed entity, but concurrently, I’m producing a more generic open source version that decouples the language itself from the components of the parser from the highlighting/checking utility for use in an editor. Kelly and I are working on this open source version. (The two implementations are completely different, because they have different goals. Whenever I refer to ‘the project’ from now on, I’ll be talking about the open source version.)

To be continued…

1 Comment »

RSS feed for comments on this post. TrackBack URI

  1. Amazing!, I’m working now with the Kalimstra engine, but a highlighter will be a nice-to-have feature :)

    Comment by Diego Jancic — February 28, 2008 #

Leave a comment

XHTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds. Valid XHTML and CSS. ^Top^