Rich Programmer Food: Highly recommended reading
January 18, 2008 on 3:37 pm | In General Programming | No Commentshttp://steve-yegge.blogspot.com/2007/06/rich-programmer-food.html
I have been preaching things like this for a long time, but Steve here is a bit more fluent about explaining it. I’ll not try to turn you off to the ideas presented here by summarizing it before you read it. Honestly, it’s the perfect expression of a relatively amorphous emotion I’ve had about compilers and their place in the world of CS for a long time, (but obviously I can’t take credit for the quality of Steve’s writing, of course.)
Make yourself read it though, please, if you never do another thing for me as long as you know me. (Halo_Four, if you’re reading this, this means you too.)
Colorizer Part II.V
December 19, 2007 on 12:55 am | In .NET Coding | No CommentsA friend of mine pointed out to me one possible way of finding syntax errors, and resuming parsing after one successfully. The key is to split the input text on some known divider. It’s possible that your grammar won’t have one, but in the case of, say… C#, you can split all the statements on ‘}’ and ‘;’ . Then you can store each as an independently parsed string (even store it’s parse tree in a linked list or array or something) and then flag each with an error or not. This is a great trick, provided that you have some identifiable block/line terminator to split on. For line based grammars this is exceptionally easy.. just split on newlines… for not line based grammars, you might split on block terminators only, like ‘end’ in ‘Simple.’ At least then you have some smaller unit of ‘parse’ to work with and you can track state of parsing across a document with some degree of efficiency. That also means you can trigger the ‘re-parsing’ of small segments when new text is added or pasted. Very interesting. I’m right now trying to think of a case that there wouldn’t be a way to do this… perhaps some functional languages would be eternally nesting… but as far as I’m concerned that’s a corner case ![]()
SuperDeluxe.Com - Stop the MAdnesS
December 1, 2007 on 2:27 am | In Uncategorized | No CommentsOk I’ve had it with SuperDeluxe It’s too much funny, smartness in one place. I can’t take it. It’s sucking my humorous response enegry reserves to the dregs. If they put out one more episode of [(Layers)] or the Maria Bamford Show, I may never write another line of productive code again.
I actually don’t go there or any video sites, or even watch anything on TV that often, but when I do go there, it’s like I get stuck. I can’t hit the stop button.
Building a Syntax Highlighter: Part II
November 26, 2007 on 1:24 am | In .NET Coding | 2 CommentsOk so you’ve seen that which is my quest, now for a little more slightly boring, yet more useful, background information on how I’m tackling this problem.
First of all, I’ll point out that currently the state of the ’syntax aware editor’ industry is pretty sad. (This is not the case in the world of C# and Visual Basic, and perhaps some of the languages that work in full force in visual studio and or mono/sharp develop. The examples I cited, of course, are out there now, using dynamic compilation in combination with reflection to achieve their magical goals. I’m talking about doing this purely from the position of using a grammar as input and a 100% compiler-free parser generated entirely from the grammar. I’ll cede the point that dynamic compilation is really cool, and can do a lot more than what I’m trying to do in the way of telling you what to do next in your coding, but this project is meant for the casual domain specific language writer.) Most of the tools out there right now are ‘lexer’ driven only, and they almost always rely on some kind of regular expression framework, or worse, find/replace methodology. That might be ok for some smaller projects, but it doesn’t check syntax correctness and it doesn’t scale very well. Two examples right now are Compona SyntaxBox (now a part of the puzzle framework and recently integrated into the Fireball framework), and Scintilla. These tools are great for what they do, but they don’t validate syntax… (SyntaxBox has some rudimentary support for ‘intelligent sensing auto-completion,’ which I so term for not wanting to step on anyone in Redmond’s trademarks.)
Why a Parser?
So, if the state of the art in the non-commercial software arena of this field is to just use text (lexical) matching alone, why do we need a parser? I have already alluded to the answer… Syntax Checking. A parser can tell you if what you’ve got is valid according to the language rules… A tokenizer, (also often termed a lexical analyzer, a scanner, or a ‘lexer’, for short,) the current type of commonly employed tool, can just tell you if what you’ve got is a keyword, constant, whitespace, comment or whatever other kind of language token you might have. Knowing what kind of token you have is ok, if all you’re doing is making pretty colors, but if you type some ridiculous string of keywords, having them all bolded and colored doesn’t amount to anything useful if it’s improperly sequenced. For example, I’ll use javascript:
if (x > 12) x -= 5; // valid
if ++x {} else {} for; //still produces colors but makes no sense..
Both of these two lines will be happily highlighted by a lexer driven system, but only a parser will be able to tell you that the second line is meaningless gook.
How I do Parsing
In an acronym, LALR(1). What that stands for is Look Ahead Left to Right, and the 1 means I have one symbol of lookahead. The LR part just means I read the input text from Left to Right, and the Lookahead means that if two or more types of symbols start out the same, I can look at the next symbol in order to differentiate them. I will leave the exact details of what Lookahead means to the reader (we have wikipedia and google, so I’m not going to pretend to be a lesson in parsing 101.) One thing I will cover in a little more detail is the ‘Shift-Reduce-Accept’ cycle works in LALR. When a parser gets some input text, it’s fed to a Lexical Analyzer (lexer) and compared against a table of known text patterns (called the parsing tables) character by character. Each new character of input is compared against the starting points in the lexer section of the parsing tables to see which are the matches. Then, the next character in the input sequence is checked against the list of all possible valid ‘next characters’ based on the set of matches generated by the first character. As more characters are added to the stream of the lexer’s input, it is able to eliminate more and more of the possible alternatives, until ultimately, it will have a single exact match. A lexical match generates what is known as a ‘Token’ which is fed back to the parser, and stored on a stack or a list in the parser’s working memory. After returning (Fetching) a token, the lexer is technically done with it’s duty, and the parser resumes control. Upon fetching, the latest few tokens are compared against a ‘rule table’ which is a special table in the parsing tables that specifies sequences of whole tokens, not just characters. A match in this rule table will trigger all of the component tokens to be pulled off the list or stack and they will be replaced by a reference to the rule which they match. This ‘rule matching’ is known as a ‘reduction’ action. If no rule is yet matched after a token, the parser will continue invoking the lexer to grab new tokens and stash the results… this grabbing and queuing of tokens is known as ’shifting.’ So, upon acquiring a new token, the parser will try to perform a reduction, if it cannot make a valid reduction from the list of recently read tokens, it will continue invoking the lexer for tokens, and shifting the results until it can. When there are no more tokens in the input stream the parser has a decision to make. If all of the tokens can be reduced, then it will return an Accept message (meaning, yup, ok we read the input and it’s all within the language, following the language definitions to the letter,) or it will return a parsing error message, usually supplying some indication of what the last tokens were, and what tokens it was expecting but didn’t find. There is one more layer of complexity at work here… The parser will do this shifting and reducing as I’ve described, but it has to have a defined starting point, which also happens to be it’s ending point. There has to be some rule which contains some combination of the other valid rules that can serve as a unit of validity for the entire input text. This is known as the Start Rule (or Start Symbol.) The parser cannot just willy nilly match rules without first matching the start rule. If you’re parsing a program, for example, you’d have to define somehow what constitutes a program… You might say a program is a collection of statements, and a statement any one of a few predefined statements, a block, or a lonely expression. A comment would not be considered a statement by that definition, and if you say you’re looking for a collection of statements (one or more statements), and your entire source text is just a bunch of comments, then the parser will not produce an Accept message on the source, because the start rule wasn’t matched.
Gold Specific:
Gold is a fantastic tool, known rather inaptly* as a compiler-compiler, that will physically produce these aforementioned parsing tables. But it still needs some specification of your tokens and rules (and start symbol) in order to do so. This specification is called a grammar, which serves as a way to specify the set of all input that is a member your defined language. (In language parlance we say that a grammar ‘generates’ a language, meaning that it specifies the entire set of valid input considered to be in the language, even though nothing is actually physically ‘generated’.) If you grab a copy of Gold Parser Builder, you can crack open the help files and see a few example grammars. On the website, you can also download a number of them, one of which is called ‘Simple.’ Simple is a grammar which specifies a BASIC-like language for use in examples and testing of the builder tool itself. Since I’ll be referring to it from time to time as my example language, you can find it below. This grammar was written by the author of Gold Parser Builder, and the master and commander of the entire Gold world, Devin Cook.
"Name" = 'Simple'
"Author" = 'Devin Cook'
"Version" = '2.0'
"About" = 'This is a very simple grammar designed for use in examples'
"Case Sensitive" = False
"Start Symbol" = <Statements>
{String Ch 1} = {Printable} - ['']
{String Ch 2} = {Printable} - ["]
Id = {Letter}{AlphaNumeric}*
! String allows either single or double quotes
StringLiteral = '' {String Ch 1}* ''
| '"' {String Ch 2}* '"'
NumberLiteral = {Digit}+('.'{Digit}+)?
<Statements> ::= <Statement> <Statements>
| <Statement>
<Statement> ::= display <Expression>
| display <Expression> read ID
| assign ID '=' <Expression>
| while <Expression> do <Statements> end
| if <Expression> then <Statements> end
| if <Expression> then <Statements> else <Statements> end
<Expression> ::= <Expression> '>' <Add Exp>
| <Expression> '<' <Add Exp>
| <Expression> '<=' <Add Exp>
| <Expression> '>=' <Add Exp>
| <Expression> '==' <Add Exp>
| <Expression> '<>' <Add Exp>
| <Add Exp>
<Add Exp> ::= <Add Exp> '+' <Mult Exp>
| <Add Exp> '-' <Mult Exp>
| <Add Exp> '&' <Mult Exp>
| <Mult Exp>
<Mult Exp> ::= <Mult Exp> '*' <Negate Exp>
| <Mult Exp> '/' <Negate Exp>
| <Negate Exp>
<Negate Exp> ::= '-' <Value>
| <Value>
<Value> ::= ID
| StringLiteral
| NumberLiteral
| '(' <Expression> ')'
I’ll come back to this later in more detail, but for now, notice the Start Symbol is defined at the top as <Statements>, and <Statements> is defined as a list of one or more <Statement>s, and of course that comments are not valid statements. So hopefully this helps to illustrate what I was saying about how LALR works.
To be continued… again.
*FOOTNOTE: I say that it’s inaptly named a ‘compiler-compiler’ because a lexer and parser are only the first two components of what is known as the ‘front end’ of a compiler. You cannot produce an entire compiler from the front end, and you cannot even produce the entire front end in most cases using a so-called ‘compiler-compiler.’ There are other very famous tools also known as compiler-compilers, not the least of which being Lex/Yacc (unix) or Flex/Bison (Gnu versions of Lex and Yacc.)
Annoying Me Softly, the RichTextBox
November 24, 2007 on 4:05 pm | In .NET Coding | No CommentsThis is a side note to my parser project. I’m using a rich text box to display colorized syntax, and it’s really annoying me now that when I set the .Rtf property with a trailing newline (\par) the .Text property pretends I didn’t put it there. In fact, the .Text property seems to like trimming my text when I set it via the .Rtf proprety, but if I set it directly via the .Text, I don’t get said trimming. Can someone please tell me why this has to be this way? I’ve got the whole syntax editor thing worked out and it highlights according to a mapping I’ve defined for the tokens, and the parser DFA generated by Gold, but when I end the input with a newline character, it truncates it and shovels me back up to the previous line. How is it possible that I’m the first person to run into this? Here’s what happens:
MyRTFBox.Rtf = "someRTFgoodies...LastWord\r\n}\r\n\par";
string foo = MyRTFBox.Text;
when I put a debug watch on foo… it comes back fully formed having removed the trailing whitespace. “someRTFgoodies…LastWord”;
What?!? How can I work with that?
I’d like to thank Microsoft one more time for assuming they know what I’m trying to do and just doing it for me. You guys really know how to make my life easier.
EDIT: I was looking at the source to the RichTextBox in the Reflector, and I notice that this behavior is caused by the way they’re copying the text as a byte array (to account for various possible methods of character encoding.) and in the Encoding process one of the ‘helpful’ side effects happens to be the trimming of trailing whitespace. The only way I can get around this is to write my own RTF box? Does anyone out there know another way here?
Edit Edit Edit: It has finally occurred to me what to do about this. Since this problem goes away if there is text after the enter key, I just don’t trigger the parsing when enter is pressed. Duh. Thanks for being so patient with me… but STILL it’s silly of the richtextbox to prune my text without asking.
Building a Syntax Highlighter using Gold Parser Builder: Part I
November 23, 2007 on 12:24 am | In .NET Coding | 1 CommentHere we are. Finally something useful again on my blog, instead of some drivel about the crap I’ve been working on with no code samples. This won’t be a HOWTO from start to end but rather a narration of my efforts on the subject. I intend to have it working at the end, so by the time I post the final installation, it will indeed be a HOWTO.
I’m working on a syntax-checking text highlighter (colorizer), and I’ve discovered that it’s much more difficult than I’d thought it would be. I have written parsers as front ends for a compiler before. I’ve written recognizers that tell you if you’ve made a syntax error… I assumed it might be a trivial exercise then to make myself a syntax highlighting component that can plug into an editor, but I forgot some terribly obvious points… I’ll start by listing these:
White Space Matters
That’s right. Whitespace matters even if the language I’m parsing doesn’t care about white space because coders will write code with all kinds of white space to facilitate readability. This is my number one issue. It’s the one I haven’t entirely decided how to surmount.
Even bad Text has to Stay in the editor
If there happens to be some code in the editor that doesn’t parse, I cannot ignore it, and I have to save it in the text stream because otherwise it would be destructive to the whole idea of ‘unfinished code.’
You must parse all of the input text at some logical moment
If your users are editing some code, you won’t want it to miss something, so it’s got to be parsing pretty often. How often is too often? Well, I don’t know. I’ll say for now that I’m going to do it at every keystroke, and I’ll parse all of the input at once. This second bit about parsing all of it at once will eventually have to change for two reasons. I have to continue parsing after an error, because it’s possible that the rest of the document is valid, and should be highlighted, leaving just the lonely error bit to be complained about. The other reason is that documents can be huge. Immense. Large. It doesn’t make sense to parse the entire input file on every key stroke unless the input is below some reasonable threshold. (Like for example, the stuff that’s visible on the screen.)
Mainly these are the only three big issues, but they are indeed big. There are products out there that help and that are meant to do just syntax highlighting, or those that do so via regular expressions or find/replace etc… but these are computationally intense, and they usually don’t offer the ability to point out syntax errors… For example, Scintilla.. it will highlight text all purty-like, but it doesn’t tell me that I need a semi colon at the end of the line when one is missing. The same goes for the tools like SyntaxBox etc. They’re great tools, but they have limited functionality. Nope, I can’t cop out and cheat on this one. I have to implement something real, and that doesn’t require visual studio integration. (Which is why I can’t use a Babel language service.)
Gold Parser Builder is my parser generation tool, and since I work in .NET (or rather since I am on this project) I’ll be using .NET engines for the task. The two engines with which I am familiar are the Calitha engine, created by Robert van Loenhout, and the Morozov engine, created by Vladimir Morozov. The parsing technique exposed by Gold is LALR(1).
The first task is to generate a parser that’s capable of passing over the source text and actually parsing it. With gold parser that’s incredibly simple.. there is a grammar designer in which you specify the language to be parsed in a series of EBNF rules. After the builder checks the grammar, it will generate a deterministic finite automaton (DFA) which can recognize the specified language of the grammar. There is even a great testing facility built into the builder which allows me to test my grammar for correctness. I can throw sample input text at the generated DFA and it will tell me if the text parses or not, and throw a detailed trace and error log if for some reason that it will not.
I’ve been active on the Gold Parser Builder mailing list and the google group for the project, and I’ve raised the issue of building a tool such as a syntax checking text colorizer and asked for help from the community. I received a response from a certain Kelly Parker, with whom I’ve begun working on the task. I’m also working independently on a commercial version of this project for use with a certain not-yet-disclosed entity, but concurrently, I’m producing a more generic open source version that decouples the language itself from the components of the parser from the highlighting/checking utility for use in an editor. Kelly and I are working on this open source version. (The two implementations are completely different, because they have different goals. Whenever I refer to ‘the project’ from now on, I’ll be talking about the open source version.)
To be continued…
Early Christmas this year!
November 19, 2007 on 10:02 am | In .NET Coding | No CommentsVisual Studio 2008 drops this week as an RTM. I’m going to be checking the website about every 3 hours now. Whee!
What’s with the widgets?
October 22, 2007 on 10:39 pm | In Uncategorized | No CommentsOk, I’m a sort of almost a blogger, and I write things that sometimes roughly resemble blog postings. I admit it takes a certain level of vanity to do that. I like to think I’m not being too ridiculous, and I admit that I only have a very narrow crowd of people who even care what I have to say… Most of them are people I collaborate on projects with, or people who look at it to be nice to me… but certain websites in the endless pursuit of, I’m not sure what, are getting ridiculous today. It’s vain enough to put a ‘Slashdot this!’ or ‘Digg It!’ widget on your page, but if you’re ’sort of a big deal’ then I guess that makes sense. HOWEVER, things can quickly get out of hand, and I kid you not, I actually saw this on a real live site…

Great Idea! That’s right, for your convenience, you can digg it, facebook it, del.icio.us it, newsvine it, stumbleupon it, reddit, yahoo it, fark it, technorati it, furl it, ma.gnolia it, embed it in your site, or just spam the hell out of your closest friends. All with the click of a button. Brilliant! But still a little too much work for me.
I’m going to invent my own button. I’ll put it at the bottom of my posts and sign up for adsense… I’ll call it ‘Give Me Ad Revenue!’, and it will do all of those things with one click. It’s honest, straight to the point, and best of all, it saves you the trouble of clicking on all of the buttons individually. How Convenient!

C# Template for GoldParser Builder and the Morozov Engine
October 15, 2007 on 12:58 am | In .NET Coding | 1 CommentI decided to give a shot to creating a template for Gold Parser Builder that implements a C# instance of a parser making use of the Morozov Engine, which is known in the Gold Parser world as the fastest of the C# engines.
Thanks to Devin Cook for making this all possible by creating Gold Parser Builder, and also to Vladimir Morozov for his contribution of the great Gold engine for us C# geeks. (a few of the structure-like features of this template were.. uh.. borrowed from the Calitha template, but nothing really ‘codey’, just ’structurey’)
You can get the template here, in case you’re interested: http://davedolan.com/downloads/C-SharpMorozovNet2.pgt
Updated I forgot to add the definition of SymbolException and RuleException to the original download (it was correct in the version that went to the list, I just got a little happy cleaning up the script and … chopped two things too many. It is FIXED now.
I’ve also added a .Net 1.x compatible version of this template: http://davedolan.com/downloads/C-SharpMorozovNet1.pgt
Don’t even ask about .Net 3, and if you were thinking ‘what? why not?’ then perhaps you should not be making parsers, even with templates
Lock Free Coding Addendum
October 3, 2007 on 10:38 pm | In .NET Coding | 3 CommentsOk so something strikes me… the only thing that makes sense in multi client lock free coding is to have a shared, un-orderd data structure between all process elements. This strikes me as a little odd because you have to ensure the thread safety of this shared structure… I’m going to have to read more on this, because I’m now throughly confused… Maybe lock free isn’t the best methodology for client-server paradigms (that don’t broadcast data to each client simultaneously, and must maintain open ‘indexed’ tcp connections) Sure you can safeguard a collection, like the hash structure of tcp handles… but… don’t you have to put them in critical regions first? This is what i get for learning this stuff in my “spare time”… you can see how often I get that.
Comment added After the Fact: I apologize for sounding like such an idiot on this subject, I’m just trying to get my head into the multi-core ready ideas, and I’m having a bit of a time at it… All of your emails and comments are very helpful. So Thanks, and I’m trying, I’m really not as dumb as my (lack of) conceptualization of Lock-Freeness would suggest.
Powered by WordPress with Pool theme design by Borja Fernandez.
Entries and comments feeds.
Valid XHTML and CSS. ^Top^
