Clifton Royston's Posts - TechHui2023-03-21T07:03:59ZClifton Roystonhttp://www.techhui.com/profile/CliftonRoystonhttp://storage.ning.com/topology/rest/1.0/file/get/353340777?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1http://www.techhui.com/profiles/blog/feed?user=1dhjyp7v10tsc&xn_auth=noRecursive descent parsing with BNF grammarstag:www.techhui.com,2007-12-20:1702911:BlogPost:16832007-12-20T19:38:44.000ZClifton Roystonhttp://www.techhui.com/profile/CliftonRoyston
(This is a modified repost of a post I recently put up on our company's wiki,<br />
explaining a very old CS concept/approach which seems to be rather<br />
neglected lately in production use.)<br></br><h2>Introduction</h2>
<br></br>If you're implementing any sort of miniature language as part of some software (query language, scripting capability) or even just attempting to parse structured input of some kind, using the full-boogie compiler-builder tools like YACC, bison, and friends can seem like overkill. If…
(This is a modified repost of a post I recently put up on our company's wiki,<br />
explaining a very old CS concept/approach which seems to be rather<br />
neglected lately in production use.)<br/><h2>Introduction</h2>
<br/>If you're implementing any sort of miniature language as part of some software (query language, scripting capability) or even just attempting to parse structured input of some kind, using the full-boogie compiler-builder tools like YACC, bison, and friends can seem like overkill. If you just try to write a completely ad-hoc parser off-the-cuff, however, you'll find it laborious, painful, and likely find the result to be full of bugs and nasty corner cases. However, there's a happy medium, which is recursive descent parsing. I'm going to give you a quick hand-wavy explanation of how to use it.<br/><h2>Overview of Recursive Descent Parsing</h2>
<br/>Recursive Descent Parsing (RDP) is a powerful technique for implementing
languages or parsing inputs conforming to some syntax, as long as the<br />
syntax can be expressed in BNF (Backus-Naur form) or EBNF (Extended<br />
Backus-Naur Form). Recursive descent parsers are sometimes described<br />
as <strong>LL(</strong>k<strong>)</strong> grammars, where k defines the number of tokens of look-ahead needed, usually 1. <br/> <br/>Until
the advent of "compiler-compiler" tools like YACC, essentially all<br />
compilers were written by hand using recursive descent parsers, and<br />
it's still not a terribly onerous task because the translation from the<br />
grammar to a recursive parser is extremely straightforward, almost<br />
mechanical. Given a grammar in BNF or EBNF, you can translate it to a slightly simpler BNF form which obeys certain constraints. Once you have the BNF there will be a single<br />
routine corresponding to each left-hand-side (LHS) item or "production"<br />
in the grammar, which will end up calling the routines corresponding to<br />
each entry on its right-hand-side (RHS).<br/><br/><h3>Example of BNF conversion to RDP</h3>
Let's assume we're trying to parse a simple "calculator" input, with integer expressions combined with +,-,*,/ and possibly grouped with parentheses. This is simple enough to explain in a brief example.<br/> <a name="EBNF_grammar_for_simple_integer_expressions.3a" id="EBNF_grammar_for_simple_integer_expressions.3a"></a><h4>EBNF grammar for simple integer expressions:</h4>
<pre>Expr := Term | Term '+' Term | Term '-' Term<br/>Term := Factor | Factor '*' Factor | Factor '/' Factor<br/>Factor := RealNum | '-' RealNum | '(' Expr ')'<br/>RealNum := Digit +</pre>
Note: In reality, for practicality's sake we'd probably match RealNum as a<br />
token with a pattern-matching rule rather than in EBNF.<br/><br/><h4>BNF left-factorized simplification of EBNF grammar:</h4>
To convert the grammar to a parser, begin by converting it to the simpler<br />
BNF form, which eliminates any use of optional elements ([ ] or ?), any<br />
use of repetition operators (*) and any use of parenthesis operators<br />
for grouping. This can always be done at the expense of a slightly<br />
longer simple BNF form.<br/> <br/>We also must transform the grammar
(which again may introduce some new non-terminal symbols) so that there<br />
is never a self-recursive or recursive derivation of a given symbol on<br />
the left side of a rule, and left-factor it, i.e. modify it so that<br />
whenever a given symbol appears as the first element in two or more<br />
possible derivations of a given symbol, it gets "factored out" for<br />
common processing, by splitting it into two sub-rules. This gives us:<pre>Expr := Term ExprTail<br/>ExprTail := nil | '+' Term | '-' Term<br/>Term := Factor TermTail <br/>TermTail := nil | '*' Factor | '/' Factor<br/>Factor := RealNum | '-' RealNum | '(' Expr ')' <br/>RealNum := Digit RestOfNum<br/>RestOfNum := nil | Digit RestOfNum<br/></pre>
<h4>Simplified parser code corresponding to BNF grammar:</h4>
Translating this quite mechanically into very simplified code, assuming we have a<br />
TokenSource type with "peekToken" and "nextToken" methods, gives us<br />
code something like the following: <br/><pre>void ParseExpr( TokenSource t ) {<br/> ParseTerm( t );<br/> ParseExprTail( t );<br/>}<br/>void ParseExprTail( TokenSource t ) {<br/> string s = t.peekToken();<br/> if ( s == '+' ) {<br/> s = t.nextToken();<br/> ParseTerm( t );<br/> }<br/> else if ( s == '-' ) {<br/> s = t.nextToken();<br/> ParseTerm( t );<br/> }<br/> else {<br/> // nil<br/> }<br/>}<br/>void ParseTerm( TokenSource t ) {<br/> ParseFactor( t );<br/> ParseTermTail( t );<br/>}<br/>void ParseTermTail( TokenSource t ) {<br/> string s = t.peekToken();<br/> if ( s == '*' ) {<br/> s = t.nextToken(); <br/> ParseFactor( t );<br/> }<br/> else if ( s == '/' ) {<br/> s = t.nextToken();<br/> ParseFactor( t );<br/> }<br/> else {<br/> // nil<br/> }<br/></pre>
<br/>etc. As you can see, it's a very mechanical process. It almost takes less time to write the code than it does to format this blog post. I've slightly glossed over the need for the parser to do something and return its results; in a real parser, the routines would be returning a pointer to the expression tree it had built thus far, or perhaps evaluating it on the fly and returning a real. <br/><h2>Web links:</h2>
<h3>General information on BNF and EBNF</h3>
<a class="external" rel="external nofollow" target="blank" href="http://www.garshol.priv.no/download/text/bnf.html" title="http://www.garshol.priv.no/download/text/bnf.html"><strong>BNF</strong> and <strong>EBNF</strong>: What are they and how do they work?</a><br/><br/>Wikipedia on <a class="external" rel="external nofollow" target="blank" href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form" title="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form">Backus-Naur form</a><br/><h3>Recursive Descent Parsers for .Net programmers</h3>
I found an interesting blog entry from a teacher at the IT University of Copenhagen, Denmark with a <u><a class="external" rel="external nofollow" target="blank" href="http://www.itu.dk/people/kfl/parsernotes.pdf" title="http://www.itu.dk/people/kfl/parsernotes.pdf">note about how to write scanners and parsers in C#</a></u>. If you're interested in learning more, it has a good introductory explanation of the relationship between grammars (productions) and parsing, as well as more detailed examples of parsing code<br />
in C#. You should find it easy to apply this approach in any language.<br/>