Building Parsers With Java¿

Category: Hardware & DIY
Author: Steven John Metsker
All Stack Overflow 7
This Year Stack Overflow 3
This Month Stack Overflow 3


by anonymous   2017-08-20

Developer of ParseKit here.

Unfortunately, there is no further (good) documentation on ParseKit's grammar syntax. Currently the best resources are:

  1. Steven Metsker's Book Building Parsers in Java. The good news: This will teach you about the design/innards of ParseKit. The bad news: the "Grammar syntax" feature of ParseKit is an additional feature layered on top of ParseKit which I designed and added myself. So it is not described in Metsker's book as his Java library does not have this feature.

  2. The .grammar files in the Test target of the ParseKit Xcode project. There's lots of real-world example grammars here. You can learn a lot by example.

  3. The ParseKit tag here on StackOverflow. I've answered a lot of questions which may be helpful to you.

As for your specific example, here's how I'd probably define it in ParseKit syntax.

@symbolState = '\n'; // Tokenizer Directive
                     // tells tokenizer to treat new line chars as 
                     // individual Symbol tokens rather than whitespace
@start = headerLine*;
headerLine = logFormat logId comma category eol;
logFormat = ('Type' 'A' 'Logfile') | ('Logfile' 'II') | ('Some' 'Other' 'Format');
logId = hash Number;
category = Any+;

comma = ',';
hash = '#';
eol = '\n';

One important thing to keep in mind is that parsing in ParseKit is a two Phase process:

  1. Tokenizing (done by PKTokenizer and altered by Tokenizer Directives in your grammar)
  2. Parsing (done by the parser constructed by the Declarations in your grammar)

So the Parser created by your grammar works on Tokens which have already been tokenized by the Tokenizer. It does not work on either individual chars or long strings composed of multiple tokens.

by anonymous   2017-08-20

Developer of ParseKit here.

First, the best way to understand this stuff is Steven Metsker's book, upon which ParseKit is based.

Second, checkout my answer to another question about PKAssembly's stack and target.

Third, here's my answer to another PaseKit question about unexpected callbacks.

Fourth, checkout the TDArithmeticParser.m file in the the ParseKit Tests Target (included in the ParseKit Xcode project. This class has callbacks which implement the same kind of arithmetic logic you seem to be looking for.

Also checkout the arithmetic.grammar file (also in the ParseKit Tests Target). This is an example of how to design an arithmetic grammar in ParseKit syntax.

Finally, here's some thoughts more specific to your example above.

Let's clarify your grammar a bit, as the question you're asking is pretty basic, and I don't think it requires a very complex grammar to tackle. Here's a basic arithmetic grammar which gives multiplication and division operators precedence over addition and subtraction:

@start         = expr;
expr           = term (plusTerm | minusTerm)*;
term           = factor (timesFactor | divFactor)*;
plusTerm       = '+'! term;
minusTerm      = '-'! term;
timesFactor    = '*'! factor;
divFactor      = '/'! factor;
factor         = Number;

That ! after the '+' tells ParseKit to discard this token automatically. This makes things slightly more convenient for you when writing your callbacks.

Note that if you want your grammar to have only left-to-right operator precedence (like a calculator), this grammar will not work. If you need that, please ask a separate question tagged #ParseKit here on StackOverflow, and I will answer it promptly.

I would define these callbacks:

- (void)parser:(PKParser *)p didMatchExpr:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    NSNumber *n = [a pop];

    // the expr is complete, and its value is on the stack.
    // important! wrap things up by 
    // storing your work in ``. not in an ivar. = n;

- (void)parser:(PKParser *)p didMatchFactor:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    // a number token was found. store its number value on the stack
    PKToken *tok = [a pop];
    [a push:[NSNumber numberWithDouble:tok.floatValue]];

- (void)parser:(PKParser *)p didMatchPlusTerm:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    // a '+' expr was found. pop off the two operands and add them
    // store the result on the stack temporarily
    NSNumber *n2 = [a pop];
    NSNumber *n1 = [a pop];
    [a push:[NSNumber numberWithDouble:[n1 doubleValue] + [n2 doubleValue]]];

- (void)parser:(PKParser *)p didMatchMinusTerm:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    // a '-' expr was found. pop off the two operands and subtract them
    // store the result on the stack temporarily
    NSNumber *n2 = [a pop];
    NSNumber *n1 = [a pop];
    [a push:[NSNumber numberWithDouble:[n1 doubleValue] - [n2 doubleValue]]];

- (void)parser:(PKParser *)p didMatchTimesFactor:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    // a '*' expr was found. pop off the two operands and multiply them
    // store the result on the stack temporarily
    NSNumber *n2 = [a pop];
    NSNumber *n1 = [a pop];
    [a push:[NSNumber numberWithDouble:[n1 doubleValue] * [n2 doubleValue]]];

- (void)parser:(PKParser *)p didMatchDivideFactor:(PKAssembly *)a {
    NSLog(@"%s %@", __PRETTY_FUNCTION__, a);

    // a '/' expr was found. pop off the two operands and divide them
    // store the result on the stack temporarily
    NSNumber *n2 = [a pop];
    NSNumber *n1 = [a pop];
    [a push:[NSNumber numberWithDouble:[n1 doubleValue] / [n2 doubleValue]]];

Two important points are

  1. Don't worry about how many times these callbacks are called. They may be called more times than you expect, or in strange-seeming order.
  2. Don't store the results of work done in these callbacks in an ivar. Always store your work on either the a argument's target or stack. I usually store temporary values on the stack, and the ultimate result on the target, as I find that most convenient. But you have flexibility there.

I would write this driver code:

NSString *g = .. // fetch grammar above
PKParser *p = [[PKParserFactory factory] parserFromGrammar:g assembler:self];
NSString *s = @"3*4+4*8";

PKAssembly *res = [p parse:s];
NSLog(@"res %@", res);

I see this log output:

-[DebugAppDelegate parser:didMatchFactor:] [3]3^*/4/+/4/*/8
-[DebugAppDelegate parser:didMatchFactor:] [3, 4]3/*/4^+/4/*/8
-[DebugAppDelegate parser:didMatchTimesFactor:] [3, 4]3/*/4^+/4/*/8
-[DebugAppDelegate parser:didMatchFactor:] [12, 4]3/*/4/+/4^*/8
-[DebugAppDelegate parser:didMatchFactor:] [12, 4, 8]3/*/4/+/4/*/8^
-[DebugAppDelegate parser:didMatchTimesFactor:] [12, 4, 8]3/*/4/+/4/*/8^
-[DebugAppDelegate parser:didMatchPlusTerm:] [12, 4]3/*/4/+/4^*/8
-[DebugAppDelegate parser:didMatchPlusTerm:] [12, 32]3/*/4/+/4/*/8^
-[DebugAppDelegate parser:didMatchExpr:] [3]3^*/4/+/4/*/8
-[DebugAppDelegate parser:didMatchExpr:] [12]3/*/4^+/4/*/8
-[DebugAppDelegate parser:didMatchExpr:] [16]3/*/4/+/4^*/8
-[DebugAppDelegate parser:didMatchExpr:] [44]3/*/4/+/4/*/8^
res 44

Hope this helps.

by anonymous   2017-08-20

Developer of ParseKit here.

There are two features in ParseKit which can be used to help provide user-readable hints describing parse errors encountered in input.

  1. -[PKParser bestMatchFor:]
  2. The PKTrack class

It sounds like you're aware of the -bestMatchFor: method even if it's not doing what you expect in this case.

I think the PKTrack class will be more helpful here. As described in Metsker's book, PKTrack is exactly like PKSequence except that its subparsers are required, and an error is thrown (with a helpful error message) when all of its subparsers are not matched.

So here's a grammar for your example input:

@start         = '(' expr ')' | expr;
expr           = ('+' | '-') term term;
term           = '(' expr ')' | Word;

Any productions listed contiguously are a Sequence -- but could instead be a Track.

The benefit of changing these Sequences to be Tracks is that an NSException will be thrown with a human-readable parse error message if the input doesn't match. The downside is that you must now wrap all usages of your factory-generated parser in a try/catch block to catch these Track exceptions.

The problem currently (or before now, at least) is that the PKParserFactory never produced a parser using Tracks. Instead, it would always use Sequences.

So I've just added a new option in head of trunk at Google Code (you'll need to udpate).

#define USE_TRACK 0



It's 0 by default. If you change this define to 1, Tracks will be used instead of Sequences. So given the grammar above and invalid input like this:

(+ a - b c))

and this client code:

NSString *g = // fetch grammar above
PKParser *p = [[PKParserFactory factory] parserFromGrammar:g assembler:self];
NSString *s = @"(+ a - b c))";

@try {
    PKAssembly *res = [p parse:s];
    NSLog(@"res %@", res);
@catch (NSException *exception) {
    NSLog(@"Parse Error:%@", exception);

you will get a nice-ish human-readable error:

Parse Error:

After : ( + a
Expected : Alternation (term)
Found : -

Hope that helps.

by anonymous   2017-08-20

Developer of ParseKit here.

First, see my previous answers on debugging ParseKit grammars and battling infinite recursion in ParseKit grammars.

I think there might be an issue in the very first line (but I'm not a SQL expert, so I'm not sure). Shouldn't that be:

@start = (statement ';')+;

I would strongly recommend using camel case instead of underscores, as underscores will make your Objective-C callbacks very awkward and ugly. That is why camel case is the convention in ParseKit grammars.

However, I see the main issue. Your grammar includes Left Recursion which is not allowed in ParseKit grammars. Particularly in your expr prodcution (I haven't looked closely to see if it is elsewhere too).

ParseKit is fine with recursion, but not left recusion. The best explanation of this is in Steven Metsker's "Building Parsers with Java". Or search the web.

But basically left recursion is when a production immediately references itself (on the left side of an expression):

e = e '-' Number;


e = Number | e '-' Number;

Instead, you must design your grammars to remove left recursion like:

e = Number ('-' Number)*;
by anonymous   2017-08-20

Developer of ParseKit here.

(One thing to keep in mind with my answer: although I am the developer of ParseKit, I did not really design the framework or its API. It is based mostly on specific designs found in Steven Metsker's book Building Parsers With Java. I merely ported them to ObjC/Cocoa.)

ParseKit is composed of three parts:

  1. A highly-flexible, high-performance Objective-C Tokenizer (PKTokenizer, PKToken classes)
  2. A thoroughly dynamic Objective-C Parser toolkit for building backtracking, recursive-decent parsers with infinite lookahead (the PKParser class and sublcasses). Due to its dynamism, performance of this parser toolkit is poor for large input.
  3. Objective-C Parser Generation via Grammars - Generate an Objective-C parser for your custom language using a BNF-style grammar syntax (similar to yacc or ANTLR). While parsing, the parser will provide callbacks to your Objective-C code. Due to #2's dynamism, writing grammars is relatively easy, and there are relatively few restrictions on what you can do in the grammars.

Each component above builds upon the prior components. So #3 - the Grammar toolkit - uses both #1 the tokenizer, and #2 the parser toolkit.

If you are doing any serious parsing tasks, I would always recommend checking out #1 - the tokenizer - PKTokenizer. It is incredibly flexible and powerful and performance is very good. If it's easier for you to work on tokens rather than an input string (and it usually is), you'll probably want to check this out.

As for #2 (ObjC Parser toolkit), you'll usually want to just skip it and move to #3, as building parsers via grammar is much nicer than building them via ObjC code.

For #3 (ObjC Parser toolkit via BNF Grammars), the most important consideration is performance. ParseKit's parser toolkit is suitable for parsing relatively small input strings. Some examples might be:

  1. XPath-style query languages
  2. SQL
  3. Relatively consise DSL or command languages
  4. Regular Expressions
  5. Menus (or things than can be broken up into a flat array of relatively small sentences)

ParseKit's parser toolkit is generally not suitable for parsing larger input strings due to performance concerns. Some examples might be:

  1. XML documents
  2. JSON documents

ParseKit can certainly (and does) parse these types of input, but again, performance is poor compared to a dedicated XML or JSON parser due to ParseKit's dynamism (backtracking, infinte lookeahead).

For a "wine menu", I would say, yes - ParseKit is probably a good (possibly great) solution. Especially if you can break the individual lines of input into a array of strings and parse them one by one. Performance should be quite good, and once you get over the learning curve, ParseKit is incredibly powerful/convenient for these types of jobs.

In fact, IIRC, Metsker's original book even uses something like this as an example of a good use of his toolkit.

Hope this helps.