This book will tell you exactly how to do it:
http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
Here's a simple Java implementation that might get you on track: http://matt.might.net/articles/implementation-of-nfas-and-regular-expressions-in-java/
Basically the industrial-strength implementation is a state machine. You deconstruct the regular expression - the string with the '*' in it - and create a graph for it. Then you recursively search the graph, for example in a breadth-first tree search.
Here's some discussion of different ways to do it, that will help illustrate the approach: http://swtch.com/~rsc/regexp/regexp1.html
I am not a compiler expert but I do know that this book is actually regarded as necessary reading for anyone that wants to write a compiler. Yes the cover is antiquated but from what I have read it still has many good patterns relevant to code compilation:
Read the Dragon Book:
http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
It covers lexical and syntactical analysis in depth (among other topics). You can use this to help you understand the "language" you are trying to parse to determine the best way to go about it.
@Yfour You're welcome, but it's not original to me. I read it in [Aho's book](https://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886) - I think.
What programming languages are you familiar with? My impression from your question is C.
It looks like like the tokens of your configuration language are regular expressions:
Listen
127.0.0.1:1000
1000
;
{
}
etc.
Almost all modern programming languages have some form of support for those.
If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.
Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.
Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)
I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?
A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.
A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.
Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools.
end Pontification.
If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.
I found two links to stackoverflow answers which look helpful:
Recursive descent parser implementation
Looking for a tutorial on Recursive Descent Parsing
Here is an example of using flex.
Flex takes a 'script', and generates a C function called yylex(). This is the input script.
Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.
If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.
Effectively the yylex function is everything from the first %% to the second %%. It behaves like a big switch() statement.
The regular expressions are like (very exotic) case: labels.
The code inside the { ... } is ordinary C. It can contain any C statements, and must be properly nested within the { ... }
The stuff before the first %% is the place to put flex definitions, and a few 'instructions' to flex.
The stuff inside %{ ... %} is ordinary C, and can include any headers needed by the C in the file, or even define global variables.
The stuff after the second %% is ordinary C, with no need for extra syntax, so no %{ ... %].
/* scanner for a configuration files */
%{
/* Put headers in here */
#include <config.h>
%}
%%
[0-9]+ { return TOK_NUMBER; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+":"[0-9]+ { return TOK_IP_PORT; }
[0-9]+"."[0-9]+"."[0-9]+"."[0-9]+"/"[0-9]+ { return TOK_IP_RANGE; }
"Listen" { return TOK_KEYWORD_LISTEN; }
[A-Za-z][A-Za-z0-9_]* { return TOK_IDENTIFIER; }
"{" { return TOK_OPEN_BRACE; }
"}" { return TOK_CLOSE_BRACE; }
";" { return TOK_SEMICOLON; }
[ \t\n]+ /* eat up whitespace, do nothing */
. { fprintf(stderr, "Unrecognized character: %s\n", yytext );
exit(1);
}
%%
/* -------- A simple test ----------- */
int main(int argc, char *argv[])
{
int tok;
yyin = stdin;
while (tok=yylex()) {
fprintf(stderr, "%d %s\n", tok, yytext);
}
}
That has a minimal, dummy main, which calls the yylex() function to get the next token
(enum) value. yytext is the string matched by the regular expression, so main just prints it.
WARNING, this is barely tested, little more than:
flex config.l
gcc lex.yy.c -ll
./a.out <tinytest
The values are just integers, so an enum in a header:
In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.
You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.
How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.
It appears that Addison-Wesley has done a reprint of its predecessor, the Green Dragon Book, and is passing it off as something recent, so be careful to get the genuine article.
Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).
Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA
Building the top level
Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.
Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a).
Intermediate result:
Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
There's no alternation outside of parentheses, so we skip it.
Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.
Building P
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for b|a. Intermediate result:
Integrating P
Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
Building Q
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)
Integrating Q
We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
Tada! Er, I mean, QED.
Want to know more?
All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.
This book will tell you exactly how to do it: http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
Here's a simple Java implementation that might get you on track: http://matt.might.net/articles/implementation-of-nfas-and-regular-expressions-in-java/
Basically the industrial-strength implementation is a state machine. You deconstruct the regular expression - the string with the '*' in it - and create a graph for it. Then you recursively search the graph, for example in a breadth-first tree search.
Here's some discussion of different ways to do it, that will help illustrate the approach: http://swtch.com/~rsc/regexp/regexp1.html
I am not a compiler expert but I do know that this book is actually regarded as necessary reading for anyone that wants to write a compiler. Yes the cover is antiquated but from what I have read it still has many good patterns relevant to code compilation:
http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
This book teachs many things, including parsing. It's considered a classic in compilers.
http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
Read the Dragon Book: http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
It covers lexical and syntactical analysis in depth (among other topics). You can use this to help you understand the "language" you are trying to parse to determine the best way to go about it.
Aho-Ulman have an excellent (and very deep) book on compilers. http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886
However, if you want a quick recipe for writing a simple compiler, it might be too deep. Still, it may be good to have it for reference.
What programming languages are you familiar with? My impression from your question is C.
It looks like like the tokens of your configuration language are regular expressions:
Almost all modern programming languages have some form of support for those.
If the implementation is C, I'd probably use flex. It generates a function which will apply a set of regular expressions, put the matched text into a C string, and return the type of that regular expression (just an int, which you choose). The function is a 'lexical analyser' or 'tokeniser'. It chops up streams of characters into handy units that match your needs, one regular expression at a time.
Flex is pretty easy to use. It has several advantages over lex. One is that you can have multiple lexical analysers functions, so if you need to do something odd for an include file, then you could have a second lexical analyser function for that job.
Your language looks simple. Bison/Yacc are very powerful tools, and "with great power comes great responsibility" :-)
I think it is sufficiently simple, that I might just write a parser by hand. It might only be a few functions to handle its structure. A technique that is very straightforward is called recursive descent parser. Have you got a CS degree, or understand this stuff?
Lots of people will (at this stage) tell you to get the 'Dragon Book' or one of its newer versions, often because that is what they had at college. The Dragon book is great, but it is like telling someone to read all of Wikipedia to find out about whales. Great if you have the time, and you'll learn a lot.
A reasonable start is the Wikipedia Recursive Descent parser article. Recursive descent is very popular because it is relatively straightforward to understand. The thing that makes it straightforward is to have a proper grammar which is cast into a form which is easy for recursive descent to parse. Then you literally write a function for every rule, with a simple error handling mechanism (that's why I asked about this). There are probably tools to generate them, but you might find it quicker to just write it. A first cut might take a day, then you'd be in a good position to decide.
A very nifty lex/flex feature is any characters which are not matched, are just echo'd to standard output. So you can see what your regular expressions are matching, and can add them incrementally. When the output 'dries up' everything is being matched.
Pontification alert: IMHO, more C programmers should learn to use flex. It is relatively easy to use, and very powerful for text handling. IMHO lots are put off because they are also told to use yacc/bison which are much more powerful, subtle and complex tools. end Pontification.
If you need a bit of help with the grammar, please ask. If there is a nice grammar (might not be the case, but so far your examples look okay) then implementation is straightforward.
I found two links to stackoverflow answers which look helpful:
Here is an example of using flex.
Flex takes a 'script', and generates a C function called yylex(). This is the input script.
Remember that all of the regular expressions are being matched within that yylex function, so though the script looks weird, it is really an ordinary C function. To tell the caller, which will be your recursive descent parser, what type of regular expression is matched, it returns an integer value that you choose, just like any ordinary C function.
If there is nothing to tell the parser about, like white space, and probably some form of comment, it doesn't return. It 'silently' consumes those characters. If the syntax needs to use newline, then that would be recognised as a token, and a suitable token value returned to the parser. It is sometimes easier to let it be more free form, so this example consumes and ignores all white space.
Effectively the yylex function is everything from the first
%%
to the second%%
. It behaves like a bigswitch()
statement.The regular expressions are like (very exotic)
case:
labels.The code inside the
{ ... }
is ordinary C. It can contain any C statements, and must be properly nested within the{ ... }
The stuff before the first
%%
is the place to put flex definitions, and a few 'instructions' to flex. The stuff inside%{ ... %}
is ordinary C, and can include any headers needed by the C in the file, or even define global variables.The stuff after the second
%%
is ordinary C, with no need for extra syntax, so no%{ ... %]
.That has a minimal, dummy main, which calls the yylex() function to get the next token (enum) value. yytext is the string matched by the regular expression, so main just prints it.
WARNING, this is barely tested, little more than:
The values are just integers, so an enum in a header:
In your parser, call yylex when you need the next value. You'll probably wrap yylex in something which copies yytext before handing the token type value back to the parser.
You will need to be comfortable handling memory. If this were a large file, maybe use malloc to allocate space. But for small files, and to make it easy to get started and debug, it might makes sense to write your own 'dumb' allocator. A 'dumb' memory management system, can make debugging much easier. Initially just have a big char array statically allocated and a mymalloc() handing out pieces. I can imagine the configuration data never gets free()'d. Everything can be held in C strings initially, so it is straightforward to debug because the exact sequence of input is in the char array. An improved version might 'stat' a file, and allocates a piece big enough.
How you deal with the actual configuration values is a bit beyond what I can describe. Text strings might be all that is needed, or maybe there is already a mechanism for that. Often there is no need to store the text value of 'Keywords', because the parser has recognised what it means, and the program might convert other values, e.g. IP addresses, into some internal representation.
If you want to know how one goes from source code to something that actually runs on a target machine, you should get a copy of the famous Red Dragon Book. I've used it for building parsers and lexical analyzers. While it dates back to 1986, and I'm sure there's been progress in the interim, as far as I can tell, it hasn't been surpassed as a text.
It appears that Addison-Wesley has done a reprint of its predecessor, the Green Dragon Book, and is passing it off as something recent, so be careful to get the genuine article.
Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).
Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA
Building the top level
Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.
Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a). Intermediate result:

Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:

There's no alternation outside of parentheses, so we skip it.
Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.
Building P
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for b|a. Intermediate result:

Integrating P
Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:

Building Q
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)
Integrating Q
We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:

Tada! Er, I mean, QED.
Want to know more?
All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.
The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.