.vscode | ||
.gitignore | ||
assignment.jj | ||
README.md | ||
run.sh |
Well-parenthesized language parser
This project is a JavaCC parser configuration which generates a parser for a language with well-balanced parentheses. The features that this language has is:
- Each string accepted by the language consists of multiple lines (but at least 1 line).
- Lines must not be empty.
- Lines must contain well-balanced parentheses, e.g.
()
,()()()
,(()())
,(((())))
. - Only parentheses are allowed, so any other character makes the input invalid.
- (As an exception to the first rule) there must be a blank final line at the end of the input.
E.g. the following is a good input to the parser:
()(())
(())(((()())))
((((((((((()))))))))))
(())()()()()()
((()()())()()(()(())))()
<blank line here>
How to run
To build the Java code and run:
./run.sh
. This is an alias for:
javacc assignment.jj && javac *.java && java Assignment < test.txt > output.txt 2> err.txt
Output
The standard output stream will contain the parser's decision as to whether the string is valid — the output will either be PASS
or FAIL
.
If it is FAIL
, the standard error will also contain the line number of the first error that caused the invalidity.
Commentary on the code
The syntax of JavaCC is at times elusive, so I thought to document what I have learnt about it from this project:
Main parser declaration
The part of code delimited by PARSER_BEGIN(Assignment)
and PARSER_END(Assignment)
denotes the entry point to the parser generated, i.e. the class that
represents the parser at the highest level. You can create an instance of this parser class and feed it some input in order to see its effects.
That's exactly what this code does:
Assignment parser = new Assignment(System.in);
parser.Start();
...which takes our parser (whose name is Assignment
, because that's what was specified inside PARSER_BEGIN
), constructs an instance of it, giving it
the input stream System.in
(standard input), and then calls the entry point to the parser (its start symbol or sentence symbol).
In this parser, I have called this Start
.
Tokens
Now, to register to the tokenizer which tokens we would like to allow in our language (that is, which, conceptually, UTF-8 codepoints we would like to consider valid), we use TOKEN
declarations to specify them. Here, we only actually want 3 valid types of token: left parentheses, right parentheses, and line breaks. Those are the characters we want to consider in our valid inputs. Due to a technicality that will be addressed later, I have also made a fourth
token, which captures any character altogether, called ANY
; this makes our tokenizer total, i.e. it accepts any sequence of characters at all.
This won't actually change the language of strings we can accept, because the parser can simply reject anything that doesn't match the desired form, so
any input that does actually generate ANY
tokens will fail anyway.
The syntax for generating a named token is:
TOKEN : { <SYMBOLIC_NAME_OF_THE_TOKEN: regular expression that we want to match>}
This allows us to refer to the token using its symbolic name later (or else we would not be able to use it in the parser), and specify what string should produce that token. This can be used to e.g. define numeric constants, identifiers, or similar costructs seen in programming languages, in a succinct way.
Note, however, that literal character parts of the regular expression should be enclosed in strings, e.g. as in the following regex for a number:
TOKEN: { <NUMBER: (["0"-"9"])+> }
Additionally, it seems that the +
and *
operators expect their argument to be a parenthesized expression.
Thus, in order to specify the three tokens we wish to consider, we use these lines:
TOKEN : { <LEFT_PAREN: "(">}
TOKEN : { <RIGHT_PAREN: ")">}
TOKEN : { <NEWLINE: "\n">}
...the line-break token production in particular being specified using the usual escape sequence.
The regular expressions for these tokens are very simply the string representing the single characters for each token type.
Finally, to recognize any other character, the following incantation is used:
TOKEN : { <ANY: ~[]>}
One may interpret the expression as being a negation of the empty character set ([]
), i.e.: []
is a construct of regular expression syntax that effectively stands for alternation: the overall expression matches any of the characters inside the square brackets, e.g. [abc]
matches a
, b
, or c
; in JavaCC, the negation of this is denoted using ~[]
, so representing the alternation of all characters not inside the square brackets, in this case being the complement of []
, namely all possible characters.
Grammar
The grammar I am using for generating all sets of balances parentheses is:
S ::= (P)P
P ::= (P)P | ε
Notably, since JavaCC generates an LL(k) parser, which cannot parse any left-recursion-containing grammar, our grammar rules mustn't contain
any rules like A -> Ax
, where a non-terminal appears on the left of its own production. Fortunately, in this grammar, that is respected, so all is well.
In a previous iteration, I had tried a production for P
that looked like:
P ::= P()
| ()P
| (P)
| ε
...which, due to the first rule, was regrettably indeed left-recursive. That rule should be able to be fixed using the following adjustment, though:
P() → P{()}+
(where {}
denotes a grouping, and ()
still just denote parentheses. In PCRE regex syntax, this would be (\(\))+
).
Note that S
is just the grammar for one line of parentheses, without the line-break character of end-of-file part of the specification taken into account.
The full grammar is:
Start ::= {P<NEWLINE>}+<EOF>
, i.e. specifying that each string of balanced parentheses is followed by a new line, one may have many such lines, and at the end,
the final line is blank, followed directly by an EOF.
Productions
Now, onto encoding the production rules of the grammar, where we specify the functions that the parser will use for its work; immediately, we shall cover the starting rule once again. It expects the form of input I expressed above: a sequence of "valid lines followed by line breaks" (with at least one such repetition), followed by the end of the input. Therefore, the grammar rule needed in JavaCC syntax is (ValidLine()<NEWLINE>)+<EOF>
: this consists of a repeated unit ((ValidLine()<NEWLINE>)+
) of calls to the production ValidLine
followed by a NEWLINE
token; followed in turn by the end of the program.
As one can see, the syntax for a named token is <TOKEN_NAME>
, and the syntax for another production in the grammar is Production()
, treating it like a method.
Therefore, productions in the JavaCC syntax are defined like Java methods, with the void ProductionName
syntax; however, one may note the difference to the usual method syntax, the : {}
featured throughout my file, which specifies variable declarations for variables that will be used inside the production.
Principally, JavaCC seems to be an event-driven parser, so you can execute Java code within the body of the production, including storing results of previous
calls to other productions as variables; therefore, not only can you give productions a type (void
suffices for this program, but it can be int
, String
, and probably anything else), but you must also tell the parser what variables will be used in the body of the production.
Since no variables are needed here, we will just use empty braces to denote none.
Now, to translate our rules for the grammar:
Start ::= {P<NEWLINE>}+<EOF>
becomes:
void Start() : {}
{
(ValidLine()<NEWLINE>)+<EOF>
}
S ::= (P)P
becomes:
void ValidLine() : {}
{
<LEFT_PAREN>Parens()<RIGHT_PAREN>Parens()
}
P ::= (P)P | ε
becomes:
void Parens() : {}
{
<LEFT_PAREN>Parens()<RIGHT_PAREN>Parens() | Epsilon()
}
What about this call to Epsilon
? Well, it seems JavaCC doesn't have a way to represent epsilon by default, so I created an empty production to match it:
void Epsilon() : {}
{
{}
}
And, with that, the main part of the grammar is done!
Driver code
Here is the entry point to our parser again, the part within the PARSER_BEGIN directive:
class Assignment {
public static void main(String[] args) throws ParseException, TokenMgrError
{
try {
Assignment parser = new Assignment(System.in);
parser.Start();
System.out.println("PASS");
} catch (ParseException e) {
System.out.println("FAIL");
System.err.println(e.currentToken.beginLine);
}
}
}
In order to implement the desired I/O action of printing pass/fail to the standard output, and printing the error line if occurred to standard error,
we use exception handling to handle the possible exceptions generated by JavaCC's output tokenizer and parser. Note that, because of the ANY
token
I provisioned before, we won't ever have a TokenMgrError
, so only the parser should be able to error; this is helpful, because the TokenMgrError
class
generated doesn't actually contain a member field that you can read the line number of the error; instead, it packages that information as an error message,
which then needs to be read, blah blah, in order to extract the meaningful error line; and, fortunately, the ParseException
class does have such a line number, which comes from the currentToken
field's beginLine
field. With all this exception handling in place, the main method is constructed as a wishful exection of the parser on standard input, followed by printing PASS
; or, printing FAIL
plus the error line if an exception was raised.