Commit 49be8565 authored by Dr. Michael Petter's avatar Dr. Michael Petter

initial fork

Changes from 1.7 to -with-syntaxtree branch:
New feature, 'syntaxtree generation' makes it possible to automatically
generate a semantics class that creates a syntaxtree, compatible to the
java-cup 0.11b parser generator. Its format is a generic DOM format, that
can be queried and processed with XML/XSL as well as custom tree
transformers in Java.
Changes from distribution 1.6.1 to 1.7:
New feature, 'deferred actions' makes it possible to construct messages
that can be revoked in case of backtracking.
The feature uses lambda expressions of Java, which means it requires
Java version 1.8 (JSE 8).
The manual is revised, with added references to new literature about
converting EBNF to PEG.
Some code clean-up and restructuring has been done, with class Phrase
separated form ParserBase.
Changes from distribution 1.6 to 1.6.1:
Classes for Mouse tools: Generate, TestParser, TestPEG, TryParser,
and MakeRuntime are made public.
Option --mode=755 is added to the 'tar' command used to create
Mouse-1.6.1.tar. This should set the correct permission bits
when the archive is unpacked under Unix. (The original files
and directories are under Windows, and have different permissions,
not copied by 'tar'.)
Changes from distribution 1.5.1 to 1.6:
'TestParser' can optionally write its output in the form of
comma-separated values (CSV) that can be imported into a spreadsheet
application. This is controlled by option '-C'.
'TryParser' and 'TestParser' may optionally write timing information.
This is controlled by option '-t'.
The output format of both 'TryParser' and 'TestParser' is slightly
changed: file name precedes output from the parser under test.
Three bugs fixed:
- A semicolon generated after 'while' caused infinite
loop in an internal parsing procedure with a body
consisting of a '*+' expression only.
- Omitted '-S' option to Generate ignored the package specified
by '-r' when defining SemanticBase.
- 'rhsText' did not return empty string for empty range.
Changes from distribution 1.5 to 1.5.1:
Bug fix and clean-up due to Stephen P. Owens (
Fixed bug: after parser failure, printed a blank line if the error message
was intercepted by semantics and error info cleared using errClear().
Clean-up: removed unused variables and imports from six source files.
Changes from distribution 1.4 to 1.5:
Identity of rule that created a Phrase made available
to semantic procedures via helper methods 'rule' and 'isTerm'.
Bug fixed in the generated parser:
incorrect handling of ^[s] when s has length 1.
Build file provided for compiling the source and making the jars:
'build.xml' for Apache Ant. Invocation is described in a comment.
Changes from distribution 1.3.1 to 1.4:
Shorthand forms introduced for three frequently used
constructions. They are easier to read and have
a more efficient implementation.
- e1*+ e2. A shorthand for (!e2 e1)*e2:
iterate e1 until e2.
- e1++ e2. A shorthand for (!e2 e1)+e2:
iterate e1 at least once until e2.
- ^[s]. A shorthand for ![s]_:
consume character NOT appearing in string s.
Position of a Phrase in source text made available
to semantic procedures via helper method 'where'.
(Feature request 3298666.)
A slightly better formulation for messages
about failing predicate !e.
Changes from distribution 1.3a to 1.3.1:
Two bugs fixed in the generated parser:
- the error message for failing predicate &e was
'expected not e' instead of 'expected e'.
- an attempt to print error position at the end of file
resulted in an infinite loop.
One bug fixed in Generate:
- printed extra message 'expected : or end of input'
for syntax error after last semicolon.
Two errors corrected in the sample grammar for Java 1.6.
Sample grammar for Java 1.7 added.
Change from distribution 1.3 to 1.3a:
Revised version of user's manual.
Changes from distribution 1.2 to 1.3:
In the generated parser, do not check for success
of expressions that never fail.
Default character encoding is assumed for the input file
(version 1.2 assumed ISO 8859_1).
Can be changed by modifying 'runtime.SourceFile'.
The generated Java code uses ASCII encoding.
Helper method 'errMsg' returns a 'readable' string:
all characters outside the range 32-255 are represented
by Java escapes.
Improved diagnostics from TestPEG.
Minor bug fixes.
Sample grammars for Java and C are included in the package.
This diff is collapsed.
<?xml version="1.0" encoding="ISO-8859-1"?>
* Stand in the distribution directory Mouse-1.7 and invoke Apache Ant:
* ant -Ddest=<destdir> <targets>
* where <destdir> is name of the directory to contain the result.
* This directory will be deleted if it exists and then re-created.
* If you omit -D option, destdir is the subdirectory 'build' of Mouse-1.7.
* <targets> are one or more of:
* full-jar - build complete Mouse-1.7.jar
* runtime-jar - build Mouse-1.7.runtime.jar
* compile - compile all source to subdirectory 'mouse'
* of destdir. It is included in the preceding two.
* Default is all of above.
* 2011-11-08 Created.
* 2012-01-06 Updated for version 1.5.1.
* 2013-04-15 Updated for version 1.6.
* 2014-04-13 Updated for version 1.6.1.
* 2015-07-25 Updated for version 1.7.
* =========================================================================
<project name="MakeMouse" basedir="." default="all" >
<property name="dest" value="${basedir}/build" />
<target name="all" depends="full-jar, runtime-jar" />
<target name="full-jar" depends="compile, rtsource">
<jar basedir="${dest}" destfile="${dest}/Mouse-1.7.jar" />
<delete dir="${dest}/rtsource" />
<target name="runtime-jar" depends="compile">
<jar basedir="${dest}/mouse/runtime"
destfile="${dest}/Mouse-1.7.runtime.jar" />
<target name="compile" depends="init">
<javac srcdir="${basedir}/src" destdir="${dest}" includeAntRuntime="no" />
<target name="rtsource" depends="rtdir" >
<copy todir="${dest}/rtsource">
<fileset dir="${basedir}/src/mouse/runtime" />
<target name="rtdir" depends="init">
<mkdir dir="${dest}/rtsource"/>
<target name="init" depends="clean">
<mkdir dir="${dest}"/>
<target name="clean">
<delete dir="${dest}"/>
\section{Appendix: The grammar of \Mouse\ PEG}
Grammar = Space (Rule/Skip)*+ EOT ;
Rule = Name EQUAL RuleRhs DiagName? SEMI ;
Skip = SEMI
/ _++ (SEMI/EOT) ;
RuleRhs = Sequence Actions (SLASH Sequence Actions)* ;
Choice = Sequence (SLASH Sequence)* ;
Sequence = Prefixed+ ;
Prefixed = PREFIX? Suffixed ;
Suffixed = Primary (UNTIL Primary / SUFFIX)? ;
Primary = Name
/ StringLit
/ Range
/ CharClass ;
Actions = OnSucc OnFail ;
OnSucc = (LWING AND? Name? RWING)? ;
OnFail = (TILDA LWING Name? RWING)? ;
Name = Letter (Letter / Digit)* Space ;
DiagName = "<" Char++ ">" Space ;
StringLit = ["] Char++ ["] Space ;
CharClass = ("[" / "^[") Char++ "]" Space ;
Range = "[" Char "-" Char "]" Space ;
Char = Escape
/ ^[\r\n\\] ;
Escape = "\\u" HexDigit HexDigit HexDigit HexDigit
/ "\\t"
/ "\\n"
/ "\\r"
/ !"\\u""\\"_ ;
Letter = [a-z] / [A-Z] ;
Digit = [0-9] ;
HexDigit = [0-9] / [a-f] / [A-F] ;
PREFIX = [&!] Space ;
SUFFIX = [?*+] Space ;
UNTIL = ("*+" / "++") Space ;
EQUAL = "=" Space ;
SEMI = ";" Space ;
SLASH = "/" Space ;
AND = "&" Space ;
LPAREN = "(" Space ;
RPAREN = ")" Space ;
LWING = "{" Space ;
RWING = "}" Space ;
TILDA = "~" Space ;
ANY = "_" Space ;
Space = ([ \r\n\t] / Comment)* ;
Comment = "//" _*+ EOL ;
EOL = [\r]? [\n] / !_ ;
EOT = !_ ;
\section{Appendix: Helper methods\label{helper}}
These methods provide access to the environment seen by a semantic action.
The following four methods
are inherited from \tx{mouse.runtime.Semantics}.\newline
They call back the parser to access the relevant \Phrase\ objects.
\item[\tx{Phrase }\textbf{lhs}\tx{()}]\upsp\newline
Returns the left-hand side object.
\item[\tx{int }\textbf{rhsSize}\tx{()}]\upsp\newline
Returns the number of right-hand side objects.
\item[\tx{Phrase }\textbf{rhs}\tx{(int i)}]\upsp\newline
Returns the $i$-th element on the right-hand side, $0\le i <\,$\tx{rhsSize()}.
\item[\tx{String }\textbf{rhsText}\tx{(int i,int j)}]\upsp\newline
Returns as one \tx{String} the text
represented by the right-hand side objects $i$ through $j-1$,\newline
where $0\leq i < j \le\,$\tx{rhsSize()}.
The following fifteen methods can be applied to a \Phrase\ object:
\item[\tx{void }\textbf{put}\tx{(Object v)}]\upsp \newline
Inserts $v$ as the semantic value of this \Phrase.
\item[\tx{Object }\textbf{get}\tx{()}]\upsp \newline
Returns the semantic value of this \Phrase.
\item[\tx{String }\textbf{text}\tx{()}]\upsp \newline
Returns the text represented by this \tx{Phrase}.
\item[\tx{char }\textbf{charAt}\tx{(int i)}]\upsp \newline
Returns the $i$-th character of the text represented by this \tx{Phrase}.
\item[\tx{String }\textbf{rule}\tx{()}]\upsp \newline
Returns name of the rule that created this \tx{Phrase}.
\item[\tx{boolean }\textbf{isA}\tx{(String name)}]\upsp \newline
Returns \tx{true} if this \tx{Phrase} was created by rule \textit{name}.
\item[\tx{boolean }\textbf{isTerm}\tx{()}]\upsp \newline
Returns \tx{true} if this \tx{Phrase} was created by a terminal.
\item[\tx{boolean }\textbf{isEmpty}\tx{()}]\upsp \newline
Returns \tx{true} if the text represented by this \tx{Phrase} is empty.
\item[\tx{String }\textbf{where}\tx{(int i)}]\upsp \newline
Returns a printable string describing where in the text being parsed\newline
find the i-th character
of text represented by this \tx{Phrase}.
\item[\tx{String }\textbf{errMsg}\tx{()}]\upsp \newline
Returns the error message contained in this \tx{Phrase}.
\item[\tx{void }\textbf{errClear}\tx{()}]\upsp \newline
Erases the error message contained in this \tx{Phrase}.
\item[\tx{void }\textbf{errAdd}\tx{(String expr,int i)}]\upsp \newline
Updates error message contained in this \tx{Phrase} with information\newline
that expression \textit{expr} failed at the i-th character
of text represented by this \tx{Phrase}.
\item[\tx{void }\textbf{actAdd}\tx{(Deferred act)}]\upsp \newline
Adds to this \Phrase deferred action \tx{act}: a lambda expression implementing
the interface \tx{Deferred}.
\item[\tx{void }\textbf{actClear}\tx{()}]\upsp \newline
Removes all deferred actions saved in this \Phrase.
\item[\tx{void }\textbf{actExec}\tx{()}]\upsp \newline
Executes and removes all deferred actions saved in this \Phrase.
\section{Appendix: Your parser class\label{DocPars}}
These are the methods you can apply to your generated parser.\newline
"\tx{Parser}" and
"\tx{Semantics}" are names of your parser and semantics classes,
\item[\textbf{Parser}\texttt{()}]\upsp \newline
Parser constructor. Instantiates your parser and semantics,
connects semantics object to the parser,
and returns the resulting parser object.\dnsp
\item[\texttt{boolean }\textbf{parse}\texttt{(Source src)}]\upsp \newline
Parses input wrapped into a \tx{Source} object \textit{src}.\newline
Returns \tx{true} if the parse was successful, or \tx{false} otherwise.\dnsp
\item[\texttt{Semantics }\textbf{semantics}\texttt{()}]\upsp \newline
Returns the semantics object associated with the parser.\dnsp
\item[\texttt{void }\textbf{setTrace}\texttt{(String s)}]\upsp \newline
Assigns $s$ to the \tx{trace} field in semantics object.\dnsp
\item[\texttt{void }\textbf{setMemo}\texttt{(int n)}]\upsp \newline
Sets the amount of memoization to $n$, $0 \le n \le 9$.\newline
Can only be applied to a parser generated with option \tx{-M} or \tx{-T}
(see \tx{mouse.Generate} tool).
This diff is collapsed.
\section{Backtracking again\label{back2}}
The grammar just constructed has another non-$LL(1)$ choice
in addition to the choice between two kinds of number.
It is the choice between \tx{Store} and \tx{Print} in \Input:
they can both start with \tx{Name}.
If you enter something like "\tx{lambda * 7}",
the parser tries \tx{Store} first and processes "\tx{lambda }",
expecting to find equal sign next.
Finding "\tx{*}" instead, the parser backtracks and tries \tx{Print}.
It eventually comes to process "\tx{lambda }" via \tx{Sum}, \tx{Product},
and \tx{Factor}.
To watch this activity, you may generate test version of the parser using option \tx{-T},
as described in Section~\ref{back} and try it.
A possible result is shown below.
java mouse.TestParser -PmyParser -d
> lambda = 12
62 calls: 31 ok, 30 failed, 1 backtracked.
10 rescanned.
backtrack length: max 2, average 2.0.
Backtracking, rescan, reuse:
procedure ok fail back resc reuse totbk maxbk at
------------- ----- ----- ----- ----- ----- ----- ----- --
Digits 2 2 0 2 0 0 0
Space 5 0 0 1 0 0 0
Factor_0 0 1 1 0 0 2 2 After 'lambda = '
[0-9] 4 4 0 5 0 0 0
" " 2 5 0 1 0 0 0
"." 0 2 0 1 0 0 0
> lambda * 7
87 calls: 41 ok, 44 failed, 2 backtracked.
22 rescanned.
backtrack length: max 7, average 4.0.
Backtracking, rescan, reuse:
procedure ok fail back resc reuse totbk maxbk at
------------- ----- ----- ----- ----- ----- ----- ----- --
Store 0 0 1 0 0 7 7 At start
Digits 2 4 0 3 0 0 0
Name 2 1 0 1 0 0 0
Space 6 0 0 2 0 0 0
Factor_0 0 2 1 0 0 1 1 After 'lambda * '
[0-9] 2 6 0 5 0 0 0
" " 3 6 0 3 0 0 0
[a-z] 12 3 0 7 0 0 0
"." 0 3 0 1 0 0 0
This is quite a lot of rescanning; you may try to see the effects of specifying
the number of rescans is reduced to 2 for each input.
\section{What about backtracking?\label{back}}
As you can easily see, \tx{Number} in the last grammar
does not satisfy the classical $LL(1)$ condition:
seeing a digit as the first character,
you do not know which alternative it belongs to.
Presented with "\tx{123}", the parser will start with
the first alternative of \tx{Number} ("fraction")
and process "\tx{123}" as its \Digits.
Not finding the decimal point, the parser will backtrack and try
the other alternative ("integer"),
again processing the same input as \Digits.
This re-processing of input is the price for not bothering
to make the grammar $LL(1)$.
The loss of performance caused by reprocessing is uninteresting
in this particular application.
However, circumventing $LL(1)$ in this way may cause a more serious problem.
The reason is the limited backtracking of PEG.
As indicated in Section~\ref{PEG},
if the parser successfully accepts the first alternative
of $e_1 / e_2$ and fails later on,
it will never return to try $e_2$.
As a result, some part of the language of $e_2$ may never be recognized.
Just imagine for a while that the two alternatives in \tx{Number}
are reversed.
Presented with "\tx{123.45}", \tx{Number} consumes "\tx{123}" as an "integer"
and returns to \tx{Sum}, reporting success.
The \tx{Sum} finds a point instead of \tx{AddOp},
and terminates prematurely.
The other alternative of \tx{Number} is never tried.
All fractional numbers starting with \Digits\ are hidden by greedy "integer".
Your rule for \Number\ is almost identical to the follwing production
in Extended Backus-Naur Form (EBNF):
Number ::= Digits? "." Digits Space | Digits Space
and you certainly expect that the language accepted by your PEG rule
is exactly one defined by this production.
This is indeed true in this case.
A sufficient condition
for PEG rule $A = e_1 / e_2$ defining the same languge
as EBNF production $A ::= e_1 | e_2$ is:
\sL(e_1) \cap \Pref(\sL(e_2)\Tail(A)) = \emptyset\,, \tag{*}
where $\sL(e)$ is the language defined by $e$ (in EBNF),
$\Pref(\sL)$ is the set of all prefixes of strings in $\sL$,
and $\Tail(A)$ is the set of all strings that can follow $A$ in correct input\footnote{
This condition was correctly guessed by Schmitz in \cite{Schmitz:2006}.
It is formally proved in \cite{Redz:2013:FI}
using an approach invented by Medeiros \cite{Medeiros:PhD,Medeiros:2014}.}.
You can see that in this case $\Tail(A)$ is
$\sL($\stx{(AddOp Number)*}$)$, so the condition becomes:
\sL($\stx{Digits?\;\,"." Digits Space}$)\; \cap \;\Pref(\sL($\stx{Digits Space (AddOp Number)*}$)) = \emptyset\,.
It is obviously satisfied as every string to the left of the intersection operator $\,\cap\,$
contains a decimal point
and none of them to the right of it does.
If you reverse the alternatives in \Number,
the condition becomes:
\sL($\stx{Digits Space}$)\; \cap\; \Pref(\sL($\stx{Digits?\;\,"." Digits Space (AddOp Number)*}$)) = \emptyset\,,
which obviously does not hold.
An automatic checking of (*) is complex,
and \Mouse\ does not attempt it.
you can often verify it by inspection as this was done above.
You can also find some hints at verification methods in \cite{Redz:2013:FI,Redz:2014:FI},
and you should note that (*) always holds if $A$ satisfies $LL(1)$.
To watch the backtracking activity of your parser,
you may generate an instrumented version of it.
To do this, type these commands:
java mouse.Generate -G myGrammar.txt -P myParser -S mySemantics -T
The option \tx{-T} instructs the Generator to construct a "test version".
(You may choose another name for this version if you wish,
the semantics class remains the same.)
Invoking the test version with \tx{mouse.TryParser}
produces the same results as before.
In order to exploit the instrumentation, you have to use \tx{mouse.TestParser};
the session may look like this:
java mouse.TestParser -P myParser
> 123 + 4567
51 calls: 34 ok, 15 failed, 2 backtracked.
11 rescanned.
backtrack length: max 4, average 3.5.
This output tells you that to process your input "\tx{123 + 4567}",
the parser executed 51 calls to parsing procedures,
of which 34 succeeded, 15 failed, and two backtracked.
(We treat here the services that implement terminals as parsing procedures.)
As expected, the parser backtracked 3 characters on the first \tx{Number}
and 4 on the second, so the maximum backtrack length was 4
and the average backtrack length was 3.5.
You can also see that 11 of the procedure calls were "re-scans":
the same procedure called again at the same input position.
You can get more detail by specifying the option~\tx{-d}:
java mouse.TestParser -P myParser -d
> 123 + 4567
51 calls: 34 ok, 15 failed, 2 backtracked.
11 rescanned.
backtrack length: max 4, average 3.5.
Backtracking, rescan, reuse:
procedure ok fail back resc reuse totbk maxbk at
------------- ----- ----- ----- ----- ----- ----- ----- --
Digits 4 0 0 2 0 0 0
Number_0 0 0 2 0 0 7 4 After '123 + '
[0-9] 14 4 0 9 0 0 0
You see here statistics for individual procedures that were involved
in backtracking and rescanning.
\verb#Number_0# is the internal procedure for the first alternative of \Number.
As you can guess, "\tx{totbk}" stands for total backtrack length
and "\tx{maxbk}" for length of the longest backtrack;
"\tx{at}" tells where this longest backtrack occurred.
The meaning of "\tx{reuse}" will be clear in a short while.
@String{LNCS = {Lecture Notes in Comp. Sci.}}
@String{Springer = {Springer-Verlag}}
% -----------------------------------
author = {Alfred V. Aho and Jeffrey D. Ullman},
title = {The Theory of Parsing, Translation and Compiling,
Vol.~I, Parsing},
publisher = {Prentice Hall},
year = {1972}}
% -----------------------------------
author = {Alfred V. Aho and
Ravi Sethi and
Jeffrey D. Ullman},
title = {Compilers, Principles,Techniques, and Tools},
publisher = {Addison-Wesley},
year = {1987}}
% -----------------------------------
author = {Alexander Birman},
title = {The TMG Recognition Schema},
school = {Princeton University},
month = {February},
year = {1970}}
% -----------------------------------
author = {Alexander Birman and Jeffrey D. Ullman},
title = {Parsing Algorithms with Backtrack},
journal = {Information and Control},
year = {1973},
volume = {23},
pages = {1--34}}
% -----------------------------------
author = {P.A. Brooker and D. Morris},
title = {Some Proposals for the Realization of a Certain Assembly Program},
journal = {The Computer Journal},
year = {1961},
volume = {3},
number = {4},
pages = {220--231}}
% -----------------------------------
author = {P.A. Brooker and D. Morris},
title = {A General Translation Program
for Phrase Structure Languages},
journal = JACM,
year = {1962},
volume = {9},
number = {1},
pages = {1--10}}
% -----------------------------------
author = {Bryan Ford},
title = {Packrat Parsing:
a Practical Linear-Time Algorithm with Backtracking},
school = {Massachusetts Institute of Technology},
month = {September},
year = {2002},
note = {\newline\url{}}}
% -----------------------------------
author = {Ford, Bryan},
title = {Packrat Parsing: Simple, Powerful, Lazy, Linear Time},
booktitle = {Proceedings of the Seventh ACM SIGPLAN International Conference on Functional Programming},
year = {2002},
address = {Pittsburgh, PA, USA},
publisher = {ACM},
pages = {36--47},
note = {\newline\url{}}}
% -----------------------------------
author = {Bryan Ford},
title = {Parsing Expression Grammars:
A Recognition-Based Syntactic Foundation},
booktitle = {Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages},
editor = {Neil D. Jones and Xavier Leroy},
address = {Venice, Italy},
month = {14--16~January},
year = {2004},
publisher = {ACM},
pages = {111--122},
note = {\newline\url{}}}
% -----------------------------------
author = {Bryan Ford},
title = {The Packrat Parsing and {P}arsing {E}xpression {G}rammars Page},
howpublished = {\newline\url{}},
note = {Accessed 2015-07-15}}
% -----------------------------------
author = {Robert Grimm},
title = {Practical Packrat Parsing},
institution = {Dept. of Computer Science, New York University},
number = {TR2004-854},
month = {March},
year = {2004},
library = {yes}}
%url = {\tts{}}}