Re: Parsing theory and parser construction

5 Aug 2010


      Objective Modula-2 wrote:
...
Frank Heckenbach wrote:
...
So why do claim you need a debugger for understanding?
There is a difference between understanding how to feed input into a
tool and understanding what the output coming out of the tool is
doing.
There are a number of reasons why it is generally a good idea for
contributors to be able to figure out what the code is doing, not only
knowing how it is used. One example where this is important in a
compiler is implementing error-handling and recovery so that
meaningful error messages can be generated and also to avoid phantom
errors that aren't really there but only appear as a result of a
previous error that threw the parser off.
Still a debugger would be too low-level. A debugger is useful when
you want to examine what happens on the assembler level, and
(already in a more limited way) on the next higher level, here C or
C++ (and potentially later, generated Pascal).
But that's not the level where you understand what's actually
happening in the grammar, so unless you want to debug the actual
parsing algorithm (which is "fixed", and I've never had the need to
examine myself), the debugger is not well suited.
To understand grammar issues that you need to "debug" at runtime
(i.e., in contrast to conflicts etc., which appear at build time and
can (and should) be statically analyzed), it's more useful to watch
which reductions the parser makes etc., e.g. by using YYDEBUG.
I can understand that if you're only used to RD parsers you might
consider the position in the parser code as *the* information to
track parser issues, and I suppose that's why you're used to using
debuggers in such cases. With table-driven parsers, it's just
different. The code position is mostly irrelvant, but the tokens on
the stack and the reductions performed are the important things, so
different methods are suitable to track them (e.g., YYDEBUG output
rather than debuggers). One can argue which is "better", and if you
want to engage in this discussion, I'd point out that not having to
use a debugger is an indication to me of being higher-level.
...
...
...
The reality is that you have used this tool for 10 years or longer and
the base from which to possibly recruit one or more new maintainers is
unlikely to have this experience.
Except, as I wrote initially, the grammar is exactly one of those
areas where we wouldn't need to spend much work, because it exists
and works already. Even if the semantic actions have to be
rewritten, and even in the Bison parser code has to be translated to
Pascal at some point, the grammar rules and their logic are not
affected.
You will find that I was pretty much the only one in this entire
discussion who made a plea to not change anything but find new
maintainers to keep GNU Pascal going as a GCC front end, then let the
new maintainers decide if they want to make any changes to the
implementation or not.
However, these are two different issues, the parser (which is like
the front of the frontend) and the backend. I, as stated, prefer to
keep the parser and replace the backend.
...
The two things that came up the most were 1) can the compiler be
written in Pascal and 2) can the compiler target LLVM without having
to link directly to any API that when changed may break it.
Both are not incompatible with keeping Bison (except that for 1) the
parsing algorithm in Bison would need to be ported to Pascal).
...
I do understand however, that your comments are geared towards a
different scenario that doesn't necessarily involve new recruits, as
you seem to be mostly interested to simply add a C++ target to the
existing compiler and let the GCC back end alone until perhaps some
day a new maintainer shows up who might want to update it.
Then you got me wrong. I suggested rewriting all TREE_NODE related
parts of the current compiler, which would completely break
compatibility with the GCC backend.
...
...
...
Moreover, in recent years the trend has been to move away from
yacc/bison and move towards RD and PEG. There is an entire generation
of newer tools that build RD parsers, both conventional and memoising,
such as ANTLR for example.
How powerful are they? GPC with its mix of dialects needs even more
than LALR(1), i.e. Bison's GLR parser.
ANTLR does LL(*) which means infinite (as in arbitrary) lookup.
No, it doesn't. It means DFA based lookup (see below for an actual
case where this is important).
...
Note that Clang uses a handwritten RD parser even for the C++
implementation which I believe is based on LL(*) too. The often
recited statement  that LL is not powerful enough to implement "real
languages out there" is just a myth.
It seems hard to find concrete information on this topic.
http://www.codecommit.com/blog/scala/unveiling-the-mysteries-of-gll-part-1
says:
"It's difficult to formally analyze the non-commutative LL(*)
techniques, and so theorists tend to be a little unclear as to
exactly how powerful the techniques are with respect to more well
defined classes like LL and LR. However, it is generally assumed
that non-commutative LL(*) is strictly more powerful than LL(k) but
likely less powerful than LR(k)"
http://webcache.googleusercontent.com/search?q=cache:DxL1IM-pb_0J:www.antlr....
(by Terence Parr, your witness) says:
"LR(k) even with k=1 is generally more powerful than LL(*) or at
least more efficient for same grammar, but there is no strict
ordering".
And GPC's grammar isn't even LALR(1) (we need GLR for a few
problematic cases), and therefore likely also not LR(1).
(I'm not saying that GPC is the archetypical "real language out
there". It clearly isn't, because it mixes wildly differing
dialects. But it's the language that this discussion is ultimately
about.)
An actual example from GPC's grammar:
Dispose (<expression>, <discriminant-list>) (EP)
vs.
  Dispose (<expression>, <destructor-name> <destructor-parameters>) (BP)
Since expressions are, of course, not DFA, LL(*) can't parse this,
at least without refactoring the grammar (which destroys the "more
natural grammar" argument usually given in support of LL -- which I
never actually believed since I saw the necessary workarounds to
avoid left-recursion for such easy cases as left-associative
operators like "-" and "/").
The Bison specification in this case, however, is quite natural
(which doesn't prove that it's more natural in all cases, but
disproves that LL grammars are generally more natural):
builtin_procedure_statement: [...]
  | p_Dispose '(' expression ',' actual_parameter_list ')' %dprec 1
  | p_Dispose '(' expression ',' new_identifier builtin_actual_parameter_list ')' %dprec 2
In fact, this is one of the few cases that's even ambiguous, since
<discriminant-list> and <destructor-name> can derive an identifier
and <destructor-parameters> is nullable. That's why we need "%dprec"
here (GLR parser) and have to disambiguate in the action.
These are among the most difficult problems (in fact I choose one of
those for the example so I could simply search for "%dprec" in the
source(1)). I suppose there are other cases that are not ambiguous
and don't need GLR, but still are not [naturally] LL(*), but I don't
know offhand how to quickly identify them. Note that the ambiguity
is just an added complication -- even without it, i.e. if
actual_parameter_list didn't have the identifier token in its FIRST
set, the above case wouldn't be LL(*) in the natural way.
(1) As a side note, this part of the grammar was written several
    years ago -- I don't even remember whether by Waldek or by me,
    and I wouldn't have recalled it offhand as a problematic case.
    Yet when I found it searching for "%dprec", I had no problem
    understanding the rules or seeing why there was an ambiguity,
    just looking at them briefly. Another piece of evidence against
    the "LR grammars are hard to read" argument.
...
The benefit of RD parsers are there. People smarter than you and me
will confirm that. Niklaus Wirth is a strong proponent of hand coded
RD parsers. Moessenboeck changed his seminal work (COCO) from LALR to
LL. Tom Pittman has also been on record in favour of RD.
And there are strong proponents amongst today's generation of highly
acclaimed scholars, for example Terence Parr and Chris Lattner. Chris
Lattner has just received an ACM award for his work on Clang and LLVM.
So you can make fun of me for my preference of hand written RD all day
long, but I doub't you have enough clout to make fun of Chris Lattner
who is also an outspoken proponent of hand written RD parsers.
How many people do you want me to quote in favour of LR? Or should
we stop name dropping and discuss actual merits?
I did some googling and found surprisingly little on the subject --
perhaps I used the wrong search terms, in which case you might give
me some pointers (after all, it's your argument, so it's up to you
to back it up). I won't make fun of something or someone I know
nothing about -- I make fun of Borland Pascal because I actually
know quite a bit about it.
What I found, apart from the above two references is
http://bit.ly/9KILMe (the link points to "Compiler Construction -
The Art of Niklaus Wirth" by MÃ¶ssenbÃ¶ck, page 60, at
books.google.com; I think the link will expire soon, but you can
search by the title):
"Wirth always designed his languages so that recursive descent
parsing could be applied" (which, according to the previous
paragraph, he takes to mean LL(1), which is actually more
restrictive that what you talk about). I absolutely agree with this
when designing a language from scratch. But that's not the situation
we're in with GPC; we have an existing language (mix) that, if
anything, will only be extended, so "it would be nice it if were
LL(1)" doesn't help, since it isn't so.
"If the designers of C were forced to make their language LL(1) it
would have probably become more readable." The problem here is the
"forced" bit. Writing a manual parser without formal verification
just doesn't do this. Tellingly, the very next sentence reads:
"Unfortunately even Wirth's languages contain a few constructs which
are not LL(1)." So, apparently he wasn't quite forced to make it
LL(1), he just tried to (and, AFAIK, succeeded mostly, but not
entirely).
Another reason given for RD parsing is that it's (supposed to be)
faster. I'm not convinced that it is, and even if so, it may be
negligible. For comparison, on the previous page in the same book it
says: "... since the scanner is one of the most time consuming parts
of a compiler".
That might have been true in the 1970s, but not today. If anything,
the lexer is one of the *least* time consuming parts of GPC. Yes, it
has to touch every character of the program (whereas the parser
already works on tokens), but it does so in guaranteed linear time,
and even with the overhead of storing locations etc., it simply
doesn't take long. (Anybody who's worked with text files recently
can confirm that the time it takes to process files of the typical
length of source files it negligible, if the processing itself is
not complicated, which it isn't in the scanner.) That's even more
true in Pascal, compared to C, C++ and other languages which, due to
includes, have to scan and parse the same declarations again and
again (unless using precompiled headers).
Of course, the most time-consuming part of GPC is the optimizer.
Wirth's opinions on optimizers, stated a few pages before, are also
outdated IMHO. Modern processor architecture allows many
optimization possibilities that are too complex for humans to do (in
a non-trivial extent). Different architectures have different
characteristics, and programmers should not be burdened with having
to know about and care for them. "Generate good code from the
beginning, don't generate bad code and optimize later" is in direct
contrast to SSA which seems to have a wide consensus among compiler
designers today, including GCC and LLVM. With all due respect, I
think this is a case of straying too far outside of his area of
expertise. He was a brilliant language designer, but not necessarily
so good at good generation. (I'm neither, but I don't claim to be
and try to prescribe to optimizer writers how to do their jobs.)
...
...
...
People who write RD parsers by hand generally calculate FIRST and
FOLLOW sets and proof that their grammar contains no ambiguities. It
seems you have run into somebody who didn't do that but that doesn't
mean that this is how its done.
Borland apparently didn't prove their grammars (because their
grammar did contain ambiguities).
Well, shame on them.
A year ago we had an Indian or Bangladeshi neighbour who always parked
his bicycle in my spot which got me into trouble with the landlord
because I had to bring my bicycle into the building. It doesn't mean
all Indians/Bangladeshis do this. It was just this one guy and he was
wrong.
So far it's 1-3 WRT Pascalish/Wirthian languages (Borland, FPC, and
the quote about Wirth above vs. OM2 if I assume it does it
properly). Other compiler authors (Scott?) may chime in, but up to
now I take your "this is how its done" to mean "this is how it
should be done ideally".
...
...
I still don't quite see the point. With RD, you specify the grammar
in a formal way and write the parser manually, and it's verified
automatically. With Bison you specify the grammar formally, and the
parser is generated and verified automatically.
If we had wanted to use a generator, we'd have either used COCO/R or
ANTLR, both of which generate human readable RD parsers in several
output languages. COCO can generate output in Pascal, Modula-2 and
Oberon and others. ANTLR can generate output in Java, C, C++, Python,
Oberon and others.
However, our compiler is meant to be an entirely self contained
bootstrap kit. We wanted to make it as easy as possibly for anyone to
bootstrap the compiler anywhere without having to worry what libraries
or tools might be there and whether or not they are up to date. Our
bootstrap compiler has no dependencies other than a C compiler and
stdio/stddef.
Well, this is also possible with a generator, especially if the
genarated code is not target-dependent (as is the case with Bison
and Flex). Look at the GPC source distribution. Bison is only
required if you change the grammar (where in your case you also need
the verification tool IIUYC).
...
...
The former seems more work to me.
Appearances can be deceiving. The time spent coding the parser is only
a fraction of the work you have to do on things a generator won't do
for you anyway.
So what are those things that a generator won't do for you, but a RD
verifier (or what?) will? It's got to be something major, so that
two steps (writing the parser and specifying the grammar for the
verifier) become easier than one (writing the grammar for the parser
generator), so what is it?
Frank
-- 
Frank Heckenbach, f.heckenbach@fh-soft.de, http://fjf.gnu.de/, 7977168E
GPC To-Do list, latest features, fixed bugs:
http://www.gnu-pascal.de/todo.html
GPC download signing key: ACB3 79B2 7EB2 B7A7 EFDE  D101 CD02 4C9D 0FE0 E5E8

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: Parsing theory and parser construction