Objective Modula-2 wrote:
Frank Heckenbach wrote:
So why do claim you need a debugger for understanding?
There is a difference between understanding how to feed input into a tool and understanding what the output coming out of the tool is doing.
There are a number of reasons why it is generally a good idea for contributors to be able to figure out what the code is doing, not only knowing how it is used. One example where this is important in a compiler is implementing error-handling and recovery so that meaningful error messages can be generated and also to avoid phantom errors that aren't really there but only appear as a result of a previous error that threw the parser off.
Still a debugger would be too low-level. A debugger is useful when you want to examine what happens on the assembler level, and (already in a more limited way) on the next higher level, here C or C++ (and potentially later, generated Pascal).
But that's not the level where you understand what's actually happening in the grammar, so unless you want to debug the actual parsing algorithm (which is "fixed", and I've never had the need to examine myself), the debugger is not well suited.
To understand grammar issues that you need to "debug" at runtime (i.e., in contrast to conflicts etc., which appear at build time and can (and should) be statically analyzed), it's more useful to watch which reductions the parser makes etc., e.g. by using YYDEBUG.
I can understand that if you're only used to RD parsers you might consider the position in the parser code as *the* information to track parser issues, and I suppose that's why you're used to using debuggers in such cases. With table-driven parsers, it's just different. The code position is mostly irrelvant, but the tokens on the stack and the reductions performed are the important things, so different methods are suitable to track them (e.g., YYDEBUG output rather than debuggers). One can argue which is "better", and if you want to engage in this discussion, I'd point out that not having to use a debugger is an indication to me of being higher-level.
The reality is that you have used this tool for 10 years or longer and the base from which to possibly recruit one or more new maintainers is unlikely to have this experience.
Except, as I wrote initially, the grammar is exactly one of those areas where we wouldn't need to spend much work, because it exists and works already. Even if the semantic actions have to be rewritten, and even in the Bison parser code has to be translated to Pascal at some point, the grammar rules and their logic are not affected.
You will find that I was pretty much the only one in this entire discussion who made a plea to not change anything but find new maintainers to keep GNU Pascal going as a GCC front end, then let the new maintainers decide if they want to make any changes to the implementation or not.
However, these are two different issues, the parser (which is like the front of the frontend) and the backend. I, as stated, prefer to keep the parser and replace the backend.
The two things that came up the most were 1) can the compiler be written in Pascal and 2) can the compiler target LLVM without having to link directly to any API that when changed may break it.
Both are not incompatible with keeping Bison (except that for 1) the parsing algorithm in Bison would need to be ported to Pascal).
I do understand however, that your comments are geared towards a different scenario that doesn't necessarily involve new recruits, as you seem to be mostly interested to simply add a C++ target to the existing compiler and let the GCC back end alone until perhaps some day a new maintainer shows up who might want to update it.
Then you got me wrong. I suggested rewriting all TREE_NODE related parts of the current compiler, which would completely break compatibility with the GCC backend.
Moreover, in recent years the trend has been to move away from yacc/bison and move towards RD and PEG. There is an entire generation of newer tools that build RD parsers, both conventional and memoising, such as ANTLR for example.
How powerful are they? GPC with its mix of dialects needs even more than LALR(1), i.e. Bison's GLR parser.
ANTLR does LL(*) which means infinite (as in arbitrary) lookup.
No, it doesn't. It means DFA based lookup (see below for an actual case where this is important).
Note that Clang uses a handwritten RD parser even for the C++ implementation which I believe is based on LL(*) too. The often recited statement that LL is not powerful enough to implement "real languages out there" is just a myth.
It seems hard to find concrete information on this topic.
http://www.codecommit.com/blog/scala/unveiling-the-mysteries-of-gll-part-1 says: "It's difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k)"
http://webcache.googleusercontent.com/search?q=cache:DxL1IM-pb_0J:www.antlr.... (by Terence Parr, your witness) says: "LR(k) even with k=1 is generally more powerful than LL(*) or at least more efficient for same grammar, but there is no strict ordering".
And GPC's grammar isn't even LALR(1) (we need GLR for a few problematic cases), and therefore likely also not LR(1). (I'm not saying that GPC is the archetypical "real language out there". It clearly isn't, because it mixes wildly differing dialects. But it's the language that this discussion is ultimately about.)
An actual example from GPC's grammar:
Dispose (<expression>, <discriminant-list>) (EP) vs. Dispose (<expression>, <destructor-name> <destructor-parameters>) (BP)
Since expressions are, of course, not DFA, LL(*) can't parse this, at least without refactoring the grammar (which destroys the "more natural grammar" argument usually given in support of LL -- which I never actually believed since I saw the necessary workarounds to avoid left-recursion for such easy cases as left-associative operators like "-" and "/").
The Bison specification in this case, however, is quite natural (which doesn't prove that it's more natural in all cases, but disproves that LL grammars are generally more natural):
builtin_procedure_statement: [...] | p_Dispose '(' expression ',' actual_parameter_list ')' %dprec 1 | p_Dispose '(' expression ',' new_identifier builtin_actual_parameter_list ')' %dprec 2
In fact, this is one of the few cases that's even ambiguous, since <discriminant-list> and <destructor-name> can derive an identifier and <destructor-parameters> is nullable. That's why we need "%dprec" here (GLR parser) and have to disambiguate in the action.
These are among the most difficult problems (in fact I choose one of those for the example so I could simply search for "%dprec" in the source(1)). I suppose there are other cases that are not ambiguous and don't need GLR, but still are not [naturally] LL(*), but I don't know offhand how to quickly identify them. Note that the ambiguity is just an added complication -- even without it, i.e. if actual_parameter_list didn't have the identifier token in its FIRST set, the above case wouldn't be LL(*) in the natural way.
(1) As a side note, this part of the grammar was written several years ago -- I don't even remember whether by Waldek or by me, and I wouldn't have recalled it offhand as a problematic case. Yet when I found it searching for "%dprec", I had no problem understanding the rules or seeing why there was an ambiguity, just looking at them briefly. Another piece of evidence against the "LR grammars are hard to read" argument.
The benefit of RD parsers are there. People smarter than you and me will confirm that. Niklaus Wirth is a strong proponent of hand coded RD parsers. Moessenboeck changed his seminal work (COCO) from LALR to LL. Tom Pittman has also been on record in favour of RD.
And there are strong proponents amongst today's generation of highly acclaimed scholars, for example Terence Parr and Chris Lattner. Chris Lattner has just received an ACM award for his work on Clang and LLVM. So you can make fun of me for my preference of hand written RD all day long, but I doub't you have enough clout to make fun of Chris Lattner who is also an outspoken proponent of hand written RD parsers.
How many people do you want me to quote in favour of LR? Or should we stop name dropping and discuss actual merits?
I did some googling and found surprisingly little on the subject -- perhaps I used the wrong search terms, in which case you might give me some pointers (after all, it's your argument, so it's up to you to back it up). I won't make fun of something or someone I know nothing about -- I make fun of Borland Pascal because I actually know quite a bit about it.
What I found, apart from the above two references is http://bit.ly/9KILMe (the link points to "Compiler Construction - The Art of Niklaus Wirth" by Mössenböck, page 60, at books.google.com; I think the link will expire soon, but you can search by the title):
"Wirth always designed his languages so that recursive descent parsing could be applied" (which, according to the previous paragraph, he takes to mean LL(1), which is actually more restrictive that what you talk about). I absolutely agree with this when designing a language from scratch. But that's not the situation we're in with GPC; we have an existing language (mix) that, if anything, will only be extended, so "it would be nice it if were LL(1)" doesn't help, since it isn't so.
"If the designers of C were forced to make their language LL(1) it would have probably become more readable." The problem here is the "forced" bit. Writing a manual parser without formal verification just doesn't do this. Tellingly, the very next sentence reads: "Unfortunately even Wirth's languages contain a few constructs which are not LL(1)." So, apparently he wasn't quite forced to make it LL(1), he just tried to (and, AFAIK, succeeded mostly, but not entirely).
Another reason given for RD parsing is that it's (supposed to be) faster. I'm not convinced that it is, and even if so, it may be negligible. For comparison, on the previous page in the same book it says: "... since the scanner is one of the most time consuming parts of a compiler".
That might have been true in the 1970s, but not today. If anything, the lexer is one of the *least* time consuming parts of GPC. Yes, it has to touch every character of the program (whereas the parser already works on tokens), but it does so in guaranteed linear time, and even with the overhead of storing locations etc., it simply doesn't take long. (Anybody who's worked with text files recently can confirm that the time it takes to process files of the typical length of source files it negligible, if the processing itself is not complicated, which it isn't in the scanner.) That's even more true in Pascal, compared to C, C++ and other languages which, due to includes, have to scan and parse the same declarations again and again (unless using precompiled headers).
Of course, the most time-consuming part of GPC is the optimizer. Wirth's opinions on optimizers, stated a few pages before, are also outdated IMHO. Modern processor architecture allows many optimization possibilities that are too complex for humans to do (in a non-trivial extent). Different architectures have different characteristics, and programmers should not be burdened with having to know about and care for them. "Generate good code from the beginning, don't generate bad code and optimize later" is in direct contrast to SSA which seems to have a wide consensus among compiler designers today, including GCC and LLVM. With all due respect, I think this is a case of straying too far outside of his area of expertise. He was a brilliant language designer, but not necessarily so good at good generation. (I'm neither, but I don't claim to be and try to prescribe to optimizer writers how to do their jobs.)
People who write RD parsers by hand generally calculate FIRST and FOLLOW sets and proof that their grammar contains no ambiguities. It seems you have run into somebody who didn't do that but that doesn't mean that this is how its done.
Borland apparently didn't prove their grammars (because their grammar did contain ambiguities).
Well, shame on them.
A year ago we had an Indian or Bangladeshi neighbour who always parked his bicycle in my spot which got me into trouble with the landlord because I had to bring my bicycle into the building. It doesn't mean all Indians/Bangladeshis do this. It was just this one guy and he was wrong.
So far it's 1-3 WRT Pascalish/Wirthian languages (Borland, FPC, and the quote about Wirth above vs. OM2 if I assume it does it properly). Other compiler authors (Scott?) may chime in, but up to now I take your "this is how its done" to mean "this is how it should be done ideally".
I still don't quite see the point. With RD, you specify the grammar in a formal way and write the parser manually, and it's verified automatically. With Bison you specify the grammar formally, and the parser is generated and verified automatically.
If we had wanted to use a generator, we'd have either used COCO/R or ANTLR, both of which generate human readable RD parsers in several output languages. COCO can generate output in Pascal, Modula-2 and Oberon and others. ANTLR can generate output in Java, C, C++, Python, Oberon and others.
However, our compiler is meant to be an entirely self contained bootstrap kit. We wanted to make it as easy as possibly for anyone to bootstrap the compiler anywhere without having to worry what libraries or tools might be there and whether or not they are up to date. Our bootstrap compiler has no dependencies other than a C compiler and stdio/stddef.
Well, this is also possible with a generator, especially if the genarated code is not target-dependent (as is the case with Bison and Flex). Look at the GPC source distribution. Bison is only required if you change the grammar (where in your case you also need the verification tool IIUYC).
The former seems more work to me.
Appearances can be deceiving. The time spent coding the parser is only a fraction of the work you have to do on things a generator won't do for you anyway.
So what are those things that a generator won't do for you, but a RD verifier (or what?) will? It's got to be something major, so that two steps (writing the parser and specifying the grammar for the verifier) become easier than one (writing the grammar for the parser generator), so what is it?
Frank