Frank Heckenbach wrote:
since GPC development has mostly stalled, I've thought about if and how its development could be continued. Here are my current thoughts about it. Since the text is quite long, I put it on the web: http://fjf.gnu.de/gpc-future.html
Although I no longer use Pascal, I would like to make a plea for GNU Pascal to be sustained.
Generating C (or even C++) as an alternative target is certainly a useful feature for any compiler to have but if GPC was to lose its interface to the GCC back-end, I wonder if it would still be GNU Pascal. I believe most people associate GCC back-end support with any compiler that carries the GNU moniker.
Don't get me wrong, I am not a GNU fanboy nor a GCC zealot. I know that developing for the GCC back-end can be painful. Nevertheless, I believe it would be a loss for the Algol/Pascal family of languages and their users to see Pascal disappear from the list of supported languages in the GCC toolbox.
It is correct that the existing code will remain available but without maintenance, GNU Pascal would eventually cease to work with the latest GCC distribution.
GNU Modula-2 is now very close to its 1.0 release at which point it is expected to become part of the main GCC distribution. It would be sad to see Pascal leave the GCC party just as Modula-2 joins.
It is not uncommon that a project maintainer burns out after a decade. Nobody can blame you for that. Perhaps somebody else can be found to take over the project and continue it.
As for retargeting GNU Pascal, I personally would consider that a new project. How much sense it would make to reuse the GNU Pascal front end for a new compiler project to target a different back-end I am unable to say. However, I can comment on targeting the LLVM compiler infrastructure.
First, some clarification. Some posters appear to be confused about what LLVM is. For example, I saw one commenter writing "I am not interested in programming in ... LLVM". The LLVM compiler infrastructure is not a programming language you would write programs in. Instead, it is a compiler-back end library that compiler developers can use to plug their compilers into thereby saving themselves the effort to write their own target specific back-ends and optimisers.
Compared to using the GCC back-end, LLVM is far more straightforward. There are two ways to interface to it. You can either use the C or C++ bindings and generate your LLVM AST directly from within your own parser, or you could write a back-end for LLVM pseudo-assembly code which is actually far cleaner than C or C++. It is a typed virtual assembly language that has atomic intrinsics and trampolines and a lot of other useful things.
I am currently involved in a Modula-2 compiler project that targets C and LLVM (via the LLVM pseudo-assembly language). We are using a template approach for code generation and for each AST node a corresponding C or LLVM template is populated with the details from the AST node. So far code generation has only been in an experimental stage but I can already tell you from this experimental experience that targeting LLVM is certainly far more straight forward than targeting the GCC back-end.
It would certainly be nice to see a Pascal compiler targeting LLVM but as I said, this should probably be considered a new project and no longer GNU Pascal. Even with an LLVM Pascal compiler available, I still believe it would be beneficial to sustain GNU Pascal (as a Pascal compiler targeting the GCC back-end).
Anyway, good luck to you and the GNU Pascal community
-- Projects mentioned in this post: GNU Modula-2 (http://www.nongnu.org/gm2) LLVM Compiler Infrastructure (http://llvm.org) Objective Modula-2 (http://objective.modula2.net) Modula-2 R10 (http://www.ohloh.net/p/m2r10)
Objective Modula-2 wrote:
Compared to using the GCC back-end, LLVM is far more straightforward.
If we used the LLVM back end, would our static and dynamic libraries have a format compatible with GCC static and dynamic libraries? At one time there were three standards for function calls in Windows libraries. In other words: will we still be able to link to GCC code with the same ease that we can now?
Yours, Kevan
Kevan Hashemi wrote:
Objective Modula-2 wrote:
Compared to using the GCC back-end, LLVM is far more straightforward.
If we used the LLVM back end, would our static and dynamic libraries have a format compatible with GCC static and dynamic libraries?
Yes.
In other words: will we still be able to link to GCC code with the same ease that we can now?
Yes. As I understand it, inter-operability and drop-in replaceability are amongst the primary objectives of the LLVM project.
BTW, if you want to quiz the developers you can always do so at their IRC channel at irc.oftc.net #llvm. They are always very helpful, they don't bite you ;-)
On Sat, 31 Jul 2010, Objective Modula-2 wrote:
Frank Heckenbach wrote:
since GPC development has mostly stalled, I've thought about if and how its development could be continued. Here are my current thoughts about it. Since the text is quite long, I put it on the web: http://fjf.gnu.de/gpc-future.html
Although I no longer use Pascal, I would like to make a plea for GNU Pascal to be sustained.
Generating C (or even C++) as an alternative target is certainly a useful feature for any compiler to have but if GPC was to lose its interface to the GCC back-end, I wonder if it would still be GNU Pascal. I believe most people associate GCC back-end support with any compiler that carries the GNU moniker.
Then we need to figure out how to make the syncing of GPC with the GCC back end take much less of Frank's time (or anyone else who might serve as lead developer). Any ideas? Ideally, we'd want the syncing with new versions of GCC to be close to automatic, but it sounds like that's not feasible in the short term.
--------------------------| John L. Ries | Salford Systems | Phone: (619)543-8880 x107 | or (435)867-8885 | --------------------------|
John L. Ries wrote:
Then we need to figure out how to make the syncing of GPC with the GCC back end take much less of Frank's time (or anyone else who might serve as lead developer). Any ideas?
Ideally you will find a new maintainer or better still, more than one. The question is, where to look for potential recruits?!
At our Modula-2 project we have received lots of interest from students at universities where Modula-2 is still being taught. Some are new graduates who have been involved in compiler implementations of Modula-2 or Oberon subsets either as part of their course-work or in their graduation projects.
It seems GPC has a good following in adademia and there should still be universities who teach Pascal. So my recommendation would be to actively seek out those universities, talk to the professors who run the courses and try to win them over as "recruiters". They could suggest GPC related course-work. They could tell their compiler construction classes that a new generation of maintainers is needed for GPC. To me it seems that this is the most likely way to find new maintainers for your project.
I also would like to comment on the notion that the GPC front-end is stuck with C as an implementation language simply because GCC is C based.
The GNU Modula-2 GCC front end is written in Modula-2. There is still some C involved at the bootrap stage (via p2c) but by and large GM2 is a Modula-2 code base. Check out the GM2 repository and take a look. The code is well documented and if you run into difficulties doing the same for GPC I am sure Gaius Mulley (the GM2 developer) will be happy to give you some guidance.
Ideally, we'd want the syncing with new versions of GCC to be close to automatic, but it sounds like that's not feasible in the short term.
Never say never, but I seriously doubt that this will become feasible in the foreseeable future. There is a noticeable influence of design-by-license in the GCC architecture. Making it easy to plug a front-end into GCC is not only not a top priority for the GCC developers but an outspoken group amongst them actually considers it undesirable because they see it as making it easy to circumvent the GPL.
There was a long discussion about the pro and contra of plug-ins some time last year (or even earlier) on the GCC developer mailing list. A great number of commenters actually suggested that the API should regularly be changed at random to break anyone's code on purpose. Some of these suggestions were to randomly change the order of function parameters. There is of course no technical merit whatsoever in doing that. To those who argue this way technical merit takes a back seat. Not all the GCC developers think this way, but this kind of thinking has nevertheless had an impact on the GCC architecture over the years.
It can reasonably be expected that competition with LLVM will cause this influence to weaken and the influence of technical merit to strengthen, but I seriously doubt that design-by-license will ever completely disappear from the GCC project.
Objective Modula-2 wrote:
So my recommendation would be to actively seek out those universities...
That might work. But it sounds like a lot of hours on the phone, and we engineers and programmers are not the greatest salesmen, in my experience. So your solution may not be practical.
There was a long discussion about the pro and contra of plug-ins some time last year (or even earlier) on the GCC developer mailing list. A great number of commenters actually suggested that the API should regularly be changed at random to break anyone's code on purpose.
Am I understanding you correctly: people writing GPL code are actively trying to make it harder for other people to use their work? If that is true, they are betraying the spirit upon which GPL was founded in the first place: to share one's work and save everyone else a lot of trouble.
Your accusation against the GCC team is a serious one. Can you point me to the GCC forum discussions in question with a web link? Because if what you say is true, then I'm going to have to get away from GCC.
Adriaan van Os wrote:
Doing that is a very difficult and never-ending task, because the back-end will keep changing. It is more time-ineffcient to go another way, e.g. to rewrite the compiler to produce intermediate C++ or (what I would prefer) intermediate LLVM assembly code.
If what OM2 says about GCC is true, then I think it's clear that the GCC team will do their best to make it hard for us to keep up with their changes, and therefore you are right: it's not worth bothering with.
So, this LLVM route is looking attractive. I'll skip looking into GCC for now, and spend a few hours reading here instead:
Yours, Kevan
On Mon, 2 Aug 2010 01:35:33 am Kevan Hashemi wrote:
Objective Modula-2 wrote:
There was a long discussion about the pro and contra of plug-ins some time last year (or even earlier) on the GCC developer mailing list. A great number of commenters actually suggested that the API should regularly be changed at random to break anyone's code on purpose.
Am I understanding you correctly: people writing GPL code are actively trying to make it harder for other people to use their work? If that is true, they are betraying the spirit upon which GPL was founded in the first place: to share one's work and save everyone else a lot of trouble.
I don't want to get into a long and tedious debate about the GPL, but let me just say this: the primary purpose of the GPL is not merely to make it easy for others to use software you have written. If it were, they would use a BSD-style licence. The primary purpose of the GPL is equally to ensure that when others use your software, their improvements remain under the GPL and therefore available to others still.
In context, which you deleted, the (alleged) fear of some GCC developers is that support for a plugin style front end to the compiler allegedly makes it easy for people to bypass the GPL. So they're not just making it difficult for people to use GCC on a whim, it is to support the #1 primary purpose of the GPL: to ensure that software remains accessible to all and not locked up in proprietary walled gardens.
Your accusation against the GCC team is a serious one. Can you point me to the GCC forum discussions in question with a web link? Because if what you say is true, then I'm going to have to get away from GCC.
Surely it depends on who it is saying this? If they are lead developers in the GCC community, or Richard Stallman, then your fears are justified. If they're random Internet fly-by commentators on a public mailing list shooting off at the mouth with no influence and no likelihood of becoming lead developers, then let them rant and froth at the mouth.
Objective Modula-2 wrote:
At our Modula-2 project we have received lots of interest from students at universities where Modula-2 is still being taught. Some are new graduates who have been involved in compiler implementations of Modula-2 or Oberon subsets either as part of their course-work or in their graduation projects.
It seems GPC has a good following in adademia and there should still be universities who teach Pascal.
From what I see and read recently (more like, in the last 10 years
or longer), apparently not so many.
So my recommendation would be to actively seek out those universities, talk to the professors who run the courses and try to win them over as "recruiters". They could suggest GPC related course-work. They could tell their compiler construction classes that a new generation of maintainers is needed for GPC. To me it seems that this is the most likely way to find new maintainers for your project.
To be honest, I think this might have been a good idea some 10-15 years ago. Now it may be too late.
I also would like to comment on the notion that the GPC front-end is stuck with C as an implementation language simply because GCC is C based.
The GNU Modula-2 GCC front end is written in Modula-2. There is still some C involved at the bootrap stage (via p2c) but by and large GM2 is a Modula-2 code base. Check out the GM2 repository and take a look. The code is well documented and if you run into difficulties doing the same for GPC I am sure Gaius Mulley (the GM2 developer) will be happy to give you some guidance.
One advantage GM2 (and disadvantage GPC) has had is WRT support of different major backend versions and their large changes.
You might say we should have dumped the old versions and go all in with the new ones as soon as possible, but remember we didn't write GPC in an empty space with unlimited resources for philanthropic reasons. We had actual programs that we needed to compile and maintain (since BP wasn't up to the task anymore), with gcc 2.7.x or 2.8.x current at that time. When gcc-3 came out, it took some time before GPC support with it was really stable (not just somehow usable), and during this time we still needed support for gcc-2. (Actually it was two major steps, since gcc-2.95 and egcs already introduced big changes from 2.8.x and gcc-3 made more.) Then, in the transition to gcc-4 we cannot (note the present tense) dump gcc-3 which for me currently is the only way to compile most of my programs. If we had started with gcc-4 as GM2 did (AFAICT), we wouldn't have had many of these problems (but we also wouldn't have had a working compiler for many years).
It will be interesting to see how GM2 does when gcc-7.x comes out. This isn't meant sarcastically -- I wish them well; perhaps disruptive changes in the backend will have settled somewhat by then, or perhaps they have developers who can affort to break the compiler during the ports without the constant need for a current (frontend) version available.
Kevan Hashemi wrote:
Adriaan van Os wrote:
Doing that is a very difficult and never-ending task, because the back-end will keep changing. It is more time-ineffcient to go another way, e.g. to rewrite the compiler to produce intermediate C++ or (what I would prefer) intermediate LLVM assembly code.
If what OM2 says about GCC is true, then I think it's clear that the GCC team will do their best to make it hard for us to keep up with their changes, and therefore you are right: it's not worth bothering with.
They don't even need to make it hard for us on purpose. It's hard enough already (probably too hard, given available developer time).
Grant Jacobs wrote:
- A work-around solution for *my* personal use would to have a
development line against one or two "frozen" OS(es) that can be virtualised easily on the current platforms (e.g. selected versions of Linux + gcc + the other tools needed).
If it's not clear, what I am thinking of is running (say) a version of Linux as a virtual machine on (in my case) Mac OS X and working with GPC from within that. (Ideally the GPC project would host the core VM images.)
As I said, on Linux (specifically Debian, but I suppose similar with other distributions), the situation is currently as follows:
- gcc-3.x is not part of current distributions anymore.
- gcc-3.x binaries (including GPC) from older distributions can still be installed without dependency problems.
- When this fails someday, it can probably still be compiled from older sources.
- When this fails, it's possible to run several OS versions side by side. (FTM, I currently run several different distributions, including, as I mentioned in another mail, even a libc5 based system from 1996, in chroot environments, which means, basically, the same kernel, and even the same X11 desktop, but otherwise completely serapate programs, libraries etc. This works surprisingly well, and I suppose it will in the future.)
- When this fails someday, virtual machines or emulators are probably the last ressort.
So it's actually quite some time until the existing GPC will really cease to be usable in one of these ways. (The Y2038 problem might hit earlier. ;-) But of course, it will get less and less comfortable, especially if you intend to distribute programs.
Objective Modula-2 wrote:
It is correct that the existing code will remain available but without maintenance, GNU Pascal would eventually cease to work with the latest GCC distribution.
Actually, it doesn't even work yet with the latest GCC. It works with a somewhat older GCC (4.1), but buggy, as I described. It works well with gcc-3.x, which is so old it's been removed from major Linux distributions (though it can still be be installed manually, see above), doesn't work well on Mac OS X (AFAIK), and may have other issues.
As for retargeting GNU Pascal, I personally would consider that a new project.
Waldek suggested the same in private correspondence, which already made me skeptical of the approach before I sent my first mail. The situation doesn't seem right for starting new projects.
First, some clarification. Some posters appear to be confused about what LLVM is. For example, I saw one commenter writing "I am not interested in programming in ... LLVM". The LLVM compiler infrastructure is not a programming language you would write programs in.
Yes, and since similar confusion arose about my C++ converter proposal, let me restate this: Any possible converter (whether to C++, Ada or LLVM, which seem to be the major 3 candidates now) would be part of the build process, so one would program in Pascal. (As I said, a half-automatic converter for a permanent transition would be a much simpler task, which I might do if GPC dies.)
Frank
Steven D'Aprano wrote:
The primary purpose of the GPL is equally to ensure that when others use your software, their improvements remain under the GPL and therefore available to others still.
That's right, that's what the license is for. So why are they building the same feature into their code?
Frank Heckenbach wrote:
It seems GPC has a good following in adademia and there should still be universities who teach Pascal.
From what I see and read recently (more like, in the last 10 years or longer), apparently not so many.
It depends on where you are looking. It is true that at universities in most of Western Europe and the United States the use of Wirthian languages has been in decline but there are geographies where this isn't so.
For example, Pascal is still very prominent in the Baltic states and some former Soviet republics. Modula-2 is taught in many Spanish universities, former Yugoslavia and in Russia. Oberon is taught in some universities in Belgium and in the UK. Oxford university even maintains its own Oberon compiler which I believe is predominantly for teaching.
So my recommendation would be to actively seek out those universities, talk to the professors who run the courses and try to win them over as "recruiters".
To be honest, I think this might have been a good idea some 10-15 years ago. Now it may be too late.
I agree that it would have been easier 10-15 years ago than it is today, but if I had any stake in this, I personally wouldn't give up that easily.
Not that far back, our Modula-2 project looked like a Don-Quixotian effort, you know, fighting wind mills and everybody watching thinks you're nuts. Yet, with hard work and perseverance the project has gained momentum and more and more people are flocking in. If we had listened to the sceptics none of this would have happened.
My advice would be to get 2 or 3 folks together to form a kind of steering/action committee, draft some document about what the future direction should be, use peer review to refine that, turn it into a goals and requirements catalog, give it a name (say GNU Pascal NG, "for next generation"), put it up on a website with all the necessary collaboration tools, then go fishing and see what happens. You never know how far you will get unless you go for it.
The GNU Modula-2 GCC front end is written in Modula-2.
One advantage GM2 (and disadvantage GPC) has had is WRT support of different major backend versions and their large changes.
You might say we should have dumped the old versions and go all in with the new ones as soon as possible, but remember we didn't write GPC in an empty space with unlimited resources for philanthropic reasons.
Fair enough. I am not saying it is going to be a walk in the park, but I am saying it ain't impossible to have a GCC front end that is not written in C or C++.
We had actual programs that we needed to compile and maintain (since BP wasn't up to the task anymore), with gcc 2.7.x or 2.8.x current at that time. When gcc-3 came out, it took some time before GPC support with it was really stable (not just somehow usable), and during this time we still needed support for gcc-2. (Actually it was two major steps, since gcc-2.95 and egcs already introduced big changes from 2.8.x and gcc-3 made more.) Then, in the transition to gcc-4 we cannot (note the present tense) dump gcc-3 which for me currently is the only way to compile most of my programs. If we had started with gcc-4 as GM2 did (AFAICT),
You are mistaken. GM2 started originally as M2F, a Linux-only Modula-2 compiler with an x86 back-end and about 10 years ago, the author decided to retarget it to GCC and thereby turn it into GNU Modula-2. At the time when the first work on that was done the current GCC version was 2.95 or thereabouts.
Of course, the fact that it has taken 10 years for GNU Modula-2 to get to almost reaching a 1.0 release is an indication that this has been a major effort. Like GPC, GM2 is multi-dialect (including ISO M2). Like GPC it has only one developer (other contributors are mostly platform testers). Like GPC it has struggled keeping up with GCC updates.
In any event, my point was that if you guys really want GPC to be written in Pascal, it's not impossible, it'll be an effort, yes, but it can be achieved.
Kevan Hashemi wrote:
If what OM2 says about GCC is true, then I think it's clear that the GCC team will do their best to make it hard for us to keep up with their changes, and therefore you are right: it's not worth bothering with.
Please don't take this out of context. I also wrote that not all GCC developers think this way. There are two lines of thought. One group is so concerned about possible circumvention of the GPL that they are willing to let this influence design choices. The other line is more concerned with technical merit, they think protecting the license is a job for lawyers not for engineers to do.
I didn't mean to suggest that GCC is entirely shaped by those who are willing to let license considerations influence the design. I said that there is an influence from *both* these lines of thought and that even if the influence of the former can be expected to weaken as a result of competition with LLVM, that I doubt it will ever entirely disappear.
Also note that I do not pass any judgement about the license or which approach is "right" and which approach is "wrong". I am only observing.
I personally tend to believe that in the early days of open source it was necessary to have a viral license to protect and spread the open source model, that meanwhile open source has become so well established that the model is no longer threatened by more liberal license regimes. However, like anybody else, I don't have a crystal ball, so I cannot be 100% sure that my assessment is correct, there is a chance I might be wrong about this.
For this reason, even if I disagree, I cannot entirely dismiss the concerns of those GCC developers who think their project is threatened by possible GPL circumvention.
This is one of the reasons why I made a plea for GNU Pascal to be sustained as a GCC front end even though I am not exactly a big fan of GCC. As nature teaches us, the best way for a species to survive is genetic diversity. I believe the same holds true for software projects. If Pascal is represented in all the mainstream technology approaches and all the mainstream licensing models, it has a better chance of survival then if all eggs are in one nest. As a result it follows that there ought to be a GNU Pascal that targets GCC even if there is also an LLVM Pascal around (and a JVM and .NET/mono targeting Pascal).
The same applies on the language level. I personally prefer Modula-2, but I'd be rather concerned if Modula-2 was the only surviving member of the Algol line. I think diversity is important on that level, too.
If one member of the extended family gets into trouble, that should be concern for all members of the extended family. The fewer of us remain standing, the more likely we'll end up in a world where the Algol line becomes extinct.
As for retargeting GNU Pascal, I personally would consider that a new project.
Waldek suggested the same in private correspondence, which already made me skeptical of the approach before I sent my first mail. The situation doesn't seem right for starting new projects.
Please, don't let my comment discourage you from doing what you would like to do. My comment was more about marketing than technical issues.
Also, when I said "new project" I meant something akin to a fork, not necessarily an effort starting from scratch.
But even starting from scratch isn't as bad as it may seem. In fact, starting from scratch can be liberating, motivating and energising.
Our Modula-2 project started from scratch. We even spent an entire year on revising and modernising Wirth's last Modula-2 specification. During that time we only did some experimental coding to test ideas. It almost felt like we were moving further away from working on implementation.
Yet, once we had a viable specification and started our public repository, we quickly gained momentum and I have not seen as much enthusiasm and motivation in any project I have been involved in for many years. A lot of that, if not all, is a result of a fresh start.
Whichever road you decide to take, don't let yourself be discouraged.
Objective Modula-2 wrote:
The same applies on the language level. I personally prefer Modula-2, but I'd be rather concerned if Modula-2 was the only surviving member of the Algol line. I think diversity is important on that level, too.
If one member of the extended family gets into trouble, that should be concern for all members of the extended family. The fewer of us remain standing, the more likely we'll end up in a world where the Algol line becomes extinct.
Would it be possible to look at GNU Pascal and Objective Modula-2 as several dialects of the same language ? And if so, why not merge both projects ?
Regards,
Adriaan van Os
Adriaan van Os wrote:
Would it be possible to look at GNU Pascal and Objective Modula-2 as several dialects of the same language ?
The boundary between dialect and different language is not well defined. If you take a very broad view, you might consider all of Pascal's, Modula's and Oberon's dialects to be dialects of a single Wirthian language. OTOH, if you take a very narrow view, then you might consider even the dialects of Pascal to be separate languages.
It is very much in the eye of the beholder.
And if so, why not merge both projects ?
I don't think that would be very practical for a number of reasons. Our project is targeting LLVM, GNU Pascal targets GCC. We build our own AST, the GPC AST is probably GCC's. So not only would there be different front-ends but different middle and back-ends, too.
It would be far more practical for GPC to join forces or share code with another GNU compiler project. For example GNU Modula-2 or perhaps GNAT.
Objective Modula-2 wrote:
Kevan Hashemi wrote:
If what OM2 says about GCC is true, then I think it's clear that the GCC team will do their best to make it hard for us to keep up with their changes, and therefore you are right: it's not worth bothering with.
Please don't take this out of context. I also wrote that not all GCC developers think this way. There are two lines of thought. One group is so concerned about possible circumvention of the GPL that they are willing to let this influence design choices. The other line is more concerned with technical merit, they think protecting the license is a job for lawyers not for engineers to do.
I don't want to take it out of context, that's why I'd like to read the forum e-mails myself, to see who it pushing for self-inflicted mutilation of their own code.
So, I've been looking through the LLVM pages. I'm new to this compiler-creation business, but it looks to me like the following steps are required to produce a self-compiling GPC on LLVM.
We first have to write a partial-syntax Pascal lexer and parser to create the abstract syntax tree (AST) that is required by the LLVM libraries. We can't use Bison to create the lexer and parser because Bison's output is a C-program. We have to write the lexer and parser from scratch in Pascal.
We compile the partial-syntax lexer and parser with the existing GCC-based GPC. We link our AST structure to the LLVM libraries. The result is a program that takes text input and produces LLVM Intermediate Representation (IR) code. The LLVM system appears to handle optimization and platform-dependent assembly from there.
Having tested the minimal Pascal compiler, we enhance its lexer and parser until it supports all the syntax that is used in its own definition. We are still compiling our Pascal with the GCC version of GPC.
Now we compile the lexer and parser with itself. Our lexer-parser is entirely written in Pascal, but there are C-like parts where we link to the LLVM libraries.
From here, we continue to enhance the compiler in steps. Each step requires defining a new feature without the use of the new feature. Frank says he does not like that, but I'm fine with it.
I will continue looking into the matter, in particular I'd like to learn more about the state of LLVM's optimisers, and hear from some existing users about how well it all works on different platforms. Also, I'd like to know how settled the library interface is, because it's here that we had problems with GCC, if I understand correctly.
So, please correct my mistakes. I may have misunderstood the process entirely.
Yours, Kevan
Kevan Hashemi hashemi@brandeis.edu wrote:
I don't want to take it out of context, that's why I'd like to read the forum e-mails myself, to see who it pushing for self-inflicted mutilation of their own code.
Well, the gcc mailing list has a huge amount of traffic and I do not remember what the subject of this thread was nor even when it was. All I can tell you is that it was a very large discussion with strong opinions on both sides. The only thing I could find is this one:
http://gcc.gnu.org/ml/gcc/2008-01/msg00349.html
but that is only a one-line paragraph on status. You'd have to search further back in time to find the actual thread.
As you can see from this status message, Richard Stallman blocked the introduction of plugins at the time because of concerns that it would allow the GPL to be circumvented.
So, I've been looking through the LLVM pages. I'm new to this compiler-creation business, but it looks to me like the following steps are required to produce a self-compiling GPC on LLVM.
We first have to write a partial-syntax Pascal lexer and parser to create the abstract syntax tree (AST) that is required by the LLVM libraries. We can't use Bison to create the lexer and parser because Bison's output is a C-program. We have to write the lexer and parser from scratch in Pascal.
We compile the partial-syntax lexer and parser with the existing GCC-based GPC. We link our AST structure to the LLVM libraries. The result is a program that takes text input and produces LLVM Intermediate Representation (IR) code. The LLVM system appears to handle optimization and platform-dependent assembly from there.
Having tested the minimal Pascal compiler, we enhance its lexer and parser until it supports all the syntax that is used in its own definition. We are still compiling our Pascal with the GCC version of GPC.
Now we compile the lexer and parser with itself. Our lexer-parser is entirely written in Pascal, but there are C-like parts where we link to the LLVM libraries.
From here, we continue to enhance the compiler in steps. Each step requires defining a new feature without the use of the new feature. Frank says he does not like that, but I'm fine with it.
I will continue looking into the matter, in particular I'd like to learn more about the state of LLVM's optimisers, and hear from some existing users about how well it all works on different platforms. Also, I'd like to know how settled the library interface is, because it's here that we had problems with GCC, if I understand correctly.
So, please correct my mistakes. I may have misunderstood the process entirely.
A few things ..
There is no reason why you couldn't use a parser generator as a base for a compiler that targets LLVM.
As I had mentioned earlier, there are two ways to target LLVM. One is a C++ API for which C bindings exist as well. The other is to generate LLVM pseudo-assembly (text files).
However, if you want your compiler front end to be written in Pascal, I would recommend against bison. The most natural way to write a Pascal compiler would be to write an RD parser by hand. RD parsers have the advantage that it is possible to understand what they do by reading the source code. For each rule in the grammar there is a corresponding Pascal function/procedure. Table driven parsers have their logic in data tables and therefore it is very difficult to figure out what is happening by reading the code, you really need a debugger to get an understanding.
A project looking for newcomers to compiler construction to join the team is probably better off with an RD parser for the reasons mentioned above, but also RD parsers are generally more efficient and produce better error messages.
I would recommend a book by Per Brinch Hansen titled "Brinch Hansen on Pascal Compilers". It handles RD parsing of Pascal and includes source code and commentary for a Pascal-subset compiler.
Now, if you do want to develop a Pascal compiler targeting LLVM, I would suggest to find a new name for the product. Not only would it be confusing to most people if a GNU Pascal didn't target GCC but LLVM instead, but a different name would likely allow you to take advantage of LLVM's popularity when recruiting new developers to the project.
Amongst LLVM developers/users, some have on occasion talked about the possibility of an Algol-family counterpart to Clang. Although thus far nobody has taken this idea anywhere, the nicknames people have given it half-seriously and half-jokingly (Alang, with A for Algol or Ada, or Wlang with W for Wirth and others) would seem to indicate that there is enough love for such an idea to take advantage of.
If I was to start a Pascal compiler project targeting LLVM, I would have "Plang" high on my list of desirable names for the project as it is likely this would help find new friends within the LLVM crowd ;-)
GNU Pascal on the other hand might confuse potential contributors away as they probably think its a GCC front end. You'd always have to explain "Yes, we're GNU but we don't do GCC, we do LLVM". That sort of thing is usually unhelpful.
On Sun, 1 Aug 2010, Steven D'Aprano wrote:
On Mon, 2 Aug 2010 01:35:33 am Kevan Hashemi wrote:
Objective Modula-2 wrote:
I don't want to get into a long and tedious debate about the GPL, but let me just say this: the primary purpose of the GPL is not merely to make it easy for others to use software you have written. If it were, they would use a BSD-style licence. The primary purpose of the GPL is equally to ensure that when others use your software, their improvements remain under the GPL and therefore available to others still.
In context, which you deleted, the (alleged) fear of some GCC developers is that support for a plugin style front end to the compiler allegedly makes it easy for people to bypass the GPL. So they're not just making it difficult for people to use GCC on a whim, it is to support the #1 primary purpose of the GPL: to ensure that software remains accessible to all and not locked up in proprietary walled gardens.
It seems to me that making random changes in the GCC back end to "protect" the GPL would be conterproductive in any case. After all, it's easy enough and completely legal for a proprietary developer to generate C++ code and then call g++ as a separate program to compile it (just as Frank proposes to do), while programs that link with the GCC back end directly can only be distributed under the GPL (or so I believe). So, unless somebody with the knowledge and authority to speak for the GCC developers can enlighten us, I think it best to assume that the changes to the back end interface are made for technical reasons, though maybe backwards compatibility needs to be made a higher priority than it has been.
--------------------------| John L. Ries | Salford Systems | Phone: (619)543-8880 x107 | or (435)867-8885 | --------------------------|
Your accusation against the GCC team is a serious one. Can you point me to the GCC forum discussions in question with a web link? Because if what you say is true, then I'm going to have to get away from GCC.
Surely it depends on who it is saying this? If they are lead developers in the GCC community, or Richard Stallman, then your fears are justified. If they're random Internet fly-by commentators on a public mailing list shooting off at the mouth with no influence and no likelihood of becoming lead developers, then let them rant and froth at the mouth.
-- Steven D'Aprano
Objective Modula-2 wrote:
Kevan Hashemi hashemi@brandeis.edu wrote:
From here, we continue to enhance the compiler in steps. Each step requires defining a new feature without the use of the new feature. Frank says he does not like that, but I'm fine with it.
I like it in theory, I just don't think it will work in practice, given the available manpower, in a reasonable amount of time.
However, if you want your compiler front end to be written in Pascal, I would recommend against bison. The most natural way to write a Pascal compiler would be to write an RD parser by hand. RD parsers have the advantage that it is possible to understand what they do by reading the source code. For each rule in the grammar there is a corresponding Pascal function/procedure. Table driven parsers have their logic in data tables and therefore it is very difficult to figure out what is happening by reading the code, you really need a debugger to get an understanding.
Maybe you misunderstand how parser generators are used. You don't read and debug the generated tables. Instead you read and write the grammar rules whose syntax is similar to BNF, i.e. what you find in standard documents and such. To debug problems in your grammar, you look e.g. at Bison's verbose output. To debug problems when Bison mis-translates your rules -- well, I've yet to encounter that ...
I've given reasons and examples why I'm convinced a manually written parser is not a good idea. You ignored them. Now, I don't want to sound smart-assed, but I'm actually talking from a few years of experience. Of course, you're allowed to repeat others' mistakes if you prefer.
I would recommend a book by Per Brinch Hansen titled "Brinch Hansen on Pascal Compilers". It handles RD parsing of Pascal and includes source code and commentary for a Pascal-subset compiler.
Oh look, another subset compiler ...
The problem with subset implementations is that they usually leave of the tricky parts. Sure, for a new project you might want to go for a subset. It's just not a project I'm interested in. Any project useful to me would have to be backward-compatible to the current GPC, which does not handle a subset, but actually a large superset of various Pascal dialects, many of them less clean in themselves than the original language. Good look parsing that manually.
samiam@moorecad.com wrote:
P5 is a much better basis for a compiler front end, being 40 years old, very well documented, and completely ISO 7185 compliant (not a subset), and completely free of copyright or license encumberments.
Where is the copyright/license/public domain statement from its authors? I didn't find them in p5.zip from your web site. Note that "no copyright notice" doesn't mean "no copyright", but in fact the opposite "all rights reserved" is the default. Like it or not, that's the legal situation.
Marrying the GCC front end to the backend as a single program was a classically bad design decision that I have always suspected was made for the goals that others here have described.
I don't think so. In the 1980s, when it started, writing and reading an extra file would have caused notable performance loss (and extra coding effort, of course -- I suppose they didn't have unlimited resources then either). If you're brave, look at the cpp (preprocessor) sources of old gcc versions sometime to see which horrible hacks they did just to squeeze some cycles.
Frank
Dear Frank,
Maybe you misunderstand how parser generators are used.
I guarantee you that's the case for me. I read through Bison and I liked it very much, and I am thoroughly convinced of its merit by your ^C example. But my first look through the Bison manual page suggested that the output was a C program that we would then compile and link to, which means that at least part of our Pascal compiler would be written in C.
Given the extent of my inexperience, but at the same time my eagerness to help, perhaps you could explain to me how we could use Bison on the way to a self-compiling Pascal, or how the C-code output of Bison (if that's indeed what it puts out) could be part of the project in a way that left us independent of the GCC back-end problems.
I'm not saying that I think the Pascal to C++ compiler is a bad idea, it's just that I don't see how I can be effective in helping you with that solution, so I'm keen to understand the other possibilities.
Yours, Kevan
Frank Heckenbach wrote:
However, if you want your compiler front end to be written in Pascal, I would recommend against bison. The most natural way to write a Pascal compiler would be to write an RD parser by hand. RD parsers have the advantage that it is possible to understand what they do by reading the source code. For each rule in the grammar there is a corresponding Pascal function/procedure. Table driven parsers have their logic in data tables and therefore it is very difficult to figure out what is happening by reading the code, you really need a debugger to get an understanding.
Maybe you misunderstand how parser generators are used.
Most certainly not. I have used and tried just about every free tool there is.
read and debug the generated tables. Instead you read and write the grammar rules whose syntax is similar to BNF, i.e. what you find in standard documents and such. To debug problems in your grammar, you look e.g. at Bison's verbose output. To debug problems when Bison mis-translates your rules -- well, I've yet to encounter that ...
The reality is that you have used this tool for 10 years or longer and the base from which to possibly recruit one or more new maintainers is unlikely to have this experience. The learning curve is far steeper with table driven parser generators. Just hang out for a few days in the IRC support channels #parsers and #compilers and you will see how confused both newbies and not-so-newbie-anymores are when they use yacc/bison.
Moreover, in recent years the trend has been to move away from yacc/bison and move towards RD and PEG. There is an entire generation of newer tools that build RD parsers, both conventional and memoising, such as ANTLR for example.
This all means it will be more difficult to find new recruits if you use bison. It's as simple as that.
I've given reasons and examples why I'm convinced a manually written parser is not a good idea. You ignored them. Now, I don't want to sound smart-assed, but I'm actually talking from a few years of experience.
So do I.
Of course, you're allowed to repeat others' mistakes if you prefer.
You have the wrong idea about the process of writing an RD parser.
People who write RD parsers by hand generally calculate FIRST and FOLLOW sets and proof that their grammar contains no ambiguities. It seems you have run into somebody who didn't do that but that doesn't mean that this is how its done.
Some do their proofs by hand, some write a tool to do the proof using the grammar as input and some use an existing tool. Just because you write your parser by hand doesn't mean you have to do the proof by hand.
Whichever way you do it, your grammars are verified and unambiguous when you start coding your parser.
And the available (free) tools for designing and verifying grammars for RD and PEG parsers are far more suitable for newbies. Take the IDE for ANTLR as an example. It's called ANTLRworks and it generates syntax diagrams on the fly as you type in your grammar rules and when you tell it to verify the grammar it shows any conflicts and ambiguities graphically within the generated diagrams. This is an excellent teaching tool for a newbie that reduces the learning curve by an order of magnitude.
Also, the new generation of tools such as ANTLR are increasingly used in university course work. Again, a recruiting advantage for RD/BEG.
Also, there are better parser generators than bison even without the graphical editing. COCO/R for example. Originally developed by Moessenboeck (co-author of Oberon-2) in Modula-2, it has since been ported to Pascal by Pat Terry and it generates RD parsers written in Pascal. It's open source, too.
Oh look, another subset compiler ...
Yes, indeed a subset compiler for the purpose of illustrating the text in the book. It is well written and it explains most if not all the important concepts and techniques you would encounter when writing a Pascal compiler.
Or would you rather recommend the Dragon book to a potential new contributor who says that compiler construction is new to them? I doubt that will increase your recruiting chances.
I am beginning to sense that you are right about your prospects finding new recruits amongst the generation of software engineers who are now at university or have just graduated. It would seem indeed more difficult with the kinds of choices you seem to want to impose.
Anyway, good luck.
Kevan Hashemi wrote:
Maybe you misunderstand how parser generators are used.
I guarantee you that's the case for me. I read through Bison and I liked it very much, and I am thoroughly convinced of its merit by your ^C example. But my first look through the Bison manual page suggested that the output was a C program that we would then compile and link to, which means that at least part of our Pascal compiler would be written in C.
Given the extent of my inexperience, but at the same time my eagerness to help, perhaps you could explain to me how we could use Bison on the way to a self-compiling Pascal, or how the C-code output of Bison (if that's indeed what it puts out) could be part of the project in a way that left us independent of the GCC back-end problems.
The language of intermediate auto-generated code doesn't matter, as long as there is a stable compiler for it. For C that's certainly the case. So, we don't need to care about the language of Bison output. It is just the handwritten code we prefer to be Pascal.
Regards,
Adriaan van Os
Objective Modula-2 wrote:
Frank Heckenbach wrote:
However, if you want your compiler front end to be written in Pascal, I would recommend against bison. The most natural way to write a Pascal compiler would be to write an RD parser by hand. RD parsers have the advantage that it is possible to understand what they do by reading the source code. For each rule in the grammar there is a corresponding Pascal function/procedure. Table driven parsers have their logic in data tables and therefore it is very difficult to figure out what is happening by reading the code, you really need a debugger to get an understanding.
Maybe you misunderstand how parser generators are used.
Most certainly not. I have used and tried just about every free tool there is.
So why do claim you need a debugger for understanding? In all my Bison use, I've never needed one, and it would be among the last tools I'd think of. You don't use a debugger when you've confused about you class hierarchy in OOP either, do you? (In both cases, it's just several levels too low.)
read and debug the generated tables. Instead you read and write the grammar rules whose syntax is similar to BNF, i.e. what you find in standard documents and such. To debug problems in your grammar, you look e.g. at Bison's verbose output. To debug problems when Bison mis-translates your rules -- well, I've yet to encounter that ...
The reality is that you have used this tool for 10 years or longer and the base from which to possibly recruit one or more new maintainers is unlikely to have this experience.
Except, as I wrote initially, the grammar is exactly one of those areas where we wouldn't need to spend much work, because it exists and works already. Even if the semantic actions have to be rewritten, and even in the Bison parser code has to be translated to Pascal at some point, the grammar rules and their logic are not affected.
The learning curve is far steeper with table driven parser generators. Just hang out for a few days in the IRC support channels #parsers and #compilers and you will see how confused both newbies and not-so-newbie-anymores are when they use yacc/bison.
Sorry, I don't have a few days to spend on this argument.
Moreover, in recent years the trend has been to move away from yacc/bison and move towards RD and PEG. There is an entire generation of newer tools that build RD parsers, both conventional and memoising, such as ANTLR for example.
How powerful are they? GPC with its mix of dialects needs even more than LALR(1), i.e. Bison's GLR parser.
People who write RD parsers by hand generally calculate FIRST and FOLLOW sets and proof that their grammar contains no ambiguities. It seems you have run into somebody who didn't do that but that doesn't mean that this is how its done.
Borland apparently didn't prove their grammars (because their grammar did contain ambiguities).
To the FPC people who are reading this, how do you verify your grammar?
Some do their proofs by hand, some write a tool to do the proof using the grammar as input and some use an existing tool. Just because you write your parser by hand doesn't mean you have to do the proof by hand.
I still don't quite see the point. With RD, you specify the grammar in a formal way and write the parser manually, and it's verified automatically. With Bison you specify the grammar formally, and the parser is generated and verified automatically. The former seems more work to me.
And the available (free) tools for designing and verifying grammars for RD and PEG parsers are far more suitable for newbies. [...] Or would you rather recommend the Dragon book to a potential new contributor who says that compiler construction is new to them? I doubt that will increase your recruiting chances.
Sorry, GPC grammar is *not* something for newbies. Even if I sound condescending, but the GPC grammar is about the last thing I'd recommend to someone new to compiler construction.
Also, the new generation of tools such as ANTLR are increasingly used in university course work. Again, a recruiting advantage for RD/BEG.
I've only read about PEGs, but it seems to me, they "resolve" ambiguities and (non-ambiguities) simply by shifting this burder on the grammar developer who has to specify the correct order of alternatives.
Indeed, the Wikipedia page says:
: In order to express a grammar as a PEG, the grammar author must : convert all instances of nondeterministic choice into prioritized : choice. Unfortunately, this process is error prone, and often : results in grammars which mis-parse certain inputs.
The nice thing about Bison is that I can just throw some rules at it (e.g. mixed from different dialects), and it tells me whether there are any conflict. With PEG, I would have do choose an ordering by myself, there wouldn't be conflicts (by definition), but I could never be sure whether the parser actually parsed all I intended.
I am beginning to sense that you are right about your prospects finding new recruits amongst the generation of software engineers who are now at university or have just graduated. It would seem indeed more difficult with the kinds of choices you seem to want to impose.
Yes, my preference is not rewriting what's not broken (e.g., the grammar) for the sake of it, nor implementing a compiler for a subset language (even if it's a very clean subset) that may or may not grow in 10 years to something more or less similar to GPC, but something that can actually compile my existing (not hypothetical) programs, and thus would need to be (mostly) backward-compatible with the existing GPC.
Frank
On 03 Aug 2010, at 11:40, Frank Heckenbach wrote:
To the FPC people who are reading this, how do you verify your grammar?
We don't. And yes, we have had to break backwards compatibility a few times in order to fix ambiguities that had crept in (e.g. http://wiki.freepascal.org/User_Changes_2.4.0#Order_of_field_and_method.2Fpr... ).
Jonas
Frank Heckenbach wrote:
So why do claim you need a debugger for understanding?
There is a difference between understanding how to feed input into a tool and understanding what the output coming out of the tool is doing.
There are a number of reasons why it is generally a good idea for contributors to be able to figure out what the code is doing, not only knowing how it is used. One example where this is important in a compiler is implementing error-handling and recovery so that meaningful error messages can be generated and also to avoid phantom errors that aren't really there but only appear as a result of a previous error that threw the parser off.
The reality is that you have used this tool for 10 years or longer and the base from which to possibly recruit one or more new maintainers is unlikely to have this experience.
Except, as I wrote initially, the grammar is exactly one of those areas where we wouldn't need to spend much work, because it exists and works already. Even if the semantic actions have to be rewritten, and even in the Bison parser code has to be translated to Pascal at some point, the grammar rules and their logic are not affected.
You will find that I was pretty much the only one in this entire discussion who made a plea to not change anything but find new maintainers to keep GNU Pascal going as a GCC front end, then let the new maintainers decide if they want to make any changes to the implementation or not.
In fact considering the FSF's guidelines on the use of the GNU moniker, it is quite possible that the FSF might withdraw their authorisation to call the project GNU Pascal if it abandoned GCC in favour of LLVM.
Nevertheless, despite my plea for keeping GNU Pascal the way it is so that there continues to be a Pascal front end for GCC, some people here did want to explore other routes, so I responded to their questions.
The two things that came up the most were 1) can the compiler be written in Pascal and 2) can the compiler target LLVM without having to link directly to any API that when changed may break it.
My comments were specifically targeted at these questions and I was also specifically taking into account that any such undertaking would require the recruitment of new contributors, probably from a pool of people who don't have much if any experience with compiler contruction.
I think you will find that GIVEN THE AFOREMENTIONED CONSTRAINTS, my recommendations are measured and appropriate.
I do understand however, that your comments are geared towards a different scenario that doesn't necessarily involve new recruits, as you seem to be mostly interested to simply add a C++ target to the existing compiler and let the GCC back end alone until perhaps some day a new maintainer shows up who might want to update it. Since you are already familiar with your own code, if you are yourself doing the work adding a C++ back end, there is of course no imminent need for rewrites.
Yet, my comments were not targeted at that scenario. Instead I was responding to Kevan's LLVM scenario questions.
Moreover, in recent years the trend has been to move away from yacc/bison and move towards RD and PEG. There is an entire generation of newer tools that build RD parsers, both conventional and memoising, such as ANTLR for example.
How powerful are they? GPC with its mix of dialects needs even more than LALR(1), i.e. Bison's GLR parser.
ANTLR does LL(*) which means infinite (as in arbitrary) lookup. I'd be surprised if GPC's grammar could not be expressed in LL(*).
Note that Clang uses a handwritten RD parser even for the C++ implementation which I believe is based on LL(*) too. The often recited statement that LL is not powerful enough to implement "real languages out there" is just a myth.
The benefit of RD parsers are there. People smarter than you and me will confirm that. Niklaus Wirth is a strong proponent of hand coded RD parsers. Moessenboeck changed his seminal work (COCO) from LALR to LL. Tom Pittman has also been on record in favour of RD.
And there are strong proponents amongst today's generation of highly acclaimed scholars, for example Terence Parr and Chris Lattner. Chris Lattner has just received an ACM award for his work on Clang and LLVM. So you can make fun of me for my preference of hand written RD all day long, but I doub't you have enough clout to make fun of Chris Lattner who is also an outspoken proponent of hand written RD parsers.
People who write RD parsers by hand generally calculate FIRST and FOLLOW sets and proof that their grammar contains no ambiguities. It seems you have run into somebody who didn't do that but that doesn't mean that this is how its done.
Borland apparently didn't prove their grammars (because their grammar did contain ambiguities).
Well, shame on them.
A year ago we had an Indian or Bangladeshi neighbour who always parked his bicycle in my spot which got me into trouble with the landlord because I had to bring my bicycle into the building. It doesn't mean all Indians/Bangladeshis do this. It was just this one guy and he was wrong.
I still don't quite see the point. With RD, you specify the grammar in a formal way and write the parser manually, and it's verified automatically. With Bison you specify the grammar formally, and the parser is generated and verified automatically.
If we had wanted to use a generator, we'd have either used COCO/R or ANTLR, both of which generate human readable RD parsers in several output languages. COCO can generate output in Pascal, Modula-2 and Oberon and others. ANTLR can generate output in Java, C, C++, Python, Oberon and others.
However, our compiler is meant to be an entirely self contained bootstrap kit. We wanted to make it as easy as possibly for anyone to bootstrap the compiler anywhere without having to worry what libraries or tools might be there and whether or not they are up to date. Our bootstrap compiler has no dependencies other than a C compiler and stdio/stddef.
The former seems more work to me.
Appearances can be deceiving. The time spent coding the parser is only a fraction of the work you have to do on things a generator won't do for you anyway.
In any event, the summary of my recommendations was and still is this:
1) Continue GPC as a GCC front end, don't change it, try to find a new generation of maintainers willing to update to newer versions of GCC as GCC progresses.
2) If you really must start a new Pascal project targeting LLVM, use an RD parser written in Pascal and generate LLVM pseudo-assembly, don't call it GPC then, use a name that avoids confusion and is likely to give you an advantage finding new contributors (riding atop the LLVM buzz).
Jonas Maebe wrote:
On 03 Aug 2010, at 11:40, Frank Heckenbach wrote:
To the FPC people who are reading this, how do you verify your grammar?
We don't. And yes, we have had to break backwards compatibility a few times in order to fix ambiguities that had crept in (e.g. http://wiki.freepascal.org/User_Changes_2.4.0#Order_of_field_and_method.2Fpr... ).
Indeed, a typical example. We've had some similar cases (including some we had to support for compatiblity with this or that dialect).
Without trying it out, I suppose this case would have produced a S/R conflict in Bison, though it would have been possible to parse it perhaps with some tricks, or otherwise with GLR. (Not saying that you should have -- near-ambiguous grammar is generally not nice, and if there were no major backward-compatibility problems, it was probably better to get rid of it.)
Frank
Objective Modula-2 wrote:
Frank Heckenbach wrote:
So why do claim you need a debugger for understanding?
There is a difference between understanding how to feed input into a tool and understanding what the output coming out of the tool is doing.
There are a number of reasons why it is generally a good idea for contributors to be able to figure out what the code is doing, not only knowing how it is used. One example where this is important in a compiler is implementing error-handling and recovery so that meaningful error messages can be generated and also to avoid phantom errors that aren't really there but only appear as a result of a previous error that threw the parser off.
Still a debugger would be too low-level. A debugger is useful when you want to examine what happens on the assembler level, and (already in a more limited way) on the next higher level, here C or C++ (and potentially later, generated Pascal).
But that's not the level where you understand what's actually happening in the grammar, so unless you want to debug the actual parsing algorithm (which is "fixed", and I've never had the need to examine myself), the debugger is not well suited.
To understand grammar issues that you need to "debug" at runtime (i.e., in contrast to conflicts etc., which appear at build time and can (and should) be statically analyzed), it's more useful to watch which reductions the parser makes etc., e.g. by using YYDEBUG.
I can understand that if you're only used to RD parsers you might consider the position in the parser code as *the* information to track parser issues, and I suppose that's why you're used to using debuggers in such cases. With table-driven parsers, it's just different. The code position is mostly irrelvant, but the tokens on the stack and the reductions performed are the important things, so different methods are suitable to track them (e.g., YYDEBUG output rather than debuggers). One can argue which is "better", and if you want to engage in this discussion, I'd point out that not having to use a debugger is an indication to me of being higher-level.
The reality is that you have used this tool for 10 years or longer and the base from which to possibly recruit one or more new maintainers is unlikely to have this experience.
Except, as I wrote initially, the grammar is exactly one of those areas where we wouldn't need to spend much work, because it exists and works already. Even if the semantic actions have to be rewritten, and even in the Bison parser code has to be translated to Pascal at some point, the grammar rules and their logic are not affected.
You will find that I was pretty much the only one in this entire discussion who made a plea to not change anything but find new maintainers to keep GNU Pascal going as a GCC front end, then let the new maintainers decide if they want to make any changes to the implementation or not.
However, these are two different issues, the parser (which is like the front of the frontend) and the backend. I, as stated, prefer to keep the parser and replace the backend.
The two things that came up the most were 1) can the compiler be written in Pascal and 2) can the compiler target LLVM without having to link directly to any API that when changed may break it.
Both are not incompatible with keeping Bison (except that for 1) the parsing algorithm in Bison would need to be ported to Pascal).
I do understand however, that your comments are geared towards a different scenario that doesn't necessarily involve new recruits, as you seem to be mostly interested to simply add a C++ target to the existing compiler and let the GCC back end alone until perhaps some day a new maintainer shows up who might want to update it.
Then you got me wrong. I suggested rewriting all TREE_NODE related parts of the current compiler, which would completely break compatibility with the GCC backend.
Moreover, in recent years the trend has been to move away from yacc/bison and move towards RD and PEG. There is an entire generation of newer tools that build RD parsers, both conventional and memoising, such as ANTLR for example.
How powerful are they? GPC with its mix of dialects needs even more than LALR(1), i.e. Bison's GLR parser.
ANTLR does LL(*) which means infinite (as in arbitrary) lookup.
No, it doesn't. It means DFA based lookup (see below for an actual case where this is important).
Note that Clang uses a handwritten RD parser even for the C++ implementation which I believe is based on LL(*) too. The often recited statement that LL is not powerful enough to implement "real languages out there" is just a myth.
It seems hard to find concrete information on this topic.
http://www.codecommit.com/blog/scala/unveiling-the-mysteries-of-gll-part-1 says: "It's difficult to formally analyze the non-commutative LL(*) techniques, and so theorists tend to be a little unclear as to exactly how powerful the techniques are with respect to more well defined classes like LL and LR. However, it is generally assumed that non-commutative LL(*) is strictly more powerful than LL(k) but likely less powerful than LR(k)"
http://webcache.googleusercontent.com/search?q=cache:DxL1IM-pb_0J:www.antlr.... (by Terence Parr, your witness) says: "LR(k) even with k=1 is generally more powerful than LL(*) or at least more efficient for same grammar, but there is no strict ordering".
And GPC's grammar isn't even LALR(1) (we need GLR for a few problematic cases), and therefore likely also not LR(1). (I'm not saying that GPC is the archetypical "real language out there". It clearly isn't, because it mixes wildly differing dialects. But it's the language that this discussion is ultimately about.)
An actual example from GPC's grammar:
Dispose (<expression>, <discriminant-list>) (EP) vs. Dispose (<expression>, <destructor-name> <destructor-parameters>) (BP)
Since expressions are, of course, not DFA, LL(*) can't parse this, at least without refactoring the grammar (which destroys the "more natural grammar" argument usually given in support of LL -- which I never actually believed since I saw the necessary workarounds to avoid left-recursion for such easy cases as left-associative operators like "-" and "/").
The Bison specification in this case, however, is quite natural (which doesn't prove that it's more natural in all cases, but disproves that LL grammars are generally more natural):
builtin_procedure_statement: [...] | p_Dispose '(' expression ',' actual_parameter_list ')' %dprec 1 | p_Dispose '(' expression ',' new_identifier builtin_actual_parameter_list ')' %dprec 2
In fact, this is one of the few cases that's even ambiguous, since <discriminant-list> and <destructor-name> can derive an identifier and <destructor-parameters> is nullable. That's why we need "%dprec" here (GLR parser) and have to disambiguate in the action.
These are among the most difficult problems (in fact I choose one of those for the example so I could simply search for "%dprec" in the source(1)). I suppose there are other cases that are not ambiguous and don't need GLR, but still are not [naturally] LL(*), but I don't know offhand how to quickly identify them. Note that the ambiguity is just an added complication -- even without it, i.e. if actual_parameter_list didn't have the identifier token in its FIRST set, the above case wouldn't be LL(*) in the natural way.
(1) As a side note, this part of the grammar was written several years ago -- I don't even remember whether by Waldek or by me, and I wouldn't have recalled it offhand as a problematic case. Yet when I found it searching for "%dprec", I had no problem understanding the rules or seeing why there was an ambiguity, just looking at them briefly. Another piece of evidence against the "LR grammars are hard to read" argument.
The benefit of RD parsers are there. People smarter than you and me will confirm that. Niklaus Wirth is a strong proponent of hand coded RD parsers. Moessenboeck changed his seminal work (COCO) from LALR to LL. Tom Pittman has also been on record in favour of RD.
And there are strong proponents amongst today's generation of highly acclaimed scholars, for example Terence Parr and Chris Lattner. Chris Lattner has just received an ACM award for his work on Clang and LLVM. So you can make fun of me for my preference of hand written RD all day long, but I doub't you have enough clout to make fun of Chris Lattner who is also an outspoken proponent of hand written RD parsers.
How many people do you want me to quote in favour of LR? Or should we stop name dropping and discuss actual merits?
I did some googling and found surprisingly little on the subject -- perhaps I used the wrong search terms, in which case you might give me some pointers (after all, it's your argument, so it's up to you to back it up). I won't make fun of something or someone I know nothing about -- I make fun of Borland Pascal because I actually know quite a bit about it.
What I found, apart from the above two references is http://bit.ly/9KILMe (the link points to "Compiler Construction - The Art of Niklaus Wirth" by Mössenböck, page 60, at books.google.com; I think the link will expire soon, but you can search by the title):
"Wirth always designed his languages so that recursive descent parsing could be applied" (which, according to the previous paragraph, he takes to mean LL(1), which is actually more restrictive that what you talk about). I absolutely agree with this when designing a language from scratch. But that's not the situation we're in with GPC; we have an existing language (mix) that, if anything, will only be extended, so "it would be nice it if were LL(1)" doesn't help, since it isn't so.
"If the designers of C were forced to make their language LL(1) it would have probably become more readable." The problem here is the "forced" bit. Writing a manual parser without formal verification just doesn't do this. Tellingly, the very next sentence reads: "Unfortunately even Wirth's languages contain a few constructs which are not LL(1)." So, apparently he wasn't quite forced to make it LL(1), he just tried to (and, AFAIK, succeeded mostly, but not entirely).
Another reason given for RD parsing is that it's (supposed to be) faster. I'm not convinced that it is, and even if so, it may be negligible. For comparison, on the previous page in the same book it says: "... since the scanner is one of the most time consuming parts of a compiler".
That might have been true in the 1970s, but not today. If anything, the lexer is one of the *least* time consuming parts of GPC. Yes, it has to touch every character of the program (whereas the parser already works on tokens), but it does so in guaranteed linear time, and even with the overhead of storing locations etc., it simply doesn't take long. (Anybody who's worked with text files recently can confirm that the time it takes to process files of the typical length of source files it negligible, if the processing itself is not complicated, which it isn't in the scanner.) That's even more true in Pascal, compared to C, C++ and other languages which, due to includes, have to scan and parse the same declarations again and again (unless using precompiled headers).
Of course, the most time-consuming part of GPC is the optimizer. Wirth's opinions on optimizers, stated a few pages before, are also outdated IMHO. Modern processor architecture allows many optimization possibilities that are too complex for humans to do (in a non-trivial extent). Different architectures have different characteristics, and programmers should not be burdened with having to know about and care for them. "Generate good code from the beginning, don't generate bad code and optimize later" is in direct contrast to SSA which seems to have a wide consensus among compiler designers today, including GCC and LLVM. With all due respect, I think this is a case of straying too far outside of his area of expertise. He was a brilliant language designer, but not necessarily so good at good generation. (I'm neither, but I don't claim to be and try to prescribe to optimizer writers how to do their jobs.)
People who write RD parsers by hand generally calculate FIRST and FOLLOW sets and proof that their grammar contains no ambiguities. It seems you have run into somebody who didn't do that but that doesn't mean that this is how its done.
Borland apparently didn't prove their grammars (because their grammar did contain ambiguities).
Well, shame on them.
A year ago we had an Indian or Bangladeshi neighbour who always parked his bicycle in my spot which got me into trouble with the landlord because I had to bring my bicycle into the building. It doesn't mean all Indians/Bangladeshis do this. It was just this one guy and he was wrong.
So far it's 1-3 WRT Pascalish/Wirthian languages (Borland, FPC, and the quote about Wirth above vs. OM2 if I assume it does it properly). Other compiler authors (Scott?) may chime in, but up to now I take your "this is how its done" to mean "this is how it should be done ideally".
I still don't quite see the point. With RD, you specify the grammar in a formal way and write the parser manually, and it's verified automatically. With Bison you specify the grammar formally, and the parser is generated and verified automatically.
If we had wanted to use a generator, we'd have either used COCO/R or ANTLR, both of which generate human readable RD parsers in several output languages. COCO can generate output in Pascal, Modula-2 and Oberon and others. ANTLR can generate output in Java, C, C++, Python, Oberon and others.
However, our compiler is meant to be an entirely self contained bootstrap kit. We wanted to make it as easy as possibly for anyone to bootstrap the compiler anywhere without having to worry what libraries or tools might be there and whether or not they are up to date. Our bootstrap compiler has no dependencies other than a C compiler and stdio/stddef.
Well, this is also possible with a generator, especially if the genarated code is not target-dependent (as is the case with Bison and Flex). Look at the GPC source distribution. Bison is only required if you change the grammar (where in your case you also need the verification tool IIUYC).
The former seems more work to me.
Appearances can be deceiving. The time spent coding the parser is only a fraction of the work you have to do on things a generator won't do for you anyway.
So what are those things that a generator won't do for you, but a RD verifier (or what?) will? It's got to be something major, so that two steps (writing the parser and specifying the grammar for the verifier) become easier than one (writing the grammar for the parser generator), so what is it?
Frank
On Tue, Aug 3, 2010 at 4:49 AM, Kevan Hashemi hashemi@brandeis.edu wrote:
I guarantee you that's the case for me. I read through Bison and I liked it very much, and I am thoroughly convinced of its merit by your ^C example. But my first look through the Bison manual page suggested that the output was a C program that we would then compile and link to, which means that at least part of our Pascal compiler would be written in C.
There a software very similar to Bison written in Pascal for Pascal Applications. It is called plex / pyacc and included in Free Pascal. Here is one project which uses it:
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/utils/h2pas/
And here I started documenting it in our wiki:
http://wiki.lazarus.freepascal.org/Plex_and_Pyacc
Unfortunately I lost my previous example app which used plex and pyacc in a HD crash. I tryed to reconstitute it in the wiki, but it doesn't work, but this simple example should already help a lot more then no small example for someone using plex and pyacc.
A team could use that and also improve plex / pyacc to build a compiler.
Felipe Monteiro de Carvalho wrote:
On Tue, Aug 3, 2010 at 4:49 AM, Kevan Hashemi hashemi@brandeis.edu wrote:
I guarantee you that's the case for me. I read through Bison and I liked it very much, and I am thoroughly convinced of its merit by your ^C example. But my first look through the Bison manual page suggested that the output was a C program that we would then compile and link to, which means that at least part of our Pascal compiler would be written in C.
There a software very similar to Bison written in Pascal for Pascal Applications. It is called plex / pyacc and included in Free Pascal. Here is one project which uses it:
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/utils/h2pas/
And here I started documenting it in our wiki:
I've only briefly looked at it so far, but one important question I have is, does it support GLR or another larger class of languages than the traditional LALR(1)? Yacc doesn't, Bison does.
As I said, GPC requires more than LALR(1) already (and with the addition of new features, it can only get worse).
Of course, it being free software, if it doesn't, it could be extended -- which would raise the question whether it's easier to do this or to port the (existing and working) Bison GLR implementation to Pascal. (Given that the project pages talk about Turbo Pascal a lot, I have my doubts. Accuse me of Borland bashing, but from experience, when given random samples of C vs. TP code, I'd usually rather work on the former.)
Frank