10206 lexical ambiguity

List overview All Threads
Download

newer

older

Two Dimensional Arrays

function-pointer

Paul Isaacs

19 Jan 2017 19 Jan '17

7:48 p.m.

Hello,

Am I correct that, without semantic type tie-ins, that the production

new-ordinal-type = enumerated-type | subrange-type ;

is ambiguous because (a) could be both an enumerated type and the expression for the lower bound of a subrange?

Regards,

Paul Isaacs

Show replies by thread

Bastiaan Veelo

19 Jan 19 Jan

7:57 p.m.

...

On 19 Jan 2017, at 19:48, Paul Isaacs paul@redpineinstruments.org wrote:

Hello,

Am I correct that, without semantic type tie-ins, that the production

new-ordinal-type = enumerated-type | subrange-type ;

is ambiguous because (a) could be both an enumerated type and the expression for the lower bound of a subrange?

Where do you see the ambiguity? enumerated-type is lexically different from subrange-type.

Bastiaan.

Paul Isaacs

9:54 p.m.

On 19/01/17 01:57 PM, Bastiaan Veelo wrote:

...

...
On 19 Jan 2017, at 19:48, Paul Isaacs paul@redpineinstruments.org wrote:

Hello,

Am I correct that, without semantic type tie-ins, that the production

new-ordinal-type = enumerated-type | subrange-type ;

is ambiguous because (a) could be both an enumerated type and the expression for the lower bound of a subrange?

Where do you see the ambiguity? enumerated-type is lexically different from subrange-type.

Bastiaan.

Hello Bastiaan,

x = (a); enumerated-type y = (a)..5 subrange-type where (a) is a subrange-bound, the subrange-bound production = expression and (a) is a lexically valid expression.

This would require a lookahead of 3 tokens to resolve. I believe the case is unique because "( a," can not be an expression so must be an enumerated. This still requires a lookahead of 2 tokens to resolve.

The 10206 spec does not contain either lookahead or lalr so maybe requiring more than 1 token of lookahead is allowed.

Regards,

Paul

Peter

20 Jan 20 Jan

midnight

On 19/01/17 20:54, Paul Isaacs wrote:

...

On 19/01/17 01:57 PM, Bastiaan Veelo wrote:

...
...
On 19 Jan 2017, at 19:48, Paul Isaacs paul@redpineinstruments.org wrote:

Hello,

Am I correct that, without semantic type tie-ins, that the production

new-ordinal-type = enumerated-type | subrange-type ;

is ambiguous because (a) could be both an enumerated type and the expression for the lower bound of a subrange?

Where do you see the ambiguity? enumerated-type is lexically different from subrange-type.

Bastiaan.

Hello Bastiaan,

x = (a); enumerated-type y = (a)..5 subrange-type where (a) is a subrange-bound, the subrange-bound production = expression and (a) is a lexically valid expression.
                This would require a lookahead of 3 tokens to resolve.
                I believe the case is unique because "( a," can not
be an expression so must be an enumerated. This still requires a lookahead of 2 tokens to resolve.

The 10206 spec does not contain either lookahead or lalr so maybe requiring more than 1 token of lookahead is allowed.

Regards,

Paul

Gpc mailing list Gpc@gnu.de https://www.g-n-u.de/mailman/listinfo/gpc

Hi Paul,

As the grammer for new_ordinal_type is

new_ordinal_type: enumerated_type | subrange_type ;

(a) would still be parsed as "new_ordinal_type".

from that point, only one token, either ; or .. is needed to resolve which "new_ordinal_type" we have. Don't see the problem.

Regards, Peter

Paul Isaacs

2:22 a.m.

On 19/01/17 06:00 PM, Peter wrote:

...

As the grammer for new_ordinal_type is new_ordinal_type: enumerated_type | subrange_type ; (a) would still be parsed as "new_ordinal_type". from that point, only one token, either ; or .. is needed to resolve which "new_ordinal_type" we have. Don't see the problem. Regards, Peter

Hello Peter,

Ahhh. Some light. If the parse is shift-reduce a-la bison, yes.

If the parse is recursive descent, it may be different story. The enumerated/subrange choice can not be delayed.

I am too green to know but I find it hard to believe that the type classification of a grammar ( LALR( 1 ), etc. ) depends on the parsing method.

I have chosen recursive descent. It is easier to understand and the error messages are more meaningful because the parser always knows where it is in the productions. The probability is higher that others ( who are not bison gurus and who want to stick with pascal source ) may take an interest in the compiler.

I have a circular buffers at both the lexical and token levels whose size can be set so the lookahead is not a problem.

Up until this point, I have not had a lookahead problem ( at least that I am aware of ) and it has been my intention to generate a phase 1 parser for an eventual compiler that used strictly lexical information.

However, I am now leaning towards the idea that using the semantic hints as provided by the multiple identifier types in the productions may be wiser. Type mismatches would be detected and reported much earlier at the expense of maintaining a symbol table stack and the parser would more closely follow the 20106 standard. Is this: true? wise?

Regards and thanks,

Paul

Waldek Hebisch

8:25 a.m.

Paul Isaacs wrote:

...

On 19/01/17 06:00 PM, Peter wrote:

...
As the grammer for new_ordinal_type is new_ordinal_type: enumerated_type | subrange_type ; (a) would still be parsed as "new_ordinal_type". from that point, only one token, either ; or .. is needed to resolve which "new_ordinal_type" we have. Don't see the problem. Regards, Peter

Hello Peter,

Ahhh. Some light. If the parse is shift-reduce a-la bison, yes.

If the parse is recursive descent, it may be different story. The enumerated/subrange choice can not be delayed.

I am too green to know but I find it hard to believe that the type classification of a grammar ( LALR( 1 ), etc. ) depends on the parsing method.

I have chosen recursive descent. It is easier to understand and the error messages are more meaningful because the parser always knows where it is in the productions. The probability is higher that others ( who are not bison gurus and who want to stick with pascal source ) may take an interest in the compiler.

I have a circular buffers at both the lexical and token levels whose size can be set so the lookahead is not a problem.

Up until this point, I have not had a lookahead problem ( at least that I am aware of ) and it has been my intention to generate a phase 1 parser for an eventual compiler that used strictly lexical information.

However, I am now leaning towards the idea that using the semantic hints as provided by the multiple identifier types in the productions may be wiser. Type mismatches would be detected and reported much earlier at the expense of maintaining a symbol table stack and the parser would more closely follow the 20106 standard. Is this: true? wise?

1) lexical level deals with tokens. Correct question is if there is syntactic ambiguity in ISO grammar. And the answer is that your example is _not_ an ambiguity. 2) Of course ability of parser to deal with tricky construct depends on parsing method. Basicaly simple recursive descent parser (LL(1)) must recognize a construct seeing first token of that construct (or a following token if the construct is empty). LR(1) parser must recognize a construct seeing first token _after_ the construct. In other words, LL(1) parser makes decisions quite early and may have trouble with constructs like "new_ordinal_type". LR(1) has no trouble with this. 3) LL(1) parser can be modified to work more like LR(1), namely you handle common prefix separately and once you can decide which branch to take you follow that branch. 4) IME using semantic information during parse is very problematic. Namely, semantic information for lookahead tokens is not available: you can put semantic information into symbol table only _after_ you recognized declaration. In many cases this does not cause problems, but you may get subtle errors. In particular problem becomes much worse if you decide to use more than one token of lookahead. 5) My preffered way to handle ambiguites which may be resoved using semantic information is to postpone resoving them to semantic stage. For example one could allow enumaration as lower bound of a subrange during parsing and report error only during semantic checking: arguably putting enumeration there is a semantic error not different in spirit from using non-constant expression as lower bound or from type error in lower bound expression. 6) GPC uses GLR parser (which essentially automatically uses whatever lookahead is needed to resolve parse) and subranges are main reason for needing GLR. However, ISO subranges are tame, the wild beasts are Borland Pascal extensions.

-- Waldek Hebisch

Paul Isaacs

5:14 p.m.

Waldek,Thanks for the outstanding explanation - particularly the significance of the distinction between LL and LR parsing. I have never seen such a clear explanation before. After about 3 reads I understood the whole thing.

I arbitrarily distinguish lexical tokens as a separate entity because tokens like := need character lookahead at what I term the lexical level. At the "tokens" level, tokens lookahead is needed.

The new-ordinal-type production can be resolved if the LL parser has access to an identifiers symbol table. The first identifier of the enumeration will always be undefined. Any subrange identifier must have been defined previously.

I have not noticed any direct left recursion in the 10206 grammar. I will hope that there are indirect LL recursion traps awaiting.

I will proceed with the assumptions that identifier type information will be similarly useful in other cases and that determining the "type" of an identifier will never be ambiguous in an LL sense.

Thank you for taking the time to write such a comprehensive reply.

Regards,

Paul

Peter

23 Jan 23 Jan

9:51 p.m.

On 20/01/17 16:14, Paul Isaacs wrote:

...

... The first identifier of the enumeration will always be undefined. Any subrange identifier must have been defined previously. ...

Only for correct programs! A missing, or extra definition is a much more likely mistake, than getting the grammar for sub-ranges or enumerations wrong. the error messages produced this way might be a bit confusing ;-)

Regards, Peter

Thorsten Palm

24 Jan 24 Jan

3:52 p.m.

On Mon, 23 Jan 2017, Peter wrote:

...

On 20/01/17 16:14, Paul Isaacs wrote:

...
... The first identifier of the enumeration will always be undefined. Any subrange identifier must have been defined previously. ...

Only for correct programs!

Actually not true at all. The constraint on an identifier in an enumeration (as on any other identifier being defined) is that it not (or rather that no other identifier with the same spelling) have been defined _in the same block_. For example, the fragment

program p; const a=42; procedure q; var x:(a,b,c); {...}

is correct, despite "a" being defined when it occurs in the enumeration.

According to my old Turbo Pascal (version 6.0) reference manual that variant of the language did not allow for the first expression in a subrange specification to start with "(".

...

...

Best regards,

Thorsten

--- Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. https://www.avast.com/antivirus

Paul Isaacs

5:12 p.m.

On 24/01/17 09:52 AM, Thorsten Palm wrote:

...

Actually not true at all. The constraint on an identifier in an enumeration (as on any other identifier being defined) is that it not (or rather that no other identifier with the same spelling) have been defined _in the same block_. For example, the fragment

program p; const a=42; procedure q; var x:(a,b,c); {...}

is correct, despite "a" being defined when it occurs in the enumeration.

Hello Thorsten,

Thanks for the reply.

It would appear that searching for an identifier that has been defined is not the logical inverse of searching for and identifier that has not been defined.

I have my symbol tables set up as a push down stack. Therefore, I will restrict searching for undefined symbols to the top of the stack while searching for a defined symbol will use the entire stack.

Thanks for pointing this out.

Regards,

Paul Isaacs

3097

Age (days ago)

3102

Last active (days ago)

gpc@gnu.de

9 comments

5 participants

tags (0)

participants (5)

Bastiaan Veelo
Paul Isaacs
Peter
Thorsten Palm
Waldek Hebisch