Re: national character sets

List overview All Threads
Download

newer

older

Bug in Delay

agnes up again

Frank Heckenbach

27 Apr 2001 27 Apr '01

6:19 p.m.

Hi Chief,

...

...
...
I can send you a sample zip file if you want.

Yes, please.

Attached. It was produced by a Danish customer, and inside the zip file is a bitmap that shows exactly what the other filenames should look like (so that the issue is not confused any further). Windows handles them correctly if I call OEMToChar (formerly OEMToAnsi) before displaying the filenames or before calling Rewrite.

I'm using:

UnZip 5.32 of 3 November 1997, by Info-ZIP. Maintained by Greg Roelofs. Send bug reports to the authors at Zip-Bugs@lists.wku.edu; see README for details.

and I cannot exactly say that it "fails most miserably".

E.g., the small ligature ae is coded in the zip file as #$a1, and unzip converts it to #$e6 which is the correct latin1 code for this character.

Both the zip listing (unzip -l) and the directory listing after unpacking look just like the included graphics, both in xterm (XFree86 3.3.2(68)) and on the console after setting the default font (setfont default8x16 ; loadunimap) -- I usually use a different font which does not contain all of the Danish characters, but those that it does are still shown correctly.

So I conclude that the charset conversion code in unzip is indeed quite right.

Frank

-- Frank Heckenbach, frank@g-n-u.de http://fjf.gnu.de/ PGP and GPG keys: http://fjf.gnu.de/plan

Show replies by thread

Prof Abimbola Olowofoyeku

27 Apr 27 Apr

10:52 p.m.

New subject: national character sets

On 27 Apr 01, at 18:19, Frank Heckenbach wrote:

...

Hi Chief,

...
...
...
I can send you a sample zip file if you want.

Yes, please.

Attached. It was produced by a Danish customer, and inside the zip file is a bitmap that shows exactly what the other filenames should look like (so that the issue is not confused any further). Windows handles them correctly if I call OEMToChar (formerly OEMToAnsi) before displaying the filenames or before calling Rewrite.

I'm using:

UnZip 5.32 of 3 November 1997, by Info-ZIP. Maintained by Greg Roelofs. Send bug reports to the authors at Zip-Bugs@lists.wku.edu; see README for details.

and I cannot exactly say that it "fails most miserably".

YMHV (your mileage has varied ;-))

...

E.g., the small ligature ae is coded in the zip file as #$a1, and unzip converts it to #$e6 which is the correct latin1 code for this character.

Both the zip listing (unzip -l) and the directory listing after unpacking look just like the included graphics, both in xterm (XFree86 3.3.2(68)) and on the console after setting the default font (setfont default8x16 ; loadunimap) -- I usually use a different font which does not contain all of the Danish characters, but those that it does are still shown correctly.

So I conclude that the charset conversion code in unzip is indeed quite right.

Ok. Perhaps its a font problem with Mandrake 7.1. I am downloading the sources now, to see exactly what portable continuation they have used ...

Best regards, The Chief -------- Prof. Abimbola A. Olowofoyeku (The African Chief) Author of: Chief's Installer Pro for Win32 http://www.bigfoot.com/~African_Chief/chief32.htm Email: African_Chief@bigfoot.com

Frank Heckenbach

11:51 p.m.

New subject: national character sets

Prof Abimbola Olowofoyeku wrote:

...

...
E.g., the small ligature ae is coded in the zip file as #$a1, and unzip converts it to #$e6 which is the correct latin1 code for this character.

Both the zip listing (unzip -l) and the directory listing after unpacking look just like the included graphics, both in xterm (XFree86 3.3.2(68)) and on the console after setting the default font (setfont default8x16 ; loadunimap) -- I usually use a different font which does not contain all of the Danish characters, but those that it does are still shown correctly.

So I conclude that the charset conversion code in unzip is indeed quite right.

Ok. Perhaps its a font problem with Mandrake 7.1.

Maybe. In this case, it might help to distinguish between a charset (basically, a mapping between characters and numbers) and a font (particular shapes of letters etc.).

You can get a description of the ISO-8859-1 (AKA latin1) charset under Linux with `man iso_8859_1' (if you don't have it on your system, I can send you the file, it's rather small), and even without any suitable font, you could check (with a hex viewer or something) that the characters generated by unzip match this encoding, or use it to do any conversion.

(I don't know the encoding used in zip files, but apparently, it must be contained in the unzip sources.)

Frank

-- Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/ GPC To-Do list, latest features, fixed bugs: http://agnes.dida.physik.uni-essen.de/~gnu-pascal/todo.html

Prof Abimbola Olowofoyeku

28 Apr 28 Apr

1:12 a.m.

New subject: national character sets

On 27 Apr 01, at 23:51, Frank Heckenbach wrote:

[...]

...

...
...
So I conclude that the charset conversion code in unzip is indeed quite right.

Ok. Perhaps its a font problem with Mandrake 7.1.

Maybe. In this case, it might help to distinguish between a charset (basically, a mapping between characters and numbers) and a font (particular shapes of letters etc.).

I have just had a cursory look at the Info-Zip sources. There is no portable continuation that I can see, but rather a whole load if IFDEFs, and, surprise surprise, the win32 conversions use the OEMToxxx API calls.

My tentative conclusion is that there is no portable way to do this. I guess I have to use IFDEFs and platform-specific calls after all. I thought there might be a portable GNU library for this, but there obviously isn't one.

Frank Heckenbach

4:31 a.m.

New subject: national character sets

Prof Abimbola Olowofoyeku wrote:

...

On 27 Apr 01, at 23:51, Frank Heckenbach wrote:

[...]

...
...
...
So I conclude that the charset conversion code in unzip is indeed quite right.

Ok. Perhaps its a font problem with Mandrake 7.1.

Maybe. In this case, it might help to distinguish between a charset (basically, a mapping between characters and numbers) and a font (particular shapes of letters etc.).

I have just had a cursory look at the Info-Zip sources. There is no portable continuation that I can see, but rather a whole load if IFDEFs, and, surprise surprise, the win32 conversions use the OEMToxxx API calls.

My tentative conclusion is that there is no portable way to do this. I guess I have to use IFDEFs and platform-specific calls after all. I thought there might be a portable GNU library for this, but there obviously isn't one.

Did you expect the GNU libraries to provide support for IBM charsets? ;-)

I took a very brief look at the Info-Zip source. ISTM that the main thing are the iso2oem and oem2iso tables in ebcdic.h (and that "OEM" indeed refers to the original Dos charset, as I had supposed initially).

So the "portable" answer could be that the Dos charset is quite an exotic thing outside of Dos systems, and it's of interest only when file formats (like zip) based on this charset need to be addressed -- that's another reason why sich conversions are no likely candidates for general-purpose libraries; they're just not relevant to so many programs.

Those tables can be taken (probably as-is if the license permits -- I didn't check this) to convert between this charset and latin1 (in a portable way, since it's only two character tables). latin1 in itself is reasonably portable and can trivially be converted to Unicode (the first 256 characters of Unicode are exactly those of latin1), so that's probably as good as one can get.

Frank

-- Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/ GPC To-Do list, latest features, fixed bugs: http://agnes.dida.physik.uni-essen.de/~gnu-pascal/todo.html

Prof Abimbola Olowofoyeku

11:35 p.m.

New subject: national character sets

On 28 Apr 01, at 4:31, Frank Heckenbach wrote:

[...]

...

...
My tentative conclusion is that there is no portable way to do this. I guess I have to use IFDEFs and platform-specific calls after all. I thought there might be a portable GNU library for this, but there obviously isn't one.

Did you expect the GNU libraries to provide support for IBM charsets? ;-)

I can't see the objection to doing so!

...

I took a very brief look at the Info-Zip source. ISTM that the main thing are the iso2oem and oem2iso tables in ebcdic.h (and that "OEM" indeed refers to the original Dos charset, as I had supposed initially).

So the "portable" answer could be that the Dos charset is quite an exotic thing outside of Dos systems, and it's of interest only when file formats (like zip) based on this charset need to be addressed -- that's another reason why sich conversions are no likely candidates for general-purpose libraries; they're just not relevant to so many programs.

Well, I am not sure about this. AFAICS all programs need the facility if they are to display or manipulate text written in one language on systems (such as Windows) using another language.

...

Those tables can be taken (probably as-is if the license permits -- I didn't check this) to convert between this charset and latin1 (in a portable way, since it's only two character tables). latin1 in itself is reasonably portable and can trivially be converted to Unicode (the first 256 characters of Unicode are exactly those of latin1), so that's probably as good as one can get.

Ok - how do you convert these macros into Pascal?

#define ASCII2ISO(c) (((c) & 0x80) ? oem2iso[(c) & 0x7f] : (c)) #define ASCII2OEM(c) (((c) & 0x80) ? iso2oem[(c) & 0x7f] : (c))

Thanks.

Frank Heckenbach

29 Apr 29 Apr

1:17 a.m.

New subject: national character sets

Prof Abimbola Olowofoyeku wrote:

...

On 28 Apr 01, at 4:31, Frank Heckenbach wrote:

[...]

...
...
My tentative conclusion is that there is no portable way to do this. I guess I have to use IFDEFs and platform-specific calls after all. I thought there might be a portable GNU library for this, but there obviously isn't one.

Did you expect the GNU libraries to provide support for IBM charsets? ;-)

I can't see the objection to doing so!

It's simply not relevant.

...

...
I took a very brief look at the Info-Zip source. ISTM that the main thing are the iso2oem and oem2iso tables in ebcdic.h (and that "OEM" indeed refers to the original Dos charset, as I had supposed initially).

So the "portable" answer could be that the Dos charset is quite an exotic thing outside of Dos systems, and it's of interest only when file formats (like zip) based on this charset need to be addressed -- that's another reason why sich conversions are no likely candidates for general-purpose libraries; they're just not relevant to so many programs.

Well, I am not sure about this. AFAICS all programs need the facility if they are to display or manipulate text written in one language on systems (such as Windows) using another language.

It's not really about languages, but about charsets. Both ISO-8859-1 and this "OEM" charset mostly cover West-European languages (i.e., accented characters etc., rather than Cyrillic, Greek or other letters). The "OEM" charset seems to be a relic from the 80s which has survived through legacy Dos files (and formats such as zip), so outside of Dos/Windoze and programs specifically written to convert old Dos files or access old Dos formats, this charset is not relevant. (Otherwise, how should a program know when it should convert charsets? I.e. if I use some program and give it some input consisting of normal ASCII characters as well as characters

...

=#$80, I would expect it interpret the characters according to my

default charset and not convert them, unless it has a special reason to, like working with zip files. Plain text files have no indication about the charset used, so if I want to process a file in a foreign charset, I would normally just convert it (`recode ibmpc:lat1' or something).)

What's more relevant on modern systems is converting between the different ISO-8859-n charsets and Unicode, and AFAIK, the GNU library has support for that.

...

...
Those tables can be taken (probably as-is if the license permits -- I didn't check this) to convert between this charset and latin1 (in a portable way, since it's only two character tables). latin1 in itself is reasonably portable and can trivially be converted to Unicode (the first 256 characters of Unicode are exactly those of latin1), so that's probably as good as one can get.

Ok - how do you convert these macros into Pascal?

#define ASCII2ISO(c) (((c) & 0x80) ? oem2iso[(c) & 0x7f] : (c)) #define ASCII2OEM(c) (((c) & 0x80) ? iso2oem[(c) & 0x7f] : (c))

if c >= #$80 then Result := OEM2ISO [c] else Result := c

and declare OEM2ISO as #$80 .. #$ff (rather than #0 .. #$7f).

Frank

-- Frank Heckenbach, frank@g-n-u.de, http://fjf.gnu.de/ GPC To-Do list, latest features, fixed bugs: http://agnes.dida.physik.uni-essen.de/~gnu-pascal/todo.html

8847

Age (days ago)

8848

Last active (days ago)

gpc@gnu.de

6 comments

2 participants

tags (0)

participants (2)

Frank Heckenbach
Prof Abimbola Olowofoyeku