[clean-list] isLower isUpper

Wed, 8 May 2002 10:47:57 +0200

From: "Richard A. O'Keefe" <ok@cs.otago.ac.nz>
> "Marco Kesseler" <m.kesseler@aia-itp.com> wrote:
> The question is of course, what codepage Char is in.
> There is no portable way to classify =E4, =E9,=EE, =E7 and =F1
> without knowing what codepage one is dealing with.
>
> If the Clean system has any C code in it, there _is_ a portable way.
>
> setlocale(LC_CTYPE, "");
>
> is supposed to tell the functions in <ctype.h> to use the "native" locale,
> however that is defined on the host machine.
>
> Of course this assumes an environment where all the files are in the same
> character set, which, though not always true, is at least better than
assuming
> that everything is in ASCII.

This would mean that the same application behaves differently on different
systems. Not very portable. I think that it is better to _define_ that the
Char type is plain ASCII (or at least as far as classifications are
concerned). Assuming a certain character set will always lead to trouble
sooner or later Whether you assume it is ASCII, the 'local' codepage, or
something else.

(Originally, ä, é,î, ç and ñ were present in the e-mail message, now they
have become =E4, =E9,=EE, =E7 and =F1, which is exactly the kind of trouble
you will get).

> Hence the need for Unicode.
>
> Unfortunately, that does not solve the problem.  The problem is existing
> data without any ISO 2022 "announcer" sequence to tell you which
registered
> character set is used.  Without that information, you can't translate the
> data to Unicode, so Unicode doesn't help.

Existing single-byte data without any codepage information will always be a
problem in this respect. Support for Unicode does not solve that problem
(nothing can), but it does allow you to start using non-ascii characters and
classify them in a portable way without worry.

> Thanks to all the free tables at www.unicode.org, it's quite easy to
> classify characters portably, once you know which character set to use.

Once you know, yes. But that is exactly the point. A plain 8-bit Char type
does not - and cannot - carry this information, and that is what it makes it
non-portable.

What one could do, is define a datatype that does specify the codepage (for
lists of single-byte characters, or even single single-byte characters), and
write classification/conversion routines based on the tables you mentioned.

Or, one could assume a codepage and do the same, but I do not think it would
be wise to turn that into standard functionality,

regards,
Marco

----------------------------------------------------------------------
Aia Software B.V.                     Phone :  +31 24 371 02 30
PO Box 38025                          Fax   :  +31 24 371 02 31
6503 AA Nijmegen                      URL   :  http://www.aia-itp.com
The Netherlands
----------------------------------------------------------------------
This E-mail and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this E-mail in error please notify
the postmaster (postmaster@aia-itp.com). The authenticity of this
message cannot, at this moment, be guaranteed by ourselves. For this
reason no legal rights may be granted should the contents differ to
the original sent message. The Aia log-file of sent messages is deemed
to be the sole, true transcript of communication unless the contrary,
other than the received message, can be proven.
----------------------------------------------------------------------