[clean-list] isLower isUpper

Richard A. O'Keefe ok@cs.otago.ac.nz
Thu, 9 May 2002 12:21:30 +1200 (NZST)


I pointed out that using setlocale(LC_CTYPE, "") tells <ctype.h> functions
to use the "native" character set.

"Marco Kesseler" <m.kesseler@aia-itp.com> wrote:
	This would mean that the same application behaves differently on different
	systems. Not very portable.

I think there is a misunderstanding here.
"Native" means the USER'S "native" character set,
"however it is defined on the host system" means "however you discover what
character set the user wants".

	I think that it is better to _define_ that the Char type is
	plain ASCII (or at least as far as classifications are
	concerned).  Assuming a certain character set will always lead
	to trouble sooner or later

But the whole point of setlocale(LC_CTYPE, "") is *not* to assume,
but to allow yourself to be *told*.

Note that even in Unicode, things like case-mapping are locale-dependent.
If you are in an English locale, the upper case version of "i" is "I".
In a Turkish locale, it's "capital I with dot above", and the lower case
version of "I" is not "i" but "dotless i".

Bad assumption 1:
    All data is ASCII.

    This means that the program is not going to correctly classify characters
    that the user uses all the time.

Bad assumption 2:
    All data is in a character set determined by the operating system.

    No, UNIX, MacOS, and even Windows let a user say what character set
    to use and have done for many years.

Poor assumption 3:
    All data is in the character set that the user has selected as the
    "native" character set.

    This will work very well for most people most of the time.
    It is demonstrably easy to use; that's how C has worked since 1989.

Poor assumption 4:
    Each data source is in a single character set; if the character set
    is not defined in some "meta-data" (<?xml encoding=...>, MIME header,
    whatever) then it is the user's selected native character set.

    Unfortunately, each data source may need a different means of determining
    the character set.  Some operating systems (B6700 MCP) made "what is my
    character set" a file property you could inspect; Windows NT file
    attributes make it possible to do something similar in NT; MacOS has
    means to associate "scripts" with files and strings, but these means are
    not always used.

    The thing that makes this a poor assumption is that some data sources
    may include data in several different encodings.  Embedded meta-data
    (ISO 2022 announcers) or external meta-data may tell you what.

Correct assumption 5:
    Chaos reigns.

	Existing single-byte data without any codepage information will always be a
	problem in this respect.

If you don't know which coded character set is in use, yes, it's a problem.
This is the situation for most of the data most people have.
(And if you use a Windows box, you have another problem:  data with WRONG
coded character set identification.  Too many Windows programs say
"Latin 1" when they mean "Windows CP 1252" or whatever it is.)

	Support for Unicode does not solve that problem
	(nothing can), but it does allow you to start using non-ascii
	characters and classify them in a portable way without worry.
	
But that doesn't solve the actual problem.  The problem is, how does
someone using existing data they generated in UNIX, Windows, or MacOS
use that data and classify characters correctly, *given* than there is
no meta-data explicitly indicating the coded character set, when practically
all their data is in the *same* coded character set, which the operating
system knows, and call inform the Clean runtime system about?

In order to convert such data to Unicode for classifying,
Clean would have to know the *same* information that would make it
unnecessary to convert.

	Once you know, yes. But that is exactly the point. A plain 8-bit Char type
	does not - and cannot - carry this information, and that is what it makes it
	non-portable.
	
But the *environment* _can_ provide this information.

	Or, one could assume a codepage and do the same, but I do not think it would
	be wise to turn that into standard functionality,
	
The suggestion was NOT to assume any particular coded character set
in Clean, but to have it ask the host system "what is the user's chosen
default coded character set?" and use that.  Ultimately, there needs to be
something similar to the locale support in C++, where you can change the
locale of any stream at any time, but this would be a good step forward
from assuming always and only ASCII.