Hello, Lycos!

The Logic Of Looneycode (6 May 2001)
Make your own free website on Tripod.com

moon1.jpg


The Logic Of Looneycode

osśa

Illustrating The Unity And Utility Of "Unicode"

6 May 2001

_corpus delicti_

That is enough to be getting on with. Let us consider the "Unicode" document that consists of those four lines, being the second row of the keyboard (lower-case) alternating with spaces and concluding with two newlines in (a) US English, (b) Arabic, (c) Russian, and (d) Greek.

Having no clue what any other program in the world will consider "Unicode" to be, we naturally save this treasure in all available ways, producing the following subdirectory:

Directory of C:\Program Files\SC UniPad\moi
ASCIIUCN UTX           602  05-06-01  7:54a asciiucn.utx
SCHEME   UTX           398  05-06-01  7:54a scheme.utx
UTF16    UTX           754  05-06-01  7:52a utf16.utx
UTF16BIG UTX           754  05-06-01  7:52a utf16big.utx
UTF32    UTX         1,508  05-06-01  7:53a utf32.utx
UTF32B   UTX         1,508  05-06-01  7:53a utf32b.utx
UTF7     UTX           531  05-06-01  7:53a utf7.utx
UTF8     UTX           438  05-06-01  7:52a utf8.utx

Something on that menu for every taste, I daresay, mostly depending on how much cold water you like in your soup.

The default encoding offered was UTF-16 (Little-endian), UTF16.UTX, which is probably our own intuitive idea of what a "Unicode" file should be like. It goes like this:

0000  FF FE 71 00 20 00 77 00-20 00 65 00 20 00 72 00  .~q. .w. .e. .r.
0010  20 00 74 00 20 00 79 00-20 00 75 00 20 00 69 00   .t. .y. .u. .i.
0020  20 00 6F 00 20 00 70 00-28 20 28 20 36 06 20 00   .o. .p.( ( 6. .
0030  35 06 20 00 2B 06 20 00-42 06 20 00 41 06 20 00  5. .+. .B. .A. .
0040  3A 06 20 00 39 06 20 00-47 06 20 00 2E 06 20 00  :. .9. .G. ... .
0050  2D 06 28 20 28 20 39 04-20 00 46 04 20 00 43 04  -.( ( 9. .F. .C.
0060  20 00 3A 04 20 00 35 04-20 00 3D 04 20 00 33 04   .:. .5. .=. .3.
0070  20 00 48 04 20 00 49 04-20 00 37 04 28 20 28 20   .H. .I. .7.( (
0080  3B 00 20 00 C2 03 20 00-B5 03 20 00 C1 03 20 00  ;. .B. .5. .A. .
0090  C4 03 20 00 C5 03 20 00-B8 03 20 00 B9 03 20 00  D. .E. .8. .9. .
00A0  BF 03 20 00 C0 03 28 20-28 20 3D 00 3D 00 3D 00  ?. .@.( ( =.=.=.
00B0  3D 00 3D 00 3D 00 3D 00-3D 00 3D 00 3D 00 28 20  =.=.=.=.=.=.=.(
00C0  28 20 54 00 68 00 61 00-74 00 20 00 69 00 73 00  ( T.h.a.t. .i.s.
00D0  20 00 65 00 6E 00 6F 00-75 00 67 00 68 00 20 00   .e.n.o.u.g.h. .

The bytewise idea here is

{... Code, Codepage, Code, Codepage, ... }.

"Big-endian" only means you are virtually a Motorola rather than an Intel customer, _ergo_,

{ ... Codepage, Code, Codepage, Code, .... }

and thus in UTF16BIG.UTX we find

0000  FE FF 00 71 00 20 00 77-00 20 00 65 00 20 00 72  ~..q. .w. .e. .r
0010  00 20 00 74 00 20 00 79-00 20 00 75 00 20 00 69  . .t. .y. .u. .i
0020  00 20 00 6F 00 20 00 70-20 28 20 28 06 36 00 20  . .o. .p ( (.6.
0030  06 35 00 20 06 2B 00 20-06 42 00 20 06 41 00 20  .5. .+. .B. .A.

and so on.

Now it turns out that our upscale acquaintance Winnie Nono (a.k.a. "Microsoft Word For Windows 2000", a.k.a. "WinWord 9.0.3821 SR-1") is almost able to read those two files correctly. She even recognized them as "Unicode" and "Unicode (Big-Endian)" respectively. In her rendition,

nono1.jpg

the English, Arabic, Cyrillic and Greek glyphs are all there and there are even spaces between them. But Signora Nono doesn't understand { 0x28, 0x20 } at all. Neither, for that matter do we ourselves altogether comprehend it, but being human and the author of the text, we are pretty confident it somehow has to signify "\r\n", as we used to say back when Hector was a pup.

The 32-bit variations presumably run

{ ... Code, Codepage, Codechapter, Codebook, Code, Codepage, ... }

at Intel and

{ ... Codebook, Codechapter, Codepage, Code, Codebook, ... }

at Motorola. And so it appears:


0100  FF FE 00 00 71 00 00 00-20 00 00 00 77 00 00 00  .~..q... ...w...
0110  20 00 00 00 65 00 00 00-20 00 00 00 72 00 00 00   ...e... ...r...
0120  20 00 00 00 74 00 00 00-20 00 00 00 79 00 00 00   ...t... ...y...
0130  20 00 00 00 75 00 00 00-20 00 00 00 69 00 00 00   ...u... ...i...
0140  20 00 00 00 6F 00 00 00-20 00 00 00 70 00 00 00   ...o... ...p...
0150  28 20 00 00 28 20 00 00-36 06 00 00 20 00 00 00  ( ..( ..6... ...
0160  35 06 00 00 20 00 00 00-2B 06 00 00 20 00 00 00  5... ...+... ...
0170  42 06 00 00 20 00 00 00-41 06 00 00 20 00 00 00  B... ...A... ...
0180  3A 06 00 00 20 00 00 00-39 06 00 00 20 00 00 00  :... ...9... ...
0190  47 06 00 00 20 00 00 00-2E 06 00 00 20 00 00 00  G... ....... ...

OR

0100  00 00 FE FF 00 00 00 71-00 00 00 20 00 00 00 77  ..~....q... ...w
0110  00 00 00 20 00 00 00 65-00 00 00 20 00 00 00 72  ... ...e... ...r
0120  00 00 00 20 00 00 00 74-00 00 00 20 00 00 00 79  ... ...t... ...y
0130  00 00 00 20 00 00 00 75-00 00 00 20 00 00 00 69  ... ...u... ...i
0140  00 00 00 20 00 00 00 6F-00 00 00 20 00 00 00 70  ... ...o... ...p
0150  00 00 20 28 00 00 20 28-00 00 06 36 00 00 00 20  .. (.. (...6...
0160  00 00 06 35 00 00 00 20-00 00 06 2B 00 00 00 20  ...5... ...+...
0170  00 00 06 42 00 00 00 20-00 00 06 41 00 00 00 20  ...B... ...A...
0180  00 00 06 3A 00 00 00 20-00 00 06 39 00 00 00 20  ...:... ...9...
0190  00 00 06 47 00 00 00 20-00 00 06 2E 00 00 00 20  ...G... .......
01A0  00 00 06 2D 00 00 20 28-00 00 20 28 00 00 04 39  ...-.. (.. (...9

Alas, poor Winnie! She'd be very insulted on Mr. Gates's behalf to hear that she isn't up to playing with a full thirty-two-bit pack, but the brute fact of the matter is, she ain't. She thinks all those zeroes are not cold water but real text.

nono2.jpg

Not a pretty sight. Still, Signora Nono makes a distinction. The Intel way she takes it to be "Unicode" text, so we meet Cyril, Methodios, Abdul & Co. skulking amidst all the little square boxes. The Motorola way, she obviously can't make head nor tail of it and doesn't even ask -- she just dumps it all at you

nono3.jpg

in Codepage 1252. I.e., Winnie figures any text whose first two bytes are zeroes probably isn't text at all. Hard to blame her. We'd figure that way ourselves.

It is questionable whether UTF-7 and UTF-8 are "Unicode," rather than a compression and encipherment of it, but at least they are known formats. Winnie gets them right, in the sense that she puts up the same display as with her "Unicode" proper, the sixteen-bit Intellian kind. Of course she doesn't understand { Code 0x28, Codepage 0x20 } any better just because the rest of the document it occurs in has been encrypted.

With ASCIIUCN.UTX, Signora Nono also doesn't ask any questions, but this time she isn't just shrugging her shoulders but knows exactly what's what and doesn't need to ask. That file reads in full:


\ufeffq w e r t y u i o p\u2028\u2028\u0636 \u0635 \u062b \u0642 \u0641 \u063a \u0639

\u0647 \u062e \u062d\u2028\u2028\u0439 \u0446 \u0443 \u043a \u0435 \u043d \u0433 \u0448

\u0449 \u0437\u2028\u2028; \u03c2 \u03b5 \u03c1 \u03c4 \u03c5 \u03b8 \u03b9 \u03bf

 \u03c0\u2028\u2028==========\u2028\u2028That is enough to be getting on with.  Let us

consider the "Unicode" document\u2028that consists of those four lines, being the second

row of the keyboard (lower-case)\u2028alternating with spaces and concluding with two

newlines in (a) US English,\u2028(b) Arabic, (c) Russian, and (d) Greek.\u2028\u2028

We don't quite see, however, why this admirable system shouldn't be carried through to its logical conclusion so that instead of a lowly "cat" we'd have a lofty "\u0063\u0061\u0074" and so on. I believe somebody in the early twentieth century seriously (?) proposed that the Chinese give up those cumbersome ideograms

Can 'Unicode' Do This Trick?

altogether and simply learn to directly read and write the 10,000 four-digit numbers that had been assigned to them by telegraph operators.

00A7 136E 0208 211F 3B15 003F! (If you don't think that joke is very funny, maybe we didn't tell it quite right....)

Even this way, "\u2028" isn't as perspicuous as one might ideally wish. You'd as likely take it for a space as a paragraph boundary. Still, the above version invites some simple but illuminating reformatting, as for example,


\ufeff
   Q      W     E       R      T      Y      U      I      O      P
\u0636 \u0635 \u062b \u0642 \u0641 \u063a \u0639 \u0647 \u062e \u062d
\u0439 \u0446 \u0443 \u043a \u0435 \u043d \u0433 \u0448 \u0449 \u0437
   ;   \u03c2 \u03b5 \u03c1 \u03c4 \u03c5 \u03b8 \u03b9 \u03bf \u03c0

It's definitely the best yet, but of course "Unicode" it egregiously is not.

Another sect of Unicodians gave us that file we called SCHEME.UTX. This one also is off Winnie's scope. She takes it to begin with a {...Column Break...}, which is certainly not what anybody ever intended, and after that there

nono4.jpg

is only plain 1252 with miscellaneous Eurogibberish _chez_ WinWord.

What SCHEME.UTX _really_ is is

100  0E FE FF 71 20 77 20 65-20 72 20 74 20 79 20 75  .~.q w e r t y u
110  20 69 20 6F 20 70 05 28-05 28 13 B6 20 B5 20 AB   i o p.(.(.6 5 +
120  20 C2 20 C1 20 BA 20 B9-20 C7 20 AE 20 AD 05 28   B A : 9 G . -.(
130  05 28 12 B9 20 C6 20 C3-20 BA 20 B5 20 BD 20 B3  .(.9 F C : 5 = 3
140  20 C8 20 C9 20 B7 05 28-05 28 3B 20 1C FB D2 20   H I 7.(.(; .{R
150  C5 20 D1 20 D4 20 D5 20-C8 20 C9 20 CF 20 D0 05  E Q T U H I O P.
160  28 05 28 3D 3D 3D 3D 3D-3D 3D 3D 3D 3D 05 28 05  (.(==========.(.
170  28 54 68 61 74 20 69 73-20 65 6E 6F 75 67 68 20  (That is enough
180  74 6F 20 62 65 20 67 65-74 74 69 6E 67 20 6F 6E  to be getting on
190  20 77 69 74 68 2E 20 20-4C 65 74 20 75 73 20 63   with.  Let us c


Evidently 05 shifts to Page 20,       ((Page 20?))
          13                06,
          12                04,
          1C                03.

However, 05 moves to Page 20 for one character only, then relapses back to Ansistan, whereas the others stay in effect until further notice. V. cute. It certainly manages to be shorter than the competition, but unfortunately whatever the true point of "Unicode" may be, making files shorter definitely is not it. _Au contraire_.


moon2.jpg

Time to call it a page. Our next thrilling webisode is titled

Looneycode East Of Suez

and illustrates the effulgent unity and runcible utility of Arabic-script "Unicode" in particular.

Can 'Unicode' Do This Trick?