Hello, Lycos!
That is enough to be getting on with. Let us consider the "Unicode" document that consists of those four lines, being the second row of the keyboard (lower-case) alternating with spaces and concluding with two newlines in (a) US English, (b) Arabic, (c) Russian, and (d) Greek.
Having no clue what any other program in the world will consider "Unicode" to be, we naturally save this treasure in all available ways, producing the following subdirectory:
Directory of C:\Program Files\SC UniPad\moi ASCIIUCN UTX 602 05-06-01 7:54a asciiucn.utx SCHEME UTX 398 05-06-01 7:54a scheme.utx UTF16 UTX 754 05-06-01 7:52a utf16.utx UTF16BIG UTX 754 05-06-01 7:52a utf16big.utx UTF32 UTX 1,508 05-06-01 7:53a utf32.utx UTF32B UTX 1,508 05-06-01 7:53a utf32b.utx UTF7 UTX 531 05-06-01 7:53a utf7.utx UTF8 UTX 438 05-06-01 7:52a utf8.utx
Something on that menu for every taste, I daresay, mostly depending on how much cold water you like in your soup.
The default encoding offered was UTF-16 (Little-endian), UTF16.UTX, which is probably our own intuitive idea of what a "Unicode" file should be like. It goes like this:
0000 FF FE 71 00 20 00 77 00-20 00 65 00 20 00 72 00 .~q. .w. .e. .r. 0010 20 00 74 00 20 00 79 00-20 00 75 00 20 00 69 00 .t. .y. .u. .i. 0020 20 00 6F 00 20 00 70 00-28 20 28 20 36 06 20 00 .o. .p.( ( 6. . 0030 35 06 20 00 2B 06 20 00-42 06 20 00 41 06 20 00 5. .+. .B. .A. . 0040 3A 06 20 00 39 06 20 00-47 06 20 00 2E 06 20 00 :. .9. .G. ... . 0050 2D 06 28 20 28 20 39 04-20 00 46 04 20 00 43 04 -.( ( 9. .F. .C. 0060 20 00 3A 04 20 00 35 04-20 00 3D 04 20 00 33 04 .:. .5. .=. .3. 0070 20 00 48 04 20 00 49 04-20 00 37 04 28 20 28 20 .H. .I. .7.( ( 0080 3B 00 20 00 C2 03 20 00-B5 03 20 00 C1 03 20 00 ;. .B. .5. .A. . 0090 C4 03 20 00 C5 03 20 00-B8 03 20 00 B9 03 20 00 D. .E. .8. .9. . 00A0 BF 03 20 00 C0 03 28 20-28 20 3D 00 3D 00 3D 00 ?. .@.( ( =.=.=. 00B0 3D 00 3D 00 3D 00 3D 00-3D 00 3D 00 3D 00 28 20 =.=.=.=.=.=.=.( 00C0 28 20 54 00 68 00 61 00-74 00 20 00 69 00 73 00 ( T.h.a.t. .i.s. 00D0 20 00 65 00 6E 00 6F 00-75 00 67 00 68 00 20 00 .e.n.o.u.g.h. .
The bytewise idea here is
{... Code, Codepage, Code, Codepage, ... }.
"Big-endian" only means you are virtually a Motorola rather than an Intel customer, _ergo_,
{ ... Codepage, Code, Codepage, Code, .... }
and thus in UTF16BIG.UTX we find
0000 FE FF 00 71 00 20 00 77-00 20 00 65 00 20 00 72 ~..q. .w. .e. .r 0010 00 20 00 74 00 20 00 79-00 20 00 75 00 20 00 69 . .t. .y. .u. .i 0020 00 20 00 6F 00 20 00 70-20 28 20 28 06 36 00 20 . .o. .p ( (.6. 0030 06 35 00 20 06 2B 00 20-06 42 00 20 06 41 00 20 .5. .+. .B. .A.
and so on.
Now it turns out that our upscale acquaintance Winnie Nono (a.k.a. "Microsoft Word For Windows 2000", a.k.a. "WinWord 9.0.3821 SR-1") is almost able to read those two files correctly. She even recognized them as "Unicode" and "Unicode (Big-Endian)" respectively. In her rendition,
the English, Arabic, Cyrillic and Greek glyphs are all there and there are even spaces between them. But Signora Nono doesn't understand { 0x28, 0x20 } at all. Neither, for that matter do we ourselves altogether comprehend it, but being human and the author of the text, we are pretty confident it somehow has to signify "\r\n", as we used to say back when Hector was a pup.
The 32-bit variations presumably run
{ ... Code, Codepage, Codechapter, Codebook, Code, Codepage, ... }
at Intel and
{ ... Codebook, Codechapter, Codepage, Code, Codebook, ... }
at Motorola. And so it appears:
0100 FF FE 00 00 71 00 00 00-20 00 00 00 77 00 00 00 .~..q... ...w... 0110 20 00 00 00 65 00 00 00-20 00 00 00 72 00 00 00 ...e... ...r... 0120 20 00 00 00 74 00 00 00-20 00 00 00 79 00 00 00 ...t... ...y... 0130 20 00 00 00 75 00 00 00-20 00 00 00 69 00 00 00 ...u... ...i... 0140 20 00 00 00 6F 00 00 00-20 00 00 00 70 00 00 00 ...o... ...p... 0150 28 20 00 00 28 20 00 00-36 06 00 00 20 00 00 00 ( ..( ..6... ... 0160 35 06 00 00 20 00 00 00-2B 06 00 00 20 00 00 00 5... ...+... ... 0170 42 06 00 00 20 00 00 00-41 06 00 00 20 00 00 00 B... ...A... ... 0180 3A 06 00 00 20 00 00 00-39 06 00 00 20 00 00 00 :... ...9... ... 0190 47 06 00 00 20 00 00 00-2E 06 00 00 20 00 00 00 G... ....... ... OR 0100 00 00 FE FF 00 00 00 71-00 00 00 20 00 00 00 77 ..~....q... ...w 0110 00 00 00 20 00 00 00 65-00 00 00 20 00 00 00 72 ... ...e... ...r 0120 00 00 00 20 00 00 00 74-00 00 00 20 00 00 00 79 ... ...t... ...y 0130 00 00 00 20 00 00 00 75-00 00 00 20 00 00 00 69 ... ...u... ...i 0140 00 00 00 20 00 00 00 6F-00 00 00 20 00 00 00 70 ... ...o... ...p 0150 00 00 20 28 00 00 20 28-00 00 06 36 00 00 00 20 .. (.. (...6... 0160 00 00 06 35 00 00 00 20-00 00 06 2B 00 00 00 20 ...5... ...+... 0170 00 00 06 42 00 00 00 20-00 00 06 41 00 00 00 20 ...B... ...A... 0180 00 00 06 3A 00 00 00 20-00 00 06 39 00 00 00 20 ...:... ...9... 0190 00 00 06 47 00 00 00 20-00 00 06 2E 00 00 00 20 ...G... ....... 01A0 00 00 06 2D 00 00 20 28-00 00 20 28 00 00 04 39 ...-.. (.. (...9
Alas, poor Winnie! She'd be very insulted on Mr. Gates's behalf to hear that she isn't up to playing with a full thirty-two-bit pack, but the brute fact of the matter is, she ain't. She thinks all those zeroes are not cold water but real text.
Not a pretty sight. Still, Signora Nono makes a distinction. The Intel way she takes it to be "Unicode" text, so we meet Cyril, Methodios, Abdul & Co. skulking amidst all the little square boxes. The Motorola way, she obviously can't make head nor tail of it and doesn't even ask -- she just dumps it all at you
in Codepage 1252. I.e., Winnie figures any text whose first two bytes are zeroes probably isn't text at all. Hard to blame her. We'd figure that way ourselves.
It is questionable whether UTF-7 and UTF-8 are "Unicode," rather than a compression and encipherment of it, but at least they are known formats. Winnie gets them right, in the sense that she puts up the same display as with her "Unicode" proper, the sixteen-bit Intellian kind. Of course she doesn't understand { Code 0x28, Codepage 0x20 } any better just because the rest of the document it occurs in has been encrypted.
With ASCIIUCN.UTX, Signora Nono also doesn't ask any questions, but this time she isn't just shrugging her shoulders but knows exactly what's what and doesn't need to ask. That file reads in full:
\ufeffq w e r t y u i o p\u2028\u2028\u0636 \u0635 \u062b \u0642 \u0641 \u063a \u0639 \u0647 \u062e \u062d\u2028\u2028\u0439 \u0446 \u0443 \u043a \u0435 \u043d \u0433 \u0448 \u0449 \u0437\u2028\u2028; \u03c2 \u03b5 \u03c1 \u03c4 \u03c5 \u03b8 \u03b9 \u03bf \u03c0\u2028\u2028==========\u2028\u2028That is enough to be getting on with. Let us consider the "Unicode" document\u2028that consists of those four lines, being the second row of the keyboard (lower-case)\u2028alternating with spaces and concluding with two newlines in (a) US English,\u2028(b) Arabic, (c) Russian, and (d) Greek.\u2028\u2028
We don't quite see, however, why this admirable system shouldn't be carried through to its logical conclusion so that instead of a lowly "cat" we'd have a lofty "\u0063\u0061\u0074" and so on. I believe somebody in the early twentieth century seriously (?) proposed that the Chinese give up those cumbersome ideograms
altogether and simply learn to directly read and write the 10,000 four-digit numbers that had been assigned to them by telegraph operators.
00A7 136E 0208 211F 3B15 003F! (If you don't think that joke is very funny, maybe we didn't tell it quite right....)
Even this way, "\u2028" isn't as perspicuous as one might ideally wish. You'd as likely take it for a space as a paragraph boundary. Still, the above version invites some simple but illuminating reformatting, as for example,
\ufeff Q W E R T Y U I O P \u0636 \u0635 \u062b \u0642 \u0641 \u063a \u0639 \u0647 \u062e \u062d \u0439 \u0446 \u0443 \u043a \u0435 \u043d \u0433 \u0448 \u0449 \u0437 ; \u03c2 \u03b5 \u03c1 \u03c4 \u03c5 \u03b8 \u03b9 \u03bf \u03c0
It's definitely the best yet, but of course "Unicode" it egregiously is not.
Another sect of Unicodians gave us that file we called SCHEME.UTX. This one also is off Winnie's scope. She takes it to begin with a {...Column Break...}, which is certainly not what anybody ever intended, and after that there
is only plain 1252 with miscellaneous Eurogibberish _chez_ WinWord. What SCHEME.UTX _really_ is is100 0E FE FF 71 20 77 20 65-20 72 20 74 20 79 20 75 .~.q w e r t y u 110 20 69 20 6F 20 70 05 28-05 28 13 B6 20 B5 20 AB i o p.(.(.6 5 + 120 20 C2 20 C1 20 BA 20 B9-20 C7 20 AE 20 AD 05 28 B A : 9 G . -.( 130 05 28 12 B9 20 C6 20 C3-20 BA 20 B5 20 BD 20 B3 .(.9 F C : 5 = 3 140 20 C8 20 C9 20 B7 05 28-05 28 3B 20 1C FB D2 20 H I 7.(.(; .{R 150 C5 20 D1 20 D4 20 D5 20-C8 20 C9 20 CF 20 D0 05 E Q T U H I O P. 160 28 05 28 3D 3D 3D 3D 3D-3D 3D 3D 3D 3D 05 28 05 (.(==========.(. 170 28 54 68 61 74 20 69 73-20 65 6E 6F 75 67 68 20 (That is enough 180 74 6F 20 62 65 20 67 65-74 74 69 6E 67 20 6F 6E to be getting on 190 20 77 69 74 68 2E 20 20-4C 65 74 20 75 73 20 63 with. Let us c Evidently 05 shifts to Page 20, ((Page 20?)) 13 06, 12 04, 1C 03.
However, 05 moves to Page 20 for one character only, then relapses back to Ansistan, whereas the others stay in effect until further notice. V. cute. It certainly manages to be shorter than the competition, but unfortunately whatever the true point of "Unicode" may be, making files shorter definitely is not it. _Au contraire_.
Time to call it a page. Our next thrilling webisode is titled
and illustrates the effulgent unity and runcible utility of Arabic-script "Unicode" in particular.