چرا يونيكُد؟



UDL

("The 'Unicode' Digital Library")

La "Unicode" Mactabum Digitalikon

Degli Dschehaanliteratur

Take nothing for granted.


If ignorance be bliss at all, it is only because it leads to things that are so much fun to learn. (A sound sententious maxim to begin with, no? Somebody should write a poem about it. Preferably in Persian.)

Concretely to apply the apophthegm, no sooner did we sternly conclude that there is nothing out there in "Unicode" worth mentioning than we started finding mention-demanding stuff.

The very least of it is the likes of

يونيكُد چيست؟

or

ما هي الشفرة الموحّدة إيونِكودإ ؟

or

Ut-hwey Izwey Unikodyey?


These three pages are undoubtedly encoded in UTF-8, which proves beyond all possible disputing that that trick really can be done and that IE5 can perform from such tricky scripts. On the other hand, consider the URL's. There is a decidedly in-house -- not to say, hot-house -- flavor to it all. And the list of languages you can read this bit of self-promotion in is of interest in its own right. In addition to the ones you have seen, there are Czech, Danish, French, Georgian, German, Greek, Hindi, Interlingua, Italian, Japanese, Korean, Lithuanian, Polish, Portugeuse, Russian, S-Chinese, Spanish, Swedish, T-Chinese, and Welsh versions.

We read that inventory off from the directory listing as it stood on 10 May 2001, all these files being dated 22 February 2001. There is also, of course, the original. That webpage the original is on is a really tricky script, because the Unicoders have links to all the translated versions running down the left sidebar in the appropriate scripts. Unfortunately, IE5 seems to flub Georgian and Hindi, at least chez nous. We are sufficiently out of sympathy with the "Unicode" mindset not to feel very urgently compelled to seek out the appropriate fonts.

Needless to say, one is struck above all by the inclusion of Interlingua in such company. It and "Unicode" would seem to be a match made in Heaven. Especially because you could so obviously write Interlingua in plain ASCII if you ever wanted to write it at all. The Unicoders seem to have drawn the line at Quenya and Sindarin, but the principle of their doing so does not altogether appear. The elves need more help than the Esperantistas, and vastly more help than the Interlinguals.

This polyglot blurb happens incidentally to contain something that comes very close to being a lie as far as Arabic-script languages (and also presumably Syriac and Thana) are concerned, the assertion that Unicode assigna un numero unic pro cata character, independentemente de platteforma, independentemente de programma, independentemente de lingua. It ain't necessarily so. Indeed, it is demonstrably not so, as regards unic. But that's another story, one we shall tell elsewhere.


Very, Very Thick Horizontal Rule


No, the promoters' own pamphlets are of slight account in the "Unicode" Digital Library even in these very early days. Fortunately the UDL already contains the complete works of one who is widely accounted (and even by Goethe) a major Weltschaair. We refer to

لسان ِ غيب

(( That picture doesn't do the bard justice, I suspect. There is some slight chance that it isn't Himself at all, but only Dr. Iraj Bashiri. Or maybe Raskolnikov or Bakunin. Let's see what else is available . . . Good grief! maybe the creature actually looked like that! ))

ترجمان ِ اسرار

Let Mr. Edward Granville Browne of the University of Cambridge expound to us why this gentleman's verses are to be accounted on a par with the prose of the "Unicode" consortium:

... the most eminent and famous ... of all the poets of Persia, the immortal and incomparable حافظ of شيراز, entitled by his admirers لسان الغيب ("the Tongue of the Unseen") and ترجمان الأسرار ("the Interpreter of Mysteries").

Nowadays, what with Progress and "Unicode" and all, one can put together one's own Digital Library, not out of single books scattered here and there, but out of whole pre-collected Digital Libraries. Such is the case with the "Unicode" Digital Library, which virtually incorporates the

Persian Digital Library

The PDL is the proud possessor of two (2) books. One of them is by أحمد شاملو and the other one isn't. The other one is, as it happens, by the above-depicted personage. We might as well leap straight to the obvious poem, the one every schoolgirl knows by heart, at least in Macaulistan:


اگر آن ترك شيرازى به دست آرد دل ما را
به خال هندويش بخشم سمرقند و بخارا را

بده ساقى مى باقى كه در جنت نخواهى يافت
كنار آب ركن آباد و گلگشت مصلا را

فغان كاين لوليان شوخ شيرين كار شهرآشوب
چنان بردند صبر از دل كه تركان خوان يغما را

ز عشق ناتمام ما جمال يار مستغنى است
به آب و رنگ و خال و خط چه حاجت روى زيبا را

من از آن حسن روزافزون كه يوسف داشت دانستم
كه عشق از پرده عصمت برون آرد زليخا را

اگر دشنام فرمايى و گر نفرين دعا گويم
جواب تلخ مى-زيبد لب لعل شكرخا را

نصيحت گوش كن جانا كه از جان دوست‌تر دارند
جوانان سعادتمند پند پير دانا را

حديث از مطرب و مى گو و راز دهر كمتر جو
كه كس نگشود و نگشايد به حكمت اين معما را

غزل گفتى و در سفتى بيا و خوش بخوان حافظ
كه بر نظم تو افشاند فلك عقد ثريا را

If you are so gentrified as to have forgotten your Persian, a trot or two or three is available readily enough. If you just need to be reminded how the vowels go, the third translation comes with a transliteration as well.

To be sure, what we just presented is not, in fact, "Unicode" at all, only Massachusetts Windows Arabic. That is to say, the page you read it on is encoded in Codepage 1256 (plus unnumerated µsoft stealth Persian), not in UTF-8 or UTF-anything-at-all.

We'd be glad you give you a URL for the UTF, but there are certain technical difficulties. The Persian Digital Library has chopped the divan up into five subdirectories, and then named the files in them only with numbers. We ourselves retrieved the above chestnut only by downloading their

complete ZIP file,

concatenating everything, and then searching for loc. cit., which means that we don't know what number (or file) it corresponds to. But we might try a binary search, assuming the "sonnets" are in order by the rhyme at the appropriate page of the PDL site. Let's see ... 495 "sonnets", call it 512.

Probe at #256 and the rhyme is ر.

Probe at #128, find د.

At #64, ت.

At #32, بست

At #16, انداخت

At #8, ايام را

At #4, تو دادهاى ما را 

And then (cheating a little) #3 is

the thing itself.

Isn't mathematics wonderful? Definitely not the example of binary search you'd want to put in a textbook! One reason we engage in this solemn funning (apart from wanting to find the reference) is that the parent institution of the Persian Digital Library happens to be the

Sharif University of Technology,

located in the capital city of a certain unmentionable Islamic Republic.

The thing itself undoubtedly has the proper credentials. It belongs in the "Unicode" Digital Library because the HTML of it clearly says

  {meta http-equiv="Content-Type" content="text/html; charset=utf-8"}
at the outset. As no doubt belong the other 494 "sonnets," the 2 "couplets," the 34 Ghat'es, the 43 "quatrains," and the three "elegies." They're only non-Anglophone nerds, remember; they say "couplets" when they mean "narrative poem written in rhymed couplets." If you're happier with Arabic than Persian, we'll share our private guess that "ghat'e" is probably قطعة.

But stay, possibly a "couplet" is not a مثنوى but a قصيدة؟ EGB inventories the divan as "... 693 separate poems; to wit, 573 odes (ghazaliyyat), 42 fragments (muqatta‘at), 69 quatrains (ruba‘iyyat), 6 mathnawis; 2 qasidas, and one "five-some" or mukhammas. No. Ocular inspection of the first "couplet" makes quite clear that it can only be a masnavi.

No doubt you will have guessed that their by-the-numbers presentation comes with a by-the-random-numbers approach as well. It does. EGB: "... something must be said of the practice of taking an augury (تفاءل) ...." So we learn that WWWonderland possesses at least two websites where you may play that parlor game, the other one being

The Oracle of Shiraz.

Before we abandon the lofty Iranian Plateau of Culture and lapse into the shallow bog of hexadecimal for the rest of this web page, permit us to point out that there is a curious occult connection between this poet-terrorist of the East and our own

Ulysses S. Grant

We allude not to the General's beard, but to the jokes associated with him. We have already recycled the one about the two tunes, one of which was "Yankee Doodle." It only remains to ask,

(gazebo)

Who Lies Buried In This Tomb?


And now back to the fun stuff, like what Winnie Nono makes of files from the "Unicode" Digital Library. Actually Her Worship has no trouble with incoming UTF-8, but when we ask her to save it in a less exotic format, specifically Codepage 1256, discrepancies happen. Two of them, at least.

(1) The alif maqsura (ى) is coded some strange way. It looks OK inside WinWord 9, but it does not save OK as 0xEC. As witness

maqsura on lines 5 and 7

Saving only the word رندى, we get

20E2:0100  D1 E4 CF 3F 20 D5 E1 C7-CD 0D 0A 45 6C 20 43 68  QdO? UaGM..El Ch

Which was dumb, because it just shows that Winnie replaced it with a question mark.  Shoulda
saved it in "Unicode" ...

0100  FEFF 0631 0646 062F 06CC 0020 0635 0644
0110  0627 062D 000D 000A 0045 006C 0020 0043

Aha!  What is the story with U+06CC?

Evidently not the Unicoders' fault. If that little number can't come in as U+06CC and go out as 0xEC of Codepage 1256, no questions asked, the fault is almost certainly located at Redmond WA. Quelle surprise!

(2) The second discompatibility is much more interesting. The "Unicode" crew have decreed an invisible glyph called U+200C which is, in effect, a zero-width space. (After we describe it in lay terms, we'll look it up in Holy Writ as well and give the Unicoders' own account of it.) Such a thing would be quite useless in most languages, and even in Arabic, but with Persian, and especially with Osmaniyyeh, it would be a handy thing to have, because it breaks the connection of the letters. It occurs in the original UDL vesian of a hemistich above,

جواب تلخ مى-زيبد لب لعل شكرخا را

We have replaced it with a hyphen here and set the letters on either side in bold. What is wanted is a final e followed immediately by an initial z. No space interjected, but also no connectivity.

That came out wrong, but the error is instructive. The HTML, display of which we decline to patch up with the {SPAN DIR=} rigmarole, is

{P ALIGN=CENTER}
جواب تلخ م{B}ى-ز{/B}يبد لب لعل شكرخا را
{/P}

We accidentally forced the miym not to link forwards, thus revealing that U+200C can be exactly emulated on webpages with a sequence like {B}{/B} which should certainly not print anything at all, but does force a break. Let us check it out:

جواب تلخ مىزيبد لب لعل شكرخا را

Just so. Very nice to learn this trick, since now we can fix up the

Massachusetts Windows Turkish

webpage so it looks more exactly like what Mr. E.J.W. Gibb does in his footnotes to the History of Ottoman Poetry, instead of double- and triple-spacing between the real words and single-spacing where we want the U+200C effect.

A happy note to end a webpage on, although the problem of doing this trick when the target is plain Codepage 1256 text rather than HTML is admittedly not solved.

link to Arabic "Unicode" Tables page

No, we can't quit now. There remains some unfinished business. What is the "Unicode" crew's account of U+200C?


(egress)