ۥ-/@ -aQeeeeeees    e(@@@@@@@BBBBBB`4M`e`` Arabic For Computers(?) 7 Muarram 1422 This dubious title of mine almost certainly doesn't mean whatever it makes you think of. Reading today's al-Ahrm with Netscape Navigator or using AmiPro to help you write your verse composition exercises for next Tuesday afternoon are all very well in their way, but other people have that sort of thing tolerably well in hand by now. The cutting edge, if any, is elsewhere. What I am interested in here might be called "teaching the computer Arabic" in the sense of setting up a certain sort of computer model of the whole language. Or more likely, of certain selected parts of it. I have in mind, howver,some very large parts of it, however, above all the whole verbal system. The specifically "Semitic" part of Arabic, the part that seems like some mad proto-Zamenhof professor made it all up about the time of Hagar and Ismal and somehow persuaded the Bedouin to give up their original really human and natural language for his strange arbitrary contraption. Of course once it was out of his hands (or those of the flying saucer people?), Semitic at once began "decaying" and becoming more and more like a natural language of human beings itself. But as everybody knows, although she probably never yet phrased it exactly like I do now, Classical Arabic is extraordinarily undecayed, considering how far back we must date its invention on the basis of the degeneration noted in related languages. Consider the usual historico-sociolinguistic observations about Icelandic nomads, the authenticity of Jhiliyya poetry, the grammarians of Bara & Kfa &c. to have been uttered at this time, and let us get back to the fun technical part. How far could we get, for instance, towards an account of it all before we have to break down and actually represent a word as a word, as a less artificial language would force us to do at the very threshold? An amazing distance, I think. Let me be quite specific. Since we now have the Lisnu l-Arabi in computer-readable form, and you can have it too by downloading from << http://mail.almuhaddith.org/ >>, what can we do with this material -- itself in some sense a tolerably complete representation of the language in question, as its very title suggests -- purely and computationally by the numbers before we worry about what any of it means? Well, right off we can and do find that LA contains approximately 6200 articles on triliteral roots, 2800 articles on quadriliteral roots, and perhaps 400 articles that are either really fouled up on the data entry side or which our preliminary fadged-up programs couldn't parse correctly. We further know a priori that there are 28 * 28 * 28 possible triliteral roots. Well, maybe not quite a priori. I have a sneaking suspicion that the flying saucer people and/or Prof. Dr. Zed started with 32 radical consonants, as any rational computer naturally would. Presumably four pairs of them had coalesced by phonetic decay before the Bedouin tongue ever came to be written down. Be that as it may, there are undoubtedly a potential 21,952 (0x55C0) triliteral roots in the actually attested system. And 614,656 (0x96100) quadriliteral ones. And thus we may say on the basis of LA that about 31% of the triliteral potentialities and about 0.45% of the quadriliteral potentialiities are actualized. These numbers will err on the high side, of course, because if LA isn't well provided with "words" the Barans and Kfans made up to amuse themselves and puzzle one another, I am Marie of Roumelia. How to represent this information? It seems clear that we want a bitmap for the triliterals (2,744 bytes -- or still only 3,136 if we use a 32x28x28 array that can be processed without shiftiness). Whereas a list would require at least 18,600 bytes if done simplistically, and at least 3,528 bytes even using a 4.5-bit representation of the radicals. Conversely, we want a list, not an array for the quadriliterals. The array would be very sparse and occupy at least 76,832 bytes, as against a simple list with about 11,200 bytes, plus some very small percentage of overhead for pre-indexing it. And so in a very dehydrated sense, we can have "the whole Classical Arabic language" at our disposal in considerably less than 16K. Although the sense is very dehydrated, it nevertheless remains a sensible sense. Could the like be done for English (or even Esperanto) in ten times the space? El Chipo de Silicio -- Normal W20 -- filename Document1 -- Page PAGE2 st"!&S b  S [ 34=>GHQRVWXY_aj+-9;+- ! s u ~bd  []_a!!!!!!!! !!! @ a `OUj a eTimes New Roman Symbol&Arial5Courier New GaramondDushizat New Roman Dushizat%/9CHJU! "hT&T&MTowards Radical SemiticsJohn H. McCloskeyJohn H. McCloskey