Next Previous Contents

5. The Database: Source Encoding Aspects

For each individual language, the various encoding issues are discussed. The approach differs slightly from language to language as there numerous problems unique to single languages. With the exception of Chinese, all languages are stored in a romanized form, which can be regarded as the information layer. In contrast, the presentation layer is then machine-generated out of the information layer, or romanized material. For some languages, storing the information layer is sufficient, for other languages, additional information is to be stored and presented. This will be discussed under each individual language. In the future, all information will also be stored in Unicode encoding which partially eliminates the need for differentiating between the information and presentation layers of a given writing.

5.1 Manju

Manju is entered and stored in a romanized form. The romanization approach uses Hauer's system, with minor deviations. The romanization is the basis for automatic rendering of the Manju writing presentation forms. Thus, the four lemmata of the first text page (page 2 of the Beijing 1957 edition) are given as:

  1. abka
  2. dergi abka
  3. niohon abka
  4. genggiyen abka

As seen in the example of the book title at the beginning of this document, the romanization is sufficient for storing the complete necessary information on Manju. The Manju writing shown in this document was generated with the Manju facilities of the MonTeX system.

5.2 Tibetan

Similar to Manju, the Tibetan writing does not pose any particular challenge in terms of distinction between information layers and presentation layers. The romanization shown here is both human-readable and machine-readable. While it serves as input for the cTibTeX system which then generates the presentation layer out of the romanized text, it represents only the second stage in data processing; the initial layer of Tibetan is a romanization carefully crafted to reflect the two-dimensional behaviour (e.g., stacks) of the Tibetan writing system. An example of how Tibetan text is entered follows here:

  1. gnam|
  2. s,teN-gnam|
  3. dguN-s,dom
  4. gnam-dWaNs

In this example, a comma between two consonants indicates that these two consonants form a stack; a capital N stands for the nasal which is usually represented as ng. Many more rules were defined which are not shown here.

After conversion into a one-dimensional string the resulting romanization looks much more familiar. cTibTeX knows how to deal with most common consonant stacks, and does not need information beyond the (strictly speaking) ambiguous and defective positioning information that can be conveyed by a one-dimensional system.

  1. gnam.
  2. steng gnam.
  3. dgung sdom
  4. gnam dwangs

A second example shows the Tibetan text of page 104, Beijing edition:

  1. da lo.
  2. lo de nyid.
  3. sang phod.
  4. phyi lo.

Tibetan in Manju Transliteration

Including the special Manju transliteration symbols used for representing Tibetan letters, this script can be safely stored and rendered in its romanized form. ManjuTeX is a proven tool for generating the presentation layer. The following example shows the transliteration of the four items on page 104, Beijing edition:

  1. da lo
  2. lo de niyad
  3. sang pod
  4. pyi lo
Unfortunately, this example does not show any of the Manju characters specially created for rendering Tibetan orthography.

Tibetan in Manju Transcription

Similar to the Manju transliteration of Tibetan, the transcription text can be stored and rendered in romanized form. Unlike the transliteration, the transcription limits itself to the basic character set of the Manju alphabet, but conveys an understanding of word boundaries of spoken Tibetan. See the transcription example of page 104, Beijing edition:

  1. dalo
  2. lo denid
  3. sangpod
  4. cilo

5.3 Mongolian

Mongolian with its inherent ambiguities requires a sophisticated approach for properly dealing with the idiosyncrasies demonstrated in the Pentaglot. All ambiguous letter shapes, though not visible in the presentation layer, are stored in their dictionary form: while ordu and urtu yield the same presentation form, the romanization of the information layer distinguishes between these two lexically different words. The orthography is dealt with in a similar manner. The copyists frequently omitted the dots of g and n creating word forms looking like *xurban for mong. gurban engl. three. In order to make the lexical distinction between otherwise potentially ambiguous words, these cases are resolved using a modified version of the Form Variant Selector symbols of the ISO/Unicode encoding for Mongolian: g'urban indicates that gurban is the lexical form, with a presentation form looking like xurban. The first text page of the Pentaglot is shown here (delimiters are not shown here):

  1. tngri
  2. dägädü tngri
  3. kökä tngri
  4. gägägän tngri

An example of page 104 of the Beijing edition follows (please note the irregular usage of delimiters):

  1. änä zil,
  2. mön kü zil,
  3. xoyitu zil
  4. iräkü zil, basa oruxu zil kämämüi,

The Mongolian script example of the title page given above was typeset using the MonTeX system.

The presence of synonyms in the Mongolian lemma set requires adaptive measures in the database structure; besides Tibetan, the Mongolian entries are the only ones with a 1:n-relation between page/column and lemma, as can be seen from lemma 104.4: iräkü zil, basa oruxu zil kämämüi. In the Mongolian portion of the Pentaglot database, this should translate into the distinct entries iräkü zil and oruxu zil.

5.4 Uighur

Among the languages of the Pentaglot, Uighur poses the greatest encoding challenge as it is recorded without most of the distinguishing vowels used for modern Uighur. Traditional transliterations render vowel harmony but the abridged writing does not take any of these subtleties into account.

For Uighur, it becomes thus necessary to define a function which yields a rather simple surface (or presentation) form out of the complex information layer. The presentation of the titlepage example was prepared using ArabTeX.

  1. bu yil
  2. shu bu yil
  3. kiraduga yil
  4. kilur yil

There are substantial differences between the information and presentation layers of Uighur writing as can be seen from the traditional (lexical) rendering of the title in contrast to the presentation form (most of the diacritics omitted!):

han-ning pütügän bäsh qysmi qoshqan hät mancu söz ning ayri majin häti

The presentation layer characters entered in Uighur mode of ArabTeX are in fact:

xan nyng fotokn b^s qysmi qo^sqan xt manjo soz nyng ayri mjyn xti
The Uighur writing was entered manually, and a transition function generating the presentation layer out of the information or lexical layer waits to be implemented.

Uighur in Manju Transcription

Unlike Uighur proper, a substantial number of Manju-transcribed Uighur material has been entered successfully. In the given example (p. 104 of the Beijing 1957 edition) Uighur original and Manju transcription are fairly close:

  1. bu il
  2. shu bu il
  3. kiraduga il
  4. kiyalur il

5.5 Chinese

Chinese characters were entered using a B5-based editor; later this material was converted to UTF-8-encoded Unicode CJK characters. The pinyin equivalents were machine-generated for the majority of the material. For our example page (p. 104, Beijing edition) only the pinyin of the characters is shown here; the ideographs of the example above were typeset using the CJK system.

  1. jin1 nian2
  2. ben3 nian2
  3. ming2 nian2
  4. lai2 nian2

Since the pinyin and tonal equivalents of the Chinese characters were completely machine-generated for the first run, certain systematic errors were hard to avoid. One dedicated proofreading run thus focussed completely on pinyin and tone issues.


Next Previous Contents