Next Previous Contents

4. Status of Individual Languages

As mentioned in the mailings posted to the former altainet the author organized a team of several contributors entering Manju, Tibetan, Mongolian, Uighur and Chinese into computer so as to create a retrievable and fully indexed version of the Pentaglot. Other participants to this project contribute their valuable time for proofreading individual languages.

By order of language, the contributors are:

The following list gives an overview over the status of each language at the time of the latest revision of this document.

4.1 Manju

Entering the complete set of Manju data was finished in early 2002. The data ware romanized using Hauer's system. A handful of items (less than 200) need further investigation of orthography, graphical representation, etc. and are presently being worked on.

4.2 Tibetan

Tibetan Original

Approximately 20% of the entries (not in consecutive order) have been entered so far. Besides certain graphical ambiguities present in the 1957 reprint (nga vs. da or n+tshag vs. m, etc.) certain consonant clusters considered unconventional by modern usage place additional importance on the following proofreading phase.

Tibetan Transliteration

This one can be most certainly generated out of the Tibetan original. Proofreading will nonetheless be necessary in order to check for the presentation of unconventional consonant clusters.

Tibetan Transcription

It is doubtful whether the Tibetan transcription can be generated in a completely automated fashion since the authors chose rather to obey word boundaries of the spoken language than following the principles of Tibetan writing. Even if a machine-generated skeletal version based on certain assumptions, transformation rules and look-up tables can be generated, the bulk of work will remain in the identification of each entry.

4.3 Mongolian

By fall 2001, the complete text was made available in electronic form. The task of entering the words was greatly facilitated by software heuristics providing a generalized set of rules of legal word forms.

One problem in entering the words is the great number of ambiguities resulting from a not exactly formal treatment of diacritics and glyph shapes.

4.4 Uighur

Uighur Original

By Spring 2002, about one third of the total Pentaglot material has been entered into computer. A first survey into samples of the Uighur text shows that a simplified version (no special vowels etc.) seems to have been applied.

Uighur Transliteration

The work of entering the Uighur transliteration in Manju characters started in February 2001 and was finished during the same year. While most of the Manju transliterations can be read directly, a minority of them requires understanding of the Uighur original. The verification and correction procedure will start in early summer 2001.

4.5 Chinese

Chinese is a special case within the framework of the Pentaglot project. While the original text shows the characters, a modern rendition cannot do without a phonetical rendering; the Hanyu Pinyin system provides better access to the corpus than an exclusively character-based approach.

Chinese Original

One of the completed sections. Data have been entered using the B5 character set and are now converted into Unicode so as to be able to enter the few characters in the book which cannot be found in the B5 character set. Proofreading of the whole Chinese text started in February 2001 and the first phases have been finished!

Chinese Romanization, or Hanyu Pinyin

In the first step, pinyin romanization has been generated automatically out of the character data; due to the limitations of the machine-based approach, there are numerous mechanical errors which can be identified and corrected easily. These errors can be grouped into the following categories:

Original spelling and tonal errors

Problem: The reference table containing the pinyin equivalents contains a number of misspelled data, like shen, when xin is expected, or fa4 instead of fa3.

Measure: After correction of the table entry, the process will yield correct results throughout the whole data set.

Context-dependent spelling and tonal errors

Problem: Strictly speaking, these are not errors, these are misrepresentations. If the character with the meanings `heavy' and `double, repeated' with the respective pronounciations zhong4 and chong2 is taken as an example, then the choice of the individual pronouncation depends completely on the local context of the lemma in question.

Measure: Misrepresentations of this type have to be corrected individually in the final data set.


Next Previous Contents