As mentioned in the mailings posted to the former altainet
the author organized a team of several contributors entering Manju,
Tibetan, Mongolian, Uighur and Chinese into computer so as to create a
retrievable and fully indexed version of the Pentaglot. Other
participants to this project contribute their valuable time for
proofreading individual languages.
By order of language, the contributors are:
The following list gives an overview over the status of each language at the time of the latest revision of this document.
Entering the complete set of Manju data was finished in early 2002. The data ware romanized using Hauer's system. A handful of items (less than 200) need further investigation of orthography, graphical representation, etc. and are presently being worked on.
Approximately 20% of the entries (not in consecutive order) have been entered so far. Besides certain graphical ambiguities present in the 1957 reprint (nga vs. da or n+tshag vs. m, etc.) certain consonant clusters considered unconventional by modern usage place additional importance on the following proofreading phase.
This one can be most certainly generated out of the Tibetan original. Proofreading will nonetheless be necessary in order to check for the presentation of unconventional consonant clusters.
It is doubtful whether the Tibetan transcription can be generated in a completely automated fashion since the authors chose rather to obey word boundaries of the spoken language than following the principles of Tibetan writing. Even if a machine-generated skeletal version based on certain assumptions, transformation rules and look-up tables can be generated, the bulk of work will remain in the identification of each entry.
By fall 2001, the complete text was made available in electronic form. The task of entering the words was greatly facilitated by software heuristics providing a generalized set of rules of legal word forms.
One problem in entering the words is the great number of ambiguities resulting from a not exactly formal treatment of diacritics and glyph shapes.
By Spring 2002, about one third of the total Pentaglot material has been entered into computer. A first survey into samples of the Uighur text shows that a simplified version (no special vowels etc.) seems to have been applied.
The work of entering the Uighur transliteration in Manju characters started in February 2001 and was finished during the same year. While most of the Manju transliterations can be read directly, a minority of them requires understanding of the Uighur original. The verification and correction procedure will start in early summer 2001.
Chinese is a special case within the framework of the Pentaglot project. While the original text shows the characters, a modern rendition cannot do without a phonetical rendering; the Hanyu Pinyin system provides better access to the corpus than an exclusively character-based approach.
One of the completed sections. Data have been entered using the B5 character set and are now converted into Unicode so as to be able to enter the few characters in the book which cannot be found in the B5 character set. Proofreading of the whole Chinese text started in February 2001 and the first phases have been finished!
In the first step, pinyin romanization has been generated automatically out of the character data; due to the limitations of the machine-based approach, there are numerous mechanical errors which can be identified and corrected easily. These errors can be grouped into the following categories:
Problem: The reference table containing the pinyin equivalents contains a number of misspelled data, like shen, when xin is expected, or fa4 instead of fa3.
Measure: After correction of the table entry, the process will yield correct results throughout the whole data set.
Problem: Strictly speaking, these are not errors, these are misrepresentations. If the character with the meanings `heavy' and `double, repeated' with the respective pronounciations zhong4 and chong2 is taken as an example, then the choice of the individual pronouncation depends completely on the local context of the lemma in question.
Measure: Misrepresentations of this type have to be corrected individually in the final data set.