====== The TreeTagger ====== The TreeTagger is a set of programs developed by computational linguist Helmut Schmid and collaborators that allows you to add part-of-speech information to texts in many different languages. You can download it from the [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|author's website]] and install it on your computer, or, even better, you can run it from your ZEDAT account. ===== Set up (ZEDAT account) ===== To set up the TreeTagger in your ZEDAT account, sign in and run the following command (type it at the prompt and hit RETURN): cove.sh Then sign out and sign in again. (Note for advanced Linux users: This command will make changes to your .bash_profile file, so if you don't want this, look at the script itself to see what information it adds to your $PATH variable and add it manually. If you don't know what this means, you are not an advanced Linux user, so don't worry and just read on.) ===== Annotating text with the tree tagger ===== To annotate text, it must be stored in [[resources:plain-text-files|plain text files]] which must be [[resources:transferring-files-zedat-account|uploaded to your ZEDAT account.]] Then, you simply have to read this file using the ''cat'' command, use the pipe ''|'' send it to the tree-tagger script for whatever language the text is in, and save the output to a new file. For example, if you have a file called ''mansfield.txt'' in a directory called ''files'', which, in turn, is in a directory called ''cove'' (which you do, because the ''cove.sh'' script installed it there), the following will produce a part-of-speech tagged version of the same file): cat cove/files/mansfield.txt | tree-tagger-english > cove/files/mansfield-tagged.txt If you open the tagged text, for example, using the text editor [[resources:nano|nano]], you will see that it looks like this: He PP he was VBD be in IN in love NN love , , , very RB very much RB much in IN in love NN love ; : ; and CC and it PP it was VBD be a DT a love NN love The first column contains the word form or punctuation mark from the original text, the second column contains a part-of-speech tag (see [[corpora:tagset-treetagger|here]] for the complete list of tags used by the tree tagger), and the third line contains the lemma of the word form in the first column. This is the format used to represent text and word-level annotation in the [[cqp:corpus-structure#representing_information_in_corporathe_vrt_format|.vrt format]]. By the way, the TreeTagger ignores all lines starting with a ''<'' and prints them directly to the output -- which means that, if you have added meta-information about your text in the form of xml-Tags, they will appear in your tagged file in exactly the same form.