====== The TreeTagger ======

The TreeTagger is a set of programs developed by computational linguist Helmut Schmid and collaborators that allows you to add part-of-speech information to texts in many different languages. You can download it from the [[https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/|author's website]] and install it on your computer, or, even better, you can run it from your ZEDAT account.

===== Set up (ZEDAT account) =====

To set up the TreeTagger in your ZEDAT account, sign in and run the following command (type it at the prompt and hit RETURN):

	cove.sh

Then sign out and sign in again. (Note for advanced Linux users: This command will make changes to your .bash_profile file, so if you don't want this, look at the script itself to see what information it adds to your $PATH variable and add it manually. If you don't know what this means, you are not an advanced Linux user, so don't worry and just read on.)

===== Annotating text with the tree tagger =====

To annotate text, it must be stored in [[resources:plain-text-files|plain text files]] which must be [[resources:transferring-files-zedat-account|uploaded to your ZEDAT account.]]

Then, you simply have to read this file using the ''cat'' command, use the pipe ''|'' send it to the tree-tagger script for whatever language the text is in, and save the output to a new file. For example, if you have a file called ''mansfield.txt'' in a directory called ''files'', which, in turn, is in a directory called ''cove'' (which you do, because the ''cove.sh'' script installed it there), the following will produce a part-of-speech tagged version of the same file):

	cat cove/files/mansfield.txt | tree-tagger-english > cove/files/mansfield-tagged.txt

If you open the tagged text, for example, using the text editor [[resources:nano|nano]], you will see that it looks like this:

	He	PP	he
	was	VBD	be
	in	IN	in
	love	NN	love
	,	,	,
	very	RB	very
	much	RB	much
	in	IN	in
	love	NN	love
	;	:	;
	and	CC	and
	it	PP	it
	was	VBD	be
	a	DT	a
	love	NN	love

The first column contains the word form or punctuation mark from the original text, the second column contains a part-of-speech tag (see [[corpora:tagset-treetagger|here]] for the complete list of tags used by the tree tagger), and the third line contains the lemma of the word form in the first column. This is the format used to represent text and word-level annotation in the [[cqp:corpus-structure#representing_information_in_corporathe_vrt_format|.vrt format]].

By the way, the TreeTagger ignores all lines starting with a ''<'' and prints them directly to the output -- which means that, if you have added meta-information about your text in the form of xml-Tags, they will appear in your tagged file in exactly the same form.