User Tools

Site Tools


resources:tree-tagger

The TreeTagger

The TreeTagger is a set of programs developed by computational linguist Helmut Schmid and collaborators that allows you to add part-of-speech information to texts in many different languages. You can download it from the author's website and install it on your computer, or, even better, you can run it from your ZEDAT account.

Set up (ZEDAT account)

To set up the TreeTagger in your ZEDAT account, sign in and run the following command (type it at the prompt and hit RETURN):

cove.sh

Then sign out and sign in again. (Note for advanced Linux users: This command will make changes to your .bash_profile file, so if you don't want this, look at the script itself to see what information it adds to your $PATH variable and add it manually. If you don't know what this means, you are not an advanced Linux user, so don't worry and just read on.)

Annotating text with the tree tagger

To annotate text, it must be stored in plain text files which must be uploaded to your ZEDAT account.

Then, you simply have to read this file using the cat command, use the pipe | send it to the tree-tagger script for whatever language the text is in, and save the output to a new file. For example, if you have a file called mansfield.txt in a directory called files, which, in turn, is in a directory called cove (which you do, because the cove.sh script installed it there), the following will produce a part-of-speech tagged version of the same file):

cat cove/files/mansfield.txt | tree-tagger-english > cove/files/mansfield-tagged.txt

If you open the tagged text, for example, using the text editor nano, you will see that it looks like this:

He	PP	he
was	VBD	be
in	IN	in
love	NN	love
,	,	,
very	RB	very
much	RB	much
in	IN	in
love	NN	love
;	:	;
and	CC	and
it	PP	it
was	VBD	be
a	DT	a
love	NN	love

The first column contains the word form or punctuation mark from the original text, the second column contains a part-of-speech tag (see here for the complete list of tags used by the tree tagger), and the third line contains the lemma of the word form in the first column. This is the format used to represent text and word-level annotation in the .vrt format.

By the way, the TreeTagger ignores all lines starting with a < and prints them directly to the output – which means that, if you have added meta-information about your text in the form of xml-Tags, they will appear in your tagged file in exactly the same form.

resources/tree-tagger.txt · Last modified: 2021/02/10 11:55 by astefanowitsch