|
An Interinstitutional Center for Research and Development in Computational Linguistics |
NILC´s Taggers
|
Related Project: Lacio-WEB
Starting Time: 1998 (as Rachel Aires’ MSc project)
Goal
The main purpose of this project is to study and evaluate several state-of-art taggers available in the WWW with the aim of choosing the best (or the best combination of them) for tagging corpora of Brazilian texts with the NILC tagset.
To implement
Part of Speech (POS) taggers for Brazilian Portuguese which use empirical and
symbolic methods and whose performances are compatible to the state of the art
in this area.
Current
Status
A
104,966-word corpus has been tagged by current versions of the taggers and has
been manually corrected to incrementally produce larger training corpus and
improved taggers.
download the journalistic texts download the literary texts download the didactic texts
Results
- Three different POS taggers available on WWW (Tree Tagger (Shmid, 1995), MXPOST (Ratnaparcki, 1996), and TBL Tagger
(Brill, 1995)) have been trained with a 104,966-word corpus of Brazilian
Portuguese texts, and a symbolic rule-based tagger
derived from ReGra's lexicon and disambiguation rules
has been developed at NILC (PoSiTagger).
- NILC Corpus has been tagged corpus with full and partial NILC Tagset.
·
- Since November 2000, Rachel Aires have made available a
Portuguese model for MXPOST that was trained using a much simpler tagset than the one used on her MsC
project. It was considered only 27 tags plus punctuation marks
tags, achieving 97% accuracy. Even though it was used a 10 folders
cross-validation test strategy, the accuracy should not be generalized to texts
in general. It must be remembered that the corpus used during the training is
small ~ 100,000 words, and for this reason it is not a representative model of
the Portuguese language in general. It was showed on the MsC
project that the precision is different for each of the three genres studied
and that the journalistic genre is the one with less ambiguity and the easiest
one to tag. You can download the trained tagger, the tagset and the evaluation
results per tag.
You can also download 3 trained taggers resulted of Lácio-Web Project:
Trained MACMORPHO files for MXPOST
Trained MACMORPHO files for TreeTagger
Trained MACMORPHO files for Brill Tagger (TBL)
Team:
Rachel Virgínia Xavier Aires - MSc Student
Marcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]
Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project
Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]
Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]
Finantial Support
Itautec-Philco S.A.
Intelligenesis/Webmind
: 1999-2000Finep (PADCT-CE, Proc.
88-98-059100-02-01): 1999-2000
PADCT/Finep -
Itautec-Philco (2000-2001)
Contact
Related Publications
Aires, R. V. X. (2000). Implementação, Adaptação,
Combinação e Avaliação de Etiquetadores para o Português do Brasil. MsC Thesis . October, 2000. download ps file