Núcleo Interinstitucional de Lingüística Computacional

 

 

An Interinstitutional Center for Research and Development in Computational Linguistics

NILC´s Taggers

 

 

Related Project: Lacio-WEB

Starting Time: 1998 (as Rachel Aires’ MSc project)

Goal

The main purpose of this project is to study and evaluate several state-of-art taggers available in the WWW with the aim of choosing the best (or the best combination of them) for tagging corpora of Brazilian texts with the NILC tagset.

To implement Part of Speech (POS) taggers for Brazilian Portuguese which use empirical and symbolic methods and whose performances are compatible to the state of the art in this area.

Current Status

A 104,966-word corpus has been tagged by current versions of the taggers and has been manually corrected to incrementally produce larger training corpus and improved taggers.

download the full corpus

download the journalistic texts     download the literary texts     download the didactic texts

download the tagset

Results

- Three different POS taggers available on WWW (Tree Tagger (Shmid, 1995), MXPOST (Ratnaparcki, 1996), and TBL Tagger (Brill, 1995)) have been trained with a 104,966-word corpus of Brazilian Portuguese texts, and a symbolic rule-based tagger derived from ReGra's lexicon and disambiguation rules has been developed at NILC (PoSiTagger).

- NILC Corpus has been tagged corpus with full and partial NILC Tagset.

·         - Since November 2000, Rachel Aires have made available a Portuguese model for MXPOST that was trained using a much simpler tagset than the one used on her MsC project. It was considered only 27 tags plus punctuation marks tags, achieving 97% accuracy. Even though it was used a 10 folders cross-validation test strategy, the accuracy should not be generalized to texts in general. It must be remembered that the corpus used during the training is small ~ 100,000 words, and for this reason it is not a representative model of the Portuguese language in general. It was showed on the MsC project that the precision is different for each of the three genres studied and that the journalistic genre is the one with less ambiguity and the easiest one to tag. You can download the trained tagger, the tagset and the evaluation results per tag

You can also download 3 trained taggers resulted of Lácio-Web Project:
    Trained MACMORPHO files for MXPOST
    Trained MACMORPHO files for TreeTagger
    Trained MACMORPHO files for Brill Tagger (TBL)

Team:

Rachel Virgínia Xavier Aires - MSc Student

Sandra Maria Aluísio - Supervisor

Marcio Luis Barse Andreeta - a student who worked on the tagging of the training-test corpus and on the codification of several tools for combining the taggers and evaluate them, [1998-2000]

Ronaldo Teixeira Martins - The linguist who wrote the version of the NILC tagset used on this project

Denise Khun - The linguist who wrote most of the PoSiTagger rules, [June 2000]

Ana Raquel Marchi - The linguist who worked on the correction of the training-test corpus, [June 2000]


Finantial Support
Itautec-Philco S.A.

CNPQ

Intelligenesis/Webmind

: 1999-2000

Finep (PADCT-CE, Proc. 88-98-059100-02-01): 1999-2000

PADCT/Finep - Itautec-Philco (2000-2001)

FAPESP (2001-2003)


Contact
Sandra Maria Aluísio: sandra@icmc.usp.br


Related Publications

Aires, R. V. X. (2000). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. MsC Thesis . October, 2000. download ps file

Aires, R. V. X.; Aluísio, S. M.; Kuhn, D. C. S.; Andreeta, M. L. B.; Oliveira Jr., O. N. (2000). Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian Portuguese. (SBIA'2000) Atibaia, SP, November, 20-22. download ps file

Aires, R.V.X.; Aluísio, S.M.(2000). Implementação, Adaptação e Avaliação de Etiquetadores para o Português do Brasil. In V Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 2000. p.109-110.

Aires, R.V.X.; Aluísio, S.M.(1999). Um Etiquetador para o Português do Brasil. In IV Workshop de Teses e Dissertações em Andamento do ICMC/USP. São Carlos, 1999. p.57-58.

Aires, R.V.X.; Aluísio, S.M. Criação de um corpus com 1.000.000 de palavras etiquetado morfossintaticamente. Série de Relatórios do NILC. NILC-TR-01-8, Outubro 2001, 14p.download zip file

Aluísio, S.M.; Aires, R.V. Etiquetação de um Corpus e Construção de um Etiquetador de Português. Relatórios Técnicos do ICMC-USP, 107 (NILC-TR-00-2). Março 2000, 18p.download zip file

Aires, R. V. X.; Aluísio, S. M. (2001). Implementação, Adaptação, Combinação e Avaliação de Etiquetadores para o Português do Brasil. In VI Workshop de Teses e Dissertações defendidas do ICMC/USP. 2001.