Mac-Morpho

Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3].

The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.

Disclaimer: Mac-Morpho versions 1, 2 and 3 are licensed under a Creative Commons Attribution 4.0 International License. This means you can distribute, remix, tweak, and build upon Mac-Morpho versions, even commercially, as long as you give us the credit for the original creation. Mac-Morpho License.

Revisions

The first main revision is described in [2], and consisted of two main parts: cleaning noise in the data and changing the tagset to include preposition contractions. The noise present in the original corpus was mainly in the form of sentences which were missing words and repeated sentences. Using a set of heuristics to detect impossible POS tag sequences, sentences with missing words were discarded from the corpus. Repeated sentences were also checked and removed. The use of preposition contractions avoids the need of a preprocessing step for splitting such tokens.

After the first revision, a second one has taken place, described in [3]. It included corrections of problematic sentences (which had tags assigned to empty strings) and a new change to the tagset. This new change aimed at removing tags that needed knowledge above the morpho-syntactic level in order to be correctly detected: auxiliary verbs (merged with main verbs), relative connective pronouns (merged with connective pronouns) and relative connective adverbs (merged with connective adverbs).

Previous Versions

The two previous versions of Mac-Morpho can be found below. They are also split into train, development and test, in the same way used in [3].

References

  1. Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V. 2003. An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Proceedings of the 6th International Conference on Computational Processing of the Portuguese Language. PROPOR 2003 [link]
  2. Fonseca, E.R., Rosa, J.L.G. 2013. Mac-morpho revisited: Towards robust part-of-speech. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology – STIL [link]
  3. Fonseca, E.R., Aluísio, Sandra Maria, Rosa, J.L.G. 2015. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society. [link]