Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 , and since then, two revisions have been made in order to improve the quality of the resource [2, 3].
The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in , and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.
- Download Mac-Morpho
- Download annotation manual (in Portuguese)
NOTE: the manual was written for its original annotation, i.e., before the changes in the tagset were introduced. Therefore, it does not reflect the current state of the corpus.
Disclaimer: Mac-Morpho versions 1, 2 and 3 are licensed under a Creative Commons Attribution 4.0 International License. This means you can distribute, remix, tweak, and build upon Mac-Morpho versions, even commercially, as long as you give us the credit for the original creation. Mac-Morpho License.
The first main revision is described in , and consisted of two main parts: cleaning noise in the data and changing the tagset to include preposition contractions. The noise present in the original corpus was mainly in the form of sentences which were missing words and repeated sentences. Using a set of heuristics to detect impossible POS tag sequences, sentences with missing words were discarded from the corpus. Repeated sentences were also checked and removed. The use of preposition contractions avoids the need of a preprocessing step for splitting such tokens.
After the first revision, a second one has taken place, described in . It included corrections of problematic sentences (which had tags assigned to empty strings) and a new change to the tagset. This new change aimed at removing tags that needed knowledge above the morpho-syntactic level in order to be correctly detected: auxiliary verbs (merged with main verbs), relative connective pronouns (merged with connective pronouns) and relative connective adverbs (merged with connective adverbs).