Semi-Automatic Construction of a Textual Entailment Dataset: Selecting Candidates with Vector Space Models

This page presents the code and data used to bootstrap an RTE corpus, as described in the paper in STIL 2015 by Erick Fonseca and Sandra Aluísio.

Code

The code used for bootstrapping the data can be found at GitHub. The project contains a README file with a detailed description of its internal working. Note that since the original experiments, the code has evolved to include some new capabilities.

Raw Data

The data used in the experiments consists of two corpora:

A very large news corpus from the G1 website, with around 100 million tokens. It was used to generate the different VSMs. Any substantially large corpus should generate similar results.
A corpus of clusterized news collected from Google News. This corpus is much smaller, with 2.4 million tokens. The RTE candidate pairs were extracted from its clusters. Download

Extracted Pairs

You can also download the pairs generated with each VSM. The numbers of pairs in each file differ because each VSM provided a different quantity of pairs with the similarity score within the desired range. In each file, the first 100 pairs were annotated manually and the rest has the "UKNOWN" class.

Download