Introduction¶

This documents covers the basics for installing and using nlpnet.

Installation¶

nlpnet can be downloaded from the Python package index at https://pypi.python.org/pypi/nlpnet/ or installed with

pip install nlpnet

See the Dependencies section below for additional installation requirements.

If you want to get the latest development version before it is uploaded to the package index, you can clone the repository from github. After downloading the code, type the following command in the code directory:

python setup.py install

And it’s done.

Important: in order to use the trained models for Portuguese NLP, you will need to download the data from Trained Models.

Dependencies¶

nlpnet requires NLTK and numpy. Additionally, it needs to download some data from NLTK. After installing it, call

>>> nltk.download()

go to the Models tab and select the Punkt tokenizer. It is used in order to split the text into sentences.

Cython is used to generate C extensions and run faster. You probably won’t need it, since the generated .c file is already provided with nlpnet, but you will need a C compiler. On Linux and Mac systems this shouldn’t be a problem, but may be on Windows, because setuptools requires the Microsoft C Compiler by default. If you don’t have it already, it is usually easier to install MinGW instead and follow the instructions here.

Brief explanation¶

Here is a brief exaplanation about how stuff works in the internals of nlpnet (you don’t need to know it to use this library). For a more detailed view, refer to the articles in the index page or about the SENNA system.

Two types of neural networks are available: the common MLP (multilayer perceptron) and the convolutional one. The former was used to train a POS model, and the latter an SRL model. Basically, the common MLP examines word windows, outputs a score for assigning each tag to each word, and then determines the tags using the Viterbi algorithm (which is essentially picking the best combination from network scores and tag transition scores).

During training, adjustments are made to the network connections, word representations and the tag transition scores. Their learning rates may be set separately, although the best results seem to arise when all three have the same value.

The convolutional network is a little more complicated. In order to output a score for each word, it examines the whole sentence. It does so by picking a word window at a time and forwarding it to a convolution layer. This layer stores in each of its neurons the biggest value found so far. After all words have been examined, the convolution layer forwards its output like a usual MLP network. Then, it works like the previous model: the network outputs scores for each word/tag combination, and a Viterbi search is performed.

In the convolution layer, the values found by each neuron may come from different words, i.e., each neuron stores its maximum independently from the others. This is particularly complex during training, because neurons must backpropagate their error only to the word window that yielded their stored value.

All the details concerning the neural networks are hidden from the user when calling the tagger methods or the nlpnet-tag standalone script. However, they are available to play with in the Networks module.

Basic usage¶

nlpnet can be used both as a Python library or by its standalone scripts. The basic library API is explained below. See also Standalone Scripts.

Library usage¶

You can use nlpnet as a library in Python code as follows:

>>> import nlpnet
>>> nlpnet.set_data_dir('/path/to/nlpnet-data/')
>>> tagger = nlpnet.POSTagger()
>>> tagger.tag('O rato roeu a roupa do rei de Roma.')
[[(u'O', u'ART'), (u'rato', u'N'), (u'roeu', u'V'), (u'a', u'ART'), (u'roupa', u'N'), (u'do', u'PREP+ART'), (u'rei', u'N'), (u'de', u'PREP'), (u'Roma', u'NPROP'), (u'.', 'PU')]]

In the example above, the call to set_data_dir indicates where the data directory is located. This location must be given whenever nlpnet is imported.

Calling a tagger is pretty straightforward. The two provided taggers are POSTagger and SRLTagger, both having a method tag which receives strings with text to be tagged. The tagger splits the text into sentences and then tokenizes each one (hence the return of the POSTagger is a list of lists).

The output of the SRLTagger is slightly more complicated:

>>> tagger = nlpnet.SRLTagger()
>>> tagger.tag(u'O rato roeu a roupa do rei de Roma.')
[<nlpnet.taggers.SRLAnnotatedSentence at 0x84020f0>]

Instead of a list of tuples, sentences are represented by instances of SRLAnnotatedSentence. This class serves basically as a data holder, and has two attributes:

>>> sent = tagger.tag(u'O rato roeu a roupa do rei de Roma.')[0]
>>> sent.tokens
[u'O', u'rato', u'roeu', u'a', u'roupa', u'do', u'rei', u'de', u'Roma', u'.']
>>> sent.arg_structures
[(u'roeu',
  {u'A0': [u'O', u'rato'],
   u'A1': [u'a', u'roupa', u'do', u'rei', u'de', u'Roma'],
   u'V': [u'roeu']})]

The arg_structures is a list containing all predicate-argument structures in the sentence. The only one in this example is for the verb roeu. It is represented by a tuple with the predicate and a dictionary mapping semantic role labels to the tokens that constitute the argument.

Note that the verb appears as the first member of the tuple and also as the content of label ‘V’ (which stands for verb). This is because some predicates are multiwords. In these cases, the “main” predicate word (usually the verb itself) appears in arg_structures[0], and all the words appear under the key ‘V’.

Introduction¶

Installation¶

Dependencies¶

Brief explanation¶

Basic usage¶

Library usage¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Introduction¶

Installation¶

Dependencies¶

Brief explanation¶

Basic usage¶

Library usage¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation