Developer guide

This guide is intended for developers adapting POLKE for other natural languages.

Project setup

Software

Pipeline

Overview

POLKE is a Apache UIMA based Java application. It is deployed on a Jetty server in a Docker container. Deployment and server setup is handled by a collection of bash scripts.

  • env-setup.sh sets environment variables on first launch.
  • serve.sh handles compiling and packaging code, building the Docker image, and deploying the Docker container
  • stop.sh safely shuts down container

The basic overview of the POLKE pipeline is as follows:

  1. The ExtractorServlet receives the input text to be annotated from a POST request.
  2. The text is passed on to the LinguisticConstructAnnotator, a UIMA aggregate AE which performs two functions: NLP pre-processing and applying UIMA Ruta rules. Here, a new JCas object is created with the input text and language. A JCas is the Java interface of the CAS (Common Analysis System) that UIMA uses for handling data processed by the annotator components.
  3. The JCas object is processed with basic NLP annotations, which is handled by the NlpAE class. The annotations types are defined in the typesystem NLP_TypeSystem.
  4. Ruta rules are applied on the NLP annotations to further annotate the text with the linguistic constructs from the English Grammar Profile.
  5. The servlet extracts the EGP annotations from the JCas and returns them in the JSON format.

Typesystem

The UIMA type system is an object schema for the CAS, describing what kinds of data is available to the annotators. The type system consists of types and features. Types are the objects you can manipulate in a CAS and features are the optional subfields within a type.

Only the types defined in the type system can be added to the JCas and manipulated by UIMA Ruta.

POLKE contains three kinds of types:

  • The NLP annotations from the pre-processing stage (tokens, part-of-speech tags, dependencies, chunks, key-value pairs etc.)
  • The NLP annotations from the UIMA Ruta pre-processing stage (e.g., grouping part-of-speech tags into broader categories such as to-infinitives or all finite base verbs)
  • The EGP construct annotations annotated on the basis of the other two types

Each type can be imported as a Java class and has automatically generated setters and getters for the begin and end span indices and for each feature defined in the type system for that type.

In Eclipse, the type system file can be opened and edited with the Component Descriptor Editor. After adding new types and/or features, click ‘JCasGen’ to generate the class files.

Note: if you are using DKPro Core, delete the de/tudarmstadt folder that gets generated to avoid conflicts with classes imported via Maven.

Adding NLP components

The easiest way to add NLP components is to use the DKPro Core components for Apache UIMA. It is also possible to integrate input from other tools, however this requires a couple of extra steps.

Using DKPro Core

Simply add the DKPro Core component by creating a new engine description in the pipeline. Check the DKPro Core Component Reference for the list of currently available components and models.

In cases where the models are not distributable, these should be downloaded directly into the project (see External models)

Using other NLP tools

Output from external NLP tools needs to be manually added to the JCas. The way to do this is illustrated below with an example.

Let's say we have the type type.nlp.Noun defined in the type system with the features pos and lemma. We have imported it as a Java class and want to use it to put all of the nouns in the text into the JCas.

For each noun in the string, we create a new instance of the class using the begin and end indices of the span, set the feature values, and add the instance to the JCas:

Noun noun = new Noun(aJCas, begin, end);
noun.setPos(pos);
noun.setLemma(lemma);
noun.addToIndexes();

This class will now be visible as a UIMA annotation and accessible by the Ruta rules.

External models

External model files are copied into the resources folder inside the Docker container when the application is served. Make a folder containing the new files under jetty_base/lib and add a docker command in the serve.sh script to copy the files:

docker cp jetty_base/lib/your_new_folder $JETTY_CONTAINER_NAME:/var/lib

EGP annotation with UIMA Ruta

POLKE uses UIMA Ruta for annotating EGP constructs. UIMA Ruta is a rule language which defines and executes a pattern of annotations on the data in a CAS. The UIMA Ruta documentation provides a comprehensive description of the Ruta syntax; here we give only an overview on how the Ruta rules in POLKE are set up.

Ruta rules

Each EGP construct has a unique Ruta rule or set of unique Ruta rules handling its annotation. These rules are based on the NLP tags (tokens, parts of speech, lemmas, dependencies, etc.). Each rule annotates the minimum possible span containing the construct based on the EGP can-do statement, meaning that, for example, an annotation for a construct focused on the main verb does not contain the subject noun phrases. Additionally, context is generally not taken into account unless it is required by the can-do statement, meaning that even if the broader context might be ungrammatical overall, if the construct in isolation is used correctly it will still be annotated.

In cases where the can-do statement is very broad, the intended target structure was determined from the provided learner examples.

For example, construct 52 in the mapping table with the supercategory 'modality' is described as "Can use the negative forms."

There are two learner examples:

  • I cannot come to see you. (A1 BREAKTHROUGH, 2009, Bengali)
  • I'm very nervous and I can't say anything. (A1 BREAKTHROUGH Polish)
Using the supercategory and the similar structures in the two sentences, we determine that the construct targets the modal verb can followed by a negation and an infinitive verb.

The Ruta rule to annotate this construct would therefore be:


(Token{REGEXP("(?i)can")} & POS.PosValue == "VM0")
 KeyValuePair.key == "NEGATION_MARKER" ADV* KeyValuePair.key == "INFINITIVE"
{-> CREATE(EGPConstruct,1,4,"constructID"=52)};

We use a regular expression to look for a token "can" with the part of speech VM0, followed by a negation marker, an optional adverb, and an infinitive verb. From this sequence we create a new EGPConstruct annotation and assign it the construct ID 52.

When the can-do statement specifies a lexical range (e.g. limited range of...), we use the CEFR-annotated vocabulary lists Oxford 3000 and 5000 from the Oxford Learner's Dictionaries. In cases where the can-do statement requires a specific word type (e.g. gradable adjectives), the word list for that type was created using a subset of the Oxford 3000 and 5000.