Developer guide
This guide is intended for developers adapting POLKE for other natural languages.
Project setup
Software
- Eclipse 2022-06
- JDK 15
- UIMA Runtime and UIMA Tools 3.3.0 (Eclipse update site)
- UIMA Ruta 3.2.0 (Eclipse update site)
- Docker 20.10.24
Pipeline
Overview
POLKE is a Apache UIMA based Java application. It is deployed on a Jetty server in a Docker container. Deployment and server setup is handled by a collection of bash scripts.
env-setup.sh
sets environment variables on first launch.serve.sh
handles compiling and packaging code, building the Docker image, and deploying the Docker containerstop.sh
safely shuts down container
The basic overview of the POLKE pipeline is as follows:
- The
ExtractorServlet
receives the input text to be annotated from a POST request. - The text is passed on to the
LinguisticConstructAnnotator
, a UIMA aggregate AE which performs two functions: NLP pre-processing and applying UIMA Ruta rules. Here, a new JCas object is created with the input text and language. A JCas is the Java interface of the CAS (Common Analysis System) that UIMA uses for handling data processed by the annotator components. - The JCas object is processed with basic NLP annotations, which is handled by the
NlpAE
class. The annotations types are defined in the typesystemNLP_TypeSystem
. - Ruta rules are applied on the NLP annotations to further annotate the text with the linguistic constructs from the English Grammar Profile.
- The servlet extracts the EGP annotations from the JCas and returns them in the JSON format.
Typesystem
The UIMA type system is an object schema for the CAS, describing what kinds of data is available to the annotators. The type system consists of types and features. Types are the objects you can manipulate in a CAS and features are the optional subfields within a type.
Only the types defined in the type system can be added to the JCas and manipulated by UIMA Ruta.
POLKE contains three kinds of types:
- The NLP annotations from the pre-processing stage (tokens, part-of-speech tags, dependencies, chunks, key-value pairs etc.)
- The NLP annotations from the UIMA Ruta pre-processing stage (e.g., grouping part-of-speech tags into broader categories such as to-infinitives or all finite base verbs)
- The EGP construct annotations annotated on the basis of the other two types
Each type can be imported as a Java class and has automatically generated setters and getters for the begin and end span indices and for each feature defined in the type system for that type.
In Eclipse, the type system file can be opened and edited with the Component Descriptor Editor. After adding new types and/or features, click ‘JCasGen’ to generate the class files.
Note: if you are using DKPro Core, delete the de/tudarmstadt
folder that gets generated to avoid conflicts with classes imported via Maven.
Adding NLP components
The easiest way to add NLP components is to use the DKPro Core components for Apache UIMA. It is also possible to integrate input from other tools, however this requires a couple of extra steps.
Using DKPro Core
Simply add the DKPro Core component by creating a new engine description in the pipeline. Check the DKPro Core Component Reference for the list of currently available components and models.
In cases where the models are not distributable, these should be downloaded directly into the project (see External models)
Using other NLP tools
Output from external NLP tools needs to be manually added to the JCas. The way to do this is illustrated below with an example.
Let's say we have the type type.nlp.Noun
defined in the type system with the features pos
and lemma
. We have imported it as a Java class and want to use it to put all of the nouns in the text into the JCas.
For each noun in the string, we create a new instance of the class using the begin and end indices of the span, set the feature values, and add the instance to the JCas:
Noun noun = new Noun(aJCas, begin, end);
noun.setPos(pos);
noun.setLemma(lemma);
noun.addToIndexes();
This class will now be visible as a UIMA annotation and accessible by the Ruta rules.
External models
External model files are copied into the resources folder inside the Docker container when the application is served. Make a folder containing the new files under jetty_base/lib
and add a docker command in the serve.sh
script to copy the files:
docker cp jetty_base/lib/your_new_folder $JETTY_CONTAINER_NAME:/var/lib
EGP annotation with UIMA Ruta
POLKE uses UIMA Ruta for annotating EGP constructs. UIMA Ruta is a rule language which defines and executes a pattern of annotations on the data in a CAS. The UIMA Ruta documentation provides a comprehensive description of the Ruta syntax; here we give only an overview on how the Ruta rules in POLKE are set up.
Ruta rules
Each EGP construct has a unique Ruta rule or set of unique Ruta rules handling its annotation. These rules are based on the NLP tags (tokens, parts of speech, lemmas, dependencies, etc.). Each rule annotates the minimum possible span containing the construct based on the EGP can-do statement, meaning that, for example, an annotation for a construct focused on the main verb does not contain the subject noun phrases. Additionally, context is generally not taken into account unless it is required by the can-do statement, meaning that even if the broader context might be ungrammatical overall, if the construct in isolation is used correctly it will still be annotated.
In cases where the can-do statement is very broad, the intended target structure was determined from the provided learner examples.
For example, construct 52 in the mapping table with the supercategory 'modality' is described as "Can use the negative forms."
There are two learner examples:
- I cannot come to see you. (A1 BREAKTHROUGH, 2009, Bengali)
- I'm very nervous and I can't say anything. (A1 BREAKTHROUGH Polish)
The Ruta rule to annotate this construct would therefore be:
(Token{REGEXP("(?i)can")} & POS.PosValue == "VM0")
KeyValuePair.key == "NEGATION_MARKER" ADV* KeyValuePair.key == "INFINITIVE"
{-> CREATE(EGPConstruct,1,4,"constructID"=52)};
We use a regular expression to look for a token "can" with the part of speech VM0, followed by a negation marker, an optional adverb, and an infinitive verb. From this sequence we create a new EGPConstruct annotation and assign it the construct ID 52.
When the can-do statement specifies a lexical range (e.g. limited range of...), we use the CEFR-annotated vocabulary lists Oxford 3000 and 5000 from the Oxford Learner's Dictionaries. In cases where the can-do statement requires a specific word type (e.g. gradable adjectives), the word list for that type was created using a subset of the Oxford 3000 and 5000.