|
|
|
## IMGazetteer PR
|
|
|
|
|
|
|
|
#### Initialization
|
|
|
|
|
|
|
|
The Inexact Gazetteer plugin provides a processing resource named IMGazetteer for GATE, that similarly to GATE Default Gazetteer, annotate document text based on a gazetteer list. The difference is that IMGazetteer annotate text chunks accepting approximate string matching. IMGazetter uses string transformation, trie index, Edit Distance search and string similarity metrics to match text chunks with entries from the gazetteers list and create annotations.
|
|
|
|
The configuration file that is used to load IMGazetteer is divided in three sections: 1) configuration parameters, 2) general features and 3) entries with individual features, sections are delimited bay beginning and end tags that can not be changed.
|
|
|
|
|
|
|
|
Configuration parameters (from “|CONFIG|” until “|/CONFIG|”)
|
|
|
|
These parameters allow user to tune the PR. Each parameter include a name and a value, the name is static and must not be modified, the value is the part that user can change. Some parameters can be changed using GATE UI after PR was loaded, but changes in some other parameters values must be made in config file and require that the PR is reloaded to affect the PR behavior and avoid execution errors. Below you will find all parameters with an individual description:
|
|
|
|
* _annotationType
|
|
|
|
Type that will be used to identify annotations.
|
|
|
|
Ex: _annotationType=Inexact_Lookup
|
|
|
|
* _caseSensitive
|
|
|
|
Define if the case of characters should be considered for matching. As values you must use YES or NO.
|
|
|
|
Ex: _caseSensitive=NO
|
|
|
|
* _maxEditDistance
|
|
|
|
This parameter defines the maximum value accepted for Edit Distance search on trie of entries.
|
|
|
|
Ex: _maxEditDistance=1
|
|
|
|
* _numberBetterSimilarity
|
|
|
|
Quantity of better similarities will be used (Top-k similarities). Annotations with the same similarity will count as only one. If user insert 2 in this parameter, the PR may return 3 or even more annotation for the same chunk, this hapens when more than one entry obtain the same similarity.
|
|
|
|
Ex: _numberBetterSimilarity=1
|
|
|
|
* _minAcceptedSimilarity
|
|
|
|
Minimum value accepted as result of the similarity metric to create an annotation. This value must be between 0 (zero) and 1(one), being 1 equal and 0 totally different. Because of that the result of similarity metric must be normalized to stay between 0 and 1.
|
|
|
|
Ex: _minAcceptedSimilarity=0.9
|
|
|
|
* _editDistanceFeatureName
|
|
|
|
Name of the feature that will be used in annotations to store the Edit Distance value obtained in the search.
|
|
|
|
Ex: _editDistanceFeatureName=ED
|
|
|
|
* _similarityClass
|
|
|
|
The complete path for the class that contain the method that implement the algorithm used to calculate the similarity. To use a class here, the .jar file must be added to “lib” directory that you find in plugin directory, and than add the call for .jar file in “creole.xml” file that are in plugin directory. It is not necessary to recompile the plugin.
|
|
|
|
Ex: _similarityClass=org.apache.lucene.search.spell.JaroWinklerDistance
|
|
|
|
* _similarityMethod
|
|
|
|
The name of the method from the class indicated in parameter “_similarityClass” and that implement the algorithm used to calculate the similarity. This method must receive two strings as parameters and return a Float between 0 and 1 that represents the similarity.
|
|
|
|
Ex: _similarityMethod=getDistance
|
|
|
|
* _transformationClass
|
|
|
|
The class path for the class that contains the method that perform the string transformation. To do not use conversion, remove/comment the parameters line, remember to do the same to the line of parameter “_transformationMethod”.
|
|
|
|
Ex: _transformationClass=org.apache.commons.codec.language.DoubleMetaphone
|
|
|
|
* _transformationMethod
|
|
|
|
Name of method that will be used to perform string transformation. Method must receive a string and return another string. To do not use conversion, remove/comment this parameter line, remember to do the same to the parameter “_transformationClass”.
|
|
|
|
Ex: _transformationMethod=doubleMetaphone
|
|
|
|
* _similarityFeatureName
|
|
|
|
The name of the feature that will be used in annotations to store the value from the similarity metric.
|
|
|
|
Ex: _similarityFeatureName=Similarity
|
|
|
|
* _featureSeparator=;
|
|
|
|
Character used on entries lines to separate features each other.
|
|
|
|
Ex: _featureSeparator=;
|
|
|
|
* _featureNameValueSeparator=:
|
|
|
|
Charactere used on entries lines to separate the name from the value of each feature.
|
|
|
|
Ex: _featureNameValueSeparator=:
|
|
|
|
* _entrieDelimiter
|
|
|
|
Character used on entries lines that indicates the end of the entry.
|
|
|
|
Ex: _entrieDelimiter=#
|
|
|
|
|
|
|
|
#### General features (from “|Features4ALL|” until “|/Features4ALL|”)
|
|
|
|
Features in this section will be applied to all annotations, independent what entry is used. Each features contain a name and a value, just like features in standard GATE Gazetteer. Both name and value are considered managed like simple strings and each line must contain only one feature (name + value).
|
|
|
|
|
|
|
|
Ex:
|
|
|
|
|Features4ALL|
|
|
|
|
Type=city
|
|
|
|
|/Features4ALL|
|
|
|
|
|
|
|
|
#### Entries (from “|ENTRIES|” until the end of the document)
|
|
|
|
This is the section that represents the gazetteer list. Each line contain only one entry and it’s features, each entry can have zero or various features. The separation between entry and features, features name and features value and between features, are made by characters defined by configuration parameters. Each entry can contain one or multiple words, numbers and special characters, every part will be considered simple text.
|
|
|
|
|
|
|
|
Ex:
|
|
|
|
|ENTRIES|
|
|
|
|
Sheffield#Continent:Europe
|
|
|
|
Curitiba#Country:Brazil
|
|
|
|
|
|
|
|
#### Execution
|
|
|
|
|
|
|
|
Each entry from config file are divided in tokens, transformed (if user decided for that) using the algorithm provided in dictionary, grouped together again, then indexed in the trie and linked with respective features and original (without transformation) entry string. To run IMGazetteer on a document, first is necessary to execute a tokenizer, to define every token in the document, these tokens will be also transformed and then searched on entries trie.
|
|
|
|
When a chunk of text match an entry, they are send to define the string similarity by a string similarity metric, for this step is used the original string for both entry and text chunk. If similarity obtained meets the requirements, an annotation will be created for that chunk using that entry and its respective features. |