Named-Entity Recognition (NER) is one of the most popular NLP tasks. It’s popular because it produces annotation result that can be used directly (eg. extracting people name from text) or indirectly (eg. extracting feature for classification task).
One of the easiest way to do it is by downloading and using latest Stanford Core NLP suite from https://stanfordnlp.github.io/CoreNLP/ to train a NER model using our own dataset. Here are some steps to do it.
First, prepare necessary files:
PERS
for Person entity). Example: jane-austen-emma-ch1If you have datasets in ENAMEX or Open NLP format, you can use these simple python scripts enamex2stanfordner.py or tag2stanfordner.py to convert them
Enter stanfordnlp
unzipped directory and run this command to train model:
java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -prop jane-austen.prop
The result will show the output model file name ner-model.ser.gz
:
.... [main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - CRFClassifier training ... done [2.1 sec]. [main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - Serializing classifier to ner-model.ser.gz... [main] INFO edu.stanford.nlp.ie.crf.CRFClassifier - done.
To test the model against the test file run this command:
java -cp "*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile jane-austen-emma-ch2.tsv
The result shows the performance of the model. In this case it achieves 82% precision, 72% recall and 77% F-measure:
... [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1999 words in 1 documents at 6305.99 words per second. [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - PERS 0.8205 0.7273 0.7711 32 7 12 [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.8205 0.7273 0.7711 32 7 12
You can also train the classifier using for language other than English as long as you provide proper training/testing dataset.
Cheers! ?
References: https://nlp.stanford.edu/software/crf-faq.html
Getting verified SSL information with Python (3.x) is very easy. Code examples for it are…
By default, Spring Data Couchbase implements single-bucket configuration. In this default implementation, all POJO (Plain…
Last year, Google released Firebase Auth Emulator as a new component in Firebase Emulator. In…
One of the authentication protocol that is supported by most of Google Cloud services is…
If you need to to add a spatial information querying in your application, PostGIS is…
Amazon Web Service Transcribe provides API to automatically convert an audio speech file (mp3/wav) into…
View Comments
Thanks for post. Can't find Input files.
My bad. I've just updated the links. Thanks for letting me know!