Introduction
In this blog, we are going to discuss one of the major tasks of Natural language processing i.e., Named Entity Recognition.
As the name suggests, it helps in recognizing entity type from text i.e., detecting if an organization presents and what is the name of an organization, etc.
Generally, we deal with some basic 5-7 entities such as organization, person, time, date, number, money, etc.
NER (Named Entity recognition)
To create a NER for a basic or custom entity, you will definitely need a ton of labeled datasets.
There could be different labeling methods like Stanford NER uses IOB encoding, spacy uses the start index and end index format.
We have a number of pre-built NER models, readily available such as Stanford Core NLP, Spacy, Allennlp, etc.
Today we will see how to train our own custom model in order to get some idea of how these prebuilt NER models are built.
To train a custom model i.e., for some new entities you need to annotate the data.
Implementation in Python development
The data set used is medical data. You can find the dataset here:
https://www.dropbox.com/s/ef5g11fdq7igi74/hackathon_disease_extraction.zip?dl=0
Requirements of libraries:
- 1. Pandas
- 2. Numpy
- 3. Keras
- 4. Tensorflow
- 5. Unicodedata
- 6. Keras-contrib for CRF
You can choose your architecture, that is, you can add more layers. You can also use hyperparameter tuning.
## Importing all required packages
import pandas as pd
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras_contrib.layers import CRF
from keras.layers import TimeDistributed, Dropout
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense
import unicodedata
## Loading data
train_data = pd.read_csv("./data/input/train.csv")
test_data = pd.read_csv("./data/input/test.csv")
##-------------------------------------- Data Analysis ----------------------------------- ##
print("Training data summarization\n:",train_data.nunique())
## Getting the list of words
words = list(set(train_data["Word"].append(test_data["Word"]).values))
## Creating the vocabulary
## Converting into ascii form
words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words]
n_words = len(words)
## Creating the list of tags
tags = list(set(train_data["tag"].values))
n_tags = len(tags)
## Converting into index and map with words and tag in order to refer for future.
word_idx = {word:index for index, word in enumerate(words)}
tag_idx = {tag:index for index, tag in enumerate(tags)}
##------------------------------------ Preparing the dataset --------------------------------------------##
word_tag_func = lambda s: [(word,tag) for word, tag in zip(s["Word"].values, s["tag"].values)]
grouped_word_tag = train_data.groupby("Sent_ID").apply(word_tag_func)
sentences = [s for s in grouped_word_tag]
word_tag_func = lambda s: [word for word in s["Word"].values]
grouped_word = test_data.groupby("Sent_ID").apply(word_tag_func )
test_sentences = [s for s in grouped_word]
##------------------------------- Preparing data for modelling ----------------------------##
X_train = [[word_idx[unicodedata.normalize('NFKD', str(w[0])).
encode('ascii','ignore')] for w in s]for s in sentences]
## Preparing input training data
X_train = pad_sequences(sequences= X_train, maxlen=180, padding='post')
print(len(X_train))
X_test = [[word_idx[unicodedata.normalize('NFKD', str(w)).
encode('ascii','ignore')] for w in s] for s in test_sentences]
## Preparing input test data
X_test = pad_sequences(sequences= X_test, maxlen= 180, padding='post')
## Preparing output training data
y_train = [[tag_idx[w[1]] for w in s] for s in sentences]
y_train = pad_sequences(sequences=y_train, maxlen=180, padding= 'post', value= tag_idx["O"])
y_train = [to_categorical(i, num_classes=n_tags) for i in y]
Y_train = np.array(y_train)
##------------------------------------------------ Model Creation -----------------------------------------------##
input = Input(shape=(180,))
model = Embedding(input_dim= n_words, output_dim=180, input_length=180)(input)
model = Dropout(0.1)(model)
model = LSTM(units=150, return_sequences=True, recurrent_dropout=0.1)(model)
model = TimeDistributed(Dense(n_tags, activation="relu"))(model)
crf_model = CRF(n_tags+1)
output = crf_model(model) # output
model = Model(input, output)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(model.summary)
fitted_model = model.fit(X_train, y_train, batch_size= 48, epochs= 5, verbose=1)
print(fitted_model)
Conclusion
We have seen a Python Development Company India describe how to create our NER model for some basic entities. You can add on more entities for the custom NER model.
There are some commonalities in the above program, the rest you can modify the architecture of the model according to your needs.