Generate medical reports using deep learning

The author | Vysakh Nair compile | source of vitamin k | forward Data Science

1. Understand the problem

Image captioning is a challenging ARTIFICIAL intelligence problem. It refers to the process of generating text description from an image according to its content. For example, look at the following figure:

A common answer is “a woman who plays guitar.” As humans, we can look at a picture and describe everything in it, using appropriate language. That’s easy. Let me show you one more:

Okay, how would you describe this?

For all of us “non-radiologists,” a common answer is “chest X-ray.”

For radiologists, they write text reports describing the findings of various parts of the body during imaging examinations, especially whether each part is normal, abnormal, or potentially abnormal. They can get valuable information from one of these images and make medical reports.

Writing medical imaging reports can be difficult for inexperienced radiologists and pathologists, especially those working in rural areas where the quality of care is relatively low, while writing imaging reports can be tedious and time consuming for experienced radiologists and pathologists.

So, to solve all these problems, wouldn’t it be great if a computer could use a chest X-ray like the one above as input and output the results as text, just like a radiologist does?

2. Basic skills

This article assumes some familiarity with topics such as neural networks, CNN, RNNs, transfer learning, Python programming, and the Keras library. The two models mentioned below will be used for our problem and will be explained briefly later in this blog:

Codec model
Pay attention to the mechanism

Knowing enough about them will help you understand the model better.

3. The data

You can get the data for this question from the following link:

Image – contains all the chest X-ray: academictorrents.com/details/5a3…
Report – contains the image of the corresponding report: academictorrents.com/details/664…

The image dataset contains multiple chest X-rays of a person. For example: side view of X-ray, multiple front view, etc.

Just as radiologists use all of these images to write reports, models will use all of these images together to produce results. There are 3955 reports in the dataset, and each report has one or more images associated with it.

3.1 Extract the required data from the XML file

The reports in the dataset are XML files, where each file corresponds to a separate. These files contain the id of the image associated with the person and the corresponding results. The following is an example:

The highlighted information is what you need to extract from these files. This can be done with the help of Python’s XML library.

Note: Findings will also be referred to as reports. They will be used interchangeably in other parts of the blog.

import xml.etree.ElementTree as ET
img = []
img_impression = []
img_finding = []
# directory contains the report file
for filename in tqdm(os.listdir(directory)):
    if filename.endswith(".xml"):
        f = directory + '/' + filename
        tree = ET.parse(f)
        root = tree.getroot()
        for child in root:
            if child.tag == 'MedlineCitation':
                for attr in child:
                    if attr.tag == 'Article':
                        for i in attr:
                            if i.tag == 'Abstract':
                                for name in i:
                                    if name.get('Label') = ='FINDINGS':
                                        finding=name.text   
        for p_image in root.findall('parentImage'):
            img.append(p_image.get('id'))
            img_finding.append(finding)
Copy the code

4. Obtain structured data

Once the required data is extracted from the XML file, the data is converted into a structured format that is easy to understand and access.

As mentioned earlier, there are multiple images associated with a single report. Therefore, our model also needs to see these images when generating reports. But some reports had only one image associated with it, others had two, and the most was only four.

So the question arises, how many images should we feed into the model at a time to generate a report? To make the model inputs consistent, a pair of images (that is, two images) are selected as inputs at a time. If a report has only one image, the same image is copied as the second input.

Now we have structured data that is appropriate and understandable. Images are saved by their absolute address name. This will help load the data.

5. Prepare text data

Once we get the results from the XML files, we should clean them up and prepare them properly before we enter them into the model. The images below show a few examples of what the findings look like before cleaning.

We will clean up the text as follows:

Converts all characters to lowercase.
Perform basic decompression, converting won’t and can’t to will not and can not, respectively.
Remove punctuation from the text. Note that periods are not removed because the result contains multiple sentences, so we need the model to generate the report in a similar way by identifying sentences.
Remove all numbers from the text.
Delete all words of length 2 or less. For example, “is” and “to” are deleted. These words don’t provide much information. But the word “no” won’t be removed because it adds semantic information. Adding “no” to a sentence completely changes its meaning. So we have to be careful when we perform these cleanup steps. You need to decide which words to keep and which words to avoid.
Some text was also found to contain multiple periods or Spaces, or “X” repeated multiple times. Such characters are also removed.

The model we will develop will produce a report composed of two images, one word at a time. The previously generated sequence of words is supplied as input.

Therefore, we need a “first word” to start the generation process and a “last word” to indicate the end of the report. To do this, we will use the strings “startseq” and “endseq”. These strings are added to our data. This is important now, because when we encode the text, we need to encode these strings correctly.

The main step in encoding text is to create a consistent mapping from words to unique integer values, called identification. In order for our computers to be able to understand any text, we need to break down words or sentences in a way that machines can understand. Without identification, text data cannot be processed.

Identification is a method of breaking up a piece of text into smaller units called identifiers. The identifier can be a word or a character, but in our case, it will be a word. Keras provides a built-in library for this purpose.

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(filters='!" # $% & * () +, - / :; The < = >? @[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(reports)
Copy the code

We have now cleaned and marked the text appropriately for future use. The full code for all of this can be found in my GitHub account, which is linked to at the end of this article.

6. Obtain image features

Images and partial reports are inputs to our model. We need to convert each image into a fixed-size vector, which we then pass into the model as input. To do this, we will use transfer learning.

“In transfer learning, we first train the base network on the base data set and task, and then we reassign the features we learn or transfer them to a second target network for training on the target data set and task. This process tends to be efficient if the characteristics are generic, that is, appropriate to both the primary and the target tasks, rather than specific to the primary task.”

VGG16, VGG19, or InceptionV3 are common CNNS for transfer learning. These are all trained on data sets like Imagenets, which have completely different images than chest X-rays. So logically, they don’t seem like good choices for our mission. So what kind of network should we use to solve our problems?

If you’re not familiar, let me introduce you to CheXNet. CheXNet is a 121-layer convolutional neural network trained on Chest X-ray 14, currently the largest open chest X-ray data set, containing more than 100,000 frontal views of 14 disease X-ray images. However, our goal here is not to classify images, but to capture the characteristics of each image. Therefore, the last classification layer of the network is not required.

You can download it from here CheXNet training weight: drive.google.com/file/d/19Bl…

from tensorflow.keras.applications import densenet

chex = densenet.DenseNet121(include_top=False, weights = None,   input_shape=(224.224.3), pooling="avg")

X = chex.output
X = Dense(14, activation="sigmoid", name="predictions")(X)

model = Model(inputs=chex.input, outputs=X)

model.load_weights('load_the_downloaded_weights.h5')

chexnet = Model(inputs = model.input, outputs = model.layers[-2].output)
Copy the code

In case you forgot, we have two images as input to our model. Here’s how to get features:

The size of each image was adjusted to (224,224,3) and passed through CheXNet to obtain feature vectors of 1024 length. Then, the two eigenvectors are connected in series to obtain 2048 eigenvectors.

If you noticed, we added an average pool layer as the last layer. There’s a reason for that. Because we are concatenating two images, the model might learn some concatenating order. For example, image1 always comes after Image2 and vice versa, but this is not the case here. We don’t maintain any order when we connect them. This problem is solved by pooling.

The code is as follows:

def load_image(img_name) :
Load image function
    image = Image.open(img_name)
    image_array = np.asarray(image.convert("RGB"))
    image_array = image_array / 255.
    image_array = resize(image_array, (224.224))
    X = np.expand_dims(image_array, axis=0)
    X = np.asarray(X) 
    return X
Xnet_features = {}
for key, img1, img2, finding in tqdm(dataset.values):
    i1 = load_image(img1)
    img1_features = chexnet.predict(i1)    
    i2 = load_image(img2)
    img2_features = chexnet.predict(i2)
    input_ = np.concatenate((img1_features, img2_features), axis=1)
    Xnet_features[key] = input_
Copy the code

These features are stored in the dictionary in pickle format for future use.

7. Input pipe

Consider a scenario where you have so much data that you can’t keep it all in RAM at once. Buying more memory is obviously not an option for everyone.

The solution can be to dynamically input small batches of data into the model. That’s exactly what the data generator does. They can dynamically generate model inputs, thus forming a pipeline from memory to RAM for loading data as needed.

Another advantage of this pipeline is that it can be easily applied when these small batches of data are ready for input into the model.

We’ll use tF.data for our problem.

We first split the dataset into two parts, a training dataset and a validation dataset. When partitioning, make sure you have enough data points for training and a sufficient amount of data for validation. The ratio I chose allowed me 2560 data points in the training set and 1147 data points in the validation set.

Now it’s time to create a generator for our dataset.

X_train_img, X_cv_img, y_train_rep, y_cv_rep = train_test_split(dataset['Person_id'], dataset['Report'],
                                                                test_size = split_size, random_state=97)
def load_image(id_, report) :
    Load image features with corresponding ID.
    img_feature = Xnet_Features[id_.decode('utf-8'] [0]
    return img_feature, report
def create_dataset(img_name_train, report_train) :
    dataset = tf.data.Dataset.from_tensor_slices((img_name_train, report_train))
  Use map to load numpy files in parallel
    dataset = dataset.map(lambda item1, item2: tf.numpy_function(load_image, [item1, item2],
                          [tf.float32, tf.string]),
                          num_parallel_calls=tf.data.experimental.AUTOTUNE)
  # Randomize and batch
    dataset = dataset.shuffle(500).batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return dataset
train_dataset = create_dataset(X_train_img, y_train_rep)
cv_dataset = create_dataset(X_cv_img, y_cv_rep)
Copy the code

Here, we create two data generators, train_DATASET for training and CV_DATASET for validation. The create_dataset function retrieves the ID (which is the key of the dictionary for the previously created feature) and the preprocessed report, and creates the generator. The generator generates a batch sized number of data points at a time.

As mentioned earlier, the model we will create will be a verbatim model. The model takes image features and partial sequences as inputs to generate the next word in the sequence.

For example, set image features to startseq the Cardiac and Mediastinum size are within normal limits endseq.

Then the input sequence is divided into 11 input-output pairs to train the model:

Note that we are not creating these input/output pairs through generators. The generator only gives us the batch processing size number of image features and their corresponding full report at a time. Input/output pairs are generated later in the training process and will be explained later.

8. Codec model

The sequence-to-sequence model is a deep learning model that takes one sequence (in our case, the features of the image) and outputs another sequence (the report).

The encoder processes each item in the input sequence, and it compiles the captured information into a vector called context. After processing the entire input sequence, the encoder sends the context to the decoder, which begins generating the output sequence item by item.

The encoder in this example is a CNN that generates context vectors by capturing image features. The decoder is a recurrent neural network.

In his paper Where to Put the Image in an Image Caption Generator, Marc Tanti introduced init-inject, par-inject, pre-inject, merge and other architectures. When you create an image title generator, you specify where the image should be injected. We will use the architecture specified in his paper to solve our problem.

In the “Merge” architecture, the RNN is not exposed to image vectors (or vectors derived from image vectors) at any time. Instead, images are introduced into the language model after the RNN is encoded as a whole. This is a late binding architecture that does not modify the image representation with each time step.

Some of the important conclusions from his paper were applied to the architecture we implemented. They are:

RNN output needs to be regularized, with loss.
Image vectors should not have a nonlinear activation function, or be regularized using dropout.
When extracting features from CheXNet, image input vectors must be normalized before being input to neural network.

Embedded layer:

Word embedding is a class of methods that use dense vector representation to represent words and documents. Keras provides an embedding layer that can be used for neural networks over text data. It can also be embedded using words learned elsewhere. In the field of natural language processing, learning and preserving word embedding is very common.

In our model, the embedding layer maps each word into a 300-dimensional representation using a pre-trained GLOVE model. When using pre-trained inserts, remember that the weight of the layer should be frozen by setting the parameter “trainable=False” so that the weight is not updated during training.

Model code:

input1 = Input(shape=(2048), name='Image_1')
dense1 = Dense(256, kernel_initializer=tf.keras.initializers.glorot_uniform(seed = 56),
               name='dense_encoder')(input1)

input2 = Input(shape=(155), name='Text_Input')
emb_layer = Embedding(input_dim = vocab_size, output_dim = 300, input_length=155, mask_zero=True,
                      trainable=False, weights=[embedding_matrix], name="Embedding_layer")
emb = emb_layer(input2)
LSTM2 = LSTM(units=256, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
            kernel_initializer=tf.keras.initializers.glorot_uniform(seed=23),
            recurrent_initializer=tf.keras.initializers.orthogonal(seed=7),
            bias_initializer=tf.keras.initializers.zeros(), name="LSTM2")
LSTM2_output = LSTM2(emb)
dropout1 = Dropout(0.5, name='dropout1')(LSTM2_output)

dec =  tf.keras.layers.Add()([dense1, dropout1])

fc1 = Dense(256, activation='relu', kernel_initializer=tf.keras.initializers.he_normal(seed = 63),
            name='fc1')
fc1_output = fc1(dec)
output_layer = Dense(vocab_size, activation='softmax', name='Output_layer')
output = output_layer(fc1_output)

encoder_decoder = Model(inputs = [input1, input2], outputs = output)
Copy the code

Model Abstract:

8.1 training

Loss function:

A masking loss function is established for this problem. Such as:

If we have a series of identifiers [3],[10],[7],[0],[0],[0],[0],[0],[0],[0]

We only have three words in this sequence, 0 for padding, which is not actually part of the report. But the model assumes that zeros are also part of the sequence and starts learning about them.

When the model starts correctly predicting zero, the loss will be reduced because it is correctly learned for the model. But for us, the losses should be reduced only if the model correctly predicts the actual words (non-zero).

Therefore, we should mask the zeros in the sequence so that the model does not focus on them and only learns the words needed in the report.

loss_function = tf.keras.losses.CategoricalCrossentropy(from_logits=False, reduction='auto')

def maskedLoss(y_true, y_pred) :
    Get the mask
    mask = tf.math.logical_not(tf.math.equal(y_true, 0))
    
    # loss calculation
    loss_ = loss_function(y_true, y_pred)
    
    # convert to type LOss_ dtype
    mask = tf.cast(mask, dtype=loss_.dtype)
    
    Apply a mask to the loss function
    loss_ = loss_*mask
    
    # get mean value
    loss_ = tf.reduce_mean(loss_)
    return loss_
Copy the code

The output word is a one-hot encoding, so the classification cross entropy will be our loss function.

optimizer = tf.keras.optimizers.Adam(0.001)
encoder_decoder.compile(optimizer, loss = maskedLoss)
Copy the code

Remember our data generator? Now it’s time to use them.

Here, the batch provided by the generator is not the actual data batch we use for training. Remember, they are not verbatim INPUT/output pairs. They return only the image and its corresponding entire report.

We will retrieve each batch from the generator, and we will manually create input and output sequences from that batch, that is, we will create our own custom Batch data for training. So here, batch processing size is logically the number of image pairs that the model sees in a batch. We can change it according to the capabilities of our system. I found this approach to be much faster than the traditional custom generators mentioned in other blogs.

Since we are creating our own batch data for training, we will use “train_on_batch” to train our model.

epoch_train_loss = []
epoch_val_loss = []

for epoch in range(EPOCH):
    print('EPOCH : ',epoch+1)
    start = time.time()
    batch_loss_tr = 0
    batch_loss_vl = 0
    
    for img, report in train_dataset:
       
        r1 = bytes_to_string(report.numpy())
        img_input, rep_input, output_word = convert(img.numpy(), r1)
        rep_input = pad_sequences(rep_input, maxlen=MAX_INPUT_LEN, padding='post')
        results = encoder_decoder.train_on_batch([img_input, rep_input], output_word)
        
        batch_loss_tr += results
    train_loss = batch_loss_tr/(X_train_img.shape[0]//BATCH_SIZE)
    with train_summary_writer.as_default():
        tf.summary.scalar('loss', train_loss, step = epoch)
    
    for img, report in cv_dataset:
        
        r1 = bytes_to_string(report.numpy())
        img_input, rep_input, output_word = convert(img.numpy(), r1)
        rep_input = pad_sequences(rep_input, maxlen=MAX_INPUT_LEN, padding='post')
        results = encoder_decoder.test_on_batch([img_input, rep_input], output_word)
        batch_loss_vl += results
    
    val_loss = batch_loss_vl/(X_cv_img.shape[0]//BATCH_SIZE)
    with val_summary_writer.as_default():
        tf.summary.scalar('loss', val_loss, step = epoch)

    epoch_train_loss.append(train_loss)
    epoch_val_loss.append(val_loss)
    
    print('Training Loss: {}, Val Loss: {}'.format(train_loss, val_loss))
    print('Time Taken for this Epoch : {} sec'.format(time.time()-start))   
    encoder_decoder.save_weights('Weights/BM7_new_model1_epoch_'+ str(epoch+1) + '.h5')
Copy the code

The convert function mentioned in the code converts the data in the generator to a verbatim input-output pair representation. The remaining reports are then populated to the maximum length of the report.

The Convert function:

def convert(images, reports) :
    This function takes batch data and converts it to a new data set.
    imgs = []
    in_reports = []
    out_reports = []
    for i in range(len(images)):
        sequence = [tokenizer.word_index[e] for e in reports[i].split() if e in tokenizer.word_index.keys()]
        for j in range(1.len(sequence)):
            
            in_seq = sequence[:j]
            out_seq = sequence[j]
            out_seq = tf.keras.utils.to_categorical(out_seq, num_classes=vocab_size)

            imgs.append(images[i])
            in_reports.append(in_seq)
            out_reports.append(out_seq)
    return np.array(imgs), np.array(in_reports), np.array(out_reports)
Copy the code

The Adam optimizer has a learning rate of 0.001. The model trained 40 epochs, but got the best results at the 35th epoch. Because of randomness, you might get different results.

Note: The above training is implemented in Tensorflow 2.1.

8.2 reasoning

Now that we have trained our model, it is time to prepare our model for predictive reporting.

To do this, we have to make some adjustments to our model. This will save some time during testing.

First, we will separate the encoder and decoder parts from the model. The features predicted by the encoder will be used as input to our decoder.

# encoder
encoder_input = encoder_decoder.input[0]
encoder_output = encoder_decoder.get_layer('dense_encoder').output
encoder_model = Model(encoder_input, encoder_output)


# decoder
text_input = encoder_decoder.input[1]
enc_output = Input(shape=(256,), name='Enc_Output')
text_output = encoder_decoder.get_layer('LSTM2').output
add1 = tf.keras.layers.Add()([text_output, enc_output])
fc_1 = fc1(add1)
decoder_output = output_layer(fc_1)
decoder_model = Model(inputs = [text_input, enc_output], outputs = decoder_output)
Copy the code

By doing so, we only need to predict the characteristics of the encoder once, which we use for greedy search and beam search algorithms.

We’ll implement these two algorithms for generating text and see which one works best.

8.3 Greedy search algorithm

Greedy search is an algorithmic paradigm that builds solutions piece by piece, always choosing the best each time.

Greedy search steps:

The encoder outputs the characteristics of the image. That’s the end of the encoder. Once we have the features we need, we don’t need to worry about the encoder.
This eigenvector and the starting identifier “startseq” (our initial input sequence) are used as the first input to the decoder.
The decoder predicts a probability distribution for the entire vocabulary, and the word with the highest probability is chosen as the next word.
The predicted word and the previous input sequence will be our next input sequence and passed to the decoder.
Continue with Steps 3-4 until the end flag, endseq, is encountered.

def greedysearch(img) :
    image = Xnet_Features[img] Extract the initial Chexnet features of the image
    input_ = 'startseq'  # Start id of the report
    image_features = encoder_model.predict(image) # Encoding output
    
    result = [] 
    for i in range(MAX_REP_LEN):
        input_tok = [tokenizer.word_index[w] for w in input_.split()]
        input_padded = pad_sequences([input_tok], 155, padding='post')
        predictions = decoder_model.predict([input_padded, image_features])
        arg = np.argmax(predictions)
        ifarg ! = tokenizer.word_index['endseq'] :# endseq logo
            result.append(tokenizer.index_word[arg])
            input_ = input_ + ' ' + tokenizer.index_word[arg]
        else:
            break
    rep = ' '.join(e for e in result)
    return rep
Copy the code

Let’s examine how our model performs after using GreedySearch to generate reports.

BLEU Score – Greedy Search:

The replacement score for bilingual assessment, or BLEU, is a measure of generated sentences to reference sentences.

A perfect match results in a score of 1.0, while a complete mismatch results in a score of 0.0. The method calculates n words matched in the candidate text to N words in the reference text, where UNI-gram is each identifier and bigram comparison is each word pair.

In practice it is impossible to get a perfect score because the translation must match the references exactly. This is not even possible for human translators.

Want to know more information about the BLEU, please click here: machinelearningmastery.com/calculate-b…

8.4 beam search

Beam Search is an algorithm that expands on greedy search and returns a list of the most likely output sequences. Each sequence has a score associated with it. The order with the highest score is the final result.

Instead of greedy selection of the most likely next step when building a sequence, the beam search expands all possible next steps and maintains k most likely results, where k (i.e., the beam width) is a parameter specified by the user, and controls the beam number or parallel search through a probabilistic sequence.

A beam search of beam width 1 is greedy search. Common beam width values range from 5 to 10, but even values as high as 1000 or more were used in the study to squeeze the best performance out of the model. To learn more about bundle search, click here.

But keep in mind that as the beam width increases, so does the time complexity. As a result, these are much slower than greedy searches.

def beamsearch(image, beam_width) :
    
    start = [tokenizer.word_index['startseq']]

    sequences = [[start, 0]]
    
    img_features = Xnet_Features[image]
    img_features = encoder_model.predict(img_features)
    finished_seq = []
    
    for i in range(max_rep_length):
        all_candidates = []
        new_seq = []
        for s in sequences:

            text_input = pad_sequences([s[0]], 155, padding='post')
            predictions = decoder_model.predict([img_features, text_input])
            top_words = np.argsort(predictions[0])[-beam_width:]
            seq, score = s
            
            for t in top_words:
                candidates = [seq + [t], score - log(predictions[0][t])]
                all_candidates.append(candidates)
                
        sequences = sorted(all_candidates, key = lambda l: l[1])[:beam_width]
        # check 'endseq' in each sequence in the beam
        count = 0
        for seq,score in sequences:
            if seq[len(seq)-1] == tokenizer.word_index['endseq']:
                score = score/len(seq)   # standardization
                finished_seq.append([seq, score])
                count+=1
            else:
                new_seq.append([seq, score])
        beam_width -= count
        sequences = new_seq
        
        # If all sequences end before 155 time steps
        if not sequences:
            break
        else:
            continue
        
    sequences = finished_seq[-1] 
    rep = sequences[0]
    score = sequences[1]
    temp = []
    rep.pop(0)
    for word in rep:
        ifword ! = tokenizer.word_index['endseq']:
            temp.append(tokenizer.index_word[word])
        else:
            break    
    rep = ' '.join(e for e in temp)        
    
    return rep, score
Copy the code

Bundle search doesn’t always guarantee better results, but in most cases it will give you a better result.

You can check the BLEU score of the bundle search using the function given above. But keep in mind that it takes a while (a few hours) to evaluate them.

8.5 the sample

Now let’s look at the chest X-ray prediction report:

Image of 1’s original report: “Heart normal size. Mediastinum is not evident. The lungs are clean.”

Image of 1 prediction report: “Heart normal size. Mediastinum is not evident. The lungs are clean.”

For this example, the model predicts exactly the same report.

Image of 2’s original report: “Heart size and pulmonary blood vessels within normal range. No focal infiltrating pneumothorax pleural effusion was observed

Image prediction for 2 reports: “Heart size and pulmonary blood vessels appear within normal ranges. The lung is a free focal spatial lesion. No pleural effusion pneumothorax observed

Although not identical, the forecast is almost similar to the original report.

Image of 3’s original report: “Lung hyperinflated but clear. There was no focal infiltrating exudation. The heart and mediastinum outline are within normal range. Calcified mediastinum was found

Image prediction for 3 reports: “Heart size normal. The mediastinal profile is within normal range. There’s no invasive lesions in the lungs. There is no nodular mass. No obvious pneumothorax. No pleural fluid was visible. This is perfectly normal. There is no visible free intraperitoneal air below the diaphragm.”

You didn’t expect this model to work perfectly, did you? No model is perfect, and this is not perfect. Although there are some details that are correctly identified from the image to 3, many of the additional details generated may or may not be correct.

The model we created wasn’t a perfect one, but it did produce decent reports for our images.

Now let’s look at an advanced model and see if it improves the current performance!!

9. Pay attention to mechanics

Note that the mechanism is an improvement on the codec model. It turns out that context vectors are a bottleneck in these type models. This makes it difficult for them to process long sentences. Bahdanau et al., 2014 and Luong et al., 2015 proposed a solution.

These papers introduced and improved on a technique called “attention mechanism” that greatly improved the quality of machine translation systems. Note that the model is allowed to focus on the relevant parts of the input sequence as needed. Later, this idea was applied to image titles.

So how do we build attention mechanisms for images?

For text, we have a representation for each position in the input sequence. But for images, we typically use a fully connected layer representation in the network, but this representation does not contain any location information (think of it, they are fully connected).

We need to look at specific parts of the image (locations) to describe what is inside. For example, to describe the size of a person’s heart from an X-ray, we only need to look at their heart area, not their arm or any other part of their body. So what should be the input to the attentional mechanism?

We use the output of the convolution layer (transfer learning) rather than the fully connected representation because the output of the convolution layer has spatial information.

For example, let the output of the last convolution layer be a feature of (7×14×1024) size. Here, “7×14” is the actual position corresponding to some part of the image, and 1024 are channels. What we care about is not the channel but the position of the image. So here we have 7 times 14 is 98 of these positions. We can think of it as 98 positions, each represented in 1024 dimensions.

We now have 98 time steps, each of which has 1024 dimensional representations. We now need to decide how the model should focus on these 98 points in time or locations.

An easy way to do this is to assign some weight to each position and get the weighted sum of all 98 positions. If a particular time step is important to predict an output, the time step will have a higher weight. Let’s put these weights in letters.

We now know that alpha determines the importance of a particular location. The higher the alpha, the higher the importance. But how do we find the value of alpha? No one will give us these values, the model itself should learn them from the data. To do this, we define a function:

This quantity represents the importance of the JTH input to decoding the t th output. H_j is the JTH position representation and s_T-1 is the decoder state to that point. We need these two quantities to determine e_jt. F_ATT is just a function, which we will define later.

We want the sum of this quantity (e_jt) to be 1 in all inputs. It’s like a probability distribution for the importance of the input. The e_JT is converted into probability distribution by Softmax.

Now we have alphas! Alphas is our weight. Alpha_jt represents the probability of focusing on the JTH input to produce the t th output.

Now it is time to define our function f_ATT. Here is one of many possible options:

V, U, and W are parameters learned during training to determine the value of e_JT.

We have the alphas, we have the inputs, now we just need to get the weighted sum, generate the new context vector, which will be fed into the decoder. In practice, these models work better than codec models.

Model implementation:

Like the codec model mentioned above, this model will consist of two parts, an encoder and a decoder, but this time the decoder will have an additional attention component, the attention decoder. To better understand this, let’s write it in code:

# calculation e_jts
score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))

Convert scores to probability distributions using Softmax
attention_weights = tf.nn.softmax(score, axis=1)

# Calculate context vector (weighted sum)
context_vector = attention_weights * features
Copy the code

We don’t have to write these lines of code from scratch when building the model. The Keras library already has a built-in attention layer for this purpose. We will directly use add layer or other attention called Bahdanau. You can learn more about this layer from the documentation itself. Link: www.tensorflow.org/api_docs/py…

The text input for this model will remain the same, but for image features, this time we will get features from the last CONV layer of the CheXNet network.

The final output shape of the combined images is (None, 7,14,1024). So the input to the encoder after shaping will be (None, 981024). Why reconstruct the image? Well, this has been explained in the introduction to attention, so be sure to read the explanation again if you have any doubts.

Model:

input1 = Input(shape=(98.1024), name='Image_1')
maxpool1 = tf.keras.layers.MaxPool1D()(input1)
dense1 = Dense(256, kernel_initializer=tf.keras.initializers.glorot_uniform(seed = 56), name='dense_encoder')(maxpool1)

input2 = Input(shape=(155), name='Text_Input')
emb_layer = Embedding(input_dim = vocab_size, output_dim = 300, input_length=155, mask_zero=True, trainable=False, 
                      weights=[embedding_matrix], name="Embedding_layer")
emb = emb_layer(input2)

LSTM1 = LSTM(units=256, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
            kernel_initializer=tf.keras.initializers.glorot_uniform(seed=23),
            recurrent_initializer=tf.keras.initializers.orthogonal(seed=7),
            bias_initializer=tf.keras.initializers.zeros(), return_sequences=True, return_state=True, name="LSTM1")
lstm_output, h_state, c_state = LSTM1(emb)

LSTM2 = LSTM(units=256, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
            kernel_initializer=tf.keras.initializers.glorot_uniform(seed=23),
            recurrent_initializer=tf.keras.initializers.orthogonal(seed=7),
            bias_initializer=tf.keras.initializers.zeros(), return_sequences=True, return_state=True, name="LSTM2")

lstm_output, h_state, c_state = LSTM2(lstm_output)

dropout1 = Dropout(0.5)(lstm_output)

attention_layer = tf.keras.layers.AdditiveAttention(name='Attention')
attention_output = attention_layer([dense1, dropout1], training=True)

dense_glob = tf.keras.layers.GlobalAveragePooling1D()(dense1)
att_glob = tf.keras.layers.GlobalAveragePooling1D()(attention_output)

concat = Concatenate()([dense_glob, att_glob])
dropout2 = Dropout(0.5)(concat)
FC1 = Dense(256, activation='relu', kernel_initializer=tf.keras.initializers.he_normal(seed = 56), name='fc1')
fc1 = FC1(dropout2)
OUTPUT_LAYER = Dense(vocab_size, activation='softmax', name='Output_Layer')
output = OUTPUT_LAYER(fc1)

attention_model = Model(inputs=[input1, input2], outputs = output)
Copy the code

The model is similar to the codec model we saw earlier, but with a note of components and some minor updates. You can try your own changes if you like, and they may yield better results.

Model architecture:

Model Abstract:

9.1 training

The training steps will be exactly the same as our codec model. We will use the same “convert” function to generate a batch to get a verbatim input-output sequence and train it using train_on_batch.

The attention model requires more memory and computing power than the codec model. Therefore, you may need to reduce the batch size. Refer to the training section of the codec model for the entire process.

In order to pay attention to the mechanism, Adam optimizer was used with a learning rate of 0.0001. The model was trained with 20 epochs. Because of randomness, you might get different results.

All code is accessible from my GitHub. A link to it is provided at the end of this blog.

9.2 reasoning

As before, we will separate the encoder and decoder parts from the model.

# encoder
encoder_input = attention_model.input[0]
encoder_output = attention_model.get_layer('dense_encoder').output
encoder_model = Model(encoder_input, encoder_output)

Decoder with attentional mechanism
text_input = attention_model.input[1]
cnn_input = Input(shape=(49.256))
lstm, h_s, c_s = attention_model.get_layer('LSTM2').output att = attention_layer([cnn_input, lstm]) d_g = tf.keras.layers.GlobalAveragePooling1D()(cnn_input) a_g = tf.keras.layers.GlobalAveragePooling1D()(att) con  = Concatenate()([d_g, a_g]) fc_1 = FC1(con) out = OUTPUT_LAYER(fc_1) decoder_model = Model([cnn_input, text_input], out)Copy the code

This saved us some testing time.

9.3 Greedy Search

Now that we have built the model, let’s check to see if the BLEU score obtained is indeed an improvement over the previous model:

We can see that it has better performance than greedy search codec model. Therefore, it is definitely an improvement over the previous one.

9.4 beam search

Now let’s look at some of the scores of the bundle search:

BLEU scores are lower than greedy algorithms, but not by much. But it’s worth noting that the fraction actually increases as the beam width increases. Therefore, there may be some value of the beam width where the fraction crosses with that of the greedy algorithm.

9.5 the sample

Here are some reports generated by the model using greedy search:

Image of 1 original report: “Heart size and pulmonary blood vessels within normal range. No focal infiltrating pneumothorax pleural effusion was observed

Image prediction for 1 report: “Heart size and mediastinal contour within normal range. The lungs are clean. No pneumothorax effusion. There were no acute bony findings.”

These forecasts are almost similar to the initial reports.

Image of 2’s original report: “Heart size and pulmonary blood vessels appear within normal ranges. The lung is a free focal spatial lesion. No pleural effusion pneumothorax observed

Image prediction for 2 reports: “Heart size and pulmonary blood vessels appear within normal ranges. The lung is a free focal spatial lesion. No pleural effusion pneumothorax observed

The forecast report is exactly the same!!

Image of 3 original report: “Heart normal size. Mediastinum is not evident. The lungs are clean.”

Image of 3 prediction report: “Heart normal size. Mediastinum is not evident. The lungs are clean.”

In this case, the model is well done.

Image of 4’s original report: “Both lungs clear. Pneumothorax pleural effusion without focal consolidation was confirmed. The cardiopulmonary mediastinum is not well defined. There is no acute abnormality of bone structure in the chest

Image prediction for 4 reports: “Heart size and mediastinal contour within normal range. The lungs are clean. No pneumothorax effusion

You can see that this prediction is not really convincing.

“However, the beam search for this example predicts exactly the same report, even though it produces a BLEU score that is lower than the sum of the entire test data!!”

So, which one to choose? Well, that’s up to us. You just need to choose a method that is universal.

Here, even our attentional model can’t accurately predict every image. If we look at the words in the original report, we can see some complex words that don’t come up very often through some EDA. These may be some of the reasons why we don’t have good predictions in some cases.

Remember, we are only training this model on 2560 data points. In order to learn more complex features, the model needs more data.

10. In this paper,

Now that we’re done with the project, let’s summarize what we did:

We’ve just seen how image captioning can be used in medicine. We understand the problem and we understand the need for this kind of application.
We saw how to use a data generator for an input pipeline.
Created a codec model and gave us good results.
The basic results were improved by building an attentional model.

11. Future work

As we mentioned, we don’t have a big data set to do this. Larger data sets will yield better results.

No hyperparametric adjustments were made to any of the models. Therefore, a better hyperparametric adjustment might yield better results.

Using some more advanced technology, such as Transformers or Bert, might yield better results.

12. Reference

www.appliedaicourse.com/
Arxiv.org/abs/1502.03…
www.aclweb.org/anthology/P…
Arxiv.org/abs/1703.09…
Arxiv.org/abs/1409.04…
Machinelearningmastery.com/develop-a-d…

The entire code for this project can be accessed from my GitHub :github.com/vysakh10/Im…

The original link: towardsdatascience.com/image-capti…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/