使用CNN和LSTM构建图像字幕标题生成器

时间：2020-04-02 17:46:10 阅读：93 评论：0 收藏：0 [点我收藏+]

标签：needed put plt 生成 pst oop 准备 prepare -o

感谢参考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html

当您看到一个图像，您的大脑可以轻松分辨出图像的含义，但是计算机可以分辨出图像的含义吗？计算机视觉研究人员为此做了很多工作，他们认为直到现在都不可能！随着深度学习技术的进步，海量数据集的可用性和计算机功能的增强，我们可以构建可以为图像生成字幕的模型。

这就是我们将在这个项目中实现的目标，在该项目中，我们将一起使用卷积神经网络和一种循环神经网络（LSTM）的深度学习技术。

什么是图像字幕生成器？

图像标题生成器是一项任务，涉及计算机视觉和自然语言处理概念，以识别图像的上下文并以自然语言描述它们。

我们项目的目的是学习CNN和LSTM模型的概念，并通过使用LSTM实现CNN来构建图像字幕生成器的工作模型。

在这个项目中我们将使用CNN（卷积神经网络） 和LSTM（长短期记忆）实现字幕生成器。图像特征将从Xception中提取，Xception是在imagenet数据集上训练的CNN模型，然后我们将特征输入到LSTM模型中，该模型将负责生成图像标题。

整理数据集

对于图像标题生成器，我们将使用Flickr_8K数据集。还有其他一些大数据集，例如Flickr_30K和MSCOCO数据集，但是训练网络可能需要数周的时间，因此我们将使用一个小的Flickr8k数据集。庞大的数据集的优势在于我们可以构建更好的模型。

准备条件

我们将需要以下的几种库

tensorflow
keras
pillow
numpy
tqdm
jupyterlab

1.首先，我们导入所有必需的库

import string  
import numpy as np  
from PIL import Image  
import os  
from pickle import dump, load  
import numpy as np  
from keras.applications.xception import Xception, preprocess_input  
from keras.preprocessing.image import load_img, img_to_array  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.utils import to_categorical  
from keras.layers.merge import add  
from keras.models import Model, load_model  
from keras.layers import Input, Dense, LSTM, Embedding, Dropout  
# small library for seeing the progress of loops.  
from tqdm import tqdm_notebook as tqdm  
tqdm().pandas()

技术图片

2、获取并执行数据清理

我们文件的格式是图像和标题，用新行（“ \ n”）分隔。

每个图像有5个字幕，我们可以看到为每个字幕分配了＃（0到5）数字。

我们将定义5个函数：

load_doc（filename）–用于加载文档文件并将文件内部的内容读取为字符串。
all_img_captions（filename）–此函数将创建一个描述字典，该字典映射具有5个字幕列表的图像。
cleaning_text（descriptions）–此函数获取所有描述并执行数据清理。当使用文本数据时，这是重要的一步，根据目标，我们决定要对文本执行哪种类型的清理。在我们的例子中，我们将删除标点符号，将所有文本转换为小写并删除包含数字的单词。
text_vocabulary（descriptions）–这是一个简单的函数，它将分隔所有唯一的单词并从所有描述中创建词汇表。
save_descriptions（descriptions，filename）–该函数将创建一个已被预处理的所有描述的列表，并将它们存储到文件中。我们将创建一个descriptions.txt文件来存储所有标题。

# Loading a text file into memory  
def load_doc(filename):  
    # Opening the file as read only  
    file = open(filename, ‘r‘)  
    text = file.read()  
    file.close()  
    return text  
# get all imgs with their captions  
def all_img_captions(filename):  
    file = load_doc(filename)  
    captions = file.split(‘\n‘)  
    descriptions ={}  
    for caption in captions[:-1]:  
        img, caption = caption.split(‘\t‘)  
        if img[:-2] not in descriptions:  
            descriptions[img[:-2]] =   
        else:  
            descriptions[img[:-2]].append(caption)  
    return descriptions  
#Data cleaning- lower casing, removing puntuations and words containing numbers  
def cleaning_text(captions):  
    table = str.maketrans(‘‘,‘‘,string.punctuation)  
    for img,caps in captions.items():  
        for i,img_caption in enumerate(caps):  
            img_caption.replace("-"," ")  
            desc = img_caption.split()  
            #converts to lowercase  
            desc = [word.lower() for word in desc]  
            #remove punctuation from each token  
            desc = [word.translate(table) for word in desc]  
            #remove hanging ‘s and a   
            desc = [word for word in desc if(len(word)>1)]  
            #remove tokens with numbers in them  
            desc = [word for word in desc if(word.isalpha())]  
            #convert back to string  
            img_caption = ‘ ‘.join(desc)  
            captions[img][i]= img_caption  
    return captions  
def text_vocabulary(descriptions):  
    # build vocabulary of all unique words  
    vocab = set()  
    for key in descriptions.keys():  
        [vocab.update(d.split()) for d in descriptions[key]]  
    return vocab  
#All descriptions in one file   
def save_descriptions(descriptions, filename):  
    lines = list()  
    for key, desc_list in descriptions.items():  
        for desc in desc_list:  
            lines.append(key + ‘\t‘ + desc )  
    data = "\n".join(lines)  
    file = open(filename,"w")  
    file.write(data)  
    file.close()  
# Set these path according to project folder in you system  
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"  
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"  
#we prepare our text data  
filename = dataset_text + "/" + "Flickr8k.token.txt"  
#loading the file that contains all data  
#mapping them into descriptions dictionary img to 5 captions  
descriptions = all_img_captions(filename)  
print("Length of descriptions =" ,len(descriptions))  
#cleaning the descriptions  
clean_descriptions = cleaning_text(descriptions)  
#building vocabulary   
vocabulary = text_vocabulary(clean_descriptions)  
print("Length of vocabulary = ", len(vocabulary))  
#saving each description to file   
save_descriptions(clean_descriptions, "descriptions.txt")

技术图片

3、从所有图像中提取特征向量

这项技术也称为转移学习，我们不必自己做任何事情，我们使用已经在大型数据集上进行训练的预训练模型，并从这些模型中提取特征并将其用于我们的任务。我们正在使用Xception模型，该模型已经在imagenet数据集中进行了训练，该数据集具有1000个不同的类别进行分类。我们可以直接从keras.applications导入此模型。由于Xception模型最初是为imagenet构建的，因此与模型集成时，我们所做的改动很少。需要注意的一件事是，Xception模型采用299 299 3的图像尺寸作为输入。我们将删除最后一个分类层，并获得2048个特征向量。

模型= Xception（include_top = False，pooling =‘avg‘）

函数extract_features（）将提取所有图像的特征，然后将图像名称与它们各自的特征数组映射。然后，我们将特征字典转储到“ features.p”pickle文件中。

def extract_features(directory):  
        model = Xception( include_top=False, pooling=‘avg‘ )  
        features = {}  
        for img in tqdm(os.listdir(directory)):  
            filename = directory + "/" + img  
            image = Image.open(filename)  
            image = image.resize((299,299))  
            image = np.expand_dims(image, axis=0)  
            #image = preprocess_input(image)  
            image = image/127.5  
            image = image - 1.0  
            feature = model.predict(image)  
            features[img] = feature  
        return features  
#2048 feature vector  
features = extract_features(dataset_images)  
dump(features, open("features.p","wb"))

技术图片

根据您的系统，此过程可能会花费很多时间。

features = load(open("features.p","rb"))

4、加载数据集以训练模型

在Flickr_8k_test文件夹中，我们有Flickr_8k.trainImages.txt文件，其中包含用于训练的6000个图像名称的列表。

为了加载训练数据集，我们需要更多函数：

load_photos（filename）–这将以字符串形式加载文本文件，并返回图像名称列表。
load_clean_descriptions（文件名，照片）–此函数将创建一个字典，其中包含照片列表中每张照片的标题。我们还为每个字幕附加了<start>和<end>标识符。我们需要这样做，以便我们的LSTM模型可以识别字幕的开始和结束。
load_features（photos）–此函数将为我们提供先前从Xception模型提取的图像名称及其特征向量的字典。

#load the data   
def load_photos(filename):  
    file = load_doc(filename)  
    photos = file.split("\n")[:-1]  
    return photos  
def load_clean_descriptions(filename, photos):   
    #loading clean_descriptions  
    file = load_doc(filename)  
    descriptions = {}  
    for line in file.split("\n"):  
        words = line.split()  
        if len(words)<1 :  
            continue  
        image, image_caption = words[0], words[1:]  
        if image in photos:  
            if image not in descriptions:  
                descriptions[image] = []  
            desc = ‘<start> ‘ + " ".join(image_caption) + ‘ <end>‘  
            descriptions[image].append(desc)  
    return descriptions  
def load_features(photos):  
    #loading all features  
    all_features = load(open("features.p","rb"))  
    #selecting only needed features  
    features = {k:all_features[k] for k in photos}  
    return features  
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"  
#train = loading_data(filename)  
train_imgs = load_photos(filename)  
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)  
train_features = load_features(train_imgs)

技术图片

5、词汇化

我们将用唯一的索引值映射词汇表中的每个单词。Keras库为我们提供了tokenizer函数，我们将使用该函数从词汇表创建令牌并将其保存到“ tokenizer.p”pickle文件中。

#calculate maximum length of descriptions  
def max_length(descriptions):  
    desc_list = dict_to_list(descriptions)  
    return max(len(d.split()) for d in desc_list)  

max_length = max_length(descriptions)  
max_length

技术图片

我们的词汇表包含7577个单词。

我们计算描述的最大长度。这对于确定模型结构参数很重要。说明的最大长度为32。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

技术图片

6、创建数据生成器

首先让我们看一下模型输入和输出的样子。为了使此任务成为监督学习任务，我们必须为模型提供输入和输出以进行训练。我们必须在6000张图像上训练模型，每张图像将包含2048个长度的特征向量，并且标题也以数字表示。不能将这6000个图像的数据量保存到内存中，因此我们将使用生成器方法来生成批处理。

生成器将产生输入和输出序列。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

技术图片

7.定义CNN-RNN模型

为了定义模型的结构，我们将使用Functional API中的Keras模型。它将包括三个主要部分：

Feature Extractor–从图像中提取的特征大小为2048，带有密集层，我们会将尺寸减小到256个节点。
Sequence Processor–嵌入层将处理文本输入，然后是LSTM层。
Decoder –通过合并以上两层的输出，我们将按密集层进行处理以做出最终预测。最后一层将包含等于我们词汇量的节点数。

最终模型的视觉表示如下：

技术图片

from keras.utils import plot_model  
# define the captioning model  
def define_model(vocab_size, max_length):  
    # features from the CNN model squeezed from 2048 to 256 nodes  
    inputs1 = Input(shape=(2048,))  
    fe1 = Dropout(0.5)(inputs1)  
    fe2 = Dense(256, activation=‘relu‘)(fe1)  
    # LSTM sequence model  
    inputs2 = Input(shape=(max_length,))  
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  
    se2 = Dropout(0.5)(se1)  
    se3 = LSTM(256)(se2)  
    # Merging both models  
    decoder1 = add([fe2, se3])  
    decoder2 = Dense(256, activation=‘relu‘)(decoder1)  
    outputs = Dense(vocab_size, activation=‘softmax‘)(decoder2)  
    # tie it together [image, seq] [word]  
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)  
    model.compile(loss=‘categorical_crossentropy‘, optimizer=‘adam‘)  
    # summarize model  
    print(model.summary())  
    plot_model(model, to_file=‘model.png‘, show_shapes=True)  
    return model

技术图片

8、训练模型

为了训练模型，我们将使用6000个训练图像，方法是分批生成输入和输出序列，并使用model.fit_generator（）方法将它们拟合到模型中。我们还将模型保存到我们的模型文件夹中。

# train our model  
print(‘Dataset: ‘, len(train_imgs))  
print(‘Descriptions: train=‘, len(train_descriptions))  
print(‘Photos: train=‘, len(train_features))  
print(‘Vocabulary Size:‘, vocab_size)  
print(‘Description Length: ‘, max_length)  
model = define_model(vocab_size, max_length)  
epochs = 10  
steps = len(train_descriptions)  
# making a directory models to save our models  
os.mkdir("models")  
for i in range(epochs):  
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)  
    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)  
    model.save("models/model_" + str(i) + ".h5")

技术图片

9、测试模型

该模型已经过训练，现在，我们将制作一个单独的文件testing_caption_generator.py，它将加载模型并生成预测。预测包含索引值的最大长度，因此我们将使用相同的tokenizer.p pickle文件从其索引值中获取单词。

import numpy as np  
from PIL import Image  
import matplotlib.pyplot as plt  
import argparse  
ap = argparse.ArgumentParser()  
ap.add_argument(‘-i‘, ‘--image‘, required=True, help="Image Path")  
args = vars(ap.parse_args())  
img_path = args[‘image‘]  
def extract_features(filename, model):  
        try:  
            image = Image.open(filename)  
        except:  
            print("ERROR: Couldn‘t open image! Make sure the image path and extension is correct")  
        image = image.resize((299,299))  
        image = np.array(image)  
        # for images that has 4 channels, we convert them into 3 channels  
        if image.shape[2] == 4:   
            image = image[..., :3]  
        image = np.expand_dims(image, axis=0)  
        image = image/127.5  
        image = image - 1.0  
        feature = model.predict(image)  
        return feature  
def word_for_id(integer, tokenizer):  
for word, index in tokenizer.word_index.items():  
     if index == integer:  
         return word  
return None  
def generate_desc(model, tokenizer, photo, max_length):  
    in_text = ‘start‘  
    for i in range(max_length):  
        sequence = tokenizer.texts_to_sequences([in_text])[0]  
        sequence = pad_sequences([sequence], maxlen=max_length)  
        pred = model.predict([photo,sequence], verbose=0)  
        pred = np.argmax(pred)  
        word = word_for_id(pred, tokenizer)  
        if word is None:  
            break  
        in_text += ‘ ‘ + word  
        if word == ‘end‘:  
            break  
    return in_text  
#path = ‘Flicker8k_Dataset/111537222_07e56d5a30.jpg‘  
max_length = 32  
tokenizer = load(open("tokenizer.p","rb"))  
model = load_model(‘models/model_9.h5‘)  
xception_model = Xception(include_top=False, pooling="avg")  
photo = extract_features(img_path, xception_model)  
img = Image.open(img_path)  
description = generate_desc(model, tokenizer, photo, max_length)  
print("\n\n")  
print(description)  
plt.imshow(img)

技术图片

two girls are playing in the grass(两个女孩在草地上玩)

结论

在这个项目中，我们通过构建图像标题生成器实现了CNN-RNN模型。需要注意的一些关键点是，我们的模型取决于数据，因此，它无法预测词汇量之外的单词。我们使用了一个包含8000张图像的小型数据集。对于生产级别的模型，我们需要对大于100,000张图像的数据集进行训练，以产生更好的精度模型。

使用CNN和LSTM构建图像字幕标题生成器

标签：needed put plt 生成 pst oop 准备 prepare -o

原文地址：https://blog.51cto.com/14744108/2484182

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行