こんにちはR&Dチームの宮崎です。ユニファアドベントカレンダーの7日目となります。どうぞよろしくお願いします。

はじめに

DeepLearningの発展に伴い、画像分類や物体検出はかなりの精度で行えるようになってきました。そこで、近年は画像からより高レベルな情報を抽出しようと、画像の要約を生成するImage Captioningや検出した物体間の関係性を認識するVisual Relationship Detectionなどの手法が提案されています。

今回は、このVisual Relationship Detectionの手法の一つであるBAR-CNN (Box Attention Relational CNN)[1]を試してみましたので、ご紹介したいと思います。

なお、Image Captioningについては下記にて紹介されておりますので、是非そちらも参照ください。

tech.unifa-e.com

Visual Relationship Detectionについて

Visual Relationship Detectionとは、検出した物体を<主語 - 述語 - 目的語>の3要素で表すことを目的としたタスクになります。ここで主語と目的語には検出した物体が入り、述語には2つの物体の関係性を表す言葉が入ります。例えば下記の画像のように人物がバイクに乗った画像を入力としたとき、物体の座標とともに主語: person、述語: on、目的語: mortorcycleとしてラベルを出力することが目的となります。

f:id:unifa_tech:20191125135407p:plain — Visual Relationshipの例 (引用元: [2])

これによって、画像から物体の位置や種類だけでなく、物体間の関連性まで認識できるようになります。ちなみに、このVisual Relationship DetectionのタスクはKaggleのコンペになったりもしています。

www.kaggle.com

BAR-CNN

今回はVisual Relationship Detectionの手法の一つであるBAR-CNN (Box Attention Relational CNN)[1]を検証してみました。 Visual Relationship Detectionの多くは、一般的な物体検出手法で物体を検出したあと、検出した物体間の関係性、すなわち述語ラベルを予測する2ステップの手法となっています。それらに対し、BAR-CNNは物体の検出から述語ラベルの予測まで一度に行うのが特徴です。

下記がBAR-CNNの概念図になります。BAR-CNNは物体検出手法の一つであるRetinaNet[3]をベースとしています。通常の物体検出手法は対象画像のみを入力しますが、BAR-CNNはAttention Mapとして主語の位置を表したマスク画像も一緒に入力します。そして、目的語としてマスク画像に関わる物体の位置および種類と述語ラベルを予測して出力します。下記の例ではマスク画像の3番目にキャッチャーの位置を塗りつぶした画像を入力しています。そして出力の3番目はHoldという述語ラベルとキャッチャーミットを囲ったボックスとなっています。これにより主語: キャッチャー、述語: Hold、目的語: ミットという関係性が取得されています。

f:id:unifa_tech:20191125142752p:plain:w400 — BAR-CNNの概念図(引用元: [1])

実装

BAR-CNNは残念ながらコードが公開されていないため、自分で実装する必要があります。今回はTensorFlow/modelsにあるRetinaNet[4]をベースに実装しました。

まずBAR-CNNでは主語のマスク画像であるAttention Mapを入力する必要があります。 RetinaNetのバックボーンであるResNet50[5]の各Bottleneck Unitにおいて、以下のようにAttention Mapを足していきます。

f:id:unifa_tech:20191125150221p:plain:w300 — Attention Mapの入力(引用元: [1])

RetinaNet内のResnetクラスのbottleneck_blockメソッドを変更します。

# This code includes the work that is distributed in the Apache License 2.0.
class Resnet(object):
    """Class to build ResNet family model."""

    # (略)

    def add_maps(self, inputs, attention_maps):
        """Attention Mapの加算"""
        if self._data_format == 'channels_last':
            _, height, width, k = inputs.get_shape().as_list()
        else:
            _, k, height, width = inputs.get_shape().as_list()

        attention_maps = tf.image.resize(
            attention_maps, size=[height, width],
            method=tf.image.ResizeMethod.NEAREST_NEIGHBOR)

        attention_maps = tf.keras.layers.Conv2D(
            filters=k,
            kernel_size=3,
            strides=1,
            padding='SAME',
            use_bias=False,
            kernel_initializer=tf.initializers.Zeros(),
            data_format=self._data_format)(
            inputs=attention_maps)

        return inputs + attention_maps

    def bottleneck_block(self,
                         inputs,
                         attentoin_maps, # 追加
                         filters,
                         strides,
                         use_projection=False,
                         is_training=None):
        """オリジナルのBottleneck Unitに対し、Attention Mapの加算を追加"""
        shortcut = inputs
        if use_projection:
            # Projection shortcut only in first block within a group. Bottleneck
            # blocks end with 4 times the number of filters.
            filters_out = 4 * filters
            shortcut = self.conv2d_fixed_padding(
                inputs=inputs, filters=filters_out, kernel_size=1,
                strides=strides)
            shortcut = self._batch_norm_relu(relu=False)(
                shortcut, is_training=is_training)
        shortcut = self.dropblock(shortcut, is_training=is_training)

        inputs = self.conv2d_fixed_padding(
            inputs=inputs, filters=filters, kernel_size=1, strides=1)
        inputs = self._batch_norm_relu()(inputs, is_training=is_training)
        inputs = self.dropblock(inputs, is_training=is_training)

        inputs = self.conv2d_fixed_padding(
            inputs=inputs, filters=filters, kernel_size=3, strides=strides)
        inputs = self._batch_norm_relu()(inputs, is_training=is_training)
        inputs = self.dropblock(inputs, is_training=is_training)

        # 追加
        inputs = self.add_maps(inputs, attention_maps)

        inputs = self.conv2d_fixed_padding(
            inputs=inputs, filters=4 * filters, kernel_size=1, strides=1)
        inputs = self._batch_norm_relu(
            relu=False, init_zero=True)(
            inputs, is_training=is_training)
        inputs = self.dropblock(inputs, is_training=is_training)

        return tf.nn.relu(inputs + shortcut)

    # (略)

次に、通常のRetinaNetの出力は目的語である物体の種類と位置のみであるため、新たに述語ラベルを予測するためのサブネットを追加します。 RetinanetHeadクラスに物体の種類のサブネットと同じ要領で述語ラベル用のサブネットを追加する変更を行います。

# This code includes the work that is distributed in the Apache License 2.0.
class RetinanetHead(object):
    # (略)

    def __call__(self, fpn_features, is_training=None):
        """述語ラベルの推論(predicate_class_outputs)を追加"""
        # 追加
        predicate_class_outputs = {} 

        class_outputs = {}
        box_outputs = {}

        with backend.get_graph().as_default(), tf.name_scope('retinanet'):
            for level in range(self._min_level, self._max_level + 1):
                features = fpn_features[level]

                # 追加
                predicate_class_outputs[level] = self.predicate_class_net(
                    features, level, is_training=is_training)

                class_outputs[level] = self.class_net(
                    features, level, is_training=is_training)
                box_outputs[level] = self.box_net(
                    features, level, is_training=is_training)

        # 変更
        return predicate_class_outputs, class_outputs, box_outputs

    def predicate_class_net(self, features, level, is_training):
        """class_netメソッドと同様の述語ラベル用のサブネットを追加"""
        # (略)

    # (略)

だいぶ省略してしまいましたが、これで、主語のAttention Mapを入力とし、述語ラベルと目的語である物体の種類と位置を出力するBAR-CNNができました。

実験

実装したBAR-CNNを学習させます。今回はRetinaNetをCOCOデータセット[6]で事前学習した後、そのバックボーンの重みをBAR-CNNに読み込ませてVisual Relationshipsデータセット[7]で学習しました。 Visual Relationshipsデータセットは4000枚の学習画像に、1000枚のテスト画像、70の述語カテゴリそして100種類の物体カテゴリを持ちます。

まずは以下の野球の画像をAttention Map無しで入力します。

f:id:unifa_tech:20191125152049j:plain:w400 — 入力画像(引用元: [7])

すると、主語の候補として3人の人物が検出されました。

f:id:unifa_tech:20191125152338j:plain:w400 — 出力結果(主語)

次に審判の検出領域からAttention Mapを生成し、主語のマスク画像として先ほどの入力画像と一緒に入力してみます。青が背景、ピンクがAttention領域となります。

f:id:unifa_tech:20191125152556j:plain:w400 — 審判エリアのAttention Map

少し見辛いですが、述語および目的語として、審判がシャツを着ていること(wear - shirt)や前のキャッチャーを掴んでいること(hold - person)が検出できました！

f:id:unifa_tech:20191125152703j:plain:w400 — 出力結果(述語および目的語)

おわりに

今回は学習に用いたデータセットがwear - shirtなど汎用的な関連性のみのため、このような結果でしたが、積み木で遊んでいるや絵本を読んでいるなどの関係性を学習出来たら、保育の現場でも使えないかと期待しています。これからも写真からより多くのことを認識できるよう取り組み、子供たちの成長・発達の支援に結び付けていきたいと思います。