みなさんこんにちは。ユニファでCTOをしてます赤沼です。先日 AWS さん主催の招待制オフサイトカンファレンス CTO Night & Day に参加させていただきました。昨年初めて参加させていただいて今回で2回目だったのですが、名だたる企業の大先輩CTOや、同じような規模・フェーズのスタートアップのCTOまで100人以上が参加していて、セッションの内容だけでなくネットワーキングやディスカッションの時間も多く、とても学びの多い2日間でした。全ての内容について書くと長くなりすぎてしまうので、その中で特に学びのあったことなどについて書きたいと思います。
This is part II of the MAG (Multi-Model Attribute Generator) paper I am working on. You can see part 1 here Multi-Model Attribute Generator - ユニファ開発者ブログ
This post will focus on defining what clothing the lower half of a person is wearing. This will not look at color right now as that will follow in the next few posts. This model will allow for the classification of three different clothing types; skirts/dresses, shorts, and pants.
So, from my previous post, I was only getting around 56% accuracy which is ok but not good enough. I altered my code to use a fine turned Xception model trained on the Market1501 dataset. I then used this as the base feature extractor which gave very good results in Keras. My experiments showed that the Resent50 did not produce as good results compared to Xception pre-trained models for this dataset. The data argumentation consists of rotation, cropping, vertical and horizontal shifts, and horizontal flipping. I also added a preprocessing script into the data augmentation which processes each image using the Xception preprocessing input which greatly helps in the accuracy of the model as well as keeping the handling of input consistent between this model and the base model as the Xception preprocessing was performed there as well. This aids in keeping the handling of input consistent between models and limits errors that could occur due to inconsistent preprocessing.
Clothing Makes the Man:
The first step is to separate the images into their classes for Keras to use in the data generator. This will consist of three classes, Dress/skirts, shorts, and pants.
The next step is to create the base feature extractor by importing the pre-trained model and creating a new base model using Keras Function API.
base_model = load_model('pre_trained_model.ckpt')
base_extractor = Model(inputs=base_model.input, outputs=base_model.get_layer('glb_avg_pool').output)
for layer in base_extractor.layers:
layer.trainable = True
Note the output should be the last layer before the Softmax Fully connected layer. This will give you a feature vector over a prediction.
This will prime the base model to be used for the feature extractor. Remember you want to set each layer to trainable to allow for the base model to be retrained for the specific task. The next step is to actually build out the new model for classification.
img = Input(shape=(224,224,3))
# Get the base of the image
x = base_extractor(img)
x = Dense(2048, activation='relu')(x)
x = Dropout(0.2)(x)
x = Dense(2048, activation='relu')(x)
x = Dropout(0.2)(x)
x = BatchNormalization()(x)
pred = Dense(3, activation='sigmoid')(x)
return Model(inputs = img, outputs = pred)
Seeing that we are only looking at the lower portion of the body we will need to crop the image using a custom image generator.
# Custom cropping method for preprocessingdefcrop_lower_part(img):
# xception preprocessing an image should be called in the datagen not here in the prefrocess function
y,x,_ = img.shape
startx = 0
starty = y//2return img[starty:y,startx:startx+x]
batch_x, batch_y = next(batches)
batch_crops = np.zeros((batch_x.shape, 112, 224, 3))
for i inrange(batch_x.shape):
batch_crops[i] = crop_lower_part(batch_x[i])
yield (batch_crops, batch_y)
This basically cuts the image in half (the lower portion only) and create new images and send the batch to the model when needed.
Binary over Categorical loss:
We will be using binary cross-entropy over the categorical version for multi-label classification. This can be confusing as most every other model out there uses a categorical version. However, this works by treating each output label as an independent Bernoulli distribution which gives greater accuracy over the traditional approach (Hazewinkel, 2001). This will allow for each output node to be singularly penalized for the wrong answer. Which should in return give better more accurate results overall.
While categorical_crossentropy was getting 75% was decent results, by using binary_crossentropy over categorical_crossentropy, the accuracy increased by 5%.
We are still overfitting after the 5 epoch, but this might be managed by cleaning the dataset by adding more samples as well as making a clear definition between shorts and pants as some samples of shorts look very close to pants. For example, some men's shorts are very long and some women's pants are higher which will look the same to the model. So this might be a battle between high water pants and long shorts.
Keras score and accuracy for the model look pretty good.
So for target T and network output O, the binary_crossentropy is:
f(T,O) = -(T*log(O) + (1-T)*log(1-O) )
And this model's score and accuracy are:
The score is the evaluation of the loss for a given input. and the accuracy is how accurate the model is for a given input. The lower the score the better and the higher the accuracy the better.
The final evaluation of the results came back pretty decent. The accuracy of the model is about 86% for the evaluation dataset.
We saw some significant improvement in accuracy by using a pre-trained fine-tuned model that works well with Keras. The model itself is not that complex to gain a good deal of accuracy. The most interesting change would be using a binary cross-entropy over a categorical loss. This gave a little more than a 10% increase in accuracy over using a more traditional approach for multiple label classification.
As you can see from the confusion matrix, the result looks pretty decent. Possible next moves may be to test out different network architectures, look at different labels, add more examples from other datasets, and look at using different losses and optimizers to aid in the training. The overall accuracy of the testing data was 80% so there is some room to improve, but the results are much better than the previous post.
The confusion matrix does confirm the issue with the pants and shorts as there are 93 misclassifications for pants and shorts. If feel that the majority of the classification errors may come from the data mainly as it is subjective as to what a pair of shorts are and what are short pants. To overcome this issue, a better data separation technique should be made to be more strict as to what should and should not be classified as pants and shorts.
Hazewinkel, Michiel, ed. (2001) , "Binomial distribution", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 978-1-55608-010-4
This blog is to show the development process of a new research paper that I am working on.
The goal of this string of blog posts is to slowly but surely develop a product that can aid in the data attribute labeling for humans and even other types of image data.
This can be used in several products from people identification, tracking, and statistical data analysis.
Are you ready? Try to keep up!
What is Attribute Recognition? It is the process of identifying what properties are present in an image. This is normally done on humans but can be done on pretty much anything from cities, cars, and even airplanes. The ability to predict the presence or absence of an item can be very beneficial. Tracking people, a safety check of a vehicle (like a bus or a plane) before departure, visual inspection of an assembled computer, even uses in nuclear power plants. A simple scan of an image can yield some very important warning which could be detected before a disaster can occur.
The data-set that I will be using will be the Market-1501 data-set (Zheng et al., 2015) which is commonly used for the Re-identification problems. Why use this data-set? I am using this data-set because of the size and variety of people in the images. The image quality is akin to that of a standard security camera. There are varied backgrounds for each image which will only make the program stronger at generalization by avoiding the use of a cleaned, non-noisy data-set. This data-set will give us many attributes to extract over the next few weeks.
Step 1 Battle of the Sexes:
The first and possibly easiest attribute to check is the gender of a person. This will be easy as it can be a binary classification problem, so not that big of a deal. If your reading this then more likely than not have read a Dog and Cat classification post somewhere when you started out learning CNNs. The model that we will build will be similar so I will not go into great detail of the model itself.
The first step we need to take is the pre-processing of the images. First, we need to separate the images into the two classes (male, female). These will be our classes for training. Then we need to split the data-set into training and testing sets.
I will use Keras’s image generator to do this as it will not only save time, but I can do all the other pre-processing steps at the same time. This is a list of all possible random image augmentations that will be performed on each image along with some pre-processing steps that will always be performed.
Here is the code for the generator for both training and validation data-set. By defining the image generators like this, it saved time splitting up the data-set yourself or having to load it into memory directly and use another python library to do the splitting.
train_datagen = ImageDataGenerator(rescale=1./255,
validation_split=0.2) # set validation split
train_generator = train_datagen.flow_from_directory(
subset='training') # set as training data
validation_generator = train_datagen.flow_from_directory(
DATA_PATH, # same directory as training data
subset='validation') # set as validation data
Now with that defined we can then use this in training the model. The model will be a simple binary classification model. There is no real need to make it too complex as this is just one of many models that will be used in the product.
As you can see the model started to produce pretty good results (~80% validation accuracy) after training.
322/322 [==============================] - 139s 432ms/step - loss: 0.6444 - acc: 0.6356 - val_loss: 0.6126 - val_acc: 0.7211
Epoch 00001: val_acc improved from -inf to 0.72109, saving model to GenderID-01-0.7211.ckpt
322/322 [==============================] - 128s 398ms/step - loss: 0.5833 - acc: 0.6987 - val_loss: 0.5848 - val_acc: 0.7490
Epoch 00002: val_acc improved from 0.72109 to 0.74902, saving model to GenderID-02-0.7490.ckpt
322/322 [==============================] - 128s 399ms/step - loss: 0.5459 - acc: 0.7334 - val_loss: 0.5795 - val_acc: 0.7565
Epoch 00003: val_acc improved from 0.74902 to 0.75647, saving model to GenderID-03-0.7565.ckpt
322/322 [==============================] - 125s 388ms/step - loss: 0.5208 - acc: 0.7462 - val_loss: 0.5736 - val_acc: 0.7137
Epoch 00004: val_acc did not improve from 0.75647
322/322 [==============================] - 125s 390ms/step - loss: 0.4986 - acc: 0.7637 - val_loss: 0.5472 - val_acc: 0.7212
Epoch 00005: val_acc did not improve from 0.75647
322/322 [==============================] - 124s 384ms/step - loss: 0.4912 - acc: 0.7667 - val_loss: 0.5136 - val_acc: 0.7851
Epoch 00006: val_acc improved from 0.75647 to 0.78510, saving model to GenderID-06-0.7851.ckpt
322/322 [==============================] - 124s 384ms/step - loss: 0.4674 - acc: 0.7799 - val_loss: 0.5209 - val_acc: 0.7745
Epoch 00007: val_acc did not improve from 0.78510
322/322 [==============================] - 124s 385ms/step - loss: 0.4485 - acc: 0.7925 - val_loss: 0.4978 - val_acc: 0.7643
Epoch 00008: val_acc did not improve from 0.78510
322/322 [==============================] - 123s 381ms/step - loss: 0.4323 - acc: 0.8022 - val_loss: 0.5000 - val_acc: 0.7737
Epoch 00009: val_acc did not improve from 0.78510
322/322 [==============================] - 124s 386ms/step - loss: 0.4277 - acc: 0.8037 - val_loss: 0.5061 - val_acc: 0.7565
Epoch 00010: val_acc did not improve from 0.78510
Testing on some images of both male and female the model did as expected ok.
For men, the accuracy was 65.17 % correct.
And for women, the accuracy was 48.36 % correct
So the model is a little more accurate for detecting men than women in the end.
With a total accuracy of 58.36% which is ok a little better than guessing randomly so I will take that as a win.
Now we can see the model is accurate for this complex problem. But how can we improve this model? Some improvements can be done by using a pre-train model to aid in the feature extraction of an image along with better data augmentation techniques.
The model can successfully predict if a person in an image is a man or a woman without the use of faces which is a very difficult task. Why is this important? This will allow for telling if someone sex from a distance even if their face is obscured by clothing or a jacket. So you can use lower resolution security cameras and still with a certain accuracy tell if the person is a man or a woman.
From here I will add in layer initializers, deepen the network, add in a pre-trained fine turned model, and improve the data augmentation for the model. This should give a little better results and possibly reaching my goal of 65% which would be a very good model for this particular task.
L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang and Q. Tian, "Scalable Person Re-identification: A Benchmark," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 1116-1124.
lane : refresh_dsyms do|options|# バージョンの指定がなければ、最新バージョンのdSYMファイルをダウンロードする