Chapter 8: Advanced Convolutional Neural Networks
In this chapter, you’ll learn about advanced model architectures used for solving image classification. You’ll learn how you can use Keras to do transfer learning, and how applying advanced techniques such as dropout and regularization can improve your model’s performance.
大綱
SqueezeNet
Turi Create actually gives you a choice between different convnets:
SqueezeNet v1.1
ResNet50
Over 25 million parameters, it’s on the big side for use on mobile devices
Vision FeaturePrint.Screen
This model is built into iOS itself and so we don’t know what it actually looks like
The Keras functional API
Keras functional API - To code SqueezeNet’s branching structures with Keras, you need to specify your model in a slightly different way.
原本寫法
Creating a Sequential object and then doing model.add(layer)
that is limited to linear pipelines that consist of layers in a row
更彈性的寫法
This layer object is immediately applied to the output from the previous layer
x is not a layer object but a tensor object
應用Keras functional API來寫SqueezeNet中fire_module結構
This has four tensors:
x that has the input data,
sq with the output of the squeeze layer,
left for the left branch
right for the right branch
MobileNet and data augmentation
MobileNet. Just like SqueezeNet, this is an architecture that is optimized for use on mobile devices
MobileNet has more learned parameters than SqueezeNet, so it’s slightly bigger but it’s also more capable
[MobileNet 架構](https://medium.com/@chih.sheng.huang821/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-mobilenet-depthwise-separable-convolution-f1ed016b3467)
A depthwise convolution
A depthwise separable convolution
The combination of a 3×3 DepthwiseConv2D followed by a 1×1 Conv2D
The batch normalization layer
Without batch normalization, the data in the tensors would eventually disappear in deep networks because the numbers become too small — known as the problem of the vanishing gradients
Keras’s MobileNet has been trained on the famous ImageNet dataset.
The final layer in this model outputs a tensor of size (7, 7, 1024).
The Vision FeaturePrint model that is built into iOS 12 is even more powerful than MobileNet, and it doesn’t even take up any space in your app bundle, but again is slower. And you can’t use it on iOS 11 or other platforms
Adding the classifier
Create a second model for the classifier, to go on top of that base model
It’s a logistic regression. Just like before it has a Dense layer followed by a softmax activation at the end.
You’re not going to be training the MobileNet feature extractor. This has already been trained on the large ImageNet dataset
The GlobalAveragePooling2D layer shrinks the 7×7×1024 output tensor from MobileNet to a vector of 1024 elements, by taking the average of each individual 7×7 feature map.
Data augmentation
Data augmentation: You augment the training data through small random transformations.
ImageDataGenerator: it should also rotate the images, flip them horizontally, shift the images up/down/sideways, zoom in/out, shear, and change the color channels by random amounts
preprocess_input function from the Keras MobileNet module because that knows exactly how MobileNet expects the input data.
Training the classifier layer
A very handy Keras feature: callbacks
EarlyStopping callback :will halt the training once the "val_acc" metric, the validation accuracy, stops improving.
ModelCheckpoint callback: smart to save a model checkpoint every so often. This is a copy of the model’s weights it has learned up to that point
Having feature extraction as a separate step only makes sense if you plan to reuse the same images in every epoch.
But with data augmentation — where images are rotated, flipped and distorted in many other ways — no two images are ever the same. And so all the feature vectors will be different for every epoch.
In every epoch, Keras needs to compute all the feature vectors again because all the training images are now slightly different from last time.
It’s a bit slower, but that’s a small price to pay for having a much larger training set with very little effort.
Fine-tuning the feature extractor
the pretrained feature extractor contains a lot of irrelevant knowledge, too, about animals, vehicles and all kinds of other things that are not snacks.
With fine-tuning, you can adjust the knowledge inside the feature extractor to make it more relevant to your own data
After about 10 epochs, the validation loss and accuracy no longer appear to improve.
When that happens, it’s useful to reduce the learning rate. Here, you make it three times smaller
LearningRateScheduler callback that can automatically reduce the learning rate, which is especially useful for training sessions with hundreds of epochs that you don’t want to babysit.
ReduceLROnPlateau callback will automatically lower the learning rate when the validation accuracy or loss has stopped improving. Very handy
The final accuracy on the test set is 82%. That’s a lot better than the SqueezeNet model from Turi Create.
There are two reasons for this:
1) MobileNet is more powerful than SqueezeNet;
2) Turi Create does not use data augmentation.
Granted, 82% is still not as good as the model from Create ML, which had 91% accuracy
Regularization and dropout
it would be better if the validation curves were closer to the training curves. You can do this by adding regularization to the model. This makes it harder for the model to get too attached to the training images.
Regularization is very useful, but keep in mind that it isn’t some magic trick that makes your validation score suddenly a lot better — it actually does the opposite and makes the training score a bit worse.
There are different methods for regularization, but what they all have in common is that they make learning more difficult.
Batch normalization
Dropout
L2 penalty
Dropout
Randomly removes elements from the tensor by setting them to zero.
Stop the neural network from relying too much on remembering specific training examples.
A hyperparameter, so you get to decide how high or low it should be. 0.5 is a good default choice. To disable dropout, simply set the rate to zero.
L2 penalty
it actually adds the square of the weights to the loss term
large weights result in a large loss value. This prevents situations where some features get really large weights, making them seem more important than features with very small weights
The value 0.001 is a hyperparameter called weight decay. This lets you tweak how important the L2 penalty is in the loss function
Tune those hyperparameters
By using a grid search or a random search, which will try all possible combinations of the hyperparameters
It’s very important that you use the validation set for tuning the hyperparameters, not the training set or the test set.
How good is the model really?
This loads the model from a checkpoint file that was saved by the ModelCheckpoint callback. (Replace the filename with your own best checkpoint.)
This HDF5 file contains the learned parameters for the model but also the architecture definition.
Because the relu6 activation is not a standard part of Keras, you have to provide this in the custom_objects dictionary, otherwise, Keras won’t be able to load the model.
On the diagonal — the bright squares — are the images that were correctly matched.
Everything else is an incorrect match
TTA, or Test Time Augmentation
For example, instead of making only one prediction for each test image, you could do it once for the normal image and once for the image flipped. Then the final score is the average of these two predictions.

Precision, recall, F1-score
Precision means: how many of the images that were classified as being X really are X?
the more false positives there are, i.e. images the model thinks belong to class X but that aren’t, the lower the precision.
Recall means: how many of the images of class X did the model find?
Recall for banana is high, so the images that contained bananas were often correctly found by the model
The more false negatives there are, i.e., things that are wrongly predicted to not be class X, the lower the recall for X.
F1-score. This is a combination of precision and recall and is useful if you want to get an average of the two.
The class with the lowest F1-score, 0.73, is cake. If you wanted to improve this classifier, the first thing you might want to do is find more and better training images for the cake category.
What are the worst predictions?
Use the following code to find the images that the model was the most wrong about
檢查model判斷錯誤的照片,了解model為何會判斷錯誤
有可能是test set的label錯誤
A note on imbalanced classes
If the disease happens to only 1% of the patients, the classifier could simply always predict “disease not present” and it would be correct 99% of the time
There are various techniques you can use to deal with class imbalance
Oversampling where you use the images from the smaller categories more often
Undersampling where you use fewer images from the larger categories, or setting weights on the classes so that the bigger category has a smaller effect on the loss.
Turi Create and Create ML currently have no options for this, so if you need to build a classifier for an imbalanced dataset, Keras is a better choice.
Converting to Core ML
This has quite a few arguments
Keras model object
Here you’re using the best_model object that you loaded in the previous section.
input_names
tells the converter what the inputs should be named in the .mlmodel file. Since this is an image classifier, it makes sense to use the name "image". This is also the name that’s used by Xcode when it automatically generates the Swift code for your Core ML model.
image_input_names
tells the converter that the input called "image" should be treated as an image. This is what lets you pass a CVPixelBuffer object to the Core ML model. If you leave out this option, the input is expected to be an MLMultiArray object, which is not as easy to work with.
output_names and predicted_feature_name
The first one is "labelProbability" and contains a dictionary that maps the predicted probabilities to the names of the classes.
The second one is "label" and is a string that contains the class label of the best prediction. These are also the names that Turi Create used.
red_bias, green_bias, blue_bias, and image_scale
used to normalize the image.
The chosen values are equivalent to the normalization function you’ve used before: image / 127.5 - 1
class_labels
contains the list of label names you defined earlier

Key points
MobileNet uses depthwise convolutions because they’re less expensive than regular convolution. Ideal for running models on mobile devices. Instead of pooling layers, MobileNet uses convolutions with a stride of 2.
Training a large neural network on a small dataset is almost impossible. It’s smarter to do transfer learning with a pre-trained model, but even then you want to use data augmentation to artificially enlarge your training set. It’s also a good idea to adapt the feature extractor to your own data by fine-tuning it.
Regularization helps to build stable, reliable models. Besides increasing the amount of training data, you can use batch normalization, dropout and an L2 penalty to stop the model from memorizing specific training examples. The larger the number of learnable parameters in the model, the more important regularization becomes.
Try your model on the test set to see how good it really is. Use a confusion matrix and a precision-recall report to see where the model makes mistakes. Look at the images that it gets most wrong to see if they are really mistakes, or if your dataset needs improvement.
Use coremltools to convert your Keras model to Core ML.
Last updated