Chapter 9: Beyond Classification
In this chapter, you’ll learn how to identify an object’s location in an image. You’ll learn how to build a simple localization model that predicts a single bounding box
大綱
Where is it?
Object detection is to find all the objects inside an image.
It does this by predicting one or more bounding boxes, which are simply rectangular regions in the image.
Each bounding box also has a class — the type of the object inside the box — and a probability that tells you how confident the model is


The ground-truth will set you fre
目標: Just predict one bounding box
起手式: revisit the dataset
訓練model的流程是不變的
provide a dataset that consists of the images and the targets.
provide a suitable loss function that calculates how wrong the model’s predictions are by comparing them to the targets.
use a Stochastic Gradient Descent optimizer, such as Adam, to find the values for the model’s learnable parameters
訓練資料的改變
Previously, the targets were just the class names for the images, but now they must also include the so-called ground-truth bounding boxes

The first three rows in the dataframe all belong to the same image, a picture of a cake, but apparently, there’s also some ice cream in that image (see rows 1 and 2).
normalized coordinates
between 0 and 1
It’s convenient to use normalized coordinates because it makes them independent of the actual size of the imag
Show me the data!
Garbage in equals garbage out. If you’re training your model on data that doesn’t make sense, then neither will the model’s predictions and you just wasted a lot of time and electricity. Don’t be that person!
Not all objects from all images have annotations, and some have duplicates, this dataset isn’t ideal
When you start building your own models, you’ll find that you’ll be spending a lot of time cleaning up your training data, filling in missing values, and so on.
What about images without annotations?
There are some tools that can help
RectLabel, available on the Mac App Store. This is a powerful tool with many options, but it expects the annotations to be provided as a separate XML file for each image.
Labelbox is an online tool for labeling training data for many different tasks, including object detection. This is a paid service but there is a free tier.
Simple Image Annotator is a Python program that runs as a local web service. As its name implies, it’s pretty simple to use and offers only basic editing features. The output is a CSV file but it’s not 100% compatible with the CSV format we’re using.
Sloth, which is available at sloth.readthedocs.io, and is an advanced labeling tool. Requires Linux.
CVAT, or Computer Vision Annotation Tool
Your own generator
Previously, you used ImageDataGenerator and flow_from_directory() to automatically load the images and put them into batches for training
You’ll need a way to read the rows from this dataframe into a batch. Fortunately, Keras lets you write your own custom generator.
X.shape = (32, 224, 224, 3)
thirty-two 224×224 color images
y_class = (32,)
it has thirty-two class labels
y_bbox is (32, 4)
it has thirty-two bounding boxes — one per image — and each box is made up of four coordinates.
BoundingBoxGenerator is a subclass of the Keras Sequence object that overrides a couple of methods
len() method determines how many batches this generator can produce
The generator produces exactly 220 batches because 7,040 rows / 32 rows per batch = 220 batches.
Usually, the size of the training set doesn’t divide so neatly by batch_size, in which case the last, incomplete batch is ignored or is padded with zeros to make it a full batc
getitem(). This method is called when you do next() or when you write train_generator[some_index].
Create new NumPy arrays to hold the images X, and the targets y_class and y_bbox for one batch. These arrays are initially empty.
Get the indices of the rows to include in this batch. It looks these up in self.rows.
For every row index, grab the corresponding row from the DataFrame.
Return X, as well as y_class and y_bbox, to the caller.
BoundingBoxGenerator is currently not doing any data augmentation. If you’re up for a challenge, try adding data augmentation code to the generator — but don’t forget that the bounding boxes should be transformed too along with the images!
A simple localization model
利用之前訓練好的model, 但是現在要output兩種預測內容。
has two outputs: one for the classification results, and one for the bounding box predictions

如何利用functional API來實現branch架構
Create a layer object, such as GlobalAveragePooling2D()
Call this layer object on a tensor, such as base_model.outputs[0], which is the output from the MobileNet feature extractor
The layer_dict lets you look up layers in the Keras model by name. That’s why you named the new layers when you created them.
layers[-2]. In Python notation, a negative index means that you’re indexing the array from the back, so layers[-1] would be the last layer.

The new loss function
Two loss functions: sparse_categorical_crossentropy and mse.
The model has two outputs and each predicts a different thing, so you want to use a different loss function for each output.
"mse" or mean squared error.
model.compile() now also has a loss_weights argument. Because there are two outputs, the loss computed during training looks like this:”
this model has already been trained on the classification task but hasn’t learned anything about the bounding box prediction task yet,
we’ve decided that the MSE loss for the bounding boxes should count more heavily. That’s why it has a weight of 10.0 versus a weight of 1.0 for the cross-entropy loss.
This will encourage the model to pay more attention to errors from the bounding box output.
Sanity checks
See what happens when you load an image and make a prediction.
The preds variable is a list containing two NumPy arrays:
The first array, preds[0], is the 20-element probability distribution from the classifier output.
The second array, preds[1], has the four numbers for the bounding box.
Use the generator to make predictions
This will create predictions for all the rows in the train_annotations dataframe,
an array of size (7040, 20) for the classification output, and an array of size (7040, 4) for the bounding box output.
Train it!
bounding box loss is much smaller than the class loss, 0.1187 versus 0.4749. You can’t really compare these values because they were computed using completely different formulas
there is no such metric for the bounding box predictions. That’s because you told model.compile() that you only wanted metrics={ "class_prediction": "accuracy" }.
IOU
For the bounding box predictions, there is also a metric that gives us some intuition about the quality of the model: IOU.
Intersection-over-Union
A number between 0 and 1.
The more similar the two boxes are, the higher the number. A perfect match is 1, while 0 means the boxes don’t overlap at all.

To use this metric, you need to compile the model, again:
The MeanIOU object is a simple wrapper class that lets Keras and TensorFlow use the iou() function..
Trying out the localization model
Write a function that makes a prediction on an image and plots both the ground-truth bounding box and the predicted one:
目前模型只能畫出一個boudingbox, 所以對照片中有多個物件出現的狀況仍無法正確處理, 需要可以讓model正確畫出多個boundingbox.
Conclusion: not bad, could be better
On the validation set, it had an average IOU of a little over 30%.
In general, we only consider a bounding box prediction correct when its IOU is over 0.5 or 50%
how to create a model that can predict more than one bounding box.
Key points
Object detection models are more powerful than classifiers: They can find many different objects in an image. It’s easy to make a simple localization model that predicts a single bounding box, but more tricky to make a full object detector.
To train an object detector, you need a dataset that has bounding box annotations. There are various tools that let you create these annotations. You may need to write your own generator to use the annotations in Keras. Data wrangling is a big part of machine learning.
A model can perform more than one task. To predict a bounding box in addition to classification probabilities, simply add a second output to the model. This output needs to have its own targets in the training data and its own loss function.
The loss function to use for linear regression tasks, such as predicting bounding boxes, is MSE or Mean Squared Error. An interpretable metric for the accuracy of the bounding box predictions is IOU or Intersection-over-Union. An IOU of 0.5 or greater is considered a good prediction.
When working with images, make plenty of plots to see if your data is correct. Don’t just look at the loss and other metrics, also look at the actual predictions to check how well the model is doing.
Last updated