Chapter 2: Getting Started with Image Classification

In this chapter, you’ll build your first iOS app by adding a CoreML model to detect whether a snack is healthy or unhealthy. You’ll focus on how machine learning can be used to solve classification problems such as trying to identify what an object might be.

大綱

Is that snack healthy?

  • A classifier: a machine learning model that takes an input of some kind, in this case an image, and determines what sort of “thing” that input represents.

    • An image classifier tells you which category, or class, the image belongs to.

  • Binary means that the classifier is able to distinguish between two classes of objects.

    • For example, you can have a classifier that will answer either “cat” or “dog” for a given input image, just in case you have trouble telling the two apart.

  • In this chapter you’ll learn how to build an image classifier that can tell the difference between healthy and unhealthy snacks.

    • using a ready-made model that has already been trained

  • The image classifier

    • Input: an image

    • Output : a probability distribution, a list of numbers between 0 and 1

Core ML

  • Core ML models are packaged up in a .mlmodel file.

    • This file contains both the structural definition of the model , the learned parameters (or the “weights”).

  • The HealthySnacks model type is Neural Network Classifier

    • this model was made using a tool called Turi Create and it uses SqueezeNet v1.1, a popular deep learning architecture for mobile apps.

    • The main benefit of SqueezeNet is that it’s small, about 5 MB.

  • There is only one input, a color image that must be 227 pixels wide and 227 pixels tall.

    • You cannot use images with other dimensions. The reason for this restriction is that the SqueezeNet architecture expects an image of exactly this size

  • When you add an .mlmodel file to a project, Xcode does something smart behind the scenes:

    • It creates a Swift class with all the source code needed to use the model in your app.

  • Most importantly that the images need to be scaled to 227×227 pixels and placed into a CVPixelBuffer object before you can call the prediction() method

Vision

  • Vision helps with computer vision tasks.

    • For example, it can detect rectangular shapes and text in images, detect faces and even track moving objects

  • Vision makes it easy to run Core ML models that take images as input.

    • Vision will automatically resize and crop the image.

  • Vision also performs a few other tricks, such as rotating the image so that it’s always right-size up, and matching the image’s color to the device’s color space.

  • The way Vision works is that you create a VNRequest object, which describes the task you want to perform, and then you use a VNImageRequestHandler to execute the request. Since you’ll use Vision to run a Core ML model, the request is a subclass named VNCoreMLRequest.

Creating the VNCoreML request

  • 要建立VNCoreML request,需要先import 兩個重要framework

  • Don’t create a new request object every time you want to classify an image — that’s wasteful.

    • 利用lazy property來進行reuse, 不要每次都建立一個新的request物件。

Crop and scale options

  • Vision will automatically scale the image to the correct size.

  • For the best results you should set the request’s imageCropAndScaleOption property so that it uses the same method that was used during training.

  • centerCrop:

    • The .centerCrop option first resizes the image so that the smallest side is 227 pixels, and then it crops out the center square:

  • scaleFill:

    • scaleFill, the image gets resized to 227×227 without removing anything from the sides, so it keeps all the information from the original image — but if the original wasn’t square then the image gets squashed

  • scaleFit:

    • scaleFit keeps the aspect ratio intact but compensates by filling in the rest with black pixels

Performing the request

Image orientation

  • iOS keeps track of the true orientation of the image with the imageOrientation property.

  • If you’re holding the phone in portrait mode and snap a picture, its imageOrientation will be .right to indicate the camera has been rotated 90 degrees clockwise. 0 degrees means that the phone was in landscape with the Home button on the right.

  • Core ML model does not take “image orientation” as an input, so it will see only the “raw” pixels in the image buffer without knowing which side is up.

    • Image classifiers are typically trained to account for images being horizontally flipped so that they can recognize objects facing left as well as facing right, but they’re usually not trained to deal with images that rotated by 90, 180 or 270 degrees.

    • This is why you need to tell Vision about the image’s orientation so that it can properly rotate the image’s pixels before they get passed to Core ML.”

Trying it out

Showing the results

  • The request parameter is of type VNRequest, the base class of VNCoreMLRequest. If everything went well, the request’s results array contains one or more VNClassificationObservation objects.

  • Vision automatically sorts the results by confidence, so results[0] contains the class with the highest confidence — the winning class.

What if the image doesn’t have a snack?

  • What happens when you show it a kind of snack that it has never seen before, or maybe even a totally different kind of object — maybe something that isn’t even edible?

  • 那[VNClassificationObservation]中results[0]和results[1]的差異不會很大

  • 針對此種case的處理方式,就是小於一定confidence就處理成不確定。

What if there’s more than one object in the image?

  • Image classification always looks at the entire image and tries to find out what the most prominent object in the image is.

  • 這種case其實也是[VNClassificationObservation]中results[0]和results[1]的差異不會很大, 所以一樣處理的方式

    • 小於一定confidence就處理成不確定

How does it work?

  • Convolutional neural network

  • Core ML treats the model as a black box, where input goes into one end and the output comes out the other. Inside this black box it actually looks like a pipeline with multiple stages

Into the next dimension

  • The input image is 227×227 pixels and is a color image, so you need 227 × 227 × 3 = 154,587 numbers to describe an input image

  • 每張圖都是154,587特徵向量,CNN會透過特殊處理方式找到一條decision boundary來將圖片進行分類。

A concrete example

  • To classify a new image, the neural network will apply all the transformations it has learned during training, and then it looks at which side of the line the transformed image falls. And that’s the secret sauce of neural network classification!

Multi-class classification

  • a multi-class classifier that was trained on the exact same data as the binary healthy/unhealthy classifier but that can detect the individual snacks.

  • 替換新的model

  • 針對Multi-class classification取得top3預測

Key points

  • Obtain a trained .mlmodel file from somewhere. You can sometimes find pre-trained models on the web (Apple has a few on its website) but usually you’ll have to build your own. You’ll learn how to do this in the next chapter.

  • Add the .mlmodel file to your Xcode project.

  • Create the VNCoreMLRequest object (just once) and give it a completion handler that looks at the VNClassificationObservation objects describing the results.

  • For every image that you want to classify, create a new VNImageRequestHandler object and tell it to perform the VNCoreMLRequest.

Last updated