Using custom containers on gcloud ai-platform to train a dog breed classifier model
View the Project on GitHub brianpinto91/dog-breed-prediction-gcloud
This project is about demonstrating the usage of custom containers on gcloud ai-platform to train a deep learning model.
The hallmark of successful people is that they are always stretching themselves to learn new things - Carol S. Dweck
Cloud platforms offer great tools to manage end to end machine learning projects. This project uses the GoogleCloud ai-platform. In particular, automation of the training job using custom Docker containers is demonstrated. To make the project complete, the deployment of the trained model on GoogleCloud app engine to serve online predictions is also included.
There are two parts in this project. The first one is deploying a training job on GoogleCloud ai-platform. And the second part is using the trained model to serve predictions by deploying it on GoogleCloud app engine
All the instructions here are for linux systems. On your computer, first navigate to the directory where you want to download the repository to. Then run the following commands:
git clone --depth 1 https://github.com/brianpinto91/dog-breed-prediction-gcloud.git
There are different ways to deploy a training job on GoogleCloud ai-platform. If you are using Tensorflow, scikit-learn, and XGBoost, then there are configured runtime verisons with required dependencies that can be directly used. However, if you are using Pytorch, as in this case, then Docker containers can be used to define the dependencies and deploy the training job.
The dataset for this project is obtained from Stanford Dogs Dataset. It contains 20,528 images of 120 dog breeds. Since the dataset is huge, it is not included in the github repository.
You can manually download the images from here. The downloaded file will be in the tar format. Once dowloaded, extract it and save the extracted Images directory in the directory training/data of the cloned git repository on your computer.
Open a command line and then navigate to the root of the cloned git repository on your computer. Then run the following command:
bash ./training/download_data.sh
All the required data preprocessing can be done using the training/data_preperation.ipynb jupyter-notebook.
To deploy a training job on GoogleCloud, you need to have a Googlecloud account and a project that has billing enabled. You can follow instructions in the section Set up your GCP project in this link
After completing the above steps, you can use the Cloud SDK, to deploy your model training on the GoogleCloud.
First login using the command
glcloud auth login
Create a new project or select your existing project which you will use to deploy the training job. Activate the project in the command line using:
gcloud config set project <your_gcloud_project_id>
When training on cloud, it is a good practice to name the resources and directories in a structured way. For this purpose, I have defined all the namings in the bash script export_project_varibles.sh. You can modify the assigned value of the variables as required, but all the variables are required to move on with this guide. Importantly, use the region same as your project’s region. Also change the hostname value the $IMAGE_URI accordingly. for example eu.gcr.io is for Europe region. Hostnames for other regions can be obtained from here.
Export all these varibles using the command
source ./training/export_project_variables
First install Docker by following this [guide][docker_install_guide].
Depending on whether to train on cpu or gpu, I have created Dockerfile_trainer_cpu and Dockerfile_trainer_gpu Docker files respectively. The required python packages for model training are specified in the training/requiremnets.txt file and is installed when the docker images are built. I do not recommend using cpu to train this model as it takes forever. The cpu docker file is just to show how cpu can also be used for example when training a small neural network model.
Use the below command to build the Docker image for gpu training.
docker build -f Dockerfile_trainer_gpu -t ${IMAGE_URI} .
First you need to create a cloud bucket where you will export the training data. Assuming you are using the same terminal where the project variables were exported previously, run the following commands:
gsutil mb -l ${REGION} gs://${BUCKET_NAME}
gsutil -m cp training/data/* ${DATA_DIR}
Then you need to push the built docker image to gcloud using:
docker push ${IMAGE_URI}
Later export the job varibles which will be used to identify the results of your training job correctly using time_stamp based naming of directories for logs and models using:
source ./training/export_job_variables.sh
Finally submit the training job using the command:
gcloud ai-platform jobs submit training ${JOB_NAME} \
--region ${REGION} \
--scale-tier BASIC_GPU \
--master-image-uri ${IMAGE_URI} \
-- \
--model_dir=${MODELS_DIR} \
--epochs=30 \
--use_cuda \
--data_dir=${DATA_DIR} \
--batch_size=120 \
--test_batch_size=400 \
--log_dir=${LOGS_DIR}
Note: Here the line with –\ seperates the gcloud arguments with the command line arguments that are given to the python program running on the docker container.
The Googlecloud App engine is used for deploying the trained model. The app uses Flask and is served using the WSGI server Gunicorn. The python package dependencies are specified in app/requirements.txt file. The runtime python verison, the entrypoint, and the hardware instance for GoogleCloud app engine is defined in the app.yaml file.
First create a app (ideally in the same glcoud project) using the command:
gcloud app create <your-app-name>
Then navigate to the app directory of the cloned github repository on your computer. Create a directory called models. Download your trained model from gcloud and copy it into this directory. Rename the model file as torch_model. Alternatively, you can choose your own directory and model name and define it in the app/utils.py file using the variable MODEL_PATH.
Run the following command on the command line from the app directory:
gcloud app deploy
The home page:
And the prediction results page:
Copyright 2020 Brian Pinto