tools, tutorial,

How to use data from Google Drive in Colab and what can possibly go wrong

Sep 05, 2020 · 7 mins read
How to use data from Google Drive in Colab and what can possibly go wrong
Share this

Google Colab is extremely useful if you want to train your machine learning models in cloud for free. It allows you to use free GPU and a decent amount of disk space for your trainings. Moreover, you can use your own Google Drive to store data for Colab notebooks. I will show you how to make Drive and Colab work together and also what can go wrong between those two.

Mount Google Drive in Colab notebooks

Connecting (mounting) your Google Drive is fairly simple. You need to add and execute a code cell with the code you can see below. After running the code, you will be asked to finish the authentication step (a link will pop up in cell output). Then choose your Google account and confirm that you want to connect Colab with Drive. You will receive a secret token that has to be inserted into an input field in the cell you executed before. Well, that all sounds more complicated than it is.

Code cell asking you to enter authorization code to mount Drive.

Figure 1. Code cell asking you to enter authorization code to mount Drive.

After you complete that step, you should be able to access folders in your Google Drive from Colab. In my case all Drive files are available under /content/drive/My Drive path and I believe it’s the same for everyone.

From now on, treat Google Drive as if it was your local directory. You can it for a different kind of stuff: datasets, pre-trained models or configuration files. I uploaded a small dataset to use in my Colab notebook for the purpose of the previous post and here’s how I use it:

base_dir = '/content/drive/My Drive/Datasets/split-garbage-dataset/'

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    os.path.join(base_dir, 'train'),
    **dataset_config
)

There are other ways to mount Google Drive using PyDrive library or native REST API but I’m not going to describe them (take a look here for more information). Instead, I will tell you what can go wrong when using data straight from Google Drive. But first, I have to mention another way of storing files in Colab notebooks.

Uploading data directly to Colab notebook - alternative?

It is possible to upload your local files to Google Colab directly. Wouldn’t that be a better option?

Unfortunately, storing files in Colab filesystem is not an option if you work with large files (folders) and you want to share your work with somebody else. The reason is that virtual environment (files, custom libraries) do not get shared along with notebook. In other words, when sharing Colab code you should also include code cells that install custom libraries (e.g. !pip install blahblahblah) and upload or load input files.

It’s even worse. When your session ends (and for free Colab it can last only up to 12 hours) all of local data you uploaded gets removed. Imagine reuploading huge dataset every time you start working on your code. While it may still be a good option if you work with CSV file instead of millions of images… usually it’s not the case if you decide to use Colab 😉

So Google Drive it is. You may stop reading here, you already know how to use it. Or stay here for one more minute and learn about one mistake that can make you waste a lot of time and will bring you back to this article eventually.

What could possibly go wrong…

I bet you have already guessed what it is...

Figure 2. I bet you have already guessed what it is...

Let’s cut to the chase right here, probably you’ve already guessed where’s the problem. It turns out that despite Google Drive is a better place to store your files than Colab local filesystem, it can be dramatically slower when reading data from Drive. I was surprised when I bumped into that issue, because I expected that Google can handle it better. But how big is the difference? And where the performance suffers the most?

I used my previous Garbage (image) Classification notebook and measured how much time it takes to complete different steps of the pipeline e.g. loading data, displaying sample images, computing class weights and at the end - training. Take a look at those bar plots below.

There are two methods I compare in these plots:

  • Copying means that I use data from Google Drive (right after mounting it),
  • Copied means that I also copied dataset from GD, extracted data and used it locally from Colab.

Loading image data to Tensorflow Dataset (five trials)

Figure 3. Loading image data to Tensorflow Dataset (five trials)

When loading data with Keras image_dataset_from_directory, using local Colab data was 2 times faster. It’s not much as it boils down to few seconds of difference. I believe it’s because of how TFDataset works - it doesn’t read all images at once. So the real difference visible in Fig. 2 ought to be somewhere else…

Displaying sample images for 1st dataset batch (five trials)

Figure 4. Displaying sample images for 1st dataset batch (five trials)

Concatenating labels and computing their weights (five trials)

Figure 5. Concatenating labels and computing their weights (five trials)

And there it is. In two next steps I displayed sample images and then computed weights of all image labels (to use weighted classes in training). You may have troubles even noticing red bars in Figures 4 and 5, but believe me they’re there.

Displaying images and computing weights took around 3-4 seconds on average when using data that has been copied to Colab. When using images directly from Google Drive, it was around 455 seconds to show samples and 250 seconds to compute class weights (that’s an average from 5 experiments, see Fig. 2).

Using files straight from Google Drive directory in my case was up to 100-120x (not %) slower!

You can also see the difference in training time...but it's smaller than 100x

Figure 6. You can also see the difference in training time...but it's smaller than 100x

How to do it the right way

The solution is simple and I already told you what to do: Uploading datasets/model weights or whatnot to Colab filesystem is not an option because you need to do that again and again everytime you start a session. Instead, mount Google Drive (where you store data) and add a code cell that copies files to Colab environment.

from google.colab import drive
drive.mount('/content/drive')
# Better copy data from Drive as an archive
zip_path = '/content/drive/My Drive/Data/split-garbage-dataset.zip'

!cp "{zip_path}" .

!unzip -q split-garbage-dataset.zip

# Remove .zip file after you unzip it
!rm split-garbage-dataset.zip

# Make sure it's there
!ls

Everytime you start a new session, you need to run these cells again but 1) you don’t need to store data locally and 2) I believe it’s way faster to transfer data between Drive and Colab than to upload it from your local computer. Here you go!

I’m curious if you had that problem with your notebooks before, let me know and thanks for reading 👋