United Kingdom: +44 (0)208 088 8978

Squashing large Docker images

Matt discusses how to keep Docker image sizes down when modifying the same file regularly.

We're hiring Software Developers

Click here to find out more

In an earlier post, I discussed keeping a data set in a Docker image. I showed how it's possible to script a process that pulls the image from a registry, migrates the contained database to the latest version, and pushes the new image back up. This makes it possible to keep the latest tag up to date.

We noticed that each time we used this approach, the resulting Docker image was a few gigabytes larger than the previous one 😱 Before long we'd be asking any new developer to download a Docker image that was tens of gigabytes, and possibly more than 100 in future! Needless to say, this didn't seem like a viable long-term approach, so we investigated what was going on. In this post, I'll dig into why our images were getting larger, and discuss a solution.

A problem with many layers

As described in the Docker overview, Docker images are made up of layers. An image can then be reconstructed by adding all of the layers one on top of the other.

When making a new image, you'll usually start with another image and make some changes, resulting in one or more layers being added on top of the existing layers. Often these changes are declared in a Dockerfile, but it's possible to make them manually against a container running the image and commit the diff into a new layer. This is what the docker image commit command was used for in the previous post.

The docker history command allows you to see some basic information about the layers that make up an image. When looking at the layers making up our database image, we can see that the most recent layers are all large and all created by the sqlservr process.

$ docker history bigdb:latest
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
dd2cb8f47e32   44 hours ago   /opt/mssql/bin/sqlservr                         3.79GB    
cd1b049c9bcc   2 weeks ago    /opt/mssql/bin/sqlservr                         3.79GB    
e7f4dd7601a1   2 weeks ago    /opt/mssql/bin/sqlservr                         3.87GB    
b52639c6c532   7 weeks ago    /opt/mssql/bin/sqlservr                         4.68GB    
<missing>      5 months ago   /bin/sh -c #(nop)  CMD ["/opt/mssql/bin/sqls…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  ENTRYPOINT ["/opt/mssql/b…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  USER mssql                   0B        
<missing>      5 months ago   /bin/sh -c /tmp/install.sh                      75.2MB    
<missing>      5 months ago   /bin/sh -c #(nop) COPY dir:16849aa04138bf48d…   1.24GB    
<missing>      5 months ago   /bin/sh -c #(nop)  EXPOSE 1433                  0B        
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL vendor=Microsoft co…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  MAINTAINER dpgswdist@micr…   0B        
<missing>      5 months ago   /bin/sh -c /tmp/apt-get.sh                      92.1MB    
<missing>      5 months ago   /bin/sh -c #(nop) COPY file:b2f70b16162a0a54…   367B      
<missing>      6 months ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B        
<missing>      6 months ago   /bin/sh -c #(nop) ADD file:122ad323412c2e70b…   72.8MB    

This makes sense:

  • A SQL Server database has all of its data stored in an .mdf file. (There is also an .ldf file for transaction logs, but this should be small with the database configuration we've set up.)
  • Each database migration and docker image commit we run creates a new layer with the .mdf file changed.
  • This file is large because our data set is large.
  • Because a large file is changed between each commit, each layer is large.
  • Because each image has one more (large) layer than the previous one, each successive image is significantly larger than the previous version.

What can be done?

Cutting out the middle, man

Docker has an experimental --squash option for the build command. The docs say:

Squashing layers can be beneficial if your Dockerfile produces multiple layers modifying the same files

It's worth bearing in mind that there are drawbacks too:

  • When squashing layers, the resulting image cannot take advantage of layer sharing with other images, and may use significantly more space. Sharing the base image is still supported.
  • When using this option you may see significantly more space used due to storing two copies of the image, one for the build cache with all the cache layers intact, and one for the squashed version.
  • While squashing layers may produce smaller images, it may have a negative impact on performance, as a single layer takes longer to extract, and downloading a single layer cannot be parallelized.

In our case, there's approximately 1.4GB of layers that won't be reusable between successive versions of our database image. But the payoff is that we only have to download roughly 4GB on top of that irrespective of which other images (hence layers) we have locally. So, for example, if a developer were to have the previous version of the image on their machine and pulls the latest version down, they'd need to pull about 5GB down, 1.4GB more than the 4GB they would have to if we weren't squashing images. However, they'd still only need to pull 5GB down if they're two or three versions behind, rather than 8GB or 12GB. This is a big win.

Even better, the image size is not growing. Without squashing, the image's size is about 4GB per migration that's been applied. With squashing, the image is always roughly 5GB. This means much less space is required to use the latest images.

As mentioned in a GitHub issue requesting the --squash option for docker commit, it's possible to squash committed images by making a Dockerfile referring to them. The Dockerfile's contents just needs to be: FROM bigdb, and then — after enabling experimental Docker Engine features — it's possible to run docker build --squash -t smalldb . from the directory containing the Dockerfile to create a squashed image.

As expected, the image is a little over 5GB large:

$ docker history smalldb:latest
IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
453bd1fba16f   4 days ago                                                     5.26GB    create new from sha256:1e7db80083bcfbdbf4cfc57b2b2639f231d1be6f02675019274bdc17f2a1ae44
<missing>      4 days ago     /opt/mssql/bin/sqlservr                         0B        
<missing>      2 weeks ago    /opt/mssql/bin/sqlservr                         0B        
<missing>      2 weeks ago    /opt/mssql/bin/sqlservr                         0B        
<missing>      7 weeks ago    /opt/mssql/bin/sqlservr                         0B        
<missing>      5 months ago   /bin/sh -c #(nop)  CMD ["/opt/mssql/bin/sqls…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  ENTRYPOINT ["/opt/mssql/b…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  USER mssql                   0B        
<missing>      5 months ago   /bin/sh -c /tmp/install.sh                      0B        
<missing>      5 months ago   /bin/sh -c #(nop) COPY dir:16849aa04138bf48d…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  EXPOSE 1433                  0B        
<missing>      5 months ago   /bin/sh -c #(nop)  LABEL vendor=Microsoft co…   0B        
<missing>      5 months ago   /bin/sh -c #(nop)  MAINTAINER dpgswdist@micr…   0B        
<missing>      5 months ago   /bin/sh -c /tmp/apt-get.sh                      0B        
<missing>      5 months ago   /bin/sh -c #(nop) COPY file:b2f70b16162a0a54…   0B        
<missing>      6 months ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B        
<missing>      6 months ago   /bin/sh -c #(nop) ADD file:122ad323412c2e70b…   0B        

Summary

If you make an image from multiple layers which modify the same files, and those files are large in each layer, squashing can reduce the total size of your image. Beware that the squashed image will consist of a single layer, so won't be able to share layers with other images. In short, there are tradeoffs with this approach, but it can sometimes be beneficial and is a good tool to have in your belt.