In an earlier post, I discussed keeping a data set in a Docker image. I showed how it's possible to script a process that pulls the image from a registry, migrates the contained database to the latest version, and pushes the new image back up. This makes it possible to keep the latest
tag up to date.
We noticed that each time we used this approach, the resulting Docker image was a few gigabytes larger than the previous one 😱 Before long we'd be asking any new developer to download a Docker image that was tens of gigabytes, and possibly more than 100 in future! Needless to say, this didn't seem like a viable long-term approach, so we investigated what was going on. In this post, I'll dig into why our images were getting larger, and discuss a solution.
A problem with many layers
As described in the Docker overview, Docker images are made up of layers. An image can then be reconstructed by adding all of the layers one on top of the other.
When making a new image, you'll usually start with another image and make some changes, resulting in one or more layers being added on top of the existing layers. Often these changes are declared in a Dockerfile
, but it's possible to make them manually against a container running the image and commit the diff into a new layer. This is what the docker image commit
command was used for in the previous post.
The docker history
command allows you to see some basic information about the layers that make up an image. When looking at the layers making up our database image, we can see that the most recent layers are all large and all created by the sqlservr process.
$ docker history bigdb:latest
IMAGE CREATED CREATED BY SIZE COMMENT
dd2cb8f47e32 44 hours ago /opt/mssql/bin/sqlservr 3.79GB
cd1b049c9bcc 2 weeks ago /opt/mssql/bin/sqlservr 3.79GB
e7f4dd7601a1 2 weeks ago /opt/mssql/bin/sqlservr 3.87GB
b52639c6c532 7 weeks ago /opt/mssql/bin/sqlservr 4.68GB
<missing> 5 months ago /bin/sh -c #(nop) CMD ["/opt/mssql/bin/sqls… 0B
<missing> 5 months ago /bin/sh -c #(nop) ENTRYPOINT ["/opt/mssql/b… 0B
<missing> 5 months ago /bin/sh -c #(nop) USER mssql 0B
<missing> 5 months ago /bin/sh -c /tmp/install.sh 75.2MB
<missing> 5 months ago /bin/sh -c #(nop) COPY dir:16849aa04138bf48d… 1.24GB
<missing> 5 months ago /bin/sh -c #(nop) EXPOSE 1433 0B
<missing> 5 months ago /bin/sh -c #(nop) LABEL vendor=Microsoft co… 0B
<missing> 5 months ago /bin/sh -c #(nop) MAINTAINER dpgswdist@micr… 0B
<missing> 5 months ago /bin/sh -c /tmp/apt-get.sh 92.1MB
<missing> 5 months ago /bin/sh -c #(nop) COPY file:b2f70b16162a0a54… 367B
<missing> 6 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 6 months ago /bin/sh -c #(nop) ADD file:122ad323412c2e70b… 72.8MB
This makes sense:
- A SQL Server database has all of its data stored in an .mdf file. (There is also an .ldf file for transaction logs, but this should be small with the database configuration we've set up.)
- Each database migration and
docker image commit
we run creates a new layer with the .mdf file changed. - This file is large because our data set is large.
- Because a large file is changed between each commit, each layer is large.
- Because each image has one more (large) layer than the previous one, each successive image is significantly larger than the previous version.
What can be done?
Cutting out the middle, man
Docker has an experimental --squash
option for the build
command. The docs say:
Squashing layers can be beneficial if your Dockerfile produces multiple layers modifying the same files
It's worth bearing in mind that there are drawbacks too:
- When squashing layers, the resulting image cannot take advantage of layer sharing with other images, and may use significantly more space. Sharing the base image is still supported.
- When using this option you may see significantly more space used due to storing two copies of the image, one for the build cache with all the cache layers intact, and one for the squashed version.
- While squashing layers may produce smaller images, it may have a negative impact on performance, as a single layer takes longer to extract, and downloading a single layer cannot be parallelized.
In our case, there's approximately 1.4GB of layers that won't be reusable between successive versions of our database image. But the payoff is that we only have to download roughly 4GB on top of that irrespective of which other images (hence layers) we have locally. So, for example, if a developer were to have the previous version of the image on their machine and pulls the latest version down, they'd need to pull about 5GB down, 1.4GB more than the 4GB they would have to if we weren't squashing images. However, they'd still only need to pull 5GB down if they're two or three versions behind, rather than 8GB or 12GB. This is a big win.
Even better, the image size is not growing. Without squashing, the image's size is about 4GB per migration that's been applied. With squashing, the image is always roughly 5GB. This means much less space is required to use the latest images.
As mentioned in a GitHub issue requesting the --squash
option for docker commit
, it's possible to squash committed images by making a Dockerfile referring to them. The Dockerfile's contents just needs to be: FROM bigdb
, and then — after enabling experimental Docker Engine features — it's possible to run docker build --squash -t smalldb .
from the directory containing the Dockerfile to create a squashed image.
As expected, the image is a little over 5GB large:
$ docker history smalldb:latest
IMAGE CREATED CREATED BY SIZE COMMENT
453bd1fba16f 4 days ago 5.26GB create new from sha256:1e7db80083bcfbdbf4cfc57b2b2639f231d1be6f02675019274bdc17f2a1ae44
<missing> 4 days ago /opt/mssql/bin/sqlservr 0B
<missing> 2 weeks ago /opt/mssql/bin/sqlservr 0B
<missing> 2 weeks ago /opt/mssql/bin/sqlservr 0B
<missing> 7 weeks ago /opt/mssql/bin/sqlservr 0B
<missing> 5 months ago /bin/sh -c #(nop) CMD ["/opt/mssql/bin/sqls… 0B
<missing> 5 months ago /bin/sh -c #(nop) ENTRYPOINT ["/opt/mssql/b… 0B
<missing> 5 months ago /bin/sh -c #(nop) USER mssql 0B
<missing> 5 months ago /bin/sh -c /tmp/install.sh 0B
<missing> 5 months ago /bin/sh -c #(nop) COPY dir:16849aa04138bf48d… 0B
<missing> 5 months ago /bin/sh -c #(nop) EXPOSE 1433 0B
<missing> 5 months ago /bin/sh -c #(nop) LABEL vendor=Microsoft co… 0B
<missing> 5 months ago /bin/sh -c #(nop) MAINTAINER dpgswdist@micr… 0B
<missing> 5 months ago /bin/sh -c /tmp/apt-get.sh 0B
<missing> 5 months ago /bin/sh -c #(nop) COPY file:b2f70b16162a0a54… 0B
<missing> 6 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 6 months ago /bin/sh -c #(nop) ADD file:122ad323412c2e70b… 0B
Summary
If you make an image from multiple layers which modify the same files, and those files are large in each layer, squashing can reduce the total size of your image. Beware that the squashed image will consist of a single layer, so won't be able to share layers with other images. In short, there are tradeoffs with this approach, but it can sometimes be beneficial and is a good tool to have in your belt.