Explanation of the architecture of Docker and how to create a running container of your app and your PostgreSQL database.

Mental picture: with Docker you put your application in an environment that you can control extensively. To create this environment, you either create an image (write a Dockerfile, then build it using ‘build’) or you use an existing image. ‘Environment’ in this context means the filesystem, which is the directory structure plus all its contents, as seen from inside the container.

Docker architecture

Everything you need for Docker can be pulled in by installing Docker Desktop. It includes Docker cli, so you can run commands from a shell, and also the so-called daemon, or dockerd, that manages the containers. These two communicate with a REST API, by default over UNIX sockets but you can configure it to communivcate over a network interface (less safe but suitable if client and daemon live on different machines).

Another important element in Docker is the registry, where images are centrally stored. The default registry is Docker Hub but you can create your own, which you might typically do in an organization.

Furthermore there are Buildx and BuildKit, that are responsible for building images. Buildx is the client and is invoked once you use the docker build command on the cli, BuildKit is the backend that receives instructions from Buildx and does the actual building process.

The other relevant elements of Docker are Docker Compose, a tool to run and define multi-container applications, and Docker Swarm, an orchestration service. The latter is an alternative for Kubernetes.

Underlying Linux and other software

Docker relies on three Linux elements that predated Docker, namely chroot, cgroups and namespaces. The existence of these kernel features shows that the creation of isolated environments, where applications cannot see the underlying OS filesystem or its environment variables, was recognized as something important early on.

Apart from these Linux kernel features, Docker also relies on containerd as the basis for individual containers. Containerd is an open source tool that provides a layer of abstraction upon the basic Linux kernel features and specializes in the running of containers. It gets its instructions from dockerd. Between the kernel features chroot, cgroups and namespaces and containerd lives another creature named runc. It is not part of Linux and written in Go.

Both containerd and runc play a role in the cloud and in kubernetes, and are highly relevant outside of the Docker context.

About images

A container is a process, an image is a file

While the term container suggests something similar to a file, or at least something that will exist even if the machine is switched off, it isn’t. A container is a process based upon an image, which actually is a (large) file. Once you have built a container on a machine you can start and stop it and it won’t have to rebuild or rerun every time, but migrating it means that it needs another ‘docker run’ command on the other system.

Image as a layered construct

An image is built following a recipe found in a dockerfile that you, or someone else, has composed. A dockerfile contains a list of instructions, whereby every instruction affects the resulting filesystem and its contents (the ‘environment’). While as a user you are only interested in the final filesystem, the image actually stores multiple filesystems, one for every instruction. Each is called a layer.

The rationale for layering is that these layers can be reused which saves time. Say that two different and unrelated images each install some application, resulting in all sorts of new files and folders added to the environment, then only the first install has to take place. The second image can use the layer that is the result of the first installation.

I asked ChatGPT what this means if you have instructions that delete content. Say I add an instruction that copies 100Mb of files to the environment/filesystem, and a next instruction deletes them. The final result is the same as if no instructions had taken place, but the image will still contain a layer containing all the copied files. The image will have those 100Mb stored as part of the first layer.

One consequence of this is that if you want to limit the size of the final image and thus want to clean up during the build process, you cannot delete the mess you made in some instruction (“copy all sorts of temporary installation files to my environment”) in a next instruction (“now delete the garbage”). The workaround is in the RUN instruction, that allows you to chain commands with &&. As both the mess making and cleaning happens within the same instruction, no residue will persist in the final image.

Creating an image with Dockerfile

To create an image you first create a dockerfile. Name it Dockerfile, without an extension. By doing so you can run the docker build command in the same folder and it will pick up the instructions in this file.

A Dockerfile is an ordered list of instructions. Every instruction adds something (or eventually subtracts something) to the image and thus to the filesystem in the container. There are 18 different instruction types, the dockerdocs lists and explains them here. A more practically oriented manual, also from the website, is found here.

Of the 18 commands, FROM <image> and either CMD <command> or ENTRYPOINT <command> are required. Below is a full listing. The order is alphabetical but I changed it slightly to be able to have instructions with high similarity together:

ADD

Adds local or remote files and directories to the container filesystem. The form is one of these two, the latter to be used when there is whitespace in the paths/filenames:

ADD [OPTIONS] <src> <dest>

ADD [OPTIONS] ["<src>", "<dest>"]

The choice between these two forms exists for other instruction types as well. Instead of copying one source to one destination you can also copy multiple sources to the same destination in one line. The last argument must always be the destination.

If the destination must be a directory, the path must end with a slash (/), otherwise Linux thinks it is a file. This rule is not applied to the source argument, here trailing slashes are disregarded.

The last argument must be a directory when you have multiple source arguments (which can both be files and directories). The following lines are correct:

ADD ["file1.txt", "file2.txt", "/usr/src/datastuff/"]

ADD file.txt /home/geert/copytocontainer/ /usr/src/datastuff/

It is also possible to add a source from a remote location. The documentation shows these valid examples:

ADD https://example.com/archive.zip /usr/src/things/
ADD git@github.com:user/repo.git /usr/src/things/

If the source is a local tar archive (any compressed file), it is decompressed and extracted to the specified location. There are some more ins and outs, and you can use some options. The link directly to the ADD reference is here.

COPY

Copy is very similar to add. General rule: for local files/folders use COPY, for remote content use ADD. Docker discusses the differences here and here you find the COPY reference.

ARG

This instruction defines a variable that users can pass at build-time, using the --build-arg <varname>=<value> flag. You can provide a default value so a user can omit the argument, and one ARG instruction can define multiple arguments. Something similar to the Main function in Java. The form is this:

ARG <name>[=<default value>] [<name>[=<default value>]...]

Note that if you have multiple arguments and you want defaults for some, you have to start from the end with default values.

To keep it simple, you can write an ARG instruction for every distinct argument. The values of the arguments are available to subsequent instructions, you can for example use them in a RUN instruction.

Arguments passed by ARG are ephemeral, unlike ENV variables they are not stored in the image and thus not available after the build. More details in the reference.

ENV

In the Dockerfile you can set environment variables that are then available to the running container. The values can be hardcoded in the Dockerfile but you can also write the code so that they can be passed as argument in the build process, using ARG. Example:

FROM ubuntu
ARG CONT_IMG_VER
ENV CONT_IMG_VER=${CONT_IMG_VER:-v1.0.0}
RUN echo $CONT_IMG_VER

The snippet above lets the user set CONT_IMG_VER. If for some reason no argument is passed, the environment variable is set to default (-v1.0.0), but this can be overridden with the following build command:

--build-arg CONT_IMG_VER=-v2.0.0

Be aware that setting environment variables in the build process is different from setting it in the run stage. An example below shows how the official PostgreSQL image requires the user who runs (not builds) the container to set values for specific environment variables using the -e flag. More on ENV here.

Basic commands

Docker can be fairly well understood when you know the following commands:

  • docker run
  • docker build
  • docker pull

docker run

The run command presupposes the existence of an image. Any run command requires an image as argument, more specifically the last argument unless that image requires some arguments itself. This image can be an image from Docker Hub or one that you got from your colleague. The typical form is:

docker run [OPTIONS] IMAGE [COMMAND] [ARG...] 

Take this example, in which a self-generated image is used (‘myapp’):

docker run -d -p 8080:8080 -v /home/user/data:/app/data myapp:latest

Or this one, that starts up a container with a PostgreSQL database:

docker run --name my-postgres \
  -e POSTGRES_USER=testuser \
  -e POSTGRES_PASSWORD=testpass \
  -e POSTGRES_DB=testdb \
  -v pgdata:/var/lib/postgresql/data \
  -p 5432:5432 \
  -d postgres:16

Note that both these run commands end with the name of an image, not followed by some command or arguments. Most often this is sufficient but there are occassions where you want to provide some command with arguments and where the image accomodates this:

docker run --rm python:3.11 python -c "print(3 * 7)"  // runs Python and prints 21

docker run --rm postgres:17 postgres --version  // prints the postgres version

docker run --rm alpine:3.19 ls -l /bin  // prints a directory listing of /bin inside the container

Relevant option flags

A complete overview of flags can be found here:

flag meaning
–name Give the container a name
-d Container will run in background. Not connected to some value
-p Publish a container’s port to the host. First number is port within Docker, second is the port for external access
-v Bind a mount volume. For persistence, connects path within container to path in the underlying OS
-e Set environment variable
–rm Automatically remove container and its associated anonymous volumes when it exits

Data persistence

The reason you want the -v flag is when the data must be safely stored. You cannot safely store it within the container, as the container is a process that disappears upon shutdown. The binding of a directory inside the container to a directory outside of the container in the underlying OS means that the inner directory is a direct reference to the external directory on the host. They are the same.

Use of environment variables

Environment variables can be set, which is useful in combination with a Spring application.properties files that refers to specific environment variables for database credentials. In case of the official PostgreSQL image, the maintainers of this ‘Docker Official Image’ have explicitly included a script in which the POSTGRES_USER, POSTGRES_DB and POSTGRES_PASSWORD environment variables are being used. You must set them with the proper names otherewise the image won’t work. It is written in their documentation, ChatGPT jsut advised me to read it.

docker build

Once you have an image you can run a build command to create the corresponding image. This is the basic command:

docker build -t myname:sometag .

The dot at the end indicates that Docker expects to find the Dockerfile in the current directory. The -t flag lets you set a name and a tag. See here for extra build options.


<
Previous Post
PostgreSQL on Linux
>
Blog Archive
Archive of all previous blog posts