Npm Broke Arm64 Pi Garage Builds

I know this sounds crazy right? Lets rewind a bit so we can try and understand why.

In a previous post Material 3 + Flutter 3.16.0 Broke Pi Garage Theme I stated that you should pin core dependencies. Apart from the Flutter version there was another place that I did not pin dependencies specifically enough. This was in the Dockerfile. Here I has specified the node version as below.

FROM node:18 

As you can see although this would limit the Node version to 18 this would allow anything from 18.0.0 through to 18.100.0. Something very interesting happened with version 18.19.0. In the Release Notes one of the first things mentioned is that npm 10 has been backported and included. This like the Flutter + Material 3 issue linked above caught me by surprise.

Pi Garage Service GitHub Action

I’m going to “deep dive” into the CI platform and how Pi Garage’s backend service is built/published to be multi-architecture. By multi-architecture the Docker Image supports both 32 bit ARM (arm/v7) as well as 64 bit ARM (arm64).

The build pipeline for the backed service looks like below and we will then explain it.

docker:
  runs-on: ubuntu-latest
  steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up QEMU
      uses: docker/setup-qemu-action@v3

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Extract Docker Metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: michaelgambold/pi-garage

    - name: Login to DockerHub
      uses: docker/login-action@v3
      with:
        username: #####
        password: #####

    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        context: ./service
        platforms: linux/arm64,linux/arm/v7
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

Lets start from the end working backwards. The magic for building the multi-architecture builds is in the final “build-push-action”. In it you define platforms and these are then passed to buildx higher in the stack.

Buildx and QEMU allow you to virtualise different architectures when building the Docker Images. This is important as the build machines are x64 architecture. As of writing there are no ARM based GitHub hosted runners.

So re-capping you have a x64 bit machine that runs a virtualised environment via QEMU/buildx to then build an arm64 architecture Docker Image.

So What Happened

The builds suddenly stopped working for arm64. Unfortunately this prevented the release of the v2.2.0 backed service that actually had a useful bug fix.

Shown below is the output from the GitHub Action. As you can see there is just a timeout when the command “npm ci” is ran inside the Dockerfile. This did not happen locally during testing/debugging as I am using a native ARM machine (M1 Max Macbook Pro).

At first based on the error message I looked into the timeouts thinking that the problem was building both architectures at the same time. The way I have the Dockerfile’s is that they are a multi-stage build. This means that I don’t prune development dependencies from the production build. I simply just have two stages for dependencies (one “npm ci” and one “npm ci –omit=dev”). Although this is probably not the most efficient it guarantees only production dependencies are in the build as well as obeying the KISS principle (Keep It Simple Stupid).

I initially started off only building one architecture (the 32 bit arm/v7 architecture) and this worked. I thought this validated my hypothesis that I was trying to pull too many requests from npm in a given period as the architectures/stages would run at the same time (i.e. four “npm ci” requests running concurrently.

However when I duplicated the step in the GitHub Action to build the 64 bit arm64 architecture Docker Image this would fail. This had me completely stumped as I thought I was on the right track.

I then started to look more deeply into the last successful run as well as trying to compare version numbers onto what was actually running (Node, Docker, etc).

When looking at the output when the Node image is pulled from Dockerhub it doesn’t really give any information on to what version of Node has been pulled. Just that it was the Image that matched the tag (in this case “node:18”).

I checked the versions of Node and found that 18.19.0 had been released and checked that against my local development environment which had node 18.17.x. To be sure I updated my local node version to the 18.19.0 and it built fine on my machine (not emulated as is already arm64).

I did however notice that Node 18.19.0 was released between my last successful build and my first failure build. To my Dockerfile I specified a major.minor version for Node to see if this helped.

FROM node:18.18

This in theory would lock Node to the previous version that would use npm 9 instead of npm 10.

Success 🎉 this worked. Now all I had to do was understand why and fix up my horrible hacks to the build pipeline.

After “Googling” other people seemed to be having the same issue. https://github.com/nodejs/docker-node/issues/1335.

I haven’t released a new production build but the staging builds that are released on every merge into GitHub (under the “main” tag) have worked.

Fool Me Once, Shame on You; Fool Me Twice, Same on Me!

This is the second time I have been bitten (in a week) by not limiting my dependencies to be specific enough. Although statistically unlikely, in software engineering these sort of things happen (especially at release time). This is often (and is true in this case) due to long times between software releases and software changing at such a rapid pace.

In software engineering I always try to take a positive path from any issue. The systems in place should not allow such failures to occur (especially in a enterprise environment). If they have occurred we need to just put systems in place to prevent such things from happening again or if not have a way that they can be discovered before making it to production.

As stated earlier there is a build that is published to Docker as the “main” tag which is the latest code (working or non working) on the main brach of the repository. Although I don’t have the hardware this would allow me to run a second Pi Garage that would be used for testing purposes before releasing.

For the mobile apps I intend on getting TestFlight for iOS and some alternative for Android so that I can test pre-release/release versions before being published for the wider community.


Posted

in

by

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *