Tuesday, 2019 July 09

The last time I updated the underlying OS and rebooted all of the nodes, I had some problems with my recipe program coming back up. The log revealed it was unable to talk to Postgres and pointed at DNS problems. Not wanting to spend much time on it I just restarted the nodes a few more times and eventually it started working. I shrugged and moved on.

A few days later I wrote the blog post about Delaware. It never got auto-published to my beta website; something was definitely wrong. I verified that the Docker image had been built on Gitlab, so next I checked on Flux. I immediately noticed DNS-related errors in the log files. It was time to sort this out.

After a few hours of fumbling investigation (notably going through Debugging DNS Resolution troubleshooting guide, I had a good idea of what was wrong:

  1. The coredns pods and service looked good.
  2. If I specified the internal IP address of either of the coredns pods, resolution worked just fine.
  3. But any call that went through the service failed.
  4. When I went a little deeper and investigated the coredns endpoint, I discovered that both of the pods were listed under NotReadyAddresses.
  5. A little more research revealed that there are two levels of readiness associated with a pod; that of the container, and that of the pod as a whole.
  6. My coredns pods were never going ready at the pod level (despite being ready at the container level).

At this point I was pretty stumped as I found the documentation sorely lacking. As a final act of desperation I restarted my coredns pods - and all of a sudden DNS was working. The endpoint had two healthy hosts, Flux picked up the new blog post, etc. I learned a lot from the investigation, but not what was actually wrong.

But this is pretty typical (in my experience) of modern tech stacks. There are so many layers of complexity that some weird interaction is a) virtually guaranteed to happen and b) be virtually impossible to fully understand.

And that is why I hate computers.

Wednesday, 2019 July 03 Saturday, 2019 July 13