I recently started working as a DevOps engineer for a really cool company that is building the next generation machine learning platform.
One of our many responsibilities as a DevOps team is the CI environment, based on Jenkins groovy pipelines. We have many types of jobs ranging from testing, provisioning, deployment and other utility jobs for various automation tasks. We use Jenkins shared libraries which is awesome for keeping your pipeline code well organized and modular.
One of our nightly test jobs takes a pre-configured Amazon Machine Image – what we call a “quickstart” AMI, spins a new ec2 instance from it, deploys our product and runs different set of tests on it.
All was running pretty stable with this job until recently we started noticing random failures in the deployment stage. We use Ansible to install our product which usually takes 5-15 minutes to complete, depending on the properties of the environment and installation options. The Ansible task that failed was a simple unarchive task that extracts a tarball from /var/tmp into /usr/local/ . The error in the playbook was the infamous “file not found” message. This of course made no sense, since the alleged missing file existed in the source directory and because we bake it into the quickstart image. What was even more baffling to us was the fact that for some reason this failure was completely random – there were nights when this job completed successfully and other nights it failed.
The only thing that changed between runs of the same job is the duration. Ansible would sometimes install the product in 8 minutes, and in other times it was 10, 15 or 5… This variable depends on several factors such EBS warm-up issues, S3 download bandwidth etc.
After some digging we discovered that systemd has a cleanup task that runs on /var/tmp which deletes anything older than 30 days. Because we don’t update our quickstart images so often, systemd removed everything in this path. We now have someone to blame.
So how come this doesn’t happen every time the job is executed? How is it possible that there are successful runs of the same job that completes the deployment? The answer is that the cleanup task has a timer that gets triggered 15 minutes after systemd is first started (or the machine reboots). This explains perfectly the flaky nature of the issue. When Ansible is running longer than usual, systemd fires the cleanup, deleting our build artifacts from /var/tmp and the deployment fails. When deployment is shorter, the timer is triggered after Ansible extracts the tarball. Mystery solved 🙂
Our solution to this problem was pretty straight forward: We didn’t want to introduce changes to our deployment process by stop using /var/tmp because it means changing the product installer, which requires manual QA verification before it reaches end customers and has the potential to introduce regressions. We could have simply disabled the systemd cleanup timer but that means to alter the testing environment in a way that differs from what our customers are using, which is obviously against best practices. So instead we added a step to our CI pipeline that changes the modification timestamp of files in /var/tmp using simple touch command. When the systemd timer kicks into action, it skips the altered files and viola! No more deleting stuff.
1. I’m not a member of the whole “systemd sucks” fan club and for the most part I think once you get used to it systemd does its job pretty good. On the other hand I can understand the critics who says it tries to do more than it should and that this kind of monolithic approach contradicts the traditional philosophy of the *nix community.
2. /var/tmp is a volatile place. If you need things to survive a reboot – put them somewhere else.
3. If you’re trying to debug an inconsistent/random failure with an automated process, especially if it’s CI related, try to identify the variables that change significantly in each iteration. It will give you important clues to where your focus should be when trying to solve your problem.
Until next time, chao