Systemd, continuous integration and the mysterious file not found error

I recently started working as a DevOps engineer for a really cool company that is building the next generation machine learning platform.

One of our many responsibilities as a DevOps team is the CI environment, based on Jenkins groovy pipelines. We have many types of jobs ranging from testing, provisioning, deployment and other utility jobs for various automation tasks. We use Jenkins shared libraries which is awesome for keeping your pipeline code well organized and modular.

One of our nightly test jobs takes a pre-configured Amazon Machine Image – what we call a “quickstart” AMI, spins a new ec2 instance from it, deploys our product and runs different set of tests on it.

All was running pretty stable with this job until recently we started noticing random failures in the deployment stage. We use Ansible to install our product which usually takes 5-15 minutes to complete, depending on the properties of the environment and installation options. The Ansible task that failed was a simple unarchive task that extracts a tarball from /var/tmp into /usr/local/ . The error in the playbook was the infamous “file not found” message. This of course made no sense, since the alleged missing file existed in the source directory and because we bake it into the quickstart image. What was even more baffling to us was the fact that for some reason this failure was completely random – there were nights when this job completed successfully and other nights it failed.

The only thing that changed between runs of the same job is the duration. Ansible would sometimes install the product in 8 minutes, and in other times it was 10, 15 or 5… This variable depends on several factors such EBS warm-up issues, S3 download bandwidth etc.

After some digging we discovered that systemd has a cleanup task that runs on /var/tmp which deletes anything older than 30 days. Because we don’t update our quickstart images so often, systemd removed everything in this path. We now have someone  to blame.

So how come this doesn’t happen every time the job is executed? How is it possible that there are successful runs of the same job that completes the deployment? The answer is that the cleanup task has a timer that gets triggered 15 minutes after systemd is first started (or the machine reboots). This explains perfectly the flaky nature of the issue. When Ansible is running longer than usual, systemd fires the cleanup, deleting our build artifacts from /var/tmp and the deployment fails. When deployment is shorter, the timer is triggered after Ansible extracts the tarball. Mystery solved 🙂

Our solution to this problem was pretty straight forward: We didn’t want to introduce changes to our deployment process by stop using /var/tmp because it means changing the product installer, which requires manual QA verification before it reaches end customers and has the potential to introduce regressions. We could have simply disabled the systemd cleanup timer but that means to alter the testing environment in a way that differs from what our customers are using, which is obviously against best practices. So instead we added a step to our CI pipeline that changes the modification timestamp of files in /var/tmp using simple touch command. When the systemd timer kicks into action, it skips the altered files and viola! No more deleting stuff.

Lessons learned:

1. I’m not a member of the whole “systemd sucks” fan club and for the most part I think once you get used to it systemd does its job pretty good. On the other hand I can understand the critics who says it tries to do more than it should and that this kind of monolithic approach contradicts the traditional philosophy of the *nix community.

2.  /var/tmp is a volatile place. If you need things to survive a reboot – put them somewhere else.

3. If you’re trying to debug an inconsistent/random failure with an automated process, especially if it’s CI related, try to identify the variables that change significantly in each iteration. It will give you important clues to where your focus should be when trying to solve your problem.

Until next time, chao

Advertisements

Websphere Application Server V8.5.5.9 Network Deployment 2-Node Cluster on Docker

Inspired by Dockerfiles for WebSphere Application Server traditional

I always hated installing WebSphere software. The installation is so boring and time-consuming I always thought there’s a better way of doing it. I also love Docker, so I figured why not make some Docker images and a docker-compose file that will spin up a complete WebSphere Network Deployment Cell in no time? THERE IS NO REASON WHY NOT 🙂

First you’ll have to build the 2 images yourself (dmgr + custom WAS profile) using the WASDev instructions at the above link. Once you have those available in an online or local repository (I use a private Docker Hub repo) you’ll need to update the name of the images in the compose yaml file. After you execute docker-compose up you will have a fully functioning 2 node WebSphere Cell ready for clustering. I’ve added some screenshots to explain how to actually create the cluster using the Integrated Services Console.

It’s great for testing, learning and production. It’s even better at demonstrating the awesome power of Docker in legacy environments were WebSphere is usually deployed.

Checkout it out in my public GitHub accountpublic GitHub account.

Using Ant + Maven to upload artifacts to Nexus OSS

Everyone uses Maven. But sometimes you don’t have the privilege of starting a new Maven project with fancy pom file – Maybe it’s because you’re exporting from Eclipse, maybe you have an existing project with different build tools etc…

For example I have a some Liferay plugins SDK projects that’s already using ANT. If you don’t know what is Liferay, you should check it out, it’s a nice open source, full-fledged JAVA content management system for enterprise websites.

So my developers are using ANT to build their plugin artifacts and I need to store these artifacts (mainly JARS and WARS) in Nexus OSS repository. So I came across the “deploy”  maven plugin with its “deploy-file” goal (deploy:deploy-file). It allows you to quickly upload any WAR/JAR file to a maven repository without having to build a complete POM file with millions of dependencies. You can include a pom.xml file if you like and override the properties from the command line.

Make sure you configure your Maven settings.xml file with the Nexus OSS repository credentials like this:

<?xml version="1.0"?>
<settings>
  <servers>
    <server>
      <id>liferay-releases</id>
      <username>admin</username>
      <password>*****</password>
    </server>
    <server>
      <id>liferay-snapshots</id>
      <username>admin</username>
      <password>*****</password>
    </server>
  </servers>
</settings>

When you execute maven from Jenkins, you need to pass some parameters: (‘properties’)

2016-08-29 16_46_32-qognify-theme Config [Jenkins]

 

 

Notice I’ve used parameter $JOB_NAME to pass the Jenkins job name as an artifact ID + $BUILD_NUMBER as maven version to Nexus OSS.

Finally you get this nice view in Nexus OSS UI after each build:

2016-08-28 15_58_56-Components - Nexus Repository Manager

 

 

Moving away from CVS to GIT

  2016-07-29 18_21_42-Edit Post ‹ RedHeadHat — WordPress.com

We are currently at the very start of our DevOps journey, gradually transitioning from legacy development tools, into more modern, platform-independent tools. One of my clients is using the ancient source control system ‘CVS’, and because I want to implement a CI pipeline we had to convert all of our CVS repositories into git repos.

It’s not easy to convince dev people to throw away what they know and start using new tools, no matter how good you explain the benefits. The old statement “Why replace something that’s not broken?” will almost always win any argument. Eventually I was able to persuade the right people.

So if you have a bunch of CVS repositories you want to get rid of, and you like GitLab, checkout this bash script I’ve wrote to convert each CVS directory into a bare git repo, and then push that repo into GitLab.

What’s neat about this script is that it uses GitLab API to programatically create each repo with custom visibility settings (‘Internal Project’) so that developers can get on board quickly. Consider running the ‘Add all users to all projects’ rake task to add all GitLab users to the newly created repos (careful with this, it does not destinguish between imported repos and old repos)

$ gitlab-rake gitlab:import:all_users_to_all_projects

Here are the instructions:

  1. Install cvs2svn & cvs (you can find links below)
  2. cd into the directory holding all your CVS repositories (Where CVSROOT is located)
  3. Create folders “output/git_tmp” in the current folder
  4. Change parameters in cvs2git.sh to match your environment
  5. chmod +x cvs2git.sh
  6. Execute ./cvs2git.sh

http://cvs2svn.tigris.org/

 

And here is the code:

#!/bin/bash
# cvs2git.sh

REPOS='repos.txt'
OUTPUT_ROOT_DIR='output'
TMP='git_tmp'
GITLAB_HOSTNAME='gitlab.example.com'
GITLAB_API_KEY='*********'

while read LINE ;do
  LINE_LOWER=${LINE,,}
  echo "doing $LINE ..."
  cvs2git --blobfile="$OUTPUT_ROOT_DIR"/"$TMP"/git-blob_"$LINE".dat --dumpfile="$OUTPUT_ROOT_DIR"/"$TMP"/git-dump_"$LINE".dat --username=cvs2git $LINE
  mkdir "$OUTPUT_ROOT_DIR"/"$LINE".git
  cd "$OUTPUT_ROOT_DIR"/"$LINE".git
  git init --bare
  cat ../"$TMP"/git-blob_"$LINE".dat ../"$TMP"/git-dump_"$LINE".dat | git fast-import
  git gc --prune=now
  git remote add origin git@"$GITLAB_HOSTNAME":root/"$LINE_LOWER".git
  curl -H "Content-Type:application/json" http://"$GITLAB_HOSTNAME"/api/v3/projects?private_token="$GITLAB_API_KEY" -d "{ \"name\": \"$LINE\" ,\"visibility_level\": \"10\" }"
  git push --all
  cd ../../
  echo "done with ${LINE}."

done < $REPOS

 

 

 

 

Use Bare Git Repo To bypass Corporate Firewall

We started using Git to track changes and manage Puppet code. At first we had only one Puppet master and all was nice and simple.

Once we gained trust and everyone was happy, we added more puppet masters in different isolated networks, one of which we had to find a way to punch a hole through the company Firewall. Sometimes it’s just not worth it to fight through bureaucracy and security regulations so we had to come up with a work-around how to keep Puppet code repository synced with all the other repos.

We couldn’t get direct network access from the isolated puppet master to the git server, but we had 2 other servers, one in each side that were open for communication on port 22. We thought we can use these two servers as some sort of a git bridge to keep all puppet masters in synced with the latest code.

We did this using a bare git repo to act as a bridge between the puppet master and the git server. A bare git repo is a repo that does not have a working tree attached to it. You can  initialize it like this:

$ git init --bare

We had to do this twice: one on a server located outside of the sealed off network and second on a server inside the secured network, who can talk with the puppet master.

We then created a post-receive git hook to push data automagically whenever new code is pushed to the git server. You have to setup SSH keys for this to work. A post-receive hook is a simple script you drop in the HOOKS directory of any git repo, containing commands you want to be executed by Git, whenever the repo is being updated by pushing code to it.

The end result was this:

1. You push from workstation to git server

2.git hook pushes from git server to a git bare repo A.

3. Bare git repo A is using another git hook to push to bare repo B which is located at the other more secured network.

4. It’s now possible to pull updated code from git to puppet.

Horay!

What’s next? Create more git hooks to automatically push/pull code whenever a git commit is done by a developer so that we wouldn’t have to manually update the puppet masters with new code. This is a bit dangerous since we are working with only one git branch (master) and you don’t want untested code to get into production. So we’ll have to create dev/staging/prod branches before we jump into more automation tasks.

2016-07-02 12_30_17-Git + GitLab introduction - Google Slides

 

 

Building a Docker Image for IBM WebSphere Portal Enable (Web Content Manager) V8.5

I’ve started experimenting with Docker only recently and already I can do something useful with it! The idea of firing a disposable, light-weight container for testing and development purposes is super cool. It gets even better when you have limited access to VM infrastructure: For example, when I’m at a customer site and I’m working on a project with limited budget, I usually get access to only 1-2 VM. With Docker I can create and destroy instances of my application without relying on IT / System personal.

There are a ton of guides on how to build Docker images and run container but I couldn’t find something for IBM WebSphere Portal Server. So I wrote some Dockerfiles to create an image that would allow me to start an instance of IBM WebSphere Portal Server 8.5 WCM (Web Content Manager) with the latest cumulative fix, and with minimum effort.
Manual installation of WPS can take a whole day and it’s just not worth it if you only need to test your theme or demonstrate some capability of the product.
What’s even more amazing is the fact that thanks to Docker’s storage layering system, the final image size is only 4GB when pulled from Docker Hub. Regular installation of WebSphere Application Server (Network Deployment) with Portal WCM can take as much as 30-40GB of storage.
You can find all the instructions on how to create the image on my GitHub repo.
What I did after creating the image was to push it to Docker-Hub so I can pull it anywhere. Make sure you use a private repository or you’ll probably violate IBM license agreement, luckily Docker-Hub offers 1 free private repository so I’m using that.
Note:
Even though the image is small compared to a non-docker deployment, docker hub still struggles to pull it when using Docker-Tools for Windows/Mac. This maybe because Docker Tools uses boot2docker VM, anyway I recommend that you install docker on a dedicated RHEL7 VM and pull the image from there.

Using Foreman & Puppet to monitor CIS Benchmarks

cislogoweb + foreman_medium + pl_logo_vertical_rgb_sm

I’ve recently started a very interesting project with puppet and foreman to implement CIS benchmark document for hardening server infrastructure. Some of the items in the document can seriously cripple your machines (Disable X11 Forwarding… rpm -v *…) so instead of enforcing everything or creating complicated “if-then” code, we applied a mechanism in Foreman UI to alert when certain node is not in compliance with CIS.

At first, we used “Notify” resources in puppet for every alert that needs to be thrown when a machine is not in compliance. This way we can have puppet reports show CIS warnings in Foreman with colorful messages. The problem with the “Notify” resource is that Puppet treats these events as an active change to the machine. This defeats the purpose of a monitoring system because every time puppet agent runs, it sends its report to Foreman and Foreman shows all nodes not in compliance as being “Active”. This is misleading because nothing really changed, it’s just the Notify message causing Puppet to think something changed on the machine.

To remedy this issue, I tried to make some modifications in Foreman source code to make Foreman ignore the Notify events, but that didn’t turned out so well because I had to enable “noop” attribute for every single Notify piece of code (200+ CIS items = 200+ Notify events in “noop” generates even more noise in Foreman dashboard)

 

Fortunately, someone at StackOverFlow was kind enough to point out that I should use a custom resource type that is available in Puppet Forge called “Echo“. It does exactly the same thing as the Notify resource, without having Puppet report to Foreman that the node has changed. Problem solved!

We now have a fairly good indication when servers in production are not in compliance with CIS benchmarks, using Foreman and Puppet.