5AM, the cellphone next to my bed rings continuously. On the phone, the desperate CTO of a SAAS company. Their infrastructure is a ruin, the admin team has left the building with all the credentials in an encrypted file. They’ll run out of business if I don’t accept to join them ASAP he cries, threatening to hang himself with a WIFI cable.
Well, maybe it’s a bit exaggerated. But not that much.
I’ve given my Devops Commando talk about taking over a non documented, non managed, crumbling infrastructure a couple of time lately. My experience on avoiding disasters and building awesome devops teams rose many questions that led me into writing down everything before I forget about them.
In the past 10 years, I’ve been working for a couple of SAAS companies and done a few consulting for some other. As a SAAS company, you need a solid infrastructure as much as solid sales team. The first one is often considered as a cost for the company, while the second is supposed to make it live. Both should actually considered assets, and many of my clients only realised it too late.
Taking over an existing infrastructure when everyone has left the building to make it something viable is a tough, long term job that won’t go without the management full support. Before accepting the job, ensure a few prerequisites are filled or you’ll face a certain failure.
Control over the budget
Having a tight control of the budget is the mosts important part of getting over an infrastructure that requires a full replatforming. Since you have no idea of the amount of things you’ll have to add or change, it’s a tricky exercise that needs either some experience or a budget at least twice as much as the previous year. You’re not forced to spend all the money you’re allowed, but at least you’ll be able to achieve your mission during the first year.
Control over your team hires (and fires)
Whether you’re taking over an existing team or building one form scratch, be sure you have the final word on hiring (or firing). If the management can’t understand that people who use to “do the job” at a certain time of the company’s life don’t fit anymore, you’ll rush into big trouble. Things get worse when you get people who’s been slacking or under performing for years. After all, if you’re jumping in, that’s because some things are really fishy aren’t they?
Freedom of technical choices
Even though you’ll have to deal with an existing infrastructure, be sure you’ll be given free hands on the new technical choices when they happen. Being stuck with a manager who’ll block every new technology he doesn’t know about, or being forced to pick up all the newest fancy, not production ready things they’ve read on Hackers News makes one ops life a nightmare. From my experience, keeping the technos that work, even though they’re outdated or you don’t like them can save you lots of problems, starting with managing other people’s ego.
Freedom of tools
Managing an infrastructure requires a few tools, and better pick up the ones you’re familiar with. If you’re refused to switch form Trac to Jira, or refused a PagerDuty account for any reason, be sure you’ll get in trouble very soon for anything else you’ll have to change. Currently, my favorite, can’t live without tools are Jira for project management, PagerDuty for incident alerting, Zabbix for monitoring and ELK for log management.
Being implied early into the product roadmap
As an ops manager, it’s critical to be aware of what’s going on on the product level. You’ll have to deploy development, staging (we call it theory because “it works in theory”) and production infrastructure and the sooner you know, the better you’ll be able to work. Being implied in the product roadmap also means that you’ll be able to help the backend developers in terms of architecture before they deliver something you won’t be able to manage.
Get an initial glance of the infrastructure:
It’s not really a prerequisite, but it’s always good to know where you’re going. Having a glance of the infrastructure (and even better at the incident logs) allows you to setup your priorities before you actually start the job.
Your priorities, according to the rest of the company
Priority is a word that should not have a plural
For some reasons, most departments in a company have a defined, single priority. Sales priority is to bring back new clients, marketing to build new leads, devs to create a great product without bugs. When it comes to the devops team, every department has a different view on what you should do first.
The sales, consulting and marketing expect stability first to get new clients and keep the existing ones. A SAAS company with an unstable infrastructure can’t get new clients, keep the existing ones, and get bad press outside. Remember Twitter Fail Whale era? Twitter was most famous for being down that everything else.
The product team expect you to help deliver new feature first, and they’re not the only ones. New feature are important to stay up to date in front of your competitors. The market expects them, the analysts expect them, and you must deliver some if you want to look alive.
The development teams expect on demand environment. All of them. I’ve never seen a company where the development team was not asking for a virtual machine they could run the product on. And they consider it critical to be able to work.
The company exec, legal team, your management expect documentation, conformity to the standards, certifications, and they expect you to deliver fast. It’s hard to answer a RFP without a strong documentation showing you’ve a state of the art infrastructure and top notch IT processes.
As a devops manager, your only priority is to bring back confidence in the infrastructure, which implies reaching the whole company’s expectation.
The only way to reach that goal is to provide a clear, public roadmap of what you’re going to do, and why. All these points are critical, they all need to be addressed, not at the same time, but always with an ETA.
Our work organisation
I’m a fan of the Scrum agile methodology. Unfortunately, 2–3 weeks sprints and immutability do not fit a fast changing, unreliable environment. Kanban is great at managing ongoing events and issues but makes giving visibility on the projects harder. That’s why we’re using a mix of Scrum and Kanban.
We run 1 week sprints, with 50% of our time dedicated to the projects, and 50% dedicated to managing ongoing events. Ongoing events are both your daily system administration and requests from the developers that can’t wait for the following sprint.
Our work day officially starts at 10AM for the daily standup around the coffee machine. Having a coffee powered standup turns what can be seen as a meeting into a nice, devops-friendly moment where we share what’s we’ve done the day before, what we’re going to do, and which problems we have. If anyone’s stuck, we discuss the various solutions and plan a pair-working moment if it takes more than a minute to solve.
Sprint planning is done every Friday afternoon so everybody knows what they’ll do Monday morning. That’s a critical part of the week. We all gather around a coffee and start reviewing the backlog issues. Tasks we were not able to achieve during the week are added on the top of the backlog, then we move the developers requests we promised to take care of, then the new projects. People pick up the projects they want to work on, with myself saying yes or no or assigning projects I consider we must do first in last resort. We take care of having everyone working on all the technologies we have so there’s no point of failure in the team and everybody can learn and improve.
Each devops work alone on their projects. To avoid mistakes and share knowledge, nothing ships to production without a comprehensive code review so at least 2 people in the team are aware of what’s been done. That way, when someone is absent, the other person can take over the project and finish it. In addition to the code reviews, we take care about documentation, the minimum being operation instruction being added in every Zabbix alert.
Managing the ongoing events Managing the ongoings is a tricky part because they often overlap with the planned projects and you can’t always postpone them. You’ll most probably take a few months before you’re able to do everything you planned to within a week.
During the day, incident management is the duty of the whole team, not only the oncall person. Oncall devops also have their projects to do so they can’t be assigned all the incidents. Moreover, some of us are more at ease with some technologies or part of the infrastructure and are more efficient when managing an incident. (Backend) developers are involved in the incidents management when possible. When pushing to production, they provide us with a HOWTO to fix most of the issues we’ll meet so we can add them to Zabbix alert messages.
We try to define a weekly contact who’ll manage the relationships with the development team so we’re not disturbed 10 times a day and won’t move without a Jira ticket number. Then, the task is prioritised in the current sprint or another one, depending on the emergency. When managing relationships with the development teams, it’s important to remember that “no” is an acceptable answer if you explain why. The BOFH style is the best way to be hated by the whole company, and that’s not something you want, do you?
In any case, we always provide the demander with an ETA so they know when they can start working. If the project is delayed, we communicate about the delay as well.
When you have no one left
Building a new team from scratch because everyone has left the building before you join is a rewarding and exciting task, except you can’t stop the company’s infrastructure while you’re recruiting.
During the hiring process, which can take up to 6 months, I rely on external contractors. I hire experienced freelancers to help me fix and build the most important things. Over the year, I’ve built a good address book of freelancers skilled with specific technologies such as FreeBSD, database management or packaging so I always work with people I’ve worked with, or people who’ve worked with people I trust.
I also rely on vendors consulting and support to help on technologies I don’t know. They teach me a lot and help fixing the most important issues. When I had to take over a massive Galera cluster, I relied on Percona support during the 6 first months so we’re now able to operate it fully.
Finally, we work a lot with the developers who wrote the code we operate. That’s an important part since they know most of the internals and traps of the existing machines. It also allows to create a deep link with the team we’re going to work the most with.
Recruiting the perfect team
Recruiting the perfect devops team is a long and difficult process, even more when you have to build a small team. When looking for people, I look for a few things:
Complementary and supplementary skills
A small team can’t afford having single points of failure, so we need at least 2 people knowing the same technology, at least a bit when possible. We also look for people knowing other technologies, whether or not we’ll deploy them someday. Having worked on various technologies give you a great insight on the problem you’ll encounter when working on similar ones.
Autonomy and curiosity
Our way of working requires people to be autonomous and not wait until we help them when they’re blocked. I refuse to micro manage people and ask them what they’re doing every hour. They need to be honest enough to say “I don’t know” or “I’m stuck” before the projects delays go out of control.
Knowledge of the technologies in place and fast learners
Building a team from scratch on an existing platform requires to learn fast how to operate and troubleshoot it. Having an experience in some if the technologies in place is incredibly precious and limits either the number of incidents or their length. Since hiring people who know all the technologies in place is not possible, having fast learners is mandatory so they can operate them quickly. Being able to read the code is a plus I love.
Indeed, every medal has two sides, and these people are both expensive and hard to find. It means you need to feed them enough to keep them on the long term.
Knowing who are your clients
The first thing before starting to communicate is to understand who your client, the one you can get satisfaction metrics from, is. Working in a B2B company, I don’t have a direct contact with our clients. It means my clients are the support people, sales persons, project managers, and the development team. If they’re satisfied, then you’ve done your job right.
This relationship is not immutable and you might reconsider it after a while. Sometimes, acting like a service provider for the development team does not work and you’ll have to create a deeper integration. Or, on the contrary, take your distance if they prevent you from doing your job correctly, but that’s something you need time to know.
Communication and reporting
Communication within the company is critical, even more when the infrastructure is considered a source of problems.
Unify the communication around one person, even more when managing incidents
We use a dedicated Slack channel to communicate about incidents, and only the infrastructure manager, or the person oncall during the night and weekend communicates there. That way, we avoin conflicting messages with someone saying the incident is over while it’s not totally over. This also requires a good communication within the team.
Don’t send alarming messages.
Never. But be totally transparent with your management so they can work on a communication when the shit really hits the fan, which might happen. This might mean they’ll kick you in the butt if you’ve screwed up, but at least they’re prepared.
Finally, we always give an ETA when communicating about an incident, along with a precise functional perimeter. “A database server has crashed” has no meaning if you’re not in the technical field, “People can’t login anymore” does. And remember that “I don’t have an ETA yet” is something people can hear.
We do a 3 slides weekly reporting with the most important elements:
- KPIs: budget, uptime, number of incidents, evolution of the number of oncall interventions.
- Components still at risk (a lot in the beginning).
- Main projects status and ETA.
Discovering the platform
So you’re in and it’s time for the things to get real. Here are a few things I use to discover the platform I’ll have to work on.
Monitoring brings the most useful thing to know about the servers and services you operate. It also provides a useful incident log so you know what breaks the most. Unfortunately, I’ve realised that the monitoring is not always as complete as it should and you might get some surprises.
When running on the cloud or a virtualised infrastructure, the hypervisor is the best place to discover the infrastructure even though it won’t tell you which services are running, and which machines are actually used. On AWS, the security groups provide useful informations about the open ports, when it’s not 1–65534 TCP.
nmap + ssh + facter in a CSV
Running nmap with OS and service discovery on your whole subnet(s) is the most efficient discovery way I know. It might provide some surprises as well: I once had a machine with 50 internal IP addresses to run a proxy for 50 geo located addresses! Be careful too, facter does not return the same information on Linux and FreeBSD.
tcpdump on the most central nodes
Running tcpdump and / or iftop on the most central nodes allows a better comprehension of the networking flows and service communication within your infrastructure. If you run internal and external load balancers, they’re the perfect place to snif the traffic. Having a glance at their configuration also provides helpful information.
Puppet / Ansible
When they exist, automation infrastructure provide a great insight on the infrastructure. However, from experience, they’re often incomplete, messy as hell and outdated. I remember seeing the production infrastructure running on the CTO personal Puppet environment. Don’t ask why.
The great old ones
People working in the tech team for a while often have a deep knowledge of the infrastructure. More than how it works, they provide useful information on why things have been done this way and why it will be a nightmare to change.
The handover with the existing team
If you’re lucky, you’ll be able to work with the team you’re replacing during 1 or 2 days. Focus on the infrastructure overview, data workflows, technologies you don’t know about and most common incidents. In the worst case, they’ll answer “I don’t remember” to every question you ask.
In the beginning
In the beginning, there was Jack, and Jack had a groove.
And from this groove came the groove of all grooves.
And while one day viciously throwing down on his box, Jack boldy declared,
“Let there be HOUSE!”, and house music was born.
“I am, you see,
I am the creator, and this is my house!
And, in my house there is ONLY house music.
So, you’re in and you need to start somewhere, so here are a few tricks to make the first months easier.
Let the teams who used to do it manage their part of the infrastructure. It might not be state of the art system administration, but if it works, it’s OK and it lets you focus on what doesn’t work.
Create an inventory as soon as you can. Rationalise the naming of your machines so you can group them into clusters and later on, automate everything.
Restart every service one by one, under control to make sure they come back. Once, I had postfix configuration into a sendmail.cf file, and the service had not be restarted for months. Another time, a cluster did not want to restart after a crash because the configuration files were referring servers that had be removed 1 year sooner.
Focus on what you don’t know but works, then look at what you know but needs fixes. The first time I took over a sysadmin-less infrastructure, I left the load balancers away because they were running smoothly, focusing on the always crashing PHP batches. A few weeks later, when both load balancers crashed at the same time, it took me 2 hours to understand how everything was working.
Automate on day one In the beginning, you’ll have to do lots of repetitive tasks, so better start automating early.
If I have to do the same task 3 times, I’ve already lost my time twice.
The most repetitive thing you’ll have to do is deployment, so better start with them. We’re using an Ansible playbook triggered by a Jenkins build so the developers can deploy when they need without us. If I didn’t want, I could ignore how many deployments to production are done every day.
Speaking of involving the developers, ask the backend developers to provide the Ansible material they need to deploy what they ask you to operate. It’s useful both for them to ensure dev, production and theory are the same, and to know things will be deployed the way they want with the right library.
Giving some power to the development team does not mean leaving them playing in the wild. Grant them only the tools they need, for example using Jenkins builds or users with limited privileges through automated deployment.
Resist in hostile environments
Hostile environment: anything you don’t have at least an acceptable control on.
Developers designed servers are a nightmare to operate so better talk about them first. A developer designed server is a machine providing a full feature without service segregation. The processes, database, cache stack… runs on the same machine making them hard to debug and impossible to scale horizontally. And they take a long time to split. They need to be split into logical smaller (virtual) machines you can expand horizontally. It provides reliability, scalability but has an important impact on your network in general and on your IP addressing in particular.
Private clouds operated by a tier are another nightmare since you don’t control resources allocation. I once had a MySQL server that crashed repeatedly and couldn’t understand why. After weeks of searches and debugging, we realised the hosting company was doing memory ballooning since they considered we used too much memory. Ballooning is a technic that fills part of the virtual machine memory so it won’t try to use everything it’s supposed to have. When MySQL started to use more than half of the RAM it was supposed to have, it crashed because it didn’t have enough despite the operating system saying the contrary.
AWS is another hostile environment. Machines and storage might disappear anytime, and you can’t debug their NATted network. So you need to build your infrastructure for failure.
Write documentation early
Finally, let’s talk about documentation. Infrastructure documentation is often considered a burden, and with the infra as code fashion, automation scripts are supposed to be the real documentation. Or people tell “I’ll write documentation when I’ll have time, for now everything is on fire”.
Nothing’s more wrong (except running Docker in production). But yes, writing documentation takes time and organisation, so you need to iterate on it.
The tool is an critical part if you want your team to write documentation. I like using Git powered wiki using flat Markdown files like the ones on Github or Gitlab, but it does not fit everyone, so I often fallback to Confluence. Whatever the tool, ensure the documentation is not hosted on your infrastructure!
I usually start small, writing down operation instructions in the monitoring alert messages. It’s a good start and allows you to solve an incident without digging into the documentation looking for what you need. Then, we build infrastructure diagrams using tools like Omnigraffle on Mac, or Lucidchart in the browser. Then we write comprehensive documentation around those 2.
Well, that’s all folks, or most probably only the beginning of a long journey in turning a ruin into a resilient, blazing fast, scalable infrastructure around an awesome devops team. Don’t hesitate to share and comment if you liked this post.