★ Code reviews croisées

« Le processus de rationalisation donne ses pouvoirs à l’expert, mais les résultats de la rationalisation les limitent. Sitôt qu’un domaine est sérieusement analysé et connu, sitôt que les premières intuitions et innovations ont été traduites en règle et programme, le pouvoir de l’expert tend à disparaître. En fait les experts n’ont de pouvoir social réel que sur le front du progrès, ce qui signifie que ce pouvoir est changeant et fragile… », et leur pouvoir serait de plus en plus fragile « dans la mesure où les méthodes et programmes auxquels science et technologie parviennent peuvent être utilisés et dirigés par des gens qui ne sont plus des experts… »

Jacques Ellul dans L’illusion politique citant Crozier dans Le phénomène bureaucratique

J’ai fait appel à Anthony pour faire une revue de code sur un bout de JavaScript que j’estimais de faible qualité. La semaine précédente, Stéphane nous demandait avec David d’effectuer une revue de code plus haut niveau sur le code de MultiBàO. Ces code reviews croisées permettent d’augmenter la qualité d’un code de façon impressionnante. La pratique de la revue de code dans le cadre d’une collaboration technique au sein d’une équipe peut mener à une certaine monotonie mais le simple fait de demander à quelqu’un d’externe permet de prendre du recul sur la pertinence de ce que l’on a développé, sur sa compréhension par un nouveau venu et sur la façon que l’on a de le tester. Je vois cela comme de la pollinisation technique, un moyen de faire monter en compétence et en curiosité ses pairs.

La revue de code en elle-même m’a permis d’identifier certains patterns dans ma façon de coder et d’être plus vigilant sur ces points à l’avenir. C’est une chose qui est déjà identifiable au sein d’une équipe mais qui tend à être moins remontée à la longue par habitude et manque d’attention. Autre problématique que l’on a en interne, on est trop peu nombreux pour pouvoir imposer plusieurs revues par pull-request ce qui endommage la qualité, idéalement je pense qu’il en faudrait deux ou trois. Enfin un œil nouveau permet de revoir la façon de documenter le pourquoi ce qui est difficile lorsque toute l’équipe sait de quoi est-ce qu’il est question.

Ces expériences me donnent envie d’aller plus loin dans ces échanges croisés. Est-ce qu’il faut vraiment essayer de passer à l’échelle ? Est-ce qu’il faut une charte (bienveillance, respect, etc) ? Est-ce qu’une monnaie virtuelle est nécessaire pour ce troc ? Est-ce qu’il faut un service pour centraliser l’index ? (DAO (cache) ? :-p) Est-ce que d’autres personnes seraient intéressées ? Est-ce que d’autres projets auraient le même besoin ? J’ai encore beaucoup de questions sur le sujet mais je sais que je peux déjà m’engager à titre personnel sur 2 à 3 revues Python/ES6 sur ces prochaines semaines si vous êtes motivés. Un bon moyen de tester la formule. La seule contrainte que je me mets c’est de pouvoir le faire sur du code en open-source sur cette première itération. Envoyez-moi vos PR/MR/autres ! :-)

Mais M. Crozier me semble assimiler à tort l’expert et le technicien. Sans doute l’expert est appelé, incidemment, pour donner son avis dans une situation d’incertitude. Mais le rôle du technicien, qui d’ailleurs peut aussi être appelé comme expert, ne se limite pas à cela. Or, ce n’est pas parce que la situation cesse d’être incertaine que l’influence du technicien diminue. Et nous sommes très loin d’une diffusion aisée et simple des techniques !

Ibid.

Anthony me signale en meta-review ce qu’avait fait Tarek à ce sujet (cache).

★ JavaScript, promesses et générateurs

Conceptually that’s a much cleaner and easier to understand process than what happens when promises are chained multiple times. If you’ve ever tried to understand just how a promise chain is wired up, or explain it to someone else, you’ll know it’s not so simple. I won’t belabor those details here, but I’ve previously explained it in more detail if you care to dig into it.

The sync-async pattern above wipes away any of that unnecessary complexity. It also conveniently hides the then(..) calling altogether, because truthfully, that part is a wart. The then(..) on a promise is an unfortunate but necessary component of the Promise API design that was settled on for JS.

Promises: All The Wrong Ways (cache)

À la lecture de cet article et utilisant fetch avec son polyfill sur les projets Etalab, j’ai décidé d’expérimenter ce qui pouvait être fait à base de générateurs pour éviter d’enchaîner les promesses. Ce billet technique sur JavaScript (encore !) n’est pas une introduction à l’objet Promise mais un usage possible, si vous n’avez jamais manipulé ça risque d’être un peu difficile à comprendre. Si vous êtes familier avec les générateurs Python ça peut aider :-).

Prenons un exemple classique où vous voulez récupérer des données depuis une API HTTP :

const url = 'http://httpbin.org/get'
fetch(url)

Jusqu’ici tout va bien mais dès la prochaine ligne ça se complique car il va falloir gérer les codes d’erreurs à la main :

function checkStatus (response) {
  if (response.status >= 200 && response.status < 300) {
    return response
  } else {
    const error = new Error(response.statusText)
    error.response = response
    throw error
  }
}

Ici on ne retourne la réponse que si elle est en 2XX et sinon on lève notre propre erreur. On peut maintenant enchaîner l’exécution de cette fonction au retour de notre fetch :

.then(checkStatus)

Puis récupérer la réponse sous forme de JSON :

.then((response) => response.json())

Et enfin faire quelque chose d’utile avec :

.then((jsonResponse) => {
  // Do something with the jsonResponse.
})

Si vous en restez là en cas d’erreur vous n’avez pas de moyen de savoir ce qu’il se passe, il ne faut jamais oublier d’ajouter un catch à vos chaînes de promesses :

.catch(console.error.bind(console))

Ici on fait le minimum en affichant un message dans la console. Investiguer des bugs liés à de l’asynchrone peut rapidement s’avérer être prise de tête si vous ne prenez pas ces précautions. Au même titre, ne mélangez jamais différentes implémentations des promises sous peine de devoir re-standardiser ensuite toutes vos promesses via Promise.resolve(promesse)

Donc au final notre promesse chaînée complète ressemble à :

fetch(url)
  .then(checkStatus)
  .then((response) => response.json())
  .then((jsonResponse) => {
    // Do something with the jsonResponse.
  })
  .catch(console.error.bind(console))

Ça reste lisible mais c’est loin d’être idéal, je vous laisse aller lire l’article en intro (cache) pour les différentes raisons.

En utilisant un générateur, je suis arrivé à ce résultat :

run(fetchJSON, url)
  .then((jsonResponse) => {
    // Do something with the jsonResponse.
  }, console.error.bind(console))

Qui est finalement la seule logique métier qui m’intéresse. Un seul .then() permettant de gérer le fulfilled/rejected directement en paramètre sans passer par le .catch(). Ensuite notre fetchJSON est un générateur * :

function * fetchJSON (url) {
  const rawResponse = yield fetch(url)
  const validResponse = yield checkStatus(rawResponse)
  return validResponse.json()
}

Proche de la syntaxe Python, on retrouve le mot-clé yield qui sert à récupérer les différents éléments de l’itérateur généré. Ici il y a exactement le même traitement que précédemment de manière un peu plus explicite à mon goût. Il reste à définir notre runner :

function run (generator, ...args) {
  const iterator = generator(...args)
  return Promise.resolve()
    .then(function handleNext (value) {
      const next = iterator.next(value)
      return (function handleResult (next) {
        if (next.done) {
          return next.value
        } else {
          return Promise.resolve(next.value)
            .then(
              handleNext,
              (error) => {
                return Promise.resolve(iterator.throw(error))
                  .then(handleResult)
              }
            )
        }
      })(next)
    })
}

Ici c’est la gestion des erreurs qui rend le script un peu velu mais en gros ça suit l’itérateur créé à partir du générateur de manière récursive en retournant des promesses (de promesses (de promesses (etc))). C’est là où se cache maintenant la complexité de vouloir faire de l’asynchrone séquentiel. L’avantage c’est que c’est localisé et réutilisable ensuite pour toutes vos séquences. En attendant async/await (cache). Ou pas (cache).

Le gros point noir dans tout ça, c’est lorsque vous tentez la conversion en ES5 avec Babel. Vous devez utiliser babel-polyfill qui rajoute immédiatement 300Kb à votre joli code… ce qui est envisageable côté serveur ou pour une application complète devient rédhibitoire pour une simple bibliothèque. Même si on ne va au final pas l’utiliser pour cette raison là, ça aura été une incursion intéressante dans l’univers asynchrone de JavaScript en attendant que les fonctionnalités d’ES6/ES7 soient davantage implémentées nativement.

Je suis un débutant en JS donc n’hésitez pas à me faire vos retours si j’ai raconté des bêtises !

★ Instantané Scopyleft

L’avenir des entreprises est d’être en réseau, de proposer des services en pair à pair, d’être des plateformes distribuées et autogérées. La nature change plus lentement que la culture, la gouvernance ou la technologie. Les entreprises connectées sont appelées à être agiles, “glocales”, basées sur l’humain… Les institutions de demain doivent devenir des systèmes vivants. Elles doivent se penser comme des systèmes d’exploitation avec lesquels les gens peuvent contribuer… Elles doivent coder leurs valeurs dans leur mode d’organisation même.

Reste à savoir comment. Dans cette présentation qui déroulait tous les mots clefs attendus… chacun pouvait surtout piocher ce qu’il voulait entendre, sans pour autant partager les mêmes valeurs.

A la croisée des économies collaboratives (cache)

J’ai beaucoup de questions relatives à scopyleft lors des évènements et rencontres. Il s’agit de quelques réponses valides à cet instant t et j’insiste sur l’instantanéité car cette coopérative est un cadre de travail vivant.

Ce qui suit n’est pas un guide mais un exemple de ce qui peut être fait. Je ne sais pas si c’est généralisable ni même reproductible, encore moins si c’est souhaitable.

Emploi

Il y a souvent confusion sur ce qu’est scopyleft pour moi et on me demande si je suis passé par la SCOP pour travailler avec Mozilla puis Etalab ou indépendamment en direct. Scopyleft est ma seule entreprise et je suis content de participer à ce projet. Ce n’est pas un regroupement d’indépendants ou une structure artificielle pour faire plus gros dans les appels d’offres.

C’est notre coopérative que l’on cultive depuis plus de trois ans, elle est le reflet de notre perception du travail.

Quotidien

Vincent est travailleur itinérant depuis plus de deux ans, Stéphane a franchi le pas cette année. Autant dire que l’on ne se croise pas tous les jours :-). On profite des conférences comme Mix-IT ou SudWeb dernièrement pour se retrouver physiquement et passer du temps ensemble. C’est l’occasion de se synchroniser sur les projets et de s’aligner sur certaines décisions. Le reste du temps on utilise d’autres moyens de communication (messagerie instantanée et visioconférence).

Notre quotidien n’est pas pour autant exclusivement solitaire sachant que l’on coopère pas mal avec des personnes extérieures à la SCOP : Etalab, UT7 ou Claude Aubry par exemple.

Argent

C’est souvent cette question qui fait vriller l’interlocuteur alors je vais essayer de détailler. Tout ce que l’on facture atterri dans un pot commun qui est là pour faire vivre la coopérative et ses membres. Cette entrée d’argent — parfois personnelle (voir paragraphe suivant) — est indépendante des salaires. Les salaires sont consentis par les trois employés à hauteur de leurs besoins ressentis. Nous avons longtemps été à salaires égaux mais ce n’est plus du tout le cas, les rémunérations peuvent évoluer dans le temps et les dividendes (lorsqu’il y en a) tentent d’équilibrer cela. Nous ne nous épuisons pas à capitaliser plus.

Une relation saine à l’argent ne peut se faire que dans la confiance et la bienveillance.

Collaboration

Je travaille actuellement avec Vincent sur certains projets, j’ai participé à des conférences avec Stéphane, je maintiens un autre projet en solo et on expérimente chacun dans nos domaines respectifs pas forcément rémunérateurs (accompagnement, enseignement, bien-être, etc).

On n’essaye plus de s’imposer la collaboration mais on l’encourage en fonction de la motivation et de l’énergie à ce moment là. C’est assez éloigné de ce que l’on a cherché à faire par le passé en travaillant tous ensemble régulièrement.

Horaires

On ne comptabilise pas les heures de travail ou les vacances. Ça peut sembler être une entorse violente au code du travail mais on préfère s’en référer à notre bien-être ressenti. Si l’un de nous a besoin de prendre une semaine ou un mois off il les prend. Si l’un de nous a besoin de passer une année à mi-temps pour découvrir son fils il le fait. Si l’un de nous souhaite explorer un pays sans être sûr d’y trouver une connexion on fait avec.

Tout cela indépendamment de la rémunération que l’on estime être un besoin non indexé sur la force de travail.

Inspiration

A les entendre témoigner, on voit bien que ces formes organisationnelles se cherchent en avançant, qu’elles expérimentent, testent, essayent. C’est certainement lié au fait que ces “nouvelles méthodes” sont peu documentées, nécessitent d’être adaptées et qu’elles ne sont pas si simples à mettre en place et à faire perdurer face à des organisations qui évoluent sans cesse, notamment quand elles sont petites et qu’elles veulent demeurer agiles.

Travailler de manière collaborative, oui ! Mais comment s’organiser ? (cache)

On me demande régulièrement aussi si ça ne se rapproche pas de X (X pouvant être l’Holacratie, le salaire à vie ou autre). Peut-être. Les sources d’inspiration sont nombreuses mais l’on n’essaye pas d’appliquer une méthode à la lettre. Au même titre que l’on n’applique pas Scrum mais on expérimente dans une culture agile. On itère et on recrée le cadre s’il le faut. La différence permet de chercher ses propres solutions.

Voilà certaines raisons pour lesquelles j’ai du mal à envisager de quitter scopyleft même si c’est pour m’intégrer dans une nouvelle culture. J’ai l’impression d’être allé beaucoup trop loin dans ma relation au travail pour pouvoir revenir en arrière sans faire une dépression ! C’est aussi pour cela que je me sens très loin de certaines discussions et relations liées au travail que je ne peux (mal)heureusement plus vivre. Il n’y a plus de bouc-émissaires autres que nos propres contradictions, le travail devenant une matière à part entière que nous explorons ensemble.

Il faudra que je vous parle d’éducation un de ces jours, j’ai le sentiment d’être dans le même type de dimension parallèle…

★ SudWeb 2016

Beaucoup d’inventions utiles n’ont pas été le fruit d’un problème. Penser « positif », c’est peut-être juste ça : laisser tomber les problèmes et rêver un peu…

4 changements qui émergent dans les projets (cache)

Les éditions de cette conférence se suivent mais ne se ressemblent pas si ce n’est dans leur recherche de singularité. Chaque intervention donne envie d’aller interagir avec l’orateur pour échanger plus que d’ouvrir son laptop. Derrière ces sujets non-techniques se cachent des réflexions plus profondes qui n’interrogent plus le comment mais le pourquoi et de plus en plus le pourquoi pas ?

Les sujets des élaboratoires en format Forum Ouvert sont assez révélateurs d’une communauté qui devient plus mature (c’est la façon polie de dire vieille :-p).

  • J’ai eu le plaisir de m’initier au handlettering avec Hellgy pour éventuellement s’amuser en duo.
  • J’ai initié une discussion sur l’enseignement et le web qui a permis de vérifier la diversité des expériences et des méthodes employées. Beaucoup de discussions ont finalement été relatives au ratio théorie/pratique et à la pertinence du contenu transmis. Bien que les questions aient été posées, on n’a pas tenté de définir ce qu’était le Web ni si une culture pouvait être enseignée. De quoi alimenter ma réflexion :-).
  • J’ai participé à une discussion sur Progresser dans son métier qui soulevait des pratiques intéressantes. Je retiens qu’un domaine en perpétuelle évolution demande de progresser même lorsqu’on souhaite simplement rester à niveau. Personne dans la salle n’a mentionné les formations traditionnelles pour progresser, c’était assez surprenant.
  • Enfin, j’ai assisté au retour d’expérience de Vincent sur son travail itinérant ou néomadisme. Il faudrait que je fasse un billet sur le fonctionnement actuel de scopyleft car j’ai eu beaucoup de questions à ce sujet au cours des deux jours. Instantané publié depuis.

Je vais terminer sur le témoignage de Roxanne qui se pose des questions sur son entrée dans le monde professionnel et sur les relations entre son futur emploi rémunéré et ses travaux bénévoles actuels. Cette articulation est tout à fait possible et rejoint pas mal de réflexions que nous avons eues au cours de la création et de l’évolution de scopyleft. Par contre, il s’agit clairement de la face nord car cela nécessite un alignement des bonnes énergies des bonnes personnes au bon moment. Ou pas, il faut expérimenter pour le vérifier ;-).

The obvious source of failure no one talks about

despair

Every time I see a startup dying, I can’t help but trying to understand what went wrong. Unfortunately, you can’t turn back time or get a time lapse of a multiple years history.

Unlike success, a startup failure might be hard to understand. Obvious reasons exist: lack of product / market fit, co founders issues, poor execution, lack culture, failure to build a great team, but it doesn’t explain everything.

Before 2011, Twitter was well known for its “fail whale” and inability to scale. Early Google was relying on (poor) competitors data before they had a big enough index. And Docker was once a Platforms As A Service (PAAS) before they pivoted to focus on Docker. Before they became the success story we hear about all over the Internet.

Semi failure is even harder to analyse. How can you know what made a promising company barely survive after 5 or 10 years without diving into an insane amount of data? Beside the company internals, such as product evolution, business model pivots, execs turnaround — a sign something’s fishy, not necessary the reason — and poor selling strategies, you need to analyse the evolution of the market, their clients and indeed their competitors.

There’s something else no one talks about when analysing failure. Something so obvious it sounds ridiculous until you face it.

Yesterday I wanted to see a friend whose startup only has 1 or 2 months cash left. Yesterday was also an optional bank holiday in France, but I didn’t expect their office door to be closed.

I was shocked. If my company was about to die, I would spend the 30, 45 remaining days trying to save it by all means. I’d run everywhere trying to find some more cash. I’d have the product team release that deal breaker differentiating feature. I’d try to find a quick win pivot. I’d even try to sell to my competitors in order to save some jobs. But I’d certainly not take a bank holiday.

Then I remembered every time I went there during the past 2 years, sometimes dropping for a coffee, sometimes using their Internet connection when I was working remotely and did not want to stay at home. There was no one before 10:00AM, and there was no one after 7:00PM. There were always people playing foosball / the playstation / watching a game on TV. Not like they were thousands people, more like a dozen. I remember late lunchs and, early parties.

Despite a fair product and promising business plan, they missed something critical. “Work hard, play harder” reads from left to right, not the other way around. In the startup world, the obvious source of failure no one talks about is the lack of work.

The myth of the "always at 200%" team

From Paris Marathon 2008

In the past decade, I’ve met many entrepreneurs asking their team to be as dedicated to their job as they are.

When I hire someone, I want them to be at 200%, 24/7, 365 days a year. If I send them an email at 2:00 AM, I expect an answer within 10 minutes. That’s the way you build a great business.

They all failed.

I experienced that state of mind, and it didn’t turn well. Employees of the company would stay awake late to be sure they would not miss an email from the boss. They wanted to be the first ones to answer to show how reactive and motivated they were. The ugly truth was, none of us was working efficiently building a great company. We were slacking on the Web late at night, checking our email from time to time, just in case something would happen. After one year, we all stopped pretending, and an incredible percentage of the team divorced.

Building a great team is hard. Keeping it is harder, and an important turnover is already a failure.

Quoting Richard Dalton on Twitter,

Teams are immutable. Every time someone leaves, or joins, you have a new team, not a changed team.

This is even more true for small teams where losing someone with specific skills can make the whole team at stake.

When one of my guys had to take a 3 months sick leave, I had 2 options. The first one was to divide his projects to the whole team, putting more pressure on them, asking them to do extra hours and working during the weekend so we could finish all the projects in time as expected. The second one was rescheduling the less critical projects and explaining my management why we would postpone some of them, and why it was OK to do it.

Would I have been in a huge corporation where the Monday morning report meeting is a sacred, well oiled ceremony with all your peers looking at each other, expecting them to fall from their pedestal as the leader gets mad at them for sliding their projects, I might have picked up the first option.

Forcing the whole team to work harder because of one of them missing would have been a huge mistake with terrible results. You can’t expect a team already working under pressure to work more to catch up with their absent pal projects without getting poor results and ressent towards the missing guy. Even more, this would have lead to someone else leaving for a sick leave after a few weeks, or leaving at all, destroying the team and all the efforts done to build it during more than one year.

As a manager, your first duty is to make sure the whole team succeeds, not to cover your ass from your management ire. For that reason, I decided to reschedule our projects, because the “always at 200%” team on the long run is a myth. And a planned failure.

Building an awesome Devops team on the ashes of an existing infrastructure

Devops Commando

5AM, the cellphone next to my bed rings continuously. On the phone, the desperate CTO of a SAAS company. Their infrastructure is a ruin, the admin team has left the building with all the credentials in an encrypted file. They’ll run out of business if I don’t accept to join them ASAP he cries, threatening to hang himself with a WIFI cable.

Well, maybe it’s a bit exaggerated. But not that much.

I’ve given my Devops Commando talk about taking over a non documented, non managed, crumbling infrastructure a couple of time lately. My experience on avoiding disasters and building awesome devops teams rose many questions that led me into writing down everything before I forget about them.

In the past 10 years, I’ve been working for a couple of SAAS companies and done a few consulting for some other. As a SAAS company, you need a solid infrastructure as much as solid sales team. The first one is often considered as a cost for the company, while the second is supposed to make it live. Both should actually considered assets, and many of my clients only realised it too late.

Prerequisites

Taking over an existing infrastructure when everyone has left the building to make it something viable is a tough, long term job that won’t go without the management full support. Before accepting the job, ensure a few prerequisites are filled or you’ll face a certain failure.

Control over the budget

Having a tight control of the budget is the mosts important part of getting over an infrastructure that requires a full replatforming. Since you have no idea of the amount of things you’ll have to add or change, it’s a tricky exercise that needs either some experience or a budget at least twice as much as the previous year. You’re not forced to spend all the money you’re allowed, but at least you’ll be able to achieve your mission during the first year.

Control over your team hires (and fires)

Whether you’re taking over an existing team or building one form scratch, be sure you have the final word on hiring (or firing). If the management can’t understand that people who use to “do the job” at a certain time of the company’s life don’t fit anymore, you’ll rush into big trouble. Things get worse when you get people who’s been slacking or under performing for years. After all, if you’re jumping in, that’s because some things are really fishy aren’t they?

Freedom of technical choices

Even though you’ll have to deal with an existing infrastructure, be sure you’ll be given free hands on the new technical choices when they happen. Being stuck with a manager who’ll block every new technology he doesn’t know about, or being forced to pick up all the newest fancy, not production ready things they’ve read on Hackers News makes one ops life a nightmare. From my experience, keeping the technos that work, even though they’re outdated or you don’t like them can save you lots of problems, starting with managing other people’s ego.

Freedom of tools

Managing an infrastructure requires a few tools, and better pick up the ones you’re familiar with. If you’re refused to switch form Trac to Jira, or refused a PagerDuty account for any reason, be sure you’ll get in trouble very soon for anything else you’ll have to change. Currently, my favorite, can’t live without tools are Jira for project management, PagerDuty for incident alerting, Zabbix for monitoring and ELK for log management.

Being implied early into the product roadmap

As an ops manager, it’s critical to be aware of what’s going on on the product level. You’ll have to deploy development, staging (we call it theory because “it works in theory”) and production infrastructure and the sooner you know, the better you’ll be able to work. Being implied in the product roadmap also means that you’ll be able to help the backend developers in terms of architecture before they deliver something you won’t be able to manage.

Get an initial glance of the infrastructure:

It’s not really a prerequisite, but it’s always good to know where you’re going. Having a glance of the infrastructure (and even better at the incident logs) allows you to setup your priorities before you actually start the job.

Your priorities, according to the rest of the company

Priority is a word that should not have a plural

For some reasons, most departments in a company have a defined, single priority. Sales priority is to bring back new clients, marketing to build new leads, devs to create a great product without bugs. When it comes to the devops team, every department has a different view on what you should do first.

The sales, consulting and marketing expect stability first to get new clients and keep the existing ones. A SAAS company with an unstable infrastructure can’t get new clients, keep the existing ones, and get bad press outside. Remember Twitter Fail Whale era? Twitter was most famous for being down that everything else.

The product team expect you to help deliver new feature first, and they’re not the only ones. New feature are important to stay up to date in front of your competitors. The market expects them, the analysts expect them, and you must deliver some if you want to look alive.

The development teams expect on demand environment. All of them. I’ve never seen a company where the development team was not asking for a virtual machine they could run the product on. And they consider it critical to be able to work.

The company exec, legal team, your management expect documentation, conformity to the standards, certifications, and they expect you to deliver fast. It’s hard to answer a RFP without a strong documentation showing you’ve a state of the art infrastructure and top notch IT processes.

As a devops manager, your only priority is to bring back confidence in the infrastructure, which implies reaching the whole company’s expectation.

The only way to reach that goal is to provide a clear, public roadmap of what you’re going to do, and why. All these points are critical, they all need to be addressed, not at the same time, but always with an ETA.

Our work organisation

I’m a fan of the Scrum agile methodology. Unfortunately, 2–3 weeks sprints and immutability do not fit a fast changing, unreliable environment. Kanban is great at managing ongoing events and issues but makes giving visibility on the projects harder. That’s why we’re using a mix of Scrum and Kanban.

We run 1 week sprints, with 50% of our time dedicated to the projects, and 50% dedicated to managing ongoing events. Ongoing events are both your daily system administration and requests from the developers that can’t wait for the following sprint.

Our work day officially starts at 10AM for the daily standup around the coffee machine. Having a coffee powered standup turns what can be seen as a meeting into a nice, devops-friendly moment where we share what’s we’ve done the day before, what we’re going to do, and which problems we have. If anyone’s stuck, we discuss the various solutions and plan a pair-working moment if it takes more than a minute to solve.

Sprint planning is done every Friday afternoon so everybody knows what they’ll do Monday morning. That’s a critical part of the week. We all gather around a coffee and start reviewing the backlog issues. Tasks we were not able to achieve during the week are added on the top of the backlog, then we move the developers requests we promised to take care of, then the new projects. People pick up the projects they want to work on, with myself saying yes or no or assigning projects I consider we must do first in last resort. We take care of having everyone working on all the technologies we have so there’s no point of failure in the team and everybody can learn and improve.

Each devops work alone on their projects. To avoid mistakes and share knowledge, nothing ships to production without a comprehensive code review so at least 2 people in the team are aware of what’s been done. That way, when someone is absent, the other person can take over the project and finish it. In addition to the code reviews, we take care about documentation, the minimum being operation instruction being added in every Zabbix alert.

Managing the ongoing events Managing the ongoings is a tricky part because they often overlap with the planned projects and you can’t always postpone them. You’ll most probably take a few months before you’re able to do everything you planned to within a week.

During the day, incident management is the duty of the whole team, not only the oncall person. Oncall devops also have their projects to do so they can’t be assigned all the incidents. Moreover, some of us are more at ease with some technologies or part of the infrastructure and are more efficient when managing an incident. (Backend) developers are involved in the incidents management when possible. When pushing to production, they provide us with a HOWTO to fix most of the issues we’ll meet so we can add them to Zabbix alert messages.

We try to define a weekly contact who’ll manage the relationships with the development team so we’re not disturbed 10 times a day and won’t move without a Jira ticket number. Then, the task is prioritised in the current sprint or another one, depending on the emergency. When managing relationships with the development teams, it’s important to remember that “no” is an acceptable answer if you explain why. The BOFH style is the best way to be hated by the whole company, and that’s not something you want, do you?

In any case, we always provide the demander with an ETA so they know when they can start working. If the project is delayed, we communicate about the delay as well.

When you have no one left

Building a new team from scratch because everyone has left the building before you join is a rewarding and exciting task, except you can’t stop the company’s infrastructure while you’re recruiting.

During the hiring process, which can take up to 6 months, I rely on external contractors. I hire experienced freelancers to help me fix and build the most important things. Over the year, I’ve built a good address book of freelancers skilled with specific technologies such as FreeBSD, database management or packaging so I always work with people I’ve worked with, or people who’ve worked with people I trust.

I also rely on vendors consulting and support to help on technologies I don’t know. They teach me a lot and help fixing the most important issues. When I had to take over a massive Galera cluster, I relied on Percona support during the 6 first months so we’re now able to operate it fully.

Finally, we work a lot with the developers who wrote the code we operate. That’s an important part since they know most of the internals and traps of the existing machines. It also allows to create a deep link with the team we’re going to work the most with.

Recruiting the perfect team

Recruiting the perfect devops team is a long and difficult process, even more when you have to build a small team. When looking for people, I look for a few things:

Complementary and supplementary skills

A small team can’t afford having single points of failure, so we need at least 2 people knowing the same technology, at least a bit when possible. We also look for people knowing other technologies, whether or not we’ll deploy them someday. Having worked on various technologies give you a great insight on the problem you’ll encounter when working on similar ones.

Autonomy and curiosity

Our way of working requires people to be autonomous and not wait until we help them when they’re blocked. I refuse to micro manage people and ask them what they’re doing every hour. They need to be honest enough to say “I don’t know” or “I’m stuck” before the projects delays go out of control.

Knowledge of the technologies in place and fast learners

Building a team from scratch on an existing platform requires to learn fast how to operate and troubleshoot it. Having an experience in some if the technologies in place is incredibly precious and limits either the number of incidents or their length. Since hiring people who know all the technologies in place is not possible, having fast learners is mandatory so they can operate them quickly. Being able to read the code is a plus I love.

Indeed, every medal has two sides, and these people are both expensive and hard to find. It means you need to feed them enough to keep them on the long term.

Knowing who are your clients

The first thing before starting to communicate is to understand who your client, the one you can get satisfaction metrics from, is. Working in a B2B company, I don’t have a direct contact with our clients. It means my clients are the support people, sales persons, project managers, and the development team. If they’re satisfied, then you’ve done your job right.

This relationship is not immutable and you might reconsider it after a while. Sometimes, acting like a service provider for the development team does not work and you’ll have to create a deeper integration. Or, on the contrary, take your distance if they prevent you from doing your job correctly, but that’s something you need time to know.

Communication and reporting

Communication within the company is critical, even more when the infrastructure is considered a source of problems.

Unify the communication around one person, even more when managing incidents

We use a dedicated Slack channel to communicate about incidents, and only the infrastructure manager, or the person oncall during the night and weekend communicates there. That way, we avoin conflicting messages with someone saying the incident is over while it’s not totally over. This also requires a good communication within the team.

Don’t send alarming messages.

Never. But be totally transparent with your management so they can work on a communication when the shit really hits the fan, which might happen. This might mean they’ll kick you in the butt if you’ve screwed up, but at least they’re prepared.

Finally, we always give an ETA when communicating about an incident, along with a precise functional perimeter. “A database server has crashed” has no meaning if you’re not in the technical field, “People can’t login anymore” does. And remember that “I don’t have an ETA yet” is something people can hear.

We do a 3 slides weekly reporting with the most important elements:

  • KPIs: budget, uptime, number of incidents, evolution of the number of oncall interventions.
  • Components still at risk (a lot in the beginning).
  • Main projects status and ETA.

Discovering the platform

So you’re in and it’s time for the things to get real. Here are a few things I use to discover the platform I’ll have to work on.

The monitoring

Monitoring brings the most useful thing to know about the servers and services you operate. It also provides a useful incident log so you know what breaks the most. Unfortunately, I’ve realised that the monitoring is not always as complete as it should and you might get some surprises.

The hypervisor

When running on the cloud or a virtualised infrastructure, the hypervisor is the best place to discover the infrastructure even though it won’t tell you which services are running, and which machines are actually used. On AWS, the security groups provide useful informations about the open ports, when it’s not 1–65534 TCP.

nmap + ssh + facter in a CSV

Running nmap with OS and service discovery on your whole subnet(s) is the most efficient discovery way I know. It might provide some surprises as well: I once had a machine with 50 internal IP addresses to run a proxy for 50 geo located addresses! Be careful too, facter does not return the same information on Linux and FreeBSD.

tcpdump on the most central nodes

Running tcpdump and / or iftop on the most central nodes allows a better comprehension of the networking flows and service communication within your infrastructure. If you run internal and external load balancers, they’re the perfect place to snif the traffic. Having a glance at their configuration also provides helpful information.

Puppet / Ansible

When they exist, automation infrastructure provide a great insight on the infrastructure. However, from experience, they’re often incomplete, messy as hell and outdated. I remember seeing the production infrastructure running on the CTO personal Puppet environment. Don’t ask why.

The great old ones

People working in the tech team for a while often have a deep knowledge of the infrastructure. More than how it works, they provide useful information on why things have been done this way and why it will be a nightmare to change.

The handover with the existing team

If you’re lucky, you’ll be able to work with the team you’re replacing during 1 or 2 days. Focus on the infrastructure overview, data workflows, technologies you don’t know about and most common incidents. In the worst case, they’ll answer “I don’t remember” to every question you ask.

In the beginning

In the beginning, there was Jack, and Jack had a groove. And from this groove came the groove of all grooves. And while one day viciously throwing down on his box, Jack boldy declared, “Let there be HOUSE!”, and house music was born. “I am, you see,  I am the creator, and this is my house! And, in my house there is ONLY house music.

So, you’re in and you need to start somewhere, so here are a few tricks to make the first months easier.

Let the teams who used to do it manage their part of the infrastructure. It might not be state of the art system administration, but if it works, it’s OK and it lets you focus on what doesn’t work.

Create an inventory as soon as you can. Rationalise the naming of your machines so you can group them into clusters and later on, automate everything.

Restart every service one by one, under control to make sure they come back. Once, I had postfix configuration into a sendmail.cf file, and the service had not be restarted for months. Another time, a cluster did not want to restart after a crash because the configuration files were referring servers that had be removed 1 year sooner.

Focus on what you don’t know but works, then look at what you know but needs fixes. The first time I took over a sysadmin-less infrastructure, I left the load balancers away because they were running smoothly, focusing on the always crashing PHP batches. A few weeks later, when both load balancers crashed at the same time, it took me 2 hours to understand how everything was working.

Automate on day one In the beginning, you’ll have to do lots of repetitive tasks, so better start automating early.

If I have to do the same task 3 times, I’ve already lost my time twice.

The most repetitive thing you’ll have to do is deployment, so better start with them. We’re using an Ansible playbook triggered by a Jenkins build so the developers can deploy when they need without us. If I didn’t want, I could ignore how many deployments to production are done every day.

Speaking of involving the developers, ask the backend developers to provide the Ansible material they need to deploy what they ask you to operate. It’s useful both for them to ensure dev, production and theory are the same, and to know things will be deployed the way they want with the right library.

Giving some power to the development team does not mean leaving them playing in the wild. Grant them only the tools they need, for example using Jenkins builds or users with limited privileges through automated deployment.

Resist in hostile environments

Hostile environment: anything you don’t have at least an acceptable control on.

Developers designed servers are a nightmare to operate so better talk about them first. A developer designed server is a machine providing a full feature without service segregation. The processes, database, cache stack… runs on the same machine making them hard to debug and impossible to scale horizontally. And they take a long time to split. They need to be split into logical smaller (virtual) machines you can expand horizontally. It provides reliability, scalability but has an important impact on your network in general and on your IP addressing in particular.

Private clouds operated by a tier are another nightmare since you don’t control resources allocation. I once had a MySQL server that crashed repeatedly and couldn’t understand why. After weeks of searches and debugging, we realised the hosting company was doing memory ballooning since they considered we used too much memory. Ballooning is a technic that fills part of the virtual machine memory so it won’t try to use everything it’s supposed to have. When MySQL started to use more than half of the RAM it was supposed to have, it crashed because it didn’t have enough despite the operating system saying the contrary.

AWS is another hostile environment. Machines and storage might disappear anytime, and you can’t debug their NATted network. So you need to build your infrastructure for failure.

Write documentation early

Finally, let’s talk about documentation. Infrastructure documentation is often considered a burden, and with the infra as code fashion, automation scripts are supposed to be the real documentation. Or people tell “I’ll write documentation when I’ll have time, for now everything is on fire”.

Nothing’s more wrong (except running Docker in production). But yes, writing documentation takes time and organisation, so you need to iterate on it.

The tool is an critical part if you want your team to write documentation. I like using Git powered wiki using flat Markdown files like the ones on Github or Gitlab, but it does not fit everyone, so I often fallback to Confluence. Whatever the tool, ensure the documentation is not hosted on your infrastructure!

I usually start small, writing down operation instructions in the monitoring alert messages. It’s a good start and allows you to solve an incident without digging into the documentation looking for what you need. Then, we build infrastructure diagrams using tools like Omnigraffle on Mac, or Lucidchart in the browser. Then we write comprehensive documentation around those 2.

Conclusion

Well, that’s all folks, or most probably only the beginning of a long journey in turning a ruin into a resilient, blazing fast, scalable infrastructure around an awesome devops team. Don’t hesitate to share and comment if you liked this post.

The downside of loving your job too much

Bored monkey

A few weeks ago, we were discussing what we’d do after we leave our company before starting another job. Most of us wanted to take a 1 month break during the summer vacation so they can profit from their family. Having a tough job including 24/7 oncalls at least once a month, I was more radical:

“When I leave Synthesio after leveraging my stocks for a few millions (I hope), I’ll take a 6 months vacation. My wife smiled at me and said:
– I don’t believe you. After one month you’ll get so bored that you will start another company and we won’t see you for another 5 years.”

After thinking about it for a second, I told her she was wrong. I’ll probably get bored to tears after 2 weeks.

Migrating a non Wordpress blog to Medium for the nerdiest

Caves Pommery à Reims

I’ve just finished migrating 10 years of blogging from Publify to Medium. If you’re wondering, I’ll keep this blog online and updated, but I wanted to profit from the community and awesome UI Medium brings since I’ve never been able to do a proper design.

In this post, I’ll explain how you can migrate any blog to Medium since only Wordpress is supported so far.

To migrate your blog to Medium, you need:

  • Some dev and UNIX skills.
  • A blog.
  • A Medium account and a publication.
  • A Medium integration API key.
  • To ask Medium to remove the publishing API rate limit on your publication.

Installing medium-cli

There are many Medium SDKs around but most of them are incomplete and won’t allow you to publish in publications, but there’s a workaround. I’ve chosen to rely on medium-cli, a NPM command line interface that does the trick.

$ (sudo) npm install (-g) medium-cli

Medium-cli does not allow to push to a publication so we’ll have to patch it a bit to make it work. Edit the lib/medium.js file and replace line 38 with:

.post(uri + ‘publications/’ + userId + ‘/posts’)

Since medium-cli also cleans unwanted arguments, we’ll have to add 2 lines at the end of the clean() function, lines 61–62.

publishedAt: post.publishedAt,
notifyFollowers: post.notifyFollowers

These are 2 important options:

  • publishedAt is a non documented API feature that allows to back date your old posts.
  • notifyFollowers will prevent Medium from spamming your followers will all your new publications.

Setting up medium-cli to post to your publication

To post to Medium with medium-cli, you need:

  • Your Medium user id.
  • Your Medium publication id.

We’ll get it using the API with some curl foo.

First, get your own information:

$ curl -H “Authorization: Bearer yourmediumapikeyyoujustgenerated” https://api.medium.com/v1/me
{“data”:{“id”:”1234567890",”username”:”fdevillamil”,”name”:”Fred de Villamil ✌︎”,”url”:”https://medium.com/@fdevillamil","imageUrl":"https://cdn-images-2.medium.com/fit/c/200/200/0*IKwA8UN-sM_AoqVj.jpg"}}

Now, get your publication id

$ curl -H “Authorization: Bearer yourmediumapikeyyoujustgenerated” https://api.medium.com/v1/users/1234567890/publications
{“data”:[{“id”:”987654321",”name”:”Fred Thoughts”,”description”:”Fred Thoughts on Startups, UX and Co”,”url”:”https://medium.com/fred-thougths","imageUrl":"https://cdn-images-2.medium.com/fit/c/200/200/1*EqoJ-xFhWa4dE-2i1jGnkg.png"}]}

You’re almost done. Now, create a medium-cli blog and configure it with your API key.

$ medium create myblog
$ medium login
$ rm -rf myblog/articles/*

Edit your ~/.netrc file as the following:

machine api.medium.com
 token yourmediumapikeyyoujustgenerated
 url https://medium.com/your-publication
 userId 987654321

Here we are (born to be king) we can now export the content of your blog.

Export your blog content

There are many ways to export your blog content. If you don’t have a database access, you can crawl it with any script using the readitlater library.

For my Publify blog, I’ve written a rake task.

desc “Migrate to medium”
task :mediumise => :environment do
 require ‘article’
dump = “/path-to/newblog/”
Article.published.where(“id > 7079”).order(:published_at).each do |article|
 if File.exists? “#{dump}/{article.id}-#{article.permalink}”
 next
 end
 Dir.mkdir “#{dump}/#{article.id}-#{article.permalink}”
 open(“#{dump}/#{article.id}-#{article.permalink}/index.md”, ‘w’) do |f|
 f.puts “ — -”
 f.puts “title: \”#{article.title}\””
 f.puts “tags: \”#{article.tags[0,2].map { |tag| tag.display_name }.join(“, “)}\””
 f.puts “canonicalUrl: https://t37.net/#{article.permalink}.html”
 f.puts “publishStatus: public”
 f.puts “license: cc-40-by-nc-sa”
 f.puts “publishedAt: ‘#{article.published_at.to_time.iso8601}’”
 f.puts “notifyFollowers: false”
 f.puts “ — -”
 f.puts “”
 f.puts article.html(:body)
 f.puts “”
 f.puts article.html(:extended)
 f.puts “”
 f.puts “Original article published on <a href=’https://t37.net/#{article.permalink}.html’>#{article.title}</a>”
 end
 end
end

Nothing really complicated indeed. Whatever your solution, export in your myblog directory. Then

$ mkdir archives
$ mkdir failed
$ cd myblog
for i in $(ls | egrep -v ^articles | sort -n | head -n 10); do 
mv $i articles
foo=$(medium publish | grep Done)
if [ -z “$foo” ]; then
 echo failed: $i
 mv articles/$i ../failed
else
 echo OK: $i
 mv articles/$i ../archives/
 fi
foo=””
done

That’s all folks! Hope it will help you while you’re waiting for Medium to provide an easier way to do it.

ElasticSearch cluster rolling restart at the speed of light with rack awareness

At the speed of light

Woot, first post in more than one year, that’s quite a thing!

ElasticSearch is an awesome piece of software, but some management operations can be quite a pain in the administrator’s ass. Performing a rolling restart of your cluster without downtime is one of them. On a 30 something server cluster running up to 900GB shards, it would take up to 3 days. Hopefully, we’re now able to do it in less than 30 minutes on a 70 nodes with more than 100TB of data.

If you’re looking for ElasticSearch rolling restart, here’s what you’ll find:

  1. Disable the shard allocation on your cluster.
  2. Restart a node.
  3. Enable the shard allocation.
  4. Wait for the cluster to be green again.
  5. Goto 2 until you’re done.

Indeed, part 4 is the longest, and you don’t have hours, if not days. Hopefully, there’s a solution to fix that: rack awareness.

About Rack Awareness

ElasticSearch shard allocation awareness is a rather underlooked feature. It allows you to add your ElasticSearch nodes to virtual racks so the primary and replica shards are not allocated in the same rack. That’s extremely handy to ensure fail over when you spread your cluster between multiple data centers.

Rack awareness requires 2 configuration parameters, one at the node level (restart mandatory) and one at the cluster level (so you can do it runtime).

Let’s say we have a 4 nodes cluster and want to activate rack awareness. We’ll split the cluster into 2 racks:

  • goldorack
  • alcorack

On the first two nodes, add the following configuration options:

node.rack_id: "goldorack"

And on the 2 other nodes:

node.rack_id: "alcorack"

Restart, you’re ready to enable rack awareness.

curl -XPUT localhost:9200/_cluster/settings -d '{
    "persistent" : {
        "cluster.routing.allocation.awareness.attributes" : "rack_id"
    }
}'

Here we are, born to be king. Your ElasticSearch cluster starts reallocating shards. Wait until complete rebalance, it can take some times. You’ll soon be able to perform a blazing fast rolling restart.

First, disable shard allocation globally:

curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{
    "transient" : {
        "cluster.routing.allocation.enable": "none"
    }
}'

Restart the ElasticSearch process on all nodes in the goldorack rack. Once your cluster is complet, enable shard allocation back.

curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{
    "transient" : {
        "cluster.routing.allocation.enable": "all"
    }
}'

A few minutes later, your cluster is all green. Woot!

What happened there?

Using rack awareness, you ensure that all your data is stored in nodes located in alcorack. When you restart all the goldorack nodes, the cluster elects alcorack replica as primary shards, and your cluster keeps running smoothly since you did not break the quorum. When the goldorack nodes come back, they catch up with the newly indexed data and you’re green in a minute. Now, do exactly the same thing with the other part of the cluster and you’re done.

For the laziest (like me)

Since we’re all lazy and don’t want to ssh on 70 nodes to perform the restart, here’s the Ansible way to do it:

In your inventory:

[escluster]
node01 rack_id=goldorack
node02 rack_id=goldorack
node03 rack_id=alcorack
node04 rack_id=alcorack

And the restart task:

- name: Perform a restart on a machine located in the rack_id passed as a parameter
  service: name=elasticsearch state=restarted
  when: rack is defined and rack == rack_id

That’s all folks, see you soon (I hope)!