
FangDragon |
9 people marked this as a favorite. |

I can only guess what the last week has been like for your IT department, hopefully things are stabilizing now and everyone made it through OK. Anyway once the dust has settled, I’d like to encourage you to (at least internally) write a postmortem. This isn’t about apportioning blame, rather it’s an exercise to learn what went wrong, what went well and to ensure measures are in place to prevent future outages.
Something in your process needs to change, multi-day outages really should not be happening in 2018. My day job is in software engineering, and where I work we’ve (mostly) eliminate such outages. The process we use is kind of heavyweight in terms of manpower but it works:
* Big changes must pass a formal design review before implementation starts.
* All new features are developed behind an experiment flag whose config can be updated as a data push to the servers (i.e. misbehaving features can be typically turned off in moments). To spell this out: during development the server (including backing DB) and the client code will support both old style and new style $THING. This is key to stability.
* All changes are code reviewed, and every code change must include tests.
* All patches are submitted via a pre-submit bot and all tests must pass, direct commits only in emergency.
* (For simple projects) every evening, the last known good build (which passes all tests) is promoted to canary for internal testing. In addition a fixed percentage of live traffic (typically 1%) is routed to the canary.
* The release engineer pushes internal candidates to live several times a week (never on Friday), provided signals from the canary look good.
* During development, new features are whitelisted to internal users (typically on the canary fork of the website. This is typically backed by a copy of user data. In addition non-senesitve features may use visible on the beta channel of the website. Feature launches are gated behind a formal process (involving sign off from security, legal, testing, leadership).
* During feature roll out we ramp the launch experiment from (say) 1% of traffic slowly up to 100%. Some rolls outs take weeks, others a few hours depending on what the launch is and how much data we need before we’re comfortable with it.
* After launch the experiment code is ripped out. Again all patches are code reviewed and must submitted to the CQ and pass tests.
Of course mistakes happen…
* We monitor all sorts of signals from our servers and our web apps, and we have alerting which based various conditions pages the current person on call if it looks like something bad is happening.
* They would look up the alert playbook, which in the case of a major outage would instruct them to perform an automatic rollback. They’d watch the monitoring console (and test the website) to confirm the situation has stabilized, pulling in additional people as needed. The design of the system must always allow new features to be rolled back.
* The on call dev would file bugs as needed to inform the dev team of the problem who fix it in normal office hours.
* For customer visible outages we write a post mortem.

Bobson |
6 people marked this as a favorite. |

Some from column A, some from column B.
I'm also in software, and while FangDragon's process is a really good and safe one, it's also overkill for a company with a team the size of Paizo's. However, a pared down version scaled for Paizo-size would be appropriate to have.
I'm not going to go into as much detail, but for a team of 2-6 people, I would expect the following as a minimum:
Given how important the website is to Paizo, I expect them to be well beyond these minimums in many ways, and I really can't see how _anything_ should be able to take out a modern website for almost a week, short of a cascade of failures.
I do hope we get a post-mortem, at least at a high level. If it really was a major disaster on that scale, I'd love to be able to learn from it.

UnArcaneElection |

Apparently this is the dominant thread for this topic. But before I go, I have to say I prefer the title of this one here . . . .

FangDragon |

FangDragon - how many people are involved in that process? I suspect at least 3x more than what Paizo has for IT... Perhaps a more feasible way forward would be to outsource the technology/hosting to a dedicated website company that can afford to have the mature processes like yours.
The size of Paizo's IT department was at the back of my mind when I wrote that. My immediate team is just three people and we did all the above for one product but yes the wider engineering department is rather larger which helps sustain development culture. The thing is the kind of process I describe isn't a secret, it just requires manpower (which is expensive) and management buy-in. Ultimately it's a business decision.