For a few hours in February 2012, I was sure my career was over.
I was responsible for the overall technical operation, development and maintenance of our largest client’s software, a highly customized adaptation of a very early version of our company’s flagship product.
That morning, though, our years of success deploying that solution across the country triggered a massive failure. For technical reasons, the field we used for IDs across the whole system couldn’t hold a number bigger than around 2.7 billion.
And we had just processed our 2.7 billionth lab result.
It suddenly stopped accepting new records. The customer support team sent me an automated error alert they had never seen before.
My first call was to the support agent to explain the problem and ask them to rally our on-call troops for a Severity 1 incident.
My second call was to my contact at our largest client to let them know what was happening and what we were doing. Well over ten thousand physician offices across the United States stopped receiving their lab results electronically into their practice management system. Instead, our client had to send them as paper faxes.
It felt like a disaster.
Fixing the problem seemed insurmountable at first. I had had my current team for less than a year as the previous engineers had been promoted to leaders of other teams. The original architects of the early product line hadn’t even thought about this system in five years or more.
A war room was set up, while I participated via phone from another time zone.
My company’s database administrators gathered for a creative brainstorm. How could they avoid the estimated three or four weeks of downtime it would take to migrate all 2.7 billion rows of data with the new record format? And without that downtime, how could they update the records without running out of disk space that had already been a serious concern before this?
The original architects returned to help both my current and previous engineers trace everywhere that needed to be updated for the larger ID numbers.
For just over two days, engineers, architects, and administrators worked around the clock in shifts. Data migration and system rollout strategies were proposed, reviewed, challenged, and finalized. Practically every component of every service of every product in this massively integrated system was rebuilt and tested for functional correctness and performance both independently and collectively.
Fifty-four hours after that initial error alert, lab results once again flowed electronically to physician offices.
And do you know what never happened?
Nobody demanded that the original architects defend why they used such a “small” number for the original ID field. It was the logical choice for the system as originally designed.
We certainly set up new monitors to watch other ID fields for a similar problem, but nobody blamed anyone for the fact that those monitors didn’t already exist.
Nobody really cared whose fault it was.
All of the attention was on fixing the problem and delivering lab results so physicians could give patients the care they needed.
And they did the impossible in 54 hours.
It’s been said that we may do an immense deal of good, if we do not care who gets the credit. It’s also true that we may solve immense problems if we do not care who takes the blame.
That catastrophe was not the end of my career. But with what I learned about collaboration, transparency, creativity, and care for patients that day, it may have been my career’s turning point.
Leave a Reply