Takeaways From the CrowdStrike Glitch
Rob McNelly on the ever-presence of Microsoft, and the promise and challenges that arise from an ever more tangled web of connectivity
I seldom mention the Microsoft Windows operating system or ecosystem in this space. Of course, I do use it. While, as an AIX evangelist, I don't consider myself a Windows expert, I have been there for the evolution from DOS to OS/2 to the current day. And over the years, working as a user if not an administrator, I've become relatively proficient with it.
I use much of the ecosystem: email, word processing, spreadsheets, Authenticator for account logins, etc. Sure, UNIX admins may be happier in the UNIX-based MacOS world, or using Linux variants as their desktop, but as I interact with customer sites, I typically encounter Windows virtual machines, attachments, documentation, etc.
This is true for most of us. IBM Power Systems hardware does not run in a vacuum. Oftentimes our servers run side by side with the Microsoft ecosystem. They may be virtual machines interacting with Power systems from the same data center, or they may provide access to data and applications from the perspective of end users on a Windows desktop. It is a Windows-centric world, after all. It's tough for any of us to get away from that.
Automating the management of these many operating systems, whether on Power Systems with Ansible or Puppet or by rolling out automation tools on Windows machines, can be a powerful way to handle the ever-growing numbers of physical and virtual machines we manage in our data centers. To be sure, automation offers numerous benefits, including standardization, reliability and repeatability. But these days, when things go wrong, they can go wrong instantly, and across a much larger group of machines. That's why I argue, as I always do, for the adoption and use of test/dev environments, along with stringent testing before implementing changes to production.
Details on the CrowdStrike Glitch
So with that, let's talk about CrowdStrike:
“Insurers have begun calculating the financial damage caused by last week’s devastating CrowdStrike software glitch that crashed computers, canceled flights and disrupted hospitals all around the globe—and the picture isn’t pretty.
“What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.
“The new figures put into stark relief how a single automated software update brought much of the global economy to a sudden halt—revealing the world’s overwhelming dependence on a key cybersecurity company—and what it will take to recover.”
This relatively short video (less than 15 minutes) is a great primer. Host Dave Plummer, a retired Microsoft software engineer, explains what happened and why it was necessary to have the driver in the kernel in the first place.
This article details one organization's interesting idea that ultimately resolved its issues:
“Not long after Windows PCs and servers at the Australian limb of audit and tax advisory Grant Thornton started BSODing last Friday, senior systems engineer Rob Woltz remembered a small but important fact: When PCs boot, they consider barcode scanners no differently to keyboards.
“That knowledge nugget became important as the firm tried to figure out how to respond to the mess CrowdStrike created, which at Grant Thornton Australia threw hundreds of PCs and no fewer than 100 servers into the doomloop that CrowdStrike's shoddy testing software made possible.
“All of Grant Thornton's machines were encrypted with Microsoft's BitLocker tool, which meant that recovery upon restart required CrowdStrike's multi-step fix and entry of a 48-character BitLocker key.
“The firm prioritized recovery for its servers, and tackled that task manually. But infrastructure manager Ben Watson and Woltz felt the sheer number of PCs at the firm meant an automated response would be required.
“That response could not, however, involve distributing BitLocker keys–doing so was just too risky to contemplate.
“So was reading keys to workers over the phone or in person. 'It felt like a bad idea to read a 48-character key to people who were already stressed out,' Woltz told The Register.
“Which was when his memory about barcode scanners came into play. The firm had the BitLocker keys for all its PCs, so Woltz and colleagues wrote a script that turned them into barcodes that were displayed on a locked-down management server's desktop. The script would be given a hostname and generate the necessary barcode and LAPS password to restore the machine.”
“Woltz went to an office supplies store and acquired an off-the-shelf barcode scanner for AU$55 ($36).
“At the point when rebooting PCs asked for a BitLocker key, pointing the scanner at the barcode on the server's screen made the machines treat the input exactly as if the key was being typed. That's a lot easier than typing it out every time, and the server's desktop could be accessed via a laptop for convenience.”
The entire article is worth your time, so click the link when you have a moment.
What We Do Matters
Why am I talking about this at all? Because our world is reliant on technology, and what we do matters. If you manage workloads on Power Systems, they're likely enterprise-level systems running critical applications for large organizations in industries such as banking and manufacturing, where outages can have lasting repercussions. Certainly we're seeing this with CrowdStrike, as reverberations of the outages continue to ripple through the airline and healthcare industries.
Sometimes it is easy to get caught up in the technology for technology's sake, but don’t forget that while companies purchase IBM Power Systems to host applications that run their businesses, the people who access these applications may not know or care about the hardware or operating system running underneath it all.
As I said at the time: “End users are our customers. If they weren’t using the data we store and process, there would be no need for us.”
CrowdStrike and other recent news stories serve as cautionary tales for us all. For as much good as we can do for our clients, it's worth remembering that we're also capable of negatively impacting lives all around us if we're not careful.