A medical drama unfolded at US hospital, and IT staff were the heroes


Within two hours of the outage, CrowdStrike determined that blue screens could be resolved by deleting the file that contained the software update. — Bloomberg

At 12.09am on Friday, July 19, a software update pushed out by cybersecurity firm CrowdStrike shut down 8.5 million Windows machines. This update led to the “blue screen of death,” stalling flights, health care operations and much more.

At Duke Health, the day unfolded like an episode of a medical drama, but with hundreds of IT staff being the heroes restoring hospital operations.

“We had 60,000 machines total running the software, 40,000 got the bad patch, and 22,000 of them had blue screens,” said Dr Jeffrey Ferranti, chief digital officer at Duke Health. The 18,000 machines with the bad update but without the blue screens would have gone blue if rebooted, Ferranti said.

Soon after noticing the outage, Ferranti said, they began a HICS activation - the “hospital incident command system.” He said the HICS activation usually occurs in the event of disasters and emergencies. “It’s the first time we ever activated that for an IT crisis.”

Sergio Chavez, IT analyst at Duke Health Technology Services, has been working at Duke for eight years. He said that in the past, “we’ve had little hiccups, but not nothing like this. This was 22,000 devices that were affected.” He hopes it is a “once in a lifetime event.”

First thing, Ferranti said, he and his team tried “to make sure that we don’t have to close any clinics, that our ERs can see patients, (and) that our urgent cares can see patients.”

In this, they succeeded. According to Ferranti, it was only during the first few hours that Duke had to reschedule about 20% of the day’s surgeries to the following week.

How did they fix the blue screens?

Within two hours of the outage, CrowdStrike determined that blue screens could be resolved by deleting the file that contained the software update. For many businesses, this involves putting the computer into safe mode, opening the command line, opening the folder where the file is stored, and deleting the file.

In hospitals, this gets a bit more complicated.

Files are encrypted for additional security, so “these machines had to physically be touched by an IT professional, who could decrypt them, go into the directory, remove the file they had to remove, and then reboot the machine,” Ferranti explained.

Once it was clear that each of the 40,000 computers had to be fixed one-by-one, Chavez said there were “sign-up sheets for anyone that was willing to come in to help.”

At around 6 o’clock that Friday morning, “Everyone was willing to throw their hat in and just kind of help,” Ferranti said.

“It was a bit of a low-tech approach, but we had them go around and put yellow sticky notes on all the machines that were critical for operations,” he said. “Then we had well over 100 IT people going around and fixing those machines so that the critical ones were taken care of first, and every unit had 50% of their machines up.”

He said each machine took five to eight minutes to fix.

In addition to fixing the thousands of machines, the IT team “was working on the back end on servers to make sure that the electronic health record was working, that the imaging system was up and running,” Ferranti said. This way, by the time the blue screens were fixed on individual computers, the systems and applications running on those computers would also be working.

Surviving the outage required an army and a peek into the past

By noon, the latest Duke Alert read, “Most critical systems have been brought back online, and nearly two-thirds of computers/laptops have been restored.”

Over the course of the weekend, more than 300 staff across the whole IT department were involved in this operation. Since “an army of people (were) there 24 hours a day all weekend,” Ferranti noted, “within 72 hours, we had basically all our machines up.”

Ferranti said when an outage of this scale occurs, “you have to resort to older ways of doing things.” He noted that clinicians could use iPads and iPhones to access the electronic health record. It worked, he said, even if “the ways they got that information might be different than they’re used to in their day-to-day job.”

The hospital also experiences down times during routine updates, and employees are prepared to use pen and paper, if needed, to take care of patients. Ferranti said that’s helpful, but this outage was different. “It’s just the size, the scale, the scope - the length of this event was larger than anything we’d seen before.”

He credited the clinical teams for their resilience during the outage. “While we can deliver safe care, that care takes longer, the work is harder. It was a lot of extra work for the clinical teams, and they really rose up to the challenge.”

UNC Health said its IT teams also “worked quickly to update any affected computers and our operations remained fully functional. Those IT teams continued to work through the weekend to troubleshoot any remaining issues.”

WakeMed said it does not use CrowdStrike and was not affected by the outage.

The aftermath

Chavez noted that while most machines across Duke Health were fixed over the weekend, some facilities were closed. In those cases, employees encountering the blue screen of death come Monday morning would “call it in, and it would immediately get taken care of.”

Duke Health is studying the event, focusing on how it can further prepare for outages like this one in the future. But Ferranti does wonder, “How can something like this have happened from a vendor that brought down multiple industries, not just health care?”

He said a “root cause analysis” from CrowdStrike about what exactly happened and how it happened is necessary for them to “think creatively about how we prevent this moving forward.”

For its part, CrowdStrike said on its website that it is strengthening various forms of testing and “implementing staggered roll-outs of content updates, while closely monitoring performance.” Staggered roll-outs could help reduce the broad impact of faulty updates.

The US House Committee on Homeland Security sent a letter to CrowdStrike CEO George Kurtz on July 22, calling him to testify before its subcommittee on cybersecurity and infrastructure protection regarding the outage. And shareholders filed a class action lawsuit against the company in Austin, Texas, on July 30, claiming that CrowdStrike said its technology was “validated, tested and certified” but did not properly test updates before rolling them out.

The outage is estimated to have cost Fortune 500 companies US$5.41bil (RM24bil). Health care topped the list of industries affected. – The News & Observer (Raleigh)/Tribune News Service

Follow us on our official WhatsApp channel for breaking news alerts and key updates!
   

Next In Tech News

ChatGPT's Advanced Voice Mode is coming to web browsers
Elon Musk blasts Australia's planned ban on social media for children
Bitcoin's wild ride toward $100,000
OpenAI considers taking on Google with browser, the Information reports
One tech tip: How to get started with Bluesky
FCC proposes fining Chinese video doorbell manufacturer after security concerns raised
Snap seeks to dismiss New Mexico lawsuit over child safety
Crypto industry jockeys for seats at Trump's promised council
Reddit back up after latest outage impacts thousands of users
Massachusetts student's punishment for AI use can stand, US judge rules

Others Also Read