Monday, September 16, 2024

Why the CrowdStrike crash hit Delta harder

Must read

When a software update by cybersecurity provider CrowdStrike crashed Microsoft Windows operating systems around the world on July 19, each of the major U.S. airlines was forced to halt operations.

But after that, the experiences of American, Delta and United as well as the customers of those airlines could not have been more different.

American had largely recovered its operation by that evening and had only 51 mainline flight cancellations the following day, FlightAware data shows.

Delta, meanwhile, captured unwanted national headlines as network restoration dragged on over five miserable days in which the carrier canceled approximately 7,000 mainline and regional flights. It said those cancellations disrupted the travel of 1.3 million customers and cost the carrier approximately $500 million.

In between those two responses were the results of United, which took three days to get back on track and canceled more than 1,400 flights.

Why Delta performed so poorly has become a source of conflict between the airline, CrowdStrike and Microsoft — a clash that appears to be spiraling toward litigation. Why American had so much less trouble, meanwhile, has been explained in only the most general terms by the airline, which declined to comment for this story.

“One of the things we’ve learned is that, in terms of any disruption, you’ve got to keep track of your aircraft, certainly, but also your crews, in terms of where they are. And you probably ought to take action as quickly as possible to make sure you don’t lose the ability for the purpose of recovery,” American CEO Robert Isom said during an earnings call last month. “We’ve built technology, and we’ve done the right thing to make sure that we take early caution, early steps.”

Those comments teased at what Delta has said was its biggest sticking point over its protracted recovery from the outage. The airline said 60% of its mission-critical applications, including redundant backup systems, rely on Windows, and that during its recovery it had to physically reset 40,000 servers — a bigger lift than any other airline. 

But during its long recovery process, CEO Ed Bastian also explained that what slowed Delta the most was the loss of a key crew-tracking tool, which left the airline without visibility on the whereabouts of its flight crews. Absent that knowledge, Delta was unable to reset its operation.

As part of an acrimonious public exchange involving Delta, CrowdStrike and Microsoft, Microsoft accused Delta of deflecting its own responsibility for the long-lasting operational collapse.

“Our preliminary review suggests that Delta, unlike its competitors, apparently has not modernized its IT infrastructure,” Mark Cheffo, an attorney representing Microsoft, wrote in an Aug. 6 letter to Delta.

Experts in airline operations and IT said a variety of factors could have impacted why Delta recovered so poorly from the CrowdStrike outage, even as other impacted airlines fared much better.

Delta’s heavy reliance on Windows applications might well have played a role, they said. But even random factors, such as who happened to be working that night, could have affected how each airline responded.

Daniel Stecher

“Airlines might have had more staff, or have had the right experience on the shift,” said Daniel Stecher, vice president of business development for IBS Software, which provides cloud-based solutions for airlines. “That person will manage the disruption completely different than someone who is a newbie.”

Matt Cincera, Delta’s senior vice president of software engineering, told the trade publication CIO in March 2023 that the airline’s crew-tracking system is serviced by Kyndryl, not by Microsoft. Amid its hostile exchange with CrowdStrike, Delta still said that the cybersecurity firm was responsible for the crew-tracking failure since the operational disruption led to a massive amount of incomplete data being delivered to that system.

The dispute speaks to a complexity broadly found in airline IT setups, which run on a combination of decades-old mainframe systems and modern, cloud-based applications.

“At least in my experience, the flight operating system, including crew-tracking and maintenance-tracking systems, are some of the oldest and most limiting legacy airlines systems,” said consultant Bob Mann of RW Mann and Co., whose previous experience includes sitting on a key IATA IT committee.

Bob Mann

Bob Mann

Stecher said that mix of legacy and cloud-based systems common in airline operations centers create IT silos, which impede both decision-making and response efficiency during disruptions. It’s an issue, he said, that generally leads to suboptimal industrywide reliability.

“You have a lot of redundancy in aircraft. That’s why they’re safe. This redundancy costs money. But on the ground, this redundancy is not in place,” said Stecher, whose employer, IBS, sells a cloud-based integrated operational platform.

Airlines, he said, generally aren’t willing to invest as much as they should in IT.

Mann, though, said that it’s no simple matter for an airline to fully transition away from its legacy operational systems while it is operating thousands of daily flights. Such changes have to be made over a long period of time, he added.

“At the scale these airlines operate, you really can’t afford to make a bet on something that might work,” Mann said.

Latest article