Home Insights Media and News The CrowdStrike event: Addressing the mismatch between automated processes gone wrong and manual recovery

The CrowdStrike event: Addressing the mismatch between automated processes gone wrong and manual recovery

Naresh Rajendra Shah

Manga Sridhar Akella

July 21, 2024 • 4 min read

Given the global impact of the CrowdStrike event, the principles of good digital platform engineering must be emphasized more than ever. Today’s digital network ecosystems have highly automated processes mixing machine intelligence with manual activities. When building and enhancing these mission-critical systems, we are captives of our bounded rationality. In other words, we only know what we know, and it’s what we don’t know that can lead to a black swan event. This blog post outlines six recommendations for strengthening existing software development and security practices.

Incident overview

The recent CrowdStrike event, which caused widespread disruption to global IT services, is a critical reminder of the vulnerabilities inherent in our increasingly interconnected digital infrastructure. An error in CrowdStrike’s Falcon sensors, operating at the Windows kernel level, resulted in a catastrophic failure causing a significant impact on Windows machines worldwide. This event impacted critical infrastructure across various sectors, including airlines, hospitals, law enforcement, and power grids.

Immediate response

While the fix for this issue is relatively simple—involving booting into safe mode, deleting a specific system file (00000291*.sys), and then rebooting—the scale of the problem necessitates a broader discussion on preventing such incidents in the future. Addressing issues in cloud environments presents unique challenges compared to on-premises systems. Cloud platforms do not support booting into safe mode. Instead, administrators must shut down virtual servers, attach and mount their disks on another server, and manually remove the offending files before reattaching the disks to the original server. This complex process underscores the need for a fundamental shift in modern cloud operating systems, which were derived from personal systems and still carry the assumption that a human is nearby to reboot the system in safe mode.

Systemic inadequacy

The CrowdStrike incident exposed a critical systemic inadequacy in our approach to digital infrastructure. The process that created the challenge was highly automated, rapid, and impacted systems at a huge non-linear scale. In stark contrast, the correction and recovery process was grossly inadequate in terms of speed and heavily dependent on human intervention. This fundamental mismatch between the automated nature of problem creation and the manual nature of problem-solving represents an intrinsic flaw in the current system design that went unaudited and ultimately put us all in this precarious situation.

Recommendations

There is much to learn about the CrowdStrike event. The global engineering community needs to work diligently to prevent it, or anything like it, from happening again. The following are our recommendations to strengthen existing software development and security practices:

Implement rigorous testing and phased releases: Replace automated QA processes with comprehensive testing protocols for business-critical software. Stagger releases to contain potential errors before escalation.

Embrace open source: Advocate for verified open-source software. Recent events, including the xz trojan prevention and potential mitigation of CrowdStrike’s Falcon sensor issues in Linux, showcase the effectiveness of community-audited code in identifying and neutralizing threats.

Limit kernel-level access: With great power comes great responsibility, and in circumstances where kernel-level access is needed, it is imperative to weigh the benefits against the risks. When software has elevated access, especially kernel-level permissions, special care must be taken to review the software and its release cycles. The best practices of other operating systems should also be considered.

Adopt memory-safe languages: Transition to memory-safe languages like Rust, Ada, or Zig for critical systems. These languages prevent compile-time bugs, offer superior safety without compromising performance, and are increasingly adopted by tech leaders. It is a prudent step forward to rewrite kernels in safe languages and these time-intensive efforts are already underway.

Promote transparency: Minimize secrecy in software processes. Maintain clear, complete documentation to mitigate risks associated with opaque processes in our increasingly software-dependent ecosystem.

Automate recovery and continuity: Develop built-in recovery processes and Disaster Recovery/Business Continuity Plans (DR/BCP) that match potential failure speeds and scales. Incorporate rigorous auditing and regular testing.

The CrowdStrike event is a slap-in-the-face reminder of the unknown risks and vulnerabilities of our digital network ecosystems. By embracing open-source software, limiting kernel-level access, implementing rigorous testing, adopting memory-safe languages, promoting transparency, reducing reliance on single points of failure, and automating recovery and continuity, we can reinforce our software development and security practices and safeguard against future disruptions. It is essential that we maintain the integrity and reliability of our critical systems, and that starts with prioritizing the principles of good digital platform engineering.

Conclusion

It is crucial that we address the systemic imbalance between automated problem creation and manual problem-solving. To create this balance we must ensure that the speed and scale of recovery mechanisms match those of potential failure modes as part of our ‘Disaster Recovery (DR)’ and ‘Business Continuity Plan (BCP)’ as part of our overall ‘System Resiliency Framework.’ Implementing these principles demands expertise in secure open-source implementation, fault-tolerant architecture, memory-safe languages, and digital transformation. Organizations that prioritize these aspects in their enterprise digital platform engineering playbook will be better equipped to safeguard their critical systems and navigate the complexities of today’s heterogenous digital landscape.

As the CrowdStrike incident clearly demonstrates, the cost of inaction is high. By fostering collaboration, knowledge-sharing, and engaging in meaningful dialogues within the tech and business communities alike, we can collectively build a more resilient digital future by taking unambiguous steps to strengthen our defenses toward a more secure digital ecosystem.

Grid Dynamics is a digital native platform engineering services company. We work with Fortune 1000 customers on business-critical platforms and have vast experience in releasing complex software built by globally distributed teams at enterprise scale. Reach out to us and let’s have a conversation.

The CrowdStrike event: Addressing the mismatch between automated processes gone wrong and manual recovery

Incident overview

Immediate response

Systemic inadequacy

Recommendations

Conclusion

Get in touch

Thank you!

Something went wrong...

The CrowdStrike event: Addressing the mismatch between automated processes gone wrong and manual recovery

Incident overview

Immediate response

Systemic inadequacy

Recommendations

Conclusion

Subscribe to updates from the Grid Dynamics Blog

Get in touch

Thank you!

Something went wrong...

Subscribe to our latest Insights

Subscribe to updates
from the Grid Dynamics Blog