Original article ISPreview UK:Read More
The American Content Delivery Network (CDN) and IT service company Cloudflare has committed to make several key changes in order to avoid breaking a significant chunk of the internet again, much as they did on two occasions between November (here) and, to a lesser extent, during early December 2025.
The biggest of the two events occurred on 18th November, when a huge chunk of the internet suddenly became sporadically inaccessible for several hours after Cloudflare pushed out a “wrong configuration” (i.e. a bug in generation logic for their Bot Management feature file) that “took down our network in seconds“.
Part of the problem stems from the difference between how Cloudflare deploys different types of updates. For example, when the company releases software version updates they do so in a controlled and monitored fashion. For each new binary release, the deployment must successfully complete multiple gates before it can serve worldwide traffic (e.g. deploying to staff traffic first and then a phased roll-out).
“If we detect an anomaly at any stage, we can revert the release without any human intervention,” said the company’s Chief Technical Officer, Dane Knecht, in a new blog (here). But Cloudflare doesn’t apply the same methodology to configuration changes, which are deployed instantly. “We give this power to our customers too: If you make a change to a setting in Cloudflare, it will propagate globally in seconds,” added Dane.
Cloudflare now acknowledges that the past two incidents have demonstrated that they “need to treat any change that is applied to how we serve traffic in our network with the same level of tested caution that we apply to changes to the software itself“. As a result, the provider has proposed to gradually make a series of changes to address this and to generally improve resilience, so that if an outage does occur again then it’s impact should be much less significant. All of this will fall under a new plan called: Code Orange: Fail Small.
Key Plans for Code Orange: Fail Small
➤ Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
➤ Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behaviour under all conditions, including unexpected error states.
➤ Change our internal “break glass” procedures, and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.
These projects aim to deliver iterative improvements as they proceed, rather than one “big bang” change at their conclusion. By the end of Q1 2026, Cloudflare expects to be in a position to ensure that all production systems are covered by Health Mediated Deployments (HMD) for configuration management (i.e. releasing config updates in the same way as software updates).
The company will also have updated its systems, by the same target date, to adhere to proper failure modes as appropriate for each product set and to ensure they have the processes in place, so the right people have the right access to provide proper remediation during an emergency.
“We understand that these incidents are painful for our customers and the Internet as a whole. We’re deeply embarrassed by them, which is why this work is the first priority for everyone here at Cloudflare,” said Dane Knecht.