Striking the balance: Cloudflare's innovation and the quest for reliability

If you’re familiar with Cloudflare, you likely know them as one of the world’s preeminent CDNs. To quantify their reach: Cloudflare operates 310 Points of Presence worldwide, hosts close to 8 million websites, and boasts a workforce of over 3000 employees.

Their stature is not just in size but also in innovation—they are trailblazers known for their groundbreaking features, bold choices, and swift innovation. There’s a wealth of testimonials from developers and organizations that attest to the significant value Cloudflare provides.

However, if you’ve been keeping tabs on Cloudflare recently, you may also recognize them from two major service outages—one on October 30th and another commencing on November 2nd, which at the time of writing, has been ongoing for over two days.

As someone who has extensively used their services and values their proactive and transparent response to such incidents, I wanted to offer my insights and contextual analysis of the current situation.

Cloudflare’s remarkable evolution

Before we examine the recent incidents, it’s important to review Cloudflare’s evolution over the past decade. From their origins as a Content Delivery Network (CDN), they’ve transformed into a multifaceted online platform that offers website optimization, hosting services, and a plethora of other features. This background sets the stage for understanding the context of the issues they’ve encountered.

Cloudflare’s rapid pace of innovation is impressive. Beyond their annual ‘Cloudflare Week’, they regularly roll out new features and have consistently been at the vanguard of solving complex issues with straightforward, effective solutions.

Take, for example, their response to a breach at their authentication provider, Okta. Cloudflare quickly released a “HAR sanitizer” tool within two days of the breach, which removed sensitive information from HAR files — a simple but powerful solution. Considering Cloudflare’s size, with a workforce exceeding 3000 and a vast customer base, their ability to deploy new features swiftly is a testament to their agility.

Cloudflare’s commitment to product development and customer service is evident through consistent feature releases that enhance the developer and customer experience. 

Here are some of the notable features and services Cloudflare has introduced:

  • Workers, launched in 2017, enable running code close to users worldwide, with near-instant, global deployments.
  • Building on Workers, they’ve added Workers KV (a global key-value store), R2 for large object storage (akin to S3), Pages for hosting static websites (similar to GitHub Pages), as well as Queues and global relational databases.
  • And, just like any other tech company, they recently introduced Ai with Workers Ai.

For specific use cases, Cloudflare provides tailored features such as:

  • Cloudflare Images for hosting and on-the-fly image manipulation.
  • Cloudflare Stream for an integrated video product API.
  • Turnstile as a free reCaptcha alternative.

Complementing their security-first ethos, Cloudflare also offers specialized services:

  • Cloudflare Access for setting up website access policies.
  • Web Application Firewall for threat mitigation.
  • Remote Browser Isolation for secure, virtual browsing within a browser.

Notably, Cloudflare also caters to consumers with products like the 1.1.1.1 DNS resolver, Cloudflare WARP, and their partnership with Apple for iCloud Private Relay.

The breadth of Cloudflare’s offerings is continually expanding, and this article only scratches the surface.

As Cloudflare moves further away from its CDN-only roots, they continue to innovate, simplifying the work of developers and businesses. However, with the expansion of their service suite comes increased complexity in their platform and infrastructure.

The complexity of scaling innovation

Every new feature enhances user experience but also adds complexity to the system architecture. This complexity, although unavoidable, must be managed thoughtfully to avoid undermining the system’s long-term stability. Merely adding features without considering the implications on stability can be shortsighted.

My extensive use of Cloudflare’s services has given me firsthand appreciation of their benefits. However, it’s clear that the complexity behind their offerings can lead to technical issues.

Cloudflare practices “dogfooding,” meaning they use their own products. This approach can inadvertently increase system complexity and reduce redundancy. For instance, during the incident on October 30th, a failure in a Workers KV deployment caused widespread outages across many Cloudflare services that depended on the KV system for storing and synchronizing configuration data.

This issue triggered a domino effect, with additional services failing in succession. Notably, when Cloudflare Access encountered problems, it exacerbated the situation because Cloudflare employees rely on Access to secure their internal tools, preventing them from resolving the issues internally.

Cloudflare reported that they remedied the outage using emergency “break-glass” procedures. While this speaks to the team’s competency and ability to respond swiftly, it also highlights the inherent risks in their system’s complexity and interdependencies, which can lead to service disruptions.

It’s not that Cloudflare’s team is lacking in skill or knowledge; rather, the concern is the growing number of incidents that are challenging for a platform of its critical online nature.

I admire Cloudflare’s commitment to transparency and proactive incident management. Their status page is promptly updated during incidents, and they provide thorough post-mortem reports that include apologies, insights, and planned corrective actions—which I believe they implement effectively.

Nonetheless, there is an emerging pattern where the platform’s stability appears to be compromised by the accelerated introduction of new features and improvements. This rapid development pace may be contributing to the recurrent issues, signaling a need for a more cautious approach to platform evolution.

Cloudflare’s path forward: Balancing innovation with stability

It might be prudent for Cloudflare to slow down the introduction of new features temporarily and acknowledge the need to enhance the platform’s stability. This involves a thorough review of how all their services interconnect and depend on one another.

By doing so, Cloudflare can ensure that their significant value proposition is matched by a stable and reliable user experience.

I think Cloudflare’s track record speaks for itself—they are a dynamic and innovative force, continuously expanding their horizons and bringing tangible benefits to the digital landscape. Their responsiveness to issues and commitment to transparency is not just commendable but sets a standard in the industry.

Yet, with rapid growth and expansion come challenges that need addressing. The recent service disruptions are reminders that there’s a crucial balance to be struck between innovation and stability. As Cloudflare advances, the task at hand is clear—improving the resilience and reliability of their services.

While perhaps less glamorous than launching new features, fortifying the foundation is essential. Cloudflare is indeed doing great work; however, there’s also significant work to be done in enhancing the overall stability of their platform. Stability may not be the most eye-catching headline, but for the users who rely on Cloudflare daily, it is undoubtedly the most important.

About the 48 hours incident at Cloudflare

The catalyst for this article was Cloudflare’s incident on October 30th, triggered by an issue with Workers KV, which led to disruptions across several of their global services. While drafting this piece, Cloudflare experienced yet another disruption beginning on November 2nd and extending to November 4th.

In their post-mortem analysis of the latter incident, Cloudflare outlined how certain systems failed due to a mistaken belief in their redundancy. They also identified previously unknown or misunderstood service dependencies, which contributed to critical oversights.

This post-mortem reinforces the central argument of my article: the complexity added to platforms through interdependent services can render them difficult to manage. Cloudflare has pledged to implement specific measures to avert such issues in the future. However, I believe these events should prompt a fundamental shift in the company’s priorities, favoring stability over the relentless pursuit of rapid innovation.