InfoQ Live Logo

The Software Architects' Newsletter
November 2025
View in browser

Welcome to the InfoQ Software Architects' Newsletter! We bring you essential news and experience on emerging patterns and technologies from industry peers each month.

This month, we focus on "Platform Engineering and Architecture". Technologies, patterns, and practices from this topic span the entire "diffusion of innovation" graph in our recently published InfoQ Cloud and DevOps Trends Report – 2025 (and accompanying podcast).

This newsletter will focus on architectural patterns for building internal developer platforms that prioritize speed, safety, and efficiency. Whether you're tackling platform sprawl, orchestrating multiple platform teams, or introducing AI-driven workflows into your platforms, understanding the impact of architecture is key to building systems that scale sustainably.

News

Enhancing Reliability Using Service-Level Prioritized Load Shedding: Netflix at QCon SF 2025

At the recent QCon San Francisco, Netflix Staff Software Engineers Anirudh Mendiratta and Benjamin Fedorka shared insights into the company's reliability strategy, detailing the evolution of its load-shedding techniques toward a sophisticated Service-Level-Prioritized Load-Shedding.

This approach was designed to maintain a seamless viewing experience for millions of users, particularly during unpredictable traffic spikes that overwhelm reactive autoscaling and capacity buffers.

Cloudflare Global Outage Traced to Internal Database Change

Cloudflare recently experienced a global outage caused by a database permission update, triggering widespread 5xx server errors across its CDN and security services.

The disruption started around 11:20 UTC on the 18th of November, bricking access to customer sites and even locking Cloudflare’s own team out of their internal dashboard. According to a post-mortem released by CEO Matthew Prince, the root cause was a subtle regression introduced during a routine improvement to their ClickHouse database cluster.

Race Condition in DynamoDB DNS System: Analyzing the AWS US-EAST-1 Outage

On October 19th and 20th, AWS experienced an extended outage triggered by a failure in Amazon DynamoDB that affected most services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident, sparking discussions in the community about redundancy on AWS, moving out of public cloud, and multi-region approaches.

According to the post-mortem, which provides details on the DynamoDB DNS management architecture, the incident was triggered by a latent defect in the service's automated DNS management system, leading to endpoint resolution failures for DynamoDB.

Crossplane Reaches Production Maturity by Graduating CNCF

The Cloud Native Computing Foundation (CNCF) has graduated Crossplane, marking a major milestone for the open-source project that turns Kubernetes into a universal control plane for cloud infrastructure.

Crossplane now counts more than three thousand contributors across four hundred fifty companies and has passed a security audit under a vendor-neutral governance model. The ecosystem has expanded from a handful of cloud providers into a marketplace of official and community-maintained packages covering major hyperscalers, as well as services such as Helm, Vault, and Kubernetes add-ons.

Platform Engineering Patterns for Scalable Software Delivery

Building a successful Internal Developer Platform (IDP) requires balancing standardization with developer autonomy. The panelists in this recent InfoQ webinar discuss the core components of modern platforms, the role of platform teams, and strategies for driving adoption across diverse teams. They also share key patterns, anti-patterns, and lessons learned.

Sponsored

Designing Real-Time Payment Architectures for Low Latency and Reliability - Sponsored by Hazelcast

The shift to real-time payments demands systems that can process transactions instantly and reliably—without sacrificing scale or resilience. This 90-minute workshop explores modern architectural patterns for low-latency payment pipelines that unify data, compute, and streaming. Learn how in-memory processing simplifies fraud detection, liquidity management, and instant settlement while maintaining observability and uptime. You’ll also discover how Hazelcast’s unified in-memory architecture powers mission-critical payment systems in production today.

Join the live workshop “Designing Real-Time Payment Architectures for Low Latency and Reliability”, sponsored by Hazelcast.

Case Study

Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure

Building resilient platforms requires understanding both the art and science of creating infrastructure that others depend on for critical applications. Drawing on over twenty years of experience building platforms that support critical applications, the perspectives shared here apply to anyone who builds software consumed at scale, whether developing infrastructure, software development, messaging, or banking platforms.

Great platforms deliver an intuitive experience by hiding complexity and appearing magical; they operate so seamlessly that users take them for granted and never need to think about the underlying infrastructure.

Financial services platforms support mission-critical operations such as trading systems and credit card processing. These systems have zero tolerance for downtime, security breaches, or scaling failures. The “three Ss” of stability, security, and scalability represent non-negotiable requirements. Unlike real estate, where you might optimize for two out of three (e.g., location, price, and size), platforms must deliver all three without compromise.

Stability means consistent, reliable operation at all times. However, achieving stability through stagnation creates security vulnerabilities from unpatched systems. Patching introduces changes that can impact stability while enabling security. Scalability requires building for 10x growth: Successful platforms attract users like an unstoppable force, and many fail because they cannot scale with customer demand.

Balancing these three requirements demands continuous attention and investment. While cost can fluctuate based on business needs and priorities, these three fundamentals establish an inviolable foundation. Sometimes scaling takes precedence, requiring temporary adjustments to patching cycles. The key lies in maintaining minimum acceptable levels across all three dimensions while optimizing based on immediate needs.

Platform building remains a job for unsung heroes. No one calls to celebrate when platforms work perfectly; they only call when things break. The highest compliment is silence. Success means remaining transparent, unknown, and unsung. When users start calling you directly, something has gone wrong.

The principles presented in the complete version of this article provide a framework for building platforms that others can depend upon. These platforms hide complexity while delivering value; they are platforms built to last. Whether building infrastructure, creating internal tools, or developing any software others will consume, these principles guide the path toward truly resilient platforms at scale.

The journey requires patience, discipline, and commitment to excellence, but the result enables others to build amazing things they could never have created alone.

This content is an excerpt from a recent InfoQ article by Matthew Liste, "Building Resilient Platforms: Insights from over Twenty Years in Mission-Critical Infrastructure".

To get notifications when InfoQ publishes content on these topics, follow "platform engineering", "DevOps", and "cloud computing" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

Sponsored

Adopting Agentic AI: A Playbook for Engineers & Architects - Sponsored by Boomi

Agentic AI is reshaping how modern systems are designed, automated, and scaled—but many teams aren’t architecturally prepared. Boomi’s Agentic Transformation Playbook gives engineers and architects a clear framework for why AI agents matter, where they deliver real impact, and how to overcome the technical and organizational challenges of adopting them. Get practical guidance to prepare your architecture—and your engineering teams—for the next wave of intelligent automation.

Download the Playbook “Thriving in the Age of Agentic AI”, sponsored by Boomi

About InfoQ

Senior software developers rely on the InfoQ community to keep ahead of the adoption curve. One of the main reasons software architects and engineers tell us they keep coming back to InfoQ is because they trust the information provided and selected by their peers.

We’ve been helping software development teams adopt new technologies and practices for over 19 years through InfoQ articles, news items, podcasts, tech talks, trends reports, and QCon software development conferences.

We hope you find this newsletter useful. If not, you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Subscribe

Follow InfoQ on:

You have received this email because you subscribed to "The Architects' Newsletter". To stop receiving the Architects' Newsletter, please click the following link: Unsubscribe

- - -

C4Media Inc. (InfoQ.com), 705-2267 Lake Shore Blvd. West,
Toronto, Ontario, Canada, M8V 3X2