Our thirty-seventh issue of the Architects' Newsletter again focuses on the topic of "streaming data pipelines, modern ETL, and data mesh". We believe that these topics span several phases of the diffusion of innovation curve. For example, in our latest Architecture and Design InfoQ Trends Report (April 2020), we placed data mesh in the innovator phase, and event-driven architecture and streaming in the early majority phase.
We believe that understanding all the emerging patterns, antipatterns, and technologies related to these topics is essential for a software architect.
News
Data Leadership Book Review and Interview
The Data Leadership book, authored by Anthony Algmin and published by DATAVERSITY Press, introduces the topic of "data leadership" and discusses how data leaders should manage and govern the data management programs in their organizations. In this recent article, InfoQ editor Srini Penchikala reviews the book and interviews the author to dive deeper into his motivations for researching this topic.
Key takeaways from the book include: data leadership is focused on how the organizations choose to apply the resources toward creating data capabilities to influence their business; nothing built with data matters if it is not improving the business in terms of revenue, cost, or risk management; and the "simple virtuous cycle" process aims to help an organization to get started with data governance: measuring, identifying challenges, and implementing improvements.
What Is a Data Mesh - and How Not to Mesh It Up
In this recent Medium article, Barr Moses and Lior Gavish from Monte Carlo build on the work of Zhamak Dehghani, and outline the benefits of a "data mesh". They argue that in the age of self-service business intelligence, nearly every company considers itself a data-first company, but not every company is treating its data architecture with the level of democratization and scalability it deserves.
In order to promote ownership of data by business and domain teams, organizations must be able to answer the following questions using their data management systems: Is my data fresh? Is my data broken? How do I track schema changes? What are the upstream and downstream dependencies of my pipelines? The architectures, practices, and processes outlined in the original Data Mesh article provide guidance on how to build systems to answer these questions.
Infinite Storage and Retention for Apache Kafka in Confluent Cloud
Confluent Cloud is an event streaming platform that is powered by Apache Kafka. The team behind this product recently announced the Infinite Storage option for its standard and dedicated clusters. This offering is a part of the Project Metamorphosis initiative, which is focused on imbuing Kafka with modern cloud properties.
Tiered Storage was released earlier this year as a preview in the Confluent Platform, and offered features that are referred to as the "nuts and bolts of Infinite Storage". This functionality is responsible under the hood for dynamically moving Kafka data between different storage implementations, each with varying performance and cost characteristics.
Combining DataOps and DevOps: Scale at Speed
In a recent InfoQ article, Sam Bocetta, a security analyst, states that the concept of DataOps is focused on streamlining the processes that are involved in processing, analyzing, and deriving value from big data. He argues that development teams need to learn how to look past the data delivery mechanics and instead concentrate on the policies and limitations that control data in their organization.
Two of the most pioneering organizations in the DataOps space have been Uber and Netflix. Uber, for instance, uses a machine learning model (ML) known as Michelangelo to process the huge amounts of the data that the firm collects, and to share this across the organization. The core of the Netflix user experience is their recommendation engine, which currently runs in Apache Spark.
Case Study
The Challenges of Building a Reliable Real-Time Event-Driven Ecosystem
Globally, there is an increasing appetite for data delivered in real-time or near real-time. In this recent InfoQ article, Matthew O'Riordan, co-founder of Ably, argues that as both producers and consumers are more and more interested in faster experiences and instantaneous data transactions, we are witnessing the emergence of the real-time API and the supporting backend architectures and systems.
When it comes to implementing event-driven APIs, engineers can choose between multiple different protocols. Options include the simple webhook, the newer WebSub, and popular open protocols such as WebSockets, MQTT or SSE, or even streaming protocols, such as Kafka.
In addition to choosing a protocol, engineers also have to think about subscription models: server-initiated (push-based) or client-initiated (pull-based).
Client-initiated models are the best choice for the "last mile" delivery of data to end-user devices. These devices only need access to data when they are online (connected) and don't care what happens when they are disconnected. Due to this fact, the complexity of the producer is reduced, as the server-side doesn't need to be stateful.
In the case of streaming data at scale, engineers should adopt a server-initiated model. The responsibility of sharding data across multiple connections and managing those connections rests with the producer, and other than the potential use of a client-side load balancer, things are kept rather simple on the consumer side.
To truly benefit from the power of real-time data, the entire tech stack needs to be event-driven. O'Riordan concludes the article by stating that perhaps we should start talking more about event-driven architectures than about event-driven APIs.
The complete podcast and show notes can be found on InfoQ: "The Challenges of Building a Reliable Real-Time Event-Driven Ecosystem".
To get notifications when InfoQ publishes content on these topics, follow "architecture" and "streaming" and "ETL" and "data mesh"on InfoQ.
Missed a newsletter? You can find all of the previous issues on InfoQ.
Event
On Sept 23rd InfoQ Live returns with a new virtual interactive event
Join us and deep-dive into cloud-native, secure software, serverless ML & performance, and more. Dr. Holly Cummins, Laura Bell, and Dana Engebretson are just a few of the early confirmed speakers. Registration will open soon so sign up to be amongst the first to know the latest news on the InfoQ Live Sept 23rd event.
This edition of The Software Architects' Newsletter is brought to you by:
|
API Traffic Management 101
The very nature of API traffic has been changing in the past few years. As more companies adopt the pattern of smaller, lightweight services composed into agile, resilient solutions, the amount of interservice traffic (typically called “East–West” traffic) is growing. Organizations that have spent time and resources building up a strong practice in managing traffic from behind the firewall to the outside world (called “North–South” traffic) might find that their tool selection and platform choices are not properly suited for the increased traffic between services behind the firewall.
InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.
Forwarded email? Subscribe and get your own copy.
|