Our thirty-sixth issue of the Architects' Newsletter focuses on the topic of "streaming data pipelines, modern ETL, and data mesh". We believe that these topics span several phases of the diffusion of innovation curve. For example, in our latest Architecture and Design InfoQ Trends Report (April 2020), we have placed data mesh in the innovator phase, and event-driven architecture and streaming in the early majority phases.

The combination of potential disruption to the way data is handled and managed (e.g. product teams owning and exposing data via self-service mechanisms) alongside evolving stream-based data-wrangling approaches, we believe that understanding all the emerging patterns, antipatterns, and technologies related to these topics is therefore essential for a software architect.

News

LaunchDarkly's Evolution from Polling to Streaming

In a recent article, Dawn Parzych, developer advocate at LaunchDarkly, discussed how the company has evolved its customer-facing feature flag evaluation system from a polling-based architecture to one driven by streaming.

The benefits to their end users included the increased speed of configuration changes "customers today see rapid updates when flags change with significantly less bandwidth consumption", and also other advantages, "the batteries on the mobile devices are extended as they no longer have to spend battery power on polling".

Everything You Wanted to Know about Apache Kafka but Were Too Afraid to Ask!

In this presentation from Big Data Conference Vilnius, Ricardo Ferreira explained what Apache Kafka is, and how this fits into the concepts of distributed systems and stream processing. He also discussed several use cases and design patterns around the use of Kafka.

A key takeaway from the talk was that Apache Kafka can be seen as a distributed streaming platform. Ferreira stated that engineers will never understand Kafka if they think of it as "just messaging"; messaging has no persistence, and like databases, "has scalability limits".

Amazon EventBridge Schema Registry Now Generally Available on AWS

Recently Amazon announced the general availability of the Schema Registry capability in the Amazon EventBridge service. With Amazon EventBridge Schema Registry, developers can store the event structure-or schema-in a shared central location and map those schemas to code for Java, Python, and Typescript, meaning that they can use events as objects in their code.

With this new feature, Amazon's EventBridge is now a competitive service in comparison with other cloud vendors that provide similar services. Microsoft offers EventGrid, which has been GA since the beginning of 2018 and received several updates including advanced filtering, retry policies, and support for CloudEvents. However, the service lacks a schema registry capability. Moreover, the same applies to Triggermesh's EveryBridge. This event bus can consume events from various sources, which developers can use to start serverless functions that are running on any of the major cloud providers as well as on-premises.

Boosting Apache Spark with GPUs and the RAPIDS Library

At the 2019 Spark AI Summit Europe conference, NVIDIA software engineers Thomas Graves and Miguel Martinez hosted a session on Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library. InfoQ recently talked with Jim Scott, head of developer relations at NVIDIA, to learn more about accelerating Apache Spark with GPUs and the RAPIDS library.

RAPIDS is a suite of open-source software libraries and APIs for executing end-to-end data science and analytics pipelines entirely on GPUs, allowing for a substantial speed up, particularly on large data sets. Built on top of NVIDIA CUDA, RAPIDS exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces and a DataFrame API that integrates with a variety of machine-learning algorithms for end-to-end pipeline accelerations.

Case Study

Zhamak Dehghani on Data Mesh, Domain-Oriented Data, and Building Data Platforms

In a popular InfoQ podcast, the topic of "data mesh" was introduced by Zhamak Dehghani, principal consultant, member of the technical advisory board, and portfolio director at ThoughtWorks. Topics discussed included: the motivations for becoming a data-driven organization; the challenges of adapting legacy data platforms and ETL jobs; and how to design and build the next generation of data platforms using ideas from domain-driven design and product thinking, and modern platform principles such as self-service workflows.

Dehghani began by stating that becoming a data-driven organization remains one of the top strategic goals of many organizations. Being able to rapidly run experiments and efficiently analyze the resulting data can provide a competitive advantage.

There are several "architecture failure modes" within existing enterprise data platforms. They are centralized and monolithic. The composition of data pipelines is often highly-coupled, meaning that a change to the data format will require a cascade of changes throughout the pipeline. And finally, the ownership of data platforms is often siloed and hyper-specialized.

The next generation of enterprise data platform architecture requires a paradigm shift towards ubiquitous data with a distributed data mesh, Dehghani believes. Instead of flowing the data from domains into a centrally owned data lake or platform, domains need to host and serve their domain datasets in an easily consumable "self-service" way.

Domain data teams must apply product thinking to the datasets that they provide; considering their data assets as their products, and the rest of the organization's data scientists, ML and data engineers as their customers.

The complete podcast and show notes can be found on InfoQ: "Zhamak Dehghani on Data Mesh, Domain-Oriented Data, and Building Data Platforms".

To get notifications when InfoQ publishes content on these topics, follow " architecture" and "streaming" and "ETL" and "data mesh"on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

Event

Introducing InfoQ Live: A Microservices Virtual Event on Tuesday, August 25th

Designing a distributed system is very challenging, and complexity can easily be pushed from the code to the infrastructure.

At InfoQ Live our speakers will help you ask and answer the right questions related to this topic: Should you move from a monolith to microservices? What are the new skills your team will need to learn? How do you manage the transition, both from the technical and organizational perspectives?

You’ll learn from world-class practitioners that have experience working with microservices at scale. Early confirmed speakers include: VP Cloud Architecture Strategy at AWS & Microservices Pioneer Adrian Cockcroft, Co-founder, and President at Snyk Security Guy Podjarny and Principal Engineer at Skyscanner Nicky Wrightson.

Attend InfoQ Live, an event designed for you, the modern software practitioner. Register for only $49.

This edition of The Software Architects' Newsletter is brought to you by:

Pods in a Kubernetes Cluster

Pods represent the atomic unit of work in a Kubernetes cluster. Pods are comprised of one or more containers working together symbiotically. To create a Pod, you write a Pod manifest and submit it to the Kubernetes API server by using the command-line tool or (less frequently) by making HTTP and JSON calls to the server directly.

Once you’ve submitted the manifest to the API server, the Kubernetes scheduler finds a machine where the Pod can fit and schedules the Pod to that machine. Once scheduled, the kubelet daemon on that machine is responsible for creating the containers that correspond to the Pod, as well as performing any health checks defined in the Pod manifest.

InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

The Software Architects' NewsletterJuly 2020View in browser