In our twenty first issue of the Architects' Newsletter we are focusing on the topic of data engineering and streaming architecture. This topic has largely "crossed the chasm" in regard to the diffusion of innovation, and in our latest InfoQ Architecture and Design Trends report we have placed event-driven architecture in the early majority phase, and other related patterns and approaches in the early adopter phase. Understanding all the emerging patterns, antipatterns, and technologies is therefore essential for a software architect.

News

MicroProfile Releases Reactive Streams Operators 1.0

The Java-based MicroProfile community have released the Reactive Streams Operators 1.0 API, a specification that defines a set of operators for Reactive Streams. Inspired by the java.util.stream package, first introduced in JDK 8, and based on the Reactive Streams initiative, the Reactive Streams Operators API allows developers to create Reactive Streams, process the data transiting in the streams, and accumulate results.

The Reactive Streams initiative was based on the Reactive Manifesto, which was started in 2013 as a collaboration among Lightbend, Netflix and Pivotal. The initial specification, released in 2015, was ultimately adopted with the release of JDK 9 as the java.util.concurrent.Flow API.

Squbs: Akka Streams & Akka HTTP for Large-Scale Production Deployment

Squbs is an open-source project enabling standardization and operationalisation of Akka applications on a large scale, and adheres to the reactive principles, allowing applications to process billions of transactions per day with minimal resource consumption while being more resilient to failure. It is an asynchronous programming model which uses streams as the core of the application with all input and output considered as streams of events.

Squbs (rhymes with "cubes") is a good fit for use cases which involve collecting and processing large amounts of data in near real time as well as for a number of heavy-duty data processing and orchestration applications. InfoQ editor Abhishek Kaushik recently interviewed Akara Sucharitakul, principal member of Technical Staff at PayPal and the Squbs project founder, about the problems Squbs solves.

Complex Event Flows in Distributed Systems

In this QCon London 2019 talk recording, Bernd Ruecker demonstrated how the new generation of lightweight and highly-scalable state machines ease the implementation of long running services. Building on his earlier InfoQ content, he shares how to handle complex logic and flows which require proper reactions on failures, timeouts and compensating actions and provides guidance backed by code examples to illustrate alternative approaches.

The talk video includes the full transcript, and InfoQ editor Jan Stenberg has also written a summary news post, "A Critical Look at Event-Driven Systems: Bernd Rücker at QCon London".

Building Data-Intensive Applications with Spring Cloud Stream and Kafka

In this Spring One Platform talk recording, "Building Cloud-Native Data-Intensive Applications with Spring" Sabby Anandan and Soby Chako discussed how Spring Cloud Stream and Kafka Streams can support Event Sourcing and CQRS patterns.

Using Uber as a case study, they presented a series of architecture patterns that engineers can apply, regardless of the size of their organisation. Four key topics within implementing data-intensive architectures are also discussed: reliability, scalability, maintainability, and portability. A series of live coding demonstrations are included within the talk.

Why a Data Scientist is Not a Data Engineer

In an O'Reilly blog post, Jesse Anderson discussed "why a data scientist is not a data engineer" (or, why science and engineering are still different disciplines.)He began by arguing that, from the management perspective, data engineering is not in the limelight: "it isn't getting all of the media buzz. Conferences aren't telling CxOs about the virtues of data engineering". This causes the role to be ignored at this level of the organisation. He continued by stating that anecdotally he has found most data scientists over-assessed their own data engineering abilities, which can lead them to undervaluing the value added by a good data engineering specialist.

In order to address the confusion around the roles, Anderson argued that "first and foremost, we have to understand what data scientists and data engineers do. We have to realize this isn't a focus on titles and limiting people based on that". He continued by suggesting that leadership/management in combination with developing teams with solid engineering skills has a big part to play in building effective data processing systems.

Case Study

Creating Events from Databases Using Change Data Capture

In a presentation at the MicroXchg Berlin conference, Gunnar Morling described Debezium, an implementation of a change data capture (CDC) tool that uses Apache Kafka to publish changes as event streams. InfoQ editor Jan Stenberg summarised the talk in a recent news post " Creating Events from Databases Using Change Data Capture: Gunnar Morling at MicroXchg Berlin".

To set the stage, Morling, a software engineer at Red Hat, argued that when you store data in a database, you often also want to put the same data in a cache and a search engine. The challenge then is how to keep all data in sync when you want to stay away from distributed transactions and dual writes. One solution is to use a CDC tool that captures and publishes changes in a database.

Morling describes Debezium as an open source CDC tool built on top of Kafka that reads the transaction logs in a database and creates streams of events. Other applications can then asynchronously consume these events in the correct order and update their own data storage according to their needs.

Transaction log files are append-only log files, used for rollback of transactions and replication, and for Morling they are an ideal source for capturing changes made in a database, since they contain all changes made and in the correct order. All databases have their own APIs for reading the log files, so Debezium comes with connectors for several databases. On the output side Debezium produces one generic and abstract event representation for Kafka.

Besides using CDC for updating caches, services and search engines, Morling notes some other use cases, including:

Data replication. Commonly used for replication of data into another type of database or data warehouse.
Auditing. By keeping the history of events, possibly after enriching each event with metadata, you will have an audit of all changes made. One example of metadata that may be interesting is the current user.

In a microservices-based system, commonly services need data from other services, but Morling points out that we should stay away from shared databases. Using REST calls will increase the coupling between services; instead, we can use CDC for such scenarios. By creating streams with changes, other services can subscribe to these changes and update their local databases. This pipeline of events is asynchronous, which means services can fall behind in case of, for instance, network problems, but they will catch up eventually - the system is eventually consistent.

This is an excerpt of the full article, which can be read on InfoQ.

To get notifications when InfoQ publishes content on these topics follow "Streaming" and "AI, ML and Data Engineering" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

This edition of The Software Architects' Newsletter is brought to you by:
                

Want to learn techniques and use cases with PyTorch, Keras, and TensorFlow?

Discover this and more in the “Machine Learning for Developers” track at QConNYC 2019 (June 24-26). Save $100 using the code INFOQNY19!

InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

The Software Architects' NewsletterApril 2019View in browser