InfoQ

The Software Architects' Newsletter
May 2019
View in browser

In our twenty-second issue of the Architects’ Newsletter we are focusing on the topic of data engineering and streaming architecture. We believe this topic has largely “crossed the chasm” in regard to the diffusion of innovation, and in our latest InfoQ Architecture and Design Trends report we have placed event-driven architecture in the early majority phase, and other related patterns and approaches in the early adopter phase. Understanding all the emerging patterns, antipatterns, and technologies is therefore essential for a software architect.

News

Databricks Releases Delta Lake as Open Source and MLflow Integration as GA

Databricks recently announced the open sourcing of “Delta Lake”, their proprietary storage layer that adds ACID transaction capability to Apache Spark and big data workloads. Databricks is the company founded by the original creators of Apache Spark, and Alex Giamas reported on InfoQ that Delta Lake is already being used in several companies like McGraw Hill, McAffee, Upwork, and Booz Allen Hamilton.

Databricks also made MLflow integration with Databrick notebooks generally available for its data engineering and higher-level subscription tiers. This integration combines the features of the MLflow platform for managing the ML project lifecycle with those of Databrick notebooks and jobs. Databricks originally authored MLflow as an open-source project in June 2018, and has always been usable as a separate standalone command line tool.

Google Releases Google-Landmarks-V2, a Large-Scale Dataset for Landmark Recognition and Retrieval

As reported by Alexis Perrier on InfoQ, Google has recently released Google-Landmarks-v2, an improved dataset for “Landmark Recognition and Retrieval.” The Google-Landmarks-V2 release is the second iteration of the Google-Landmarks dataset, previously released in March 2018. This new version contains 5 million images of more than 200,000 different landmarks. The images were collected from photographers around the world, who labeled their own photos and supplemented them with historical and lesser-known images from Wikimedia Commons.

Real-Time Data Processing Using Redis Streams and Apache Spark Structured Streaming

A recent full-length InfoQ article by Roshan Kumar explored real-time data processing using Redis Streams and Apache Spark Structured Streaming. Apache Spark's Structured Streaming brings SQL-like querying capabilities to data streams, allowing engineers to perform scalable, real-time data processing. Redis Streams, the new data structure introduced in Redis 5.0, enables the collection, persistence, and distribution of data at high speed with sub-millisecond latency. The library offers resilient distributed datasets (RDD) and Dataframe APIs for Redis data structures, and allows you to use Redis Streams as a data source for Structured Streaming.

Kumar argues that integrating Redis Streams and Structured Streaming simplifies the implementation and scaling of stream-processing applications: “Redis Streams simplifies the task of collecting and distributing data at a high speed. By combining it with Structured Streaming in Apache Spark, you can power all kinds of solutions that require real-time computations for scenarios ranging from IoT, fraud detection, AI and machine learning, real-time analytics, and so on.”

Google Scales Weak Supervision to Overcome Labeled Dataset Problem

On InfoQ, Aslan Brooke recently reported that Google has recognized that the need for labeled data in machine learning (ML) is a significant bottleneck for ML projects, and accordingly has adapted the open-source Snorkel framework to overcome the problem at scale. Google collaborated with Stanford and Brown University in this research, and documented the results in their AI blog and a scientific research paper titled "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale."

Event Streams and Workflow Engines: Apache Kafka and Zeebe

Apache Kafka is a highly scalable, distributed, streaming platform often used to distribute messages or events within a microservices-based system. These events are often part of a business process, with tasks spread over several microservices. To handle complex business processes a workflow engine can be used, but to match Kafka, it must meet the same scalability Kafka provides.

Zeebe is a workflow engine currently being developed and designed to meet these scalability requirements. In a joint meeting in Amsterdam, Kai Waehner described features of Kafka and how it fits in an Event-Driven Architecture (EDA), and Bernd Rücker described workflow engines, Zeebe, and how it can be used with Kafka.

 

Case Study

The Data Science Mindset: Six Principles to Build a Healthy Data-Driven Organization

In the last few years, data from a myriad of different sources has become more available and consumable, and organizations have started looking for ways to use the latest data analytics techniques to address their business needs and pursue new opportunities. Not only has data become more available and accessible, but there has also been an explosion of tools and applications that enable teams to build sophisticated data-analytics solutions. For all these reasons, organizations are increasingly forming teams around the function of Data Science.

But what does it really mean to be a data-driven organization? In a recent full-length InfoQ article Francesca Lazzeri presented six principles to build a healthy data-driven organization, and a framework to both assess whether an organization is data-driven and also to benchmark its data science maturity. This framework has been created based on her experience as a data scientist, working on end-to-end data science and machine-learning solutions with external customers from a wide range of industries including energy, oil and gas, retail, aerospace, healthcare, and professional services.

The “Healthy Data Science Organization Framework” is a portfolio of methodologies, technologies, and resources that will assist an organization becoming more data-driven, and provide a lifecycle to structure the development of data science projects. The related principles presented, included:

  1. Understand the Business and Decision-Making Process
  2. Establish Performance Metrics
  3. Architect the End-to-End Solution
  4. Build Your Toolbox of “Data Science Tricks“
  5. Unify Your Organization’s Data Science Vision
  6. Keep Humans in the Loop

Lazzeri believes that by applying these six principles from the Healthy Data Science Organization Framework, organizations can make better decisions for their business, and their choices will be backed by data that has been robustly collected and analyzed. She argues that with practice, data science processes will get faster and more accurate – meaning organizations will make better, more informed decisions to run operations most effectively.

This is an excerpt of an InfoQ article that can be read in full: “The Data Science Mindset: Six Principles to Build Healthy Data-Driven Organizations

To get notifications when InfoQ publishes content on these topics follow "Streaming" and "AI, ML and Data Engineering" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

This edition of The Software Architects' Newsletter is brought to you by:

NGINX

What Is a Service Mesh?

Service meshes provide policy-based networking for microservices describing desired behavior of the network in the face of constantly changing conditions and network topology. At their core, service meshes provide a developer-driven, services-first network; a network that is primarily concerned with alleviating application developers from building network concerns (e.g., resiliency) into their application code; a network that empowers operators with the ability to declaratively define network behavior, node identity, and traffic flow through policy.

 

InfoQ strives to facilitate the spread of knowledge and innovation within this space, and in this newsletter we aim to curate and summarise key learnings from news items, articles and presentations created by industry peers, both on InfoQ and across the web. We aim to keep readers informed and educated about emerging trends, peer-validated early adoption of technologies, and architectural best practices, and are always keen to receive feedback from our readers. We hope you find it useful, but if not you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Subscribe