InfoQ

The Software Architects' Newsletter
January 2023
View in browser

Welcome to the InfoQ Software Architects' Newsletter! Each month, we bring you essential news and experience from industry peers on emerging patterns and technologies.

This month, we focus on "Modern Data Processing: Data Pipelines, Streaming, and Data Mesh". These core topics currently span the entire "diffusion of innovation" graph in our 2022 AI, ML, and Data Engineering InfoQ Trends Report. We see increasing adoption of stream processing, distributed computation, and "data lake-as-a-service".

Key challenges remain in this space, including being conscious about how data pipeline architecture decisions are made at scale (and with the required speed) and bringing the social and ethical elements into the sociotechnical systems in which we all now work.

News

AWS Announces the General Availability of Amazon Omics

At re:Invent, AWS announced the general availability of Amazon Omics, a managed service for storage, analysis, and elaboration of genomic, transcriptomic, and other omics data. The service is designed for healthcare and life science organizations to enhance patient care and advance scientific research.

Microsoft Open-Sources Agricultural AI Toolkit FarmVibes.AI

Microsoft Research recently open-sourced FarmVibes.AI, a suite of ML models and tools for sustainable agriculture. FarmVibes.AI includes data processing workflows for fusing multiple sets of spatiotemporal and geospatial data, such as weather data and satellite and drone imagery.

The release was announced on the Microsoft Research blog. FarmVibes.AI is part of Microsoft's Project FarmVibes, an effort to develop technologies for sustainable agriculture. The key idea is a fusion of multiple data sources to improve the performance of AI models.

The toolkit contains utilities for downloading and preprocessing public datasets of satellite imagery, weather, and terrain elevation. It also includes models for removing cloud cover from satellite images and generating micro-climate forecasts.

OpenAI Releases Conversational AI Model ChatGPT

OpenAI released ChatGPT, a conversational AI model based on their GPT-3.5 language model (LM). ChatGPT is fine-tuned using Reinforcement Learning from Human Feedback (RLHF) and includes a moderation filter to block inappropriate interactions.

How Twitter Automated Data Quality Check Process

Twitter engineering has recently shared a blog post on how they architected and developed a quality automation platform using Google Cloud Platform (GCP) and open-source software. Twitter digests and creates thousands of data sets for different data products and applications. They recently moved to GCP and Big Query as part of their data lake solution. The next natural step in designing and ingesting these vast amounts of data is to ensure the data’s quality. These data will fuel Twitter core ads, product analytics, and ML products, to name a few.

Google Publishes Technique for AI-Language Model Self-Improvement

Researchers at Google and the University of Illinois at Urbana-Champaign (UIUC) have published a technique called Language Model Self-Improved (LMSI), which fine-tunes a large language model (LLM) on a dataset generated by that same model. Using LMSI, the researchers improved the performance of the LLM on six benchmarks and set new state-of-the-art accuracy records on four of them.

 

Case Study

AI, ML, and Data Engineering InfoQ Trends Report—August 2022

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you, as a software engineer, architect, or data scientist, should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things evolve.

In this year's AI, ML, and Data Engineering InfoQ Trends podcast, the InfoQ editorial team was joined by an external panelist Dr. Einat Orr, co-creator of the open source project LakeFS, and a co-founder and CEO at Treeverse, as well as a speaker at the recent QCon London conference.

The following sections in the article summarize some of these trends and where different technologies fall in the technology adoption curve.

Streaming Data Analytics: IoT and Real-Time Data Ingestion

Streaming-first architectures and streaming data analytics have seen increasing adoption in various companies, especially in the IoT and other real-time data ingestion and processing applications.

Sid Anand's presentation on building & operating high-fidelity data streams and Ricardo Ferreira's talk on building value from data in motion by transitioning from batch data processing to stream-based data processing are excellent examples of how stream-based data processing is a must-have in strategic data architectures. Also, Chris Riccomini, in his article, The Future of Data Engineering, discussed the critical role stream processing plays in overall data engineering programs.

Chip Huyen spoke at last year's QCon Plus online conference on Streaming-First Infrastructure for Real-Time ML. She highlighted the advantages of a streaming-first infrastructure for real-time and continual machine learning, the benefits of real-time ML, and the challenges of implementing real-time ML.

As a reflection of this trend, streaming data analytics and technologies, such as Spark Streaming, have been moved to the late majority. Same for Data Lake as a Service which gained further adoption last year with products like Snowflake.

AI/ML Infrastructure: Building for Scale

Highly scalable, resilient, distributed, secure, and performant infrastructure can make or break the AI/ML strategy in an organization. Without a good infrastructure as the foundation, no AI/ML program can be successful in the long term.

At this year's GTC conference, NVIDIA announced their next-generation processors for AI computing, the H100 GPU and the Grace CPU Superchip.

Resource Negotiators like YARN and container orchestration technologies like Kubernetes are now in the late majority category. Kubernetes has become the de facto standard for cloud platforms, and multi-cloud computing is gaining attention in deploying applications to the cloud. Technologies like Kubernetes can be the enablers for automating the complete lifecycle of AI/ML data pipelines, including the models' production deployments and post-production support.

MLOps: Combining ML and DevOps Practices

MLOps has been getting a lot of attention from companies to bring the same discipline and best practices that DevOps offers in the software development space.

Francesca Lazzeri spoke about MLOps as the most important piece in the enterprise AI puzzle at the QCon Plus Conference. She discussed how MLOps empowers data scientists and app developers to help bring machine learning models to production. MLOps enables you to track, version, audit, certify, reuse every asset in your machine learning lifecycle and provides orchestration services to streamline managing this lifecycle.

This content is an excerpt from a recent InfoQ article by Srini Penchikala, "AI, ML, and Data Engineering InfoQ Trends Report—August 2022".

To get notifications when InfoQ publishes content on these topics, follow "AI, ML, and Data Engineering", "Streaming", and "Data mesh" on InfoQ.

Missed a newsletter? You can find all of the previous issues on InfoQ.

Sponsored

Cockroach Labs

Most architects know that if something can go wrong, eventually it probably will. Which makes designing resilient infrastructure critical to ensuring high uptime and availability.

Start building unkillable apps today with your free copy of O'Reilly's CockroachDB: The Definitive Guide, sponsored by Cockroach Labs

Upcoming events

QCon: For practitioners, by practitioners


Solve your challenges with valuable insights from senior software developers applying the latest trends and practices. Join your peers in-person. Or get on-demand access and select live sessions with our new online ticket.

QCon London - March 27-29, 2023: Book before February 6th and save with limited early bird tickets.

 

QCon New York - June 13-15, 2023: Book before January 30th to save with our best prices.

 

Senior software developers rely on the InfoQ community to keep ahead of the adoption curve. One of the main reasons software architects and engineers tell us they keep coming back to InfoQ is because they trust the information provided and selected by their peers.

We’ve been helping software development teams adopt new technologies and practices for over 16 years through InfoQ articles, news items, podcasts, tech talks, trends reports, and QCon software development conferences.

We hope you find this newsletter useful. If not, you can unsubscribe using the link below.

Unsubscribe

Forwarded email? Subscribe and get your own copy.

Subscribe