Friday, August 15, 2025

New IEEE Studies Offer Faster, Smarter Approaches to Real Time Data and Federated SQL

Recent research outlines practical methods to speed up machine learning pipelines and improve query performance across large, diverse datasets.

Phoenix, Arizona, United States of America - Speed is no longer a luxury in data analytics; it is a necessity. A financial service flagging a suspicious payment in time to stop it, or an e-commerce site changing a promotion while a shopper is still browsing, can make all the difference. Yet as data pours in from more sources and is stored in more places, turning it into useful insight without delay has become a bigger challenge than ever.

Two newly published studies in the IEEE Xplore Digital Library by data engineering specialist Vamsi Krishna Eruvaram address this challenge directly. His work focuses on two pressing areas for today’s data teams: keeping real time data pipelines running fast enough to feed advanced machine learning, and improving the speed of federated SQL queries that pull information from massive, scattered datasets.

In the first paper, Optimizing Real Time Data Pipelines for Machine Learning A Comparative Study of Stream Processing Architectures, Eruvaram looks closely at three open source frameworks — Apache Kafka Streams, Apache Flink and Apache Pulsar. Rather than testing them under perfect lab conditions, he recreated the kinds of demands that real businesses face. One scenario simulated fraud detection on live financial transactions, another handled fast-changing recommendations for online shoppers, and a third processed streams of sensor readings from industrial equipment.

Latency, throughput, scalability and resource use were measured. Apache Flink consistently delivered the lowest latency for complex, time-sensitive processing. Kafka Streams was simpler to run and suited for teams already working in the Kafka ecosystem. Pulsar showed strong multi-tenancy and replication capabilities, important for global operations with varying data rules. Instead of naming a single winner, Eruvaram advises matching the choice of tool to the specific workload and operational context.

His second paper, Enhancing Federated SQL Query Performance in Large Scale Data Lakes A Trino Based Approach, explores how to speed up queries that must pull data from different systems without moving it into one place first. In many organizations, the data sits in a mix of on-premise clusters, cloud storage and specialized platforms, making transfers costly or restricted by regulation.

Eruvaram’s research examines methods to make these federated queries faster and more reliable. He focuses on sending filters to the source data, scanning only the relevant partitions, choosing execution plans based on cost efficiency, and managing workloads to prevent resource bottlenecks. The paper also details ways to keep long-running queries from failing when part of the system goes down.

Tests on multi-terabyte datasets showed query time reductions of up to 62 percent for complex joins, all achieved without new hardware. For organizations, that means faster results without higher infrastructure costs.

Taken together, Eruvaram’s studies offer a clear roadmap for handling the speed of incoming data and the complexity of distributed storage. They provide solutions that are both technically sound and grounded in the realities of running large-scale data systems.

Both papers by Vamsi Krishna Eruvaram are available in the IEEE Xplore Digital Library: Optimizing Real Time Data Pipelines for Machine Learning A Comparative Study of Stream Processing Architectures at https://ieeexplore.ieee.org/document/11088800 and Enhancing Federated SQL Query Performance in Large Scale Data Lakes A Trino Based Approach at https://ieeexplore.ieee.org/document/11089251.

Media Contact
Company Name: CB Herald
Contact Person: Ray
Email:Send Email
City: Phoenix
State: Arizona
Country: United States
Website: Cbherald.com