AWS Glue upgrades Spark engines, backs Ray framework

Serverless data integration service in the Amazon cloud also adds support for built-in Pandas APIs and the Apache Hudi, Apache Iceberg, and Delta Lake formats.

Editor at Large, InfoWorld |

Cyber space, digital lines, data grid — blackdovfx

AWS Glue, a serverless data integration service provided by Amazon Web Services, showcases Python and Apache Spark capabilities in a version 4.0 release introduced this week.

The upgrade adds engines for Python 3.10 and Apache Spark 3.3.0. Both engines include performance enhancements and bug fixes, with Spark offering capabilities such as row-level runtime filtering and improved error messages.

New engine plugins in Glue 4.0 support the Ray compute framework, the Cloud Shuffle Service for Spark, and Adaptive Query Execution. Support for the Pandas data analysis and manipulation tool, built on top of Python, also is featured. New data format support covers Apache Hudi, Apache Iceberg, and Delta Lake. Glue 4.0 also includes the Parquet vectorized reader, with support for additional encodings and data types.

AWS Glue provides data discovery, data preparation, data transformation, and data integration capabilities, with autoscaling based on workload size. AWS said Glue also now offers visual transforms for customers to use and share business-specific ETL logic among teams.

AWS announced a preview of AWS Glue for Ray as a new engine option. Data engineers can use AWS Glue for Ray to process large data sets with Python and popular Python libraries. Distributed processing of Python code is done over multi-node clusters.

Glue 4.0 is available now in several AWS regions of the US including Ohio, Northern Virginia, and Northern California.

Next read this:

Paul Krill is an editor at large at InfoWorld, whose coverage focuses on application development.