Skip to content

Latest commit

 

History

History
 
 

XML-ingest-demo

image

XML Processing with Spark on Databricks -

Endangered Species Demo

This Databricks notebook demonstrates how to process XML data using Apache Spark on Databricks. It covers various aspects of XML processing, including reading XML files, validating against XSD schemas, handling schema evolution, and using SQL functions for XML manipulation.

Introduction

XML (eXtensible Markup Language) is a popular format for storing and exchanging structured data. Apache Spark provides built-in support for processing XML data, enabling users to read, validate, and manipulate XML files efficiently. This notebook explores various techniques for working with XML data in Spark using Databricks.

Native XML is available now with Databricks runtime 14.3 and above. Native XML file format support enables ingestion, querying, and parsing of XML data for batch processing or streaming. This won't replace the OSS version of spark-xml. That library will continue to be supported but doesn't have the same feature set for things like schema evolution, rescue data columns. Customers using DBR don't need to add the package anymore and will get additional optimizations and features.

Setup

The notebook starts by setting up the necessary files and directories for the XML processing demo. It defines paths to XML files, XSD schemas, and a schema location for Auto Loader.

Reading XML Data

Spark provides the spark.read API to read XML data into a DataFrame. The format("xml") option specifies the XML format, and the option("rowTag", "...") option indicates the XML element to consider as a row in the DataFrame. The notebook demonstrates reading XML data and flattening attributes into columns.

XML Validation with XSD

Spark allows validating XML data against an XSD (XML Schema Definition) schema using the option("rowValidationXSDPath", "...") option. The notebook shows how to enforce an XSD schema during XML parsing and handle invalid or corrupt records.

Schema Evolution with Auto Loader

Auto Loader is a feature in Spark that enables streaming XML data and supports schema inference and evolution. The notebook demonstrates how to use Auto Loader with XML data, specifying options like "cloudFiles.format", "rowTag", "rowValidationXSDPath", and "cloudFiles.schemaLocation". It also shows how to handle schema evolution by setting the "cloudFiles.schemaEvolutionMode" option to "addNewColumns".

Using SQL with XML

Spark SQL provides native support for creating tables from XML files using the CREATE TABLE ... USING XML syntax. The notebook illustrates how to create a table from an XML file and query it using standard SQL commands. It also demonstrates the read_files function in Spark SQL for reading XML data and specifying options like format, schemaEvolutionMode, and schemaLocation.

SQL XML Functions

Spark SQL offers built-in functions for working with XML data. The notebook showcases two commonly used functions: schema_of_xml() and to_xml(). schema_of_xml() infers the schema of an XML string, while to_xml() converts a struct to an XML string representation.

Conclusion

This notebook provides a comprehensive overview of processing XML data using Apache Spark on Databricks. It covers various techniques, including reading XML files, validating against XSD schemas, handling schema evolution with Auto Loader, using SQL with XML, and leveraging SQL XML functions. By following the examples and explanations provided in the notebook, users can effectively work with XML data in their Spark applications on Databricks.

Feel free to explore the notebook and adapt the code examples to your specific XML processing requirements.