plugins.h_spark.SparkInputValidator¶

class hamilton.plugins.h_spark.SparkInputValidator¶

This is a graph hook adapter that allows you to get past a <4.0.0 limitation in spark. Spark has the option to choose between spark connect and spark, which largely have the same API. That said, they don’t have the proper subclass relationships, which make hamilton fail on the input type checking.

See the following for more information as to why this is necessary: - https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/td-p/71055 - https://issues.apache.org/jira/browse/SPARK-47909

You can access an instance of this through the convenience variable SPARK_INPUT_CHECK. This allows you to bypass that. This has to be used with the driver builder pattern – this will look as follows:

from hamilton import driver
from hamilton.plugins import h_spark

dr = driver.Builder().with_modules(...).with_adapters(h_spark.SPARK_INPUT_CHECK).build()

Then run it as you would normally. Note that in spark==4.0.0, you will only need the spark session check, not the dataframe check.

do_validate_input(*, node_type: type, input_value: Any) → bool¶: Validates the input. Treats connect/classic sessios/dataframe as interchangeable.