Data versioning¶
This module contains hashing functions for Python objects. It uses functools.singledispatch to allow specialized implementations based on type. Singledispatch automatically applies the most specific implementation
This module houses implementations for the Python standard library. Supporting all types is considerable endeavor, so we’ll add support as types are requested by users.
Otherwise, 3rd party types can be supported via the h_databackends module. This registers abstract types that can be checked without having to import the 3rd party library. For instance, there are implementations for pandas.DataFrame and polars.DataFrame despite these libraries not being imported here.
IMPORTANT all container types that make a recursive call to hash_value or a specific implementation should pass the depth parameter to prevent RecursionError.
- hamilton.caching.fingerprinting.hash_bytes(obj, *args, **kwargs) str[source]¶
Hash a bytes object.
The hash is prefixed with a
bytestype tag so thatb"1"and the string"1"(handled byhash_primitive()) do not collide.Primitive type returns a hash and doesn’t have to handle depth.
- hamilton.caching.fingerprinting.hash_mapping(obj, *, ignore_order: bool = True, depth: int = 0, **kwargs) str[source]¶
Hash each key then its value.
The mapping is always sorted first because order shouldn’t matter in a mapping.
NOTE Since Python 3.7, dictionary store insertion order. However, this function assumes that they key order doesn’t matter to uniquely identify the dictionary.
foo = {"key": 3, "key2": 13} bar = {"key2": 13, "key": 3} hash_mapping(foo) == hash_mapping(bar)
- hamilton.caching.fingerprinting.hash_none(obj, *args, **kwargs) str[source]¶
Hash for None is <none>
Primitive type returns a hash and doesn’t have to handle depth.
- hamilton.caching.fingerprinting.hash_numpy_array(obj, *args, depth: int = 0, **kwargs) str[source]¶
Hash a numpy array including shape and dtype metadata.
Without metadata, arrays with the same raw bytes but different shapes or dtypes (e.g., shape=(6,) vs shape=(2,3), or float32 vs int32 with identical bit patterns) would produce identical hashes.
- hamilton.caching.fingerprinting.hash_pandas_obj(obj, *args, depth: int = 0, **kwargs) str[source]¶
Hash a pandas DataFrame, Series, or Index via vectorized row hashing.
pandas.util.hash_pandas_objectcomputes a uint64 hash per row in a single vectorized pass; we hash that buffer in one shot rather than iterating over rows in Python. Column names and dtypes (the schema) are folded in so that frames carrying identical cell values under different schemas do not collide.The hash is order-sensitive: reordering rows changes the per-row hash buffer and therefore the fingerprint.
- hamilton.caching.fingerprinting.hash_polars_column(obj, *args, depth: int = 0, **kwargs) str[source]¶
Promote the single Series to a dataframe and hash it
- hamilton.caching.fingerprinting.hash_polars_dataframe(obj, *args, depth: int = 0, **kwargs) str[source]¶
Hash a polars DataFrame via vectorized row hashing.
DataFrame.hash_rowscomputes a per-row hash in a single vectorized pass; we hash that buffer (to_numpy().tobytes()) in one shot rather than iterating element-by-element in Python. Column names and dtypes (the schema) are folded in so frames carrying identical cell values under different schemas do not collide.
- hamilton.caching.fingerprinting.hash_primitive(obj, *args, **kwargs) str[source]¶
Convert the primitive to a string and hash it.
The hash is prefixed with the type name so that values sharing the same string form but differing in type (e.g.
1vs"1"vs1.0) do not collide.Primitive type returns a hash and doesn’t have to handle depth.
- hamilton.caching.fingerprinting.hash_repr(obj, *args, **kwargs) str[source]¶
Use the built-in repr() to get a string representation of the object and hash it.
While .__repr__() might not be implemented for all classes, the function repr() will handle it, along with exceptions, to always return a value.
Primitive type returns a hash and doesn’t have to handle depth.
- hamilton.caching.fingerprinting.hash_sequence(obj, *args, depth: int = 0, **kwargs) str[source]¶
Hash each object of the sequence.
Orders matters for the hash since orders matters in a sequence.
- hamilton.caching.fingerprinting.hash_set(obj, *args, depth: int = 0, **kwargs) str[source]¶
Hash each element of the set, then sort hashes, and create a hash of hashes.
For the same objects in the set, the hashes will be the same.
- hamilton.caching.fingerprinting.hash_unordered_mapping(obj, *args, depth: int = 0, **kwargs) str[source]¶
When hashing an unordered mapping, the two following dict have the same hash.
foo = {"key": 3, "key2": 13} bar = {"key2": 13, "key": 3} hash_mapping(foo) == hash_mapping(bar)
- hamilton.caching.fingerprinting.hash_value(obj, *args, depth=0, **kwargs) str[source]¶
- hamilton.caching.fingerprinting.hash_value(obj: None, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: bool, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: float, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: int, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: str, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: bytes, *args, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: Sequence, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: Mapping, *, ignore_order: bool = True, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: Set, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: AbstractPandasColumn, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: AbstractPandasDataFrame, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: AbstractPolarsDataFrame, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: AbstractPolarsColumn, *args, depth: int = 0, **kwargs) str
- hamilton.caching.fingerprinting.hash_value(obj: AbstractNumpyArray, *args, depth: int = 0, **kwargs) str
Fingerprinting strategy that computes a hash of the full Python object.
The default case hashes the __dict__ attribute of the object (recursive).