Great Expectations
This guide helps to setup and configure DataHubValidationAction in Great Expectations to send assertions(expectations) and their results to DataHub using DataHub's Python Rest emitter.
Capabilities
DataHubValidationAction pushes assertions metadata to DataHub. This includes
- Assertion Details: Details of assertions (i.e. expectation) set on a Dataset (Table).
- Assertion Results: Evaluation results for an assertion tracked over time.
This integration supports v3 api datasources using SqlAlchemyExecutionEngine.
Limitations
This integration does not support
- v2 Datasources such as SqlAlchemyDataset
- v3 Datasources using execution engine other than SqlAlchemyExecutionEngine (Spark, Pandas)
- Cross-dataset expectations (those involving > 1 table)
Setting up
- Install the required dependency in your Great Expectations environment.  pip install 'acryl-datahub-gx-plugin'
- To add DataHubValidationActionin Great Expectations Checkpoint, add following configuration in action_list for your Great ExpectationsCheckpoint. For more details on setting action_list, see Checkpoints and ActionsConfiguration options:action_list:
 - name: datahub_action
 action:
 module_name: datahub_gx_plugin.action
 class_name: DataHubValidationAction
 server_url: http://localhost:8080 #datahub server url- server_url(required): URL of DataHub GMS endpoint
- env(optional, defaults to "PROD"): Environment to use in namespace when constructing dataset URNs.
- exclude_dbname(optional): Exclude dbname / catalog when constructing dataset URNs. (Highly applicable to Trino / Presto where we want to omit catalog e.g.- hive)
- platform_alias(optional): Platform alias when constructing dataset URNs. e.g. main data platform is- presto-on-hivebut using- trinoto run the test
- platform_instance_map(optional): Platform instance mapping to use when constructing dataset URNs. Maps the GX 'data source' name to a platform instance on DataHub. e.g.- platform_instance_map: { "datasource_name": "warehouse" }
- graceful_exceptions(defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall checkpoint to fail. Note that configuration issues will still throw exceptions.
- token(optional): Bearer token used for authentication.
- timeout_sec(optional): Per-HTTP request timeout.
- retry_status_codes(optional): Retry HTTP request also on these status codes.
- retry_max_times(optional): Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially.
- extra_headers(optional): Extra headers which will be added to the datahub request.
- parse_table_names_from_sql(defaults to false): The integration can use an SQL parser to try to parse the datasets being asserted. This parsing is disabled by default, but can be enabled by setting- parse_table_names_from_sql: True. The parser is based on the- sqllineagepackage.
- convert_urns_to_lowercase(optional): Whether to convert dataset urns to lowercase.
 
Debugging
Set environment variable DATAHUB_DEBUG (default false) to true to enable debug logging for DataHubValidationAction.
Learn more
To see the Great Expectations in action, check out this demo from the Feb 2022 townhall.
Is this page helpful?
