Introduction to Metadata Ingestion
Please see our Integrations page to browse our ingestion sources and filter on their features.
Integration Methods
DataHub offers three methods for data ingestion:
- UI Ingestion : Easily configure and execute a metadata ingestion pipeline through the UI.
- CLI Ingestion guide : Configure the ingestion pipeline using YAML and execute by it through CLI.
- SDK-based ingestion : Use Python Emitter or Java emitter to programmatically control the ingestion pipelines.
Types of Integration
Integration can be divided into two concepts based on the method:
Push-based Integration
Push-based integrations allow you to emit metadata directly from your data systems when metadata changes. Examples of push-based integrations include Airflow, Spark, Great Expectations and Protobuf Schemas. This allows you to get low-latency metadata integration from the "active" agents in your data ecosystem.
Pull-based Integration
Pull-based integrations allow you to "crawl" or "ingest" metadata from the data systems by connecting to them and extracting metadata in a batch or incremental-batch manner. Examples of pull-based integrations include BigQuery, Snowflake, Looker, Tableau and many others.
Core Concepts
The following are the core concepts related to ingestion:
- Sources: Data systems from which extract metadata. (e.g. BigQuery, MySQL)
- Sinks: Destination for metadata (e.g. File, DataHub)
- Recipe: The main configuration for ingestion in the form or .yaml file
For more advanced guides, please refer to the following:
