Data Discovery & Attribute-Based Access
Make data findable with descriptions and tags, then control access with attribute-based policies (ABAC) — a modern approach to governance at scale.
Why data discovery matters
Imagine a library with no card catalogue.
You have 10,000 books (tables) across 50 shelves (schemas). Without a catalogue telling you what each book contains, you’d spend hours wandering the shelves. That’s what a lakehouse without descriptions feels like.
Data discovery means making every table and column self-describing — so analysts can find what they need without asking the data engineer. Tags take this further by categorising data (e.g., “contains PII,” “finance team only”) so security policies can apply automatically.
Descriptions for data discovery
Table and column descriptions
-- Add a table description
COMMENT ON TABLE prod_sales.curated.daily_revenue
IS 'Daily revenue aggregated by region and product category. Source: POS system. Refreshed nightly at 3 AM NZST. Owner: data-engineering team.';
-- Add column descriptions
COMMENT ON COLUMN prod_sales.curated.daily_revenue.region
IS 'Business region code: APAC, EMEA, Americas. Maps to dim_region.region_code.';
COMMENT ON COLUMN prod_sales.curated.daily_revenue.revenue
IS 'Total revenue in USD. Excludes taxes and returns. Decimal(12,2).';
Good descriptions answer: What is this data? Where does it come from? How often is it refreshed? Who owns it?
Ravi documents every table at DataPulse Analytics so new team members can self-serve. When they search in Catalog Explorer, descriptions appear in search results.
Best practices for descriptions
| Element | What to Include | Example |
|---|---|---|
| Table description | Purpose, source system, refresh schedule, owner | ”Customer dim from CRM, daily refresh, owned by data-eng” |
| Column description | Business meaning, allowed values, unit, foreign key | ”ISO currency code (USD, EUR, GBP). FK to dim_currency.” |
| Schema description | What data lives here, who uses it | ”Gold layer: business aggregates for BI team consumption” |
Tags in Unity Catalog
Tags are key-value pairs attached to Unity Catalog objects. They enable classification and policy enforcement:
-- Tag a table as containing PII
ALTER TABLE prod_sales.curated.customers
SET TAGS ('data_classification' = 'pii', 'retention_years' = '7');
-- Tag a column as containing sensitive data
ALTER TABLE prod_sales.curated.customers
ALTER COLUMN email SET TAGS ('sensitivity' = 'high');
-- Tag a schema
ALTER SCHEMA prod_sales.raw
SET TAGS ('environment' = 'production', 'team' = 'data-engineering');
-- View tags
SELECT * FROM system.information_schema.table_tags
WHERE schema_name = 'curated';
Mei Lin uses tags at Freshmart to classify every table by sensitivity level and regulatory domain.
Attribute-Based Access Control (ABAC)
ABAC is the next evolution of access control. Instead of granting access table-by-table, you define policies based on tags:
| Traditional (Per-Object) | ABAC (Tag-Based) |
|---|---|
| GRANT SELECT on table_a TO analysts | IF tag data_classification = ‘public’ THEN GRANT SELECT TO analysts |
| GRANT SELECT on table_b TO analysts | New public tables automatically accessible |
| Must update every new table | Policy applies to ALL tagged tables |
How ABAC works in Unity Catalog
- Tag your objects — classify tables, columns, schemas with meaningful tags
- Create tag-based policies — define rules that reference tags
- Automatic enforcement — any object matching the tag gets the policy applied
-- Example: tag-based policy concept
-- "Any table tagged data_classification=pii requires the pii-readers group"
-- "Any column tagged sensitivity=high must be masked for non-admin users"
Exam tip: ABAC vs. traditional GRANT
The exam may present scenarios where you choose between:
- Traditional GRANTs — best for small-scale, specific table permissions
- ABAC with tags — best for large-scale governance where you have hundreds of tables and need consistent policy enforcement
If the question mentions “at scale,” “automatically apply to new tables,” or “policy-driven governance” — ABAC is the answer.
🎬 Video coming soon
Knowledge check
Mei Lin manages 500+ tables across Freshmart's lakehouse. She needs to ensure that any table containing customer PII is automatically restricted to the 'pii-readers' group — including new tables added in the future. Which approach scales best?
Ravi joins DataPulse Analytics and needs to find tables related to customer segmentation. He searches Catalog Explorer but finds no useful results — table names like 'tbl_cs_v3' and 'staging_202603' are meaningless. What should the data engineering team do?
Next up: Row Filters, Column Masks & Retention — dynamic data masking and data retention policies.