What types of reference data have you used, and how did you incorporate them into a star schema? How did you handle deduplication and curating customer data?

Question

Assisting AI · Accepted Answer

I have worked with several types of reference data: master data such as customer master, product master, and location master; lookup tables for industry codes and risk categories; and static lists like country codes and currency symbols. In a star schema, these reference tables become dimension tables that connect to fact tables via surrogate keys. For example, the Customer dimension contains a unique surrogate key, a natural key, and descriptive attributes pulled from the master data. To flatten data for reporting, I used ETL processes that denormalized the dimension tables into a single fact table when needed for performance.

Deduplication was handled in two stages. First, during extraction, I used hash functions on key attributes to detect duplicates. Second, in the staging area, I applied a deduplication window that retained the most recent record based on a timestamp or version number. For curating customer data, I implemented a data quality framework that flagged missing or inconsistent values, applied business rules to correct them, and logged changes for audit. This approach ensured that the reference data remained clean, consistent, and aligned with the star schema design.

What types of reference data have you used, and how did you incorporate them into a star schema? How did you handle deduplication and curating customer data?

💡 Model Answer

🎤 Get questions like this answered in real-time