Employee ‘Active’ Status Handling in the Data Pipeline

Data warehouse options when source systems do not contain complete user information.

Background and Context

The organization’s data warehouse (DW) uses Azure Data Factory (ADF) pipelines and T-SQL stored procedures for ETL. A complication has arisen because certain employees (leadership, senior leadership, and operations staff) are not present in the main transactional system (to save software licensing costs). Instead, a separate PowerApp system contains the master list of all employees and their full employment history (start/end dates, rehires, etc.). This situation means the DW’s source data for employees is incomplete, and active status (whether an employee is currently employed) isn’t straightforward. Handling this correctly is critical for:

  • Accurate notifications – ensuring automated notifications or workflows do not contact former employees.
  • Complete reporting – including leadership/operations staff in headcounts or org charts even though they aren’t in the transactional system.
  • User trust – avoiding confusion if staff see data that incorrectly omits them or marks them inactive when they are active.

Proposed Solution: The user suggests integrating the PowerApp’s employee data early in the ETL process to maintain an “Active” flag (or similar indicator) for each employee as part of the DW data model. In other words, determine and set employee activeness in the staging/dimension loading phase. This would unify data from both sources so the DW has a complete list of employees with correct active/inactive status. The idea is to centralize this logic in the ETL pipeline (with daily batch updates, since real-time updates aren’t required).

Alternative Views: Other team members argue for minimal or delayed handling of this logic: for example, omitting those non-transactional employees entirely (or leaving their records “disabled” in the DW), or deferring the active/inactive determination until later – either in a final ETL step, or even within the BI/report layer (e.g. applying filters in Power BI or report queries rather than in the DW tables).

The task is to evaluate these approaches and recommend a solution that is maintainable, scalable, and transparent to end-users. Below, we examine each approach, discuss pros and cons (especially in the context of Azure Data Factory + T-SQL workflows), consider hybrid patterns used elsewhere, and then provide a recommendation.

Importance of a Unified Employee Status Dimension

Before comparing approaches, it’s important to recognize why handling “active” status in a unified way is considered a best practice. In modern data architectures, master data (like the list of employees) is often consolidated into a single source of truth to avoid silos and inconsistencies. If multiple systems each have partial lists of employees, merging them ensures no gaps. For example, an enterprise Master Data Management approach might create a “superset” of all employees from all systems, then feed that into the data warehouse . This ensures the DW isn’t missing records simply because one source excluded them. Conversely, failing to unify such data can lead to confusion, low data quality, and eroded user trust, as users encounter inconsistent lists or have to reconcile multiple sources . In short, having a complete Employee dimension with accurate status is crucial for consistency.

Another principle in BI/DW design is to centralize business logic in the data platform rather than scattering it across reports. Moving key logic “downward” into the data warehouse (instead of handling it in each report) makes data more consistent and accessible to all users . If many reports or processes need to exclude former employees, implementing that rule once in the DW yields uniform results everywhere. It also avoids a situation where each report writer must remember to apply the rule, which is error-prone. As one expert puts it, storing business rules in the warehouse means if a rule (e.g. what counts as an active employee) changes, you update it in one place; otherwise you’d have to fix it in “tens of reports” or various tools, and risk inconsistencies . Pre-calculating such status flags in the DW can also improve performance and simplify report creation . These considerations set the stage for evaluating the specific options.

Option 1: Maintain Active Status Early in ETL (User’s Approach)

Description: Integrate the PowerApp (HR data) feed at the beginning of the ETL pipeline to create or update a comprehensive Employee dimension table (or similar structure) that covers all employees. During this stage, determine each employee’s active status. This could involve joining the transactional system’s employee records with the PowerApp’s employment history data. The logic would likely mark an employee as “Active” if they are currently employed (e.g. today’s date is within their employment period, or no end-date exists) and “Inactive” if they have a termination date in the past. This dimension would also include those leaders/ops staff who aren’t in the transactional source at all (since the PowerApp provides their records). If using a slowly changing dimension (SCD) approach, the ETL can maintain history of status changes (with effective start/end dates for each employee’s stint) and a simple current flag – a common technique where each dimension row has a flag indicating if it’s the current active record . In practice, the ADF pipeline might load the latest PowerApp data (e.g. via a Dataverse connector or exported file) and the transactional user data, then a stored procedure (or ADF data flow) merges them. Each run would update statuses (and add new hires or mark recent leavers as inactive).

Pros:

  • Complete, single source of truth: The DW will have one Employee table representing all staff with their correct active/inactive designation. This avoids missing people. Leadership and other excluded groups will appear alongside everyone else for reporting, which eliminates blind spots. According to enterprise data management practices, combining multiple source systems’ records into one master set ensures no gaps or duplicates , exactly what this approach achieves.
  • Consistent business rule enforcement: By calculating “active” centrally, every report or downstream process uses the same definition. This consistency is a hallmark of good DW design – it prevents different interpretations in different reports. It also means no report developer has to reinvent the logic or remember to filter out former employees manually. The rule is effectively baked into the data model. This reduces the risk of error (e.g., accidentally emailing a departed employee) and boosts confidence that “active” means the same thing everywhere. In the words of one data architect, enforcing such rules in the ETL is beneficial because it guarantees a single process and avoids performance hits from doing it on-the-fly in queries .
  • Simplified reporting and user transparency: Reports can directly use the Employee dimension’s Active flag (or filter to Active = Yes) without extra data blending. For end-users, it’s clear who is active – the warehouse data itself carries that info. There’s less chance of confusion like “Why is my name not in this list?” or “Why is this person showing up if they left?” since the data has already been curated. This aligns with the idea of moving logic into the data platform so reports are simpler and consistently reflect the underlying data . Users see a clean dataset rather than needing hidden filters.
  • Avoiding unwanted notifications: Any automated process (for example, if the DW drives email alerts or reports that trigger notifications) can safely use the central Active flag to exclude former employees from outreach. By updating this early each day, the risk of notifying someone who left yesterday is very low.
  • Historical tracking: Because the PowerApp provides full employment history, the ETL can preserve that in the DW. This allows analyses of turnover or tenure if needed. Even if only a current flag is stored, one can retain termination dates, etc. The key is that inactive employees aren’t deleted, they’re just marked inactive. This is a best practice in SCD Type-2 dimensions: keep past records but mark them as not current . It means the warehouse can answer questions like “how many employees did we have last month” accurately, and also ensures re-hired individuals can be handled (their record would be reactivated or a new record added with active status).
  • Performance is manageable: With ~1000 employees and only 5–10 changes per week, maintaining this dimension is lightweight. A daily batch MERGE or update is trivial in cost. There’s no need for real-time streaming given the low churn. Azure Data Factory and SQL can easily handle a small dimension table update in seconds. Even if the organization grows, this approach scales well because it’s just one dimension table (often dimension tables are small and fast to update).
  • Maintainability: The logic for activeness resides in one place (likely one stored procedure or data flow). This is easier to test and adjust if needed. If the definition of “active” changes (say, including a grace period after end date, etc.), you update the ETL logic once. All downstream outputs then reflect the new rule on the next run. This centralized maintainability is far preferable to modifying multiple reports or queries. It’s noted in expert discussions that central DW logic means a rule change only needs one update, versus “replicating all the logic” across many tools if it were done downstream . In addition, because ETL is code (SQL), it can be version-controlled and documented, further improving maintainability of the business rule.

Cons:

  • Initial development effort: This approach requires setting up an extra data integration – pulling data from the PowerApp (if not already done) and writing the merge logic. There’s some upfront work to create the consolidated Employee dimension. However, this is a one-time cost and is relatively straightforward given the small data size. Azure Data Factory supports a variety of connectors (including Dataverse/OData for Power Apps) to facilitate this extraction, and joining data in a pipeline or stored proc is standard .
  • ETL complexity and potential data dependencies: Merging two sources means the pipeline must orchestrate them properly. For example, the PowerApp data should be available and loaded before updating the dimension. If one source is delayed or fails, the process needs error handling (e.g., what if the PowerApp feed fails one day – the pipeline should not wrongly mark everyone inactive). These are manageable with ADF (using scheduling and dependency checks), but it’s an added consideration.
  • Slight latency for updates: Since the active status is updated daily, there is a theoretical window where an employee who leaves at noon won’t be marked inactive in the DW until the next day’s ETL. In that interim, a notification could slip through. However, the user has indicated daily updates are acceptable, and churn is very low. In practice, a one-day lag is standard in warehouses. If same-day accuracy became crucial, one could always run an extra pipeline or handle exceptions (this doesn’t outweigh the benefits of central logic).
  • Changing logic requires re-running ETL for historical correction: If the definition of “active” were to be modified retroactively, the DW might need to recompute statuses on historical records. For instance, if they decided that employees on certain leave should be considered inactive for notifications, one would update the logic and potentially adjust existing records. This is not difficult (one can update the flag based on stored dates), but it’s something to plan for. By contrast, if the raw data with dates is stored, a recalculation could be done on the fly. That said, because the warehouse will still have the raw fields (start/end dates) in this dimension, it’s feasible to recalc within SQL as well – so this is a minor concern.
  • Inclusion of non-transactional staff in the DW: While it’s a pro for completeness, one might consider if there’s any downside. These leadership/ops employees might not have transactions or fact data in the warehouse. Their inclusion in the dimension is generally harmless (they simply won’t join to fact records, or will show zero activity). One must ensure any fact table foreign keys can handle a reference to a dimension entry with no corresponding facts – this is normally fine. It could slightly increase dimension size, but with such a small number that’s negligible. Some ETL designers prefer not to mix “active system users” with “non-users,” but as long as an attribute flags which source they came from or their role, it’s clear. In fact, having them in the dimension with a marker (e.g. IsInMainSystem = No for those only in PowerApp) could further help transparency.
  • Security/licensing considerations: Since these individuals are excluded to save license costs, we must ensure pulling their data from PowerApp does not violate anything. Typically, reading their basic info for reporting is fine, but it’s worth noting if any PII or privacy concerns need to be addressed when consolidating HR data.

On balance, the cons are relatively minor and mostly about implementation detail. The approach aligns with known best practices for data warehousing: unifying master data and pre-calculating important flags for consistent use . Many organizations follow this pattern, maintaining a central Employee dimension with an “Active” flag (or effective dates) as part of their warehouse model .

Option 2: Omit or Disable Certain Employees in the DW

Description: This approach essentially means minimal handling in ETL – the DW would mirror the transactional system’s data as-is and do nothing special for excluded groups. If leadership and operations staff aren’t in the source system, they simply won’t exist in the warehouse’s tables. If an employee is removed or marked inactive in the source, the warehouse would reflect that (e.g. the person might disappear from the source extract or be flagged as disabled, depending on how the source handles deletions). In short, the DW would contain only the subset of employees that the transactional system knows about, and their status would be whatever the source provides (which might be “active/disabled” fields, or just their presence/absence). The PowerApp data might be ignored in the ETL, or possibly loaded to a separate table but not integrated. The suggestion to “leave them disabled” implies perhaps still loading those records but marking them as inactive or not using them in reporting. But since these users never had accounts in the source, “disabled” really translates to “not present at all.” So effectively, this approach ignores the external employee list and trusts the main system’s user list.

Pros:

  • Simplicity of ETL: The pipeline is very straightforward – it just copies the transactional user table into the DW (or uses whatever mechanism was already in place). There’s no need to integrate multiple sources or perform additional transformations. Fewer moving parts means less potential for ETL failure. For a small team or an urgent timeline, keeping ETL simple can be attractive.
  • Matches source system exactly: Some purists prefer the DW to reflect the operational system truth without alteration, on the principle that DW shouldn’t guess or add to source data. In this view, if those leaders aren’t in the source, the DW not showing them is “accurate” with respect to that source. Also, if an employee is deactivated in the source, the DW would show them as such (or drop them if the source did). This one-to-one replication can make reconciliation easier (no extra records that weren’t in the source).
  • No duplicate records to manage: By not pulling in external data, you avoid any risk of conflicting info (say the two systems have a person with slightly different names or IDs – you bypass having to resolve that).
  • Security and policy adherence: If for some reason the organization didn’t want certain personal data in the DW, not integrating the HR data would keep it out. (However, in this case, the user explicitly wants it in for reporting, so this is likely not a benefit they care about.)

Cons:

  • Incomplete employee picture: The most glaring drawback is that the DW will not include many active employees (those excluded from the source). This could make various reports inaccurate or misleading. For example, a headcount report from the DW would undercount staff because leadership/ops aren’t included at all. Any reports or metrics that should encompass the whole organization would have to go outside the DW or be manually adjusted. This undermines the DW’s role as a one-stop repository for analysis. Essentially, you’d be perpetuating a silo: the transactional system’s view of employees remains separate from the HR view, and the DW would only cover the former. Industry guidance warns that without unifying such master data, organizations face “data silos between departments and lose the opportunity to have a shared big-picture view” . In our case, leaving out whole categories of employees creates exactly such a silo.
  • Confusion and user concern: If leadership or ops staff use the warehouse or reports (even indirectly via dashboards), they may not find themselves or their teams in the data. This could cause confusion or erode trust. For instance, a manager from Operations might run a report of all employees in their division and see zero, because none are in the source system feeding the DW – they’d rightly question the data’s accuracy. Similarly, if an executive’s data isn’t present for a KPI, it could raise concerns about data completeness. The user specifically noted the desire to prevent confusion or concern from staff not marked as active in the warehouse, which this approach would fail to do.
  • Risk of incorrect notifications or processing: If the DW (or processes using it) doesn’t know someone has left because they were never marked inactive (just deleted from source), it might inadvertently still include them somewhere. Consider a scenario: an employee who left is not in the current source extract, so the DW might remove them – but what if some fact records or other systems still reference that person’s ID? If the DW simply drops the dimension record, any fact data on that person could become orphaned or fall into an “unknown” category. Or, if the DW kept historical employees but doesn’t flag them clearly, a report might list them as if they were active. Essentially, without an explicit flag or presence of HR data, it may be ambiguous who is a current employee. Leaving “disabled” as per source might serve as a flag, but if those records were never in source to begin with, there’s no record at all to label. This ambiguity can lead to mistakes (like sending a report meant for all active managers, but your list of managers is incomplete or outdated).
  • Inconsistent logic later: Even though this approach avoids complexity in ETL, the need for excluding former employees doesn’t go away – it just shifts to the reporting layer implicitly. For instance, if a Power BI report is built on this data, the developer might manually import the PowerApp list anyway to fill the gap, or at minimum they need to be aware that “DW data lacks these people.” Every report that needs full staff info or needs to ensure no former employees are included will have to handle it individually (likely by cross-referencing another source or adding filters like “where termination_date is null” if such a field exists only in the HR system). This is essentially pushing the logic to the end-users or report developers, which is exactly what the user’s proposal wanted to avoid. It’s well-understood that having to replicate business rules in each report leads to discrepancies and maintenance headaches .
  • Lost opportunity for enrichment: The PowerApp data likely contains useful attributes (hire date, termination date, maybe reason for leaving or position history) that could enrich the DW’s Employee dimension. By not integrating it, the DW remains less informative. This could limit future analytics (e.g., analyzing attrition or tenure becomes impossible within the warehouse because those fields aren’t there). Given that adding this data is feasible and low-cost (only ~1000 rows), omitting it is a loss in analytical capability.
  • No central flag for activeness: Since this approach doesn’t explicitly set an “Active” flag early, at best you rely on whatever the source had. If the transactional system simply doesn’t list ex-employees, then there’s no flag at all – absence implies inactive. That’s not very transparent. If the source does mark inactive users, the DW can carry that, but remember, that source doesn’t even have the leadership folks to mark in the first place. So you end up with an “Active” field that says “No” for people who left and were in the system, but no entries at all for people outside the system. This patchwork could confuse users and requires them to understand the data’s origin (“If you’re not in this table, it means you’re a staff member not licensed in the system”). That’s not obvious or user-friendly.

In summary, this minimal approach might be easier technically, but it sacrifices data completeness and consistency. It effectively would make the DW blind to a subset of the workforce and rely on external processes to handle what is essentially a core piece of business logic (who is an active employee). Given the user’s goals of transparency and accurate reporting, this option is generally not recommended. It fails the “user-transparent” test because users would need tribal knowledge to interpret the data (e.g., knowing that certain departments’ staff are entirely missing from the DW). It also punts the problem to a later stage, which typically increases overall effort and risk.

Option 3: Defer Activeness Logic to Downstream Reports or Later ETL

Description: This approach is a middle-ground where the data from the PowerApp (or at least the employment dates) might be brought into the DW, but you don’t determine the “active” flag in the early ETL. Instead, you load the raw facts (like an employee table that includes start date and end date for each person, possibly combining both systems’ data in some fashion) and then decide activeness either in a later stage of ETL or even in the semantic/reporting layer. There are a few variants of this:

  • Late ETL calculation: For example, one might stage all employees (including those not in the transactional source) in the warehouse, but not mark active vs inactive in that staging table. Then, perhaps as part of building a data mart or a specific report extract, a query would compute who is active (say, CASE WHEN EndDate IS NULL OR EndDate > Today THEN 'Active' ELSE 'Inactive' END). This is still in SQL, but it’s not persisted as a column in the main dimension until needed. Another form would be using a view in the warehouse that wraps the employee table and adds an [IsActive] column based on the current date. Reports would use that view to filter active employees. Essentially, logic is implemented in a database view or a last-step transformation rather than stored as a physical flag updated by ETL.
  • Report-level or semantic layer logic: In this variant, the DW might have all employees with their start/end dates (or perhaps two tables: one from the transactional system, one from HR), and the BI tool (Power BI, etc.) is responsible for merging or filtering. For instance, a Power BI data model could bring in the DW’s employee table plus perhaps a table of “currently active employees” from the HR feed, then use a measure or relationship to filter out inactive ones. Or a calculated measure could count employees where EndDate is blank or after today. The key is the determination of who is active happens at query time when generating the report, not pre-baked in the DW tables.

This approach acknowledges the need to use the PowerApp data, but delays the actual decision on activeness instead of doing it upfront.

Pros:

  • Flexibility in logic application: If different use-cases have slightly different definitions of “active” (though usually they would not), deferring the logic allows each to apply its own filter. For example, one report might consider someone active through their last working day inclusive, another might have already excluded them on their termination date. If the logic isn’t hard-coded in ETL, the report authors have the flexibility to implement what they need. However, this flexibility is double-edged – it often leads to inconsistency. In most cases, a single definition is desired, so this “benefit” is limited.
  • Simpler initial ETL development: Similar to Option 2, the early ETL doesn’t have to perform the computation or maintain an extra flag. It may still need to combine sources (if we truly want the data available for later use), but it could, for instance, load the PowerApp data into a separate table “HR_EmployeeHistory” without integrating it with the transactional user dimension. This keeps the early pipeline focused on just loading data, deferring any transformations or merges until a later step or leaving it to analysts.
  • Logic can be updated in one place (if using a view or central measure): For instance, using a database view to calculate IsActive on the fly means if the criteria for activeness changes, you just alter the view’s formula (or the DAX measure in the BI model) and it immediately applies to all uses of that view. There’s no need to re-run an ETL to recompute columns. One of the concerns a developer raised in a discussion was that doing certain calculations in ETL could be “brittle” if business rules change, whereas a view or function can encapsulate the logic for easier updates . In our scenario, though, the rule for active is straightforward (based on dates) and unlikely to change frequently. Still, it’s true that a view could be adjusted quickly if needed.
  • Avoid storing redundant data: Some might argue that an “Active flag” is a derivation of the end-date, so storing it could be seen as denormalization. By not storing it, you ensure that you always use the raw data (start/end dates) for truth and derive as needed. If, for example, someone fixes an incorrect termination date in the source, a view-based approach would automatically reflect that the person is still active (because the end date is now null or in the future), whereas a precomputed flag might require the ETL to run to flip from inactive to active. However, since we are doing daily batches, this is a minor timing issue – at most a one-day lag. And maintaining data derivations in the DW is very common for performance and ease, so avoiding it is usually not a strong advantage unless the derivation is very complex or variable.

Cons:

  • Inconsistency and Error Risk: Deferring the logic opens the door to inconsistent implementation. If done via reports, each report writer might handle it differently or even forget to handle it. For example, one Power BI report might properly filter out inactive employees, but another report might accidentally include some because the developer wasn’t aware they needed to join to the HR data. This fragmentation is what the user is trying to avoid. Even if you use a central view or measure, it requires all report creators to use that view or measure. It only takes one team member bypassing it to produce a flawed output. In contrast, if the DW dimension itself only contains active employees or flags them clearly, it’s much harder to accidentally do the wrong thing. This approach thus relies heavily on discipline and documentation, whereas an ETL solution enforces the rule by design.
  • Complexity shifted to reporting or analysis: While the initial ETL is simpler, the burden is now on the BI layer. Analysts or report developers have to perform the join between the transactional data and the PowerApp data to figure out who is active. This could mean more complex queries or DAX expressions. It also means the BI model might need additional data (like a table of active employees) loaded. For instance, a Power BI model might have to import two tables (employee and HR history) and create a calculated column IsActive = IF( TODAY() <= EndDate OR ISBLANK(EndDate), "Y","N"). That’s not terribly hard, but multiply this by many reports and it’s duplicated effort. Additionally, not all consumers use the data via a single BI tool – if one day another system queries the DW (perhaps a custom app or another analysis tool), you’d have to replicate the logic there too. Essentially, the “single source of truth” for active status is lost because it’s not stored with the data – it becomes an external logic that must be applied in each context .
  • Performance considerations: In our scale (1000 rows), performance is a non-issue. But generally, filtering or joining on the fly can be less efficient than having a precomputed flag. If you had to join the full employee history each time to filter active employees in a complex query, it’s unnecessary overhead. Since our scenario is small, this is minor – but it’s worth noting that one reason warehouses often store flags or pre-join data is to optimize query performance for consumers . A view that calculates IsActive for each row for every query might ever so slightly slow things versus a stored value, but likely negligible here. More important is the human performance – it’s faster to create reports when the data is already in the form you need (i.e., a ready flag) than to craft new logic each time.
  • Still requires integrating the data somewhere: We should note that deferring logic doesn’t entirely eliminate the need to have the PowerApp data accessible. If done in a view on the DW, that view needs to join the DW’s employee table with the HR data table. That means you did have to load the HR data into some table in the DW in the first place (unless the view calls an external source, which is not typical or advisable for performance). So you still need an ETL step to bring in the PowerApp data daily. In essence, you end up doing almost as much work as Option 1 (loading both sources), but you simply choose not to merge or flag until later. This somewhat splits the logic into two places – data loading in ETL and logic in view/report – which can be harder to maintain than doing it all in one coherent ETL flow.
  • Lack of visibility in raw data: If a user were to query the DW tables directly (say a power user writing SQL or an ad-hoc analysis in Excel against the database), they wouldn’t immediately know who is active unless they apply the logic themselves. In contrast, with an IsActive column, it’s immediately visible. So deferring reduces “user transparency” unless every access goes through a prepared semantic layer. In many organizations, multiple people and tools might touch the DW data; having the logic embedded in the data (or in a clearly defined dimension) provides a single version of the truth that all can utilize easily.

In practice, this approach tends to be a stop-gap or used when teams lack the ability to change the ETL easily. It’s more of a workaround: you load all raw data to the warehouse and assume the BI team will sort it out. While it achieves the goal of not sending notifications to ex-employees (if the BI layer is done right), it does so in a less robust way. The maintenance of the rule is just pushed downstream.

A hybrid variant might be to use a database view in the DW to present a unified, cleaned-up employee list. This view could, for example, SELECT ... FROM EmployeeDW E LEFT JOIN HR_EmploymentHistory H ON E.EmployeeID = H.EmployeeID and include a case expression for active. If well-documented, all reports could use that view as their employee table. This ensures consistency across reports (since the view is one place to update logic) and keeps logic out of the report definitions. It’s closer to doing it in ETL, except it’s not physically materialized. The downside is performance (a view calculates on the fly) and still needing the join each time. In Azure SQL, 1000 rows is trivial, so performance would be fine – the main issues would be ensuring everyone uses the view and that the view logic is correct. This is a reasonable compromise if, say, one cannot modify the ETL easily. But given we control the ETL, a materialized approach (Option 1) is cleaner. Notably, one DW expert argued that while views can centralize logic, “views will not perform as well [as ETL-prepped data]. You can also do the ETL inside the database as ELT” – meaning you might as well compute it once during loading rather than every time at query .

Patterns from Other Organizations

Organizations that face similar challenges (multiple systems for employee data, needing a unified active roster) have historically leaned toward centralizing this reference data. A common pattern is to create an Employee master dimension in the warehouse that pulls from the primary HR system and any other relevant source. This dimension often implements SCD Type 2 to handle changes like terminations and rehires, with fields such as EffectiveStartDate, EffectiveEndDate, and an ActiveFlag (Y/N for current record) . The ActiveFlag allows easy filtering of current employees in queries. Kimball’s data warehouse methodology supports this approach – adding an “is current” indicator to dimensions to simplify filtering for current entities . It’s considered a best practice for dimensions that change over time (customers, employees, etc.).

In scenarios where not all employees are in one source, companies often designate one system as the system of record for employees (usually HR). In our case, the PowerApp employment list essentially acts as that system of record. Many Azure-centric organizations use ADF or Azure Logic Apps to bring HR data into the warehouse daily, ensuring that even users who don’t appear in operational systems are present for analysis. For instance, some might use Azure Data Factory’s Mapping Data Flow to join multiple sources within a pipeline and produce a single dimension output . Others might use stored procedure-driven ELT (extract-load-then-transform in SQL) – for example, load both datasets into staging tables, then execute a T-SQL MERGE to upsert into the Employee dimension, setting the active flag based on the current date and end dates. This MERGE pattern is quite straightforward and is recommended in ETL scenarios to handle updates/inserts efficiently .

An innovative solution some organizations adopt is to implement a Master Data Management (MDM) layer or use cloud services (like Azure Purview or a metadata-driven approach) to maintain conformed dimensions. While a full MDM tool might be overkill for 1000 employees, the concept is the same: unify data from multiple systems into one clean “golden” dataset . In one scenario (described as “Scenario 1 – no single source system” in a SoftwareOne case study), all employee records from various systems are combined into a master repository, duplicates resolved, and then that is fed to the DW . In our context, we can achieve the spirit of that by treating the PowerApp as the primary source and enriching it with any additional info from the transactional system (like maybe usernames or system-specific IDs for those who are in the system).

Some organizations with low-change, non-real-time requirements even implement hybrid batch solutions such as: Nightly batch sync + on-demand refresh. For example, a nightly ETL sets the active statuses, but in addition, if there’s an urgent need (like HR offboards someone at noon and wants to ensure no notifications go out), a lightweight process (maybe a Power Automate flow or ADF pipeline triggered on demand) can update that one record’s status in the DW. However, given our low churn and daily cycle, this is likely unnecessary.

Another pattern seen is using factless fact tables or snapshots for HR reporting. For instance, some might create a factless fact that records, for each day, which employees are active (a snapshot of headcount). That allows historical headcount trending easily. This is beyond the immediate ask, but if the data is available, it could be a future extension. With such a design, the ETL each day would mark who’s active that day and record it. This is complementary to an active flag in the dimension – it serves analytical needs (like “how many active employees did we have on the first of each month”). If such analytics are in scope later, having the base dimension set up is the first step.

In summary, the prevalent strategy in similar cases is to integrate early and centrally. Whether through formal MDM or just ETL, companies strive to have one employee list in the warehouse that is enriched with all necessary info (active status, etc.), rather than scatter that logic. Deferring to reports is generally only seen in ad-hoc or very small-scale environments, and even then it often leads to the realization that centralizing would be better as the number of reports grows.

Recommendation and Conclusion

Considering the options and the organization’s needs, handling the employee active status early in the ETL pipeline (Option 1) is the recommended approach. This method best aligns with data warehousing best practices and the goals of maintainability, scalability, and user transparency:

  • It ensures the data warehouse contains a complete and accurate picture of the workforce at any given time, by unifying data from the transactional system and the PowerApp. This eliminates confusion caused by missing records and provides a single source of truth for all reporting and analytics related to employees .
  • By centralizing the “who is active” logic in the ETL, you enforce consistency across all downstream uses. You won’t have to worry about individual reports each implementing the rule (or potentially failing to). As noted, moving such business logic into the data layer makes reports simpler and more consistent . Everyone from BI developers to end-users will work from the same consistent dataset.
  • The approach is maintainable: if business rules adjust (say, if the definition of active should ever exclude a certain category of employees or include contractors), you can change the ETL in one place. All reports will reflect the change on the next load. There’s no scattered logic to update. The daily batch nature of your process is well-suited to this; a nightly job can handle the updates easily within your existing ADF schedule. The low volume of changes means this solution will scale for the foreseeable future with negligible impact on performance or costs.
  • It is also user-transparent and user-friendly. Users querying the warehouse or building reports will see an “IsActive” flag (or similarly named field) and can immediately use it to filter or group data, without needing to know the intricacies of HR data. Leadership and operations staff will appear in reports appropriately, but if they are not active, they’ll be clearly marked as such (or can be excluded by filtering). This prevents the scenario of someone being alarmed that they or their colleague are “missing” from a report – if they’re active, they’ll be there; if they left, they’ll show as inactive which is expected. By handling this upstream, you minimize the chances of error like accidental notifications to former employees, since those records can be filtered out universally using the flag.

To implement the recommended approach:

  • Use ADF to ingest the PowerApp employee data daily (if it’s in Dataverse or SQL, use the appropriate connector; if not, perhaps export it to a CSV and load from Blob storage). Ingest the transactional system’s employee/user list as well.
  • In a staging area, prepare the data. Ensure there is a common key to join on (e.g., an employee ID or email). If the PowerApp data is authoritative, use it as the base and overlay any additional info from the transactional source (for those who exist in both). For those in the PowerApp but not in transactional, include them anyway (perhaps flagging them as SystemUser = No).
  • Implement a T-SQL stored procedure or mapping data flow to upsert into the Employee Dimension table. This proc will:
    • Insert new employees (e.g., new hires not seen before) with Active = true (assuming their start date is today or earlier and (if present) end date is in the future or null).
    • Update existing employee records: if someone now has an end date (termination) set, update their status to inactive (and possibly their end date field). If someone was inactive but has returned (rehire), either update the record or insert a new record if maintaining history (the latter is an SCD2 approach).
    • Maintain historical fields as needed. Given the small churn, even an SCD2 approach is easy. If opting for simplicity, you might do a Type 1 (single record per employee, just update the fields) with an Active flag that flips to N when they leave. Since you have full history in the source, you can reconstruct historical states if ever needed, but Type 1 will lose point-in-time status in the DW. Type 2 would keep multiple records. Choose based on reporting needs – for now, a Type 1 with effective dates stored might suffice.
  • Validate the process thoroughly: e.g., when a known person leaves, check that the next day’s data marks them inactive; when a new person is added in HR, check they appear in DW; when someone not in the main system is in HR, confirm they show up. This will build confidence in the DW data.
  • Use the Active flag in all relevant queries and ensure any existing reports are refactored to utilize it. For example, if previously a report was (unreliably) filtering by checking if an employee exists in some list, switch it to use the dimension’s Active = Y field. You might also choose to exclude inactive employees by default in certain aggregated reports (e.g., a headcount KPI might by default count only active ones). Because the data is now trustworthy, such defaults can be set confidently.

As a supporting measure, you can create a simple documentation or data dictionary entry for the Employee dimension and the Active flag, explaining that Active = “currently employed (not terminated as of the latest data load)”. This helps user transparency – anyone curious can see the definition. But likely, it’s self-explanatory enough.

Finally, the recommended solution is scalable and future-proof. Should the organization add another source of employee data (say a contractor management system, or if the PowerApp is replaced by a new HR system), the same pattern applies: you’d merge that data into the dimension with minimal changes. The approach also supports new use cases – e.g., if you want to build an attrition dashboard, you already have all necessary data (start dates, end dates, and the ability to identify active vs not on any given date with a bit of SQL). In contrast, the other approaches would make such developments harder.

In conclusion, implementing the active/inactive determination early in the ETL pipeline is the best practice for this scenario. It provides a maintainable single point of control for a critical piece of business logic, ensures all downstream reporting is consistent and correct, and leverages Azure Data Factory and SQL’s strengths in data integration. By adopting this method, the organization will avoid reporting mishaps (like emailing former employees or baffling leaders with missing data) and will set a strong foundation for reliable HR analytics in the data warehouse.

Sources:

  • SoftwareOne – Master data management in a data warehouse context (combining multiple systems’ employee records into one master set) .
  • ClearPeaks – Explanation of SCD Type-2 with an Active flag for current record identification .
  • Reddit (r/dataengineering) – Advice referencing Kimball’s approach to add an “is active” flag in dimension tables for the current version of a record .
  • Stack Overflow Q&A – Discussion on doing calculations in ETL vs in the database; recommendation to perform transformations (like computing elapsed times or flags) as part of the ETL/ELT using SQL (for consistency and performance) .
  • DBA Stack Exchange – Argument in favor of storing business logic in the DW to avoid duplicating it in many reports and to ensure one consistent definition (example of lead qualification logic, analogous to our active employee rule) .
  • LinkedIn Article – “Never Use Reports to Manage Business Logic,” stressing that the data platform should handle logic, which makes reports simpler and data more consistent .