Background and Objectives
The data warehouse is an Azure SQL instance nearing its 4 TB capacity (currently ~3 TB used). To avoid expensive scale-up and reduce storage costs, we plan to archive 250–500 GB of historical data (older than 5 years) out of the database and into Azure Blob Storage. The archived data must remain accessible for queries or retrieval within hours or days (not weeks) for compliance or analysis needs. The strategy should maximize storage cost savings without incurring disproportionate compute costs. Below, we evaluate storage tier options, methods to query archived data, cost comparisons, and tools for automating the archive pipeline.
Azure Blob Storage Tiering Options for Archive
Azure Blob Storage offers multiple access tiers to balance cost vs. access speed for archived data. The main tiers are Hot, Cool, and Archive (and a newer Cold tier), each suited for different access patterns:
- Hot Tier (Online) – Optimized for frequent access. Highest storage cost but lowest per-read cost and immediate availability . Suitable for active or recently archived data that might be queried often. (e.g. ~$0.018 per GB-month in East US2 for first 50 TB ). No minimum retention period.
- Cool Tier (Online) – Lower storage cost (~$0.01 per GB-month) for infrequently accessed data, with higher read/access charges . Data is still available immediately, but the tier is intended for data accessed only rarely (e.g. a few times per month). Requires a 30-day minimum retention to avoid early deletion fees . Good for warm archives that need quick retrieval on occasion.
- Cold Tier (Online) – (Recently introduced) Meant for rarely accessed data that still needs fast retrieval. Storage cost is lower than Cool (with a 90-day minimum retention) and access costs higher. Cold tier provides immediate read access like Hot/Cool, but at lower availability SLA (99% vs 99.9%) due to the assumption of infrequent use . This can be an intermediate staging tier before Archive for data that might be needed within a quarter.
- Archive Tier (Offline) – Lowest-cost storage for data rarely accessed, accepting higher latency for retrieval . Storage is extremely cheap (as low as ~$0.00099–$0.002 per GB-month for LRS storage , i.e. <$2 per TB-month), but data is offline: you cannot read or query it directly without first rehydrating it to an online tier. Archive has a 180-day minimum retention and significantly higher data retrieval costs (e.g. ~$5 per 10,000 read operations) than Hot/Cool . Retrieval times are on the order of hours: standard rehydration can take up to ~15 hours, or as little as ~1 hour with high-priority (at higher cost) for small blobs . Archive tier is ideal for long-term retention of compliance data that is very rarely needed, where one can tolerate a half-day delay to restore it.
Choosing a Tier: For our 5+ year old archive data, which will be seldom accessed, the Cool or Archive tiers are most cost-effective. If we want the archived data to be instantly queryable, keeping it in an online tier (Cool or Cold) is advisable so it can be read immediately by analytic tools. Cool storage at ~$10/TB-month with occasional retrieval fees is still dramatically cheaper than keeping data in the SQL database. On the other hand, if we anticipate extremely infrequent access (e.g. once a year or less), the Archive tier maximizes cost savings (nearly 80–90% cheaper than Cool per GB ). Archive would require planning for rehydration lead time (hours) before the data can be queried, which still meets the “within days, not weeks” requirement. A hybrid approach can be used: e.g. keep recently archived data (last 1–2 years of the archive set) in Cool for on-demand querying, and shift older archives to the Archive tier to minimize cost. Azure Blob lifecycle management policies can automate moving blobs from Cool to Archive after a certain time.
Summary of Blob Tier Options:
Tier | Storage Cost | Min. Retention | Availability | Retrieval Latency | Use Case |
Hot | Highest (e.g. ~$0.018/GB-month) | None | 99.9% (LRS) | Immediate (online) | Active data, frequent queries. |
Cool | 50% lower than Hot ($0.01/GB-month) | 30 days | 99% (LRS) | Immediate (online) | Infrequently accessed (e.g. older than 1–2 years), but may need quick reads. |
Cold | Lower than Cool (~$0.005–0.007/GB-month, est.) | 90 days | 99% (LRS) | Immediate (online) | Rarely accessed, but require fast retrieval (new tier bridging Cool and Archive). |
Archive | Lowest (~$0.00099–$0.002/GB-month) | 180 days | N/A (offline) | Requires rehydration (up to 15 hrs) | Very cold data (5+ years) with minimal access, can wait hours to retrieve. |
Note: Prices are indicative for LRS in a typical region (East US); actual rates vary by region and capacity. Archive and Cool tiers also incur data retrieval charges (per GB and per 10k reads) that Hot tier does not . In our scenario, the volume (250–500 GB) and expected access frequency (very low) mean the storage cost savings of Cool/Archive tiers will far outweigh any occasional retrieval costs.
Making Archived Data Queryable
A key requirement is that archived data remains queryable or retrievable within a reasonable time. We consider several approaches to enable querying data that has been offloaded to Blob storage:
- Azure Synapse Analytics Serverless SQL – This is an on-demand query engine that allows you to run T-SQL queries directly on files in Azure Data Lake Storage or Blob, without needing to load them into a database . We can store the archived data as files (e.g. Parquet or CSV) in Blob, and then use Synapse Serverless to query it when needed. This approach is highly cost-effective: there is no always-on compute cost; you only pay per query, at ~$5 per TB of data scanned . For example, scanning 100 GB of archive data would cost only about $0.50 in query fees. By using Parquet (a columnar compressed format), we reduce the data scanned and improve query performance. Synapse serverless can return results quickly (seconds to minutes for large files) and can even be integrated with tools like Power BI or Synapse notebooks for analysis. This option keeps archived data readily accessible via standard SQL queries while incurring negligible cost when archives are not queried. It perfectly aligns with the requirement to avoid new persistent compute costs.
- PolyBase or External Tables (Azure Synapse Dedicated SQL Pool or SQL Server/Azure SQL Managed Instance) – PolyBase is a technology that allows defining tables in a SQL engine that externally point to files in Azure Blob storage. In Azure Synapse Dedicated SQL pools (formerly SQL DW) or on SQL Server/Azure SQL Managed Instance (MI), you can create an external data source and external tables against the archived files . This makes the archive appear as a table in the SQL environment, so analysts can run familiar T-SQL against it (joins with local tables, etc.). For Azure SQL Managed Instance, the data virtualization feature supports querying external data in ADLS/Blob directly via OPENROWSET or external tables . The benefit is seamless integration – archived data can be queried from the main SQL instance (if MI) or from a Synapse pool using the same SQL interface. However, there are cost considerations: a dedicated Synapse pool must be kept online (incurring compute cost in DWUs) unless paused when not in use. Similarly, using PolyBase on an on-prem or VM SQL Server would require that server to process the data. If our primary warehouse is an Azure SQL Database (not MI), note that Azure SQL DB does not support external tables for ad-hoc querying of blob data (it can only bulk-load from external data sources, not query in place) . In that case, Synapse serverless or another engine would be required for querying. If we do have Azure SQL MI, using external tables there is an option, but we must ensure the MI has enough resources to handle the occasional large query on external files. Overall, PolyBase/external tables provide convenience of T-SQL access, but one should avoid keeping a heavy dedicated SQL compute running solely for archive queries that happen rarely.
- Azure Data Lake Analytics (ADLA) – This is a Hadoop/YARN-based analytics service (using U-SQL) that can run jobs to query data directly from Azure Data Lake (Blob storage). ADLA is a job-based, pay-per-query service similar in spirit to Synapse serverless (though using a different query language and execution model). One could export the 5+ year data to Azure Data Lake Storage and use ADLA to run on-demand analytical queries over it. The data can remain in cheap storage until a U-SQL job is triggered to scan it. ADLA was often used for analyzing archived large datasets without an active database. However, Microsoft now leans towards Synapse Analytics as a unified solution; ADLA (U-SQL) is not seeing new feature development. It could still fulfill the requirement (query within hours) by running a job on demand. The learning curve and maintenance for U-SQL jobs is higher compared to using T-SQL in Synapse. If starting fresh, Synapse serverless is likely the simpler choice for querying files in-place.
- Retrieval and Reload (on-demand) – Another way to access archived data is to rehydrate and reload it into a database only when needed. For instance, if we keep the archive in the Archive tier, in an event where data needs to be queried, we could: (1) rehydrate the blob from Archive to Hot tier (taking a few hours) , and (2) use a tool (Azure Data Factory, or BCP, etc.) to load the data back into a temporary table in Azure SQL for analysis, or attach an external table after rehydration. This process could be automated so that a user’s query triggers the restore pipeline. The advantage is we pay nothing for compute most of the time, only incurring costs and effort when a restore is necessary. The downside is the delay (could be several hours total) and operational overhead of moving the data back and forth. Given the requirement of “within hours or days”, this is acceptable in worst-case scenarios, but ideally we prefer direct query via Synapse or external tables to avoid the extra steps. In fact, using Synapse serverless to directly query the file (if it’s stored in an online tier) is essentially instant and avoids needing to reload data entirely.
Recommended Query Approach: Use Synapse Analytics Serverless SQL as the primary means to query archived data stored in Blob. It offers a straightforward T-SQL interface and on-demand pricing, ensuring we do not keep any compute resources running idle (no ongoing cost). We will store the archived datasets in Parquet format in Blob storage; Parquet’s columnar compression will minimize the data scanned and accelerate serverless queries. For end-users, we can abstract the external data source by creating Synapse views or external tables in the serverless pool that point to the Parquet files. This way, users or applications can query the archive via a familiar SQL endpoint (Synapse) as if it were a regular table. The cost will only be incurred when a query is actually run (e.g. scanning the entire 500 GB archive would cost on the order of ~$2.50, which is trivial for occasional use). If the main Azure SQL is a Managed Instance, we additionally have the flexibility to create external tables on the MI itself pointing to the same Parquet files, enabling queries from the MI if needed . We will, however, be cautious to not overload the MI with large external queries if it might impact production workload – heavy archive analysis can be offloaded to Synapse.
As a backup, data in Archive tier can be restored if absolutely necessary (with some lead time). But by keeping the archive data in Cool storage initially (and possibly using lifecycle rules to transition it to Archive after a certain period of no access), we ensure that most queries can be served on-demand via serverless SQL without upfront delays.
Cost Comparison: Azure SQL vs. Blob Archive
Archiving ~500 GB of old data out of the Azure SQL instance yields significant cost savings on storage, and we need to ensure any new costs (like compute for queries) don’t offset those savings. Below is a comparison of costs:
Storage Costs: Azure SQL Database storage is relatively expensive, as it often uses high-performance tiers. For example, Azure SQL Hyperscale charges around $0.10 per GB-month for storage used (plus additional costs for backups and replicas), which would be ~$50 per month for 500 GB. In contrast, storing 500 GB in Azure Blob Cool tier costs roughly $5/month, and in Archive tier as little as ~$1–2/month . That is an immediate 10x–50x reduction in storage cost. Over a year, 500 GB costs ~$600 in Azure SQL vs. ~$60 in Cool Blob or ~$12–$24 in Archive Blob – a massive saving. Even accounting for backup storage costs: keeping data in the SQL DB means it’s also in nightly backups (which can double the storage overhead, further increasing cost). Archived blob data does not incur such overhead, and one could optionally store a second copy in GRS (geo-redundant) storage and still be cheaper than SQL storage.
Compute and Query Costs: By removing 500 GB from the data warehouse, we potentially can downsize the Azure SQL tier or at least avoid higher scaling as data grows. The performance of the main warehouse may improve (less data to scan on large queries, smaller indexes), meaning we don’t need to pay for extra compute to handle bloat. The archival solution’s compute costs come into play only when querying the archived data:
- If we left the 500 GB in Azure SQL: Queries on that older data would run on the Azure SQL instance. Given the large size and age of the data, such queries might be slow and resource-intensive, possibly requiring a high compute tier or causing performance issues. So effectively we’d be paying for expensive Azure SQL compute capacity even for data that is almost never used.
- With archived blob + Synapse Serverless: We pay $5 per TB scanned for queries . If, for example, an analyst runs a few archive queries that scan ~50 GB total in a month, that’s only about $0.25 in Synapse query charges – negligible. Even a heavy one-time full scan of all 500 GB is ~$2.50. These costs are fractional compared to running the same query in a high-tier Azure SQL DW (which could be hundreds of dollars per month in provisioned capacity). Importantly, if the archived data is not queried at all in a given month, we pay $0 in query costs. There is no need to have an extra server running. This pay-per-use model ensures compute cost is proportional to actual usage of the archive.
- PolyBase/Synapse Dedicated: If we had instead chosen to keep a dedicated Synapse SQL pool for the archive, we would incur a steady hourly cost (e.g. even the smallest Synapse pool costs >$1.50/hour or ~$1100/month ). That clearly would erase any savings – hence we avoid that approach. Synapse serverless has no fixed cost, and Azure Data Factory pipelines have minimal runtime costs (just the data movement, which for 500 GB one-time is minor, and orchestrating ongoing small increments monthly is also minor).
- Data retrieval from Archive tier: In the event we move blobs to Archive tier, bringing them back has a cost. For example, reading 500 GB from Archive might cost on the order of $0.02 per GB = $10 (plus ~$5 per 10k file operations) . Rehydration might thus cost a few dollars and take up to a day. However, this would happen only if we needed to access data that had been pushed to deep archive. Even this occasional cost is far lower than keeping that 500 GB live in the SQL DB for months or years “just in case.” The strategy is to accept a small on-demand cost when very old data is needed, rather than paying large sums every month.
In summary, archiving to Blob yields clear cost benefits: Storage is an order of magnitude cheaper, and compute expenses for querying archives become event-driven (nearly zero when not in use, and modest when used). By contrast, keeping data in the warehouse means continuously paying for premium storage and possibly higher compute tiers even when that data is idle. The proposed solution optimizes cost by shifting to a model of cheap storage + on-demand compute. This ensures we don’t trade off storage savings for a big new compute bill – indeed, we expect overall cost (Azure SQL + archive solution) to be significantly lower than before, since the Azure SQL instance could potentially run at a lower size once old data is removed.
Automation Tools and Archiving Pipeline
To implement this archiving strategy reliably, we will design an automated pipeline that periodically offloads old data from the SQL warehouse to Blob storage. Several Azure services can be used for this ETL/ELT process:
- Azure Data Factory (ADF) / Synapse Pipelines: ADF is a fully managed data integration service ideal for orchestrating data movement. We can use ADF to extract 5+ year-old records from the SQL DB and copy them to Azure Blob Storage on a scheduled basis . For example, a pipeline can run monthly to move any newly-aged data (data that just exceeded 5 years age) to the archive. The copy activity can leverage SQL queries or stored procedures to pull data (possibly in chunks/partitions to minimize impact on the source) and write to Blob. ADF supports writing to compressed formats like Parquet or CSV GZip – we would choose Parquet for reasons discussed (efficient querying and storage). After a successful copy, a subsequent activity can delete the archived records from the SQL database (ensuring we keep the warehouse lean). ADF allows setting the Blob container’s access tier as part of the workflow or we can configure lifecycle policies on the storage account. For instance, ADF could initially store the files in a Cool tier container for a year, and an Azure Storage lifecycle rule can move them to Archive tier if they haven’t been touched in >1 year.
- Azure Data Factory integration with Azure SQL: We can use ADF’s built-in Incremental copy pattern or Lookup+ForEach to systematically pull partitions of data. If the warehouse tables are partitioned by date, we could leverage that (e.g. switch out an old partition, or simply query by date range). If not partitioned, we may do a query like SELECT * FROM Table WHERE Date < '2018-01-01' (for 5+ years old) and dump that. The pipeline can handle multiple tables (iterate through a list of archive-eligible tables). We will implement logging and verification – e.g. record counts before and after – to ensure data integrity.
- Azure Synapse Pipeline: If we already have a Synapse workspace, its pipeline feature is essentially the same as ADF (since Synapse Pipelines is ADF under the hood). We could use that in place of a separate Data Factory, possibly running on the same integration runtime if convenient.
- Automation and Monitoring: The entire process can be scheduled (e.g. a time-trigger in ADF) or triggered on-demand. After data movement, we will monitor the pipeline run for any failures. We can also set up Azure Monitor alerts for pipeline failures. The archive files themselves can be organized in a logical folder structure in Blob (e.g. by table and year) to make ad-hoc querying simpler and to align with partition pruning (serverless SQL can query specific folders if we query by date, to reduce scan scope).
- Alternative Tools: In some cases, one could use Azure Database Migration Service or Change Data Capture to continuously offload old data, but given the one-time (and then periodic) nature of our archival, ADF is simpler. SQL Server BACPAC export is another method (export entire tables to a .bacpac or .bak file in Blob), but that is more suited to full database backup or offline archiving (not queryable without restore). We prefer row-level archiving to keep data queryable. Another tool, Azure Data Factory Data Flow, could transform data (e.g. anonymize or filter) during archive, but if not needed, a straight copy is sufficient. For completeness, if using Azure SQL MI, we could even use Managed Instance’s ability to export to external tables: MI supports CREATE EXTERNAL TABLE AS SELECT (CETAS) which can export a SQL query directly to Parquet files on Blob and register an external table in one go . This is a powerful feature – for example, we could run a CETAS in MI to export all 2015-and-older data from a table to Blob. This might be used in place of ADF for a one-time push, but orchestrating multiple tables and ongoing schedule is typically easier in ADF.
Retention and Tiering Policies: Once data is in Blob, we will set appropriate access tiers. If using the Cool tier for initial archive, we ensure the container or blobs are marked Cool on upload. Azure Storage allows configuring a rule like “If blob is last modified > 1 year ago, move to Archive tier.” We can leverage that so that as the archived files age further without access, they get auto-archived (minimizing cost). If an archived file is accessed (read) in Cool tier, that resets its last-modified or access time, and we might delay its move to Archive to ensure frequently accessed archives remain readily available. This tiering automation complements the cost optimization by not keeping data in a higher tier longer than necessary.
Data Catalog and Access: We should keep track of what data has been archived and where. Using an Azure Data Catalog or simply a documentation spreadsheet, we’ll note, for each table, up to what date data has been archived. This way, users know that if they query in the primary DB beyond a certain date range, those records are in the archive. With our Synapse external tables or views, we can even union the active and archived data to provide a seamless view. For example, we could create a view in Azure SQL that selects from the live table (last 5 years) union-all a Synapse external table (via linked server or an external location if MI) for older data. However, caution is needed: combining live and archive in one query could inadvertently scan a lot of cold data (and incur cost). A better approach is to let users intentionally query archive when needed (perhaps through a Synapse workspace or a separate “ArchivedData” database).
Security and Governance: We will secure the Blob storage container (using SAS tokens or managed identities for the Synapse workspace to read the files). Data can be stored encrypted at rest (Azure Storage does this by default). If using Parquet, any sensitive fields could be optionally masked or encrypted before archival if required by compliance. Because the data remains accessible, we will also ensure proper access control – e.g. only certain analysts or services can run queries against the archived data, since not every app/user that uses the live warehouse should automatically get at the entire history.
Recommended Strategy and Conclusion
Proposed Archiving Solution: Offload data older than 5 years from the Azure SQL data warehouse to Azure Blob Storage in a cost-effective tier (Cool for nearline archives, with lifecycle to Archive for deep storage). Utilize Azure Data Factory to regularly transfer out old partitions of data, store them as Parquet files in Blob, then remove them from the SQL database. Maintain the ability to query this archived information using Azure Synapse serverless SQL or external tables, which keeps the data readily accessible within minutes for ad-hoc queries. This approach yields substantial storage savings (on the order of 10–20× cheaper storage) while avoiding continuous compute costs – Synapse serverless ensures we pay only when we actually run a query (e.g. $5 per TB scanned) . The main Azure SQL instance can thus operate with less bloat, potentially at a lower service tier (saving costs and improving performance on current data).
By leveraging Blob storage tiering (Hot/Cool/Archive), we can further optimize costs: data can start in Cool (instant access if needed) and later move to Archive for pennies, still retrievable within a day if required . The use of ADF/Synapse pipelines and possibly MI’s CETAS feature provides a robust automation framework to perform the archiving consistently (with minimal manual effort). In the event data needs to be restored to the warehouse, the pipeline can reverse-copy from Blob to SQL (after switching the blob to Hot tier) , ensuring business continuity.
This strategy meets the objectives by freeing up 250–500 GB on the Azure SQL (delaying the need to purchase more capacity), cutting storage costs dramatically, and still allowing the business to retrieve or query archived records on-demand within hours. All of this is achieved without spinning up long-running compute clusters or doubling infrastructure – we use serverless and existing PaaS capabilities to keep costs proportional to use. Table below summarizes the key benefits vs. the current state:
Aspect | Current (Azure SQL only) | With Archival Solution |
Storage Used | ~3 TB (approaching limit of 4 TB). | ~2.5 TB in Azure SQL; 0.25–0.5 TB in Blob storage. Azure SQL size reduced ~10–15%. |
Storage Cost | High (premium storage ~$0.1/GB-month, included in DB cost) for all 3TB. | Azure SQL: pay for ~2.5TB instead of 3TB. Blob: 0.5TB at ~$0.01–$0.002/GB-month (negligible by comparison) . Overall much lower cost. |
Compute/DB Tier | Needs to handle full 3TB workload; indexes and stats on cold data consume resources. Possibly higher DTUs/vCores needed. | Smaller active dataset -> can potentially use lower tier. Archive queries run in Synapse serverless ($5/TB scanned) , which adds only on-demand cost. No permanent extra compute. |
Data Accessibility | All data is immediately queryable in SQL (fast for recent data, but 5+ year data rarely queried and slows down maintenance). | Recent 5 years in SQL (fast). Older data queryable via Synapse (with some latency for first-time reads, but acceptable). Archive tier data requires ~hours to rehydrate before query – used only for very infrequent requests. |
Management | SQL DB continues to grow; backups and maintenance windows increasing. Potential performance issues on huge tables. | Old data is carved out neatly. Warehouse maintenance is easier (smaller backup, stats, etc.). Archive process is automated (ADF). Need to monitor pipeline and occasionally update external table schemas if base schema changes. |
Risks | As data grows, hitting size/performance limits or incurring very high costs. All eggs in one basket (if SQL goes down, all data – including old – is unavailable). | Slight complexity in setup, but mitigated by automation. Archived data not instantly in SQL – users must query via new methods (training needed). However, data is preserved safely in cheap storage, and even if the warehouse is unavailable, archive files are still in Blob (added resilience). |
In conclusion, this tiered archiving strategy using Azure Blob Storage and Synapse not only reduces costs significantly but also ensures that the organization’s historical data remains accessible within hours for analytics or compliance. We avoid unnecessary expenditure on expensive database storage for data that’s rarely used, and we do so without incurring heavy new infrastructure costs. This solution is scalable (we can continue to offload data as it ages, and Blob storage can scale to petabytes), and balances performance with cost by leveraging Azure’s tiered storage and serverless querying capabilities. It provides a future-proof foundation where the data warehouse can focus on recent, high-value data, while the data lake archive can handle the long tail of historical data at minimal cost, yet still be queryable when the need arises.
Sources:
- Microsoft Azure Blob Storage tiers documentation (Hot, Cool, Cold, Archive)
- Cost estimates for Blob storage vs. SQL storage
- Azure Synapse serverless SQL capabilities and pricing
- Azure SQL/Azure MI external data querying (PolyBase/Data Virtualization)
- Microsoft Q&A – Azure SQL archiving approach with ADF and Blob tiers