Azure Synapse Architecture

Azure Synapse is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale.
SQL Analytics uses a node-based architecture. Applications connect and issue T-SQL commands to a Control node, which is the single point of entry for SQL Analytics. The Control node runs the MPP engine, which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel.

With decoupled storage and compute, when using Synapse SQL pool one can:
  • Independently size compute power irrespective of your storage needs.
  • Grow or shrink compute power, within a SQL pool (data warehouse), without moving data.
  • Pause compute capacity while leaving data intact, so you only pay for storage.
  • Resume compute capacity during operational hours.
Data Loading in Azure Synapse:
The below are the options to load data into Azure Synapse.
  • PolyBase
  • COPY
  • BCP/SqlBulkCopy API

Polybase Option is being used for data loading mostly. It is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
It allows you to run queries on external data in Hadoop or Azure blob storage. The queries are optimized to push computation to Hadoop
To load data with PolyBase, any of these loading options can be used:
  • PolyBase with T-SQL - when data is in Azure Blob storage or ADLS.
  • PolyBase with SSIS - when source data is in SQL Server, either SQL Server on-premises or in the cloud.
  • PolyBase and COPY statement with Azure Data Factory (ADF)
  • PolyBase with Azure Databricks

Azure Storage & Distributions
In Synapse, data is stored and managed by Azure Storage and there is a separate charge for storage consumption. The data is sharded into distributions for performance optimization
Below sharding patterns can be used to distribute the data during table definition
  • Hash
  • Round Robin
  • Replicate

Hash Distributed Tables :
A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.
  • Each row belongs to one distribution.
  • A deterministic hash algorithm assigns each row to one distribution.
  • The number of table rows per distribution varies as shown by the different sizes of tables.
Round Robin Distributed Tables :
A round-robin distributed table is a table where the data is evenly (or as evenly as possible) distributed among all the distributions without the use of a hash function
Replicated Tables :
  • A replicated table provides the fastest query performance for small tables.
  • A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation.
  • Replicated tables are best utilized with small tables.
  • Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.
What is DWU ?
  • The three analytic resources - CPU, memory, and IO that are bundled into units of compute scale called Data Warehouse Units (DWUs).
  • For higher performance, you can increase the number of data warehouse units and vice-versa
  • Storage and compute costs are billed separately, so changing data warehouse units does not affect storage costs.
Refer the below links further on how to choose the DWU and its pricing tiers

Resource Classes
Resource classes are pre-determined resource that help to configure resources for the queries by setting limits on the number of concurrent queries.
Smaller resource classes reduce the maximum memory per query, but increase concurrency.
Larger resource classes increase the maximum memory per query, but reduce concurrency.
Types:
Static à Used for increased concurrency on a data volume that is fixed.
Dynamic à Used for data sets that are growing in size and require increased performance as the service level is scaled up. 
Static RC
Dynamic RC
Allocates same amount of memory
Allocates variable amount of memory
Ideal when data volume is known & constant
Ideal for growing/variable data volume
Ex: System tables
Ex: Application tables

Types of Dynamic RCs: Below are the types of Dynamic RCs
  • smallrc (default)
  • mediumrc
  • largerc
  • xlargerc

Changing user’s resource class :
Resource classes are implemented by assigning users to database roles
EXEC sp_addrolemember 'largerc', 'loaduser';

To decrease the resource class, use
EXEC sp_droprolemember 'largerc', 'loaduser';

 If 'loaduser' is not a member or any other resource classes, they go into the default smallrc

Concurrency Slots:
  • Concurrency slots are used to track the resources available for query execution
  • Before a query starts executing, it must be able to reserve enough concurrency slots
  • When a query completes, it releases its concurrency slots.
  • A query running with 10 concurrency slots can access 5 times more compute resources than a query running with 2 concurrency slots.
  • If each query requires 10 concurrency slots and there are 40 concurrency slots, then only 4 queries can run concurrently.

Best Practices
  • Reduce Cost by pausing & scaling the Azure Synapse DW
  • Use Hash Distribution method while creating large tables
  • Choosing Dynamic larger RC to improve query performance
  • Choose Dynamic Smaller RC to increase concurrency
  • Enable AUTO_CREATE_STATISTICS to keep track of all the updates done in the database
  • Use the smallest possible column size
  • Use Polybase to load data quickly and CTAS is the fastest way for data loading
  • Make the CCI (Clustered Column Index) fields to be ordered for larger tables for faster query performance.

Future Releases(Synapse Workspace - Currently in Preview)

Post a Comment

Thanks for your comment !
I will review your this and will respond you as soon as possible.