In this post I want to cover some file tests for Azure Synapse Link for SQL Server 2022 that I performed.
Because a while back I spotted something interesting whilst I was doing some initial tests for Azure Synapse Link for SQL Server 2022.
Which is when you add new data after the initial load that a new folder called ‘ChangeData’ appears in the storage account container. I noticed that the new file containing the insert was a comma separated value (csv) file. Whereas the table used for the initial load was a parquet file.
It got me wondering if Azure Synapse Link for SQL Server 2022 only created csv files or if there was a tipping point for it started to create parquet files. With this in mind I thought I would experiment and find out.
By the end of this post, you will know if there is a tipping point when a certain number of rows are inserted.
Azure Synapse Link for SQL Server 2022 recently came out of preview and is now generally available.
Before the File tests for Azure Synapse Link
I wanted to use the same Azure Synapse workspace and SQL Server that I had configured for a previous post. Where I covered some initial tests for Azure Synapse Link for SQL Server 2022.
You can refer to that post to get a recap about Azure Synapse Link for SQL Server 2022. Anyway, before I could use that setup for file tests, I had to prepare some things.
First of all, I had to check that the WideWorldImporters database was setup correctly.
Secondly, I had to create a new dedicated SQL Pool within Azure Synapse Analytics. I also had to add a master key to its database and add it as the target pool in Azure Synapse Link.
After doing this I got an error about a token. Due to the fact that I had not used the link for a while.
To resolve this, I had to go to the Integrate hub. From there I selected the link and than in the Properties section on the right-hand side I selected Rotate token as below.
Not that you might not see this option immediately in your web browser. You might have to zoom out to see the option on the bottom right-hand side.
I then started the link again and checked that it was working. By going into the Monitor hub and selecting Link connections.
Once I confirmed that the link was working, I started my tests.
Finding table in storage account container
Before I go any further a tip for anybody looking to explore the storage account container that the created files are in to find the right subfolder for the table. To find the right subfolder name for a table in Synapse Studio you can go into the Link connections section and click on the link name.
Amongst the list of tables, you can find the right folder name in the Link table ID column.
To make your life even easier you can copy this Link table ID value within Synapse Studio by clicking on it.
You can then browse the container using Storage browser and paste in the copied value to quickly find the right folder.
File tests for Azure Synapse Link for SQL Server 2022
For these tests I decided not to use the stored procedure that I had mentioned in my other post. Instead, I decided just to insert data directly into the Sales.OrderLines table.
I did my first set of tests using Continuous Mode, which continuously scans for updates. Which you can set this in the properties for the link.
I first tested adding one new row to check that a new csv file got created again. After inserting the new row, I checked the folder, a new csv file had been created.
Afterwards I wanted to see what happened if I inserted one hundred thousand new rows.
Another csv file was created. It appeared to appropriate in scale as well.
With this in mind, I thought I would raise the stakes. I inserted ten million new rows into the table
I checked again and saw that a larger csv file had been created. Which would suggest that there is no tipping point in place for Azure Synapse Link where it starts creating parquet files instead of csv files.
Batch mode file tests for Azure Synapse Link for SQL Server 2022
I stopped the link and changed to batch mode. To see if batch mode handled creating the files any different.
My main gripe about switching the link over to batch mode is the fact that I could only set the interval to process the data between twenty and sixty minutes. However, I am sure that latency is fine for some who want to use this as part of a Data Warehousing scenario.
Anyway, I encountered issues when I tried to get my existing link to use batch mode. You must expect issues like this when using a feature that is in preview.
So, I created a brand-new link instead against a restored version of the database using batch mode.
To save myself a bit of time I selected only the Sales.OrderLines table for the new link. Due to the fact that I only want to test if there is a tipping point for parquet files to be created in batch mode. Instead of creating csv files all the time.
It made no difference to the types of files created. As you can see below.
Because some would say that ten million rows are not a large number of rows in the land of big data, I did one final test. Which was for one hundred million rows.
However, as you can see below it does look to confirm that the files that contain changed data are always csv files.
I must admit this one took a while to do. Plus, it took a while for the data to fully loaded into the new file within the container. In fact, it took around eleven minutes for all the data to upload into the file and for the file be available in storage explorer.
How Batch mode handles files
On a side note, I was able to confirm during these tests that the data is definitely collected into files as soon as possible in Batch mode. Just like in Continuous mode.
Which confirms that the interval value you enter is the latency that data is uploaded from the files in the container to the dedicated SQL Pool. Just like it suggests in the Azure Synapse Link for SQL Server 2022 documentation.
I confirmed this by checking that the new data did not appear in the dedicated SQL Pool a few times during the first twenty minutes. It did appear after the twenty-minute window.
Doing this means that there is less risk of bottlenecks appearing in the SQL Server database. Which is one of the main purposes of Azure Synapse Link.
Conclusion of file tests
Based on the above results it looks like when Azure Synapse Link is first started the initial load of the tables creates files that are in parquet format. Any new or changed data afterwards appears to create csv files.
At this moment in time, there does not appear to be a tipping point for when the Azure Synapse Link creates parquet files instead of csv files for larger amounts of inserted data. Which I find interesting for a couple of reasons.
Firstly, because the initial snapshot creates parquet files. Secondly, because I suspect that parquet files would be better for performance for larger amounts of changes.
I discovered a few other things whilst testing.
Firstly, detaching the SQL Server database will definitely stop the link working.
Secondly, it looks like scaling the dedicated SQL Pools in or out can be an issue. Because I encountered an issue whilst scaling out a dedicated SQL Pool whilst Azure Synapse Link for SQL Server 2022 was running.
I had an issue with one of the tables that I was replicating whilst scaling out the dedicated SQL Pool. Maybe it would have reconnected eventually but the error looked fairly severe.
Thirdly, it can take a while for the link to stop gracefully. However, during testing I discovered it is better to stop it gracefully to avoid potential issues.
I hope my file tests for Azure Synapse Link for SQL Server 2022 are of interest to some of you.
Like I said earlier, my main goal was to check if there was a tipping point for the files created by Azure Synapse Link when new data was inserted However, it looks like at this moment in time all the files created for inserts are csv files.
Anyway, if you have any comments or queries about this post feel free to reach out to me.