Azure Storage Account
Azure Data Box Disk – Order, Usage, and PerformanceReading Time: 7 minutes
Data Box Disk Overview
I have written in the past on the considerations of using Data Box for offline data transfers into Azure or using online methods, which was primarily focused on Data Box Heavy. Here I am going to walk through the process of obtaining a Data Box, specifically a Data Box Disk (see the Data Box Family of offerings here). The ordering process for all Data Box devices is largely the same, and this can be used as a reference for any of them. However, the primary focus of this post will be on the setup and usage of Data Box Disk.
If you’ve read my previous post in which I postulated the merit of using an offline method of transfer in many cases, you may find it odd that I am now promoting the Data Box Disk, which is only suitable for transferring a few TB of data. I maintain my position that in most cases online transfer is optimal, especially for the type of data that would be in-scope for a Data Box Disk. However, as I have noted, there are some cases where offline data transfer is needed.
Order and Setup
Ordering a Data Box is straightforward through the Azure Portal.
After you’ve selected the initial configuration items, you will choose the device type.
You will name the order and select the destination storage in Azure.
After confirming whether you’re using a Microsoft-Managed Key or Customer-Managed Key (in this case I’m using a Microsoft-Managed Key) you will enter shipping information and the order will be submitted. In each step of the process, you will receive an email with the status. For example, here is the notification that my order was created and then again when it was delivered.
When you create the job in Azure, it creates a Data Box resource, which has all of the information about the device and order including a timeline showing where the device is in the process.
The Disk arrived with the SATA to USB cable, and I hooked it up to my Intel NUC (excuse the dust!).
Note in the image above both the USB adapter and the ports on my device are denoted with “SS” meaning they’re USB 3.0. This is important, you will note that the Data Box Disk is an SSD which is very performant. You will also note in the email stating the device was delivered, that I have a certain period of time to get it shipped back before I start incurring additional cost.
Most enterprise servers only have USB ports to support peripherals, and thus do not invest in USB 3.0 or 3.1, leaving you with the 2.0 standard. The maximum theoretical throughput of USB 2.0 is 480 Mbps, or 60 MBps. The maximum theoretical throughput of USB 3.0 however, is 5 Gbps or 625 MBps. This is an important note, that in some cases it may be faster to even attach this to a laptop that has Gigabit network connectivity to wherever the source data is held if the servers only have USB 2.0 ports.
*Note:* I am doing this in Windows, but you can do all of the following in Linux as well.
If I look in Windows Explorer when I attach the drive I can see a volume, but it is encrypted and locked. That is intentional and a part of the security process with Azure Data Box.
The process for allowing access to each device in the Data Box family is different, but with Data Box Disk there is a utility to unlock the device, which in combination with the passkey available under the Data Box resource in Azure, will unlock the device.
At the root of the filesystem, you will see a folder for all the storage types, Table, Queue, File, Blob, and Managed Disk; what you copy here will get copied to the respective storage type at the destination.
If you have a lot of small files, one thing to note is the impact of antivirus. Especially if you’re pulling TBs worth of small files across the network to a laptop where the drive is attached, since it’s writing those files locally your antivirus will likely do in-line scanning. Depending on the data and whether your policies allow, adding an exception on your antivirus for the folder where you’re copying the data e.g. “F:\BlockBlob” may speed up your copy performance.
To test performance, I devised two tests, one with large files and one with small files. For the large files, I copied a bit over 50GB of .iso files of various Linux distributions. The copy below is simply CNT+C, CNT+V of that folder from my machine’s SSD to the Data Box Disk using Windows Explorer. In addition to the copy operation, I took a screenshot of the disk throughput and activity in Task Manager (which is a way of showing how much of the capable performance is utilized by way of disk operations queuing metrics).
You can see with a single copy job I’m getting over 300 MBps for those large files. I then also wanted to try small files, which is much more likely of a use case for Data Box Disk. For this I used a PowerShell script which is a part of another project I’m working on which will be posted soon on my GitHub to create 10,000 x 1 MB files – I again first copied them using Windows Explorer.
I was able to get just over 50MBps in write speeds, which is good considering the file sizes, but given there were no constraints on my source disk, destination disk, or CPU, this led me to believe that the bottleneck was with the copy operation itself. Next, I wanted to run a test with a multi-threaded copy operation, so I first set a baseline with a single-threaded robocopy job.
You can see this took about 3 and a half minutes and copied at roughly the same speed as Windows Explorer. Now that I have my baseline, here’s the real performance test using the multi-threading flag on robocopy.
With that flag I was able to push over 3x the amount of performance, increasing from ~50MBps to ~190MBs and reducing the copy time from 3 minutes and 33 seconds to just 58 seconds which fully utilized my hardware.
I also went back and tried the same multi-threaded copy operation with my large files and was able to increase the throughput from 334MBps to 522MBps which fully utilized my hardware as well.
I finished loading my data onto the disk and utilized the data validation utility, which comes in the same download as the tool that unlocks and decrypts the drive, to generate checksums of my data on the device which I can use later to validate data integrity when it is copied into the Storage Account. After that I unmounted the device, packaged it back up and dropped it off at my local UPS store – the box already had a return label on it.
Similar to when the device was being shipped to me, I got email notifications for each step of the way including when the data copy started, and when it finished. The process is then marked as complete and all of the details are available in the portal.
You can see the data is now loaded into the Storage Account, and you will see a “databoxcopylog” folder as well, which you can use to validate the copy jobs included with the final checksum of the files.
Lastly, you will see a one-time charge for the device on your invoice, you can see here the $90 fee for the Data Box Disk in Azure Cost Management.
*Note*: You will still be charged for any transactions that take place when loading the data into your storage account.
The data is now all loaded, and I get a confirmation via email (which is also shown in the portal screenshot above) that the device has been erased in accordance with NIST 800-88r1 standards. As I noted above, the process for ordering the device is largely similar for the Data Box or the Data Box Heavy.
If you have any questions, comments, or suggestions for future blog posts please feel free to comment below, or reach out on LinkedIn or Twitter. I hope I’ve made your day a little bit easier!
Shared Storage Options in Azure: Part 3 – Azure Storage ServicesReading Time: 8 minutes
Welcome to Part 3 of this 5-part Series on Shared Storage Options in Azure. In this post I’ll be covering Azure Storage Services. You may be thinking to yourself, wait a minute, what have we been talking about this whole time then? Azure Storage Services is easiest thought of being the term used for the services offered under an Azure Storage Account (Blob, File, Queue, Table). Given the context of this series, I’ll be discussing Azure Blob Storage and Azure File storage in this post. Though, I do want to add a disclaimer that technically Queue, and Table Storage can be “shared” also since multiple apps can call the same Queue or Table using the APIs. Since the focus here is more on the system-level, I’m not going to cover those two, but I’ll add some links to documentation where you can read more.
- Part 1: Azure Shared Disks
- Part 2: IaaS Storage Server
- Part 3: Azure Storage Services
- Part 4: Azure NetApp Files
- Part 5: Conclusion
Azure Blob Storage:
In the majority of cases, when people discuss “cloud storage” they’re talking about Blob – binary large object. What this service allows us to do is store massive amounts of unstructured “objects” in Azure. There are a couple ways we can Blob storage as shared storage from a system-level.
Shared Blob Storage:
As I mentioned in the introduction, all Azure Storage Services can be accessed over HTTP/S via API or using any of the client libraries. This means that they can all technically be “shared” storage, but what about system-level access? While I find most applications and solutions can be adapted using a client library, there is a project called “Blobfuse” which can be used for more traditional applications.
Blobfuse is an open-source project on GitHub which uses the libfuse library to pair together the Linux FUSE kernel module and the Azure Blob REST APIs to create a virtual filesystem. The result of this configuration is a mount point on a Linux machine directly to a Blob Storage Account. There can be certain challenges in using Blobfuse though, for example the result is NOT a POSIX-compliant filesystem and if you use mount the same Blob Storage from multiple machines you should keep those limitations in mind.
The default configuration for the setup of Blobfuse is to have your Storage Account name and Access Key in a plain-text configuration file sitting on your server, which is not ideal from a security perspective and should be noted. However, it is possible to use a Managed Service Identity with Blobfuse which significantly improves the security posture of the deployment (if you use a System Assigned Managed Identity) and something I would recommend over the default configuration. Lastly, Blobfuse is not available on Windows – Linux only.
As of the time of writing this blog post (January, 2021) NFS is not yet Generally Available (GA) on Blob Storage, but NFS 3.0 has been in preview since July 2020 . Once this goes GA I will update this post with that information, but won’t quote this as an option until that point.
Lastly, from a backup and disaster recovery perspective, Azure Blob Storage supports snapshots as well as Point-in-Time restore for block blobs.
Typical Use Cases:
The majority of the use cases I’ve seen that use Blob as shared storage at the system level are wanting to use consumption-based cloud storage without the overhead or limitations of a managed disk. Specifically, in applications that don’t support SMB natively and require a local mount point, that’s where Blobfuse comes into play. I have seen this with a lot of apps that are migrated into the cloud and want lower cost, higher capacity than is available from managed disks with more legacy applications where this may be the case. I’ve also seen this configuration with many HPC applications since NFS as an access protocol is not yet GA for Blob storage.
Cost, Performance, Availability and Limitations:
The cost of using Blob storage is always the same regardless of the access protocol, since as of now it all ends up going through the Azure Storage API anyways.
Blob storage is incredibly performant. There are two tiers of Blob storage, Standard and Premium. In most cases, Standard will be the appropriate tier. Premium is for storage that needs single-digit transaction times and is better suited for larger block sizes (256KiB+). Though do keep in mind, that similar to my comparison of Managed Disk Types and the cost calculation of capacity and transaction costs, in some scenarios Premium Block Blob Storage may be cheaper.
If you’re using a standard Blob storage account (not configured with a Hierarchical Namespace) which is most common, you’ll enjoy the following performance (as of January, 2021).
Image Reference: https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/azure-subscription-service-limits#storage-limits
When configuring containers in your Blob Storage Account you’ll notice an access tier setting with the options “Hot, Cool, Archive”. I covered Archive Storage a few years back but it’s not really relevant to this topic. What is relevant though is the different between Hot and Cool storage. There seems to still be a lot of confusion around the difference between the two, but at it’s core the main different is transaction cost.
Similar to the difference between Premium and Standard SSD Managed disks, Hot Blob storage has a higher capacity cost but a lower transaction cost while Cool Blob Storage has a lower capacity cost and a higher transaction cost. If you’re storing data that is infrequently accessed but still needs to be constantly available, Cool Blob Storage is the way to go. If you’re storing data that has a lot of transactions then Hot Blob storage is your best bet. Don’t get caught up in the “per GB” sticker on each tier – this can be misleading to the resulting cost depending on your workload characteristics.
As far as durability and availability goes, Storage Accounts have a few different options depending on the storage service being used: LRS, ZRS, GRS, ZGRS, GRS-RA. There is a lot of information on these different redundancy levels, so take a look at the durability and availability table below and if you want to read more, click the link below.
Image Reference: https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
Additional Reading on Understanding Azure Storage Redundancy Offerings: https://techcommunity.microsoft.com/t5/azure-storage/understanding-azure-storage-redundancy-offerings/ba-p/1431700
Lastly, I recently re-created an outdated version of an infographic on capacity limits for Azure Storage Accounts and I thought I would share that here.
Feel free to reference this image using the following: https://urls.hansencloud.com/azure-storage-limits
Azure Files is another storage service under the Azure Storage Account and has similar shared features, but some very distinct to itself as well. The primary purpose of Azure Files is to provide file-level storage services like you would get from a Network Attached Storage appliance or a Server providing those access protocols to a filesystem share. Azure Files provides SMB access (as well as HTTP/S API) to provisioned shares.
Like I mentioned, the primary access methods for Azure Files are through the API or by using SMB (NFS v4.1 is currently in preview so I won’t be considering it an option in this post as of right now, but will update it when it goes GA). Even though SMB is most typically used in with Windows machines, shares on Azure Files can be used by Windows, Linux, or even MacOS.
Something really interesting about Azure Files though is its Azure File Sync capability which allows for a centralized file share in Azure Files which can be facilitated through agents deployed on Windows Servers which then act as a cache for the Azure Files data. This is particularly interesting because it allows the Server itself to present whichever access method it would like to the client, but use the backing of a centralized Azure Files Share.
The way Azure File Sync works at a high-level is a File Share is created, then linked to what is called a “sync group”, which facilitates the registrations from any agents deployed on Windows Servers (in Azure or on-prem).
Azure Files also allows, in conjunction with typical access key authentication, Active Directory-based authentication options . The ability to use this type of AuthN directly on the Azure Files PaaS endpoint is really interesting and makes it a great choice for a solution where you want to leverage the identity systems you already have in place. It’s also worth noting that if you’re using Azure File Sync, the deployed agent is the only one communicating with the File Share directly and the access to the data locally can be controlled through whichever method you prefer (SMB ACLs with ADDS, for example).
Lastly, from a backup and disaster recovery prospective, Azure Files supports snapshots in addition to native integration with Azure Backup.
Typical Use Cases:
I see a mix of uses with Azure Files. A mix between using it for a file-based backend for various applications and services to an environment where the data is access directly by users. A scenario I’ve run into more frequently though is when companies want to replace traditional on-prem File Servers and even things like DFS. Anywhere you want to leverage SMB in a fully managed PaaS way, Azure Files is for you.
Cost, Performance, Availability and Limitations:
Similar to Blob Storage, Azure Files has multiple Tiers to help optimize for performance and Cost.
Image Reference: https://azure.microsoft.com/en-us/pricing/details/storage/files/
Again, these tiers are priced based on capacity (provisioned or consumed) in combination with transactions and any snapshots or backups.
Performance for Azure Files is based on whether or not you use a standard storage account or a specific “Azure Files” storage account SKU which will enable “Premium” File Shares. The performance specifications for a standard storage account (eg. General Purpose v2) are the same as the limits posted for the blob storage earlier. If you’re using Premium Files though, here are the performance targets.
Image Reference: https://docs.microsoft.com/en-us/azure/storage/files/storage-files-scale-targets
Keep in mind that the 100TB limit is per share, and you can create multiple (just like you would with traditional file shares) up to the limit of the Storage Account (5 PB by default, as stated earlier in this post).
Lastly, availability for Azure Files is no different than Blob since they’re both contained by a storage account and will be subject to the availability and durability of the storage account data redundancy setting.
Pros and Cons:
Okay, here we go with the Pros and Cons for using an Azure Storage Services (Blob & File) for your shared storage configuration on Azure.
- Both are PaaS and fully managed which greatly reduces operational overhead.
- Significantly higher capacity limits as compared to IaaS.
- Ability to migrate workloads as-is and use existing storage configurations when using SMB or Blobfuse.
- Ability to use Active Directory Authentication in Azure Files
- Both integrate with native backup solutions.
- Both integrate with Azure Defender for Storage.
- Blobfuse stores connection information in plain-text, by default.
- Does not support older access protocols like iSCSI.
Alright, that’s it for Part 3 of this blog series – Shared Storage on Azure Storage Services. Please reach out to me in the comments, on LinkedIn, or Twitter with any questions or comments about this post, the series, or anything else!