AVD Accelerator: Lessons learned

AVD Accelerator: Lessons learned

Introduction

I am a huge fan of Azure Virtual Desktop. This service has matured over the years (I guess no one remembers the “classic” version anymore), especially in the management and monitoring area.

At the same time, I am a strong advocate of Azure Landing Zonesreference architecture and related landing zone accelerators. I personally helped several customers with ALZ design and implementation in their organisations.

💡
This post was originally published on my Wordpress blog in the following address: AVD Accelerator: Lessons learned – David Pazdera (pazdedav.blog)

Currently, I am part of a project delivery team, where, together with several other colleagues, we are helping our customer with a datacenter migration to Azure. As part of the scope, we were asked to deploy AVD in a corporate environment in everything as code way.

We assessed several implementation options. In the end, we decided to try AVD Accelerator, an open-source automation from Microsoft that helps with AVD deployments using best practices.

The following diagram shows what gets deployed as part of the Accelerator:

AVD Accelerator reference architecture. Source: GitHub repo, available in png and vsdx formats

The solution assumes that an appropriate platform foundation (with all the policies and governance controls) is already set up, which may or may not be the official ALZ platform foundation. This was perfect, because our customer indeed had a fully working platform foundation based onContosoreference implementation with Azure Virtual WAN and all the ‘bells and whistles’.

We were expecting a smooth deployment with minor customisations and a truly short implementation time limit. Spoiler alert: we succeeded in the end, but it was a bumpy ride. I will describe most of the pitfalls we faced in this post.

TLDR; We hit several obstacles with Azure policies that came as part of ALZ reference implementation. In addition, there were a couple of customer-specific issues.

Our setup

Our customer has built their own subscription vendingmachine, fully integrated with Service Now and GitHub, so getting a couple of new subscriptions for the AVD workload was quick and easy. We knew that ‘AVD Baseline’ requires a line-of-sight with domain controllers and corporate connectivity, so we initially requested two ‘Corporate landing zones’. I say initially because that’s not the setup we ended up with.

Our landing zones come with GitHub repositories with preconfigured workload identities and ‘starter’ CI/CD workflows, which simplified the bootstrapping process – picking relevant folders from AVD Accelerator, committing, and pushing it to our landing zone repos.

Since we are in a brownfield scenario with centrally managed networks, DNS, firewall, and AD, we got our subscriptions preconfigured with a spoke VNet, peered to a vHub in the target region, configured routing and custom DNS settings pointing to replica domain controllers.

The Accelerator allows for the ‘bring your own VNet’ option. In addition, both solution modules (deploy-baseline.bicep and deploy-custom-image.bicep) expose many parameters that help with customizing various properties of underlying cloud resources, including resource names.

We divided our solution into two logical parts, each having its own GitHub repo and a landing zone:

  • AVD Baseline – designed and configured for specific business apps and users.

  • AVD Custom Image – an automation component used to create custom VM images with Azure Image Builder at the heart.

Baseline goes first

All we needed was to copy the code from the Accelerator and update our main.bicep template file to:

  • create a subnet in the existing VNet for session hosts

  • call the deploy-baseline.bicep solution module and provide all input parameters. All sensitive inputs, including credentials for domain join, were made available as GitHub secrets, of course.

When I pushed the code to the upstream repo, it automatically triggered a workflow that validated bicep code and tried to deploy to Azure. As you expect, it did not succeed (and it won’t for the next couple of days).

Deployment scripts

The first problem, we encountered, were deployment scripts that in several places in the solution (like Storage configuration for FSLogix profiles) run a simple PowerShell script to “sleep” for couple of minutes, to ensure that previous sub-deployment succeeded and the ARM engine had enough time to sync / replicate (meta)data about resources or Azure AD objects.

This on its own wouldn’t be a problem but it does not work with ALZ policies assigned to ‘Corp’ management group, where our subscription resides, namely ‘Public network access should be disabled for PaaS services’. Why?

If you are familiar with how deployment scripts work, they provision a temporary storage account and a container instance where the script runs. The declaration of those resources is not done directly. They are provisioned ‘behind the scenes’. There is a way to use an existing storage account, but since you don’t control how the ACI instance gets provisioned (you cannot deploy it to an existing VNet, where it could communicate with the storage account using its private endpoint), this would not help. Also, the current version of ‘deploymentScripts’ resource module in the Accelerator doesn’t support this type of customisation (bring your own storage). Since the policy requires that each PaaS service deployed in this type of landing zone has a private endpoint configured and public access disabled, the deployment failed here!

Our workaround: we asked for policy exemption for this policy assignment. It won’t be the last one.

VM names matter

Our next “roadblock” was also related to Azure Policies, but this time, it was customer-specific, and not part of ALZ reference implementation.

The thing is, in CORP landing zones, you are expecting that workloads deployed there in many cases depend on traditional Active Directory and a domain join process.

Interesting fact: ALZ reference architecture doesn’t address this topic directly, so I have seen different approaches in projects.

One of many practical problems you need to solve is how to ensure VM name uniqueness across on-premises and Azure, especially since Platform teams are delegating access to landing zones to application teams and don’t have control on naming conventions for VMs (and other resources) deployed there. You need to ensure that someone doesn’t (accidentally or on purpose) “take over” an existing computer object in AD with the same name.

There could be several ways on how to solve this, but our customer chose the following model:

  • Permissions to join a computer to domain is not delegated directly to users

  • Instead, they can opt-in to have their Azure VMs joined to AD automatically by simply adding a resource tag with a specific name and a value.

  • Behind the scenes, a DINE (deploy if not exist) policy will kick in that has access to AD credentials and can deploy a VM extension that does exactly that.

  • All Azure subscriptions follow a strict naming convention that contains a unique four-digit number

  • To ensure that VM names won’t collide, there is a Deny policy in place that ensures that VM names in a subscription (e.g., az-corp-1234-workloadname) follow a particular naming pattern like az1234vmXXXX. In this way, potential conflicts are limited to a single subscription that is controlled by one team (owner), so it is their responsibility to ensure name consistency.

Once we realised such a policy existed, we tweaked the code and used custom resource names. If you hit the same issue, keep in mind that there are two “types” of VMs you need to handle in this way: AVD session hosts and a ‘Management VM’ that is a temporary resource used to perform last mile configuration of FS Logix storage.

Host Pool registration info

This deployment error appeared in ‘HostPool-AppGroups’ nested deployment:

HostPool-AppGroups: "DeploymentOutputEvaluationFailed",
"Unable to evaluate template outputs: 'hostPoolRegistrationInfo',
"The template output 'hostPoolRegistrationInfo' is not valid: The language expression property 'token' can't be evaluated."

After searching for some clues in the Accelerator repository, I found one closed issue from January that described this very problem. The workaround that was proposed there was to ‘downgrade’ the API version for host pools.

I tried to use this workaround (had to tweak a piece of code dug deep in the CARML modules that are part of the Accelerator), and it resolved the issue. I am not sure why we hit this problem, but it wasn’t reported by anyone else.

If you think the story ends happily here, we have a few bumps to cross…

Domain Join again

The next blocker that turned red in our pipeline affected two sub-deployments simultaneously: Session-Hosts and Storage-Azure-Files. In both cases, it was related to ‘JsonADDomainExtension’ VM extension.

Here is the output:

{
"code": "ComponentStatus/JoinDomainException for Option 3 meaning 'User Specified'/failed/1",
"level": "Error",
"displayStatus": "Provisioning failed",
"message": "ERROR - Failed to join domain='contoso.com', ou='OU=corp,OU=azure,DC=contoso,DC=com', user='domain-join', option='NetSetupJoinDomain, NetSetupAcctCreate' (#3 meaning 'User Specified'). Error code 1323"
}

First, we suspected wrongly typed credentials we are fetching from GitHub secrets. When this was validated as correct, we were searching for clues on Stack Overflow and Tech Community forums. Finally, we found where the problem was. It turned out that the avdDomainJoinAccountName parameter in deploy-baseline.bicep is expected to be in DOMAINNAME\\username format (eventually provided as a UPN).

It is a bit odd because there is a separate parameter for the domain name – avdIdentityDomainName – so it would be reasonable to assume that these two parameters are concatenated for that extension deployment. Nevertheless, when we provided the username in that format, the deployment succeeded!

Missing firewall openings

Something would be very wrong if we didn’t at least once hit some missing firewall rule, right? 😉

By checking the structure of AVD baseline deployment code, I realised we are now at the last stage of the deployment: ‘Storage-Azure-Files‘, where a storage account dedicated to FSLogix profiles needs to be joined to the domain. This operation is performed by a custom script extension on the Management VM that applies a PowerShell DSC configuration.

I checked the list of pre-requisites first to make sure we didn’t miss anything, but all those URLs were added to Azure Firewall policies enabled in vHubs.

After a quick search, we found out that this issue was reported on the AVD Accelerator repository – Azure Files does not get AD integrated after successful deployment of AVD Baseline via Azure portal. · Issue #310 · Azure/avdaccelerator (github.com). There was a suggestion that enforcing a ‘custom OU path’ could help and prevent this issue from happening. We tried that, and guess what: it didn’t cut it for us!

Luckily, the VM extension deployment can give more verbose information about what went wrong this time. The issue seems to be related with Script-DomainJoinStorage.ps1 that installs NuGet Package provider:

Cannot download link 'https://go.microsoft.com/fwlink/?LinkID=627338&clcid=0x409'
"message": "Install-PackageProvider : No match was found for the specified search criteria for the provider 'NuGet'. The package provider requires 'PackageManagement' and 'Provider' tags. Please check if the specified package has the tags. C:\\Packages\\Plugins\\Microsoft.Compute.CustomScriptExtension\\1.10.15\\Downloads\\0\\Manual-DSC-Storage-Scripts.ps1:80 Install-PackageProvider -Name NuGet -MinimumVersion 2.8.5.201 -Force

What was left was to connect to the Management VM and manually replicate the same steps this script was doing. Yes, you are guessing right. Our central Azure Firewall was not allowing to reach the following URLs:

These addresses are essential for downloading NuGet package provider and several modules from the PowerShell Gallery. These URLs should be added to the list of prerequisites in case someone tries to deploy the Accelerator in a restricted environment.

Least Privilege Access pickle

My store could have ended here with ‘all green’ deployment output in the Azure Portal. There was one issue left, and I can’t say if this will impact many users, but since it wasn’t reported yet, it might not be a high-impact issue.

Many organisations are adopting the ‘least privilege access’ security principle. My customer applies this rule in many areas like using Privileged Identity Management for granting RBAC roles or by not using a Domain Admin account for the ‘domain join’ automation, but only granting a very narrow set of permissions to it.

As a result, when the DSC configuration script is trying to use this account to perform actions like changing Windows Firewall policy or even installing DSC packages, it fails with “The account doesn’t have the Administrator rights” error. If that account were a member of Domain Admins group, it would automatically have those rights on the Management VM, but since it has only limited AD permissions, it isn’t.

As a workaround, we decided to perform that last step – storage account domain join – manually, rather than adding that AD account to Domain Admins group. It is not ideal (we have a manual step in this deployment), but we were pressed with time to find a better solution without sacrificing the ‘least privilege’ principle.

Custom Image goes next

The deployment of an optional component – that can generate custom images with Azure Image Builder and publish them to a Compute Gallery – was less complex, but even here, we had to make some changes.

Since this deployment is also using Deployment Scripts, we hit the same policies requiring private endpoints for PaaS services. We got approval for adding an exemption and moved on.

Workload identity permissions

This next problem showed that the Accelerator is primarily meant to be executed by admins from their consoles rather than by an automated CI/CD pipeline (e.g., GitHub Actions workflow).

The error we faced this time was related to ‘Automation-Account‘ sub-deployment and the following error:

message": "The client 'xxxx-xxxx-xxxx-xxxx' with object id 'yyyy-yyyy-yyyy-yyyy' has permission to perform action 'Microsoft.Insights/diagnosticSettings/write' on scope '/subscriptions/ID1/resourcegroups/rg-avd-weu-shared-services/providers/Microsoft.Automation/automationAccounts/aa-avd-weu/providers/Microsoft.Insights/diagnosticSettings/aa-avd-weu-diagnosticSettings'; however, it does not have permission to perform action 'Microsoft.OperationalInsights/workspaces/sharedKeys/action' on the linked scope(s) '/subscriptions/ID2/resourcegroups/eacp-mgmt/providers/microsoft.operationalinsights/workspaces/eacp-law' or the linked scope(s) are invalid."

Translation: The Accelerator expects that the security principal (a user or a Service Principal enabled in our GitHub Actions workflow) has a permission to retrieve a Shared Key for the centralized Log Analytics workspace (in the ALZ context deployed in the Management subscription). That, however, is not the case. Each SPN created for a given landing zone and enabled on a corresponding GitHub repository has an Owner role assigned to that one subscription only (to limit the blast radius).

After analysing that part of Bicep code, I realised that it is only needed to enable diagnostics settings on the Automation account, which happens anyway through ALZ policies (where each PaaS resources deployed anywhere is automatically enable for centralised log collection).

The quickest way was to comment out that specific nested deployment instead of adding more permissions to the workflow SPN.

When this was fixed, we finally got an ‘all green’ deployment. Time for a cake?

Not quite: the core of this Custom Image solution is based on an Automation runbook that, by default, runs on a schedule and generates new images. Azure Image Builder is leveraging HashiCorp Packer to do the heavy-lifting: create a VM, connect to it, apply required configuration, install tools and apps, generalize the VM, and create a new custom image.

“There must be something in this process that will not work,” you might say. And you are right 🙂 I checked the last ‘image-build’ runbook j, b and it indeed failed due to… wait for it… Azure policy violation.

VM naming convention strikes again

Do you remember what I wrote about that custom policy to only allow for specific naming pattern for Azure VMs in the Corporate Landing zone (to ensure name uniqueness in AD)?

This is what the runbook error log gave us:

Image Template build failed. Exception: Microsoft.Azure.PowerShell.Cmdlets.ImageBuilder.Runtime.UndeclaredResponseException: Validation failed: resources.DeploymentsClient#Validate: Failure sending request: StatusCode=400 -- 
Original Error: Code="InvalidTemplateDeployment" Message="The template deployment failed because of policy violation. {"expression":"name","expressionKind":"Field","expressionValue":"aibproxy5atw8","operator":"Match","path":"name","result":"False","targetValue":"az0080vm####"}],"reason":"Corporate VMs must have Contoso's unique names.","Check that the VM names are compliant with the naming convention for automated services."

The reason for this error is the fact that Azure Image Builder (AIB) needs to provision a temporary VM for creating a custom image. This VM gets some random name that does not follow the naming pattern required by that policy. And since this runbook uses AIB cmdlets (Start-AzImageBuilderTemplate), where it isn’t possible to customize the temporary VM name, we needed yet another policy exemption.

Digging into Packer logs

I will keep the rest of this story short. There were two additional issues we had to resolve before we could finally open a champagne. And since these errors are not happening during the resource deployment, and the Automation jobs don’t give all the details (even when Verbose Logging is enabled), it could be useful to know, that Packer generates a very detailed log that is stored in a storage account that gets created automatically.

  • Packer creates a Key Vault for storing secrets, but this deployment was failing due to ‘Public network access should be disabled for PaaS services‘ policy. Remember, we are in a separate subscription but still in the CORP landing zone. New policy exemption: check 🙂

  • Packer was failing to establish a WinRM connection with the temporary VM, because it was using a private IP address, because of another ALZ policy disallowing to use a public IP on a VM. New policy exemption: check??

In one standup, I called this a ‘Policy Hell’ (if you are old enough to remember what DLL Hell was, you get the reference). Adding more and more exemption was becoming problematic, so I came up with an idea to move this ‘Custom Image’ solution from CORP landing zone to the Online one. This component does not require corporate connectivity, line of sight with domain controllers, access to resources hosted on-premises, so there was no benefit in keeping it in more restricted CORP “world.”

Moving to the Online LZ

My story is coming to its “happy” ending. We requested a new Online subscription through the Vending machine, copied the code to a new repo, configured the pipeline, and had “almost perfect” deployment.

One “gotcha” remained: when we were cleaning up that previous subscription, we forgot (or didn’t realise it was there) to remove two custom role definitions, so our deployment failed at ‘Role-Definition’ part with:

"code": "RoleDefinitionWithSameNameExists","message": "A custom role with the same name already exists in this directory. Use a different name."

The custom roles that the Accelerator wanted to deploy were: Image Template Contributor, Image Template Build Automation. There was one interesting side effect caused by ‘dependsOn’ declaration among some Azure resources: the Automation account and all its child resources weren’t created at all.

I had to delete two role assignments and then those custom role definitions. After that, we could finally celebrate - both the deployment and the runbook job succeeded.

Closing remarks

Despite all those problems we faced, I appreciate the amount of effort that was put into the AVD Accelerator. Together with the architecture guidance published on Microsoft Learn, they can truly accelerate the adoption of this complex but extremely useful cloud service.

I will provide a condensed version of this feedback directly on GitHub, so the solution could be aligned better to ALZ policies and adjusted to avoid running into similar issues we faced during our project.