One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence

Last month I published an article about getting email settings right. I had recently configured DKIM on our company domain and was happy to share the experience. However, the next day I started receiving error DMARC reports saying that our Office365-backed emails are failing DKIM which can mean they are going in spam.

That was quite surprising as I have followed every step in the Microsoft guide. Let me first share a bit about that as well, as it will contribute to my final point.

In order to enable DKIM, on a normal email provider, you just have to get the DKIM selector value from some settings screen and copy them in your DNS admin dashboard (in a new ._domainkey. record). Microsoft, instead, make you do the following:

  • Connect to Exchange Powershell
  • Oh, you have 2FA enabled? Sorry, that doesn’t work – follow this tutorial instead
  • Oh, that fails when downloading their custom application? An obscure stackoverflow answer tells you that you should not download it with Firefox or Chrome – it fails to run then. You should download it with Internet Explorer (or Edge) and then it works. Just Microsoft.
  • Now that you are connected, follow the DKIM guide
  • In addition to powershell, you should also go in the Exchange admin in your Office365 admin, and enable DKIM

Yes, you can imagine you can waste some time here. But it finally works, and you set the CNAME records and assume that what they wrote – namely, that the TXT records to which the CNAME records point, are generated on Microsoft’s end (for onmicrosoft.com), and so anyone trying to validate a signature will pick the right public key from there and everything will be perfect. Well, no. Here’s an excerpt from a test mail I sent to my Gmail in December:

DKIM: ‘FAIL’ with domain .com
ARC-Authentication-Results: i=2; mx.google.com;
dkim=temperror (no key for signature) header.i=@.com header.s=selector1

I immediately reported the issue in a way that I think is quite clear, although succint:

Hello, I configured DKIM following your tutorial (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email), however the TXT records at your end that are supposed to be generated are not generated – so me setting a CNAME records points to a missing TXT record. I tried disabling and reenabling, and rotating the keys, but the TXT record is still missing, which means my emails get in spam folders, because the DMARC policy epxects a DKIM record. Can you please fix that and create the required TXT records?

What followed was 35 days of long phone calls and repetitive emails with me trying to explain what the issue is and Microsoft support not getting it. As a sidenote, Office365 support center doesn’t support tickets with so much communication – it doesn’t have pagination so I can’t see the original note online (only in my inbox).

On the 5th minute of my first-of-several 20 minute calls my wife, who was around, already knew exactly what the issue is (she’s quite technical, yes), but somehow Microsoft support didn’t get it. We went through multiple requests for me providing DMARC reports (and just the gmail report wasn’t enough, it had to be from 2 others as well), screenshots of my DNS administration screen, screenshots to prove the record is missing (but mxtoolbox is some third party tool they can’t trust), even me sharing my computer for a remote desktop session (no way, company policy doesn’t allow random semi-competent people to touch my machine). I ended up providing an nslookup (a Microsoft tool) screenshot, to show that the record is missing.

In the meantime I tried key rotation again form the exchange admin UI. That got stuck for a week, so I raised a separate ticket. While we were communication on that ticket as well, rotation finished, but the keys didn’t change. What changed was (that I later figured) the active selector. We’ll get to that later.

After the initial few days I executed Get-DkimSigningConfig | fl to get the DKIM record values and just created TXT records with them (instead of CNAME), so that our mails don’t get in spam. Of course, having DKIM fail doesn’t automatically mean you’ll get in spam, even though our DMARC policy says so, but it’s a risk one shouldn’t take. With the TXT records set everything was working okay, so I didn’t have an urgent problem anymore. That allowed me to continue slowly trying to resolve the issue with Microsoft support, for another month.

Two weeks ago they acknowledged something is missing on their end. Miracle! So the last few days of the horror were about me having to set the CNAME records back (and get rid of my TXT records) so that they can fix the missing records on their end. I know, that sounds ridiculous – you don’t need a CNAME to point to your TXT in order to get that TXT to appear, but who knows what obscure and shitty programs and procedures they have, so I complied. What followed was a request for more screenshots and more DKIM reports.

No. You have nslookup. No screenshot of mine is better than an nslookup. The TTL for my entries is 300s, so there should be no issue with caching (and screenshots won’t help with that either).

Then finally, after having spoken to or emailed probably 5-6 different people, there was an email that made sense:

Good Afternoon,

As per the public documentation you followed (https://docs.microsoft.com/en-us/microsoft-365/security/office-365-security/use-dkim-to-validate-outbound-email) it is mentioned to have TXT records created in Microsoft DNS server for both Selector CNAME Records. we agree with that.

Our product engineer team has confirmed that there is a design level change was rolled out and it is still not published in public documentation. To be precise, we don’t require TXT record for both CNAME Records published. The TXT record will be published only for the Active Selector.

LastChecked 2020-01-01 07:40:58Z
KeyCreationTime 2019-12-28 18:37:38Z
RotateOnDate 2020-01-01 18:37:38Z
SelectorBeforeRotateOnDate selector1
SelectorAfterRotateOnDate selector2

Above info confirms that the active selector is “Selector2” and we have the TXT record published. So we don’t have to wait for the TXT record for Selector1 in our scenario.

Also we require the Selector1 CNAME record in your DNS, whenever next rotate happen. Microsoft will publish the other TXT record. This shouldn’t affect your environment at any chance.

Alright. So, when I triggered key rotation, I fixed the results of the original bug that lead to no records present. The rotation made the active selector “selector2” and generated its TXT record. Selector1 is currently missing, but it’s inactive, so that is not an issue. What “active selector” means is an interesting question. It’s nowhere in the documentation, but there’s this article from 2016 that sheds some light. Microsoft has to swap the selectors on each rotation. According to the article rotation happens every week, but that’s no longer the case – since my manual rotation there hasn’t been any automatic one.

So to summarize – there was an initial bug that happened who knows why, then my stuck-for-a-week manually triggered rotation fixed it, but because they haven’t documented the actual process, there was no way for me to know that the missing TXT record is fine and will (hopefully) be generated on the next rotation (which is my current pending question for them – when will that happen so that I check if it worked).

To highlight the issues here. First, a bug. Second, missing/outdated documentation. Third, incompetent (though friendly) support. Fourth, a convoluted procedure to activate a basic feature. And that’s for not getting emails in spam in an email product.

I can expand those issues to any Microsoft cloud offering. We’ve been doing some testing on Azure and it’s a mess. At first an account didn’t even work. No reason, just fails. I had to use another account. The UI is buggy. Provisioning resources takes ages with no estimate. Documentation is far from perfect. Tools are meh.

And yet, Microsoft is growing its cloud business. Allegedly, they are the top cloud provider in the world. They aren’t, obviously, but through bundling Office365 and Azure in the same reporting, they appear bigger than AWS. Azure is most likely smaller than AWS. But still, Satya Nadella has brought Microsoft to market success again. Because technical excellence doesn’t matter in enterprise sales. You have your sales reps, nice conferences, probably some kickbacks, (alleged) easy integration with legacy Windows stuff that matters for most enterprise customers. And for the decision maker that chooses Microsoft over Amazon or Google the argument of occasional bugs, poor documentation or bad support is irrelevant.

But it is relevant for the end results. For the engineers and the products they build. Unfortunately, tech companies have prioritized alleged business value over technical excellence so much that they can no longer deliver the business value. If the outgoing email of an entire organization goes in spam because Microsoft has a bug in Office365 DKIM, where’s the business value of having a managed email provider? If the people-hours saved form going from on-prem to Azure are wasted because of poor technology, where’s the business value? Probably decision makers don’t factor that in, but they should.

I feel technical excellence has been in decline even in big tech companies, let alone smaller ones. And while that has lead to projects gone out of budget, poor quality and poor security, the fact that the systems aren’t that critical has made it possible for the tech sector to get away with that. And even grow.

But let me tell you of a big player in a different industry that stopped caring about technical excellence. Boeing. This article tells the story of shifting Boeing from an engineering organization to one that cuts on quality expenses and outsources core capabilities. That lead to the death of people. Their plane crashed because of that shift in priorities. That lead to a steep decline in sales, and a moderate decline in stock and a bailout.

If you de-prioritize technical excellence, and you are in a critical sector, you lose. If you are an IT vendor, you can get away with it, for a while. Maybe forever? I hope not, otherwise everything in tech will be broken for a long time. And digital transformation will be a painful process around bugs and poor tools.

2 thoughts on “One-Month of Microsoft DKIM Failure and Thoughts on Technical Excellence”

  1. Conclusion is worth reading!
    Quality is indeed an issue in a lot of software products or services.
    It’s easy to focus on features, and forget about the real risks with some practices. Especially in long-term projects or services like plan construction or saas provider.

    Regarding you Azure experience, I don’t feel the same. Overall, I’m professionally pretty happy about Azure. Getting started is pretty easy an comprehensive. You don’t need to read tons of documentation, like AWS terminology, or set up a lot of stuff on your machine. VNET, Storage, VMs services are pretty straightforward which is pleasant. They work really well for basic workflows.
    Documentation is missing advanced samples in some topics, but overall its more and more complete, and updated frequently. You can even immediately create a Github issue from the page itself.
    You can use the web UI, which in some rare cases is indeed buggy (mostly for services in preview or brand new). The refresh is perfectible in some sections indeed, but it works really nice in command line (powershell or azure cli). I’m used to web ui blades now, and it’s actually pretty neat experience to browse through your resources, but it’s definitely a personal taste.
    Provisioning resources (and also deleting them) indeed takes a various amount of time, depending on what you create, when you create it and in which datacenter. Which is understandable according to the demand. But most of the times you get what you need in a couple of minutes so it’s not really an issue, compared to asking your IT hardware department ^^ you launch your script/pipeline to create your vms or app, you go drink a coffee, it’s ok when you came back…
    Over the years, we’ve had some troubles with brand new stuff (ex: booting a custom RedHat X image) or specific scenario, like storage throttling or database connectivity. But support is working, and is actually great once you get passed the first levels ^^ Occasionally we have to create a ticket, but most of the time it works great, we’re happy with the (very large) ecosystem available.
    Note: I’m not working for microsoft but wanted to share my experience :-p
    Hope you find a cloud provider that suits you, find up to date documentation, and that nobody need to contact support !

  2. Thank you. I had the same question about DKIM records, this post saved me the pain of calling Microsoft support.

Leave a Reply

Your email address will not be published. Required fields are marked *