A Case For Native Smart Card Support in Browsers

February 22, 2017

A smart card is a device that holds a private key securely without letting it out of its storage. The chip on your credit card is a “smart card” (yup, terminology is ambiguous – the card and the chip are interchangeably called “smart card”). There are smaller USB-pluggable hardware readers that only hold the chip (without an actual card – e.g. this one).

But what’s the use? This w3c workshop from several years ago outlines some of them: multi-factor authentication, state-accepted electronic identification, digital signatures. All these are part of a bigger picture – that using the internet is now the main means of communication. We are moving most of our real-world activities online, so having a way to identify who we are online (e.g. to a government, to a bank), or being able to sign documents online (with legal value) is crucial.

That’s why the EU introduced the eIDAS regulation which defines (among other things) electronic identification and digital signatures. The framework laid there is aimed at having legally binding electronic communication, which is important in so many cases. Have you ever done the print-sign-scan exercise? Has your e-banking been accessed by an unauthorized person? Well, the regulation is supposed to fix these and more more issues.

Two factor authentication is another more broad concept, which has a tons of sub-optimal solutions. OTP tokens, google authenticator, sms code confirmation. All these have issues (e.g. clock syncing, sms interception, cost). There are hardware tokens like YubiKey, but they offer only a subset of the features a smart card does.

But it’s not just about legally-recognized actions online and two-factor authentication. It opens up other possibilities, like a more secure online credit card payment – e.g. you put your card in a reader and type your PIN, rather than entering the card number, CVC, date, names, 3d password and whatnot.

With this long introduction I got to the problem: browsers don’t support smart cards. In the EU, where electronic signatures are legally recognized, there is always the struggle of making them work with browsers. The solution so far: Java applets. A Java applet can interact with the smart card through the java crypto APIs, and thus provide signing features. However, with the deprecation of Java applets this era of constant struggle will end soon (and it is a struggle – having to click at least 2 confirmations and keep your java up to date, which even for developers is a hassle). There used to be a way to do it a few years ago in Firefox and IE, using window.crypto and CAPICOM APIs, but these got deprecated.

Recently the trend has been to use a “cloud-based” approach, where the keys reside on an HSM. That’s of course useful, but the problem with identification remains – getting access to your keys on the HSM requires, again, two factor authentication. Having the hardware token “in your hands” is what adds the security.

Smart people in Estonia (which has the most digital government in the world) had a better solution than Java or HSM – browser plugins that allow interaction with their ID card (which is/has a smart card). The solution is here and here. This has worked pretty well – you install the plugins one (which a one-in-all installer) and you can sign documents with javascript. You also get the proper PKCS libraries installed, and the root certificates needed to allow TLS 1.2 authentication with the hardware token (identification and authentication vs signing). The small downside of this approach is that it is somewhat fragile and dependent on browser whims – the plugins have to be upgraded constantly and are at risk of being completely broken if some browser decides to deprecate some Plugin APIs.

Another approach is the “local service” approach, which has two flavours. One is – you install a local application that exposes an HTTP interface and using javascript and proper same-origin configuration you send the files needed for signing to the service, and then get the result as an HTTP response, which you can then, again using javascript, append to the page that requested the signing. The downside here – getting a service installed to listen to a given port without administrator rights. The other approach is having an application hooked to a custom protocol (e.g. signature://). So whenever the page wants the user to sign something, it opens signnature://path-to-document-to-sign, which is intercepted by the locally installed application, digital signing is performed, and the result is pushed to (one-time) URL specified in the metadata of the document to sign. Something like that is implemented by 4identity.eu and it actually works.

Now, signature is one thing, identification (TLS client auth) is another. Allegedly, things should work there – PKCS#11 is a standard that should allow TLS client auth to happen with a smart card. Reality is – it doesn’t. You often need a vendor-specific PKCS#11 library. OpenSC, which is a cool tool that works with many smart cards, only works with Firefox and Safari. Charismatics commercial is a piece of software that is supposed to work with all smart cards out there – well, it doesn’t always.

And the problem here is the smart card vendors. The need for OpenSC and Charismatics arises because even though there are a few PKCS standards, smart cards are a complete mess. Not only it’s a mess, but it’s a closed, secretive mess. APDUs (the commands you send to the smartcard in order to communicate with it) are in most cases secret. You don’t get to know them even if you purchase tens of thousands of cards – you only get a custom vendor software that knows them. Then you have to reverse-engineer them to know how to actually talk to them. And they differ not only across vendors, but across card models of different vendors. For that reason the Estonian approach was a bit simpler to implement – they had only one type of smart card, given to all citizens and they were mostly in control. In other countries it’s a … mess. At least a dozen different types of cards to be supported.

So my first request is to smart card vendors (which are not that many) – please, please fix your mess. Get rid of that extra bit of “security through obscurity” to allow browsers to communicate with you without extra shenanigans.

My second request is to browser vendors – please do support smart card crypto natively. Unfortunately, due to the smart card mess above (among other things), hardware crypto has explicitly been excluded from the Web crypto API. As a follow-up to that, there’s the Hardware security working group, but afaik it’s still “work in progress”, and my feeling is it’s not that much yet. In w3c it’s important that browser vendors agree to implement something before it’s a standard, and I’ve heard that some are opposing the smart card integration. Due to the aforementioned mess, I guess.

You may say – standardization will fix this. Well, it hasn’t so far. The EU officials are aware of the problem, and that the eIDAS regulation may be thwarted by these technical issues, but they are powerless, as the EU is not a standardization body.

So it all comes down to having a joint effort between browser and smart card vendors to fix this thing once and for all. So, please do that in order to enable a more secure and legally-compliant web.

1

Computer Science Concepts That Non-Technical People Should Know

February 12, 2017

Sometimes it happens that people speak different languages. Even when speaking the same language. People have their own professional specifics. Biologist may see the world as the way a cell work, cosmologist may see relationships between people as attraction between planets. And as with languages different professional experiences give you a useful way of conceptualizing the world. And I think everyone would find some CS concepts useful. I’ll try to list some of these concepts that I’ve found many people don’t find “native”. And yes, they are not strictly “computer science” – they fall in various related fields like information science, systems design, etc.

  • Primary keys – the fact that every “entity” should have a unique identifier, so that you can refer to it unambiguously. Whether it’s a UUID or an auto-increment, or a number/string derived from a special set of rules, doesn’t matter. And you may say it’s obviously, but it isn’t – I’ve seen tons of spreadsheets and registers where entities don’t have a unique identifier. Unique identifiers are useful for retrieval – if I have a driving license number, when I fill an insurance form, should I fill all the details from the driving license (name, address, age), or just a single number, and the insurer should then get the rest from a driving license database?
  • Foreign keys (+ integrity violations and cascades) – the idea for this post came to me after I had a discussion with a bank clerk who insisted that the fact that my bank account is being deleted doesn’t mean that my virtual PoS terminal will also be deleted. To her they seemed unlinked, although the terminal is linked (via a foreign key) to the bank account. One should be able to quickly imagine links between data points and how dependent data can either block the removal or the parent, or be deleted with it.
  • “Fingerprint” (hash) – the fact that every entity can be transformed into a short, unique identity representation. This has some niche applications outside of the technical realm, but one should be familiar with the concept of transforming a large set of data points to a single representation for the sake of comparison.
  • Derived data – the fact that you don’t have to store data that can be derived from a primary dataset with well defined transformations. E.g. sums. I had to explain this over and over again when I was talking about open data – and ultimately, sums were there anyway. The key here is that derived data is less important than the algorithm used to derive it.
  • Single point of failure – this becomes obvious the second you spell it, but people tend to ignore it when designing real-world processes. How many times you’ve been stuck in a queue and you thought “this is obviously a single point of failure, why didn’t anyone think of preventing these delays”. It’s not always simple (even in software, not to mention the real world), of course, but the concept should be acknowledged.
  • Protocols and interfaces – i.e. the definition of how two systems communicate without knowing each others’ internal functioning. “Protocol” in the non-technical world means “the way things are usually done”. An interface is usually thought of as a GUI. But the concept of the definition of how two systems communicate seems kind of alien. Where “systems” might be “organizations”, a customer and a sales rep, two departments, etc. The definition of the point of their interaction I think helps to conceptualize these relations better.
  • Version control – the fact that content is a continuous stream of changes ontop of an original entity. Each entity is the product of its original “version” + all the changes applied on it. This doesn’t just hold for text, it holds for everything out there. And if we imagine it so, it helps us… “debug” things, even humans I guess.

The list is by no means exhaustive. I’d welcome any additions to it.

And now I know non-technical readers will say “But I understand and know these things”. I know you do. But I imply that they are not an intrinsic part of your day-to-day problem solving apparatus. I may be wrong, but my observations hint at that direction.

0

Why I Chose to Be a Government Advisor

January 22, 2017

A year and a half ago I agreed to become advisor in the cabinet of the deputy primer minister of my country (Bulgaria). It might have looked like a bizarre career move, given that at the time I was a well positioned and well paid contractor (software engineer), working with modern technologies (Scala, Riak, AWS) at scale (millions of users). I continued on that project part time for a little while, then switched to another one (again part time), but most of my attention and time were dedicated to the advisory role.

Since mid-December I’m no longer holding the advisory position (the prime minister resigned), but I wanted to look back, reflect and explain (to myself mainly) why that was a good idea and how it worked out.

First, I deliberately continued as a part-time software engineer, to avoid the risk of forgetting my (to that point) most marketable skills – building software. But not only that – sometimes you become tired of political and administrative bullshit and just want to sit down and write some code. But the rest of my time, including my “hobby”/spare time, was occupied by meetings, research, thoughts and document drafting that aimed at improving the electronic governance in Bulgaria.

I’ve already shared what my agenda was and what I was doing, and even gave a talk about our progress – opening as much data as possible, making sure we have high requirements for government software (we prepared a technical specification template for government procurement so that each administration relies on that, rather than on contractors with questionable interests), introducing electronic identification (by preparing both legislation and technical specification), and making myself generally available to anyone who wants advice on IT stuff.

The big news that we produced was the introduction of a legal requirement for making all custom-built government software open-source. It echoed in the tech community, including some big outlets like TechCrunch, ZDnet and Motherboard.

I was basically a co-CTO (together with a colleague) of the government, which is pretty amazing – having the bigger picture AND the intricate details in our heads, all across government. We were chaotic, distracted and overwhelmed, but still managed to get both the legislation and the important projects written and approved. I acquired some communication and time-management skills, researched diverse technological topics (from biometric identification to e-voting), I learnt that writing laws is very similar to programming, and I had the pleasure of working in a wonderful team.

There were downsides, of course. The financial one aside, the most significant problem was the inevitable “deskilling”. I had no time to try cool new stuff from the technology world – machine learning, deep learning, neural networks, the blockchain. Kafka, Docker, Kubernetes, etc. I’m not saying these are crucial technologies (I have tried Docker a couple of times and I wouldn’t say I love it), but one has to stay up to date – this is the best part (or the worst part?) of our profession. I missed (and still do) spending a few evenings in a row digging into some new and/or obscure technology, or even contributing to it. My GitHub contributions were limited to small government-oriented tools.

Another downside is that I’m pretty sure some people hate me now. For raising the bar for government software so high and for my sometimes overly zealous digitizing efforts. I guess that’s life, though.

Having stated the pros and cons, let me get back to the original question – why I did I choose the position in the first place. For a long time I’ve been interested in what happens to programmers when they get “bored” – when the technical challenges alone are not sufficient for feeling one’s potential, or potential for impact is fully reached. A few years ago I didn’t think of the “government” option – I thought one could either continue solve problems for an employer, or become an employer themselves. And I still think the latter is a very good option, if you have the right idea.

But I didn’t have that idea that I really liked and wanted to go through with. At the same time I had ideas of how the government could run better. Which would, in my case, impact around 8 million people. It would also mean I can get things done in a complicated context. I started with three objectives – electronic identity card, open source and open data (the latter two we have pushed for with a group of volunteer developers). All of them are now fact. Well, mostly because of the deputy prime minister, and thanks to the rest of the team, but still. That job would also mean I had to get out of my comfort zone of all-day-programming agile-following software-delivering routine.

And that’s why I did it. With all the risks of deskilling, of being disliked, or even being a pawn in a political game. Selfish as it may sound, I think having impact, getting things done and doing something different are pretty good motivators. And are worth the paycheck cut.

I still remain a software engineer and will continue to create software. But thanks to this experience, I think I’ll be able to see a broader landscape around software.

2

Forget ISO-8859-1

January 16, 2017

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{L}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.

3

Anemic Objects Are OK

December 25, 2016

I thought for a while that object-oriented purism has died off. But it hasn’t – every now and then there’s an article that tries to tell us how evil setters and getters are, how bad (Java) annotations are, and how horrible and anti-object-oriented the anemic data model is (when functionality-only services act upon data-only objects) and eventually how dependency injection is ruining software.

Several years ago I tried to counter these arguments, saying that setters and getters are not evil per se, and that the anemic data model is mostly fine, but I believe I was worse at writing then, so maybe I didn’t get to the core of the problem.

This summer we had a short twitter discussion with Yegor Bugayenko and Vlad Mihalcea on the matter and a few arguments surfaced. I’ll try to summarize them:

  • Our metaphors are often wrong. An actual book doesn’t know how to print itself. Its full contents are given to a printer, which knows how to print a book. Therefore it doesn’t make sense to put logic for printing (to JSON/XML), or persisting to a database in the Book class. It belongs elsewhere.
  • The advice to use (embedded) printers instead of getters is impractical even if a Book should know how to print itself – how do you transform your objects to other formats (JSON, XML, Database rows/etc..)? With an Jackson/JAXB/ORM/.. you simply add a few annotations, if any at all and it works. With “printers” you have to manually implement the serialization logic. Even with Xembly you still have to do a tedious, potentially huge method with add()’s and up()’s. And when you add, or remove a field, or change a field definition, or add a new serialization format, it gets way more tedious to support. Another approach mentioned in the twitter thread is having separate subclasses for each format/database. And an example can be seen here. I really don’t find that easy to read or support. And even if that’s adopted in a project I’m working on, I’d be the first to replace that manual adding with reflection, however impure that may be. (Even Uncle Bob’s Fitnesse project has getters or even public fields where that makes sense in terms of the state space)
  • Having too much logic/behaviour in an objects may be seen as breaking the Single responsibility principle. In fact, this article argues that the anemic approach is actually SOLID, unlike the rich business object approach. The SRP may actually be understood in multiple ways, but I’ll get to that below.
  • Dependency injection containers are fine. The blunt example of how the code looks without them is here. No amount of theoretical object-oriented programming talk can make me write that piece of code. I guess one can get used to it, but (excuse my appeal to emotion fallacy here) – it feels bad. And when you consider the case of dependency injection containers – whether you’ll invoke a constructor from a main method, or your main will invoke automatic cosntructor (or setter) injection context makes no real difference – your objects are still composed of their dependencies, and their dependencies are set externally. Except the former is more practical and after a few weeks of nested instantiation you’ll feel inclined to write your own semi-automated mechanism to do that.

But these are all arguments derived from a common root – encapsulation. Your side in the above arguments depends on how you view and understand encapsulation. I see the purpose of encapsulation as a way to protect the state space of a class – an object of a given class is only valid if it satisfies certain conditions. If you expose the data via getters and setters, then the state space constraints are violated – everyone can invalidate your object. For example, if you were able to set the size of an ArrayList without adding the corresponding element to the backing array, you’d break the behaviour of an ArrayList object – it will report its size inconsistently and the code that depends on the List contract would not always work.

But in practical terms encapsulation still allows for the distinction between “data objects” vs “business objects”. The data object has no constraints on its state – any combination of the values of its fields is permitted. Or in some cases – it isn’t, but it is enforced outside the current running program (e.g. via database constraints when persisting an object with an ORM). In these cases, where the state spaces is not constraint, encapsulation is useless. And forcing it upon your software blindly results in software that, I believe, is harder to maintain and extend. Or at least you gain nothing – testing isn’t easier that way (you can have a perfectly well tested anemic piece of software), deployment is not impacted, tracing problems doesn’t seem much of a difference.

I’m even perfectly fine with getting rid of the getters and exposing the state directly (as the aforementioned LogData class from Fitnesse).

And most often, in business applications, websites, and the general type of software out there, most objects don’t need to enforce any constraints on their state. Because there state is just data, used somewhere else, in whatever ways the business needs it to be used. To get back to the single responsibility principle – these data objects have only one reason to change – their … data has changed. The data itself is irrelevant. It will become relevant at a later stage – when it’s fetched via a web service (after it’s serialized to JSON), or after it’s fetched from the database by another part of the application or a completely different system. And that’s important – encapsulation cannot be enforced across several systems that all work with a piece of data.

In the whole debate I haven’t seen a practical argument against the getter/setter/anemic style. The only thing I see is “it’s not OOP” and “it breaks encapsulation”. Well, I think it should be settled now that encapsulation should not be always there. It should be there only when you need it, to protect the state of your object from interference.

So, don’t feel bad to continue with your ORMs, DI frameworks and automatic JSON and XML serializers.

7

Amend Your Contract To Allow For Side Projects

December 14, 2016

The other day Joel Spolsky blogged a wonderful overview of the copyright issues with software companies in terms of its employees. The bottom line is: most companies have an explicit clause in their contracts which states that all intellectual property created by a developer is owned by the employer. This is needed, because the default (in many countries, including mine) is that the creator owns the copyright, regardless of whether they were hired to do it or not.

That in turn means that any side project, or in fact any intellectual property that you create while being employed as a developer, is automatically owned by your employer. This isn’t necessarily too bad, as most employers wouldn’t enforce their right, but this has bugged me ever since I started working for software companies. Even though I didn’t know the legal framework of copyright, the ownership clause in my contracts was always something that I felt was wrong. Even though Joel’s explanation makes perfect sense – companies need to protect their products from a random developer suddenly deciding they own the rights to parts of it – I’ve always thought there’s a middle ground.

(Note: there is a difference between copyright, patents and trademarks, and the umbrella term “intellectual property” is kind of ambiguous. I may end up using it sloppily, so for a clarification, read here.)

California apparently tried to fix this by passing the following law:

Anything you do on your own time, with your own equipment, that is not related to your employer’s line of work is yours, even if the contract you signed says otherwise.

But this doesn’t work, as basically everything computer-related is “in your employer’s line of work”, or at least can be, depending on the judge.

So let’s start with the premise that each developer should have the right to create side projects and profit from them, potentially even pursue them as their business. Let’s also have in mind that a developer is not a person, whose only ideas and intellectual products are “source code” or “software design(s)”. On the other hand the employer must be 100% sure that no developer can claim any rights on parts of the employer’s product. There should be a way to phrase a contract in a way that it reflects these realities.

And I have always done that – whenever offered a contract, I’ve stated that:

  1. I have side-projects, some of which are commercial, and it will make no sense for the employer to own them
  2. I usually create non-computer intellectual property – I write poetry, short stories, and linguistics problems

And I’ve demanded that the contract be reworded. It’s a bargain, after all, not one side imposing it’s “standard contract” on the other. So far no company objected too much (there was some back-and-forth with the lawyers, but that’s it – companies decided it was better for them to hire a person they’ve assessed positively, than to stick to an absolute contract clause).

That way we’ve ended up with the following wording, which I think is better than the California law, protects the employer, and also accounts for side-proejcts and poetry/stories/etc.

Products resulting from the Employee’s duties according to the terms of employment – individual or joint, including ideas, software development, inventions, improvements, formulas, designs, modifications, trademarks and any other type of intellectual property are the exclusive property of Employer, no matter if patentable.

Not sure if it is properly translated, but the first part is key – if the idea/code/invention/whatever is a result of an assignment that I got, or a product I am working on for the employer, then it is within the terms of employment. And it is way less ambiguous than “the employer’s line of work”. Anything outside of that, is of course, mine.

I don’t vouch for the legal rigidity of the above, as I’m not a legal professional, but I strongly suggest negotiating such a clause in your contracts. It could be reworded in other ways, e.g. “work within the terms of employment”, but the overall idea is obvious. And If you end up in court (which would probably almost never happen, but contracts are there to arrange edge cases), then even if the clause is not perfect, the judge/jury will be able to see the intent of the clause clearly.

And here I won’t agree with Joel – if you want to do something independent, you don’t have to be working for yourself. Side projects, of which I’ve always been a proponent, (and other intellectual products) are not about the risk-taking entrepreneurship – they are about intellectual development. They are a way to improve and expand your abilities. And it is a matter of principle that you own them. In rare cases they may be the basis of your actual entrepreneurial attempt.

Amending a standard contract is best done before you sign it. If you’ve already signed it, it’s still possible to add an annex, but less likely. So my suggestion is, before you start a job, use your “bargaining power” to secure your intellectual property rights.

1

Progress in Electronic Governance [talk]

December 2, 2016

I’ve been an advisor to the depury prime minister of Bulgaria for the past year and a half. And on this year’s OpenFest conference I tried to report on what we’ve achieved. It is not that much and there are no visible results, which is a bit disappointing, but we (a small motivated team) believe we have laid the groundwork for a more open, and properly built ecosystem for the government IT systems.

Just to list a few things – we passed a law that requires open sourcing custom-built government software, we opened a lot of data (1500 datasets) on the national open data portal, and we drew a roadmap of how existing state registers and databases be upgraded in order to meet modern software engineering best practices and be ready to meet the high load of requests. We also seriously considered the privacy and auditability of the whole ecosystem. We prepared the electronic identification project (each citizen having the option to identify online with a secure token), an e-voting pilot and so on.

The video of the talk is available here:

And here are the slides:

Now that our term is at an end (due to the resignation of the government) we hope the openness-by-default will persist as a policy and the new government agency that we constituted would be able to push the agenda that has been laid out. Whether that will be the case in a complex political situation is hard to tell, but hopefully the “technical” and the “political” aspects won’t be entwined in a negative way. And our team will continue to support with advice (even though from “the outside”) whoever wishes to build a proper and open e-government ecosystem.

0

Domain Fallback Mechanism In Apps

November 19, 2016

As a consequence of the Dyn attack many major websites were down, including twitter – the browsers could not resolve an IP address of the servers because the authoritative name server (Dyn) was down. Whether that could be addressed globally, I don’t know – there was an interesting discussion on reddit about my proposal to increase TTL – how the resolution policy and algorithms can be improved, why a lower TTL is not always applicable, etc.

But while twitter.com was down, the mobile app was also not working. And while we have no control over the browser, we certainly do have control on the mobile app (same goes for desktop applications, of course, but I’ll be talking mainly about mobile apps as more dominant). The reason the app was also down is that is most likely uses the twitter.com domain as well (e.g. api.twitter.com). And that’s the right way to do it, except in these rare situations when the DNS fails.

In these cases you can hardcode a list of server IP addresses in your app and fallback to them if the domain-name based requests fail (e.g. after 3 attempts). If you change your server IP addresses, you just update the app. It doesn’t matter that it will have an unpredictable delay (until everyone updates) and that some clients won’t have a proper IP – it is an edge-case fallback mechanism.

It’s not of course my idea – it’s been used by distributed systems like Bitcoin and BitTorrent – for example, when trying to join the network, the Bitcoin or a BitTorrent DHT client tries to connect to a bootstrap node in order to get a list of peers. There is a list of domain names in the client applications that resolve using a round-robin DNS to one of many known bootstrap nodes. However, if DNS resolution fails, the clients also have a small set of hardcoded IP addresses.

In addition to hardcoding IPs, you can regularly resolve the domain to an IP from within the app, and keep the “last known working IP” in a local cache. That adds a bit of complexity, and will not work for fresh or cleaned-up installations, but is a good measure nonetheless.

As this post points out, you can have multiple fallback strategies. Instead of, or better – in addition to hardcoding the IP, you can have a fallback domain name. foo.com and foo-fallback.com, managed by different DNS providers. That way your app will have a 4 step fallback mechanism:

  1. try primary domain
  2. try fallback domain
  3. try cached last known working IP(s)
  4. try hardcoded fallback IPs

(Note: if you connect to an IP, rather than a domain, you should do the verification of the server certificate manually (assuming you want to use TLS connection, which you should). In Android that’s done by using a custom HostnameVerifier.)

Events like the Dyn attack are (let’s hope) rare, but can be costly to businesses. Adding the fallback mechanisms to at least some of the client software is quick and easy and may reduce the damage.

2

Using Named Database Locks

November 8, 2016

In a beginner’s guide to concurrency, I mentioned advisory locks. These are not the usual table locks – they are table-agnostic, database-specific way to obtain a named lock from your application. Basically, you use your database instance for centralized application-level locking.

What could it be used for? If you want to have serial operations, this is a rather simple way – no need for message queues, or distributed locking libraries in your application layer. Just have your application request the lock from the database, and no other request (regardless of the application node, in case there are multiple) can obtain the same lock.

There are multiple functions that you can use to obtain such a lock – in PostgreSQL, in MySQL. The implementations differ slightly – in MySQL you need to explicitly release the lock, in PostgreSQL a lock can be released at the end of the current transaction.

How to use it in a Java application, for example with spring. You can provide a locking aspect and a custom annotation to trigger the locking. Let’s say we want to have sequential updates for a given entity. In the general use-case that would be odd, but sometimes we may want to perform some application-specific logic that relies on sequential updates.

    @Before("execution(* *.*(..)) && @annotation(updateLock)")
	public void applyUpdateLocking(JoinPoint joinPoint, UpdateLock updateLock) {
		int entityTypeId = entityTypeIds.get(updateLock.entity());
		// note: letting the long id overflow when fitting into an int, because the postgres lock function takes only ints
		// lock collisions are pretty unlikely and their effect will be unnoticeable
		int entityId = (int) getEntityId(joinPoint.getStaticPart().getSignature(), joinPoint.getArgs(),
				updateLock.idParameter());
		
		if (entityId != 0) {
			logger.debug("Locking on " + updateLock.entity() + " with id " + entityId);
			// using transaction-level lock, which is released automatically at the end of the transaction
			final String query = "SELECT pg_advisory_xact_lock(" + entityTypeId + "," + entityId + ")";
			em.unwrap(Session.class).doWork(new Work() {
				@Override
				public void execute(Connection connection) throws SQLException {
					connection.createStatement().executeQuery(query);
				}
			});
		}
     }

What does it do:

  • It looks for methods annotated with @UpdateLock and applies the aspect
  • the UpdateLock annotation has two attributes – the entity type and the name of the method parameter that holds the ID on which we want to lock updates
  • the entityTypeIds basically has a mapping between a String name of the entity and an arbitrary number (because the postgres function requires a number, rather than a string)

That doesn’t sound very useful in the general use-case, but if for any reason you need to make sure a piece of functionality is executed sequentially in an otherwise concurrent, multi-threaded application, this is a good way.

Use this database-specific way to obtain application-level locks rarely, though. If you need to do that often, you probably have a bigger problem – locking is generally not advisable. In the above case it will lock simply on a single entity ID, which means it will rarely mean more than two requests waiting at the lock (or failing to obtain it). The good thing is, it won’t get more complicated with sharding – if you lock on a specific ID, and it relies on a single shard, then even though you may have multiple database instances (which do not share the lock), you won’t have to obtain the lock from a different shard.

Overall, it’s a useful tool to have in mind when faced with a concurrency problem. But consider whether you don’t have a bigger problem before resorting to locks.

1

Short DNS Record TTL And Centralization Are Serious Risks For The Internet

October 22, 2016

Yesterday Dyn, a DNS-provider, went down after a massive DDoS. That led to many popular websites being inaccessible, including twitter, LinkedIn, eBay and others. The internet seemed to be “crawling on its knees”.

We’ll probably read an interesting post-mortem from Dyn, but why did that happen? First, DDoS capacity is increasing, using insecure and infected IoT devices with access to the internet. Huge volumes of fake requests are poured on a given server or set of servers and they become inaccessible, either being unable to cope with the requests, or simply because the network to the server doesn’t have enough throughput to accomodate all the requests.

But why did “the internet” stop because a single DNS provider was under attack? First, because of centralization. The internet is supposed to be decentralized (although I’ve argued that exactly because of DNS, it is pseudo-decentralized). But services like Dyn, UltraDNS, Amazon Route53 and also Akamai and CloudFlare centralize DNS. I can’t tell how exactly, but out of the top 500 websites according to Moz.com, 181 use one of the above 5 services as their DNS provider. Add 25 google services that use their own, and you get nearly 200 out of 500 centered in just 6 entities.

But centralization of the authoritative nameservers alone would not have led to yesterday’s problem. A big part of the problem, I think, is the TTL (time to live) of the DNS records, that is – the records which contain the mapping between domain name and IP address(es). The idea is that you should not always hit the authoritative nameserver (Dyn’s server(s) in this case) – you should hit it only if there is no cached entry anywhere along the way of your request. Your operating system may have a cache, but more importantly – your ISP has a cache. So the idea is that when subscribers of one ISP all make requests to twitter, the requests should not go to the nameserver, but would instead by resolved by looking them up in the cache of the ISP.

If that was the case, regardless of whether Dyn was down, most users would be able to access all services, because they would have their IPs cached and resolved. And that’s the proper distributed mode that the internet should function in.

However, it has become a common practice to set very short TTL on DNS records – just a few minutes. So after the few minutes expire, your browsers has to ask the nameserver “what IP should I connect to in order to access twitter.com?”. That’s why the attack was so successful – because no information was cached and everyone repeatedly turned to Dyn to get the IP corresponding to the requested domain.

That practice is highly questionable, to say the least. This article explains in details the issues of short TTLs, but let me quote some important bits:

The lower the TTL the more frequently the DNS is accessed. If not careful DNS reliability may become more important than the reliability of, say, the corporate web server.

The increasing use of very low TTLs (sub one minute) is extremely misguided if not fundamentally flawed. The most charitable explanation for the trend to lower TTL value may be to try and create a dynamic load-balancer or a fast fail-over strategy. More likely the effect will be to break the nameserver through increased load.

So we knew the risks. And it was inevitable that this problematic practice will be abused. I decided to analyze how big the problem actually is. So I got the aformentioned top 500 websites as representative, fetched their A, AAAA (IPv6), CNAME and NS records, and put them into a table. You can find the code in this gist (uses the dnsjava library).

The resulting CSV can be seen here. And if you want to play with it in Excel, here is the excel file.

Some other things that I collected: how many websites have AAAA (IPv6) records (only 79 out of 500), whether the TTLs betwen IPv4 and IPv6 differ (it does for 4), which is the DNS provider (which is how I got the figures mentioned above), taken from the NS records, and how many use CNAME instead of A records (just a few). I also collected the number of A/AAAA records, in order to see how many (potentially) utilize round-robin DNS (187) (worth mentioning: the A records served to me may differ from those served to other users, which is also a way to do load balancing).

The results are a bit scary. The average TTL is only around 7600 seconds (2 hours and 6 minutes). But it gets worse when you look at the 50th percentile (sort the values by ttl and get the lowest 250). The average there is just 215 seconds. This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.

Just a few websites have a high TTL, as can be seen from this simple chart (all 500 sites are on the X axis, the TTL is on y):

ttl

What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides – it doesn’t automatically mean all users will be pointed to the new IP, as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods. You could do the round-robin DNS, which is basically using the DNS provider as a load-balancer, which sounds wrong in most cases. It can be used for geolocation routing – serving different IP depending on the geographical area of the request, but that doesn’t necessarily require a low TTL – if caching happens closer to the user than to the authoritative DNS server, then he will be pointed to the nearest IP anyway, regardless of whether that values gets refreshed often or not.

Short TTL is very useful with internal infrastructure – when pointing to your internal components (e.g. a message queue, or to a particular service if using microservices), then using low TTLs may be better. But that’s not about your main domain being accessed from the internet.

Overlay networks like BitBorrent and Bitcoin use DNS round-robin for seeding new clients with a list of peers that they can connect to (your first use of a torrent client connects you to one of serveral domains that each point to a number of nodes that are supposed to be always on). But that’s again a rare usecase.

Overall, I think most services should go for higher TTLs. 24 hours is not too much, and it will be needed to keep your old IP serving requests for 24 hours anyway, because of caches that ignore the TTL value. That way services won’t care if the auhtoritative nameserver is down or not. And that would in turn mean that DNS providers would be less of an interesting target for attacks.

And I understand the flexibility that Dyn and Route53 give us. But maybe we should think of a more distributed way to gain that flexibility. Because yesterday’s attack may be just the beginning.

8