Identity in the Digital World

May 21, 2016

“Identity” is a set of features that allow unique identification of a person and distinguishing them from others. That sounds simple enough, but it turns out to have a lot of implications in the modern, connected, global world.

Identity today is government managed. You are nobody if a government hasn’t confirmed that you are indeed somebody. The procedures in the countries vary, but after you are born, you get issued a birth certificate and your name (and possibly number) are entered into a database (either centralized or decentralized). From then on you have an “identity”, which you can later prove using some sort of a document (ID card, passport, driving license, social security number, etc.)

It is not that the government owns your identity, because you are far more than your ID card, but certain attributes of your identity are recorded by the government, and then it certifies (via a document and the relevant database) that this is indeed you. These attributes include your names, which have been used to identify people since forever, your address, your photo, height, eye color. Possibly your fingerprints and your iris. But we’ll get to these biometric attributes later.

Why is all this important? Except for cases of people living in small isolated tribes, where they probably don’t even need names for identifying others, the so called “civilized world” needs to be able to differentiate one person from another for all sorts of reasons. Is the driver capable of driving, is the pilot capable of flying a plane — they may show a certificate, but is it really them that were certified (“Catch me if you can” shows how serious this can be)? Who owns a given property? Is it this one, claiming to be John Smith, or that one, also claiming to be John Smith? The ownership certificate may be lost, but there is a record somewhere that holds the information. We just have to identify the real John Smith.

Traveling is another case — although rather suboptimal, the current world has countries and borders, and various traveling restrictions. You have to prove that you are you, and that you have the right to travel. You have to prove you are American, or that you have a visa, if you want to enter the United States.

There are many other cases — crime-fighting, getting a bank loan, getting employed, etc.

You may argue that you should be able to be totally anonymous and still do all of the above, but unfortunately, in a global society, fraud is too likely to allow us to deal with anonymous people. By that I’m not saying we should be identified for everything we do — not at all, it should be limited to where it makes practical sense. But there is a sufficient number of these use-cases.

Offline identity is one thing, but there’s also the notion of “online identity”. A way to prove who you are on the internet. That is most often (and rightly so) an anonymous registration process, rarely it uses some identity provider like Facebook or Twitter (where again, you don’t have to disclose your true identity), but when doing legally significant actions, or when communicating with governments in order to obtain some data or certificates about yourself, the service provider has to be able to prove it is really you. Here comes the “electronic identification” process, which was recently defined in an EU regulation, and which in most cases means you have a government-issues hardware token that only you own and know how to unlock.

But since identity exists, it can be stolen or forged. There is the so called “identity theft” and it’s used in multiple ways that are out of the scope of this post. But people do steal others identity — online, and offline.

One instance of identity theft is using another person’s identity document. Similarly, one can forge an identity document to say whatever they want it to say. And this may lead to dire consequences for unsuspecting citizens. So government and experts are trying to fight this problem. Let’s take a look at the two distinct use-cases.

Document forgery is addressed by making ever more complicated documents, with all sorts of security features, invisible components, laser engraved elements, using specific laser angels, and so on. This, of course, is imperfect, not only because it is “security through obscurity” (who guarantees that your government won’t leak the “secret sauce” for making its documents, or worse — supply the forgers with the raw materials needed to make a document), but also because a forged document can still pass inspection, as humans are not perfect when inspecting documents. To put it another way — if the one inspecting the document knows what to look for, surely the forger also knows that.

Document theft (including document copying) is addressed by comparing the picture. And that’s about it. If you look similarly to someone else and you get his identity document, you can safely pretend to be him for a long time.

None of the solutions seem good enough. So to the rescue come electronic documents. Passports are a somewhat universal identity document, and most passports are now eMRTD (Electronic machine readable travel document). Issues with them aside, the basic idea is that they have some information stored that a) guarantees the document is issued by a trusted authority and b) it belongs to the person holding it.

The first part is guaranteed via a public key infrastructure — the contents of the document are digitally signed by the issuing authority. So nobody can create his own passport or ID card, because he doesn’t have the private key of the issuing authority (and the private key cannot be extracted, because an HSM, where it is stored, doesn’t allow that).

The second part is trickier. It is currently addressed by storing your facial image and fingerprints on the chip and then comparing the image and fingerprints of the holder to the stored ones (remember that the content is certified by a digital signature, which is practically bulletproof for the time being). The facial image part is flawed, and at the moment barely anyone checks the fingerprint part, but this option exists and it is getting more and more traction “with all that terrorism”.

So starting from the somewhat intuitive concept of identity, we’ve come to the point where governments make databases of fingerprints. And then iris data, and DNA (as in Kuwait, for example).

Although everything above sounds logical, the end result is somewhat scary. People’s biometric information being stored in databases, potentially at risk of breaches, potentially misused by governments, sounds dystopian. As we are no longer the owners of our identity — someone else has collected our attributes — attributes that do not change throughout our entire life — and stores them for future use. For whatever use. That someone doesn’t have to store them for the sake of identification, as there are technologies that allow storing the data on a card that does the comparison internally, without reveling the stored data. But that option seems to be ignored, strengthening the dystopian feeling.

Recently I’ve been thinking on how to address all of these. How to make sure identity still does its job but without compromising privacy. Two hours after I’ve had some ideas, I spoke with someone with far more experience in identity technology than me, and turned out he had had quite similar ideas.

And here technology comes into play. We are a combination of our unchangeable traits — fingerprints, iris, DNA. You can differentiate even identical twins based on these attributes. You also have other, more volatile attributes — height, weight, names, address, favorite color even.

All of these represent your identity. And it can be managed by turning the essential, unchangeable parts of it, into a key. An anonymous key, that is derived using a one-way function, a so called “hash”. After you hash your fingerprints, iris and DNA, you’ll get a long value, e.g. fd4e1c67b2d28fced849ee1bb76e7391b93eb12, that represents you (read here about fingerprint hashing)

This will be you and you will be able to prove it, as every time someone needs you to prove your identity, you will get your fingerprints, iris and DNA scanned, and the result of applying the one-way function will be again 2fd4e1c67b2d28fced849ee1bb76e7391b93eb12.

Additionally, you can probably add some “secret” word to that identity. So that your identity is not only what you are (and cannot change), but also what you know. That would mean that nobody can come up with your identity unless you tell them your secret (sounds a little like “A Wizard of Earthsea”).

Of course, full identification will rarely be required. If you want to buy alcohol, only your age matters; if you want to get a contract for cable internet, only your name and address matter, and so on. For that, sub-identities can exist — they belong to a “parent” identity, but the verifier doesn’t need such a high level of assurance that it is indeed you. The sub-identity can be “just the fingerprints”, or even…a good old identity document. Each sub-identity can prove a set of attributes, certified by an authority — not necessarily a government authority.

Your sub-identity, a set of attributes, can be written on a document — something you carry around that certifies, with a significant level of certainty, that this is indeed you. It will hold your “hash”, so that anyone who wants to do a full check, can do so. The other option is the implant. Scary and dystopian, I know. It seems just a little different than an ID card — it is something you carry with you, and you have to carry with you. Provided that you control whether someone is allowed to read your implant, it becomes a slightly advanced identity card or a driver’s license.

Even when we have an identity string, the related data — owned properties, driving capabilities, travel visas, employment, bank loans — will be stored in databases, where the identity string is the lookup key. These databases are now government owned, but can very well be distributed, e.g. using a blockchain. Nobody can claim he’s you, as he cannot produce the same identity string based on his biometrics. The nodes on the blockchain network can be the implants, which hold encrypted information about you, and only you can decide when to decrypt it. That would make for a distributed human database where one is in full control of his data.

But is this feasible? The complexity of the system, and especially of managing one’s identity, may be too high. We can create a big, complex system, involving implants and biometrics, for solving a problem that is actually a tiny one. This is the first question we should ask before proceeding to such a thing. Not whether governments should manage identities, not whether we should be identifiable, but whether we need a dramatic shift in the current system. Or an electronic ID card with match-on-card (not centrally stored) fingerprints and electronically signed contents solves 99% of the issues?

Although I’m finding it fascinating to envision a technological utopia, with cryptography heavily involved, and privacy guaranteed by technological means, I’m not sure we need that.


Cleanup Temp Files

May 17, 2016

I’ve been spring-cleaning some devices and the obviousness of the advice in the title seems questionable. I found tons of unused temp files which applications (android apps, desktop applications and even server application deployments) haven’t cleaned. This is taking up space and it means more manual maintenance.

For server-side applications the impact is probably smaller, as it is entirely under your control and you can regularly cleanup the data, or even don’t care, as you regularly re-create the machine (e.g. in an AWS deployment where each upgrade to the system means new machines get spawned and the old ones – deleted). But anyway, if you use temp files (and Java), use File.createTempFile(..) and don’t forget to call file.deleteOnExit().

For client-side applications (smartphone apps or desktop software) the carelessness of not deleting temp files leads to the users’ disappointment at some point in time, when they realize their storage is filled with your useless files. The delete-on-exist works again, but maybe you need the files to survive more than one run. So simply have a job, or a startup-check that checks whether temp files aren’t older than a certain period, and if they are – delete them.

The effect of this little thing being omitted by developers is that users have to analyze their storage with special tools (sometimes – paid) in order to find the “offenders”. And the offenders are not always obvious, and besides – users are not necessarily familiar with the concept of a temp file. Even I’m sometimes not sure of a given file that looks like a temp one isn’t actually necessary for the proper functioning.

Storage is cheap, but good practices should not be abandoned because of that.


Dirty Hacks Are OK

May 12, 2016

In practically every project you’ve used a “dirty hack”. setAccessbile(true), sun.misc.Unsafe, changing a final value with reflection, copy-pasting a class from a library to change just one line of wrong code. Even if you haven’t directly, a library that you are using most certainly contains some of these.

Whenever we do something like that, we are reminded (by stackoverflow answers and colleagues alike) that this is a hack and it’s not desirable. And that’s ok – the first thing we should think about when using such a hack, is whether there isn’t a better way. A more object-oriented way, a more functional way. A way that the language allows for, but might require a bit more effort. But too often there is no such way, or at least not one that isn’t a compromise with other aspects (code readability, reuse, encapsulation, etc.). And especially in cases where 3rd party libraries are being used and “hacked”.

Vendors are also trying to make us avoid them – changing the access to a field via reflection might not work in some environments (some JavaEE cases included), due to a security manager. And one of the most “arcane” hacks – sun.misc.Unsafe is even going to be deprecated by Oracle.

But since these “hacks” are everywhere, including the Unsafe magic, deprecating or blocking any of them will just make the applications stop working. As you can see in the article linked above, practically every project depends on a sun.misc.Unsafe. It wouldn’t be an understatement to say that such “dirty hacks” are the reason major frameworks and libraries in the Java ecosystem exist at all – hibernate, spring, guava are among the ones that use them heavily.

So deprecating them is not a good idea, but my point here is different. These hacks get things done. They work. With some caveats and risks, they do the task. If instead you’d need to fork a 3rd party library and support the fork? Or suggest a patch and it doesn’t get accepted for a while, but you deadline is soon, these tricks are actually working solutions. They are not “beautiful”, but they’re OK.

Too often 3rd party libraries don’t offer exactly what you need. Either there’s a bug, or some method doesn’t behave according to your expectations. If using setAccessible in order to change a field or invoke a private method works – it’s the better approach than forking (submit an improvement request, of course). But sometimes you have to change a body method – for these use cases I created my quickfix tool a few years ago. It’s dirty, but does the job, and together with the rest of these hacks, lets you move forward to delivering actual value, rather than wondering “should I use a visitor pattern here or “should we fork this library and support it in our repository and maven repository manager until they accept our pull request and release a new version”, or “should I write this with JNI”, or even “should we do this at all, it’s not possible without a hack”.

I know this is not the best advice I’ve given, and it’s certainly a slippery slope – too much of the “get it done quick and dirty, I don’t care” mentality is surely a disaster. But poison can be a cure in small doses, if applied with full understanding of the issue.


A Beginner’s Guide to Addressing Concurrency Issues

April 20, 2016

Inserts, updates and deletes. Every framework tutorial starts with these and they are seen as the most basic functionality that just works.

But what if two concurrent requests try to modify the same data? Or try to insert the same data that should be unique? Or the inserts and updates have side-effects that have to be stored in other tables (e.g. audit log).

“Transactions” you may say. Well, yes, and no. A transaction allows a group of queries to be executed together – either pass together of fail together. What happens with concurrent transactions depends on a specific property of transactions – their isolation level. And you can read here a very detailed explanation of how all of that works.

If you select the safest isolation level – serializable (and repeatable read), your system may become too slow. And depending on the database, transactions that happen at the same time may have to be retried by specific application code. And that’s messy. With other isolation levels you can have lost updates, phantom reads, etc.

Even if you get your isolation right, and you properly handle failed transactions, isolation doesn’t solve all concurrency problems. It doesn’t solve the problem of having an application-imposed data constraint (e.g. a uniqueness complex logic that can’t be expressed as a database unique constraint), it doesn’t solve the problem of inserting exact duplicates, it doesn’t solve other application-level concurrency issues, and it doesn’t perfectly solve the data modification issues. You may have to get into database locking, and locking is tedious. What is a write lock, a read-lock, what is an exclusive lock, and how not to end-up in a deadlock (or a livelock)? I’m sure that even developers with a lot of experience are not fluent with database locks, because you either don’t need them, or you have a bigger problem that you should solve first.

The duplicate submission problem is a bit offtopic, but it illustrates that not all concurrent request problems can be solved by the database alone. As many people suggest, is solved by a token that gets generated for each request and stored in the database using a unique constraint. That way two identical inserts (a result of a double-submission) cannot both go in the database. This gets a little more complicated with APIs, because you should rely on the user of the API to provide the proper token (and not generate it on the fly in their back-end). As for uniqueness – every article that I’ve read on the matter concludes that the only proper way to guarantee uniqueness is at the database level, using a unique constraint. But when there are complicated rules for that constraint, you are inclined to check in the application. And in this case concurrent requests will eventually allow for two records with the same values to be inserted.

Most of the problems are easy if the application runs on a single machine. You can utilize your language concurrency features (e.g. Java locks, concurrent collections) to make sure everything is properly serialized, that duplicates do not happen, etc. However, when you deploy to more than one machine (which you should), that becomes a lot harder problem.

So what are the approaches to address concurrency issues, apart from transactions? There are many, and here are a few of them (in no meaningful order).

  • There is Hazelcast, which lets you use distributed locks – the whole cluster follows the Lock semantics as if it was a single machine. That is language specific and setting up a hazelcast cluster of just a few usecases (because not all of your requests will need that) may be too much
  • You can use a message queue – push all requests to a message queue that is processed by a single (async) worker. That may be useful in some cases, and impractical in others (if you have to return some immediate response to the user, for example)
  • You can use Akka and its clustering capabilities – it guarantees that an actor (think “service”) is processing only one message at a time. But using akka for everything may not be a good idea, because it completely changes the paradigm, it is harder to read and trace, harder to debug, and is platform-specific (only JVM languages can make use of it).
  • You can use database-specific application level locks. That’s something quite useful, even though it is entirely RDBMS-dependent. Postgre has advisory locks, MySQL has get_lock, others probably have something similar. The idea here is that you use the database as your distributed lock mechanism. The locks are managed by the application, and don’t even need to have anything to do with your tables – you just ask for a lock for, say (entityType, entityId), and then no other application thread can enter a given piece of code, unless it successfully obtains that database lock. It is kind of like the hazelcast approach, but you get it “for free” with the database. Then you can have, for example, a @Before (spring) aspect that attaches to service methods and does the locking appropriate for the current application use-case, without using table locks.
  • You can use a CRDT. It’s a data structure that is idempotent – no matter what the order of the operation applied is, it ends up in the same state. It’s explained in more details in this presentation. How does a CRDT map to a relational database is an interesting question I don’t have an answer to, but the point is that if your operations are idempotent, you will probably have fewer issues.
  • Using the “insert-only” model. Databases like Datomic are using it internally, but you can use it with any database. You have no deletes, no updates – just inserts. Updating a record is inserting a new record with the “version” increased. That again relies on database features to make sure you don’t end up with two records with the same version, but you never lose data (concurrent updates will make it so that one is “lost”, because it’s not the latest version, but it’s stored and can be reverted to). And you get an audit log for free.

The overall problems is how to serialize requests without losing performance. And all the various lock mechanisms and queues, including non-blocking IO, address that. But what makes the task easier is having a data model that does not care about concurrency. If the latter is applicable, always go for it.

Whole books have been written on concurrency, and I realize such a blog post is rather shallow by definition, but I hope I’ve at least given a few pointers.


How To Read Your Passport With Android

April 5, 2016

As I’ve been researching machine readable travel documents, I decided to do a little proof-of-concept on reading ePassports using an NFC-enabled smartphone (Android).

The result is on GitHub, and is based on the jMRTD library, which provides all the necessary low-level details.

As I pointed out in my previous article, the standards for the ePassports have evolved a lot throughout the years – from no protection, to BAC, to EACv1, EACv2 and SAC (which replaces BAC). Security is still doubtful, as most of the passports and inspection systems require backward compatibility to BAC. That’s slowly going away, but even when BAC goes away, it will be sufficient to enter the CAN (Card Authentication Number) for the PACE protocol, so the app will still work with minor modifications.

What the app does is:

  1. Establishes NFC communication
  2. Authenticates to the passport using the pre-entered passport number, date of birth and expiry date (hardcoded in the app at the moment). Note that the low security of the protocol is due to the low entropy of this combination, and brute force is an option, as passports cannot be locked after successive failures.
  3. Reads mandatory data groups – all the personal information present in the passport, including the photo. In the example code only the first data group (DG1) is read, and the personal identifier is shown on the screen. The way to read data groups is as follows:
    InputStream is = ps.getInputStream(PassportService.EF_DG1);
    DG1File dg1 = (DG1File) LDSFileUtil.getLDSFile(PassportService.EF_DG1, is);
  4. Performs chip authentication – the first step of EAC, which makes sure that the chip is not cloned – it requires proof of ownership of a private key, which is stored in the protected area of the chip.

The code has some questionable coding practices – e.g. the InputStream handling (the IDE didn’t initially allow me to use Java 7, and I didn’t try much harder), but I hope they’ll be fixed if used in real projects.

One caveat – for Android there’s a need for SpongyCastle (which is a port of the BouncyCastle security provider). However it is not enough, so both have to be present for certain algorithms to be supported. Unfortunately, jMRTD has a hardcoded reference to BouncyCastle in one method, which leads to the copy-pasted method for chip authentication.

There is one more step of EAC – the terminal authentication, which would allow the app to read the fingerprints (yup, sadly there are fingerprints there). However, EAC makes it harder to do that. I couldn’t actually test it properly, because the chip rejects verifying even valid certificates, but anyway, let me explain. EAC relies on a big infrastructure (PKI) where each participating country has a Document Verifier CA, whose root certificate is signed by all other participating countries (as shown here). Then each country issues short-lived (1 day) certificates signed by the DVCA, which are used in the inspection system (border polices and automatic gates). The certificate chain now contains all countries root certificates, followed by the DVCA certificate, followed by the inspection system certificate. The chip has to verify that this chain is valid (by verifying that each signature on a certificate is indeed performed by the private key of the issuer). The chip itself has the root certificate of its own country, so it has the root of the chain and can validate it (which is actually the first step). Finally, in order to make sure that the inspection system certificate is really owned by the party currently performing the protocol, the chip sends a challenge to be signed by the terminal.

So, unless a collision is found and a fake certificate is attached to the chain, you can’t easily perform “terminal authentication”. Well, unless a key pair leaks from some inspection system somewhere in the world. Then, because the chip does not have a clock, even though the certificates are short-lived, they would still allow reading the fingerprints, because the chip can’t know they are expired (it syncs the time with each successful certificate validation, but that only happens when going through border control at airports). Actually, you could also try to spam the chip with a huge chain, and it will at some point “crash”, and maybe it will do something that it wouldn’t normally do, like release the fingerprints. I didn’t do that for obvious reasons.

But the point of the app is not to abuse the passports – there may be legitimate use-cases to allow reading the data from them, and I hope my sample code is useful for that purpose.


Software Can’t Live On Its Own

March 30, 2016

We’re building software in hope that some day we’ll leave it and it will live on its own. Or with minor supervision. But the other day when my father asked me to dig an old website, I did some thinking and realized auto-pilot software is almost never the case.

Software is either being supported, or is abandonware, or is too simple. We constantly have to “fix” something, on each piece of software. Basically, picking up an old project and running it is rather hard – it would most probably require upgrading a ton of components. For example:

  • Fixing edge cases, bugs, security issues. The software environment is dynamic, and no complex software is without uncovered edge cases. Security issues arise constantly and have to be patched. Unless we find a way to write bugless software with perfect security, we have to support all these.
  • Breaking upgrades:
    • browsers are being upgraded constantly, and old websites probably won’t work. Protocols remain backward compatible for a while, but then support is discontinued and one has to upgrade. Operating systems introduce breaking changes to software running on them – one clear example is Android, where with each major version something doesn’t work anymore (because it was deprecated 2 versions prior) or has to be done in a different way. We have to be there and tweak our code to accommodate these upgrades.
    • Frameworks and languages get upgrades as well – and sometimes we can’t even build our legacy software anymore. Even if we can, the target environment may not support our old versions. The aforementioned site was written in PHP 4. Shared hosting providers no longer offer PHP 4, so will that site work? Possibly will need tweaks.
    • Changes in 3rd party APIs. If you rely on something like a facebook API, or a Google API, chances are your 3-year-old project will no longer work.
  • New use-cases – the real world is dynamic, and software that supports some real-world activity has to change with it. Some features become obsolete, new features are needed. Vendors like to advertise “draw-it-yourself” tools that create new forms and business processes without any technical expertise, but that’s rarely working properly
  • Visual design becomes outdated. Remember Web King? Maybe that was the design of the 1995, but not anymore. We’ve gone through waves and waves of new design trends, and often it’s not okay to look outdated.

A piece of software is not like a building – you can’t it once, and it lives for decades, with just occasional repairs. It is not like a piece of kitchen appliance, you can’t just replace it with a newer version.

It isn’t like a building, because it’s too complex (not diminishing the role of real architects, but they have a limited set of use cases). And it isn’t like kitchen appliances, because kitchen appliances don’t have data.

And actually, data migration is one of the reasons legacy software exists – migrating it to something new is hard – one should fit it into a new structure, and into a new database. And even simple migration from an older database version to a newer database version is hard. Migrating structures and even usecases is horrible. I won’t even mention triggers, stored procedures to be migrated across vendors and so on.

So yes, keeping an old piece of software running requires a lot of effort; migrating it to a newer and better piece of software is often a doomed project and you’re stuck with your existing system forever.

That means there is a whole big branch of the IT market that focuses on that – providing software to clients and then keeping them bound to that software forever. With regular updates and support. There is another type of companies, where things are more straightforward – the “single product as a service” companies. The cool web 2.0 startups are mostly single-product-as-a-service companies and if the company dies, the product dies with it. If the company manages to make some money, you don’t care…until it dies, and then your migration to a new piece of software is the promised nightmare.

Leaving simple software aside (my computoser is running unsupervised for 2 years already; not that it’s simple, but its complexity is confined to the algorithm; I heard that the software for the trash cleaning company I’ve written when I was 16 is still in use in my hometown) everything else needs constant caring. And given that more and more software is being build, this leads us to the sad realization that we’ll have to support a lot of software. More and more of programmer’s work will be caring for what’s been already built, rather than building something new. And on one hand that’s sad. That means software for many is not “craftsmanship”, not “science”, not “making cool things”. It is a mundane support and gradual extension of old, clunky bulks of code.

Unless we learn to build self-supporting software. Software that automatically overcomes OS upgrades, framework and protocol upgrades. Software that allows extending without writing code (which many systems claim to do even now, but very few actually do). Until that time, I’m afraid we’re stuck with supporting our current projects, and in the best case – extending them to fit new needs and customers.


Take a Step Back

March 19, 2016

A software project can become “legacy” just after three months from its inception. I’ve recently seen many projects that look OK on the surface, but are in fact so “broken”, that they have to be rewritten. Well, they work, but continuing their support is a pain. And everyone has probably seen at least one such project, where you want to just throw it away and start from scratch. The problem is – starting from scratch won’t guarantee the good outcome either. Why things go that way?

I think it’s because developers tend to solve problems one at a time, achieving a “local maximum”. They don’t need to be “quick dirty fixes” – any even reasonably sounding fix for the particular problem at hand may yield de-facto legacy code. And if we view software development as constant problem fixing (as even introducing new features consists of fixing problems along the way), we have a problem.

My approach to that process is to take a step back and ask the question “is this really the problem I’m solving, or is there a bigger underlying problem”. And sometimes there is. There are some clear indications that there is such a problem. “Code smell” is a popular term for that, but I’d like to extend it – sometimes it’s not the thing that you do that makes things smell, but rather something done before makes you take stupid decisions later. Sometimes these decisions don’t even look wrong in the context that you’ve created with your previous decisions, but they are certainly wrong. And you can use them as indicators. Some examples:

If you have to copy-paste some piece of code to another part of the project, and that’s your best option, something’s wrong with the code. You should take a step back and refactor, rather than copy-pasting yet another piece.

If you have to rely on a full manual test to figure out that your application is not broken, the quick and easy “fix” for the problem is to just get a QA to manually test it. If you do that, instead of adding tests, quality degrades over time.

If you have to use business logic to overcome data model or infrastructure deficiencies, the easy fix is to just add a couple of if’s here and there. Then six months later you have unreadable code, full of bits irrelevant to the actual business logic. Instead, fix the data model or your infrastructure (in a wider sense, e.g. framework configuration).

If, given a bug report, tracing the program flow given requires knowing where things are, rather than finding them, it means the project is not well structured. Yes, you are probably working on it for a year now and you know where things are, but finding stuff using search (or call and class hierarchies) is the proper way to go – even for people experience with the project (not to mention newcomers).

If the addition of a data field or a component requires changes in the whole project, rather than just an isolated part of the project, then each new addition creates more complexity and more potential failures. E.g. pseudo-plugin systems that require changing the core with each plugin.

The list can go on, but the point is clear – if faced with an option to do something the wrong way, take a step back and rethink whether the problem should exist in the first place. Otherwise each fix becomes technical debt. And in three months you have a “legacy” project.


Pretty Print JSON Per Request With Spring MVC

February 22, 2016

You will find a lot of posts and stackoverflow answers telling you how to pretty-print JSON responses. But sometimes you may need to tune the “prettiness” per request.

The use case for this is when you are using tools like curl or RESTClient to interact with the system and you want human-readable output. Of course, if you need human-readable output only for debug purposes, you should really consider whether you need JSON at all, or you should use some binary format. But let’s assume you need JSON. And that you’d rather get it pretty-printed, rather than use an external tool to prettify it afterwards.

The basic idea is to enable pretty-printing with either a GET parameter, or preferably with an Accept header like application/json+pretty. With Spring MVC that is not supported out of the box. You’d need to create a class like that:

 * An subclass of the MappingJackson2HttpMessageConverter that accespts the application/json+pretty content type
 * in order to enable per-request prettified JSON responses
 * @author bozho
public class PrettyMappingJackson2HttpMessageConverter extends MappingJackson2HttpMessageConverter {

   * Construct a new {@link MappingJackson2HttpMessageConverter} using default configuration
   * provided by {@link Jackson2ObjectMapperBuilder}
  public PrettyMappingJackson2HttpMessageConverter() {
    setSupportedMediaTypes(Lists.newArrayList(new MediaType("application", "json+pretty", DEFAULT_CHARSET)));

Then in your spring-mvc xml configuraton (or java config counterpart) you should register this as a message converter:

    <bean class="org.springframework.http.converter.json.MappingJackson2HttpMessageConverter" />
    <!-- Handling Accept: application/json+pretty -->
    <bean class="com.yourproject.util.PrettyMappingJackson2HttpMessageConverter" />

If you have a separately defined ObjectMapper and want to pass it to the pretty converter, you should override the other constructor (accepting an object mapper), and use the .copy() method before enabling the INDENT_OUTPUT.

And then you’re done. You can switch from regular (non-indented) and pretty output by setting the Accept header to application/json+pretty


Setting Up Distributed Infinispan Cache with Hibernate and Spring

February 17, 2016

A pretty typical setup – spring/hibernate application that requires a distributed cache. But it turns out not so trivial to setup.

You obviously need cache. There are options to do that with EhCache, Hazelcast, Infinispan, memcached, Redis, AWS’s elasticache and some others. However, EhCache supports only replicated and not distributed cache, and Hazelcast does not yet work with the latest version of Hibernate. Infinispan and Hazelcast support consistent hashing, so the entries live only on specific instance(s), rather than having a full copy of all the cache on the heap of each instances. Elasticache is AWS-specific, so Infinispann seems the most balanced option with the spring/hibernate setup.

So, let’s first setup the hibernate 2nd level cache. The official documentation for infinispan is not the top google result – it is usually either a very old documentaton, or just 2 versions old documentaton. You’d better open the latest one from the homepage.

Some of the options below are rather “hidden”, and I couldn’t find them easily in the documentation or in existing “how-to”s.

First, add the relevant dependencies to your dependency manager configuraton. You’d need infinispan-core, infinispan-spring and hibernate-infinispan. Then in your configuratoin file (whichever it is – in my case it is jpa.xml, a spring file that defines the JPA properties) configure the following:

<prop key="hibernate.cache.use_second_level_cache">true</prop>
<prop key="hibernate.cache.use_query_cache">true</prop>
<prop key="hibernate.cache.region.factory_class">org.hibernate.cache.infinispan.InfinispanRegionFactory</prop>
<prop key="hibernate.cache.inifinispan.statistics">true</prop>
<prop key="hibernate.cache.infinispan.cfg">infinispan.xml</prop>
<prop key="hibernate.cache.infinispan.query.cfg">distributed-query</prop>

These settings enable 2nd level cache and query cache, using the default region factory (we’ll see why that may need to be changed to a custom one later), enable statistics, point to an infinispan.xml configuraton file and change the default name for the query cache in order to be able to use a distributed one (by default it’s “local-cache”). Of course, you can externalize all these to a .properties file.

Then, at the root of your classpath (src/main/resources) create infinispan.xml:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan xmlns:xsi=""
        <stack-file name="external-file" path="${jgroups.config.path:jgroups-defaults.xml}" />    
    <cache-container default-cache="default" statistics="true">
        <transport stack="external-file" />
        <distributed-cache-configuration name="entity" statistics="true" />
        <distributed-cache-configuration name="distributed-query" statistics="true" />

This expects -Djgroups.config.path to be passed to the JVM to point to a jgroups configuration. Depending on whether you use your own setup or AWS, there are multiple options. Here you can find config files for EC2, Google cloud, and basic UDP and TCP mechanism. These should be placed outside the project itself, because locally you most likely don’t want to use S3_PING (S3 based mechanism for node detection), and values may vary between environments.

If you need statistics (and it’s good to have them) you have to enable them both at cache-container level and at cache-level. I actually have no idea what the statistics option in the hibernate properties is doing – it didn’t change anything for me.

Then you define each of your caches. Your entities should be annotated with something like

@Cache(usage = CacheConcurrencyStrategy.READ_WRITE, region = "user")
public class User { .. }

And then Infinispan creates caches automatically. They can all share some default settings, and these defaults are defined for the cache named “entity”. Took me a while to find that out, and finally got an answer on stackoverflow. The last thing is the query cache (using the name we defined in the hibernate properties). Note the “distributed-cache-configuration” elements – that way you you explicitly say “this (or all) cache(s) must be distributed” (they will use the transport mechanism specified in the jgroups file). You can configure defaults in a jgroups-defaults.xml and point to it as shown in the above example, if you don’t want to force developers specify the jvm arguments.

You can define entity-specific properties using <distributed-cache-configuration name="user" /> for example (check the autocomplete from the XSD to see what configuration options you have (and XML is a pretty convenient config DSL, isn’t it?).

So far, so good. Now our cache will work both locally and on AWS (EC2, S3), provided we configure the right access keys, and locally. Technically, it may be a good idea to have different infinispan.xml files for local and production, and to define by default <local-cache>, rather than a distributed one, because with the TCP or UDP settings, you may end up in a cluster with other teammates in the same network (though I’m not sure about that, it may present some unexpected issues).

Now, spring. If you were to only setup spring, you’d create a bean with a SpringEmbeddedCacheManagerFactoryBean, pass classpath:infinispan.xml as resource location, and it would work. And you can still do that, if you want completely separated cache managers. But Cache managers are tricky. I’ve given an outline of the problems with EhCache, and here we have to do some workarounds in order to have a cache manager shared between hibernate and spring. Whether that’s a good idea – it depends. But even if you need separate cache managers, you may need a reference to the hibernate underlying cache manager, so part of the steps below are still needed. A problem with using separate caches is the JMX name they get registered under, but that I guess can be configured as well.

So, if we want a shared cache manager, we have to create subclasses of the two factory classes:

 * A region factory that exposes the created cache manager as a static variable, so that
 * it can be reused in other places (e.g. as spring cache)
 * @author bozho
public class SharedInfinispanRegionFactory extends InfinispanRegionFactory {

	private static final long serialVersionUID = 1126940233087656551L;

	private static EmbeddedCacheManager cacheManager;
	public static EmbeddedCacheManager getSharedCacheManager() {
		return cacheManager;
	protected EmbeddedCacheManager createCacheManager(ConfigurationBuilderHolder holder) {
		EmbeddedCacheManager manager = super.createCacheManager(holder);
		cacheManager = manager;
		return manager;
	protected EmbeddedCacheManager createCacheManager(Properties properties, ServiceRegistry serviceRegistry)
			throws CacheException {
		EmbeddedCacheManager manager = super.createCacheManager(properties, serviceRegistry);
		cacheManager = manager;
		return manager;

Yup, a static variable. Tricky, I know, so be careful.

Then we reuse that for spring:

 * A spring cache factory bean that reuses a previously instantiated infinispan embedded cache manager
 * @author bozho
public class SharedInfinispanCacheManagerFactoryBean extends SpringEmbeddedCacheManagerFactoryBean {
        private static final Logger logger = ...;
	protected EmbeddedCacheManager createBackingEmbeddedCacheManager() throws IOException {
		EmbeddedCacheManager sharedManager = SharedInfinispanRegionFactory.getSharedCacheManager();
		if (sharedManager == null) {
			logger.warn("No shared EmbeddedCacheManager found. Make sure the hibernate 2nd level "
					+ "cache provider is configured and instantiated.");
			return super.createBackingEmbeddedCacheManager();
		return sharedManager;

Then we change the hibernate.cache.region.factory_class property in the hibernate configuration to our new custom class, and in our spring configuration file we do:

<bean id="cacheManager" class="com.yourcompany.util.SharedInfinispanCacheManagerFactoryBean" />
<cache:annotation-driven />

The spring cache is used with a mehtod-level @Cacheable annotation that allows us to cache method calls, and we can also access the CacheManager via simple injection.

Then the “last” part is to check if it works. Even if your application starts ok and looks to be working fine, you should run your integration or selenium test suite and check the statistics via JMX. You may even have tests that use the MBeans to fetch certain stats data about the caches to make sure they are being used. And/or you can write an integraton test that injects the CacheManager and uses the StandardCacheEntryImpl it turns to compare the version property after subsequent operations, to see if the cache is properly updated.

Overall, it shouldn’t take much time to set the whole thing up, and then later even to replace with another implementation if necessary.


Issues With Electronic Machine Readable Travel Documents

February 3, 2016

Most of us have passports, and most of these passports are by now equipped with chips that store some data, including fingerprints. But six months ago I had no idea how that operates. Now that my country is planning to roll out new identity documents, I had to research the matter.

The chip (which is a smartcard) in the passports has a contactless interface. That means RFID, 13.56 MHz (like NFC). Most typical uses of smartcards require PIN entry from the owner. But the point with eMRTD (Eletronic machine readable travel documents) is different – they have to be read by border control officials anf they have to allow quickly going through Automatic Border Control gates/terminals. Typing a PIN will allegedly slow the process, and besides, not everyone will remember their PIN. So the ICAO had to invent some standard and secure way to allow gates to read the data, but at the same time prevent unauthorized access (e.g. someone “sniffing” around with some device).

And they thought they did. A couple of times. First the mechanism was BAC (Basic Access Control). When you open your passport on the photo page and place it in the e-gate, it reads the machine-readable zone (MRZ) with OCR and gets the passport number, birth date and issue date from there. That combination of those is a key that is used to authenticate to the chip in order to read the data. The security issues with that are obvious, but I will leave the details to be explained by this paper.

Then, they figured, they could improve the previously unsecure e-passports, and they introduced EAC (Extended Access Control). That includes short-lived certificate on the gates, and the chip inside the passport verifies those certificates (card-verifiable certificates). Only then the gate can read the data. You can imagine that requires a big infrastructure – every issuing country has to support a PKI, countries should cross-sign their “document verifier certificates”, and all of those should be in a central repository, where gates pull the certificates from. Additionally, these certificates should be very short-lived in order to reduce the risk of leaking a certificate. Such complexity, of course, asks for trouble. The first version of EAC was susceptible to a number of attacks, so they introduced EACv2. Which mostly covers the attacks on v1, except a few small details: chips must be backward-compatible with BAC (because some gates may not support EAC). Another thing is that since the passport chip has no real clock, it updates the time after successful validation with a gate. But if a passport is not used for some period of time, expired (and possibly leaked) certificates can be used to get the data from the chip anyway. All of the details and issues of EACv1 and EACv2 are explained in this paper.

Since BAC is broken due to the low entropy, SAC (Supplemental Access Control) was created, using the PACE (v2) protocol. It is a password-authenticated key agreement protocol – roughly Diffie-Hellman + mutual authentication. The point is to generate a secret with high entropy based on a small password. The password is either a PIN, or a CAN (Card Authentication Number) printed in the MRZ of the passport. (I think this protocol can be used to secure a regular communication with a contactless reader, if used with a PIN). The algorithm has two implementations GM (General Mapping) and IM (Integrated Mapping). The latter, however, uses a patented Map2Point algorithm, and if it becomes widely adopted, is a bomb waiting to explode.

The whole story above is explained in this document. In addition, there is the BioPACE algorithm which includes biometric validation on the terminal (i.e. putting your finger for unlocking the chip), but (fortunately) that is not adopted anywhere (apart from Spain, afaik).

Overall, after many years and many attempts, the ICAO protocols seem to still have doubtful security. Although much improvement has been made, the original idea of allowing a terminal to read data without requiring action and knowledge from the holder, necessarily leads to security issues. Questions arise about brute-forcing as well – either an attacker can jam the chip with requests, or he can lock it after several unsuccessful attempts.

And if you think passports have issues, let me mention ID cards. Some countries make their ID cards ICAO-compliant in order to allow citizens to use them instead of passports (in the EU, for example, the ID card is a valid travel document). Leaving the question “why would a Schengen citizen even need to go through border control in Europe” aside, there are some more issues: the rare usage of the cards brings the EACv2 vulnerability. The MRZ is visible without the owner having to open it on the photo page – this means anyone who gets a glimpse of the ID card knows your CAN, and then authenticate as if it’s a terminal. And while passports are carried around only when you travel abroad, ID cards are carried at all times, increasing the risks for personal and biometric data leakage many times. Possibly these issues are the reason that by 2014 only Germany and Spain had e-gates that support ID cards as eMRTD. Currently there is the ABC4EU project that is aimed at defining common standards and harmonizing the e-gates infrastructure, so in 5-6 years there may be more e-gates supporting ID cards, and therefore more ID cards conforming to ICAO.

Lukas Grunwald has called all of the above “Security by politics” in his talk at DEF CON last year. He reveals practical issues with the eMRTD, including attacks not only on the chips, but on the infrastructure as well.

Leaking data, including biometric data, to strangers on the metro who happen to have a “listening” device is a huge issue. Stainless steel wallets shielding from radio signals will probably become more common, at least with more technical people. Others may try to microwave their ID cards, like some Germans have done.

It’s not about the automatic control, some say, it’s about the security of the document itself, and by that – the security of everyone. If your fingerprints are signed by your country, surely nobody can create a fake document. First, even when there are checks on the biometrics (photo, iris, fingerprints), they are far from perfect. Also, in order to identify a fake passport, you have to check the fingerprints of everyone. Which they do in the US, but they don’t rely on the ones on the passport – they specifically take your fingerprints when issuing a visa. And reasonably so – in the ICAO system, if the root certificate of any country gets compromised, it can be used to sign fake passports (rogue states aside, are we certain that all countries have proper security around their CA? I’m not). And besides – are fake passports really the threat? Even if passports are ultra-secure (which they aren’t), attackers don’t attack the strongest part of a security system – they attack the weakest part. For example unguarded borders. Arriving by car or bus (where comparing fingerprints is rather impractical). Or, actually, working with people that already have valid passports, like most of the terrorists in recent attacks.

But apparently the “political will” is aimed at ensuring the false sense of security, and at convenience at the airport, allowing for less queues and less human border control officers, while getting all possible data about the citizen. Currently all of that appears to be at the expense of information security, but can it be different? Having an RFID chip in your document is always a risk (banks allow contactless payments up to a given limit, and they accept the risk themselves). But if we eliminate all the data from the passport/ID card, and leave simply a “passport number” to be read, it may be useless to attackers (currently the eMRTD have names, address, birth date, photo, fingerprints).

There is a huge infrastructure already in place, and it operates in batch mode – i.e. rotating certificates on regular intervals. But the current state of technology allows for near-real time querying – e.g. you go the the gate, put your eMRTD, it reads your passport numbers and sends a query to the passport database of the issuing country, which returns the required data as a response. If that is at all needed – the country where you enter can simply store the passport numbers that entered, together with the picture of the citizen, and later obtain the required data in batches. If batches suffice, data on the chip may still be present, but encrypted with the issuer’s public key and sent for decryption. This “issuer database” approach has its own implications – if every visit to a foreign country triggers a check in their national database, that may be used to easily trace citizen’s movements. While national passport databases exist, forming a huge global database is too scary. (Not) logging validation attempts in national databases may be regulated and audited, but that increases the complexity of the whole system. But I think this is the direction this should move to – having only a “key” in the passport, and data in central, (allegedly) protected databases. Note that e-gates normally do picture verification, so that might have to be stored on the passport. (Note: I discovered this proposal for an online verification protocol after writing this post)

Technical issues aside, when getting our passports, and more importantly – our ID cards, we must be allowed to make an informed choice – do we want to bear the security risks for the sake of the convenience of not waiting in queues (although queues form on e-gates as well), or we don’t care about automatic border control and we’d rather keep our personal and biometric data outside the RFID chip. For EU ID cards I would even say the default option must be the latter.

And while I’m not immediately concerned about an Orwellian (super)state tracking all your movements through a mandatory RFID document (or even – implant), not addressing these issues may lead to one some day (or has already lead in less democratic countries that have RFID ID cards), and at the very least – to a lot of fraud. For that reason “security by politics” must be avoided. I just don’t know how. Probably on an EU level?