Spring Boot, @EnableWebMvc And Common Use-Cases

April 21, 2017

It turns out that Spring Boot doesn’t mix well with the standard Spring MVC @EnableWebMvc. What happens when you add the annotation is that spring boot autoconfiguration is disabled.

The bad part (that wasted me a few hours) is that in no guide you can find that explicitly stated. In this guide it says that Spring Boot adds it automatically, but doesn’t say what happens if you follow your previous experience and just put the annotation.

In fact, people that are having issues stemming from this automatically disabled autoconfiguration, are trying to address it in various ways. Most often – by keeping @EnableWebMvc, but also extending Spring Boot’s WebMvcAutoConfiguration. Like here, here and somewhat here. I found them after I got the idea and implemented it that way. Then realized doing it is redundant, after going through Spring Boot’s code and seeing that an inner class in the autoconfiguration class has a single-line javadoc stating

Configuration equivalent to {@code @EnableWebMvc}.

That answered my question whether spring boot autoconfiguration misses some of the EnableWebMvc “features”. And it’s good that they extended the class that provides EnableWebMvc, rather than mirroring the functionality (which is obvious, I guess).

What should you do when you want to customize your beans? As usual, extend WebMvcConfigurerAdapter (annotate the new class with @Component) and do your customizations.

So, bottom line of the particular problem: don’t use @EnableWebMvc in spring boot, just include spring-web as a maven/gradle dependency and it will be autoconfigured.

The bigger picture here resulted in me adding a comment in the main configuration class detailing why @EnableWebMvc should not be put there. So the autoconfiguration magic saved me doing a lot of stuff, but I still added a line explaining why something isn’t there.

And that’s because of the common use-cases – people are used to using @EnableWebMvc. So the most natural and common thing to do is to add it, especially if you don’t know how spring boot autoconfiguration works in detail. And they will keep doing it, and wasting a few hours before realizing they should remove it (or before extending a bunch of boot’s classes in order to achieve the same effect).

My suggestion in situations like this is: log a warning. And require explicitly disabling autoconfiguration in order to get rid of the warning. I had to turn on debug to see what gets autoconfigured, and then explore a bunch of classes to check the necessary conditions in order to figure out the situation.

And one of Spring Boot’s main use-cases is jar-packaged web applications. That’s what most tutorials are for, and that’s what it’s mostly used for, I guess. So there should be special treatment for this common use case – with additional logging and logged information helping people get through the maze of autoconfiguration.

I don’t want to be seen as “lecturing” the Spring team, who have done amazing job in having good documentation and straightforward behaviour. But in this case, where multiple sub-projects “collide”, it seems it could be improved.

0

Distributed Cache – Overview

April 8, 2017

What’s a distributed cache? A solution that is “deployed” in an application (typically a web application) and that makes sure data is loaded from memory, rather than from disk (which is much slower), in order to improve performance and response time.

That looks easy if the cache is to be used on a single machine – you just load your most active data from the database in memory (e.g. a Guava Cache instance), and serve it from there. It becomes a bit more complicated when this has to work in a cluster – e.g. 5 application nodes serving requests to users in a round-robin fashion.

You have to update the in-memory cache on all machines each time a piece of data is updated by a request to one of the machines. If you just load all the data in memory and don’t invalidate it, the cache won’t be “coherent” – it will have stale values and requests to different application nodes will have different results, which you most certainly want to avoid. Or you can have a single big cache server with tons of memory, but it can die – and that may disrupt the smooth operation, so you’d want to have at least 2 machines in a cluster.

You can get a distributed cache in different ways. To list a few: Infinispan (which I’ve covered previously), Terracotta/Ehcache, Hazelcast, Memcached, Redis, Cassandra, Elasticache(by Amazon). The former three are Java-specific (both JCache compliant), but the rest can be used in any setup. Cassandra wasn’t initially meant to be cache solution, but it can easily be used as such.

All of these have different configurations and different options, even different architectures. For example, you can run Ehcache with a central Terracotta server, or with peer-to-peer gossip protocol. The in-process approach is also applicable for Infinispan and Hazelcast. Alternatively, you can rely on a cloud-provided service, like Elasticache, or you can setup your own cluster of Memcached/Redis/Cassandra servers. Having the cache on the application nodes themselves is slightly faster than having a dedicated memory server (cluster), because of the network overhead.

The cache is structured around keys and values – there’s a cached entry for each key. When you want to load something form the database, you first check whether the cache doesn’t have an entry with that key (based on the ID of your database record, for example). If a key exists in the cache, you don’t do a database query.

But how is the data “distributed”? It wouldn’t make sense to load all the data on all nodes at the same time, as the duplication is a waste of space and keeping the cache coherent would again be a problem. Most of the solutions rely on the so called “consistent hashing”. When you look for a particular key, its hash is calculated and (depending on the number of machines in the cache cluster), the cache solution knows exactly on which machine the corresponding value is located. You can read more details in the wikipedia article, but the approach works even when you add or remove nodes from the cache cluster, i.e. the cluster of machines that hold the cached data.

Then, when you do an update, you also update the cache entry, which the caching solution propagates in order to make the cache coherent. Note that if you do manual updates directly in the database, the cache will have stale data.

On the application level, you’d normally want to abstract the interactions with the cache. You’d end up with something like a generic “get” method that checks the cache and only then goes to the database. This is true for get-by-id queries, but caches can also be applied for other SELECT queries – you just use the query as a key, and the query result as value. For updates it would similarly update the database and the cache.

Some frameworks offer these abstractions out of the box – ORMs like Hibernate have the so-called cache providers, so you just add a configuration option saying “I want to use (2nd level) cache” and your ORM operations automatically consult the configured cache provider before hitting the database. Spring has the @Cacheable annotation which can cache method invocations, using their parameters as keys. Other frameworks and languages usually have something like that as well.

An important side-note – you may need to pre-load the cache. On a fresh startup of a cluster (e.g. after a new deployment in a blue-green deployment scheme/crash/network failure/etc.) the system may be slow to fill the cache. So you may have a batch job run on startup to fetch some pieces of data from the database and put them in the cache.

It all sounds easy and that’s good – the basic setup covers most of the cases. In reality it’s a bit more complicated to tweak the cache configurations. Even when you manage to setup a cluster (which can be challenging in, say, AWS), your cache strategies may not be straightforward. You have to answer a couple of questions that may be important – how much do you want your cache entries to live? How big do you need your cache to be? How should elements expire (cache eviction strategy) – least recently used, least frequently used, first-in-first-out?

These options often sound arbitrary. You normally go with the defaults and don’t bother. But you have to constantly keep on eye on the cache statistics, do performances tests, measure and tweak. Sometimes a single configuration option can have a big impact.

One important note here on the measuring side – you don’t necessarily want to cache everything. You can start that way, but it may turn out that for some types of entries you are actually better off without a cache. Normally these are the ones that are updated too often and read not that often, so constantly updating the cache with them is an overhead that doesn’t pay off. So – measure, tweak, repeat.

Distributed caches are an almost mandatory component of web applications. Yet, I’ve had numerous discussions about how a given system is very big and needs a lot of hardware, where all the data would fit in my laptop’s memory. So, much to my surprise, it turns out it’s not yet a universally known concept. I hope I’ve given a short, but sufficient overview of the concept and the various options.

1

Distributing Election Volunteers In Polling Stations

March 20, 2017

There’s an upcoming election in my country, and I’m a member of the governing body of one of the new parties. As we have a lot of focus on technology (and e-governance), our internal operations are also benefiting from some IT skills. The particular task at hand these days was to distribute a number of election day volunteers (that help observe the fair election process) to polling stations. And I think it’s an interesting technical task, so I’ll try to explain the process.

First – data sources. We have an online form for gathering volunteer requests. And second, we have local coordinators that collect volunteer declarations and send them centrally. Collecting all the data is problematic (to this moment), because filling the online form doesn’t make you eligible – you also have to mail a paper declaration to the central office (horrible bureaucracy).

Then there’s the volunteer preferences – in the form they’ve filled whether they are willing to travel, or they prefer their closest poling station. And then there’s the “priority” polling stations, which are considered to be more risky and therefore we need volunteers there.

I decided to do the following:

  • Create a database table “volunteers” that holds all the data about all prospective volunteers
  • Import all data – using apache CSV parser, parse the CSV files (converted from Google sheets) with the 1. online form 2. data from the received paper declarations
  • Match the entries from the two sources by full name (as the declarations cannot contain an email, which would otherwise be the primary key)
  • Geocode the addresses of people
  • Import all polling stations and their addresses (public data by the central election commission)
  • Geocode the addresses of the polling stations
  • Find the closest polling station address for each volunteer

All of the steps are somewhat trivial, except the last part, but I’ll still explain in short. The CSV parsing and importing is straightfoward. The only thing one has to be careful is have the ability to insert additional records on a later date, because declarations are being received as I’m writing.

Geocoding is a bit trickier. I used the OpenStreetMap initially, but it managed to find only a fraction of the addresses (which are not normalized – volunteers and officials are sometimes careless about the structure of the addresses). The OpenStreetMap API can be found here. It’s basically calling http://nominatim.openstreetmap.org/search.php?q=address&format=json with the address. I tried cleaning up some of the addresses automatically, which lead to a couple more successful geocodings, but not much.

The rest of the coordinates I obtained through Google maps. I extract all the non-geocoded addresses and their corresponding primary keys (for volunteers – the full name; for polling stations – the hash of a semi-normalized address), parse them with javascript, which then invokes the Google Maps API. Something like this:

<script type="text/javascript" src="jquery.csv.min.js"></script>
<script type="text/javascript">
	var idx = 1;
	function initMap() {
        var map = new google.maps.Map(document.getElementById('map'), {
          zoom: 8,
          center: {lat: -42.7339, lng: 25.4858}
        });
        var geocoder = new google.maps.Geocoder();

		$.get("geocode.csv", function(csv) {
			var stations = $.csv.toArrays(csv);
			for (var i = 1; i < stations.length; i ++) {
				setTimeout(function() {
					geocodeAddress(geocoder, map, stations[idx][1], stations[idx][0]);
					idx++;
				}, i * 2000);
			}
		});
      }

      function geocodeAddress(geocoder, resultsMap, address, label) {
        geocoder.geocode({'address': address}, function(results, status) {
          if (status === 'OK') {
            $("#out").append(results[0].geometry.location.lat() + "," + results[0].geometry.location.lng() + ",\"" + label.replace('"', '""').trim() + "\"<br />");
          } else {
            console.log('Geocode was not successful for the following reason: ' + status);
          }
        });
      }
</script>

This spits out CSV on the screen. Which I then took and transformed with regex replace (Notepad++) to update queries:

Find: (\d+\.\d+),(\d+\.\d+),(".+")
Replace: UPDATE addresses SET lat=$1, lon=$2 WHERE hash=$3

Now that I had most of the addresses geocoded, the distance searching had to begin. I used the query from this SO question to come up with this (My)SQL query:

SELECT MIN(distance), email, names, stationCode, calc.address FROM
(SELECT email, codePrefix, addresses.address, names, ( 3959 * acos( cos( radians(volunteers.lat) ) * cos( radians( addresses.lat ) )
   * cos( radians(addresses.lon) - radians(volunteers.lon)) + sin(radians(volunteers.lat))
   * sin( radians(addresses.lat)))) AS distance
 from (select address, hash, stationCode, city, lat, lon FROM addresses JOIN stations ON addresses.hash = stations.addressHash GROUP BY hash) as addresses
 JOIN volunteers WHERE addresses.lat IS NOT NULL AND volunteers.lat IS NOT NULL ORDER BY distance ASC) as calc
GROUP BY names;

This spits out the closest polling station to each of the volunteers. It is easily turned into an update query to set the polling station code to each of the volunteers in a designated field.

Then there’s some manual amendments to be made, based on traveling preferences – if the person is willing the travel, we pick one of the “priority stations” and assign it to them. Since these are a small number, it’s not worth automating it.

Of course, in reality, due to data collection flaws, the above idealized example was accompanied by a lot of manual labour of checking paper declarations, annoying people on the phone multiple times and cleaning up the data, but in the end a sizable portion of the volunteers were distributed with the above mechanism.

Apart from being an interesting task, I think it shows that programming skills are useful for practically every task nowadays. If we had to do this manually (and even if we had multiple people with good excel skills), it would be a long and tedious process. So I’m quite in favour of everyone being taught to write code. They don’t have to end up being a developer, but the way programming helps non-trivial tasks is enormously beneficial.

2

“Infinity” is a Bad Default Timeout

March 17, 2017

Many libraries wrap some external communication. Be it a REST-like API, a message queue, a database, a mail server or something else. And therefore you have to have some timeout – for connecting, for reading, writing or idling. And sadly, many libraries have their default timeouts set to “0” or “-1” which means “infinity”.

And that is a very useless and even harmful default. There isn’t a practical use case where you’d want to hang on forever waiting for a resource. And there are tons of situations where this can happen, e.g. the other end gets stuck. In the past 3 months I had 2 libraries that have a default timeout of “infinity” and that eventually lead to production problems because we’ve forgotten to configure them properly. Sometimes you even don’t see the problem, until a thread pool gets exhausted.

So, I have a request to API/library designers (as I’ve done before – against property maps and encoding other than UTF-8). Never have “infinity” as a default timeout. Your library will thus cause lots of production issues.
Also note that it’s sometimes an underlying HTTP client (or Socket) that doesn’t have a reasonable default – it’s still your job to fix that when wrapping it.

What default should you provide? Reasonable. 5 seconds maybe? You may (rightly) say you don’t want to impose an arbitrary timeout on your users. In that case I have a better proposal:

Explicitly require a timeout for building your “client” (because these libraries are most often clients for some external system). E.g. Client.create(url, credentials, timeout). And fail if no timeout is provided. That makes the users of the client actively consider what is a good timeout for their usecase – without imposing anything, and most importantly – without risking stuck connections in production. Additionally, you can still present them with a “default” option, but still making them explicitly choose it. For example:

Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withTimeouts(Timeouts.connect(1000).read(1000))
                   .build();
// OR
Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withDefaultTimeouts()
                   .build();

The builder above should require “timeouts” to be set, and should fail if neither of the two methods was invoked. Even if you don’t provide these options, at least have a good way of specifying timeouts – some libraries require reflection to set the timeout of their underlying client.

I believe this is one of those issues that look tiny, but caus a lot of problems in the real world. And it can (and should) be solved by the library/client designers.

But since it isn’t always the case, we must make sure that timeouts are configured every time we use a 3rd party library.

3

Protecting Sensitive Data

March 12, 2017

If you are building a service that stores sensitive data, your number one concern should be how to protect it. What IS sensitive data? There are some obvious examples, like medical data or bank account data. But would you consider a dating site database as sensitive data? Based on a recent leaks of a big dating site I’d say yes. Is a cloud turn-by-turn nagivation database sensitive? Most likely, as users journeys are stored there. Facebook messages, emails, etc – all of that can and should be considered sensitive. And therefore must be highly protected. If you’re not sure if the data you store is sensitive, assume it is, just in case. Or a subsequent breach can bring your business down easily.

Now, protecting data is no trivial feat. And certainly cannot be covered in a single blog post. I’ll start with outlining a few good practices:

  • Don’t dump your production data anywhere else. If you want a “replica” for testing purposes, obfuscate the data – replace the real values with fakes ones.
  • Make sure access to your servers is properly restricted. This includes using a “bastion” host, proper access control settings for your administrators, key-based SSH access.
  • Encrypt your backups – if your system is “perfectly” secured, but your backups lie around unencrypted, they would be the weak spot. The decryption key should be as protected as possible (will discuss it below)
  • Encrypt your storage – especially if using a cloud provider, assume you can’t trust it. AWS, for example, offers EBS encryption, which is quite good. There are other approaches as well, e.g. using LUKS with keys stored within your organization’s infrastructure. This and the previous point are about “Data at rest” encryption.
  • Monitoring all access and auditing operations – there shouldn’t be an unaudited command issued on production.
  • In some cases, you may even want to use split keys for logging into a production machine – meaning two administrators have to come together in order to gain access.
  • Always be up-to-date with software packages and libraries (well, maybe wait a few days/weeks to make sure no new obvious vulnerability has been introduced)
  • Encrypt internal communication between servers – the fact that your data is encrypted “at rest”, may not matter, if it’s in plain text “in transit”.
  • in rare cases, when only the user has to be able to see their data and it’s very confidential, you may encrypt it with a key based (in part) on their password. The password alone does not make a good encryption key, but there are key-derivation functions (e.g. PBKDF2) that are created to turn low-entropy passwords into fair keys. The key can be combined with another part, stored on the server side. Thus only the user can decrypt their content, as their password is not stored anywhere in plain text and can’t be accessed even in case of a breach.

You see there’s a lot of encryption happening, but with encryption there’s one key question – who holds the decryption key. If the key is stored in a configuration file on one of your servers, the attacker that has gained access to your infrastructure, will find that key as well, get it, get the whole db, and then happily wait for it to be fully decrypted on his own machines.

To store a key securely, it has to be on a tamper-proof storage. For example, an HSM (Hardware Security Module). If you don’t have HSMs, Amazon offers it as part of AWS. It also offers key management as a service, but the particular provider is not important – the concept is important. Then you need to have a securely stored key on a device that doesn’t let the key out under no circumstances, even a breach (HSM vendors claim that’s the case).

Now, how to use these keys depends on the particular case. Normally, you wouldn’t use the HSM itself to decrypt data, but rather to decrypt the decryption key, which in turn is used to decrypt the data. If all the sensitive data in your database is encrypted, even if the attacker gains SSH access and thus gains access to the database (because your application needs unencrypted data to work with; homomorphic encryption is not yet here), he’ll have to get hold of the in-memory decryption key. And if you’re using envelope encryption, it will be even harder for an attacker to just dump your data and walk away.

Note that the ecnryption and decryption here are at the application level – so the encrypted data can be not simply “per storage” or “per database”, but also per column – usernames don’t have to be kept so secret, but the associated personal data (in the next 3 database columns) should. So you can plug the encryption mechanism in your pre-persist (and decryption – in post-load) hooks. If speed is an issue, i.e. you don’t want to do the decryption in real-time, you may have a (distributed) cache of decrypted data that you can refresh with a background job.

But if your application has to know the data, an attacker that gains full control for an unlimited amount of time, will also have the full data eventually. No amount of enveloping and layers of encryption can stop that, it can only make it harder and slower to obtain the dump (even if the master key, stored on HSM, is not extracted, the attacker will have an interface to that key to use it for decrypting the data). That’s why intrusion detection is key. All of the above steps combined with an early notification of intrusion can mean your data is protected.

As we are all well aware, there is never a 100% secure system. Our job is to make it nearly impossible for bulk data extraction. And that includes proper key management, proper system-level and application-level handling of encryption and proper monitoring and intrusion detection.

0

A Case For Native Smart Card Support in Browsers

February 22, 2017

A smart card is a device that holds a private key securely without letting it out of its storage. The chip on your credit card is a “smart card” (yup, terminology is ambiguous – the card and the chip are interchangeably called “smart card”). There are smaller USB-pluggable hardware readers that only hold the chip (without an actual card – e.g. this one).

But what’s the use? This w3c workshop from several years ago outlines some of them: multi-factor authentication, state-accepted electronic identification, digital signatures. All these are part of a bigger picture – that using the internet is now the main means of communication. We are moving most of our real-world activities online, so having a way to identify who we are online (e.g. to a government, to a bank), or being able to sign documents online (with legal value) is crucial.

That’s why the EU introduced the eIDAS regulation which defines (among other things) electronic identification and digital signatures. The framework laid there is aimed at having legally binding electronic communication, which is important in so many cases. Have you ever done the print-sign-scan exercise? Has your e-banking been accessed by an unauthorized person? Well, the regulation is supposed to fix these and more more issues.

Two factor authentication is another more broad concept, which has a tons of sub-optimal solutions. OTP tokens, google authenticator, sms code confirmation. All these have issues (e.g. clock syncing, sms interception, cost). There are hardware tokens like YubiKey, but they offer only a subset of the features a smart card does.

But it’s not just about legally-recognized actions online and two-factor authentication. It opens up other possibilities, like a more secure online credit card payment – e.g. you put your card in a reader and type your PIN, rather than entering the card number, CVC, date, names, 3d password and whatnot.

With this long introduction I got to the problem: browsers don’t support smart cards. In the EU, where electronic signatures are legally recognized, there is always the struggle of making them work with browsers. The solution so far: Java applets. A Java applet can interact with the smart card through the java crypto APIs, and thus provide signing features. However, with the deprecation of Java applets this era of constant struggle will end soon (and it is a struggle – having to click at least 2 confirmations and keep your java up to date, which even for developers is a hassle). There used to be a way to do it a few years ago in Firefox and IE, using window.crypto and CAPICOM APIs, but these got deprecated.

Recently the trend has been to use a “cloud-based” approach, where the keys reside on an HSM. That’s of course useful, but the problem with identification remains – getting access to your keys on the HSM requires, again, two factor authentication. Having the hardware token “in your hands” is what adds the security.

Smart people in Estonia (which has the most digital government in the world) had a better solution than Java or HSM – browser plugins that allow interaction with their ID card (which is/has a smart card). The solution is here and here. This has worked pretty well – you install the plugins one (which a one-in-all installer) and you can sign documents with javascript. You also get the proper PKCS libraries installed, and the root certificates needed to allow TLS 1.2 authentication with the hardware token (identification and authentication vs signing). The small downside of this approach is that it is somewhat fragile and dependent on browser whims – the plugins have to be upgraded constantly and are at risk of being completely broken if some browser decides to deprecate some Plugin APIs.

Another approach is the “local service” approach, which has two flavours. One is – you install a local application that exposes an HTTP interface and using javascript and proper same-origin configuration you send the files needed for signing to the service, and then get the result as an HTTP response, which you can then, again using javascript, append to the page that requested the signing. The downside here – getting a service installed to listen to a given port without administrator rights. The other approach is having an application hooked to a custom protocol (e.g. signature://). So whenever the page wants the user to sign something, it opens signnature://path-to-document-to-sign, which is intercepted by the locally installed application, digital signing is performed, and the result is pushed to (one-time) URL specified in the metadata of the document to sign. Something like that is implemented by 4identity.eu and it actually works.

Now, signature is one thing, identification (TLS client auth) is another. Allegedly, things should work there – PKCS#11 is a standard that should allow TLS client auth to happen with a smart card. Reality is – it doesn’t. You often need a vendor-specific PKCS#11 library. OpenSC, which is a cool tool that works with many smart cards, only works with Firefox and Safari. Charismatics commercial is a piece of software that is supposed to work with all smart cards out there – well, it doesn’t always.

And the problem here is the smart card vendors. The need for OpenSC and Charismatics arises because even though there are a few PKCS standards, smart cards are a complete mess. Not only it’s a mess, but it’s a closed, secretive mess. APDUs (the commands you send to the smartcard in order to communicate with it) are in most cases secret. You don’t get to know them even if you purchase tens of thousands of cards – you only get a custom vendor software that knows them. Then you have to reverse-engineer them to know how to actually talk to them. And they differ not only across vendors, but across card models of different vendors. For that reason the Estonian approach was a bit simpler to implement – they had only one type of smart card, given to all citizens and they were mostly in control. In other countries it’s a … mess. At least a dozen different types of cards to be supported.

So my first request is to smart card vendors (which are not that many) – please, please fix your mess. Get rid of that extra bit of “security through obscurity” to allow browsers to communicate with you without extra shenanigans.

My second request is to browser vendors – please do support smart card crypto natively. Unfortunately, due to the smart card mess above (among other things), hardware crypto has explicitly been excluded from the Web crypto API. As a follow-up to that, there’s the Hardware security working group, but afaik it’s still “work in progress”, and my feeling is it’s not that much yet. In w3c it’s important that browser vendors agree to implement something before it’s a standard, and I’ve heard that some are opposing the smart card integration. Due to the aforementioned mess, I guess.

You may say – standardization will fix this. Well, it hasn’t so far. The EU officials are aware of the problem, and that the eIDAS regulation may be thwarted by these technical issues, but they are powerless, as the EU is not a standardization body.

So it all comes down to having a joint effort between browser and smart card vendors to fix this thing once and for all. So, please do that in order to enable a more secure and legally-compliant web.

3

Computer Science Concepts That Non-Technical People Should Know

February 12, 2017

Sometimes it happens that people speak different languages. Even when speaking the same language. People have their own professional specifics. Biologist may see the world as the way a cell work, cosmologist may see relationships between people as attraction between planets. And as with languages different professional experiences give you a useful way of conceptualizing the world. And I think everyone would find some CS concepts useful. I’ll try to list some of these concepts that I’ve found many people don’t find “native”. And yes, they are not strictly “computer science” – they fall in various related fields like information science, systems design, etc.

  • Primary keys – the fact that every “entity” should have a unique identifier, so that you can refer to it unambiguously. Whether it’s a UUID or an auto-increment, or a number/string derived from a special set of rules, doesn’t matter. And you may say it’s obviously, but it isn’t – I’ve seen tons of spreadsheets and registers where entities don’t have a unique identifier. Unique identifiers are useful for retrieval – if I have a driving license number, when I fill an insurance form, should I fill all the details from the driving license (name, address, age), or just a single number, and the insurer should then get the rest from a driving license database?
  • Foreign keys (+ integrity violations and cascades) – the idea for this post came to me after I had a discussion with a bank clerk who insisted that the fact that my bank account is being deleted doesn’t mean that my virtual PoS terminal will also be deleted. To her they seemed unlinked, although the terminal is linked (via a foreign key) to the bank account. One should be able to quickly imagine links between data points and how dependent data can either block the removal or the parent, or be deleted with it.
  • “Fingerprint” (hash) – the fact that every entity can be transformed into a short, unique identity representation. This has some niche applications outside of the technical realm, but one should be familiar with the concept of transforming a large set of data points to a single representation for the sake of comparison.
  • Derived data – the fact that you don’t have to store data that can be derived from a primary dataset with well defined transformations. E.g. sums. I had to explain this over and over again when I was talking about open data – and ultimately, sums were there anyway. The key here is that derived data is less important than the algorithm used to derive it.
  • Single point of failure – this becomes obvious the second you spell it, but people tend to ignore it when designing real-world processes. How many times you’ve been stuck in a queue and you thought “this is obviously a single point of failure, why didn’t anyone think of preventing these delays”. It’s not always simple (even in software, not to mention the real world), of course, but the concept should be acknowledged.
  • Protocols and interfaces – i.e. the definition of how two systems communicate without knowing each others’ internal functioning. “Protocol” in the non-technical world means “the way things are usually done”. An interface is usually thought of as a GUI. But the concept of the definition of how two systems communicate seems kind of alien. Where “systems” might be “organizations”, a customer and a sales rep, two departments, etc. The definition of the point of their interaction I think helps to conceptualize these relations better.
  • Version control – the fact that content is a continuous stream of changes ontop of an original entity. Each entity is the product of its original “version” + all the changes applied on it. This doesn’t just hold for text, it holds for everything out there. And if we imagine it so, it helps us… “debug” things, even humans I guess.

The list is by no means exhaustive. I’d welcome any additions to it.

And now I know non-technical readers will say “But I understand and know these things”. I know you do. But I imply that they are not an intrinsic part of your day-to-day problem solving apparatus. I may be wrong, but my observations hint at that direction.

1

Why I Chose to Be a Government Advisor

January 22, 2017

A year and a half ago I agreed to become advisor in the cabinet of the deputy primer minister of my country (Bulgaria). It might have looked like a bizarre career move, given that at the time I was a well positioned and well paid contractor (software engineer), working with modern technologies (Scala, Riak, AWS) at scale (millions of users). I continued on that project part time for a little while, then switched to another one (again part time), but most of my attention and time were dedicated to the advisory role.

Since mid-December I’m no longer holding the advisory position (the prime minister resigned), but I wanted to look back, reflect and explain (to myself mainly) why that was a good idea and how it worked out.

First, I deliberately continued as a part-time software engineer, to avoid the risk of forgetting my (to that point) most marketable skills – building software. But not only that – sometimes you become tired of political and administrative bullshit and just want to sit down and write some code. But the rest of my time, including my “hobby”/spare time, was occupied by meetings, research, thoughts and document drafting that aimed at improving the electronic governance in Bulgaria.

I’ve already shared what my agenda was and what I was doing, and even gave a talk about our progress – opening as much data as possible, making sure we have high requirements for government software (we prepared a technical specification template for government procurement so that each administration relies on that, rather than on contractors with questionable interests), introducing electronic identification (by preparing both legislation and technical specification), and making myself generally available to anyone who wants advice on IT stuff.

The big news that we produced was the introduction of a legal requirement for making all custom-built government software open-source. It echoed in the tech community, including some big outlets like TechCrunch, ZDnet and Motherboard.

I was basically a co-CTO (together with a colleague) of the government, which is pretty amazing – having the bigger picture AND the intricate details in our heads, all across government. We were chaotic, distracted and overwhelmed, but still managed to get both the legislation and the important projects written and approved. I acquired some communication and time-management skills, researched diverse technological topics (from biometric identification to e-voting), I learnt that writing laws is very similar to programming, and I had the pleasure of working in a wonderful team.

There were downsides, of course. The financial one aside, the most significant problem was the inevitable “deskilling”. I had no time to try cool new stuff from the technology world – machine learning, deep learning, neural networks, the blockchain. Kafka, Docker, Kubernetes, etc. I’m not saying these are crucial technologies (I have tried Docker a couple of times and I wouldn’t say I love it), but one has to stay up to date – this is the best part (or the worst part?) of our profession. I missed (and still do) spending a few evenings in a row digging into some new and/or obscure technology, or even contributing to it. My GitHub contributions were limited to small government-oriented tools.

Another downside is that I’m pretty sure some people hate me now. For raising the bar for government software so high and for my sometimes overly zealous digitizing efforts. I guess that’s life, though.

Having stated the pros and cons, let me get back to the original question – why I did I choose the position in the first place. For a long time I’ve been interested in what happens to programmers when they get “bored” – when the technical challenges alone are not sufficient for feeling one’s potential, or potential for impact is fully reached. A few years ago I didn’t think of the “government” option – I thought one could either continue solve problems for an employer, or become an employer themselves. And I still think the latter is a very good option, if you have the right idea.

But I didn’t have that idea that I really liked and wanted to go through with. At the same time I had ideas of how the government could run better. Which would, in my case, impact around 8 million people. It would also mean I can get things done in a complicated context. I started with three objectives – electronic identity card, open source and open data (the latter two we have pushed for with a group of volunteer developers). All of them are now fact. Well, mostly because of the deputy prime minister, and thanks to the rest of the team, but still. That job would also mean I had to get out of my comfort zone of all-day-programming agile-following software-delivering routine.

And that’s why I did it. With all the risks of deskilling, of being disliked, or even being a pawn in a political game. Selfish as it may sound, I think having impact, getting things done and doing something different are pretty good motivators. And are worth the paycheck cut.

I still remain a software engineer and will continue to create software. But thanks to this experience, I think I’ll be able to see a broader landscape around software.

2

Forget ISO-8859-1

January 16, 2017

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{Alpha}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.

3

Anemic Objects Are OK

December 25, 2016

I thought for a while that object-oriented purism has died off. But it hasn’t – every now and then there’s an article that tries to tell us how evil setters and getters are, how bad (Java) annotations are, and how horrible and anti-object-oriented the anemic data model is (when functionality-only services act upon data-only objects) and eventually how dependency injection is ruining software.

Several years ago I tried to counter these arguments, saying that setters and getters are not evil per se, and that the anemic data model is mostly fine, but I believe I was worse at writing then, so maybe I didn’t get to the core of the problem.

This summer we had a short twitter discussion with Yegor Bugayenko and Vlad Mihalcea on the matter and a few arguments surfaced. I’ll try to summarize them:

  • Our metaphors are often wrong. An actual book doesn’t know how to print itself. Its full contents are given to a printer, which knows how to print a book. Therefore it doesn’t make sense to put logic for printing (to JSON/XML), or persisting to a database in the Book class. It belongs elsewhere.
  • The advice to use (embedded) printers instead of getters is impractical even if a Book should know how to print itself – how do you transform your objects to other formats (JSON, XML, Database rows/etc..)? With an Jackson/JAXB/ORM/.. you simply add a few annotations, if any at all and it works. With “printers” you have to manually implement the serialization logic. Even with Xembly you still have to do a tedious, potentially huge method with add()’s and up()’s. And when you add, or remove a field, or change a field definition, or add a new serialization format, it gets way more tedious to support. Another approach mentioned in the twitter thread is having separate subclasses for each format/database. And an example can be seen here. I really don’t find that easy to read or support. And even if that’s adopted in a project I’m working on, I’d be the first to replace that manual adding with reflection, however impure that may be. (Even Uncle Bob’s Fitnesse project has getters or even public fields where that makes sense in terms of the state space)
  • Having too much logic/behaviour in an objects may be seen as breaking the Single responsibility principle. In fact, this article argues that the anemic approach is actually SOLID, unlike the rich business object approach. The SRP may actually be understood in multiple ways, but I’ll get to that below.
  • Dependency injection containers are fine. The blunt example of how the code looks without them is here. No amount of theoretical object-oriented programming talk can make me write that piece of code. I guess one can get used to it, but (excuse my appeal to emotion fallacy here) – it feels bad. And when you consider the case of dependency injection containers – whether you’ll invoke a constructor from a main method, or your main will invoke automatic cosntructor (or setter) injection context makes no real difference – your objects are still composed of their dependencies, and their dependencies are set externally. Except the former is more practical and after a few weeks of nested instantiation you’ll feel inclined to write your own semi-automated mechanism to do that.

But these are all arguments derived from a common root – encapsulation. Your side in the above arguments depends on how you view and understand encapsulation. I see the purpose of encapsulation as a way to protect the state space of a class – an object of a given class is only valid if it satisfies certain conditions. If you expose the data via getters and setters, then the state space constraints are violated – everyone can invalidate your object. For example, if you were able to set the size of an ArrayList without adding the corresponding element to the backing array, you’d break the behaviour of an ArrayList object – it will report its size inconsistently and the code that depends on the List contract would not always work.

But in practical terms encapsulation still allows for the distinction between “data objects” vs “business objects”. The data object has no constraints on its state – any combination of the values of its fields is permitted. Or in some cases – it isn’t, but it is enforced outside the current running program (e.g. via database constraints when persisting an object with an ORM). In these cases, where the state spaces is not constraint, encapsulation is useless. And forcing it upon your software blindly results in software that, I believe, is harder to maintain and extend. Or at least you gain nothing – testing isn’t easier that way (you can have a perfectly well tested anemic piece of software), deployment is not impacted, tracing problems doesn’t seem much of a difference.

I’m even perfectly fine with getting rid of the getters and exposing the state directly (as the aforementioned LogData class from Fitnesse).

And most often, in business applications, websites, and the general type of software out there, most objects don’t need to enforce any constraints on their state. Because there state is just data, used somewhere else, in whatever ways the business needs it to be used. To get back to the single responsibility principle – these data objects have only one reason to change – their … data has changed. The data itself is irrelevant. It will become relevant at a later stage – when it’s fetched via a web service (after it’s serialized to JSON), or after it’s fetched from the database by another part of the application or a completely different system. And that’s important – encapsulation cannot be enforced across several systems that all work with a piece of data.

In the whole debate I haven’t seen a practical argument against the getter/setter/anemic style. The only thing I see is “it’s not OOP” and “it breaks encapsulation”. Well, I think it should be settled now that encapsulation should not be always there. It should be there only when you need it, to protect the state of your object from interference.

So, don’t feel bad to continue with your ORMs, DI frameworks and automatic JSON and XML serializers.

7