Developers and Ethics

June 27, 2017

“What are some areas you are particularly interested in” – recruiters (head-hunters) tend to ask that question a lot. I don’t have a good answer for that – I’ll know it when I see it. But I have a list of areas that I wouldn’t like to work in. And one of them is gambling.

Several years ago I got a very lucrative offer for a gambling company, both well paid and technically challenging. But I rejected it. Because I didn’t want to contribute to abusing peoples’ weaknesses for the sake of getting their money. And no, I’m not a raging Marxist, but gambling is bad. You may argue that it’s a necessary vice and people need it to suppress other internal struggles, but I’m not buying that as a motivator.

I felt it’s unethical to write code that does that. Like I feel it’s unethical to profile users’ behaviours and “read” their emails in order to target ads, or to write bots to disseminate fake news.

A few months ago I was part of the campaign HQ for a party in a parliamentary election. Cambridge Analytica had already become popular after “delivering Brexit and Trump’s victory”, that using voters’ data in order to target messages at them sounded like the new cool thing. As head of IT & data, I rejected this approach. Because it would be unethical to bait unsuspecting users to take dumb tests in order to provide us with facebook tokens. Yes, we didn’t have any money to hire Cambridge Analytica-like companies, but even if we had, is “outsourcing” the dubious practice changing anything? If you pay someone to trick users into unknowingly giving their personal data, it’s as if you did it yourself.

This can be a very long post about technology and ethics. But it won’t, as this is a technical blog, not a philosophical one. It won’t be about philosophy – for interesting takes on the matter you can listen to Damon Horowitz’s TED talk or even go through all of Michael Sandel’s Justice lectures at Harvard. It won’t be about how companies should be ethical (e.g. following the ethical design manifesto)

Instead, it will be a short post focusing on developers and their ethical choices.

I think we have the freedom to be ethical – there’s so much demand on the job market that rejecting an offer, refusing to do something, or leaving a company for ethical reasons is something we have the luxury to do without compromising our well-being. When asked to do something unethical, we can refuse (several years ago I was asked to take part in some shady interactions related to a potential future government contract, which I refused to do). When offered jobs that are slightly better paid but would have us build abusive technology, we can turn the offer down. When a new feature requires us to breach people’s privacy, we can argue it, and ultimately not do it.

But in order to start making these ethical choices, we have to start thinking about ethics. To put ourselves in context. We, developers, are building the world of tomorrow (it sounds grandiose, but we know it’s way more mundane than that). We are the “tools” with which future products will be shaped. And yes, that’s true even for the average back-office system of an insurance company (which allows for raising the insurance for pre-existing conditions), and true for boring banking software (which allows mortgages way beyond the actual coverage the bank has), and so on.

Are these decisions ours to make? Isn’t it legislators that should define what’s allowed and what isn’t? We are just building whatever they tell us to build. Forgive me the far-fetched analogy, but Nazi Germany was an anti-humanity machine based on people who “just followed orders”. Yes, we’ll refuse, someone else will come and do it, but collective ethics gets built over time.

As Hannah Arendt had put it – “The sad truth is that most evil is done by people who never make up their minds to be good or evil.”. We may think that as developers we don’t have a say. But without us, no software can be built. So with our individual ethical stance, a certain unethical software may not be built or be successful, and that’s a stance worth considering, especially when it costs us next to nothing.

0

Electronic Signature Using The WebCrypto API

June 11, 2017

Sometimes we need to let users sign something electronically. Often people understand that as placing your handwritten signature on the screen somehow. Depending on the jurisdiction, that may be fine, or it may not be sufficient to just store the image. In Europe, for example, there’s the Regulation 910/2014 which defines what electronic signature are. As it can be expected from a legal text, the definition is rather vague:

‘electronic signature’ means data in electronic form which is attached to or logically associated with other data in electronic form and which is used by the signatory to sign;

Yes, read it a few more times, say “wat” a few more times, and let’s discuss what that means. And it can mean basically anything. It is technically acceptable to just attach an image of the drawn signature (e.g. using an html canvas) to the data and that may still count.

But when we get to the more specific types of electronic signature – the advanced and qualified electronic signatures, things get a little better:

An advanced electronic signature shall meet the following requirements:
(a) it is uniquely linked to the signatory;
(b) it is capable of identifying the signatory;
(c) it is created using electronic signature creation data that the signatory can, with a high level of confidence, use under his sole control; and
(d) it is linked to the data signed therewith in such a way that any subsequent change in the data is detectable.

That looks like a proper “digital signature” in the technical sense – e.g. using a private key to sign and a public key to verify the signature. The “qualified” signatures need to be issued by qualified provider that is basically a trusted Certificate Authority. The keys for placing qualified signatures have to be issued on secure devices (smart cards and HSMs) so that nobody but the owner can have access to the private key.

But the legal distinction between advanced and qualified signatures isn’t entirely clear – the Regulation explicitly states that non-qualified signatures also have legal value. Working with qualified signatures (with smartcards) in browsers is a horrifying user experience – in most cases it goes through a Java Applet, which works basically just on Internet Explorer and a special build of Firefox nowadays. Alternatives include desktop software and local service JWS applications that handles the signing, but smartcards are a big issue and offtopic at the moment.

So, how do we allow users to “place” an electronic signature? I had an idea that this could be done entirely using the WebCrypto API that’s more or less supported in browsers these days. The idea is as follows:

  • Let the user type in a password for the purpose of sining
  • Derive a key from the password (e.g. using PBKDF2)
  • Sign the contents of the form that the user is submitting with the derived key
  • Store the signature alongside the rest of the form data
  • Optionally, store the derived key for verification purposes

Here’s a javascript gist with implementation of that flow.

Many of the pieces are taken from the very helpful webcrypto examples repo. The hex2buf, buf2hex and str2ab functions are utilities (that sadly are not standard in js).

What the code does is straightforward, even though it’s a bit verbose. All the operations are chained using promises and “then”, which to be honest is a big tedious to write and read (but inevitable I guess):

  • The password is loaded as a raw key (after transforming to an array buffer)
  • A secret key is derived using PBKDF2 (with 100 iterations)
  • The secret key is used to do an HMAC “signature” on the content filled in by the user
  • The signature and the key are stored (in the UI in this example)
  • Then the signature can be verified using: the data, the signature and the key

You can test it here:

Having the signature stored should be enough to fulfill the definition of “electronic signature”. The fact that it’s a secret password known only to the user may even mean this is an “advanced electronic signature”. Storing the derived secret key is questionable – if you store it, it means you can “forge” signatures on behalf of the user. But not storing it means you can’t verify the signature – only the user can. Depending on the use-case, you can choose one or the other.

Now, I have to admit I tried deriving an asymmetric keypair from the password (both RSA and ECDSA). The WebCrypto API doesn’t allow that out of the box. So I tried “generating” the keys using deriveBits(), e.g. setting the “n” and “d” values for RSA, and the x, y and d values for ECDSA (which can be found here, after a bit of searching). But I failed – you can’t specify just any values as importKey parameters, and the constraints are not documented anywhere, except for the low-level algorithm details, and that was a bit out of the scope of my experiment.

The goal was that if we only derive the private key from the password, we can easily derive the public key from the private key (but not vice-versa) – then we store the public key for verification, and the private key remains really private, so that we can’t forge signatures.

I have to add a disclaimer here that I realize this isn’t very secure. To begin with, deriving a key from a password is questionable in many contexts. However, in this context (placing a signature), it’s fine.

As a side note – working with the WebCrypto API is tedious. Maybe because nobody has actually used it yet, so googling for errors basically gives you the source code of Chromium and nothing else. It feels like uncharted territory (although the documentation and examples are good enough to get you started).

Whether it will be useful to do electronic signatures in this way, I don’t know. I implemented it for a use-case that it actually made sense (party membership declaration signature). Whether it’s better than hand-drawn signature on a canvas – I think it is (unless you derive the key from the image, in which case the handwritten one is better due to a higher entropy).

1

“Architect” Should Be a Role, Not a Position

May 31, 2017

What happens when a senior developer becomes…more senior? It often happens that they get promoted to “architect”. Sometimes an architect doesn’t have to have been a developer, if they see “the bigger picture”. In the end, there’s often a person that holds the position of “architect”; a person who makes decisions about the architecture of the system or systems being developed. In bigger companies there are “architect councils”, where the designated architects of each team gather and decide wise things…

But I think it’s a bad idea to have a position of “architect”. Architect is a position in construction – and it makes sense there, as you can’t change and tweak the architecture mid-project. But software architecture is flexible and should not be defined strictly upfront. And development and architecture are so intertwined, it doesn’t make much sense to have someone who “says what’s to be done” and others who “do it”. It creates all sorts of problems, mainly coming from the fact that the architect often doesn’t fully imagine how the implementation will play out. If the architect hasn’t written code for a long time, they tend to disregard “implementation details” and go for just the abstraction. However, abstractions leak all the time, and it’s rarely a workable solution to just think of the abstraction without the particular implementation.

That’s my first claim – you cannot be a good architect without knowing exactly how to write the whole code underneath. And no, too often it’s not “simple coding”. And if you have been an architect for years, and so you haven’t written code in years, you are almost certainly not a good architect.

Yes, okay, YOU in particular may be a good architect. Maybe in your particular company it makes sense to have someone sitting in an ivory tower of abstractions and mandate how the peons have to integrate this and implement that. But even in that case, there’s a better approach.

The architect should be a role. Every senior team member can and should take the role of an architect. It doesn’t have to be one person per team. In fact, it’s better to have more. To discuss architecture in meetings similar to the feature design meetings, with the clear idea that it will be you who is going to implement the whole thing. Any overdesign (of which architects are often guilty) will have to be justified in front of yourself – “do I want to write all this boilerplate stuff, or is there a simple and more elegant way”.

The position is “software engineer”. The role can be “scrum master”, “architect”, “continuous integration officer”, etc. If the company needs an “architects’ council” to decide “bigger picture” integrations between systems, the developers can nominate someone to attend these meetings – possibly the most knowledgeable. (As many commenters put it – the architect is often a business analyst, a manager, a tech lead, most of the time. Basically “a senior developer with soft skills”. And that’s true – being “architect” is just one of their role, again)

I know what the architects are thinking now – that there are high-level concerns that developers either don’t understand or shouldn’t be bothered with. Wrong. If your developers don’t understand the higher level architectural concerns, you have a problem that would manifest itself sooner or later. And yes, they should be bothered with the bigger picture, because in order to fit the code that you are writing into the bigger picture, you should be pretty familiar with it.

There’s another aspect, and that’s team member attitudes and interaction dynamics. If someone gets “promoted” to an “architect”, without being a particularly good or respected developer, that may damage the team morale. On the other hand, one promoted to “architect” can become too self-confident and push design decisions just because they think so, and despite good arguments against them.

So, ideally (and that’s my second claim) – get rid of the architect position. Make sure your senior team members are engaged in architectural discussions and decision making – they will be more motivated that way, and they will have a more clear picture of what their development efforts should achieve. And most importantly – the architectural decisions won’t be detached from the day-to-day development “real world”, nor will they be unnecessarily complicated.

5

Overview of Message Queues [slides]

May 23, 2017

Yesterday I gave a talk that went through all the aspects of using messages queues. I’ve previously written that “you probably don’t need a message queue” – now the conclusion is a bit more nuanced, but I still stand by the simplicity argument.

The talk goes through the various benefits and use cases of using message queues, and discusses alternatives of the typical “message queue broker” architecture. The slides are available here

One maybe strange approach that I propose is to use distributed locks (e.g. with Hazelcast) for distributed batch processing – you lock on a particular id (possibly organization id / client id, rather than an individual record id) thus allowing multiple processors to run in parallel without stepping on each other’s toes (one node picks the first entry, the other one tries the first, but fails, and picks the second one).

Something I missed as a benefit of brokers like RabbitMQ is the available tooling – you can monitor and debug your queues pretty easily.

I wasn’t able to focus in detail on much on any of the concepts – e.g. how does brokerless work, or how to deploy a broker (e.g. RabbitMQ) in multiple data centers (availability zones), or how akka (and akka cluster) fits in the the “message queue” landscape. But I hope it’s a good overview that lets everyone have a clear picture of the options, which then they can analyze and assess in more details.

And I’ll finish with the Dijkstra quote from the slides:

Simplicity is prerequisite for reliability

0

Event Logs

May 12, 2017

Most system have some sort of event logs – i.e. what has happened in the system and who did it. And sometimes it has a dual existence – once as an “audit log”, and once as event log, which is used to replay what has happened.

These are actually two separate concepts:

  • the audit log is the trace that every action leaves in the system so that the system can later be audited. It’s preferable that this log is somehow secured (will discuss that another time)
  • the event log is a crucial part of the event-sourcing model where the database only stores modifications, rather than the current state. The current state is the obtained after applying all the stored modifications until the present moment. This allows seeing the state of the data at any moment in the past.

There are a bunch of ways to get that functionality. For audit logs there is Hibernate Envers, which stores all modifications in a separate table. You can also have a custom solution using spring aspects or JPA listeners (that store modifications in an audit log table whenever a change happens). You can store the changes in multiple ways – as key/value rows (one row per modified field), or as objects serialized to JSON.

Event-sourcing can be achieved by always inserting a new record instead of updating or deleting (and incrementing a version and/or setting a “current” flag). There some event-sourcing-native databases – Datomic and Event Store. (Note: “event-sourcing” isn’t equal to “insert-only”, but the approach is very similar)

They still seem pretty similar – both track what has happened on the system in a detailed way. So can an audit log be used as an event log, and can event logs be used as audit logs? And is there a difference?

Audit logs have specific features that you don’t need for event sourcing – storing business actions and being secure. When a user logs in, or views a given item, that wouldn’t issue an update (well, maybe last_login_time or last_seen can be updated, but those are side-effects). But you still may want to see details about the event in your audit log – who logged in, when, from what IP, after how many unsuccessful attempts. You may also want to know who read a given piece of information – if its sensitive personal data, it’s important to know who has seen it and whether the access was not some form of “stalking”. In the audit log you can also have “business events” rather than individual database operations. For example “checkout a basket” involves updating the basket status, decreasing the availability of the items, updating the purchase history. And you may want to just see “user X checked out basket Y with items Z at time W”.

Another very important feature if the audit logs is their integrity. You should somehow protect them from modification (e.g. by digitally signing them). Why? Because at some point that may be evidence in court (e.g. after some employee misuses your system), so the more secure and protected it is, the better evidence it is. This isn’t needed for event-sourcing, obviously.

Event sourcing has other requirements – every modification with every field must be stored. You should also be able to have “snapshots” so that the “current state” isn’t recalculated every time. The event log isn’t just a listener or an aspect around your persistence layer – it’s at the core of your data model. So in that sense it has more impact on the whole system design. Another thing event logs are useful for is consuming events from the system. For example, if you have an indexer module and you want to index changes as the come, you can “subscribe” the indexer to the events and each insert would trigger an index operation. That is not limited to indexing, and can be used for synchronizing (parts of) the state with external systems (which a Search engine is just one example of).

But due to the many similarities – can’t we have a common solution that does both? An audit log is a poor event-sourcing tool per-se, and event-sourcing models are not sufficient for audit logs.

Since event sourcing is a design choice, I think it should take the leading role – if you use event sourcing, then you can add some additional inserts for non-data-modifying business-level operations (login, view), and have your high-level actions (e.g. checkout) represented as events. You can also get a scheduled job to sign the entries (that may mean doing updates in an insert-only model, which would be weird, but still). And you will get a full-featured audit log with a little additional effort over the event sourcing mechanism.

The other way around is a bit tricky – you can still use the audit log to see the state of a piece of data in the past, but that doesn’t mean your data model will be insert-only. If you rely too much on the audit log as if it was event sourcing, maybe you need event sourcing in the first place. If it’s just occasional interest “what happened here two days ago”, then it’s fine having just the audit log.

The two concepts are overlapping so much, that it’s tempting to have both. But in conclusion, I’d say that:

  • you always need an audit log while you don’t necessarily need event sourcing.
  • if you have decided on using event-sourcing, walk the extra mile to get a proper audit log

But of course, as usual, it all depends on the particular system and its use-cases.

2

Spring Boot, @EnableWebMvc And Common Use-Cases

April 21, 2017

It turns out that Spring Boot doesn’t mix well with the standard Spring MVC @EnableWebMvc. What happens when you add the annotation is that spring boot autoconfiguration is disabled.

The bad part (that wasted me a few hours) is that in no guide you can find that explicitly stated. In this guide it says that Spring Boot adds it automatically, but doesn’t say what happens if you follow your previous experience and just put the annotation.

In fact, people that are having issues stemming from this automatically disabled autoconfiguration, are trying to address it in various ways. Most often – by keeping @EnableWebMvc, but also extending Spring Boot’s WebMvcAutoConfiguration. Like here, here and somewhat here. I found them after I got the idea and implemented it that way. Then realized doing it is redundant, after going through Spring Boot’s code and seeing that an inner class in the autoconfiguration class has a single-line javadoc stating

Configuration equivalent to {@code @EnableWebMvc}.

That answered my question whether spring boot autoconfiguration misses some of the EnableWebMvc “features”. And it’s good that they extended the class that provides EnableWebMvc, rather than mirroring the functionality (which is obvious, I guess).

What should you do when you want to customize your beans? As usual, extend WebMvcConfigurerAdapter (annotate the new class with @Component) and do your customizations.

So, bottom line of the particular problem: don’t use @EnableWebMvc in spring boot, just include spring-web as a maven/gradle dependency and it will be autoconfigured.

The bigger picture here resulted in me adding a comment in the main configuration class detailing why @EnableWebMvc should not be put there. So the autoconfiguration magic saved me doing a lot of stuff, but I still added a line explaining why something isn’t there.

And that’s because of the common use-cases – people are used to using @EnableWebMvc. So the most natural and common thing to do is to add it, especially if you don’t know how spring boot autoconfiguration works in detail. And they will keep doing it, and wasting a few hours before realizing they should remove it (or before extending a bunch of boot’s classes in order to achieve the same effect).

My suggestion in situations like this is: log a warning. And require explicitly disabling autoconfiguration in order to get rid of the warning. I had to turn on debug to see what gets autoconfigured, and then explore a bunch of classes to check the necessary conditions in order to figure out the situation.

And one of Spring Boot’s main use-cases is jar-packaged web applications. That’s what most tutorials are for, and that’s what it’s mostly used for, I guess. So there should be special treatment for this common use case – with additional logging and logged information helping people get through the maze of autoconfiguration.

I don’t want to be seen as “lecturing” the Spring team, who have done amazing job in having good documentation and straightforward behaviour. But in this case, where multiple sub-projects “collide”, it seems it could be improved.

2

Distributed Cache – Overview

April 8, 2017

What’s a distributed cache? A solution that is “deployed” in an application (typically a web application) and that makes sure data is loaded from memory, rather than from disk (which is much slower), in order to improve performance and response time.

That looks easy if the cache is to be used on a single machine – you just load your most active data from the database in memory (e.g. a Guava Cache instance), and serve it from there. It becomes a bit more complicated when this has to work in a cluster – e.g. 5 application nodes serving requests to users in a round-robin fashion.

You have to update the in-memory cache on all machines each time a piece of data is updated by a request to one of the machines. If you just load all the data in memory and don’t invalidate it, the cache won’t be “coherent” – it will have stale values and requests to different application nodes will have different results, which you most certainly want to avoid. Or you can have a single big cache server with tons of memory, but it can die – and that may disrupt the smooth operation, so you’d want to have at least 2 machines in a cluster.

You can get a distributed cache in different ways. To list a few: Infinispan (which I’ve covered previously), Terracotta/Ehcache, Hazelcast, Memcached, Redis, Cassandra, Elasticache(by Amazon). The former three are Java-specific (both JCache compliant), but the rest can be used in any setup. Cassandra wasn’t initially meant to be cache solution, but it can easily be used as such.

All of these have different configurations and different options, even different architectures. For example, you can run Ehcache with a central Terracotta server, or with peer-to-peer gossip protocol. The in-process approach is also applicable for Infinispan and Hazelcast. Alternatively, you can rely on a cloud-provided service, like Elasticache, or you can setup your own cluster of Memcached/Redis/Cassandra servers. Having the cache on the application nodes themselves is slightly faster than having a dedicated memory server (cluster), because of the network overhead.

The cache is structured around keys and values – there’s a cached entry for each key. When you want to load something form the database, you first check whether the cache doesn’t have an entry with that key (based on the ID of your database record, for example). If a key exists in the cache, you don’t do a database query.

But how is the data “distributed”? It wouldn’t make sense to load all the data on all nodes at the same time, as the duplication is a waste of space and keeping the cache coherent would again be a problem. Most of the solutions rely on the so called “consistent hashing”. When you look for a particular key, its hash is calculated and (depending on the number of machines in the cache cluster), the cache solution knows exactly on which machine the corresponding value is located. You can read more details in the wikipedia article, but the approach works even when you add or remove nodes from the cache cluster, i.e. the cluster of machines that hold the cached data.

Then, when you do an update, you also update the cache entry, which the caching solution propagates in order to make the cache coherent. Note that if you do manual updates directly in the database, the cache will have stale data.

On the application level, you’d normally want to abstract the interactions with the cache. You’d end up with something like a generic “get” method that checks the cache and only then goes to the database. This is true for get-by-id queries, but caches can also be applied for other SELECT queries – you just use the query as a key, and the query result as value. For updates it would similarly update the database and the cache.

Some frameworks offer these abstractions out of the box – ORMs like Hibernate have the so-called cache providers, so you just add a configuration option saying “I want to use (2nd level) cache” and your ORM operations automatically consult the configured cache provider before hitting the database. Spring has the @Cacheable annotation which can cache method invocations, using their parameters as keys. Other frameworks and languages usually have something like that as well.

An important side-note – you may need to pre-load the cache. On a fresh startup of a cluster (e.g. after a new deployment in a blue-green deployment scheme/crash/network failure/etc.) the system may be slow to fill the cache. So you may have a batch job run on startup to fetch some pieces of data from the database and put them in the cache.

It all sounds easy and that’s good – the basic setup covers most of the cases. In reality it’s a bit more complicated to tweak the cache configurations. Even when you manage to setup a cluster (which can be challenging in, say, AWS), your cache strategies may not be straightforward. You have to answer a couple of questions that may be important – how much do you want your cache entries to live? How big do you need your cache to be? How should elements expire (cache eviction strategy) – least recently used, least frequently used, first-in-first-out?

These options often sound arbitrary. You normally go with the defaults and don’t bother. But you have to constantly keep on eye on the cache statistics, do performances tests, measure and tweak. Sometimes a single configuration option can have a big impact.

One important note here on the measuring side – you don’t necessarily want to cache everything. You can start that way, but it may turn out that for some types of entries you are actually better off without a cache. Normally these are the ones that are updated too often and read not that often, so constantly updating the cache with them is an overhead that doesn’t pay off. So – measure, tweak, repeat.

Distributed caches are an almost mandatory component of web applications. Yet, I’ve had numerous discussions about how a given system is very big and needs a lot of hardware, where all the data would fit in my laptop’s memory. So, much to my surprise, it turns out it’s not yet a universally known concept. I hope I’ve given a short, but sufficient overview of the concept and the various options.

1

Distributing Election Volunteers In Polling Stations

March 20, 2017

There’s an upcoming election in my country, and I’m a member of the governing body of one of the new parties. As we have a lot of focus on technology (and e-governance), our internal operations are also benefiting from some IT skills. The particular task at hand these days was to distribute a number of election day volunteers (that help observe the fair election process) to polling stations. And I think it’s an interesting technical task, so I’ll try to explain the process.

First – data sources. We have an online form for gathering volunteer requests. And second, we have local coordinators that collect volunteer declarations and send them centrally. Collecting all the data is problematic (to this moment), because filling the online form doesn’t make you eligible – you also have to mail a paper declaration to the central office (horrible bureaucracy).

Then there’s the volunteer preferences – in the form they’ve filled whether they are willing to travel, or they prefer their closest poling station. And then there’s the “priority” polling stations, which are considered to be more risky and therefore we need volunteers there.

I decided to do the following:

  • Create a database table “volunteers” that holds all the data about all prospective volunteers
  • Import all data – using apache CSV parser, parse the CSV files (converted from Google sheets) with the 1. online form 2. data from the received paper declarations
  • Match the entries from the two sources by full name (as the declarations cannot contain an email, which would otherwise be the primary key)
  • Geocode the addresses of people
  • Import all polling stations and their addresses (public data by the central election commission)
  • Geocode the addresses of the polling stations
  • Find the closest polling station address for each volunteer

All of the steps are somewhat trivial, except the last part, but I’ll still explain in short. The CSV parsing and importing is straightfoward. The only thing one has to be careful is have the ability to insert additional records on a later date, because declarations are being received as I’m writing.

Geocoding is a bit trickier. I used the OpenStreetMap initially, but it managed to find only a fraction of the addresses (which are not normalized – volunteers and officials are sometimes careless about the structure of the addresses). The OpenStreetMap API can be found here. It’s basically calling http://nominatim.openstreetmap.org/search.php?q=address&format=json with the address. I tried cleaning up some of the addresses automatically, which lead to a couple more successful geocodings, but not much.

The rest of the coordinates I obtained through Google maps. I extract all the non-geocoded addresses and their corresponding primary keys (for volunteers – the full name; for polling stations – the hash of a semi-normalized address), parse them with javascript, which then invokes the Google Maps API. Something like this:

<script type="text/javascript" src="jquery.csv.min.js"></script>
<script type="text/javascript">
	var idx = 1;
	function initMap() {
        var map = new google.maps.Map(document.getElementById('map'), {
          zoom: 8,
          center: {lat: -42.7339, lng: 25.4858}
        });
        var geocoder = new google.maps.Geocoder();

		$.get("geocode.csv", function(csv) {
			var stations = $.csv.toArrays(csv);
			for (var i = 1; i < stations.length; i ++) {
				setTimeout(function() {
					geocodeAddress(geocoder, map, stations[idx][1], stations[idx][0]);
					idx++;
				}, i * 2000);
			}
		});
      }

      function geocodeAddress(geocoder, resultsMap, address, label) {
        geocoder.geocode({'address': address}, function(results, status) {
          if (status === 'OK') {
            $("#out").append(results[0].geometry.location.lat() + "," + results[0].geometry.location.lng() + ",\"" + label.replace('"', '""').trim() + "\"<br />");
          } else {
            console.log('Geocode was not successful for the following reason: ' + status);
          }
        });
      }
</script>

This spits out CSV on the screen. Which I then took and transformed with regex replace (Notepad++) to update queries:

Find: (\d+\.\d+),(\d+\.\d+),(".+")
Replace: UPDATE addresses SET lat=$1, lon=$2 WHERE hash=$3

Now that I had most of the addresses geocoded, the distance searching had to begin. I used the query from this SO question to come up with this (My)SQL query:

SELECT MIN(distance), email, names, stationCode, calc.address FROM
(SELECT email, codePrefix, addresses.address, names, ( 3959 * acos( cos( radians(volunteers.lat) ) * cos( radians( addresses.lat ) )
   * cos( radians(addresses.lon) - radians(volunteers.lon)) + sin(radians(volunteers.lat))
   * sin( radians(addresses.lat)))) AS distance
 from (select address, hash, stationCode, city, lat, lon FROM addresses JOIN stations ON addresses.hash = stations.addressHash GROUP BY hash) as addresses
 JOIN volunteers WHERE addresses.lat IS NOT NULL AND volunteers.lat IS NOT NULL ORDER BY distance ASC) as calc
GROUP BY names;

This spits out the closest polling station to each of the volunteers. It is easily turned into an update query to set the polling station code to each of the volunteers in a designated field.

Then there’s some manual amendments to be made, based on traveling preferences – if the person is willing the travel, we pick one of the “priority stations” and assign it to them. Since these are a small number, it’s not worth automating it.

Of course, in reality, due to data collection flaws, the above idealized example was accompanied by a lot of manual labour of checking paper declarations, annoying people on the phone multiple times and cleaning up the data, but in the end a sizable portion of the volunteers were distributed with the above mechanism.

Apart from being an interesting task, I think it shows that programming skills are useful for practically every task nowadays. If we had to do this manually (and even if we had multiple people with good excel skills), it would be a long and tedious process. So I’m quite in favour of everyone being taught to write code. They don’t have to end up being a developer, but the way programming helps non-trivial tasks is enormously beneficial.

3

“Infinity” is a Bad Default Timeout

March 17, 2017

Many libraries wrap some external communication. Be it a REST-like API, a message queue, a database, a mail server or something else. And therefore you have to have some timeout – for connecting, for reading, writing or idling. And sadly, many libraries have their default timeouts set to “0” or “-1” which means “infinity”.

And that is a very useless and even harmful default. There isn’t a practical use case where you’d want to hang on forever waiting for a resource. And there are tons of situations where this can happen, e.g. the other end gets stuck. In the past 3 months I had 2 libraries that have a default timeout of “infinity” and that eventually lead to production problems because we’ve forgotten to configure them properly. Sometimes you even don’t see the problem, until a thread pool gets exhausted.

So, I have a request to API/library designers (as I’ve done before – against property maps and encoding other than UTF-8). Never have “infinity” as a default timeout. Your library will thus cause lots of production issues.
Also note that it’s sometimes an underlying HTTP client (or Socket) that doesn’t have a reasonable default – it’s still your job to fix that when wrapping it.

What default should you provide? Reasonable. 5 seconds maybe? You may (rightly) say you don’t want to impose an arbitrary timeout on your users. In that case I have a better proposal:

Explicitly require a timeout for building your “client” (because these libraries are most often clients for some external system). E.g. Client.create(url, credentials, timeout). And fail if no timeout is provided. That makes the users of the client actively consider what is a good timeout for their usecase – without imposing anything, and most importantly – without risking stuck connections in production. Additionally, you can still present them with a “default” option, but still making them explicitly choose it. For example:

Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withTimeouts(Timeouts.connect(1000).read(1000))
                   .build();
// OR
Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withDefaultTimeouts()
                   .build();

The builder above should require “timeouts” to be set, and should fail if neither of the two methods was invoked. Even if you don’t provide these options, at least have a good way of specifying timeouts – some libraries require reflection to set the timeout of their underlying client.

I believe this is one of those issues that look tiny, but caus a lot of problems in the real world. And it can (and should) be solved by the library/client designers.

But since it isn’t always the case, we must make sure that timeouts are configured every time we use a 3rd party library.

3

Protecting Sensitive Data

March 12, 2017

If you are building a service that stores sensitive data, your number one concern should be how to protect it. What IS sensitive data? There are some obvious examples, like medical data or bank account data. But would you consider a dating site database as sensitive data? Based on a recent leaks of a big dating site I’d say yes. Is a cloud turn-by-turn nagivation database sensitive? Most likely, as users journeys are stored there. Facebook messages, emails, etc – all of that can and should be considered sensitive. And therefore must be highly protected. If you’re not sure if the data you store is sensitive, assume it is, just in case. Or a subsequent breach can bring your business down easily.

Now, protecting data is no trivial feat. And certainly cannot be covered in a single blog post. I’ll start with outlining a few good practices:

  • Don’t dump your production data anywhere else. If you want a “replica” for testing purposes, obfuscate the data – replace the real values with fakes ones.
  • Make sure access to your servers is properly restricted. This includes using a “bastion” host, proper access control settings for your administrators, key-based SSH access.
  • Encrypt your backups – if your system is “perfectly” secured, but your backups lie around unencrypted, they would be the weak spot. The decryption key should be as protected as possible (will discuss it below)
  • Encrypt your storage – especially if using a cloud provider, assume you can’t trust it. AWS, for example, offers EBS encryption, which is quite good. There are other approaches as well, e.g. using LUKS with keys stored within your organization’s infrastructure. This and the previous point are about “Data at rest” encryption.
  • Monitoring all access and auditing operations – there shouldn’t be an unaudited command issued on production.
  • In some cases, you may even want to use split keys for logging into a production machine – meaning two administrators have to come together in order to gain access.
  • Always be up-to-date with software packages and libraries (well, maybe wait a few days/weeks to make sure no new obvious vulnerability has been introduced)
  • Encrypt internal communication between servers – the fact that your data is encrypted “at rest”, may not matter, if it’s in plain text “in transit”.
  • in rare cases, when only the user has to be able to see their data and it’s very confidential, you may encrypt it with a key based (in part) on their password. The password alone does not make a good encryption key, but there are key-derivation functions (e.g. PBKDF2) that are created to turn low-entropy passwords into fair keys. The key can be combined with another part, stored on the server side. Thus only the user can decrypt their content, as their password is not stored anywhere in plain text and can’t be accessed even in case of a breach.

You see there’s a lot of encryption happening, but with encryption there’s one key question – who holds the decryption key. If the key is stored in a configuration file on one of your servers, the attacker that has gained access to your infrastructure, will find that key as well, get it, get the whole db, and then happily wait for it to be fully decrypted on his own machines.

To store a key securely, it has to be on a tamper-proof storage. For example, an HSM (Hardware Security Module). If you don’t have HSMs, Amazon offers it as part of AWS. It also offers key management as a service, but the particular provider is not important – the concept is important. Then you need to have a securely stored key on a device that doesn’t let the key out under no circumstances, even a breach (HSM vendors claim that’s the case).

Now, how to use these keys depends on the particular case. Normally, you wouldn’t use the HSM itself to decrypt data, but rather to decrypt the decryption key, which in turn is used to decrypt the data. If all the sensitive data in your database is encrypted, even if the attacker gains SSH access and thus gains access to the database (because your application needs unencrypted data to work with; homomorphic encryption is not yet here), he’ll have to get hold of the in-memory decryption key. And if you’re using envelope encryption, it will be even harder for an attacker to just dump your data and walk away.

Note that the ecnryption and decryption here are at the application level – so the encrypted data can be not simply “per storage” or “per database”, but also per column – usernames don’t have to be kept so secret, but the associated personal data (in the next 3 database columns) should. So you can plug the encryption mechanism in your pre-persist (and decryption – in post-load) hooks. If speed is an issue, i.e. you don’t want to do the decryption in real-time, you may have a (distributed) cache of decrypted data that you can refresh with a background job.

But if your application has to know the data, an attacker that gains full control for an unlimited amount of time, will also have the full data eventually. No amount of enveloping and layers of encryption can stop that, it can only make it harder and slower to obtain the dump (even if the master key, stored on HSM, is not extracted, the attacker will have an interface to that key to use it for decrypting the data). That’s why intrusion detection is key. All of the above steps combined with an early notification of intrusion can mean your data is protected.

As we are all well aware, there is never a 100% secure system. Our job is to make it nearly impossible for bulk data extraction. And that includes proper key management, proper system-level and application-level handling of encryption and proper monitoring and intrusion detection.

0