Distributing Election Volunteers In Polling Stations

March 20, 2017

There’s an upcoming election in my country, and I’m a member of the governing body of one of the new parties. As we have a lot of focus on technology (and e-governance), our internal operations are also benefiting from some IT skills. The particular task at hand these days was to distribute a number of election day volunteers (that help observe the fair election process) to polling stations. And I think it’s an interesting technical task, so I’ll try to explain the process.

First – data sources. We have an online form for gathering volunteer requests. And second, we have local coordinators that collect volunteer declarations and send them centrally. Collecting all the data is problematic (to this moment), because filling the online form doesn’t make you eligible – you also have to mail a paper declaration to the central office (horrible bureaucracy).

Then there’s the volunteer preferences – in the form they’ve filled whether they are willing to travel, or they prefer their closest poling station. And then there’s the “priority” polling stations, which are considered to be more risky and therefore we need volunteers there.

I decided to do the following:

  • Create a database table “volunteers” that holds all the data about all prospective volunteers
  • Import all data – using apache CSV parser, parse the CSV files (converted from Google sheets) with the 1. online form 2. data from the received paper declarations
  • Match the entries from the two sources by full name (as the declarations cannot contain an email, which would otherwise be the primary key)
  • Geocode the addresses of people
  • Import all polling stations and their addresses (public data by the central election commission)
  • Geocode the addresses of the polling stations
  • Find the closest polling station address for each volunteer

All of the steps are somewhat trivial, except the last part, but I’ll still explain in short. The CSV parsing and importing is straightfoward. The only thing one has to be careful is have the ability to insert additional records on a later date, because declarations are being received as I’m writing.

Geocoding is a bit trickier. I used the OpenStreetMap initially, but it managed to find only a fraction of the addresses (which are not normalized – volunteers and officials are sometimes careless about the structure of the addresses). The OpenStreetMap API can be found here. It’s basically calling http://nominatim.openstreetmap.org/search.php?q=address&format=json with the address. I tried cleaning up some of the addresses automatically, which lead to a couple more successful geocodings, but not much.

The rest of the coordinates I obtained through Google maps. I extract all the non-geocoded addresses and their corresponding primary keys (for volunteers – the full name; for polling stations – the hash of a semi-normalized address), parse them with javascript, which then invokes the Google Maps API. Something like this:

<script type="text/javascript" src="jquery.csv.min.js"></script>
<script type="text/javascript">
	var idx = 1;
	function initMap() {
        var map = new google.maps.Map(document.getElementById('map'), {
          zoom: 8,
          center: {lat: -42.7339, lng: 25.4858}
        });
        var geocoder = new google.maps.Geocoder();

		$.get("geocode.csv", function(csv) {
			var stations = $.csv.toArrays(csv);
			for (var i = 1; i < stations.length; i ++) {
				setTimeout(function() {
					geocodeAddress(geocoder, map, stations[idx][1], stations[idx][0]);
					idx++;
				}, i * 2000);
			}
		});
      }

      function geocodeAddress(geocoder, resultsMap, address, label) {
        geocoder.geocode({'address': address}, function(results, status) {
          if (status === 'OK') {
            $("#out").append(results[0].geometry.location.lat() + "," + results[0].geometry.location.lng() + ",\"" + label.replace('"', '""').trim() + "\"<br />");
          } else {
            console.log('Geocode was not successful for the following reason: ' + status);
          }
        });
      }
</script>

This spits out CSV on the screen. Which I then took and transformed with regex replace (Notepad++) to update queries:

Find: (\d+\.\d+),(\d+\.\d+),(".+")
Replace: UPDATE addresses SET lat=$1, lon=$2 WHERE hash=$3

Now that I had most of the addresses geocoded, the distance searching had to begin. I used the query from this SO question to come up with this (My)SQL query:

SELECT MIN(distance), email, names, stationCode, calc.address FROM
(SELECT email, codePrefix, addresses.address, names, ( 3959 * acos( cos( radians(volunteers.lat) ) * cos( radians( addresses.lat ) )
   * cos( radians(addresses.lon) - radians(volunteers.lon)) + sin(radians(volunteers.lat))
   * sin( radians(addresses.lat)))) AS distance
 from (select address, hash, stationCode, city, lat, lon FROM addresses JOIN stations ON addresses.hash = stations.addressHash GROUP BY hash) as addresses
 JOIN volunteers WHERE addresses.lat IS NOT NULL AND volunteers.lat IS NOT NULL ORDER BY distance ASC) as calc
GROUP BY names;

This spits out the closest polling station to each of the volunteers. It is easily turned into an update query to set the polling station code to each of the volunteers in a designated field.

Then there’s some manual amendments to be made, based on traveling preferences – if the person is willing the travel, we pick one of the “priority stations” and assign it to them. Since these are a small number, it’s not worth automating it.

Of course, in reality, due to data collection flaws, the above idealized example was accompanied by a lot of manual labour of checking paper declarations, annoying people on the phone multiple times and cleaning up the data, but in the end a sizable portion of the volunteers were distributed with the above mechanism.

Apart from being an interesting task, I think it shows that programming skills are useful for practically every task nowadays. If we had to do this manually (and even if we had multiple people with good excel skills), it would be a long and tedious process. So I’m quite in favour of everyone being taught to write code. They don’t have to end up being a developer, but the way programming helps non-trivial tasks is enormously beneficial.

1

“Infinity” is a Bad Default Timeout

March 17, 2017

Many libraries wrap some external communication. Be it a REST-like API, a message queue, a database, a mail server or something else. And therefore you have to have some timeout – for connecting, for reading, writing or idling. And sadly, many libraries have their default timeouts set to “0” or “-1” which means “infinity”.

And that is a very useless and even harmful default. There isn’t a practical use case where you’d want to hang on forever waiting for a resource. And there are tons of situations where this can happen, e.g. the other end gets stuck. In the past 3 months I had 2 libraries that have a default timeout of “infinity” and that eventually lead to production problems because we’ve forgotten to configure them properly. Sometimes you even don’t see the problem, until a thread pool gets exhausted.

So, I have a request to API/library designers (as I’ve done before – against property maps and encoding other than UTF-8). Never have “infinity” as a default timeout. Your library will thus cause lots of production issues.
Also note that it’s sometimes an underlying HTTP client (or Socket) that doesn’t have a reasonable default – it’s still your job to fix that when wrapping it.

What default should you provide? Reasonable. 5 seconds maybe? You may (rightly) say you don’t want to impose an arbitrary timeout on your users. In that case I have a better proposal:

Explicitly require a timeout for building your “client” (because these libraries are most often clients for some external system). E.g. Client.create(url, credentials, timeout). And fail if no timeout is provided. That makes the users of the client actively consider what is a good timeout for their usecase – without imposing anything, and most importantly – without risking stuck connections in production. Additionally, you can still present them with a “default” option, but still making them explicitly choose it. For example:

Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withTimeouts(Timeouts.connect(1000).read(1000))
                   .build();
// OR
Client client = ClientBuilder.create(url)
                   .withCredentials(credentials)
                   .withDefaultTimeouts()
                   .build();

The builder above should require “timeouts” to be set, and should fail if neither of the two methods was invoked. Even if you don’t provide these options, at least have a good way of specifying timeouts – some libraries require reflection to set the timeout of their underlying client.

I believe this is one of those issues that look tiny, but caus a lot of problems in the real world. And it can (and should) be solved by the library/client designers.

But since it isn’t always the case, we must make sure that timeouts are configured every time we use a 3rd party library.

3

Protecting Sensitive Data

March 12, 2017

If you are building a service that stores sensitive data, your number one concern should be how to protect it. What IS sensitive data? There are some obvious examples, like medical data or bank account data. But would you consider a dating site database as sensitive data? Based on a recent leaks of a big dating site I’d say yes. Is a cloud turn-by-turn nagivation database sensitive? Most likely, as users journeys are stored there. Facebook messages, emails, etc – all of that can and should be considered sensitive. And therefore must be highly protected. If you’re not sure if the data you store is sensitive, assume it is, just in case. Or a subsequent breach can bring your business down easily.

Now, protecting data is no trivial feat. And certainly cannot be covered in a single blog post. I’ll start with outlining a few good practices:

  • Don’t dump your production data anywhere else. If you want a “replica” for testing purposes, obfuscate the data – replace the real values with fakes ones.
  • Make sure access to your servers is properly restricted. This includes using a “bastion” host, proper access control settings for your administrators, key-based SSH access.
  • Encrypt your backups – if your system is “perfectly” secured, but your backups lie around unencrypted, they would be the weak spot. The decryption key should be as protected as possible (will discuss it below)
  • Encrypt your storage – especially if using a cloud provider, assume you can’t trust it. AWS, for example, offers EBS encryption, which is quite good. There are other approaches as well, e.g. using LUKS with keys stored within your organization’s infrastructure. This and the previous point are about “Data at rest” encryption.
  • Monitoring all access and auditing operations – there shouldn’t be an unaudited command issued on production.
  • In some cases, you may even want to use split keys for logging into a production machine – meaning two administrators have to come together in order to gain access.
  • Always be up-to-date with software packages and libraries (well, maybe wait a few days/weeks to make sure no new obvious vulnerability has been introduced)
  • Encrypt internal communication between servers – the fact that your data is encrypted “at rest”, may not matter, if it’s in plain text “in transit”.
  • in rare cases, when only the user has to be able to see their data and it’s very confidential, you may encrypt it with a key based (in part) on their password. The password alone doe not make a good encryption key, but there are key-derivation functions (e.g. PBKDF2) that are created to turn low-entropy passwords into fair keys. The key can be combined with another part, stored on the server side. Thus only the user can decrypt their content, as their password is not stored anywhere in plain text and can’t be accessed even in case of a breach.

You see there’s a lot of encryption happening, but with encryption there’s one key question – who holds the decryption key. If the key is stored in a configuration file on one of your servers, the attacker that has gained access to your infrastructure, will find that key as well, get it, get the whole db, and then happily wait for it to be fully decrypted on his own machines.

To store a key securely, it has to be on a tamper-proof storage. For example, an HSM (Hardware Security Module). If you don’t have HSMs, Amazon offers it as part of AWS. It also offers key management as a service, but the particular provider is not important – the concept is important. The you need to have a securely stored key on a device that doesn’t let the key out under no circumstances, even a breach (HSM vendors claim that’s the case).

Now, how to use these keys depends on the particular case. Normally, you wouldn’t use the HSM itself to decrypt data, but rather to decrypt the decryption key, which in turn is used to decrypt the data. If all the sensitive data in your database is encrypted, even if the attacker gains SSH access and thus gains access to the database (because your application needs unencrypted data to work with; homomorphic encryption is not yet here), he’ll have to get hold of the in-memory decryption key. And if you’re using envelope encryption, it will be even harder for an attacker to just dump your data and walk away.

Note that the ecnryption and decryption here are at the application level – so the encrypted data can be not simply “per storage” or “per database”, but also per column – usernames don’t have to be kept so secret, but the associated personal data (in the next 3 database columns) should. So you can plug the encryption mechanism in your pre-persist (and decryption – in post-load) hooks. If speed is an issue, i.e. you don’t want to do the decryption in real-time, you may have a (distributed) cache of decrypted data that you can refresh with a background job.

But if your application has to know the data, an attacker that gains full control for an unlimited amount of time, will also have the full data eventually. No amount of enveloping and layers of encryption can stop that, it can only make it harder and slower to obtain the dump (even if the master key, stored on HSM, is not extracted, the attacker will have an interface to that key to use it for decrypting the data). That’s why intrusion detection is key. All of the above steps combined with an early notification of intrusion can mean your data is protected.

As we are all well aware, there is never a 100% secure system. Our job is to make it nearly impossible for bulk data extraction. And that includes proper key management, proper system-level and application-level handling of encryption and proper monitoring and intrusion detection.

0

A Case For Native Smart Card Support in Browsers

February 22, 2017

A smart card is a device that holds a private key securely without letting it out of its storage. The chip on your credit card is a “smart card” (yup, terminology is ambiguous – the card and the chip are interchangeably called “smart card”). There are smaller USB-pluggable hardware readers that only hold the chip (without an actual card – e.g. this one).

But what’s the use? This w3c workshop from several years ago outlines some of them: multi-factor authentication, state-accepted electronic identification, digital signatures. All these are part of a bigger picture – that using the internet is now the main means of communication. We are moving most of our real-world activities online, so having a way to identify who we are online (e.g. to a government, to a bank), or being able to sign documents online (with legal value) is crucial.

That’s why the EU introduced the eIDAS regulation which defines (among other things) electronic identification and digital signatures. The framework laid there is aimed at having legally binding electronic communication, which is important in so many cases. Have you ever done the print-sign-scan exercise? Has your e-banking been accessed by an unauthorized person? Well, the regulation is supposed to fix these and more more issues.

Two factor authentication is another more broad concept, which has a tons of sub-optimal solutions. OTP tokens, google authenticator, sms code confirmation. All these have issues (e.g. clock syncing, sms interception, cost). There are hardware tokens like YubiKey, but they offer only a subset of the features a smart card does.

But it’s not just about legally-recognized actions online and two-factor authentication. It opens up other possibilities, like a more secure online credit card payment – e.g. you put your card in a reader and type your PIN, rather than entering the card number, CVC, date, names, 3d password and whatnot.

With this long introduction I got to the problem: browsers don’t support smart cards. In the EU, where electronic signatures are legally recognized, there is always the struggle of making them work with browsers. The solution so far: Java applets. A Java applet can interact with the smart card through the java crypto APIs, and thus provide signing features. However, with the deprecation of Java applets this era of constant struggle will end soon (and it is a struggle – having to click at least 2 confirmations and keep your java up to date, which even for developers is a hassle). There used to be a way to do it a few years ago in Firefox and IE, using window.crypto and CAPICOM APIs, but these got deprecated.

Recently the trend has been to use a “cloud-based” approach, where the keys reside on an HSM. That’s of course useful, but the problem with identification remains – getting access to your keys on the HSM requires, again, two factor authentication. Having the hardware token “in your hands” is what adds the security.

Smart people in Estonia (which has the most digital government in the world) had a better solution than Java or HSM – browser plugins that allow interaction with their ID card (which is/has a smart card). The solution is here and here. This has worked pretty well – you install the plugins one (which a one-in-all installer) and you can sign documents with javascript. You also get the proper PKCS libraries installed, and the root certificates needed to allow TLS 1.2 authentication with the hardware token (identification and authentication vs signing). The small downside of this approach is that it is somewhat fragile and dependent on browser whims – the plugins have to be upgraded constantly and are at risk of being completely broken if some browser decides to deprecate some Plugin APIs.

Another approach is the “local service” approach, which has two flavours. One is – you install a local application that exposes an HTTP interface and using javascript and proper same-origin configuration you send the files needed for signing to the service, and then get the result as an HTTP response, which you can then, again using javascript, append to the page that requested the signing. The downside here – getting a service installed to listen to a given port without administrator rights. The other approach is having an application hooked to a custom protocol (e.g. signature://). So whenever the page wants the user to sign something, it opens signnature://path-to-document-to-sign, which is intercepted by the locally installed application, digital signing is performed, and the result is pushed to (one-time) URL specified in the metadata of the document to sign. Something like that is implemented by 4identity.eu and it actually works.

Now, signature is one thing, identification (TLS client auth) is another. Allegedly, things should work there – PKCS#11 is a standard that should allow TLS client auth to happen with a smart card. Reality is – it doesn’t. You often need a vendor-specific PKCS#11 library. OpenSC, which is a cool tool that works with many smart cards, only works with Firefox and Safari. Charismatics commercial is a piece of software that is supposed to work with all smart cards out there – well, it doesn’t always.

And the problem here is the smart card vendors. The need for OpenSC and Charismatics arises because even though there are a few PKCS standards, smart cards are a complete mess. Not only it’s a mess, but it’s a closed, secretive mess. APDUs (the commands you send to the smartcard in order to communicate with it) are in most cases secret. You don’t get to know them even if you purchase tens of thousands of cards – you only get a custom vendor software that knows them. Then you have to reverse-engineer them to know how to actually talk to them. And they differ not only across vendors, but across card models of different vendors. For that reason the Estonian approach was a bit simpler to implement – they had only one type of smart card, given to all citizens and they were mostly in control. In other countries it’s a … mess. At least a dozen different types of cards to be supported.

So my first request is to smart card vendors (which are not that many) – please, please fix your mess. Get rid of that extra bit of “security through obscurity” to allow browsers to communicate with you without extra shenanigans.

My second request is to browser vendors – please do support smart card crypto natively. Unfortunately, due to the smart card mess above (among other things), hardware crypto has explicitly been excluded from the Web crypto API. As a follow-up to that, there’s the Hardware security working group, but afaik it’s still “work in progress”, and my feeling is it’s not that much yet. In w3c it’s important that browser vendors agree to implement something before it’s a standard, and I’ve heard that some are opposing the smart card integration. Due to the aforementioned mess, I guess.

You may say – standardization will fix this. Well, it hasn’t so far. The EU officials are aware of the problem, and that the eIDAS regulation may be thwarted by these technical issues, but they are powerless, as the EU is not a standardization body.

So it all comes down to having a joint effort between browser and smart card vendors to fix this thing once and for all. So, please do that in order to enable a more secure and legally-compliant web.

3

Computer Science Concepts That Non-Technical People Should Know

February 12, 2017

Sometimes it happens that people speak different languages. Even when speaking the same language. People have their own professional specifics. Biologist may see the world as the way a cell work, cosmologist may see relationships between people as attraction between planets. And as with languages different professional experiences give you a useful way of conceptualizing the world. And I think everyone would find some CS concepts useful. I’ll try to list some of these concepts that I’ve found many people don’t find “native”. And yes, they are not strictly “computer science” – they fall in various related fields like information science, systems design, etc.

  • Primary keys – the fact that every “entity” should have a unique identifier, so that you can refer to it unambiguously. Whether it’s a UUID or an auto-increment, or a number/string derived from a special set of rules, doesn’t matter. And you may say it’s obviously, but it isn’t – I’ve seen tons of spreadsheets and registers where entities don’t have a unique identifier. Unique identifiers are useful for retrieval – if I have a driving license number, when I fill an insurance form, should I fill all the details from the driving license (name, address, age), or just a single number, and the insurer should then get the rest from a driving license database?
  • Foreign keys (+ integrity violations and cascades) – the idea for this post came to me after I had a discussion with a bank clerk who insisted that the fact that my bank account is being deleted doesn’t mean that my virtual PoS terminal will also be deleted. To her they seemed unlinked, although the terminal is linked (via a foreign key) to the bank account. One should be able to quickly imagine links between data points and how dependent data can either block the removal or the parent, or be deleted with it.
  • “Fingerprint” (hash) – the fact that every entity can be transformed into a short, unique identity representation. This has some niche applications outside of the technical realm, but one should be familiar with the concept of transforming a large set of data points to a single representation for the sake of comparison.
  • Derived data – the fact that you don’t have to store data that can be derived from a primary dataset with well defined transformations. E.g. sums. I had to explain this over and over again when I was talking about open data – and ultimately, sums were there anyway. The key here is that derived data is less important than the algorithm used to derive it.
  • Single point of failure – this becomes obvious the second you spell it, but people tend to ignore it when designing real-world processes. How many times you’ve been stuck in a queue and you thought “this is obviously a single point of failure, why didn’t anyone think of preventing these delays”. It’s not always simple (even in software, not to mention the real world), of course, but the concept should be acknowledged.
  • Protocols and interfaces – i.e. the definition of how two systems communicate without knowing each others’ internal functioning. “Protocol” in the non-technical world means “the way things are usually done”. An interface is usually thought of as a GUI. But the concept of the definition of how two systems communicate seems kind of alien. Where “systems” might be “organizations”, a customer and a sales rep, two departments, etc. The definition of the point of their interaction I think helps to conceptualize these relations better.
  • Version control – the fact that content is a continuous stream of changes ontop of an original entity. Each entity is the product of its original “version” + all the changes applied on it. This doesn’t just hold for text, it holds for everything out there. And if we imagine it so, it helps us… “debug” things, even humans I guess.

The list is by no means exhaustive. I’d welcome any additions to it.

And now I know non-technical readers will say “But I understand and know these things”. I know you do. But I imply that they are not an intrinsic part of your day-to-day problem solving apparatus. I may be wrong, but my observations hint at that direction.

0

Why I Chose to Be a Government Advisor

January 22, 2017

A year and a half ago I agreed to become advisor in the cabinet of the deputy primer minister of my country (Bulgaria). It might have looked like a bizarre career move, given that at the time I was a well positioned and well paid contractor (software engineer), working with modern technologies (Scala, Riak, AWS) at scale (millions of users). I continued on that project part time for a little while, then switched to another one (again part time), but most of my attention and time were dedicated to the advisory role.

Since mid-December I’m no longer holding the advisory position (the prime minister resigned), but I wanted to look back, reflect and explain (to myself mainly) why that was a good idea and how it worked out.

First, I deliberately continued as a part-time software engineer, to avoid the risk of forgetting my (to that point) most marketable skills – building software. But not only that – sometimes you become tired of political and administrative bullshit and just want to sit down and write some code. But the rest of my time, including my “hobby”/spare time, was occupied by meetings, research, thoughts and document drafting that aimed at improving the electronic governance in Bulgaria.

I’ve already shared what my agenda was and what I was doing, and even gave a talk about our progress – opening as much data as possible, making sure we have high requirements for government software (we prepared a technical specification template for government procurement so that each administration relies on that, rather than on contractors with questionable interests), introducing electronic identification (by preparing both legislation and technical specification), and making myself generally available to anyone who wants advice on IT stuff.

The big news that we produced was the introduction of a legal requirement for making all custom-built government software open-source. It echoed in the tech community, including some big outlets like TechCrunch, ZDnet and Motherboard.

I was basically a co-CTO (together with a colleague) of the government, which is pretty amazing – having the bigger picture AND the intricate details in our heads, all across government. We were chaotic, distracted and overwhelmed, but still managed to get both the legislation and the important projects written and approved. I acquired some communication and time-management skills, researched diverse technological topics (from biometric identification to e-voting), I learnt that writing laws is very similar to programming, and I had the pleasure of working in a wonderful team.

There were downsides, of course. The financial one aside, the most significant problem was the inevitable “deskilling”. I had no time to try cool new stuff from the technology world – machine learning, deep learning, neural networks, the blockchain. Kafka, Docker, Kubernetes, etc. I’m not saying these are crucial technologies (I have tried Docker a couple of times and I wouldn’t say I love it), but one has to stay up to date – this is the best part (or the worst part?) of our profession. I missed (and still do) spending a few evenings in a row digging into some new and/or obscure technology, or even contributing to it. My GitHub contributions were limited to small government-oriented tools.

Another downside is that I’m pretty sure some people hate me now. For raising the bar for government software so high and for my sometimes overly zealous digitizing efforts. I guess that’s life, though.

Having stated the pros and cons, let me get back to the original question – why I did I choose the position in the first place. For a long time I’ve been interested in what happens to programmers when they get “bored” – when the technical challenges alone are not sufficient for feeling one’s potential, or potential for impact is fully reached. A few years ago I didn’t think of the “government” option – I thought one could either continue solve problems for an employer, or become an employer themselves. And I still think the latter is a very good option, if you have the right idea.

But I didn’t have that idea that I really liked and wanted to go through with. At the same time I had ideas of how the government could run better. Which would, in my case, impact around 8 million people. It would also mean I can get things done in a complicated context. I started with three objectives – electronic identity card, open source and open data (the latter two we have pushed for with a group of volunteer developers). All of them are now fact. Well, mostly because of the deputy prime minister, and thanks to the rest of the team, but still. That job would also mean I had to get out of my comfort zone of all-day-programming agile-following software-delivering routine.

And that’s why I did it. With all the risks of deskilling, of being disliked, or even being a pawn in a political game. Selfish as it may sound, I think having impact, getting things done and doing something different are pretty good motivators. And are worth the paycheck cut.

I still remain a software engineer and will continue to create software. But thanks to this experience, I think I’ll be able to see a broader landscape around software.

2

Forget ISO-8859-1

January 16, 2017

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{Alpha}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.

3

Anemic Objects Are OK

December 25, 2016

I thought for a while that object-oriented purism has died off. But it hasn’t – every now and then there’s an article that tries to tell us how evil setters and getters are, how bad (Java) annotations are, and how horrible and anti-object-oriented the anemic data model is (when functionality-only services act upon data-only objects) and eventually how dependency injection is ruining software.

Several years ago I tried to counter these arguments, saying that setters and getters are not evil per se, and that the anemic data model is mostly fine, but I believe I was worse at writing then, so maybe I didn’t get to the core of the problem.

This summer we had a short twitter discussion with Yegor Bugayenko and Vlad Mihalcea on the matter and a few arguments surfaced. I’ll try to summarize them:

  • Our metaphors are often wrong. An actual book doesn’t know how to print itself. Its full contents are given to a printer, which knows how to print a book. Therefore it doesn’t make sense to put logic for printing (to JSON/XML), or persisting to a database in the Book class. It belongs elsewhere.
  • The advice to use (embedded) printers instead of getters is impractical even if a Book should know how to print itself – how do you transform your objects to other formats (JSON, XML, Database rows/etc..)? With an Jackson/JAXB/ORM/.. you simply add a few annotations, if any at all and it works. With “printers” you have to manually implement the serialization logic. Even with Xembly you still have to do a tedious, potentially huge method with add()’s and up()’s. And when you add, or remove a field, or change a field definition, or add a new serialization format, it gets way more tedious to support. Another approach mentioned in the twitter thread is having separate subclasses for each format/database. And an example can be seen here. I really don’t find that easy to read or support. And even if that’s adopted in a project I’m working on, I’d be the first to replace that manual adding with reflection, however impure that may be. (Even Uncle Bob’s Fitnesse project has getters or even public fields where that makes sense in terms of the state space)
  • Having too much logic/behaviour in an objects may be seen as breaking the Single responsibility principle. In fact, this article argues that the anemic approach is actually SOLID, unlike the rich business object approach. The SRP may actually be understood in multiple ways, but I’ll get to that below.
  • Dependency injection containers are fine. The blunt example of how the code looks without them is here. No amount of theoretical object-oriented programming talk can make me write that piece of code. I guess one can get used to it, but (excuse my appeal to emotion fallacy here) – it feels bad. And when you consider the case of dependency injection containers – whether you’ll invoke a constructor from a main method, or your main will invoke automatic cosntructor (or setter) injection context makes no real difference – your objects are still composed of their dependencies, and their dependencies are set externally. Except the former is more practical and after a few weeks of nested instantiation you’ll feel inclined to write your own semi-automated mechanism to do that.

But these are all arguments derived from a common root – encapsulation. Your side in the above arguments depends on how you view and understand encapsulation. I see the purpose of encapsulation as a way to protect the state space of a class – an object of a given class is only valid if it satisfies certain conditions. If you expose the data via getters and setters, then the state space constraints are violated – everyone can invalidate your object. For example, if you were able to set the size of an ArrayList without adding the corresponding element to the backing array, you’d break the behaviour of an ArrayList object – it will report its size inconsistently and the code that depends on the List contract would not always work.

But in practical terms encapsulation still allows for the distinction between “data objects” vs “business objects”. The data object has no constraints on its state – any combination of the values of its fields is permitted. Or in some cases – it isn’t, but it is enforced outside the current running program (e.g. via database constraints when persisting an object with an ORM). In these cases, where the state spaces is not constraint, encapsulation is useless. And forcing it upon your software blindly results in software that, I believe, is harder to maintain and extend. Or at least you gain nothing – testing isn’t easier that way (you can have a perfectly well tested anemic piece of software), deployment is not impacted, tracing problems doesn’t seem much of a difference.

I’m even perfectly fine with getting rid of the getters and exposing the state directly (as the aforementioned LogData class from Fitnesse).

And most often, in business applications, websites, and the general type of software out there, most objects don’t need to enforce any constraints on their state. Because there state is just data, used somewhere else, in whatever ways the business needs it to be used. To get back to the single responsibility principle – these data objects have only one reason to change – their … data has changed. The data itself is irrelevant. It will become relevant at a later stage – when it’s fetched via a web service (after it’s serialized to JSON), or after it’s fetched from the database by another part of the application or a completely different system. And that’s important – encapsulation cannot be enforced across several systems that all work with a piece of data.

In the whole debate I haven’t seen a practical argument against the getter/setter/anemic style. The only thing I see is “it’s not OOP” and “it breaks encapsulation”. Well, I think it should be settled now that encapsulation should not be always there. It should be there only when you need it, to protect the state of your object from interference.

So, don’t feel bad to continue with your ORMs, DI frameworks and automatic JSON and XML serializers.

7

Amend Your Contract To Allow For Side Projects

December 14, 2016

The other day Joel Spolsky blogged a wonderful overview of the copyright issues with software companies in terms of its employees. The bottom line is: most companies have an explicit clause in their contracts which states that all intellectual property created by a developer is owned by the employer. This is needed, because the default (in many countries, including mine) is that the creator owns the copyright, regardless of whether they were hired to do it or not.

That in turn means that any side project, or in fact any intellectual property that you create while being employed as a developer, is automatically owned by your employer. This isn’t necessarily too bad, as most employers wouldn’t enforce their right, but this has bugged me ever since I started working for software companies. Even though I didn’t know the legal framework of copyright, the ownership clause in my contracts was always something that I felt was wrong. Even though Joel’s explanation makes perfect sense – companies need to protect their products from a random developer suddenly deciding they own the rights to parts of it – I’ve always thought there’s a middle ground.

(Note: there is a difference between copyright, patents and trademarks, and the umbrella term “intellectual property” is kind of ambiguous. I may end up using it sloppily, so for a clarification, read here.)

California apparently tried to fix this by passing the following law:

Anything you do on your own time, with your own equipment, that is not related to your employer’s line of work is yours, even if the contract you signed says otherwise.

But this doesn’t work, as basically everything computer-related is “in your employer’s line of work”, or at least can be, depending on the judge.

So let’s start with the premise that each developer should have the right to create side projects and profit from them, potentially even pursue them as their business. Let’s also have in mind that a developer is not a person, whose only ideas and intellectual products are “source code” or “software design(s)”. On the other hand the employer must be 100% sure that no developer can claim any rights on parts of the employer’s product. There should be a way to phrase a contract in a way that it reflects these realities.

And I have always done that – whenever offered a contract, I’ve stated that:

  1. I have side-projects, some of which are commercial, and it will make no sense for the employer to own them
  2. I usually create non-computer intellectual property – I write poetry, short stories, and linguistics problems

And I’ve demanded that the contract be reworded. It’s a bargain, after all, not one side imposing it’s “standard contract” on the other. So far no company objected too much (there was some back-and-forth with the lawyers, but that’s it – companies decided it was better for them to hire a person they’ve assessed positively, than to stick to an absolute contract clause).

That way we’ve ended up with the following wording, which I think is better than the California law, protects the employer, and also accounts for side-proejcts and poetry/stories/etc.

Products resulting from the Employee’s duties according to the terms of employment – individual or joint, including ideas, software development, inventions, improvements, formulas, designs, modifications, trademarks and any other type of intellectual property are the exclusive property of Employer, no matter if patentable.

Not sure if it is properly translated, but the first part is key – if the idea/code/invention/whatever is a result of an assignment that I got, or a product I am working on for the employer, then it is within the terms of employment. And it is way less ambiguous than “the employer’s line of work”. Anything outside of that, is of course, mine.

I don’t vouch for the legal rigidity of the above, as I’m not a legal professional, but I strongly suggest negotiating such a clause in your contracts. It could be reworded in other ways, e.g. “work within the terms of employment”, but the overall idea is obvious. And If you end up in court (which would probably almost never happen, but contracts are there to arrange edge cases), then even if the clause is not perfect, the judge/jury will be able to see the intent of the clause clearly.

And here I won’t agree with Joel – if you want to do something independent, you don’t have to be working for yourself. Side projects, of which I’ve always been a proponent, (and other intellectual products) are not about the risk-taking entrepreneurship – they are about intellectual development. They are a way to improve and expand your abilities. And it is a matter of principle that you own them. In rare cases they may be the basis of your actual entrepreneurial attempt.

Amending a standard contract is best done before you sign it. If you’ve already signed it, it’s still possible to add an annex, but less likely. So my suggestion is, before you start a job, use your “bargaining power” to secure your intellectual property rights.

1

Progress in Electronic Governance [talk]

December 2, 2016

I’ve been an advisor to the depury prime minister of Bulgaria for the past year and a half. And on this year’s OpenFest conference I tried to report on what we’ve achieved. It is not that much and there are no visible results, which is a bit disappointing, but we (a small motivated team) believe we have laid the groundwork for a more open, and properly built ecosystem for the government IT systems.

Just to list a few things – we passed a law that requires open sourcing custom-built government software, we opened a lot of data (1500 datasets) on the national open data portal, and we drew a roadmap of how existing state registers and databases be upgraded in order to meet modern software engineering best practices and be ready to meet the high load of requests. We also seriously considered the privacy and auditability of the whole ecosystem. We prepared the electronic identification project (each citizen having the option to identify online with a secure token), an e-voting pilot and so on.

The video of the talk is available here:

And here are the slides:

Now that our term is at an end (due to the resignation of the government) we hope the openness-by-default will persist as a policy and the new government agency that we constituted would be able to push the agenda that has been laid out. Whether that will be the case in a complex political situation is hard to tell, but hopefully the “technical” and the “political” aspects won’t be entwined in a negative way. And our team will continue to support with advice (even though from “the outside”) whoever wishes to build a proper and open e-government ecosystem.

0