Short DNS Record TTL And Centralization Are Serious Risks For The Internet

October 22, 2016

Yesterday Dyn, a DNS-provider, went down after a massive DDoS. That led to many popular websites being inaccessible, including twitter, LinkedIn, eBay and others. The internet seemed to be “crawling on its knees”.

We’ll probably read an interesting post-mortem from Dyn, but why did that happen? First, DDoS capacity is increasing, using insecure and infected IoT devices with access to the internet. Huge volumes of fake requests are poured on a given server or set of servers and they become inaccessible, either being unable to cope with the requests, or simply because the network to the server doesn’t have enough throughput to accomodate all the requests.

But why did “the internet” stop because a single DNS provider was under attack? First, because of centralization. The internet is supposed to be decentralized (although I’ve argued that exactly because of DNS, it is pseudo-decentralized). But services like Dyn, UltraDNS, Amazon Route53 and also Akamai and CloudFlare centralize DNS. I can’t tell how exactly, but out of the top 500 websites according to, 181 use one of the above 5 services as their DNS provider. Add 25 google services that use their own, and you get nearly 200 out of 500 centered in just 6 entities.

But centralization of the authoritative nameservers alone would not have led to yesterday’s problem. A big part of the problem, I think, is the TTL (time to live) of the DNS records, that is – the records which contain the mapping between domain name and IP address(es). The idea is that you should not always hit the authoritative nameserver (Dyn’s server(s) in this case) – you should hit it only if there is no cached entry anywhere along the way of your request. Your operating system may have a cache, but more importantly – your ISP has a cache. So the idea is that when subscribers of one ISP all make requests to twitter, the requests should not go to the nameserver, but would instead by resolved by looking them up in the cache of the ISP.

If that was the case, regardless of whether Dyn was down, most users would be able to access all services, because they would have their IPs cached and resolved. And that’s the proper distributed mode that the internet should function in.

However, it has become a common practice to set very short TTL on DNS records – just a few minutes. So after the few minutes expire, your browsers has to ask the nameserver “what IP should I connect to in order to access”. That’s why the attack was so successful – because no information was cached and everyone repeatedly turned to Dyn to get the IP corresponding to the requested domain.

That practice is highly questionable, to say the least. This article explains in details the issues of short TTLs, but let me quote some important bits:

The lower the TTL the more frequently the DNS is accessed. If not careful DNS reliability may become more important than the reliability of, say, the corporate web server.

The increasing use of very low TTLs (sub one minute) is extremely misguided if not fundamentally flawed. The most charitable explanation for the trend to lower TTL value may be to try and create a dynamic load-balancer or a fast fail-over strategy. More likely the effect will be to break the nameserver through increased load.

So we knew the risks. And it was inevitable that this problematic practice will be abused. I decided to analyze how big the problem actually is. So I got the aformentioned top 500 websites as representative, fetched their A, AAAA (IPv6), CNAME and NS records, and put them into a table. You can find the code in this gist (uses the dnsjava library).

The resulting CSV can be seen here. And if you want to play with it in Excel, here is the excel file.

Some other things that I collected: how many websites have AAAA (IPv6) records (only 79 out of 500), whether the TTLs betwen IPv4 and IPv6 differ (it does for 4), which is the DNS provider (which is how I got the figures mentioned above), taken from the NS records, and how many use CNAME instead of A records (just a few). I also collected the number of A/AAAA records, in order to see how many (potentially) utilize round-robin DNS (187) (worth mentioning: the A records served to me may differ from those served to other users, which is also a way to do load balancing).

The results are a bit scary. The average TTL is only around 7600 seconds (2 hours and 6 minutes). But it gets worse when you look at the 50th percentile (sort the values by ttl and get the lowest 250). The average there is just 215 seconds. This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.

Just a few websites have a high TTL, as can be seen from this simple chart (all 500 sites are on the X axis, the TTL is on y):


What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides – it doesn’t automatically mean all users will be pointed to the new IP, as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods. You could do the round-robin DNS, which is basically using the DNS provider as a load-balancer, which sounds wrong in most cases. It can be used for geolocation routing – serving different IP depending on the geographical area of the request, but that doesn’t necessarily require a low TTL – if caching happens closer to the user than to the authoritative DNS server, then he will be pointed to the nearest IP anyway, regardless of whether that values gets refreshed often or not.

Short TTL is very useful with internal infrastructure – when pointing to your internal components (e.g. a message queue, or to a particular service if using microservices), then using low TTLs may be better. But that’s not about your main domain being accessed from the internet.

Overlay networks like BitBorrent and Bitcoin use DNS round-robin for seeding new clients with a list of peers that they can connect to (your first use of a torrent client connects you to one of serveral domains that each point to a number of nodes that are supposed to be always on). But that’s again a rare usecase.

Overall, I think most services should go for higher TTLs. 24 hours is not too much, and it will be needed to keep your old IP serving requests for 24 hours anyway, because of caches that ignore the TTL value. That way services won’t care if the auhtoritative nameserver is down or not. And that would in turn mean that DNS providers would be less of an interesting target for attacks.

And I understand the flexibility that Dyn and Route53 give us. But maybe we should think of a more distributed way to gain that flexibility. Because yesterday’s attack may be just the beginning.


The Broken Scientific Publishing Model and My Attempt to Improve It

October 12, 2016

I’ll begin this post with a rant about the state of scientific publishing, then review the technology “disruption” landscape and offer a partial improvement that I developed (source).

Scientific publishing is quite important – all of science is based on previously confirmed “science”, so knowing what the rest of the scientific community has done or is doing is essential to research. And allows scientists to “stand on the shoulders of giants”.

The web was basically invented to improve the sharing of scientific information – it was created at CERN and allowed linking from one (research) document to others.

However, scientific publishing at the moment is one of the few industries that haven’t benefited from the web. Well, the industry has – the community hasn’t, at least not as much as one would like.

Elsevier, Thomson-Reuters (which recently sold its intellectual property business), Springer and other publishers make huge profits (e.g. 39% margin on a 2 billion revenue) for doing something that should basically be free in this century – they spread the knowledge that scientists have created. You can see here some facts about their operation, the most striking being that each university has to pay more than a million dollars to get the literature it needs.

It’s because they rely on a centuries old process of submission to journals, accepting the submission, then printing and distributing to university libraries. Recently publishers have put publications online, but they are behind paywalls or accessible only after huge subscription fees have been paid.

I’m not a “raging socialist” but sadly, publishers don’t provide (sufficient) value. They simply gather the work of scientists that is already funded by public money, sometimes get the copyright on that, and disseminate it in a pre-Internet way.

They also do not pay for peer review of the submitted publications, they simply “organize it” – which often means “a friend of the editor is a professor and he made his postdocs write peer reviews”. Peer review is thus itself broken, as it is non-transparent and often of questionable quality. The funny side of the peer review process is caught at “shitsmyreviewerssay”.

Oh, and of course authors should themselves write their publication in a journal-preferred template (and each journal has its own preferences). So the only actual work that the journals do is typesetting and editorial filtering.

So, we have expensive scientific literature with no added value and broken peer review system.

And at that point you may argue that if they do not add value, they can be easily replaced. Well, no. Because of the Impact Factor – the metric for determining the most cited journals, and by extension – the reputation of the authors that manage to get published in these journals. The impact factor is calculated based on a big database (Web of Science) and assigns a number on each journal. The higher impact factor a journal has, the better career opportunities a scientist has if they managed to get accepted for publication in that journal.

You may think that the impact factor is objective – well, it isn’t. It is based on data that only publishers (Thomson-Reuters in particular) have and when others tried to reproduce the impact factor, it was nearly 40% off (citation needed, but I lost the link). Not only that, but it’s an impact factor of the journal, not the scientists themselves.

So the fact that publishers are the judge, jury and executioner, means they can make huge profits without adding much value (and yes, they allow searching through the entire collection they have, but full-text search on a corpus of text isn’t exactly rocket science these days). That means scientists don’t have access to everything they may need, and that poor universities won’t be able to keep up. Not to mention individual researchers who are just left out. In general, science suffers from the inefficient sharing and assessment of research.

The situation is even worse, actually – due to the lack of incentive for publishers to change their process (among other things), as a popular journal editor once said – “much of the scientific literature, perhaps half, may simply be untrue”. So the fact that you are published in a somewhat impactful journal doesn’t mean your publication has been thoroughly reviewed, nor that the reviewers bear any responsibility for their oversights.

Many discussions have been held about why disruption hasn’t yet happened in this apparently broken field. And it’s most likely because of the “chicken and egg problem” – scientists have an incentive to publish to journals because of the impact factor, and that way the impact factor is reinforced as a reputation metric.

Then comes open access – a movement that requires scientific publications to be publicly accessible. There are multiple organizations/initiatives that support and promote open access, including EU’s OpenAIRE. Open access comes in two forms:

  • “green open access”, or “preprints” (yup, “print” is still an important word) – you just push your work to an online repository – it’s not checked by editors or reviewers, it just stays there.
  • “gold open access” – the author/library/institution pays a processing fee to publish the publication and then it becomes public. Important journals that use this include PLOS, F1000 and others

The “gold open access” doesn’t solve almost anything, as it just shifts the fees (maybe it reduces them, but again – processing fee to get something published online, really?). The “green open access” doesn’t give you the reputation benefits – preprint repos don’t have impact factor. Despite that, it’s still good to have the copies available, which some projects (like, OABOT, ArchiveLab) try to do.

Then there’s Google Scholar, which has agreements with publishers to aggregate their content and provide search results (not the full publications). It also provides some metrics ontop of that, regarding citation. It forms a researcher profile based on that, which can actually be used as a replacement for the impact factor.

Because of that, many attempts have been made to either “revolutionize” scientific publishing, or augment it with additional services that would have the potential to one day become prelevant and take over the process. I’ll try to summarize the various players:

  • preprint repositories – this is where scientists publish their works before submitting them to a journal. The major player is arXiv, but there are others as well (list, map)
  • scientific “social networks” –, ResearchGate offer a way to connect with fellow-researchers and share your publications, thus having a public researcher profile. Scientists get analytics about the number of reads their publications get and notifications about new research they might be interested in. It is similar to a preprint repo, as they try to get hold of a lot of publications.
  • services which try to completely replace the process of scientific publishing – they try to be THE service where you publish, get reviewed and get a “score”. These include SJS, The Winnower and possibly and ResearchGate can also maybe fit in this category, as they offer some way of feedback (and plan or already have peer-review) and/or some score (RG score).
  • tools to support researchers – Mendeley (a personal collection of publications), Authorea (a tool for collaboratively editing publications), Figshare (a place for sharing auxiliary materials like figures, datasets, source code, etc.), Zenodo (data repository), Publons (a system to collect everyone’s peer reviews), and Open Science Framework (sets of tools for researchers), Altmetric (tool to track the activity around research), ScholarPedia and OpenWetWare (wikis)
  • impact calculation services – in addition to the RG score, there’s ImpactFactory
  • scientist identity – each of the social networks try to be “the profile page” of a scientist. Additionally, there are the identifiers such as ORCID, researcherId, and a few others by individual publishers. Maybe fortunately, all are converging towards ORCID at the moment.
  • search engines – Google Scholar, Microsoft Academic, Science Direct (by Elsevier), Papers, PubPeer, CrossRef, PubMed, Base Search, CLOCKSS, (AI for analyzing scientific texts) and of course Sci-Hub – which mostly rely on contracts with publishers (with the exception of SciHub)
  • journals with a more modern, web-based workflow – F1000Research, Cureus, Frontiers, PLoS

Most of these services are great and created with the real desire to improve the situation. But unfortunately, many have problems. ResearchGate has bee accused of too much spamming, its RG score is questionable; is accused of too many fake accounts for the sake of making investors happy, Publons is a place where peer review should be something you brag about, yet very few reviews are made public by the reviewers (which signifies a cultural problem). SJS and The winnower have too few users, and the search engines are dependent on the publishers. Mendeley and others were acquired by the publishers so they no longer pose a threat to the existing broken model.

Special attention has to be paid to Sci-Hub. The “illegal” place where you can get the knowledge you want to find. Alexandra Elbakyan created Sci-Hub which automatically collects publications through library and university networks by credentials donated by researchers. That way all of the content is public and searchable by DOI (the digital identifier of an article, which by the way is also a broken concept, because in order to give your article and identifier, you need to pay for a “range”). So sci-hub seems like a good solution, but doesn’t actually fix the underlying workflow. It has been sued and its original domain(s) – taken, so it’s something like the pirate bay for science – it takes effort and idealistic devotion in order to stay afloat.

The lawsuits against sci-hub, by the way, are an interesting thing – publishers want to sue someone for giving access to content that they have taken for free from scientists. Sounds fair and the publishers are totally not “evil”?

I have had discussions with many people, and read a lot of articles discussing the disruption of the publishing market (here, here, here, here, here, here, here). And even though some of the articles are from several years ago, the change isn’t yet here.

Approaches that are often discussed are the following, and I think neither of them are working:

  • have a single service that is a “mega-journal” – you submit, get reviewed, get searched, get listed in news sections about your area and/or sub-journals. “One service to rule them all”, i.e. a monopoly, is also not good in the long term, even if the intentions of its founders are good (initially)
  • have tools that augment the publishing process in hope to get more traction and thus gradually get scientists to change their behaviour – I think the “augmenting” services begin with the premise that the current system cannot be easily disrupted, so they should at least provide some improvement on it and easy of use for the scientists.

On the plus side, it seems that some areas of research almost exclusively rely on preprints (green open access) now, so publishers have a diminishing influence. And occasionally someone boycotts them. But that process is very slow. That’s why I wanted to do something to help make it faster and better.

So I created a wordpress plugin (source). Yes, it’s so trivial. I started with a bigger project in mind and even worked on it for a while, but it was about to end up in the first category above, of “mega-journal”, and that seems to have been tried already, hasn’t been particularly successful, and is risky long term (in terms of centralizing power).

Of course a wordpress plugin isn’t a new idea either. But all attempts that I’ve seen either haven’t been published, or provide just extras and tools, like reference management. My plugin has three important aspects:

  • JSON-LD – it provides semantic annotations for the the scientific content, making it more easily discoverable and parseable
  • peer review – it provides a simple, post-publication peer review workflow (which is an overstatement for “comments with extra parameters”)
  • it can be deployed by anyone – both as a personal website of a scientist and as a library/university-provided infrastructure for scientists. Basically, you can have a wordpress intallation + the plugin, and get a green open access + basic peer review for your institution. For free.

What is the benefit of the semantic part? I myself have argued that the semantic web won’t succeed anytime soon because of a chicken-and-egg problem – there is no incentive to “semanticize” your page, as there is no service to make use of it; and there are no services, because there are no semantic pages. And also, there’s a lot of complexity for making something “semantic” (RDF and related standards are everything but webmaster-friendly). There are niche cases, however. The Open Graph protocol, for example, makes a web page “shareable on facebook”, so web masters have the incentive to add these tags.

I will soon contact Google Scholar, Microsoft Academic and other search engines to convince them to index semantically-enabled web-published research. The point is to have an incentive, just like with the facebook example, to use the semantic options. I’ll also get in contact with ResearchGate/Academia/Arxiv/etc. to suggest the inclusion of semantic annotations and/or JSON-LD.

The general idea is to have green open access with online post-publication peer review, which in turn lets services make profile pages and calculate (partial) impact scores, without reliance on the publishers. It has to be easy, and it has to include libraries as the main contributor – they have the “power” to change the status-quo. And supporting a WordPress installation is quite easy – a library, for example, can setup one for all of the researchers in the institution and let them publish there.

A few specifics of the plugin:

  • the name “scienation” comes from “science” and either “nation” or the “-ation” suffix.
  • it uses URLs as article identifiers (which is compatible with DOIs that can also be turned into URLs). There is an alternative identifier, which is the hash of the article (text-only) content – that way the identifier is permanent and doesn’t rely on one holding a given domain.
  • it uses ORCID as an identity provider (well, not fully, as the OAuth flow is not yet implemented – it requires a special registration which won’t be feasible). One has to enter his ORCID in a field and the system will assume it’s really him. This may be tricky and there may be attempts to publish a bad peer review on behalf of someone else.
  • the hierarchy of science branches is obtained from Wikipedia, combined with other small sources.
  • the JSON-LD properties in use are debatable (sample output). I’ve started a discussion on having additional, more appropriate properties in’s ScholarlyArticle. I’m aware of ScholarlyHTML (here, here and here – a bit confusing which is “correct”), codemeta definitions and the scholarly article ontology. They are very good, but their purpose is different – to represent the internal details of a scientific work in a structured way. There is probably no need of that if the purpose is to make the content searchable and to annotate it with metadata like authors, id, peer reviews and citations. Still, I reuse the ScholarlyArticle standard definition and will gladly accept anything else that is suitable for the usecase.
  • I got the domain (nothing to be seen there currently) and one can choose to add his website to a catalog that may be used in the future for easier discovering and indexing semantically-enabled websites.

The plugin is open source, licensed under GPL (as is required by WordPress), and contributions, discussions and suggestions are more than welcome.

I’m well aware that a simple wordpress plugin won’t fix the debacle that I’ve described in the first part of this article. But I think the right approach is to follow the principle of decentralization and reliance on libraries and individual researchers, rather than on (centralized) companies. The latter has so far proved inefficient and actually slows science down.


I Stopped Contributing To Stackoverflow, But It’s Not Declining

September 26, 2016

“The decline of Stackoverflow” is now trending on reddit, and I started this post as a comment in the thread, but it got too long.

I’m in the 0.01% (which means rank #34) but I haven’t contributed almost anything in the past 4 years. Why I stopped is maybe part of the explanation why “the decline of stackoverflow” isn’t actually happening.

The mentioned article describes the experience of a new user as horrible – you can’t easily ask a question without having it downvoted, marked as duplicate, or commented on in a negative way. The overall opinion (of the article and the reddit thread) seems to be that SO “the elite” (the moderators) has become too self-important and is acting on a whim for an alleged “purity” of the site.

But that’s not how I see it, even though I haven’t been active since “the good old days”. This Hacker news comment has put it very well:

StackOverflow is a machine designed to do one thing: make it so that, for any given programming question, you will get a search engine hit on their site and find a good answer quickly. And see some ads.
That’s really it. Everything it does is geared toward that, and it does it quite well.
I have lots of SO points. A lot of them have come from answering common, basic questions. If you think points exist to prove merit, that’s bad. But if you think points exist to show “this person makes the kind of content that brings programmers to our site and makes them happy”, it’s good. The latter is their intent.

So why I stopped contributing? There were too many repeating questions/themes, poorly worded, too many “homework” questions, and too few meaningful, thought provoking questions. I’ve always said that I answer stackoverflow questions not because I know all the answers, but because I know a little more about the subject than the person asking. And those seemed to give way (in terms of percentage) to “null pointer exception”, “how to fix this [40 lines pasted] code” and “Is it better to have X than Y [in a context that only I know and I’m not telling you]”. (And here’s why I don’t agree that “it’s too hard to provide an answer on time”. If it’s not one of the “obvious” questions, you have plenty of time to provide an answer).

And if we get back to the HN quote – the purpose of the site is to provide answers to questions. If the questions are already answered (and practically all of the basic ones are), you should have found the answer, rather than asking it again. Because of that maybe somethings non-trivial questions get mistaken for “on, not another null pointer exception”, in which cases I’ve been actively pointing out that this is the case and voting to reopen. But that’s rare. All the examples in the “the decline of stackoverflow” article and in the reddit thread are I believe edge cases (and one is a possible “homework question”). Maybe these “edge cases” are now more prevalent than when I was active, but I think the majority of the new questions are still coming from people too lazy to google one or two different wordings of their problem. Which is why I even summarized the basic steps of finding a problem before asking on SO.

So I wouldn’t say the moderators are self-made tyrants that are hostile to anyone new. They just have a knee-jerk reaction when they see yet-another-duplicate-or-homework-or-subjective question.

And that’s not simply for the sake of purity – the purpose of the site is to provide answers. If the same question exists in 15 variations, you may not find the best answer (it has happened to me – I find three questions that for some reason aren’t marked as duplicate – one contains just a few bad answers, and the other one has the solution. If google happens to place the former ontop, one may think it’s actually a hard question).

There are always “the trolls”, of course – I have been serially downvoted (so all questions about serial downvoting are duplicates), I have had even personal trolls that write comments on all my recent answers. But…that’s the internet. And those get filtered quickly, no need to get offended or think that “the community is too hostile”.

In the past week I’ve been doing a wordpress plugin as a side project. I haven’t programmed in PHP in 4 years and I’ve never written a wordpress plugin. I had a lot of questions, but guess what – all of them were already answered, either on stackoverflow, or in the documentation, or in some blogpost. We shouldn’t assume our question is unique and rush to asking it.

On the other hand, even the simplest questions are not closed just because they are simple. One of my favourite examples is the question whether you need a null check before calling an instanceof. My answer is number 2, with a sarcastic comment that this could be tested in an IDE for a minute. And a very good comment points out that it takes less than that to get the answer on Stackoverflow.

It may seem that most of the questions are already answered now. And that’s probably true for the general questions, for the popular technologies. Fortunately our industry is not static and there are new things all the time, so stackoverflow is going to serve those.

It’s probably a good idea to have different rules/thresholds for popular tags (technologies) and less popular ones. If there’s a way to differentiate trivial from non-trivial questions, answers to the non-trivial ones could be rewarded with more reputation. But I don’t think radical changes are needed. It is inevitable that after a certain “saturation point” there will be fewer contributors and more readers.

Bottom line:

  • I stopped contributing because it wasn’t that challenging anymore and there are too many similar, easy questions.
  • Stackoverflow is not declining, it is serving its purpose quite well.
  • Mods are not evil jerks that just hate you for not knowing something
  • Stackoverflow is a little more boring for contributors now than it was before (which is why I gradually stopped answering), simply because most of the general questions have already been answered. The niche ones and the ones about new technologies remain, though.

Traditional Web Apps And RESTful APIs

September 23, 2016

When we are building web applications these days, it is considered a best practice to expose all our functionality as a RESTful API and then consume it ourselves. This usually goes with a rich front-end using heavy javascript, e.g. Angular/Ember/Backbone/React.

But a heavy front-end doesn’t seem like a good default – applications that require the overhead of a conceptually heavy javascript framework are actually not in the majority. The web, although much more complicated, is still not just about single-page applications. Not to mention that if you are writing a statically-typed backend, you would either need a dedicated javascript team (no necessarily a good idea, especially in small companies/startups), or you have to write in that … not-so-pleasant language. And honestly, my browsers are hurting with all that unnecessary javascript everywhere, but that’s a separate story.

The other option for having yourself consume your own RESTful API is to have a “web” module, that calls your “backend” module. Which may be a good idea, especially if you have different teams with different specialties, but the introduction of so much communication overhead for the sake of the separation seems at least something one should think twice before doing. Not to mention that in reality release cycles are usually tied, as you need extra effort to keep the “web” and “backend” in proper sync (“web” not requesting services that the “backend” doesn’t have yet, or the “backend” not providing a modified response model that the “web” doesn’t expect).

As in my defence of monoliths, I’m obviously leaning towards a monolithic application. I won’t repeat the other post, but the idea is that an application can be modular even if it’s run in a single runtime (e.g. a JVM). Have your “web” package, have your “services” package, and these can be developed independently, even as separate (sub-) projects that compile into a single deployable artifact.

So if you want to have a traditional web application – request/response, a little bit of ajax, but no heavy javascript fanciness and no architectural overhead, and you still want to expose your service as a RESTful API, what can you do?

Your web layer – the controllers, working with request parameters coming from form submissions and rendering a response using a template engine – normally communicates with your service layer. So for your web layer, the service layer is just an API. It uses it using method calls inside a JVM. But that’s not the only way that service layer can be used. Frameworks like Spring-MVC, Jersey, etc, allow annotating any method and exposing it as a RESTful service. Normally it is accepted that a service layer is not exposed as a web component, but it can be. So – you consume the service layer API via method calls, and everyone else consumes it via HTTP. The same definitions, the same output, the same security. And you won’t need a separate pass-through layer in order to have a RESTful API.

In theory that sounds good. In practice, the annotations that turn the method into an endpoint may introduce problems – is serialization/deserialization working properly, are the headers properly handled, is authentication correct. And you won’t know that these aren’t working if you are using the methods only inside a single JVM. Yes, you will know they work correctly in terms of business logic, but the RESTful-enabling part may differ.

That’s why you need full coverage with acceptance tests. Something like cucumber/JBehave to test all your exposed endpoints. That way you’ll be sure that both the RESTful aspects, and the business logic work properly. It’s actually something that should be there anyway, so it’s not an overhead.

Another issues is that you may want to deploy your API separately from your main application. and You may want to have just the API running in one cluster, and your application running in another. And that’s no issues – you can simply disable the “web” part with a configuration switch and your application and deploy the very same artifact multiple times.

I have to admit I haven’t tried that approach, but it looks like a simple way that would still cover all the use-cases properly.


The Right To Be Forgotten In Your Application

September 13, 2016

You’ve probably heard about “the right to be forgotten” according to which Google has to delete search results about you, if you ask them to.

According to a new General Data Protection Regulation of the EU, the right to be forgotten means that a data subject (user) can request the deletion of his data from any data controller (which includes web sites), and the data controller must delete the data without delay. Whether it’s a social network profile, items sold in online shops/auctions, location data, properties being offered for sale, even forum comments.

Of course, it is not as straightforward, as it depends on the contractual obligations of the user (even with implicit contracts), and the Regulation lists a couple of cases, some of which very broad, when the right to be forgotten is applicable, but the bottom line is that such functionality has to be supported.

I’ll quote the relevant bits from the Regulation:

Article 17
1. The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay where one of the following grounds applies:
(a) the personal data are no longer necessary in relation to the purposes for which they were collected or otherwise processed;
(b) the data subject withdraws consent on which the processing is based according to point (a) of Article 6(1), or point (a) of Article 9(2), and where there is no other legal ground for the processing;
2. Where the controller has made the personal data public and is obliged pursuant to paragraph 1 to erase the personal data, the controller, taking account of available technology and the cost of implementation, shall take reasonable steps, including technical measures, to inform controllers which are processing the personal data that the data subject has requested the erasure by such controllers of any links to, or copy or replication of, those personal data.

This “legalese” basically means that your application should have a “forget me” functionality that deletes the user data entirely. No “deleted” or “hidden” flags, no “but our business is based on your data”, no “this will break our application”.

Note also that if your applications pushes data to a 3rd party service (uploads images to youtube, pictures to imgur, syncs data with salesforce, etc.), you have to send deletion requests to these services as well.

The Regulation will be in force in 2018, but probably it’s a good idea to have it in mind earlier. Not just because there’s a Directive in force and courts already have decicions in that directions, but also because when building your system, you have to keep that functionality working. And since in most cases all data is linked to a user in your database, it is thus most likely considered “personal data” under the broad definition in the regulation.

Technically, this can be achieved by ON CASCADE DELETE in relational databases, or by cascade=ALL in ORMs, or my manual application layer deletion. The manual deletion needs supporting and extending when adding new entities/tables, but it safer than having a generic cascade deletion. And as mentioned above that may not be enough – your third party integrations should also feature deletion. And most 3rd party APIs have that functionality, so your “/forget-me” endpoint handler would probably look something like this:

public void forgetUser(UUID userId) {
   User user = userDao.find(userId);
   // if cascading is considered unsafe, delete entity by entity
   // if some components of your deployment rely on processing events
   eventQueue.publishEvent(new UserDeletionEvent(user));

That code may be execute asynchronously. It can also be run as part of a “forgetting” scheduled job – users submit their deletion requests and the job picks it up. The regulation is not strict about real-time deletion, but it should happen “without undue delay”. So “once a month” is not an option.

My point is – you should think of that feature when designing your application, so that it doesn’t turn out that it’s impossible to delete data without breaking everything.

And you should think about that not (just) because some EU regulation says so, but because users’ data is not your property. Yes, the user has decided to just push stuff to your database, but you don’t own that. So if a user decides it no longer wants you to hold any data about him – you are ethically (and now legally) bound to comply. You can bury the feature ten screens and two password forms deep, but it better be there.

Whether that’s practical – I think it is. It comes handy for acceptance tests, for example, which you can run against production (without relying on a hardcoded user profiles). It is also not that hard to support a deletion feature and it allows you to have a flexible data model.

Whether the Regulation will be properly applied depends on many factors, but the technical one may be significant with legacy systems. And as every system becomes “legacy” after six months, we should be talking about it.


Why I Introduced Scala In Our Project

August 29, 2016

I don’t like Scala. And I think it has some bad and very ugly aspects that make it a poor choice for mainstream development.

But recently I still introduced it in our project. Not only that, but the team has no experience with Scala. And I’ll try to explain why that is not a bad idea.

  • First and most important – I followed my own advice and introduced it only in a small, side module. We didn’t have acceptance tests and we desperately needed some, so the JBehave test module was a good candidate for a scala project.
  • Test code is somewhat different from “regular” code – it is okay to be less consistent, more “sketchy” and to have hacks. On the other hand it could benefit from the expressiveness and lesser verbosity of scala, as tests are generally tough to read, especially in their setup phase. So Scala seems like a good choice – you can quickly write concise test code without worrying too much that you are shooting yourself in the foot in the long term, or that multiple team members do things in a different way (as Scala generously allows). Also, in tests you don’t have to face a whole stack of frameworks and therefore – all the language concepts at once (like implicits, partial functions, case objects, currying, etc.)
  • I didn’t choose a scripting language (or groovy in particular), because I have a strong preference for statically typed languages (yeah, I know groovy has that option for a couple of years now). I it should’ve been a JVM language, because we’d want to reuse some common libraries of ours.
  • It is good for people to learn new programming languages every now and then, as it widens their perspective. Even if that doesn’t change their language of choice.
  • Learning Scala I think can lead to better understanding and using of Java 8 lambdas, as in Scala you can experience them “in their natural habitat”.
  • I have previous experience with Scala, so there is someone on the team who can help

(As a side-note – IntelliJ scala support is now better than last time I used it (unlike the Eclipse-based Scala IDE, which is still broken))

If writing acceptance test code with Scala turns out as easy as I imagine it, then it can mean we’ll have more and betters acceptance tests, which is the actual goal. And it would mean we are using the right tool for the job, rather than a hammer.


Biometric Identification [presentation]

August 21, 2016

Biometric identification is getting more common – borders, phones, doors. But I argue that it is not by itself a good approach. I tried to explain this in a short talk, and here are the slides

Biometric features can’t be changed, can’t be revoked – they are there forever. If someone gets hold of them (and that happens sooner or later), we are screwed. And now that we use our fingerprints to unlock our phones, for example, and at the same time we use our phone as the universal “2nd factor” for most online services, including e-banking in some cases, fraud is waiting to happen (or already happening).

As Bruce Schneier has said after an experiment that uses gummi bears to fool fingerprint scanners:

The results are enough to scrap the systems completely, and to send the various fingerprint biometric companies packing

On the other hand, it is not that useful and pleasant to biometric features for identification – just typing a PIN is just as good (but we can change the PIN).

I’ve previously discussed the risks related to electronic passports, which have fingerprint images in clear form and are read without a PIN thought a complex certificate management scheme. The bottom line is, they can leak from your passport without you understanding (if the central databases don’t leak before that). Fortunately, there are alternatives that would still guarantee that the owner of the passport is indeed the one it was issued to, an that it’s not fake.

But anyway, I think the biometric data can have some future applications. Near the end of the presentation I try to imagine how it can be used for a global, distributed anonymous electronic identification scheme. But the devil is always in the details. And so far we have failed with the details.


Writing Laws Is Quite Like Programming

August 7, 2016

In the past year I’ve taken the position of an adviser in the cabinet of a deputy prime minister and as a result of that I had the option to draft legislation. I’ve been doing that with a colleague, both with strong technical background, and it turned out we are not bad at it. Most of “our” laws passed, including the “open source law”, the electronic identification act, and the e-voting amendments to the election code (we were, of course, helped by legal professionals in the process, much a like a junior dev is helped by a senior one).

And law drafting felt to have much in common with programming – as a result “our” laws were succinct, well-structured and “to the point”, covering all use-cases. At first it may sound strange that people not trained in the legal profession would be able to do it at all, but writing laws is actually “legal programming”. Here’s what the two processes have in common:

  • Both rely on a formalized language. Programming languages are stricter, but “legalese” is also quite formalized and certain things are worded normally in a predefined way, in a way there are “keywords”.
  • There is a specification on how to use the formalized language and how it should behave. The “Law for normative acts” is the JLS (Java language specification) for law-drafting- it defines what is allowed, how laws should be structured and how they should refer to each other. It also defines the process of law-making.
  • Laws have a predefined structure, just as a class file, for example. There are sections, articles, references and modification clauses for other laws (much like invoking a state-changing function on another object).
  • Minimizing duplication is a strong theme in both law-drafting and programming. Instead of copy-pasting shared code / sections, you simply refer to it by its unique identifier. You do that in a single law as well as across multiple laws, thus reusing definitions and statements.
  • Both define use-cases – a law tries to cover all the edge cases for a set of use-cases related to a given matter, much like programming. Laws, of course, also define principles, which is arguably their more important feature, but the definition is use-cases is pretty ubiquitous.
  • Both have if-clasues and loops. You literally say “in case of X, do Y”. And you can say “for all X, do Y”. Which is of course logical, as these programming constructs come from the real world.
  • There are versions and diffs. After it appears for the first time (“is pushed to the legal world”) every change is in the form of an amendment to the original text, in a quite formalized “diff” structure. Adding or removing articles, replacing words, sentences or whole sections. You can then replay all the amendments ontop of the original document to find the current active law. Sounds a lot like git.
  • There are “code reviews” – you send your draft to all the other institutions and their experts give you feedback, which you can accept or reject. Then the “pull request” is merged into master by the parliament.
  • There is a lot of “legacy code”. There are laws from 50 years ago that have rarely been amended and you have to cope with them.

And you end up with a piece of “code” that either works or doesn’t solve the real world problems and has to be fixed/amended. With programming it’s the CPU, and possibly a virtual machine that carry out the instructions, and with laws it’s the executive branch (and in some cases – the juridical).

It may seem like the whole legal framework can be written in a rules engine or in Prolog. Well, it can’t, because of the principles it defines and the interpretation (moral and ethical) that judges have to do. But that doesn’t negate the similarities in the process.

There is one significant difference though. In programming we have a lot of tools to make our lives easier. Build tools, IDEs, (D)VCS, issue tracking systems, code review systems. Legal experts have practically none. In most cases they use Microsoft Word and even without “Track changes” sometimes. They get the current version of the text from legal information systems or in many cases even from printed versions of the law. Collaboration is a nightmare, as Word documents are flying around via email. The more tech-savvy may opt for a shared document with Google Docs or Office365 but that’s rare. People have to manually write the “diff” based on track changes, and then manually apply the diff to get the final consolidated version. The process of consultation (“code review”) is based on sending paper mails and getting paper responses. Not to mention that once the draft gets in parliament, there are work groups and committees that make the process even more tedious.

Most of that can be optimized and automated. The UK, for example, has done some steps forward with where each legal text is stored using LegalXML (afaik), so at least references and versioning can be handled easily. But legal experts that draft legislation would love to have the tools that we, programmers, have. They just don’t know they exist. The whole process, from idea, through work groups, through consultation, and multiple readings in parliament can be electronic. A GitHub for laws, if you wish, with good client-side tools to collaborate on the texts. To autocomplete references and to give you fine-tuned search. We have actually defined such a “thing” to be built in two years, and it will have to be open source, so even though the practices and rules vary from country to country, I hope it will be possible to reuse it.

As a conclusion, I think programming (or software engineering, actually), with its well defined structures and processes, can not only help in many diverse environments, but can also give you ideas on how to optimize them.


Custom Audit Log With Spring And Hibernate

July 18, 2016

If you need to have automatic auditing of all database operations and you are using Hibernate…you should use Envers or spring data jpa auditing. But if for some reasons you can’t use Envers, you can achieve something similar with hibernate event listeners and spring transaction synchronization.

First, start with the event listener. You should capture all insert, update and delete operations. But there’s a tricky bit – if you need to flush the session for any reason, you can’t directly execute that logic with the session that is passed to the event listener. In my case I had to fetch some data, and hibernate started throwing exceptions at me (“id is null”). Multiple sources confirmed that you should not interact with the database in the event listeners. So instead, you should store the events for later processing. And you can register the listener as a spring bean as shown here.

public class AuditLogEventListener
        implements PostUpdateEventListener, PostInsertEventListener, PostDeleteEventListener {

    public void onPostDelete(PostDeleteEvent event) {
        AuditedEntity audited = event.getEntity().getClass().getAnnotation(AuditedEntity.class);
        if (audited != null) {

    public void onPostInsert(PostInsertEvent event) {
        AuditedEntity audited = event.getEntity().getClass().getAnnotation(AuditedEntity.class);
        if (audited != null) {

    public void onPostUpdate(PostUpdateEvent event) {
        AuditedEntity audited = event.getEntity().getClass().getAnnotation(AuditedEntity.class);
        if (audited != null) {

    public boolean requiresPostCommitHanding(EntityPersister persister) {
        return true; // Envers sets this to true only if the entity is versioned. So figure out for yourself if that's needed

Notice the AuditedEntity – it is a custom marker annotation (retention=runtime, target=type) that you can put ontop of your entities.

To be honest, I didn’t fully follow how Envers does the persisting, but as I also have spring at my disposal, in my AuditLogServiceData class I decided to make use of spring:

 * {@link AuditLogServiceStores} stores here audit log information It records all 
 * changes to the entities in spring transaction synchronizaton resources, which 
 * are in turn stored as {@link ThreadLocal} variables for each thread. Each thread 
 * /transaction is using own copy of this data.
public class AuditLogServiceData {
    private static final String HIBERNATE_EVENTS = "hibernateEvents";
    public static List<Object> getHibernateEvents() {
        if (!TransactionSynchronizationManager.hasResource(HIBERNATE_EVENTS)) {
            TransactionSynchronizationManager.bindResource(HIBERNATE_EVENTS, new ArrayList<>());
        return (List<Object>) TransactionSynchronizationManager.getResource(HIBERNATE_EVENTS);

    public static Long getActorId() {
        return (Long) TransactionSynchronizationManager.getResource(AUDIT_LOG_ACTOR);

    public static void setActor(Long value) {
        if (value != null) {
            TransactionSynchronizationManager.bindResource(AUDIT_LOG_ACTOR, value);

    public void clear() {
       // unbind all resources

In addition to storing the events, we also need to store the user that is performing the action. In order to get that we need to provide a method-parameter-level annotation to designate a parameter. The annotation in my case is called AuditLogActor (retention=runtime, type=parameter).

Now what’s left is the code that will process the events. We want to do this prior to committing the current transaction. If the transaction fails upon commit, the audit entry insertion will also fail. We do that with a bit of AOP:

class AuditLogStoringAspect extends TransactionSynchronizationAdapter {

    private ApplicationContext ctx; 
    @Before("execution(* *.*(..)) && @annotation(transactional)")
    public void registerTransactionSyncrhonization(JoinPoint jp, Transactional transactional) {
        Logger.log(this).debug("Registering audit log tx callback");
        MethodSignature signature = (MethodSignature) jp.getSignature();
        int paramIdx = 0;
        for (Parameter param : signature.getMethod().getParameters()) {
            if (param.isAnnotationPresent(AuditLogActor.class)) {
                AuditLogServiceData.setActor((Long) jp.getArgs()[paramIdx]);
            paramIdx ++;

    public void beforeCommit(boolean readOnly) {
        Logger.log(this).debug("tx callback invoked. Readonly= " + readOnly);
        if (readOnly) {
        for (Object event : AuditLogServiceData.getHibernateEvents()) {
           // handle events, possibly using instanceof

    public void afterCompletion(int status) {
	// we have to unbind all resources as spring does not do that automatically

In my case I had to inject additional services, and spring complained about mutually dependent beans, so I instead used applicationContext.getBean(FooBean.class). Note: make sure your aspect is caught by spring – either by auto-scanning, or by explicitly registering it with xml/java-config.

So, a call that is audited would look like this:

public void saveFoo(FooRequest request, @AuditLogActor Long actorId) { .. }

To summarize: the hibernate event listener stores all insert, update and delete events as spring transaction synchronization resources. An aspect registers a transaction “callback” with spring, which is invoked right before each transaction is committed. There all events are processed and the respective audit log entries are inserted.

This is very basic audit log, it may have issue with collection handling, and it certainly does not cover all use cases. But it is way better than manual audit log handling, and in many systems an audit log is mandatory functionality.


Spring-Managed Hibernate Event Listeners

July 15, 2016

Hibernate offers event listeners as part of its SPI. You can hook your listeners to a number of events, including pre-insert, post-insert, pre-delete, flush, etc.

But sometimes in these listeners you want to use spring dependencies. I’ve written previously on how to do that, but hibernate has been upgraded and now there’s a better way (and the old way isn’t working in the latest versions because of missing classes).

This time it’s simpler. You just need a bean that looks like this:

public class HibernateListenerConfigurer {
    private EntityManagerFactory emf;
    private YourEventListener listener;
    protected void init() {
        SessionFactoryImpl sessionFactory = emf.unwrap(SessionFactoryImpl.class);
        EventListenerRegistry registry = sessionFactory.getServiceRegistry().getService(EventListenerRegistry.class);

It is similar to this stackoverflow answer, which however won’t work because it also relies on deprecated calsses.

You can also inject a List<..> of listeners (though they don’t share a common interface, you can define your own).

As pointed out in the SO answer, you can’t store new entities in the listener, though, so it’s no use injecting a DAO, for example. But it may come handy to process information that does not rely on the current session.