Forget ISO-8859-1

January 16, 2017

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{Alpha}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.


Anemic Objects Are OK

December 25, 2016

I thought for a while that object-oriented purism has died off. But it hasn’t – every now and then there’s an article that tries to tell us how evil setters and getters are, how bad (Java) annotations are, and how horrible and anti-object-oriented the anemic data model is (when functionality-only services act upon data-only objects) and eventually how dependency injection is ruining software.

Several years ago I tried to counter these arguments, saying that setters and getters are not evil per se, and that the anemic data model is mostly fine, but I believe I was worse at writing then, so maybe I didn’t get to the core of the problem.

This summer we had a short twitter discussion with Yegor Bugayenko and Vlad Mihalcea on the matter and a few arguments surfaced. I’ll try to summarize them:

  • Our metaphors are often wrong. An actual book doesn’t know how to print itself. Its full contents are given to a printer, which knows how to print a book. Therefore it doesn’t make sense to put logic for printing (to JSON/XML), or persisting to a database in the Book class. It belongs elsewhere.
  • The advice to use (embedded) printers instead of getters is impractical even if a Book should know how to print itself – how do you transform your objects to other formats (JSON, XML, Database rows/etc..)? With an Jackson/JAXB/ORM/.. you simply add a few annotations, if any at all and it works. With “printers” you have to manually implement the serialization logic. Even with Xembly you still have to do a tedious, potentially huge method with add()’s and up()’s. And when you add, or remove a field, or change a field definition, or add a new serialization format, it gets way more tedious to support. Another approach mentioned in the twitter thread is having separate subclasses for each format/database. And an example can be seen here. I really don’t find that easy to read or support. And even if that’s adopted in a project I’m working on, I’d be the first to replace that manual adding with reflection, however impure that may be. (Even Uncle Bob’s Fitnesse project has getters or even public fields where that makes sense in terms of the state space)
  • Having too much logic/behaviour in an objects may be seen as breaking the Single responsibility principle. In fact, this article argues that the anemic approach is actually SOLID, unlike the rich business object approach. The SRP may actually be understood in multiple ways, but I’ll get to that below.
  • Dependency injection containers are fine. The blunt example of how the code looks without them is here. No amount of theoretical object-oriented programming talk can make me write that piece of code. I guess one can get used to it, but (excuse my appeal to emotion fallacy here) – it feels bad. And when you consider the case of dependency injection containers – whether you’ll invoke a constructor from a main method, or your main will invoke automatic cosntructor (or setter) injection context makes no real difference – your objects are still composed of their dependencies, and their dependencies are set externally. Except the former is more practical and after a few weeks of nested instantiation you’ll feel inclined to write your own semi-automated mechanism to do that.

But these are all arguments derived from a common root – encapsulation. Your side in the above arguments depends on how you view and understand encapsulation. I see the purpose of encapsulation as a way to protect the state space of a class – an object of a given class is only valid if it satisfies certain conditions. If you expose the data via getters and setters, then the state space constraints are violated – everyone can invalidate your object. For example, if you were able to set the size of an ArrayList without adding the corresponding element to the backing array, you’d break the behaviour of an ArrayList object – it will report its size inconsistently and the code that depends on the List contract would not always work.

But in practical terms encapsulation still allows for the distinction between “data objects” vs “business objects”. The data object has no constraints on its state – any combination of the values of its fields is permitted. Or in some cases – it isn’t, but it is enforced outside the current running program (e.g. via database constraints when persisting an object with an ORM). In these cases, where the state spaces is not constraint, encapsulation is useless. And forcing it upon your software blindly results in software that, I believe, is harder to maintain and extend. Or at least you gain nothing – testing isn’t easier that way (you can have a perfectly well tested anemic piece of software), deployment is not impacted, tracing problems doesn’t seem much of a difference.

I’m even perfectly fine with getting rid of the getters and exposing the state directly (as the aforementioned LogData class from Fitnesse).

And most often, in business applications, websites, and the general type of software out there, most objects don’t need to enforce any constraints on their state. Because there state is just data, used somewhere else, in whatever ways the business needs it to be used. To get back to the single responsibility principle – these data objects have only one reason to change – their … data has changed. The data itself is irrelevant. It will become relevant at a later stage – when it’s fetched via a web service (after it’s serialized to JSON), or after it’s fetched from the database by another part of the application or a completely different system. And that’s important – encapsulation cannot be enforced across several systems that all work with a piece of data.

In the whole debate I haven’t seen a practical argument against the getter/setter/anemic style. The only thing I see is “it’s not OOP” and “it breaks encapsulation”. Well, I think it should be settled now that encapsulation should not be always there. It should be there only when you need it, to protect the state of your object from interference.

So, don’t feel bad to continue with your ORMs, DI frameworks and automatic JSON and XML serializers.


Amend Your Contract To Allow For Side Projects

December 14, 2016

The other day Joel Spolsky blogged a wonderful overview of the copyright issues with software companies in terms of its employees. The bottom line is: most companies have an explicit clause in their contracts which states that all intellectual property created by a developer is owned by the employer. This is needed, because the default (in many countries, including mine) is that the creator owns the copyright, regardless of whether they were hired to do it or not.

That in turn means that any side project, or in fact any intellectual property that you create while being employed as a developer, is automatically owned by your employer. This isn’t necessarily too bad, as most employers wouldn’t enforce their right, but this has bugged me ever since I started working for software companies. Even though I didn’t know the legal framework of copyright, the ownership clause in my contracts was always something that I felt was wrong. Even though Joel’s explanation makes perfect sense – companies need to protect their products from a random developer suddenly deciding they own the rights to parts of it – I’ve always thought there’s a middle ground.

(Note: there is a difference between copyright, patents and trademarks, and the umbrella term “intellectual property” is kind of ambiguous. I may end up using it sloppily, so for a clarification, read here.)

California apparently tried to fix this by passing the following law:

Anything you do on your own time, with your own equipment, that is not related to your employer’s line of work is yours, even if the contract you signed says otherwise.

But this doesn’t work, as basically everything computer-related is “in your employer’s line of work”, or at least can be, depending on the judge.

So let’s start with the premise that each developer should have the right to create side projects and profit from them, potentially even pursue them as their business. Let’s also have in mind that a developer is not a person, whose only ideas and intellectual products are “source code” or “software design(s)”. On the other hand the employer must be 100% sure that no developer can claim any rights on parts of the employer’s product. There should be a way to phrase a contract in a way that it reflects these realities.

And I have always done that – whenever offered a contract, I’ve stated that:

  1. I have side-projects, some of which are commercial, and it will make no sense for the employer to own them
  2. I usually create non-computer intellectual property – I write poetry, short stories, and linguistics problems

And I’ve demanded that the contract be reworded. It’s a bargain, after all, not one side imposing it’s “standard contract” on the other. So far no company objected too much (there was some back-and-forth with the lawyers, but that’s it – companies decided it was better for them to hire a person they’ve assessed positively, than to stick to an absolute contract clause).

That way we’ve ended up with the following wording, which I think is better than the California law, protects the employer, and also accounts for side-proejcts and poetry/stories/etc.

Products resulting from the Employee’s duties according to the terms of employment – individual or joint, including ideas, software development, inventions, improvements, formulas, designs, modifications, trademarks and any other type of intellectual property are the exclusive property of Employer, no matter if patentable.

Not sure if it is properly translated, but the first part is key – if the idea/code/invention/whatever is a result of an assignment that I got, or a product I am working on for the employer, then it is within the terms of employment. And it is way less ambiguous than “the employer’s line of work”. Anything outside of that, is of course, mine.

I don’t vouch for the legal rigidity of the above, as I’m not a legal professional, but I strongly suggest negotiating such a clause in your contracts. It could be reworded in other ways, e.g. “work within the terms of employment”, but the overall idea is obvious. And If you end up in court (which would probably almost never happen, but contracts are there to arrange edge cases), then even if the clause is not perfect, the judge/jury will be able to see the intent of the clause clearly.

And here I won’t agree with Joel – if you want to do something independent, you don’t have to be working for yourself. Side projects, of which I’ve always been a proponent, (and other intellectual products) are not about the risk-taking entrepreneurship – they are about intellectual development. They are a way to improve and expand your abilities. And it is a matter of principle that you own them. In rare cases they may be the basis of your actual entrepreneurial attempt.

Amending a standard contract is best done before you sign it. If you’ve already signed it, it’s still possible to add an annex, but less likely. So my suggestion is, before you start a job, use your “bargaining power” to secure your intellectual property rights.


Progress in Electronic Governance [talk]

December 2, 2016

I’ve been an advisor to the depury prime minister of Bulgaria for the past year and a half. And on this year’s OpenFest conference I tried to report on what we’ve achieved. It is not that much and there are no visible results, which is a bit disappointing, but we (a small motivated team) believe we have laid the groundwork for a more open, and properly built ecosystem for the government IT systems.

Just to list a few things – we passed a law that requires open sourcing custom-built government software, we opened a lot of data (1500 datasets) on the national open data portal, and we drew a roadmap of how existing state registers and databases be upgraded in order to meet modern software engineering best practices and be ready to meet the high load of requests. We also seriously considered the privacy and auditability of the whole ecosystem. We prepared the electronic identification project (each citizen having the option to identify online with a secure token), an e-voting pilot and so on.

The video of the talk is available here:

And here are the slides:

Now that our term is at an end (due to the resignation of the government) we hope the openness-by-default will persist as a policy and the new government agency that we constituted would be able to push the agenda that has been laid out. Whether that will be the case in a complex political situation is hard to tell, but hopefully the “technical” and the “political” aspects won’t be entwined in a negative way. And our team will continue to support with advice (even though from “the outside”) whoever wishes to build a proper and open e-government ecosystem.


Domain Fallback Mechanism In Apps

November 19, 2016

As a consequence of the Dyn attack many major websites were down, including twitter – the browsers could not resolve an IP address of the servers because the authoritative name server (Dyn) was down. Whether that could be addressed globally, I don’t know – there was an interesting discussion on reddit about my proposal to increase TTL – how the resolution policy and algorithms can be improved, why a lower TTL is not always applicable, etc.

But while was down, the mobile app was also not working. And while we have no control over the browser, we certainly do have control on the mobile app (same goes for desktop applications, of course, but I’ll be talking mainly about mobile apps as more dominant). The reason the app was also down is that is most likely uses the domain as well (e.g. And that’s the right way to do it, except in these rare situations when the DNS fails.

In these cases you can hardcode a list of server IP addresses in your app and fallback to them if the domain-name based requests fail (e.g. after 3 attempts). If you change your server IP addresses, you just update the app. It doesn’t matter that it will have an unpredictable delay (until everyone updates) and that some clients won’t have a proper IP – it is an edge-case fallback mechanism.

It’s not of course my idea – it’s been used by distributed systems like Bitcoin and BitTorrent – for example, when trying to join the network, the Bitcoin or a BitTorrent DHT client tries to connect to a bootstrap node in order to get a list of peers. There is a list of domain names in the client applications that resolve using a round-robin DNS to one of many known bootstrap nodes. However, if DNS resolution fails, the clients also have a small set of hardcoded IP addresses.

In addition to hardcoding IPs, you can regularly resolve the domain to an IP from within the app, and keep the “last known working IP” in a local cache. That adds a bit of complexity, and will not work for fresh or cleaned-up installations, but is a good measure nonetheless.

As this post points out, you can have multiple fallback strategies. Instead of, or better – in addition to hardcoding the IP, you can have a fallback domain name. and, managed by different DNS providers. That way your app will have a 4 step fallback mechanism:

  1. try primary domain
  2. try fallback domain
  3. try cached last known working IP(s)
  4. try hardcoded fallback IPs

(Note: if you connect to an IP, rather than a domain, you should do the verification of the server certificate manually (assuming you want to use TLS connection, which you should). In Android that’s done by using a custom HostnameVerifier.)

Events like the Dyn attack are (let’s hope) rare, but can be costly to businesses. Adding the fallback mechanisms to at least some of the client software is quick and easy and may reduce the damage.


Using Named Database Locks

November 8, 2016

In a beginner’s guide to concurrency, I mentioned advisory locks. These are not the usual table locks – they are table-agnostic, database-specific way to obtain a named lock from your application. Basically, you use your database instance for centralized application-level locking.

What could it be used for? If you want to have serial operations, this is a rather simple way – no need for message queues, or distributed locking libraries in your application layer. Just have your application request the lock from the database, and no other request (regardless of the application node, in case there are multiple) can obtain the same lock.

There are multiple functions that you can use to obtain such a lock – in PostgreSQL, in MySQL. The implementations differ slightly – in MySQL you need to explicitly release the lock, in PostgreSQL a lock can be released at the end of the current transaction.

How to use it in a Java application, for example with spring. You can provide a locking aspect and a custom annotation to trigger the locking. Let’s say we want to have sequential updates for a given entity. In the general use-case that would be odd, but sometimes we may want to perform some application-specific logic that relies on sequential updates.

    @Before("execution(* *.*(..)) && @annotation(updateLock)")
	public void applyUpdateLocking(JoinPoint joinPoint, UpdateLock updateLock) {
		int entityTypeId = entityTypeIds.get(updateLock.entity());
		// note: letting the long id overflow when fitting into an int, because the postgres lock function takes only ints
		// lock collisions are pretty unlikely and their effect will be unnoticeable
		int entityId = (int) getEntityId(joinPoint.getStaticPart().getSignature(), joinPoint.getArgs(),
		if (entityId != 0) {
			logger.debug("Locking on " + updateLock.entity() + " with id " + entityId);
			// using transaction-level lock, which is released automatically at the end of the transaction
			final String query = "SELECT pg_advisory_xact_lock(" + entityTypeId + "," + entityId + ")";
			em.unwrap(Session.class).doWork(new Work() {
				public void execute(Connection connection) throws SQLException {

What does it do:

  • It looks for methods annotated with @UpdateLock and applies the aspect
  • the UpdateLock annotation has two attributes – the entity type and the name of the method parameter that holds the ID on which we want to lock updates
  • the entityTypeIds basically has a mapping between a String name of the entity and an arbitrary number (because the postgres function requires a number, rather than a string)

That doesn’t sound very useful in the general use-case, but if for any reason you need to make sure a piece of functionality is executed sequentially in an otherwise concurrent, multi-threaded application, this is a good way.

Use this database-specific way to obtain application-level locks rarely, though. If you need to do that often, you probably have a bigger problem – locking is generally not advisable. In the above case it will lock simply on a single entity ID, which means it will rarely mean more than two requests waiting at the lock (or failing to obtain it). The good thing is, it won’t get more complicated with sharding – if you lock on a specific ID, and it relies on a single shard, then even though you may have multiple database instances (which do not share the lock), you won’t have to obtain the lock from a different shard.

Overall, it’s a useful tool to have in mind when faced with a concurrency problem. But consider whether you don’t have a bigger problem before resorting to locks.


Short DNS Record TTL And Centralization Are Serious Risks For The Internet

October 22, 2016

Yesterday Dyn, a DNS-provider, went down after a massive DDoS. That led to many popular websites being inaccessible, including twitter, LinkedIn, eBay and others. The internet seemed to be “crawling on its knees”.

We’ll probably read an interesting post-mortem from Dyn, but why did that happen? First, DDoS capacity is increasing, using insecure and infected IoT devices with access to the internet. Huge volumes of fake requests are poured on a given server or set of servers and they become inaccessible, either being unable to cope with the requests, or simply because the network to the server doesn’t have enough throughput to accomodate all the requests.

But why did “the internet” stop because a single DNS provider was under attack? First, because of centralization. The internet is supposed to be decentralized (although I’ve argued that exactly because of DNS, it is pseudo-decentralized). But services like Dyn, UltraDNS, Amazon Route53 and also Akamai and CloudFlare centralize DNS. I can’t tell how exactly, but out of the top 500 websites according to, 181 use one of the above 5 services as their DNS provider. Add 25 google services that use their own, and you get nearly 200 out of 500 centered in just 6 entities.

But centralization of the authoritative nameservers alone would not have led to yesterday’s problem. A big part of the problem, I think, is the TTL (time to live) of the DNS records, that is – the records which contain the mapping between domain name and IP address(es). The idea is that you should not always hit the authoritative nameserver (Dyn’s server(s) in this case) – you should hit it only if there is no cached entry anywhere along the way of your request. Your operating system may have a cache, but more importantly – your ISP has a cache. So the idea is that when subscribers of one ISP all make requests to twitter, the requests should not go to the nameserver, but would instead by resolved by looking them up in the cache of the ISP.

If that was the case, regardless of whether Dyn was down, most users would be able to access all services, because they would have their IPs cached and resolved. And that’s the proper distributed mode that the internet should function in.

However, it has become a common practice to set very short TTL on DNS records – just a few minutes. So after the few minutes expire, your browsers has to ask the nameserver “what IP should I connect to in order to access”. That’s why the attack was so successful – because no information was cached and everyone repeatedly turned to Dyn to get the IP corresponding to the requested domain.

That practice is highly questionable, to say the least. This article explains in details the issues of short TTLs, but let me quote some important bits:

The lower the TTL the more frequently the DNS is accessed. If not careful DNS reliability may become more important than the reliability of, say, the corporate web server.

The increasing use of very low TTLs (sub one minute) is extremely misguided if not fundamentally flawed. The most charitable explanation for the trend to lower TTL value may be to try and create a dynamic load-balancer or a fast fail-over strategy. More likely the effect will be to break the nameserver through increased load.

So we knew the risks. And it was inevitable that this problematic practice will be abused. I decided to analyze how big the problem actually is. So I got the aformentioned top 500 websites as representative, fetched their A, AAAA (IPv6), CNAME and NS records, and put them into a table. You can find the code in this gist (uses the dnsjava library).

The resulting CSV can be seen here. And if you want to play with it in Excel, here is the excel file.

Some other things that I collected: how many websites have AAAA (IPv6) records (only 79 out of 500), whether the TTLs betwen IPv4 and IPv6 differ (it does for 4), which is the DNS provider (which is how I got the figures mentioned above), taken from the NS records, and how many use CNAME instead of A records (just a few). I also collected the number of A/AAAA records, in order to see how many (potentially) utilize round-robin DNS (187) (worth mentioning: the A records served to me may differ from those served to other users, which is also a way to do load balancing).

The results are a bit scary. The average TTL is only around 7600 seconds (2 hours and 6 minutes). But it gets worse when you look at the 50th percentile (sort the values by ttl and get the lowest 250). The average there is just 215 seconds. This means the DNS servers are hit constantly, which turns them into a real single point of failure and “the internet goes down” just after a few minutes of DDoS.

Just a few websites have a high TTL, as can be seen from this simple chart (all 500 sites are on the X axis, the TTL is on y):


What are the benefits of the short TTL? Not many, actually. You have the flexibility to change your IP address, but you don’t do that very often, and besides – it doesn’t automatically mean all users will be pointed to the new IP, as some ISPs, routers and operating systems may ignore the TTL value and keep the cache alive for longer periods. You could do the round-robin DNS, which is basically using the DNS provider as a load-balancer, which sounds wrong in most cases. It can be used for geolocation routing – serving different IP depending on the geographical area of the request, but that doesn’t necessarily require a low TTL – if caching happens closer to the user than to the authoritative DNS server, then he will be pointed to the nearest IP anyway, regardless of whether that values gets refreshed often or not.

Short TTL is very useful with internal infrastructure – when pointing to your internal components (e.g. a message queue, or to a particular service if using microservices), then using low TTLs may be better. But that’s not about your main domain being accessed from the internet.

Overlay networks like BitBorrent and Bitcoin use DNS round-robin for seeding new clients with a list of peers that they can connect to (your first use of a torrent client connects you to one of serveral domains that each point to a number of nodes that are supposed to be always on). But that’s again a rare usecase.

Overall, I think most services should go for higher TTLs. 24 hours is not too much, and it will be needed to keep your old IP serving requests for 24 hours anyway, because of caches that ignore the TTL value. That way services won’t care if the auhtoritative nameserver is down or not. And that would in turn mean that DNS providers would be less of an interesting target for attacks.

And I understand the flexibility that Dyn and Route53 give us. But maybe we should think of a more distributed way to gain that flexibility. Because yesterday’s attack may be just the beginning.


The Broken Scientific Publishing Model and My Attempt to Improve It

October 12, 2016

I’ll begin this post with a rant about the state of scientific publishing, then review the technology “disruption” landscape and offer a partial improvement that I developed (source).

Scientific publishing is quite important – all of science is based on previously confirmed “science”, so knowing what the rest of the scientific community has done or is doing is essential to research. And allows scientists to “stand on the shoulders of giants”.

The web was basically invented to improve the sharing of scientific information – it was created at CERN and allowed linking from one (research) document to others.

However, scientific publishing at the moment is one of the few industries that haven’t benefited from the web. Well, the industry has – the community hasn’t, at least not as much as one would like.

Elsevier, Thomson-Reuters (which recently sold its intellectual property business), Springer and other publishers make huge profits (e.g. 39% margin on a 2 billion revenue) for doing something that should basically be free in this century – they spread the knowledge that scientists have created. You can see here some facts about their operation, the most striking being that each university has to pay more than a million dollars to get the literature it needs.

It’s because they rely on a centuries old process of submission to journals, accepting the submission, then printing and distributing to university libraries. Recently publishers have put publications online, but they are behind paywalls or accessible only after huge subscription fees have been paid.

I’m not a “raging socialist” but sadly, publishers don’t provide (sufficient) value. They simply gather the work of scientists that is already funded by public money, sometimes get the copyright on that, and disseminate it in a pre-Internet way.

They also do not pay for peer review of the submitted publications, they simply “organize it” – which often means “a friend of the editor is a professor and he made his postdocs write peer reviews”. Peer review is thus itself broken, as it is non-transparent and often of questionable quality. The funny side of the peer review process is caught at “shitsmyreviewerssay”.

Oh, and of course authors should themselves write their publication in a journal-preferred template (and each journal has its own preferences). So the only actual work that the journals do is typesetting and editorial filtering.

So, we have expensive scientific literature with no added value and broken peer review system.

And at that point you may argue that if they do not add value, they can be easily replaced. Well, no. Because of the Impact Factor – the metric for determining the most cited journals, and by extension – the reputation of the authors that manage to get published in these journals. The impact factor is calculated based on a big database (Web of Science) and assigns a number on each journal. The higher impact factor a journal has, the better career opportunities a scientist has if they managed to get accepted for publication in that journal.

You may think that the impact factor is objective – well, it isn’t. It is based on data that only publishers (Thomson-Reuters in particular) have and when others tried to reproduce the impact factor, it was nearly 40% off (citation needed, but I lost the link). Not only that, but it’s an impact factor of the journal, not the scientists themselves.

So the fact that publishers are the judge, jury and executioner, means they can make huge profits without adding much value (and yes, they allow searching through the entire collection they have, but full-text search on a corpus of text isn’t exactly rocket science these days). That means scientists don’t have access to everything they may need, and that poor universities won’t be able to keep up. Not to mention individual researchers who are just left out. In general, science suffers from the inefficient sharing and assessment of research.

The situation is even worse, actually – due to the lack of incentive for publishers to change their process (among other things), as a popular journal editor once said – “much of the scientific literature, perhaps half, may simply be untrue”. So the fact that you are published in a somewhat impactful journal doesn’t mean your publication has been thoroughly reviewed, nor that the reviewers bear any responsibility for their oversights.

Many discussions have been held about why disruption hasn’t yet happened in this apparently broken field. And it’s most likely because of the “chicken and egg problem” – scientists have an incentive to publish to journals because of the impact factor, and that way the impact factor is reinforced as a reputation metric.

Then comes open access – a movement that requires scientific publications to be publicly accessible. There are multiple organizations/initiatives that support and promote open access, including EU’s OpenAIRE. Open access comes in two forms:

  • “green open access”, or “preprints” (yup, “print” is still an important word) – you just push your work to an online repository – it’s not checked by editors or reviewers, it just stays there.
  • “gold open access” – the author/library/institution pays a processing fee to publish the publication and then it becomes public. Important journals that use this include PLOS, F1000 and others

The “gold open access” doesn’t solve almost anything, as it just shifts the fees (maybe it reduces them, but again – processing fee to get something published online, really?). The “green open access” doesn’t give you the reputation benefits – preprint repos don’t have impact factor. Despite that, it’s still good to have the copies available, which some projects (like, OABOT, ArchiveLab) try to do.

Then there’s Google Scholar, which has agreements with publishers to aggregate their content and provide search results (not the full publications). It also provides some metrics ontop of that, regarding citation. It forms a researcher profile based on that, which can actually be used as a replacement for the impact factor.

Because of that, many attempts have been made to either “revolutionize” scientific publishing, or augment it with additional services that would have the potential to one day become prelevant and take over the process. I’ll try to summarize the various players:

  • preprint repositories – this is where scientists publish their works before submitting them to a journal. The major player is arXiv, but there are others as well (list, map)
  • scientific “social networks” –, ResearchGate offer a way to connect with fellow-researchers and share your publications, thus having a public researcher profile. Scientists get analytics about the number of reads their publications get and notifications about new research they might be interested in. It is similar to a preprint repo, as they try to get hold of a lot of publications.
  • services which try to completely replace the process of scientific publishing – they try to be THE service where you publish, get reviewed and get a “score”. These include SJS, The Winnower and possibly and ResearchGate can also maybe fit in this category, as they offer some way of feedback (and plan or already have peer-review) and/or some score (RG score).
  • tools to support researchers – Mendeley (a personal collection of publications), Authorea (a tool for collaboratively editing publications), Figshare (a place for sharing auxiliary materials like figures, datasets, source code, etc.), Zenodo (data repository), Publons (a system to collect everyone’s peer reviews), and Open Science Framework (sets of tools for researchers), Altmetric (tool to track the activity around research), ScholarPedia and OpenWetWare (wikis)
  • impact calculation services – in addition to the RG score, there’s ImpactFactory
  • scientist identity – each of the social networks try to be “the profile page” of a scientist. Additionally, there are the identifiers such as ORCID, researcherId, and a few others by individual publishers. Maybe fortunately, all are converging towards ORCID at the moment.
  • search engines – Google Scholar, Microsoft Academic, Science Direct (by Elsevier), Papers, PubPeer, CrossRef, PubMed, Base Search, CLOCKSS, (AI for analyzing scientific texts) and of course Sci-Hub – which mostly rely on contracts with publishers (with the exception of SciHub)
  • journals with a more modern, web-based workflow – F1000Research, Cureus, Frontiers, PLoS

Most of these services are great and created with the real desire to improve the situation. But unfortunately, many have problems. ResearchGate has bee accused of too much spamming, its RG score is questionable; is accused of too many fake accounts for the sake of making investors happy, Publons is a place where peer review should be something you brag about, yet very few reviews are made public by the reviewers (which signifies a cultural problem). SJS and The winnower have too few users, and the search engines are dependent on the publishers. Mendeley and others were acquired by the publishers so they no longer pose a threat to the existing broken model.

Special attention has to be paid to Sci-Hub. The “illegal” place where you can get the knowledge you want to find. Alexandra Elbakyan created Sci-Hub which automatically collects publications through library and university networks by credentials donated by researchers. That way all of the content is public and searchable by DOI (the digital identifier of an article, which by the way is also a broken concept, because in order to give your article and identifier, you need to pay for a “range”). So sci-hub seems like a good solution, but doesn’t actually fix the underlying workflow. It has been sued and its original domain(s) – taken, so it’s something like the pirate bay for science – it takes effort and idealistic devotion in order to stay afloat.

The lawsuits against sci-hub, by the way, are an interesting thing – publishers want to sue someone for giving access to content that they have taken for free from scientists. Sounds fair and the publishers are totally not “evil”?

I have had discussions with many people, and read a lot of articles discussing the disruption of the publishing market (here, here, here, here, here, here, here). And even though some of the articles are from several years ago, the change isn’t yet here.

Approaches that are often discussed are the following, and I think neither of them are working:

  • have a single service that is a “mega-journal” – you submit, get reviewed, get searched, get listed in news sections about your area and/or sub-journals. “One service to rule them all”, i.e. a monopoly, is also not good in the long term, even if the intentions of its founders are good (initially)
  • have tools that augment the publishing process in hope to get more traction and thus gradually get scientists to change their behaviour – I think the “augmenting” services begin with the premise that the current system cannot be easily disrupted, so they should at least provide some improvement on it and easy of use for the scientists.

On the plus side, it seems that some areas of research almost exclusively rely on preprints (green open access) now, so publishers have a diminishing influence. And occasionally someone boycotts them. But that process is very slow. That’s why I wanted to do something to help make it faster and better.

So I created a wordpress plugin (source). Yes, it’s so trivial. I started with a bigger project in mind and even worked on it for a while, but it was about to end up in the first category above, of “mega-journal”, and that seems to have been tried already, hasn’t been particularly successful, and is risky long term (in terms of centralizing power).

Of course a wordpress plugin isn’t a new idea either. But all attempts that I’ve seen either haven’t been published, or provide just extras and tools, like reference management. My plugin has three important aspects:

  • JSON-LD – it provides semantic annotations for the the scientific content, making it more easily discoverable and parseable
  • peer review – it provides a simple, post-publication peer review workflow (which is an overstatement for “comments with extra parameters”)
  • it can be deployed by anyone – both as a personal website of a scientist and as a library/university-provided infrastructure for scientists. Basically, you can have a wordpress intallation + the plugin, and get a green open access + basic peer review for your institution. For free.

What is the benefit of the semantic part? I myself have argued that the semantic web won’t succeed anytime soon because of a chicken-and-egg problem – there is no incentive to “semanticize” your page, as there is no service to make use of it; and there are no services, because there are no semantic pages. And also, there’s a lot of complexity for making something “semantic” (RDF and related standards are everything but webmaster-friendly). There are niche cases, however. The Open Graph protocol, for example, makes a web page “shareable on facebook”, so web masters have the incentive to add these tags.

I will soon contact Google Scholar, Microsoft Academic and other search engines to convince them to index semantically-enabled web-published research. The point is to have an incentive, just like with the facebook example, to use the semantic options. I’ll also get in contact with ResearchGate/Academia/Arxiv/etc. to suggest the inclusion of semantic annotations and/or JSON-LD.

The general idea is to have green open access with online post-publication peer review, which in turn lets services make profile pages and calculate (partial) impact scores, without reliance on the publishers. It has to be easy, and it has to include libraries as the main contributor – they have the “power” to change the status-quo. And supporting a WordPress installation is quite easy – a library, for example, can setup one for all of the researchers in the institution and let them publish there.

A few specifics of the plugin:

  • the name “scienation” comes from “science” and either “nation” or the “-ation” suffix.
  • it uses URLs as article identifiers (which is compatible with DOIs that can also be turned into URLs). There is an alternative identifier, which is the hash of the article (text-only) content – that way the identifier is permanent and doesn’t rely on one holding a given domain.
  • it uses ORCID as an identity provider (well, not fully, as the OAuth flow is not yet implemented – it requires a special registration which won’t be feasible). One has to enter his ORCID in a field and the system will assume it’s really him. This may be tricky and there may be attempts to publish a bad peer review on behalf of someone else.
  • the hierarchy of science branches is obtained from Wikipedia, combined with other small sources.
  • the JSON-LD properties in use are debatable (sample output). I’ve started a discussion on having additional, more appropriate properties in’s ScholarlyArticle. I’m aware of ScholarlyHTML (here, here and here – a bit confusing which is “correct”), codemeta definitions and the scholarly article ontology. They are very good, but their purpose is different – to represent the internal details of a scientific work in a structured way. There is probably no need of that if the purpose is to make the content searchable and to annotate it with metadata like authors, id, peer reviews and citations. Still, I reuse the ScholarlyArticle standard definition and will gladly accept anything else that is suitable for the usecase.
  • I got the domain (nothing to be seen there currently) and one can choose to add his website to a catalog that may be used in the future for easier discovering and indexing semantically-enabled websites.

The plugin is open source, licensed under GPL (as is required by WordPress), and contributions, discussions and suggestions are more than welcome.

I’m well aware that a simple wordpress plugin won’t fix the debacle that I’ve described in the first part of this article. But I think the right approach is to follow the principle of decentralization and reliance on libraries and individual researchers, rather than on (centralized) companies. The latter has so far proved inefficient and actually slows science down.


I Stopped Contributing To Stackoverflow, But It’s Not Declining

September 26, 2016

“The decline of Stackoverflow” is now trending on reddit, and I started this post as a comment in the thread, but it got too long.

I’m in the 0.01% (which means rank #34) but I haven’t contributed almost anything in the past 4 years. Why I stopped is maybe part of the explanation why “the decline of stackoverflow” isn’t actually happening.

The mentioned article describes the experience of a new user as horrible – you can’t easily ask a question without having it downvoted, marked as duplicate, or commented on in a negative way. The overall opinion (of the article and the reddit thread) seems to be that SO “the elite” (the moderators) has become too self-important and is acting on a whim for an alleged “purity” of the site.

But that’s not how I see it, even though I haven’t been active since “the good old days”. This Hacker news comment has put it very well:

StackOverflow is a machine designed to do one thing: make it so that, for any given programming question, you will get a search engine hit on their site and find a good answer quickly. And see some ads.
That’s really it. Everything it does is geared toward that, and it does it quite well.
I have lots of SO points. A lot of them have come from answering common, basic questions. If you think points exist to prove merit, that’s bad. But if you think points exist to show “this person makes the kind of content that brings programmers to our site and makes them happy”, it’s good. The latter is their intent.

So why I stopped contributing? There were too many repeating questions/themes, poorly worded, too many “homework” questions, and too few meaningful, thought provoking questions. I’ve always said that I answer stackoverflow questions not because I know all the answers, but because I know a little more about the subject than the person asking. And those seemed to give way (in terms of percentage) to “null pointer exception”, “how to fix this [40 lines pasted] code” and “Is it better to have X than Y [in a context that only I know and I’m not telling you]”. (And here’s why I don’t agree that “it’s too hard to provide an answer on time”. If it’s not one of the “obvious” questions, you have plenty of time to provide an answer).

And if we get back to the HN quote – the purpose of the site is to provide answers to questions. If the questions are already answered (and practically all of the basic ones are), you should have found the answer, rather than asking it again. Because of that maybe somethings non-trivial questions get mistaken for “on, not another null pointer exception”, in which cases I’ve been actively pointing out that this is the case and voting to reopen. But that’s rare. All the examples in the “the decline of stackoverflow” article and in the reddit thread are I believe edge cases (and one is a possible “homework question”). Maybe these “edge cases” are now more prevalent than when I was active, but I think the majority of the new questions are still coming from people too lazy to google one or two different wordings of their problem. Which is why I even summarized the basic steps of finding a problem before asking on SO.

So I wouldn’t say the moderators are self-made tyrants that are hostile to anyone new. They just have a knee-jerk reaction when they see yet-another-duplicate-or-homework-or-subjective question.

And that’s not simply for the sake of purity – the purpose of the site is to provide answers. If the same question exists in 15 variations, you may not find the best answer (it has happened to me – I find three questions that for some reason aren’t marked as duplicate – one contains just a few bad answers, and the other one has the solution. If google happens to place the former ontop, one may think it’s actually a hard question).

There are always “the trolls”, of course – I have been serially downvoted (so all questions about serial downvoting are duplicates), I have had even personal trolls that write comments on all my recent answers. But…that’s the internet. And those get filtered quickly, no need to get offended or think that “the community is too hostile”.

In the past week I’ve been doing a wordpress plugin as a side project. I haven’t programmed in PHP in 4 years and I’ve never written a wordpress plugin. I had a lot of questions, but guess what – all of them were already answered, either on stackoverflow, or in the documentation, or in some blogpost. We shouldn’t assume our question is unique and rush to asking it.

On the other hand, even the simplest questions are not closed just because they are simple. One of my favourite examples is the question whether you need a null check before calling an instanceof. My answer is number 2, with a sarcastic comment that this could be tested in an IDE for a minute. And a very good comment points out that it takes less than that to get the answer on Stackoverflow.

It may seem that most of the questions are already answered now. And that’s probably true for the general questions, for the popular technologies. Fortunately our industry is not static and there are new things all the time, so stackoverflow is going to serve those.

It’s probably a good idea to have different rules/thresholds for popular tags (technologies) and less popular ones. If there’s a way to differentiate trivial from non-trivial questions, answers to the non-trivial ones could be rewarded with more reputation. But I don’t think radical changes are needed. It is inevitable that after a certain “saturation point” there will be fewer contributors and more readers.

Bottom line:

  • I stopped contributing because it wasn’t that challenging anymore and there are too many similar, easy questions.
  • Stackoverflow is not declining, it is serving its purpose quite well.
  • Mods are not evil jerks that just hate you for not knowing something
  • Stackoverflow is a little more boring for contributors now than it was before (which is why I gradually stopped answering), simply because most of the general questions have already been answered. The niche ones and the ones about new technologies remain, though.

Traditional Web Apps And RESTful APIs

September 23, 2016

When we are building web applications these days, it is considered a best practice to expose all our functionality as a RESTful API and then consume it ourselves. This usually goes with a rich front-end using heavy javascript, e.g. Angular/Ember/Backbone/React.

But a heavy front-end doesn’t seem like a good default – applications that require the overhead of a conceptually heavy javascript framework are actually not in the majority. The web, although much more complicated, is still not just about single-page applications. Not to mention that if you are writing a statically-typed backend, you would either need a dedicated javascript team (no necessarily a good idea, especially in small companies/startups), or you have to write in that … not-so-pleasant language. And honestly, my browsers are hurting with all that unnecessary javascript everywhere, but that’s a separate story.

The other option for having yourself consume your own RESTful API is to have a “web” module, that calls your “backend” module. Which may be a good idea, especially if you have different teams with different specialties, but the introduction of so much communication overhead for the sake of the separation seems at least something one should think twice before doing. Not to mention that in reality release cycles are usually tied, as you need extra effort to keep the “web” and “backend” in proper sync (“web” not requesting services that the “backend” doesn’t have yet, or the “backend” not providing a modified response model that the “web” doesn’t expect).

As in my defence of monoliths, I’m obviously leaning towards a monolithic application. I won’t repeat the other post, but the idea is that an application can be modular even if it’s run in a single runtime (e.g. a JVM). Have your “web” package, have your “services” package, and these can be developed independently, even as separate (sub-) projects that compile into a single deployable artifact.

So if you want to have a traditional web application – request/response, a little bit of ajax, but no heavy javascript fanciness and no architectural overhead, and you still want to expose your service as a RESTful API, what can you do?

Your web layer – the controllers, working with request parameters coming from form submissions and rendering a response using a template engine – normally communicates with your service layer. So for your web layer, the service layer is just an API. It uses it using method calls inside a JVM. But that’s not the only way that service layer can be used. Frameworks like Spring-MVC, Jersey, etc, allow annotating any method and exposing it as a RESTful service. Normally it is accepted that a service layer is not exposed as a web component, but it can be. So – you consume the service layer API via method calls, and everyone else consumes it via HTTP. The same definitions, the same output, the same security. And you won’t need a separate pass-through layer in order to have a RESTful API.

In theory that sounds good. In practice, the annotations that turn the method into an endpoint may introduce problems – is serialization/deserialization working properly, are the headers properly handled, is authentication correct. And you won’t know that these aren’t working if you are using the methods only inside a single JVM. Yes, you will know they work correctly in terms of business logic, but the RESTful-enabling part may differ.

That’s why you need full coverage with acceptance tests. Something like cucumber/JBehave to test all your exposed endpoints. That way you’ll be sure that both the RESTful aspects, and the business logic work properly. It’s actually something that should be there anyway, so it’s not an overhead.

Another issues is that you may want to deploy your API separately from your main application. and You may want to have just the API running in one cluster, and your application running in another. And that’s no issues – you can simply disable the “web” part with a configuration switch and your application and deploy the very same artifact multiple times.

I have to admit I haven’t tried that approach, but it looks like a simple way that would still cover all the use-cases properly.