Comments on The Twelve-Factor App

August 22, 2015

The Twelve-Factor App is a recent methodology (and/or a manifesto) for writing web applications that hopefully is getting quite popular. Although I don’t agree 100% with the recommendations, I ‘ll quickly go through all 12 factors and discuss them in the light of the Java ecosystem, mentioning the absolute “musts” and the points where I disagree. For more, visit the site.

  1. Codebase – one codebase, multiple deploys. This means you must not have various codebase for various versions. Branches is okay, different repos are not. I’d even go further and not recommend Subversion. Not because it’s not fine, but because git and mercurial do the same and much more. You can use git/mercurial the way you use SVN, but not the other way around. And tools for DVCS (e.g. SourceTree) are already quite good
  2. Dependencies – obviously, you must put as many dependencies in your manifests (e.g. pom.xml) as possible. The manifesto advices against relying on pre-installed software, for example ImageMagick or Pandoc, but I wouldn’t be that strict. If your deployments are automated and you guarantee the presence of a given tool, you shouldn’t spend days trying to wrap it in a library of your working language. If it’s as easy as putting an exetuable script in a jar file and then extracting it, that’s fine. But if it requires installation, and you really need it (ImageMagick is a good example indeed), I don’t think it’s wrong to expect it to be installed. Just check on startup if it’s present and fail fast if it’s not.
  3. Config – the most important rule here is – never commit your environment-specific configuration (most importantly: password) in the source code repo. Otherwise your production system may be vulnerable, as are probably at least a third of these wordpress deployments (and yes, mysql probably won’t allow external connections, but I bet nobody has verified that).

    But from there on my opinion is different than the one of the 12-factor app. No, you shouldn’t use environment variables for your configuration. Because when you have 15 variables, managing them becomes way easier if they are in a single file. You can have some shell script that sets them all, but that goes against the OS independence. Having a key-value .properties file (for which Java has native support), and only passing the absolute path to that file as an environment variable (or JVM param) is a better approach, I think. I’ve discussed it previously. E.g. CONFIG_PATH=/var/conf/, which you load on startup.

    And in your application you can keep a blank which contains a list of all properties to be configured – database credentials, keys and secrets for external systems, etc. (without any values). That way you have all the properties in one place and it’s very easy to discover what you may need to add/reconfigure in a given scenario. If you use environment variables, you’d have to have a list of them in a txt file in order to make them “discoverable”, or alternatively, let the developers dig into the code to find out which properties are available.

    And last, but not least – when I said that you shouldn’t commit properties files to source control, there is one very specific exception. You can choose to version your environment configurations. It must be a private repo, with limited access and all that, but the (Dev)Ops can have a place where they keep the properties and other specifics for each environment, versioned. It’s easier to have that with a properties file (not impossible with env variables, but then again you need a shell script).

    The 12-factor app authors warn about explosion of environments. If you have a properties file for each environment, these may grow. But they don’t have to. You can change the values in a properties file exactly the way you would manage the environment variables.

  4. Backing Services – it’s about treating that external services that your application depends on equally, regardless of whether you manage them, or whether another party manages them. From the application’s perspective that should not matter. What I can add here is that you should try to minimize this. If an in-memory queue would do, don’t deploy a separate MQ. If an in-memory cache would do, don’t deploy a redis instance. If an embedded database would do, don’t manage a DB installation (e.g. neo4j offers an embedded variant). And so on. But if you do need the full-featured external service, make the path/credentials to it configurable as if it’s external (rather than, for example, pointing to localhost by default).
  5. Build, release, run – it is well described on the page. It is great to have such a lifecycle. But it takes time and resources to set it up. Depending on your constraints, you may not have the full pipeline, and some stages may be more manual and fluid than ideal. Sometimes, for example in the early stages of a startup, it may be beneficial to be able to swap class files or web pages on a running production server, rather than going through a full release process (which you haven’t had the time to fully automate). I know this sounds like heresy, and one should strive to a fully automated and separated process, but before getting there, don’t entirely throw away the option for manually dropping a fixed file in production. As longs as you don’t do it all the time and you don’t end up with a production environment for which you have no idea what version of the codebase is run.
  6. Processes – this is about being stateless, and also about not relying on any state being present in memory or on the file system. And indeed, state does not belong in the code.

    However, there’s something I don’t agree with. The 12-factor preferred way of packaging your assets is during build time (merging all css files into one, for example). That has several drawbacks – you can’t combine assets dynamically, e.g. if you have 6 scripts, and on one page you need 4, on another page you need 2 of the ones used on the first page, and another 2, then you have to build all this permutations beforehand. Which is fine and works, but why is it needed? There is no apparent benefit. And depending on the tools you use, it may be easier to work with CDN if you are dynamically generating the bundles.

    Another thing where further Java-related details can be given is “sticky sessions”. It’s not a good idea to have them, but note that you can use your session to store data about the user in memory. You just have to configure your servlet container (or application server) to share that state. Basically, under the hood it still uses a distributed cache like memcached or ehcache (I guess you could also use a redis implementation of the session clustering). It’s just transparent from the developer and he can still use the session store.

  7. Port Binding – this is about having your application as standlone, instead of relying on a running instance of an application server, where you deploy. While that seems easier to manage, it isn’t. Starting an servlet container and pushing a deployment is just as easy. But in order to have your application bind to a port, you need to have the tooling for that. They mention jetty, and there is also an embedded version of tomcat, and spring-boot (which wraps both). And while I’m not against the port binding, I’d say it’s equally good to have it the other way around. Container configuration is done equally easy, regardless of whether you drop an environment-specific xml file, or do it programmatically and load the properties from the file mentioned in point 3. The point is – it doesn’t matter – do whichever is easier for you. Not to mention that you may need some apache/nginx functionality.
  8. Concurrency – it’s about using native processes. This, I think, isn’t so relevant to a Java runtime, which uses threads under the hood and hides away the unix process. By the way, another explicit reference to unix (rather than staying OS-independent).
  9. Disposability – that’s about embracing failure. Your system must work fine even though one or more of application instances die. And that’s bound to happen, especially “in the cloud”. They mention SIGTERM, which is a *nix-specific signal, whereas the general idea of the 12-factor app is to be OS-independent. There is an apparent leaning towards Linux, which is fine though.
  10. Dev/prod parity – your development environment should almost identical to a production one (for example, to avoid some “works on my machine” issues). That doesn’t mean your OS has to be the OS running in production, though. You can run Windows, for example, and have your database, MQ, etc. running on a local virtual machine (like my setup). This also underlines the OS-independence of your application. Just have in mind to keep the versions the same.
  11. Logs – the 12-factor app recommends writing all logging information to the system out. A Java developer will rightly disagree. With tools like loggack/slf4j you can manage the logging aspects within the application, rather than relying on 3rd party tools to do that. E.g. log rotation and cleanup, or sending to a centralized logging facility. It’s much easier to configure a graylog or splunk adapter, than having another process gather that from system out and push it. There can be environment-specific log configurations, which is again just one file bundled together with the If that seems complicated, consider the complications of setting up whatever is going to capture the output.
  12. Admin processes – generally agreed, but in addition I’d say it’s preferable to execute migrations on deployment or startup, rather than manually, and that manually changing “stuff” on production should preferably be done through something like capistrano in order to make sure it’s identical on all instances.

Overall, it’s a good set of advice and an approach to building apps that I’d recommend, with the above comments in mind.


A Software Engineer As a High-Level Government Adviser

August 13, 2015

Two months ago I took the job of adviser to the cabinet of the deputy prime minister of my country (the Republic of Bulgaria, an EU member). And I’d like to share my perspective of a technical person, as well as some of my day-to-day activities which might be of interest.

How does a software engineer get to such a position in the first place? Some people from NGO that I used to be part of (including myself) communicated with the interim government our open source campaign and also built the OpenData portal of Bulgaria (based on CKAN). We continued the communication with the newly elected government and helped with opendata-related stuff, so several months later we got a proposal for a part-time advisory position. And I took it, reducing my software engineer job to four hours. I don’t have to mention that hiring a 27-year old software engineer isn’t something a typical government would do, so that’s progress already.

What do I see? Slow waterfall processes, low-quality software, abandonware. Millions spent on hardware and software licenses which are then underutilized (to say the least). I knew that before, hence the push for open source and more agile processes. Basically, the common perception of the regular citizen is that millions have been spent on the so called “e-government” and there is nothing coming out of it. And that’s mostly correct.

To be honest, I cannot yet tell why exactly. Is it the processes, is it the lack of technical expertise on the side of the civil service, is it the businesses’ inability to provide quality software, or is it corruption? Maybe a little bit of each. But as you can imagine, a part-time adviser cannot fix any of these things at scale. So what do I do?

Currently the most important task is finalizing two laws – the changes to the law for e-governance and the law for electronic identification. The former introduces a “e-governance” agency, which will oversee all software projects in the country, and the latter is about a scheme to allow citizens to be identified online (a step in the direction of my campaign for electronic identification throughout the EU). I’m not a lawyer, so the technical aspects that are put in the laws get phrased by lawyers.

I have to say that I’m not the main driver of all this – there are other advisers that do a lot of the work, one of whom is way more technically experienced than me (though not particularly in writing software).

The agency that is being introduced is supposed to act as something like a CIO and we are defining what it can and must do. Among more strategic things we also plan to task it with the development of open data, including providing help to all administrations (which is currently something we do, see below) as well as standardizing an open-source development workflow for bespoke software. As I’ve written in a previous post, we already have EU-approved requirements for software – it should be built in the open from day one. And the point of that is long-term stability of the software projects. Whether it’s going to be directly pushed to GitHub, or replicated there from an on-premise deployment (or vice-versa) is a matter of discussion.

The electronic identity is about giving each citizen the means to identify online in order to get services from the government. This includes the right of every citizen to access all data that the government has about them, and request correction or even possibly deletion. I am not a Big Brother fan and I’m trying to push things into a direction where convenience doesn’t mean breach of privacy.

I try to map existing infrastructure to an idea of an architecture and act whenever there’s a discrepancy or a missing bit. For example an important “quest” of mine is to allow each administration to access the data it needs for each citizen online. That may sound like the opposite direction of the last sentence in the previous paragraph, but it isn’t. The government already has that data. And with due procedures each civil servant can access it. What I’m trying to do is automate that access, again preserving all the due legal requirements (civil servants can only access data that they need by law, in order to fulfil a given service), and also keeping a log entry for each access. Then this access will be visible to citizens when they identify with their e-id card. And whenever someone is looking for data about you, you will be notified.

The security aspect is the most serious one and the most overlooked one, so I’m putting a lot of thought into that. Nobody should be able to just get a pair of credentials and read their neighbour’s medical record.

In order to get to such a time-saving, semi-automated solution, I speak to companies that develop the software that’s part of the existing infrastructure and advise for some tweaks. Things are a bit fuzzy, because very minor things (like not using digital signatures to sign information requests) can break the whole idea. And that’s why, I think, a technical person is needed on such a high level, so that we don’t get another abandonware just because a hundred lines of code are missing.

Other things that I do:

  • open data – whenever an administration needs technical help with exporting, I should help. For example, I’ve written a php converter for Excel documents to proper CSV, because Excel’s “save as .csv” functionality is broken – it saves files in non-UTF-8 encodings and uses semicolons instead of commas (depending to regional settings). And since much of the data is currently in Excel files, exporting to a machine-readable csv should go through some “correction” script. Another thing is helping with “big, fat” SQL queries to extract relevant data from ages-old databases. So actual programming stuff
  • case study for introducing electronic document process in the administration of the Council of Ministers. That is more on the business analysis side, but still needs technical “eyes”
  • ongoing projects – as mentioned above, I speak to companies that are creating software for the government and I give feedback. This is rather “rudimentary” as I don’t have an official say of what should and what should not be done, but I hope fellow software engineers see it as a good input, rather than an attempt for interference
  • some low-hanging fruit. For example I wrote an automated test of a list of 600 government websites and it turned out that 10% do not work with or without “www” in the URL. Two are already fixed, and we are proceeding to instruct the institutions to fix the rest.
  • I try to generate new project ideas that can help. One of which is the development portal. Currently companies communicate in an ad-hoc way, which means that if you need a library for accessing a given service, you call the other company and they send you a jar via email. Or if you have a question, only you get to know the answer, and other companies must ask for themselves. The dev portal is meant to be a place for providing SDKs for inter-system communication and also serve as a Q&A site, where answers are accessible to all the companies that work on e-government projects.
  • Various uncategorizable activities, like investigate current EU projects, discuss budgeting of software projects, writing an egov roadmap, and general “common sense” stuff”

I use a private Trello to organize the tasks, because they are really diverse and I’m sure I can forget something of the 6 ongoing tasks. And that’s, by the way, one of the challenges – things happen slowly, so my trello column “Waiting for” is as full as the “In Progress” one. And that’s expected – I can’t just add two points to a law project and forget about it – it has to follow the due process.

So it may seem that so far I haven’t achieved anything. But “wheels are in motion”, if I may quote my ever more favourite “Yes, Minister” series. And my short term goal is to deploy usable systems which the administration can really use in order to make both their own lives and the lives of citizens easier, by not asking for filling dozens of documents with data that is available on a server two streets away. And if it happens that I have to write a piece of code in order to achieve that, rather than go through a 9-month official “upgrade” procedure, I’m willing to do that. And fortunately I feel I have the freedom to.

In software development, starting from a grand plan usually doesn’t get you anywhere. You should start small and expand. But at the same time you should have a grand plan in mind, so that along the way you don’t make stupid decisions. Usable, workable, pieces of the whole puzzle can be deployed and that’s what I’m “advising” (and acting) for. And it’s surprisingly interesting so far. Maybe because I still have my software development job and I don’t get to miss writing code.


Events Don’t Eliminate Dependencies

August 2, 2015

Event (or message) driven systems (in their two flavors) have some benefits. I’ve already discussed why I think they are overused. But that’s not what I’m going to write about now.

I’m going to write (very briefly) about “dependencies” and “coupling”. It may seem that when we eliminate the compile-time dependencies, we eliminate coupling between components. For example:

class CustomerActions {
  void purchaseItem(int itemId) {
    purchaseService.makePurchase(item, userId);


class CustomerActions {
  void purchaseItem(int itemId) {
    queue.sendMessage(new PurchaseItemMessage(item, userId));

It looks as though your CustomerActions class no longer depends on a PurchaseService. It doesn’t care who will process the PurchaseItem message. There will certainly be some PurchaseService out there that will handle the message, but the former class is not tied to it at compile time. Which may look like a good example of “loose coupling”. But it isn’t.

First, the two classes may be loosely coupled in the first place. The fact that one interacts with the other doesn’t mean they are coupled – they are free to change independently, as long as the PurchaseService maintains the contract of its makePurchase method

Second, having eliminated the compile-time dependencies doesn’t mean we have eliminated logical dependencies. The event is sent, we need something to receive and process it. In many cases that’s a single target class, within the same VM/deployment. And the wikipedia article defines a way to measure coupling in terms of the data. Is it different in the two approaches above? No – in the first instance we will have to change the method definition, and in the second instance – the event class definition. And we will still have a processing class whose logic we may also have to change after changing the contract. In a way, the former class still depends logically on the latter class, even though that’s not explicitly realized at compile time.

The point is, the logical coupling remains. And by simply moving it into an event doesn’t give the “promised” benefits. In fact, it makes code harder to read and trace. While in the former case you’d simple ask your IDE for a call hierarchy, it may be harder to trace who produces and who consumes the given message. The event approach has some pluses – events may be pushed to a queue, but so can direct invocations (through a proxy, for example, as spring does with just a single @Async annotation).

Of course that’s a simplified use-case. More complicated ones would benefit from an event-driven approach, but in my view these use-cases rarely cover the whole application architecture; they are most often better suited for specific problems, e.g. the NIO library. And I’ll continue to perpetuate this common sense mantra – don’t do something unless you know what exactly are the benefits it gives you.


Government Abandonware

July 25, 2015

Governments order software for their allegedly very specific needs. And the EU government (The European Commission) orders “visionary” software that will supposedly one day be used by many member states that will be able to communicate with each other.

Unfortunately, a lot of that software is abandonware. It gets built, the large public doesn’t see the results, and it dies. A while a go I listed a couple of abandoned projects of the EC. Some of them, I guess, are really research projects and their results may be important for further development. But I couldn’t find the results. All there is are expired domains and possibly lengthy documents. But 100-page documents about software are not software. I haven’t seen any piece of code whatsoever. The only repo that I found is that of OntoGov, but all there is there are zip files with “bin” in their name.

Even though unused, the software maybe isn’t lost (although the agency behind some of the projects doesn’t exist anymore), and may be utilized by new projects or even businesses. But it’s just not accessible to the general public. It’s probably hidden in a desk, in a dark basement, with a leopard warning on the door (as in the Hithchiker’s Guide to the Galaxy).

But the problem is not only on EU level. It is on national level as well. Since June I’m an adviser to the deputy prime minister of Bulgaria (which I’ll write about another time), and from the inside it’s apparent that software has been built over the years and then forgotten. I’m pretty sure this is true in most countries as well.

Why this is bad? Not only because of the suboptimal public spending, but also because a lot of effort sinks into a bottomless pit. Software, even though not useful at the given moment, may be used as a base or a building block for future projects that will be really used. Now everything has to start from scratch.

A solution I’ve been advocating for for a while is open source. And, getting back to the EU, there’s an “Open Source observatory” to which I’m subscribed, and I get news of all sorts of state-level and EU-level open-source initiatives. And the other day I saw one pretty good example of the state of affairs.

I read that an open-source tool for collaboratively editing laws is available. Which sounded great, because in the past weeks I’ve been participating in law-making and the process of sending Word document with “Track changes” via email among a dozen people is not the most optimal editing process a developer like me can imagine. So I eagerly opened the link, and…

It got to the front page of /r/opensource, and eventually, after my request to the admins, the page is accessible. And guess what? There is no link to the code or to the product, and the contact person in the link hasn’t answered my email. Yes, it’s holiday season, but that’s not the point. The point is this is not real open source. All dead software from my tweet above is gone and useless, even though, maybe, you can contact someone somewhere to get it. Update: a while later the source finally appeared, but as a zip file, rather than an open-source repo.

Not only that, but currently important EU projects like the ones under e-SENS (an umbrella for several cross-border access projects, e.g. e-procurment, e-jusice, e-authentication (Stork)) are practically closed to the general public and even governments – it took me 10 days to get get the reference implementation (which, according to the implementors, will be made public once the instructions for its use are ready, in a few months)

I hope I’m wrong, but my prediction is that these new projects will follow the fate of the countless other abandonware. Unless we change something. The solution is not just to write “open source” in the title. It should be really open source.

The good news, for Bulgaria at least, is that all new publicly funded projects will have to be open source from day one. Thus not only the work is less likely to die, but also the process will be transparent. What software gets built, with what level of quality and for how much money.

We are currently in the process of selecting the best solution to host all the repositories – either on-premise or SaaS, and we are also including requirements for supporting this solution in proposed amendments to the e-governance law.

I hope that this way we’ll get way more reusable code and way less abandonware. And I hope to extend this policy to an EU level – not just claiming it’s open in the title, but being really open. Not for the sake of being open, but for the sake of higher quality. It won’t necessarily prevent abandonware altogether (because useless software gets created all the time, and useful software gets forgotten), but it will reduce the wasted effort.


Tomcat’s Default Connector(s)

July 15, 2015

Tomcat has a couple of connectors to choose from. I’ll leave aside the APR connector, and focus on the BIO and NIO.

The BIO connector (blocking I/O) is blocking – it uses a thread pool where each thread receives a request, handles it, responds, and is returned to the pool. During blocking operations (e.g. reading from database or calling an external API) the thread is blocked.

The NIO connector (non-blocking I/O) is a bit more complicated. It uses the java NIO library and multiplexes between requests. It has two thread pools – one holds the the poller threads, which handle all incoming requests and push these requests to be handled by worker threads, held in another pool. Both pool sizes are configurable.

When to prefer NIO vs BIO depends on the use case. If you mostly have regular request-response usage, then it doesn’t matter, and even BIO might be a better choice (as seen in my previous benchmarks). If you have long-living connections, then NIO is the better choice, because it can server more concurrent users without the need to dedicate a blocked thread to each. The poller threads handle the sending of data back to the client, while the worker threads handle new requests. In other words, neither poller, nor worker threads are blocked and reserved by a single user.

With the introduction of async processing servlet it became easier to have the latter scenario from the previous paragraph. And maybe that was one of the reasons to switch the default connector from BIO to NIO in Tomcat 8. It’s an important thing to have in mind, especially because they didn’t exactly change the “default value”.

The default value is always “HTTP/1.1″, but in Tomcat 7 that “uses an auto-switching mechanism to select either a blocking Java based connector or an APR/native based connector”, while in Tomcat 8 “uses an auto-switching mechanism to select either a non blocking Java NIO based connector or an APR/native based connector”. And to make things even harder, they introduced a NIO2 connector. And to be honest, I don’t know which one of the two NIO connectors is used by default.

So even if you are experienced with tomcat configuration, have in mind this change of defaults. (And generally I’d recommend reading the documentation for all the properties and play with them on your servers)


Blue-Green Deployment With a Single Database

June 23, 2015

A blue-green deployment is a way to have incremental updates to your production stack without downtime and without any complexity for properly handling rolling updates (including the rollback functionality)

I don’t need to repeat this wonderful explanation or Martin Fowler’s original piece. But I’ll extend on them.

A blue-green deployment is one where there is an “active” and a “spare” set of servers. The active running the current version, and the spare being ready to run any newly deployed version. The “active” and “spare” is slightly different than “blue” and “green”, because one set is always “blue” and one is always “green”, while the “active” and “spare” labels change.

On AWS, for example, you can script the deployment by having two child stacks of your main stacks – active and spare (indicated by a stack label), each having one (or more) auto-scaling group for your application layer, and a script that does the following (applicable to non-AWS as well):

  • push build to an accessible location (e.g. s3)
  • set the spare auto-scaling group size to the desired value (the spare stays at 0 when not used)
  • make it fetch the pushed build on startup
  • wait for it to start
  • run sanity tests
  • switch DNS to point to an ELB in front of the spare ASG
  • switch the labels to make the spare one active and vice versa
  • set the previously active ASG size to 0

The application layer is stateless, so it’s easy to do hot-replaces like that.

But (as Fowler indicated) the database is the most tricky component. If you have 2 databases, where the spare one is a slave replica of the active one (and that changes every time you switch), the setup becomes more complicated. And you’ll still have to do schema changes. So using a single database, if possible, is the easier approach, regardless of whether you have a “regular” database or a schemaless one.

In fact, it boils down to having your application modify the database on startup, in a way that works with both versions. This includes schema changes – table (or the relevant term in the schemaless db) creation, field addition/removal and inserting new data (e.g. enumerations). And it can go wrong in many ways, depending on the data and datatypes. Some nulls, some datatype change that makes a few values unparseable, etc.

Of course, it’s harder to do it with a regular SQL database. As suggested in the post I linked earlier, you can use stored procedures (which I don’t like), or you can use a database migration tool. For a schemaless database you must do stuff manually, but but fewer actions are normally needed – you don’t have to alter tables or explicitly create new ones, as everything is handled automatically. And the most important thing is to not break the running version.

But how to make sure everything works?

  • test on staging – preferably with a replica of the production database
  • (automatically) run your behaviour/acceptance/sanity test suites against the not-yet-active new deployment before switching the DNS to point to it. Stop the process if they fail.

Only after these checks pass, switch the DNS and point your domain to the previously spare group, thus promoting it to “active”. Switching can be done manually, or automatically with the deployment script. The “switch” can be other than a DNS one (as you need a low TTL for that). It can be a load-balancer or a subnet configuration, for example – the best option depends on your setup. And while it is good to automate everything, having a few manual steps isn’t necessarily a bad thing.

Overall, I’d recommend the blue-green deployment approach in order to achieve zero downtime upgrades. But always make sure your database is properly upgraded, so that it works with both the old and the new version.


Optional Dependencies

June 6, 2015

Sometimes a library you are writing may have optional dependencies. E.g. “if apache http client is on the classpath, use it; otherwise – fallback to HttpURLConnection”.

Why would you do that? For various reasons – when distributing a library and you may not want to force a big dependency footprint. On the other hand, a more advanced library may have performance benefits, so whoever needs these, may include it. Or you may want to allow easily pluggable implementations of some functionality – e.g. json serialization. Your library doesn’t care whether it’s Jackson, gson or native android json serialization – so you may provide implementations using all of these, and pick the one whose dependency is found.

One way to achieve this is to explicitly specify/pass the library to use. When the user of your library/framework instantiates its main class, they can pass a boolean useApacheClient=true, or an enum value JsonSerializer.JACKSON. That is not a bad option, as it forces the user to be aware of what dependency they are using (and is a de-facto dependency injection)

Another option, used by spring among others, is to dynamically check is the dependency is available on the classpath. E.g.

private static final boolean apacheClientPresent = isApacheHttpClientPresent();
private static boolean isApacheHttpClientPresent() {
  try {
    Class.forName("org.apache.http.client.HttpClient");"Apache HTTP detected, using it for HTTP communication.);
    return true;
  } catch (ClassNotFoundException ex) {"Apache HTTP client not found, using HttpURLConnection.");
    return false;

and then, whenever you need to make HTTP requests (where ApacheHttpClient and HttpURLConnectionClient are your custom implementations of your own HttpClient interface):

HttpClient client = null;
if (apacheClientPresent) {
   client = new ApacheHttpClient();
} else {
   client = new HttpURLConnectionClient();

Note that it’s important to guard any code that may try to load classes from the dependency with the “isXPresent” boolean. Otherwise class loading exceptions may fly. E.g. in spring, they wrapped the Jackson dependencies in a MappingJackson2HttpMessageConverter

if (jackson2Present) {
    this.messageConverters.add(new MappingJackson2HttpMessageConverter());

That way, if Jackson is not present, the class is not instantiated and loading of Jackson classes is not attempted at all.

Whether to prefer the automatic detection, or require explicit configuration of what underlying dependency to use, is a hard question. Because automatic detection may leave the user of your library unaware of the mechanism, and when they add a dependency for a different purpose, it may get picked by your library and behaviour may change (though it shouldn’t, tiny differences are always there). You should document that, of course, and even log messages (as above), but that may not be enough to avoid (un)pleasant surprises. So I can’t answer when to use which, and it should be decided case-by-case.

This approach is applicable also to internal dependencies – your core module may look for a more specific module to be present in order to use it, and otherwise fallback to a default. E.g. you provide a default implementation of “elapsed time” using System.nano(), but when using Android you’d better rely on SystemClock for that – so you may want to detect whether your elapsed time android implementation is present. This looks like logical coupling, so in this scenario it’s maybe wiser to prefer to explicit approach, though.

Overall, this is a nice technique to use optional dependencies, with a basic fallback; or one of many possible options without a fallback. And it’s good to know that you can do it, and have it in your “toolkit” of possible solutions to a problem. But you shouldn’t always use it over the explicit (dependency injection) option.


AWS “Noisiness”

May 14, 2015

You may be familiar with the “noisy neighbour” problem with virtualization – someone else’s instances on the same physical machine “steals” CPU from your instance. I won’t be giving clues how to solve the issue (quick & dirty – by terminating your instance and letting it be spawned on another physical machine), and instead I’ll explain my observations “at scale”.

I didn’t actually experience a typical “noisy neighbour” on AWS – i.e. one instance being significantly “clogged”. But as I noted in an earlier post about performance benchmarks, the overall AWS performance depends on many factors.

The time of day is the obvious one – as I’m working from UTC+2, my early morning is practically the time when Europe has not yet woken up, and the US has already gone to sleep. So the load on AWS is expected to be lower. When I experiment with CloudFormation stacks in the morning and in the afternoon, the difference is quite noticeable (though I haven’t measured it) – “morning” stacks are up and running much faster than “afternoon” ones. It takes less time for instances, ELBs, and the whole stack to be created.

But last week we observed something rather curious. Our regular load test had to be run on Thursday, but then and till the end of the week the performance was horrible – we couldn’t even get a healthy run – many, many requests were failing due to timeouts from internal ELBs. Ontop of that, spot instances (instances for which you bid a certain price and someone else can “steal” from you at any time) were rather hard to keep – there was a huge demand for them and our spot instances were constantly claimed by someone else. But the AWS region was reported to be in an “ok” state, no errors.

What was happening last Thursday? The UK elections. I can’t prove it had such an effect on the whole AWS, and I initially have that as a joke explanation, but an EU AWS region during the UK elections is likely to be experiencing high load. Noticeably high load, as it seems, so that the whole infrastructure for everyone else was under pressure. (It might have been a coincidence, of course). And it wasn’t a typical “noisy neighbour” – it was the ELBs that were not performing. And then, this week, things were back to normal.

The AWS infrastructure is complex, it has way more than just “instances”, so even if you have enough CPU to handle noisy neighbours, any other component can suffer from increased load on the whole infrastructure. E.g. ELBs, RDS, SQS, S3, even your VPC subnets. When AWS is under pressure, you’ll feel it, one way or another.

The moral? Embrace failure, of course. Have monitoring that would notify you of these events of a less stable infrastructure, and have a fault-tolerant setup with proper retries and fallbacks.


Log Collection With Graylog on AWS

May 8, 2015

Log collection is essential to properly analyzing issues in production. An interface to search and be notified about exceptions on all your servers is a must. Well, if you have one server, you can easily ssh to it and check the logs, of course, but for larger deployments, collecting logs centrally is way more preferable than logging to 10 machines in order to find “what happened”.

There are many options to do that, roughly separated in two groups – 3rd party services and software to be installed by you.

3rd party (or “cloud-based” if you want) log collection services include Splunk, Loggly, Papertrail, Sumologic. They are very easy to setup and you pay for what you use. Basically, you send each message (e.g. via a custom logback appender) to a provider’s endpoint, and then use the dashboard to analyze the data. In many cases that would be the preferred way to go.

In other cases, however, company policy may frown upon using 3rd party services to store company-specific data, or additional costs may be undesired. In these cases extra effort needs to be put into installing and managing an internal log collection software. They work in a similar way, but implementation details may differ (e.g. instead of sending messages with an appender to a target endpoint, the software, using some sort of an agent, collects local logs and aggregates them). Open-source options include Graylog, FluentD, Flume, Logstash.

After a very quick research, I considered graylog to fit our needs best, so below is a description of the installation procedure on AWS (though the first part applies regardless of the infrastructure).

The first thing to look at are the ready-to-use images provided by graylog, including docker, openstack, vagrant and AWS. Unfortunately, the AWS version has two drawbacks – it’s using Ubuntu, rather than the Amazon AMI. That’s not a huge issue, although some generic scripts you use in your stack may have to be rewritten. The other was the dealbreaker – when you start it, it doesn’t run a web interface, although it claims it should. Only mongodb, elasticsearch and graylog-server are started. Having 2 instances – one web, and one for the rest would complicate things, so I opted for manual installation.

Graylog has two components – the server, which handles the input, indexing and searching, and the web interface, which is a nice UI that communicates with the server. The web interface uses mongodb for metadata, and the server uses elasticsearch to store the incoming logs. Below is a bash script (CentOS) that handles the installation. Note that there is no “sudo”, because initialization scripts are executed as root on AWS.


# install pwgen for password-generation
yum upgrade ca-certificates --enablerepo=epel
yum --enablerepo=epel -y install pwgen

# mongodb
cat >/etc/yum.repos.d/mongodb-org.repo <<'EOT'
name=MongoDB Repository

yum -y install mongodb-org
chkconfig mongod on
service mongod start

# elasticsearch
rpm --import

cat >/etc/yum.repos.d/elasticsearch.repo <<'EOT'
name=Elasticsearch repository for 1.4.x packages

yum -y install elasticsearch
chkconfig --add elasticsearch

# configure elasticsearch 
sed -i -- 's/ elasticsearch/ graylog2/g' /etc/elasticsearch/elasticsearch.yml 
sed -i -- 's/#network.bind_host: localhost/network.bind_host: localhost/g' /etc/elasticsearch/elasticsearch.yml

service elasticsearch stop
service elasticsearch start

# java
yum -y update
yum -y install java-1.7.0-openjdk
update-alternatives --set java /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java

# graylog
tar xvzf graylog-1.0.1.tgz -C /opt/
mv /opt/graylog-1.0.1/ /opt/graylog/
cp /opt/graylog/bin/graylogctl /etc/init.d/graylog
sed -i -e 's/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=graylog.jar}/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=\/opt\/graylog\/graylog.jar}/' /etc/init.d/graylog
sed -i -e 's/LOG_FILE=\${LOG_FILE:=log\/graylog-server.log}/LOG_FILE=\${LOG_FILE:=\/var\/log\/graylog-server.log}/' /etc/init.d/graylog

cat >/etc/init.d/graylog <<'EOT'
# chkconfig: 345 90 60
# description: graylog control
sh /opt/graylog/bin/graylogctl $1

chkconfig --add graylog
chkconfig graylog on
chmod +x /etc/init.d/graylog

# graylog web
tar xvzf graylog-web-interface-1.0.1.tgz -C /opt/
mv /opt/graylog-web-interface-1.0.1/ /opt/graylog-web/

cat >/etc/init.d/graylog-web <<'EOT'
# chkconfig: 345 91 61
# description: graylog web interface
sh /opt/graylog-web/bin/graylog-web-interface > /dev/null 2>&1 &

chkconfig --add graylog-web
chkconfig graylog-web on
chmod +x /etc/init.d/graylog-web

mkdir --parents /etc/graylog/server/
cp /opt/graylog/graylog.conf.example /etc/graylog/server/server.conf
sed -i -e 's/password_secret =.*/password_secret = '$(pwgen -s 96 1)'/' /etc/graylog/server/server.conf

sed -i -e 's/root_password_sha2 =.*/root_password_sha2 = '$(echo -n password | shasum -a 256 | awk '{print $1}')'/' /etc/graylog/server/server.conf

sed -i -e 's/application.secret=""/application.secret="'$(pwgen -s 96 1)'"/g' /opt/graylog-web/conf/graylog-web-interface.conf
sed -i -e 's/graylog2-server.uris=""/graylog2-server.uris="http:\/\/\/"/g' /opt/graylog-web/conf/graylog-web-interface.conf

service graylog start
sleep 30
service graylog-web start

You may also want to set a TTL (auto-expiration) for messages, so that you don’t store old logs forever. Here’s how

# wait for the index to be created
INDEXES=$(curl --silent "http://localhost:9200/_cat/indices")
until [[ "$INDEXES" =~ "graylog2_0" ]]; do
	sleep 5
	echo "Index not yet created. Indexes: $INDEXES"
	INDEXES=$(curl --silent "http://localhost:9200/_cat/indices")

# set each indexed message auto-expiration (ttl)
curl -XPUT "http://localhost:9200/graylog2_0/message/_mapping" -d'{"message": {"_ttl" : { "enabled" : true, "default" : "15d" }}}'

Now you have everything running on the instance. Then you have to do some AWS-specific things (if using CloudFormation, that would include a pile of JSON). Here’s the list:

  • you can either have an auto-scaling group with one instance, or a single instance. I prefer the ASG, though the other one is a bit simpler. The ASG gives you auto-respawn if the instance dies.
  • set the above script to be invoked in the UserData of the launch configuration of the instance/asg (e.g. by getting it from s3 first)
  • allow UDP port 12201 (the default logging port). That should happen for the instance/asg security group (inbound), for the application nodes security group (outbound), and also as a network ACL of your VPC. Test the UDP connection to make sure it really goes through. Keep the access restricted for all sources, except for your instances.
  • you need to pass the private IP address of your graylog server instance to all the application nodes. That’s tricky on AWS, as private IP addresses change. That’s why you need something stable. You can’t use an ELB (load balancer), because it doesn’t support UDP. There are two options:
    • Associate an Elastic IP with the node on startup. Pass that IP to the application nodes. But there’s a catch – if they connect to the elastic IP, that would go via NAT (if you have such), and you may have to open your instance “to the world”. So, you must turn the elastic IP into its corresponding public DNS. The DNS then will be resolved to the private IP. You can do that by manually and hacky:

      or you can use the AWS EC2 CLI to obtain the instance details of the instance that the elastic IP is associated with, and then with another call obtain its Public DNS.

    • Instead of using an Elastic IP, which limits you to a single instance, you can use Route53 (the AWS DNS manager). That way, when a graylog server instance starts, it can append itself to a route53 record, that way allowing for a round-robin DNS of multiple graylog instances that are in a cluster. Manipulating the Route53 records is again done via the AWS CLI. Then you just pass the domain name to applications nodes, so that they can send messages.
  • alternatively, you can install graylog-server on all the nodes (as an agent), and point them to an elasticsearch cluster. But that’s more complicated and probably not the intended way to do it
  • configure your logging framework to send messages to graylog. There are standard GELF (the greylog format) appenders, e.g. this one, and the only thing you have to do is use the Public DNS environment variable in the logback.xml (which supports environment variable resolution).
  • You should make the web interface accessible outside the network, so you can use an ELB for that, or the round-robin DNS mentioned above. Just make sure the security rules are tight and not allowing external tampering with your log data.
  • If you are not running a graylog cluster (which I won’t cover), then the single instance can potentially fail. That isn’t a great loss, as log messages can be obtained from the instances, and they are short-lived anyway. But the metadata of the web interface is important – dashboards, alerts, etc. So it’s good to do regular backups (e.g. with mongodump). Using an EBS volume is also an option.
  • Even though you send your log messages to the centralized log collector, it’s a good idea to also keep local logs, with the proper log rotation and cleanup.

It’s not a trivial process, but it’s essential to have log collection, so I hope the guide has been helpful.


Should We Look For a PRISM Alternative?

May 3, 2015

I just watched Citizen Four, and that made me think again about mass surveillance. And it’s complicated.

I would like to leave aside the US foreign policy (where I agree with Chomsky’s criticism), and whether “terrorist attacks” would have been an issue if the US government didn’t do all the bullshit it does across the world. Let’s assume there is always someone out there trying, for no rational reason, to blow up a bus or a train. In the US, Europe, or anywhere. And with the internet, it becomes easier for that person to find both motivation and means to do so.

From that point of view it seems entirely justified to look for those people day and night in attempt to prevent them from killing innocent people. Whether the effort to prevent a thousand deaths by terrorists is comparable to the efforts to prevent the deaths of millions due to car crashes, obesity-related diseases, malpractice, police brutality, poor living conditions and more, is a beyond the scope of this discussion. And regardless of whether PRISM has helped so far in preventing attacks it may do so in the future.

Privacy, on the other hand, is fundamental and we must not be monitored by a “benign government”, regardless of the professed cause. And I genuinely believe that none of the officials involved in PRISM, envision or aim at any Orwellian dystopia, but that doesn’t mean their actions can’t lead to one. Creating the means to implement a surveillance state is just one step away from having one, regardless of the intentions that the means were created with. In a not-so-impossible scenario, the thin oversight of PRISM could be wiped and the “proper people” given access to the data. I live in a former communist state, so believe me, that’s real. And that’s not the only danger – self-censorship is another one, which can really skew the course of society.

So can’t we have both privacy and security? Shall we sacrifice liberties in order to feel less threatened (and at the end get neither, as Franklin said)? Of course not. But I think the implementation details are, again, the key to the optimal solution. Can there be some sort of solution, that doesn’t give the government all the data about all the citizens, and yet serves as a means to protect against the irrational person who plans to kill people?

The NSA used private companies as a source of the data (even though the companies deny that) – google searches, facebook messages, emails, text messages, etc. All of that was poured into a huge database and searched and analyzed. For good reasons, allegedly, but we don’t trust the government, do we? And yet, we trust the companies with our data, or we don’t care, and we hope that they will protect our privacy. They use the data to target ads at us, which we accept. But handing that data to an almighty government crosses the line. And even though the companies deny any large-scale data transfer, the veil of secrecy over PRISM hints otherwise. Receiving all the data related to a given search term is a rather large-scale data transfer.

My thoughts in this respect lead me to think of alternatives that would still be able to prevent detectable attacks, but would not hand the government the tools to become a superpowerful state. And maybe naively, I think that’s achievable to an extent. And, no, you can’t prevent a terrorist to use Tor, HTTPS, PGP, Bitcoin, a possible Silk road successor, etc. But you can’t prevent a terrorist from meeting an illegal salesman in a parking garage either. And besides, that’s not what mass-surveillance solves anyway – if it does solve anything, it’s the low-hanging fruit; the inept wrongdoers.

But what if there was a piece of software that is trained to detect suspicious profiles (using machine learning). What if that software was open-source and was distributed to major tech companies (like the ones participating in PRISM – google, facebook, etc). That software would work as follows: it receives as input anonymized profiles of users of the the company, analyzes them (locally) and flags any suspicious ones. Then the anonymized suspicious ones are sent to the NSA. The link between the data and the real profile (names, IPs, locations) is encrypted with the public key of a randomly selected court (specialized or not), so that the court can de-anonymize it. If the NSA considers the flagged profile a real danger, it can request the court to de-anonymize the data. How is that different from the search-term-based solution? It’s open, more sophisticated than a keyword match, and way less data is sent to the NSA.

That way the government doesn’t get huge amounts of data – only a tiny fraction, flagged as “suspicious”. Can the companies cheat that, if paid by the NSA – well, they can – they have the data. But preventing that is in their interest as well as that of the public, given that there is a legal way to help the government prevent crimes.

Should we be algorithmically flagged as “suspicious” based on something we wrote online, and isn’t that again an invasion of privacy? That question doesn’t make my middle-ground-finding attempt easier. It is, yes, but it doesn’t make an Orwellian super-state possible; it doesn’t give the government immense power that can be abused. And, provided there’s trust in the court, it shouldn’t lead to self-censorship (e.g. refraining from terrorist jokes, due to fear of being “flagged”).

Can the government make that software flag not only terrorists, but also consider everyone who is critical to the government as an “enemy of the state”? It can, but if the software is open, and companies are not forced to use it unconditionally, then that won’t happen (or will it?).

The Internet has made the world wonderful, and at the same time more complicated. Offline analogies don’t work well (e.g. postman reading your letters, or constantly having an NSA agent near you), because of the scale and anonymity. I think, given that no government can abuse the information, and that no human reads your communication, we can have a reasonable middle ground, where privacy is preserved, and security is increased. But as I pointed out earlier, that may be naive. And in case it is naive, we should drop the alleged increase in security, and aim to achieve it in a different way.