RabbitMQ in Multiple AWS Availability Zones

July 17, 2014

When working with AWS, in order to have a highly-available setup, once must have instances in more than one availability zone (AZ ≈ data center). If one AZ dies (which may happen), your application should continue serving requests.

It’s simple to setup your application nodes in multiple AZ (if they are properly written to be stateless), but it’s trickier for databases, message queues and everything that has state. So let’s see how to configure RabbitMQ. The first steps are not relevant only to RabbitMQ, but to any persistent data solution.

First (no matter whether using CloudFormation or manual setup), you must:

  • Have a VPC. It might be possible without a VPC, but I can’t guarnatee that, especially the DNS hostnames as discussed below
  • Declare private subnets (for each AZ)
  • Declare the RabbitMQ autoscaling group (recommended to have one) to span multiple AZs, using:
            "AvailabilityZones" : { 
              "Fn::GetAZs" : {
                "Ref": "AWS::Region"
              }
            }
            
  • Declare the RabbitMQ autoscaling group to span multiple subnets using the VPCZoneIdentifier property
  • Declare the LoadBalancer in front of your RabbitMQ nodes (that is the easiest way to ensure even distribution of load to your Rabbit cluster) to span all the subnets
  • Declare LoadBalancer to be "CrossZone": true

Then comes the specific RabbitMQ configuration. Generally, you have two options:

Clustering is not recommended in case of WAN, but the connection between availability zones can be viewed (maybe a bit optimistically) as a LAN. (This detailed post assumes otherwise, but this thread hints that using a cluster over multiple AZ is fine)

With federation, you declare your exchanges to send all messages they receive to another node’s exchange. This is pretty useful in a WAN, where network disconnects are common and speed is not so important. But it may still be applicable in a multi-AZ scenario, so it’s worth investigating. Here is an example, with exact commands to execute, of how to achieve that, using the federation plugin. The tricky part with federation is auto-scaling – whenever you need to add a new node, you should modify (some of) your existing nodes configuration in order to set the new node as their upstream. You may also need to allow other machines to connect as guest to rabbitmq ([{rabbit, [{loopback_users, []}]}] in your rabbitmq conf file), or find a way to configure a custom username/password pair for federation to work.

With clustering, it’s a bit different, and in fact simpler to setup. All you have to do is write a script to automatically join a cluster on startup. This might be a shell script or a python script using the AWS SDK. The main steps in such a script (which, yeah, frankly, isn’t that simple), are:

  • Find all running instances in the RabbitMQ autoscaling group (using the AWS API filtering options)
  • If this is the first node (the order is random and doesn’t matter), assume it’s the “seed” node for the cluster and all other nodes will connect to it
  • If this is not the first node, connect to the first node (using rabbitmqctl join_cluster rabbit@{node}), where {node} is the instance private DNS name (available through the SDK)
  • Stop RabbitMQ when doing all configurations, start it after your are done

In all cases (clustering or federation), RabbitMQ relies on domain names. The easiest way to make it work is to enable DNS hostnames in your VPC: "EnableDnsHostnames": true. There’s a little hack here, when it terms to joining a cluster – the AWS API may return the fully qualified domain name, which includes something like “.eu-west-1.compute.internal” in addition to the ip-xxx-xxx-xxx-xxx part. So when joining the RabbitMQ cluster, you should strip this suffix, otherwise it doesn’t work.

The end results should allow for a cluster, where if a node dies and another one is spawned (by the auto-scaling group), the cluster should function properly.

Comparing the two approaches with PerfTest yields better throughput for the clustering option – about 1/3 less messages were processed with federation, and also there was a bit higher latency. The tests should be executed from an application node, towards the RabbitMQ ELB (otherwise you are testing just one node). You can get PerfTest and execute it with something like that (where the amqp address is the DNS name of the RabbitMQ load balancer):

wget http://www.rabbitmq.com/releases/rabbitmq-java-client/v3.3.4/rabbitmq-java-client-bin-3.3.4.tar.gz
tar -xvf rabbitmq-java-client-bin-3.3.4.tar.gz
cd rabbitmq-java-client-bin-3.3.4
sudo sh runjava.sh com.rabbitmq.examples.PerfTest -x 10 -y 10 -z 10 -h amqp://internal-foo-RabbitMQEl-1GM6IW33O-1097824.eu-west-1.elb.amazonaws.com:5672

Which of the two approaches you are going to pick up depends on your particular case, but I would generally recommend the clustering option. A bit more performant and a bit easier to setup and to support in a cloud environment, with nodes spawning and dying often.

0

The Cloud Beyond the Buzzword [presentation]

July 14, 2014

The other day I gave a presentation about “The Cloud”. I talked about buzzwords, incompetence, classification, and most importantly – embracing failure.

Here are the slides (the talk was not in English). I didn’t have time to go into too much details, but I hope it’s a nice overview.

0

You Probably Don’t Need a Message Queue

July 3, 2014

I’m a minimalist, and I don’t like to complicate software too early and unnecessarily. And adding components to a software system is one of the things that adds a significant amount of complexity. So let’s talk about message queues.

Message Queues are systems that let you have fault-tolerant, distributed, decoupled, etc, etc. architecture. That sounds good on paper.

Message queues may fit in several use-cases in your application. You can check this nice article about the benefits of MQs of what some use-cases might be. But don’t be hasty in picking an MQ because “decoupling is good”, for example. Let’s use an example – you want your email sending to be decoupled from your order processing. So you post a message to a message queue, then the email processing system picks it up and sends the emails. How would you do that in a monolithic, single classpath application? Just make your order processing service depend on an email service, and call sendEmail(..) rather than sendToMQ(emailMessage). If you use MQ, you define a message format to be recognized by the two systems; if you don’t use an MQ you define a method signature. What is the practical difference? Not much, if any.

But then you probably want to be able to add another consumer that does additional thing with a given message? And that might happen indeed, it’s just not for the regular project out there. And even if it is, it’s not worth it, compared to adding just another method call. Coupled – yes. But not inconveniently coupled.

What if you want to handle spikes? Message queues give you the ability to put requests in a persistent queue and process all of them. And that is a very useful feature, but again it’s limited based on several factors – are your requests processed in the UI background, or require immediate response? The servlet container thread pool can be used as sort-of queue – response will be served eventually, but the user will have to wait (if the thread acquisition timeout is too small, requests will be dropped, though). Or you can use an in-memory queue for the heavier requests (that are handled in the UI background). And note that by default your MQ might not be highly-availably. E.g. if an MQ node dies, you lose messages. So that’s not a benefit over an in-memory queue in your application node.

Which leads us to asynchronous processing – this is indeed a useful feature. You don’t want to do some heavy computation while the user is waiting. But you can use an in-memory queue, or simply start a new thread (a-la spring’s @Async annotation). Here comes another aspect – does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed. So, just asynchronously handling heavier invocations might work well.

Even if you can’t afford to lose messages, the use-case when a message is put into a queue in order for another component to process it, there’s still a simple solution – the database. You put a row with a processed=false flag in the database. A scheduled job runs, picks all unprocessed ones and processes them asynchronously. Then, when processing is finished, set the flag to true. I’ve used this approach a number of times, including large production systems, and it works pretty well.

And you can still scale your application nodes endlessly, as long as you don’t have any persistent state in them. Regardless of whether you are using an MQ or not. (Temporary in-memory processing queues are not persistent state).

Why I’m trying to give alternatives to common usages of message queues? Because if chosen for the wrong reason, an MQ can be a burden. They are not as easy to use as it sounds. First, there’s a learning curve. Generally, the more separate integrated components you have, the more problems may arise. Then there’s setup and configuration. E.g. when the MQ has to run in a cluster, in multiple data centers (for HA), that becomes complex. High availability itself is not trivial – it’s not normally turned on by default. And how does your application node connect to the MQ? Via a refreshing connection pool, using a short-lived DNS record, via a load balancer? Then your queues have tons of configurations – what’s their size, what’s their behaviour (should consumers explicitly acknowledge receipt, should they explicitly acknowledge failure to process messages, should multiple consumers get the same message or not, should messages have TTL, etc.). Then there’s the network and message transfer overhead – especially given that people often choose JSON or XML for transferring messages. If you overuse your MQ, then it adds latency to your system. And last, but not least – it’s harder to track the program flow when analyzing problems. You can’t just see the “call hierarchy” in your IDE, because once you send a message to the MQ, you need to go and find where it is handled. And that’s not always as trivial as it sounds. You see, it adds a lot of complexity and things to take care of.

Certainly MQs are very useful in some contexts. I’ve been using them in projects where they were really a good fit – e.g. we couldn’t afford to lose messages and we needed fast processing (so pinging the database wasn’t an option). I’ve also seen it being used in non-trivial scenarios, where we are using to for consuming messages on a single application node, regardless which node posts the message (pub/sub). And you can also check this stackoverflow question. And maybe you really need to have multiple languages communicate (but don’t want an ESB), or maybe your flow is getting so complex, that adding a new method call instead of a new message consumer is an overkill.

So all I’m trying to say here is the trite truism “you should use the right tool for the job”. Don’t pick a message queue if you haven’t identified a real use for it that can’t be easily handled in a different, easier to setup and maintain manner. And don’t start with an MQ “just in case” – add it whenever you realize the actual need for it. Because probably, in the regular project out there, a message queue is not needed.

7

How to Handle Incompetence?

June 25, 2014

We’ve all had incompetent colleagues. People that tend to write bad code, make bad decisions or just can’t understand some of the concepts in the project(s). And it’s never trivial to handle this scenario.

Obviously, the easiest solution is to ignore it. And if you are not a team lead (or something similar), you can probably pretend that the problem doesn’t exist (and occasionally curse and refactor some crappy code).

There are two types of incompetent people: those who know they are not that good, and those who are clueless about their incompetence.

The former are usually junior and mid-level developers, and they are expected to be less experienced. With enough coaching and kindly pointing out their mistakes, they will learn. This is where all of us have gone though.

The latter is the harder breed. They are the “senior” developers that have become senior only due to the amount of years they’ve spent in the industry, and regardless of their actual skills or contribution. They tend to produce crappy code, misunderstand assignments, but on the other hand reject (kindly or more aggressively) any attempt to be educated. Because they’re “senior”, and who are you to argue with them? In extreme cases this may be accompanied with an inferiority complex, which in turn may result in clumsy attempts to prove they are actually worthy. In other cases it may involve pointless discussions on topics they do not want to admit they are wrong about, just because admitting that would mean they are inferior. They will often use truisms and general statements instead of real arguments, in order to show they actually understand the matter and it’s you that’s wrong. E.g. “we must do things the right way”, “we must follow best practices”, “we must do more research before making this decision”, and so on. In a way, it’s not exactly their incompetence that is the problem, it’s their attitude and their skewed self-image. But enough layman psychology. What can be done in such cases?

A solution (depending on the labour laws) is to just lay them off. But in a tight market, approaching deadlines, a company hierarchy and rules, probably that’s not easy. And such people can still be useful. It’s just that “utilizing” them is tricky.

The key is – minimizing the damage they do without wasting the time of other team members. Note that “incompetent” doesn’t mean “can’t do anything at all”. It’s just not up to the desired quality. Here’s an incomplete list of suggestions:

  • code reviews – you should absolutely have these, even if you don’t have incompetent people. If a piece of code is crappy, you can say that in a review.
  • code style rules – you should have something like checkstyle or PMD rule set (or whatever is relevant to your language). And it won’t be offensive when you point out warnings from style checks.
  • pair programming – often simple code-style checks can’t detect bad code, and especially a bad approach to a problem. And it may be “too late” to indicate that in a code review (there is never a “too late” time for fixing technical debt, of course). So do pair programming. If the incompetent person is not the one writing the code, his pair of eyes may be useful to spot mistakes. If writing the code, then the other team member might catch a wrong approach early and discuss that.
  • don’t let them take important decisions or work or important tasks alone; in fact – this should be true even for the best developer out there – having more people involved in a discussion is often productive

Did I just make some obvious engineering process suggestions? Yes. And they would work in most cases, resolving the problem smoothly. Just don’t make a drama out of it and don’t point fingers…

…unless it’s too blatant. If the guy is both incompetent and with an intolerable attitude, and the team agrees on that, inform management. You have a people-problem then, and you can’t solve it using a good process.

Note that the team should agree. But what to do if you are alone in a team of incompetent people, or the competent people too unmotivated to take care of the incompetent ones? Leave. That’s not a place for you.

I probably didn’t say anything useful. But the “moral” is – don’t point fingers; enforce good engineering practices instead.

5

Make Tests Fail

June 12, 2014

This is about a simple testing technique that is probably obvious, but I’ll share it anyway.

In case you are not following TDD, when writing tests, make them fail, in order to be sure you are testing the right thing. You can make them fail either by changing some preconditions (the “given” or “when” parts, if you like), or by changing something small in the code. After you make them fail, you revert the failing change and don’t commit it.

Let me try to give an examples why this matters.

Suppose you want to test that a service triggers some calculation only in case a set of rules are in place.
(using Mockito to mock dependencies and verify if they are invoked)

@Test
public void testTriggeringFoo() {
   Foo foo = mock(Foo.class);
   StubConfiguration config  new StubConfiguration();
   config.enableFoo();
   Service service = new Service(foo, config);
   service.processOptionallyTriggeringFoo();
   verify(foo).calculate(); //verify the foo calculation is invoked
}

That test passes, and you are happy. But it must fail if you do not call enableFoo(). Comment that out and run it again – if it passes again, there’s something wrong and you should investigate.

The obvious question here is – shouldn’t you have a negative test case instead (i.e. test that’s testing the opposite behaviour – i.e. that if you don’t enable foo, calculate() is not called)? Sometimes, yes. But sometimes it’s not worth having the negative test case. And sometimes it’s not about the functionality that you are testing.

Even if you code is working, your mocks and stubs might not be implemented correctly, and you may think you are testing something that you aren’t actually testing. That’s why making a test fail while writing it is not about the code you are testing, it’s about your test code. In the above example, if StubConfiguration is ignoring enableFoo(), but has it set to true by default, then the test won’t fail. But in this case the test is not useful at all – it always passes. And when you refactor your code later, and the condition is no longer met, your test won’t indicate that.

So, make sure your test and test infrastructure is actually testing the code the way you intend it to, by making the test fail.

5

An Architecture for E-Voting

May 27, 2014

E-voting is a hot topic in my country, and has been discussed a lot everywhere. Since we are already using the internet and touch-screen technologies in our everyday lives, why not apply that to voting? And not for the sake of technology itself, but in order to prevent technical mistakes and election fraud, and make it easier for citizens to cast their vote and make the elections generally cheaper.

There are many concerns, some of which – relevant, including security, single points of failure, privacy, etc. Some experts claim it is impossible to make it secure enough, and that paper ballots must be used forever. On the other hand, there are several companies producing voting machines, and multiple attempts have been made to introduce e-voting, very few of which were successful. A recent audit of the Estonian e-voting system also showed some drawbacks, although the system has been in use for a while without major issues.

I’ve been thinking and discussing about the details of how a system for electronic voting can be implemented, with the following main requirements:

  • the results cannot be tampered with – neither by an attacker, nor by the election authorities
  • open source – relying on closed source and private audits is “security through obscurity”
  • everyone can vote – there should be no technical limitation to voting – people without internet and without profound technology skills should be able to cast a vote
  • guaranteed anonymity – nobody should be able to see how a person voted
  • only one vote per person – the system must be able to ensure that a person hasn’t voted more than once
  • people should be able to vote without going to a particular location
  • nobody should be able to replace a person’s vote
  • no special skills for the voting staff – ideally, voting machines should be started with one click and handle everything by themselves
  • guaranteed to work with power or internet outages

The requirements are more or less clear, but implementing them is tough.

In order to guarantee that nobody can change the results, the only solution that is secure enough would be a distributed one. No single database is so secure, that can prevent malicious attempts. That’s why a distributed vote database has to be used. Without being an expert in the field, I think the bitcoin blockchain gives us what we need – all nodes participating in the elections will have enough data about the results, so that even if half of them are compromised or taken out, the rest can reconstruct the exact results. It might not be the exact same implementation, but we can view each vote as a separate transaction. Communication between devices is secured by the appropriate protocols, of course.

Open source is a requirement, so that everyone can be sure that no sneaky code in the form “if (party == ‘foo’) then votes += 2″. With a checksum of the current deployed build on each device, for example. It is true, that only software engineers will be able to understand how the process works, while now everyone knows how the paper is cast, but currently even fewer people know how paper is collected, counted and how are results calculated – there’s enough “magic” happening already, from the point of view of the average voter.

Everyone should be able to vote if a simple tablet/tablet-like device is placed in the voting booth. A friend of mine, who is a field linguist, once told me that the indigenous people he’s working with love using his tablet, so anyone can use a clean touch-screen interface with clear indication of the choices. Usability is a major concern of course, and lots of usability and A/B testing has to be done, but that is doable.

Guaranteeing anonymity is one of the toughest problems. In my proposal for unified electronic identification I pointed out that there is a solution to that problem, and it’s called “anonymous credentials”. Here is an introduction to the technology. I understand how it works, but not as good as I would need to explain it. But in short, the owner of the credentials generates a token, that is used to represent him to the election authorities. The token cannot be linked to the owner, but contains enough information for the election authorities to verify if that person has the right to vote, and that he hasn’t voted already (here, the “election authority” is an automated system). The introductory article describes pretty well all aspects needed, including the “one-time spending” (4.1). What I can add is that the system can obtain some metadata about the voter – age group, gender, city, for statistical purposes (though sometimes in small town people can be traced based on a few details).

A good implementation of anonymous credentials handles both the “anonymity” and “one-time” voting, provided each citizen has only one “digital credential”. This is guaranteed in an offline process – if all citizens have a mandatory ID card that contains their digital credentials, then the identity of the person is verified once by the issuing authority, and can later be used in elections (and many other government services). And before the fear of the big brother gets you, re-read the previous paragraph as of why the government can’t track you even if you have an ID card with a digital element in it.

Having the digital credentials, the voter is no longer tied to a particular voting location – people on business trips, temporarily living abroad, handicapped, or in any other way unable or unwilling to be present at the voting station on election day/week, will still be able to vote on the internet, provided they have a reader for their card.

Having said that, client-side security must be taken into account as well – the block chain guarantees the data is are secure once transmitted and results can’t be changed, but (as shown in a recent audit of the Estonian e-voting) there may be client-side attacks. What happens if the computer of the voter (or worse – the tablet at the voting station) is infected by malicious software. This is the case where a real security expert should step in, and many cases should be considered, because I can only suggest general principles. Of course, the identification card is protected by PIN, and the reader can have a simple external keyboard to prevent a trojan horse to cast a vote on behalf of the voter. And having a secure smart-card (or smart-card-like) device makes sure that when you cast a vote nobody can intercept and replace that. But can malicious code interfere with the communication of the device by preventing the vote to be cast, needs further research. I think that it is possible to be secure enough, as to prevent fraud on a large scale.

The staff the facilitates the voting process would need to switch the terminals (tablets) on, and that’s all. Since voting is activated by a card, they don’t need to manually activate it. All they have to do is make sure nobody steals a device, but that’s simple – a sound can be played if the device is disconnected, or moved, for example (a technique used in many shops nowadays). The start and the end of the election day can probably be given by all members of the section commission putting their digital cards in the reader.

And the final point is edge cases. What happens if the power is down? Well, batteries should last sufficiently long. And portable battery chargers can be distributed as well. What happens if the internet is down? And what about voting stations that don’t have access to the internet? If the internet goes down, results can be cached locally until the internet is back. “Paper trail” is something that can be used as a backup – each vote is printed and stored (automatically) in a box, and in case there are problems with the technology, we revert to the old-school way. And even if there is no cable/ADSL internet, or it goes down, 3G/GPRS is normally available (a contract with the mobile carriers has to be signed for the elections, of course, but bureaucracy is offtopic).

So, the solution outlined above depends on having a card, on complex software, on further client security investigation and also needs a lot of logistics considerations – for delivering and connecting the devices, contracts, etc. Regardless of all these ifs, it seems like technology is giving us a way to do elections digitally, and we should put some effort in that direction. Companies providing e-voting solutions can do that, but they should not rely on closed-source software, and would better rely on commodity hardware, making their business model a bit different.

And last, but not least – a lot of government and societal effort will be needed as well, even after the technology is in place.

3

Integration Tests for External Services

May 22, 2014

Our systems often depend on 3rd party services (They may even be services internal to the company that we have no control on). Such services include Social Networks exposing APIs, SaaS with APIs like Salesforce, Authentication providers, or any system that our system communicates with, but is outside our product lifecycle.

In regular integration tests, we would have an integration deployment of all sub-systems, in order to test how they work together. In case of external services, however, we can only work with the real deployment (given some API credentials). What options do we have to write integration tests, i.e. check if our system properly integrates with the external system?

If the service provides a sandbox, that’s the way to go – you have a target environment where you can do anything and it will be short-lived and not visible to any end-users. This is, however, rare, as most external services do not provide such sandboxes.

Another option is to have an integration test account – e.g. you register an application at twitter, called “yourproduct-test”, create a test twitter account, and provide these credentials to the integration test. That works well if you don’t have complex scenarios involving multi-step interactions and a lot of preconditions. For example, if you application is analyzing tweets over a period of time, you can’t post tweets with the test account in the past.

The third option is mocks. Normally, mocks and integration tests are mutually exclusive, but not in this case. You don’t want to test whether the external service conforms to its specification (or API documentation) – you want to test whether your application invokes it in a proper way, and properly processes its responses. Therefore it should be OK to run a mock of the external system, that returns predefined results in predefined set of criteria. These results and criteria should correspond directly to the specifications.

This can be easily achieved by running an embedded mock server. There are multiple tools that can be used to do that – here’s a list of some of them for Java – WireMock, MockServer, MockWebServer, Apache Wink. The first three are specifically created for the above usecase, while Apache Wink has a simple mock server class as part of a larger project.

So, if you want to test whether your application properly posts tweets after each successful purchase, you can (using WireMock, for example) do it as follows:

@Rule
public WireMockRule wireMockRule = new WireMockRule(8089);

@Test
public void purchaseTweetTest() {
    stubFor(post(urlEqualTo("/statuses/update.json"))
            .willReturn(aResponse()
                .withStatus(200)
                .withHeader("Content-Type", "application/json")
                .withBody(getMockJsonResponse()));

    // ...
    purchaseService.completePurchase(purchase);

    verify(postRequestedFor(urlMatching("/statuses/update.json"))
            .withRequestBody(
               matching(".*purchaseId: " + purchaseId + "*")));
}

That way you will verify whether your communication with the external service is handled properly in your application, i.e. whether you integrate properly, but you won’t test with an actual system.

That, of course, has a drawback – the rules that you put in your mocks may not be the same as in the external system. You may have misinterpreted the specification/documentation, or it may not be covering all corner cases. But for the sake of automated tests, I think this is preferable to supporting test accounts that you can’t properly cleanup or set test data to.

These automated integration tests can be accompanied by manual testing on a staging environment, to make sure that the integration is really working even with the actual external system.

4

Algorithmic Music Influenced by Tweets

May 19, 2014

Every now and then I make some addition to my algorithmic music composer. And now I decided to allow users to hear music that is based on their tweets.

It is nothing extraordinary, but I wanted to get some experience with basic natural language processing, so I decided to analyze the latest 200 tweets of the user, and perform the following analysis, which in turn influences the music that the user gets:

  • Sentiment analysis – I pass each tweet to the Stanford deep learning sentiment analysis (part of CoreNLP), which influences the musical scale – if the average sentiment of all tweets is: positive => major scale, negative => minor scale. Neutral leads to either lydian or dorian scale. Let me elaborate on “average sentiment” – the sentiment of each tweet is a number between 0 (very negative) and 4 (very positive). The average then, obviously, is the sum of the sentiments / number tweets.
  • Tweet length – if your tweets are on average shorter than 40 characters, the scale is pentatonic. Otherwise it’s heptatonic
  • Tweet frequency – the average interval between your tweets determines the tempo of the music. The more tweets you post (and the lower the time between them is), the faster the music. Your tweet tempo is your tweets’ music tempo.
  • Variation – now that’s the most fuzzy metric, and the algorithm that I use is the most naive one. I extract all words from all tweets, apply a stemmer (that is, transform them to their base form, e.g. eats -> eat, thieves -> thief, etc., again part of CoreNLP), remove the stop words (and, or, then, etc. linking words). Then I calculate a topic threshold – the number of times a keywords must be present in order to be considered a topic of your tweets. Then I count the topics and the more topics you have, the more variation in the melody there is. The “variation in the melody” is not a standard metric of music, as far as I know, and I define it as the average distance between notes in the main part. So the more topics you have, the more up-and-down your music should be.

Now onto the technical challenges. Sentiment analysis is a heavy process, CoreNLP’s pipeline class is not thread-safe, and has to load a model each time. In addition to that, fetching 200 tweets is not that fast either. So I couldn’t make getting the corresponding music real-time. I had to use some sort of a queue. I decided to keep it simple and use my database (MySQL) as a queue – whenever a request for tweets-to-music comes, I insert the user id in a “twitter_music_requests” table. Then every X seconds (delay, meaning two runs cannot overlap) a scheduled job runs, picks the latest request with a flag “processed=false”, and processes it. That way only one thread processes requests at a time, meaning I can reuse the pipeline object and not load the coreNLP model every time. The method that does it is synchronized, to enforce that constraint. When done with the whole process, the scheduled job marks the request as processed and sends an email to the user with a link to their musical piece.

As the music generation process is rather CPU-intensive (the conversion from midi to mp3, actually; the rest is pretty fast), I’ve decided to never generate piece on-the-fly, but instead pre-generate them and play the latest generated ones. The twitter music is no exception, and even though it runs in the background, I still use that approach to speed up the process – the piece is not actually newly generated, but is located in the existing database, strictly based on the criteria listed above.

The end result can be found here, and I hope it is interesting. It was an interesting experiment for me, and although it seems rather pointless, it’s… cool.

2

The Low Quality of Scientific Code

May 11, 2014

Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.

The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.

I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.

But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.

And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.

And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.

But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.

For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.

The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).

Maybe software engineers can help in the world of science. But money is a factor.

20

Development “Methodologies”

April 25, 2014

Below are several development “methodologies” that are popular and even industry-standard:

Hype-Driven Development – you are either a startup, or you are given the freedom to choose whatever technology you like for your new cool-cutting-edge-distrupting-innovative-did-I-say-cool project. What technologies to use? The recently overhyped ones, of course. Let’s do it in Node.js, and you have to make it reactive, and do DevOps, even though your Linux experience is limited to running Ubuntu on your desktop, and store the data in MongoDB (it’s web-scale!). Speaking of web-scale, you will obviously need to go to the cloud. Then all of a sudden you have an unintelligible codebase, broken servers, pages take a second to load, you lose data every now and then, and you are unable to actually scale. But hey, you used cutting-edge, web-scale reactive technologies that you didn’t understand when you started, and were not applicable to your domain. But now your write a blogpost describing how to solve a problem with them, that has been solved for decades with other technologies. And you post it to hacker news, contributing to the hype.

Demo-Driven Development – you work in a team, part of a big organization that has adopted Agile/SCRUM. Or you are a “lean startup” and define the projects on the fly. In both cases the project does not have a clearly defined goal, but somehow money are being poured into it, so it has to keep going. The end result doesn’t seem to matter, but process has to be followed and you need to be be able to demo stuff to the stakeholders. So you write your code disregarding the fact that it should be used in production, and write it only to make it demoable. In small doses this is good, because being able to show something is indeed helpful to the project, although not part intrinsically valuable to the product. But you can easily go overboard and end-up with working for a year and demoing stuff that is completely unusable.

Copy-Paste Driven Development – applicable in multiple scenarios. If the team consists of many junior programmers and only one senior, “kindergarden expert”, or if engineers in the projects are constantly coming and going, without having time to really understand it, or if the company has many, very similar projects, but doesn’t have the resources to build a common, reusable toolkit, the usual development practice is to copy code from existing projects or features and paste it in the new one, changing the names of variables and methods. Another flavor is copy-pasting snippets from stackoverflow or mailing lists. That might go well for a while, and it’s generally good to have uniformity in the code. But often you’ll end up with code that’s there for no obvious reason, and nobody knows how and more importantly – why it works. Or doesn’t work. (This methodology is specifically applicable to test code)

Denial Driven Development – when the head(s) of engineering or architects, appointed such solely because of their age and inability or unwillingness to find another job, shun all frameworks and libraries, and insist that everything should be written internally. (This is the opposite of “Hype Driven Development”). You get to work with a lot of low-level and complex stuff. The sense of achievement is really high, and you go on reddit and flame everyone that uses puny frameworks. The only downside is, everything breaks, it takes twice the time to develop features (if you actually reach the feature development phase) and each new team member has to first go through a 3-month introductory course. But hey, in the end you at least have a solid framework that nobody uses, doesn’t handle real-world cases, and you probably can’t opensource, because it’s proprietary.

If you are using these methodologies, you are probably in trouble, but if you haven’t realized that yet, this post probably won’t be useful anyway. And I’m not even telling you how to fix them. Well, I can tell you how to fix them – be sensible. But that’s too hard.

4