AWS “Noisiness”

May 14, 2015

You may be familiar with the “noisy neighbour” problem with virtualization – someone else’s instances on the same physical machine “steals” CPU from your instance. I won’t be giving clues how to solve the issue (quick & dirty – by terminating your instance and letting it be spawned on another physical machine), and instead I’ll explain my observations “at scale”.

I didn’t actually experience a typical “noisy neighbour” on AWS – i.e. one instance being significantly “clogged”. But as I noted in an earlier post about performance benchmarks, the overall AWS performance depends on many factors.

The time of day is the obvious one – as I’m working from UTC+2, my early morning is practically the time when Europe has not yet woken up, and the US has already gone to sleep. So the load on AWS is expected to be lower. When I experiment with CloudFormation stacks in the morning and in the afternoon, the difference is quite noticeable (though I haven’t measured it) – “morning” stacks are up and running much faster than “afternoon” ones. It takes less time for instances, ELBs, and the whole stack to be created.

But last week we observed something rather curious. Our regular load test had to be run on Thursday, but then and till the end of the week the performance was horrible – we couldn’t even get a healthy run – many, many requests were failing due to timeouts from internal ELBs. Ontop of that, spot instances (instances for which you bid a certain price and someone else can “steal” from you at any time) were rather hard to keep – there was a huge demand for them and our spot instances were constantly claimed by someone else. But the AWS region was reported to be in an “ok” state, no errors.

What was happening last Thursday? The UK elections. I can’t prove it had such an effect on the whole AWS, and I initially have that as a joke explanation, but an EU AWS region during the UK elections is likely to be experiencing high load. Noticeably high load, as it seems, so that the whole infrastructure for everyone else was under pressure. (It might have been a coincidence, of course). And it wasn’t a typical “noisy neighbour” – it was the ELBs that were not performing. And then, this week, things were back to normal.

The AWS infrastructure is complex, it has way more than just “instances”, so even if you have enough CPU to handle noisy neighbours, any other component can suffer from increased load on the whole infrastructure. E.g. ELBs, RDS, SQS, S3, even your VPC subnets. When AWS is under pressure, you’ll feel it, one way or another.

The moral? Embrace failure, of course. Have monitoring that would notify you of these events of a less stable infrastructure, and have a fault-tolerant setup with proper retries and fallbacks.


Log Collection With Graylog on AWS

May 8, 2015

Log collection is essential to properly analyzing issues in production. An interface to search and be notified about exceptions on all your servers is a must. Well, if you have one server, you can easily ssh to it and check the logs, of course, but for larger deployments, collecting logs centrally is way more preferable than logging to 10 machines in order to find “what happened”.

There are many options to do that, roughly separated in two groups – 3rd party services and software to be installed by you.

3rd party (or “cloud-based” if you want) log collection services include Splunk, Loggly, Papertrail, Sumologic. They are very easy to setup and you pay for what you use. Basically, you send each message (e.g. via a custom logback appender) to a provider’s endpoint, and then use the dashboard to analyze the data. In many cases that would be the preferred way to go.

In other cases, however, company policy may frown upon using 3rd party services to store company-specific data, or additional costs may be undesired. In these cases extra effort needs to be put into installing and managing an internal log collection software. They work in a similar way, but implementation details may differ (e.g. instead of sending messages with an appender to a target endpoint, the software, using some sort of an agent, collects local logs and aggregates them). Open-source options include Graylog, FluentD, Flume, Logstash.

After a very quick research, I considered graylog to fit our needs best, so below is a description of the installation procedure on AWS (though the first part applies regardless of the infrastructure).

The first thing to look at are the ready-to-use images provided by graylog, including docker, openstack, vagrant and AWS. Unfortunately, the AWS version has two drawbacks – it’s using Ubuntu, rather than the Amazon AMI. That’s not a huge issue, although some generic scripts you use in your stack may have to be rewritten. The other was the dealbreaker – when you start it, it doesn’t run a web interface, although it claims it should. Only mongodb, elasticsearch and graylog-server are started. Having 2 instances – one web, and one for the rest would complicate things, so I opted for manual installation.

Graylog has two components – the server, which handles the input, indexing and searching, and the web interface, which is a nice UI that communicates with the server. The web interface uses mongodb for metadata, and the server uses elasticsearch to store the incoming logs. Below is a bash script (CentOS) that handles the installation. Note that there is no “sudo”, because initialization scripts are executed as root on AWS.


# install pwgen for password-generation
yum upgrade ca-certificates --enablerepo=epel
yum --enablerepo=epel -y install pwgen

# mongodb
cat >/etc/yum.repos.d/mongodb-org.repo <<'EOT'
name=MongoDB Repository

yum -y install mongodb-org
chkconfig mongod on
service mongod start

# elasticsearch
rpm --import

cat >/etc/yum.repos.d/elasticsearch.repo <<'EOT'
name=Elasticsearch repository for 1.4.x packages

yum -y install elasticsearch
chkconfig --add elasticsearch

# configure elasticsearch 
sed -i -- 's/ elasticsearch/ graylog2/g' /etc/elasticsearch/elasticsearch.yml 
sed -i -- 's/#network.bind_host: localhost/network.bind_host: localhost/g' /etc/elasticsearch/elasticsearch.yml

service elasticsearch stop
service elasticsearch start

# java
yum -y update
yum -y install java-1.7.0-openjdk
update-alternatives --set java /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java

# graylog
tar xvzf graylog-1.0.1.tgz -C /opt/
mv /opt/graylog-1.0.1/ /opt/graylog/
cp /opt/graylog/bin/graylogctl /etc/init.d/graylog
sed -i -e 's/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=graylog.jar}/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=\/opt\/graylog\/graylog.jar}/' /etc/init.d/graylog
sed -i -e 's/LOG_FILE=\${LOG_FILE:=log\/graylog-server.log}/LOG_FILE=\${LOG_FILE:=\/var\/log\/graylog-server.log}/' /etc/init.d/graylog

cat >/etc/init.d/graylog <<'EOT'
# chkconfig: 345 90 60
# description: graylog control
sh /opt/graylog/bin/graylogctl $1

chkconfig --add graylog
chkconfig graylog on
chmod +x /etc/init.d/graylog

# graylog web
tar xvzf graylog-web-interface-1.0.1.tgz -C /opt/
mv /opt/graylog-web-interface-1.0.1/ /opt/graylog-web/

cat >/etc/init.d/graylog-web <<'EOT'
# chkconfig: 345 91 61
# description: graylog web interface
sh /opt/graylog-web/bin/graylog-web-interface > /dev/null 2>&1 &

chkconfig --add graylog-web
chkconfig graylog-web on
chmod +x /etc/init.d/graylog-web

mkdir --parents /etc/graylog/server/
cp /opt/graylog/graylog.conf.example /etc/graylog/server/server.conf
sed -i -e 's/password_secret =.*/password_secret = '$(pwgen -s 96 1)'/' /etc/graylog/server/server.conf

sed -i -e 's/root_password_sha2 =.*/root_password_sha2 = '$(echo -n password | shasum -a 256 | awk '{print $1}')'/' /etc/graylog/server/server.conf

sed -i -e 's/application.secret=""/application.secret="'$(pwgen -s 96 1)'"/g' /opt/graylog-web/conf/graylog-web-interface.conf
sed -i -e 's/graylog2-server.uris=""/graylog2-server.uris="http:\/\/\/"/g' /opt/graylog-web/conf/graylog-web-interface.conf

service graylog start
sleep 30
service graylog-web start

You may also want to set a TTL (auto-expiration) for messages, so that you don’t store old logs forever. Here’s how

# wait for the index to be created
INDEXES=$(curl --silent "http://localhost:9200/_cat/indices")
until [[ "$INDEXES" =~ "graylog2_0" ]]; do
	sleep 5
	echo "Index not yet created. Indexes: $INDEXES"
	INDEXES=$(curl --silent "http://localhost:9200/_cat/indices")

# set each indexed message auto-expiration (ttl)
curl -XPUT "http://localhost:9200/graylog2_0/message/_mapping" -d'{"message": {"_ttl" : { "enabled" : true, "default" : "15d" }}}'

Now you have everything running on the instance. Then you have to do some AWS-specific things (if using CloudFormation, that would include a pile of JSON). Here’s the list:

  • you can either have an auto-scaling group with one instance, or a single instance. I prefer the ASG, though the other one is a bit simpler. The ASG gives you auto-respawn if the instance dies.
  • set the above script to be invoked in the UserData of the launch configuration of the instance/asg (e.g. by getting it from s3 first)
  • allow UDP port 12201 (the default logging port). That should happen for the instance/asg security group (inbound), for the application nodes security group (outbound), and also as a network ACL of your VPC. Test the UDP connection to make sure it really goes through. Keep the access restricted for all sources, except for your instances.
  • you need to pass the private IP address of your graylog server instance to all the application nodes. That’s tricky on AWS, as private IP addresses change. That’s why you need something stable. You can’t use an ELB (load balancer), because it doesn’t support UDP. There are two options:
    • Associate an Elastic IP with the node on startup. Pass that IP to the application nodes. But there’s a catch – if they connect to the elastic IP, that would go via NAT (if you have such), and you may have to open your instance “to the world”. So, you must turn the elastic IP into its corresponding public DNS. The DNS then will be resolved to the private IP. You can do that by manually and hacky:

      or you can use the AWS EC2 CLI to obtain the instance details of the instance that the elastic IP is associated with, and then with another call obtain its Public DNS.

    • Instead of using an Elastic IP, which limits you to a single instance, you can use Route53 (the AWS DNS manager). That way, when a graylog server instance starts, it can append itself to a route53 record, that way allowing for a round-robin DNS of multiple graylog instances that are in a cluster. Manipulating the Route53 records is again done via the AWS CLI. Then you just pass the domain name to applications nodes, so that they can send messages.
  • alternatively, you can install graylog-server on all the nodes (as an agent), and point them to an elasticsearch cluster. But that’s more complicated and probably not the intended way to do it
  • configure your logging framework to send messages to graylog. There are standard GELF (the greylog format) appenders, e.g. this one, and the only thing you have to do is use the Public DNS environment variable in the logback.xml (which supports environment variable resolution).
  • You should make the web interface accessible outside the network, so you can use an ELB for that, or the round-robin DNS mentioned above. Just make sure the security rules are tight and not allowing external tampering with your log data.
  • If you are not running a graylog cluster (which I won’t cover), then the single instance can potentially fail. That isn’t a great loss, as log messages can be obtained from the instances, and they are short-lived anyway. But the metadata of the web interface is important – dashboards, alerts, etc. So it’s good to do regular backups (e.g. with mongodump). Using an EBS volume is also an option.
  • Even though you send your log messages to the centralized log collector, it’s a good idea to also keep local logs, with the proper log rotation and cleanup.

It’s not a trivial process, but it’s essential to have log collection, so I hope the guide has been helpful.


Should We Look For a PRISM Alternative?

May 3, 2015

I just watched Citizen Four, and that made me think again about mass surveillance. And it’s complicated.

I would like to leave aside the US foreign policy (where I agree with Chomsky’s criticism), and whether “terrorist attacks” would have been an issue if the US government didn’t do all the bullshit it does across the world. Let’s assume there is always someone out there trying, for no rational reason, to blow up a bus or a train. In the US, Europe, or anywhere. And with the internet, it becomes easier for that person to find both motivation and means to do so.

From that point of view it seems entirely justified to look for those people day and night in attempt to prevent them from killing innocent people. Whether the effort to prevent a thousand deaths by terrorists is comparable to the efforts to prevent the deaths of millions due to car crashes, obesity-related diseases, malpractice, police brutality, poor living conditions and more, is a beyond the scope of this discussion. And regardless of whether PRISM has helped so far in preventing attacks it may do so in the future.

Privacy, on the other hand, is fundamental and we must not be monitored by a “benign government”, regardless of the professed cause. And I genuinely believe that none of the officials involved in PRISM, envision or aim at any Orwellian dystopia, but that doesn’t mean their actions can’t lead to one. Creating the means to implement a surveillance state is just one step away from having one, regardless of the intentions that the means were created with. In a not-so-impossible scenario, the thin oversight of PRISM could be wiped and the “proper people” given access to the data. I live in a former communist state, so believe me, that’s real. And that’s not the only danger – self-censorship is another one, which can really skew the course of society.

So can’t we have both privacy and security? Shall we sacrifice liberties in order to feel less threatened (and at the end get neither, as Franklin said)? Of course not. But I think the implementation details are, again, the key to the optimal solution. Can there be some sort of solution, that doesn’t give the government all the data about all the citizens, and yet serves as a means to protect against the irrational person who plans to kill people?

The NSA used private companies as a source of the data (even though the companies deny that) – google searches, facebook messages, emails, text messages, etc. All of that was poured into a huge database and searched and analyzed. For good reasons, allegedly, but we don’t trust the government, do we? And yet, we trust the companies with our data, or we don’t care, and we hope that they will protect our privacy. They use the data to target ads at us, which we accept. But handing that data to an almighty government crosses the line. And even though the companies deny any large-scale data transfer, the veil of secrecy over PRISM hints otherwise. Receiving all the data related to a given search term is a rather large-scale data transfer.

My thoughts in this respect lead me to think of alternatives that would still be able to prevent detectable attacks, but would not hand the government the tools to become a superpowerful state. And maybe naively, I think that’s achievable to an extent. And, no, you can’t prevent a terrorist to use Tor, HTTPS, PGP, Bitcoin, a possible Silk road successor, etc. But you can’t prevent a terrorist from meeting an illegal salesman in a parking garage either. And besides, that’s not what mass-surveillance solves anyway – if it does solve anything, it’s the low-hanging fruit; the inept wrongdoers.

But what if there was a piece of software that is trained to detect suspicious profiles (using machine learning). What if that software was open-source and was distributed to major tech companies (like the ones participating in PRISM – google, facebook, etc). That software would work as follows: it receives as input anonymized profiles of users of the the company, analyzes them (locally) and flags any suspicious ones. Then the anonymized suspicious ones are sent to the NSA. The link between the data and the real profile (names, IPs, locations) is encrypted with the public key of a randomly selected court (specialized or not), so that the court can de-anonymize it. If the NSA considers the flagged profile a real danger, it can request the court to de-anonymize the data. How is that different from the search-term-based solution? It’s open, more sophisticated than a keyword match, and way less data is sent to the NSA.

That way the government doesn’t get huge amounts of data – only a tiny fraction, flagged as “suspicious”. Can the companies cheat that, if paid by the NSA – well, they can – they have the data. But preventing that is in their interest as well as that of the public, given that there is a legal way to help the government prevent crimes.

Should we be algorithmically flagged as “suspicious” based on something we wrote online, and isn’t that again an invasion of privacy? That question doesn’t make my middle-ground-finding attempt easier. It is, yes, but it doesn’t make an Orwellian super-state possible; it doesn’t give the government immense power that can be abused. And, provided there’s trust in the court, it shouldn’t lead to self-censorship (e.g. refraining from terrorist jokes, due to fear of being “flagged”).

Can the government make that software flag not only terrorists, but also consider everyone who is critical to the government as an “enemy of the state”? It can, but if the software is open, and companies are not forced to use it unconditionally, then that won’t happen (or will it?).

The Internet has made the world wonderful, and at the same time more complicated. Offline analogies don’t work well (e.g. postman reading your letters, or constantly having an NSA agent near you), because of the scale and anonymity. I think, given that no government can abuse the information, and that no human reads your communication, we can have a reasonable middle ground, where privacy is preserved, and security is increased. But as I pointed out earlier, that may be naive. And in case it is naive, we should drop the alleged increase in security, and aim to achieve it in a different way.


The Precious Feature Design Meetings

April 29, 2015

As we know, meetings is where work goes to die. Discussion about the point of meetings aside, there is one type of meetings that I love. It has many names, depending on who you ask – design review, design overview, feature design. And I see it as the most important meeting in software engineering.

What is it? Let’s start with a little background. You are probably “doing agile”, so you have user stories. They are already groomed and sized (based partly on some known details and partly on intuition), and you have to start working on one of them. (Below I will use the auxiliary verb “should”, because this is part description, part recommendation.)

But before diving into coding, you, together with another team member, should take a step back and discuss how exactly would the implementation look like. What programming and design approaches should be taken in order to complete the story. These include discussing an API you’ll expose, the interactions between components, or even a deployment procedure. It may involve using UML on a whiteboard. Don’t waste too much time on the details or discussing “which way is slightly better than the other”, as you can refactor that later.

After that you hold a meeting where you present the result of your analysis to the team. The team may have questions, or suggestions that you didn’t think about, so the meeting itself is both informative and productive.

This is not applicable to all stories – some are absolutely obvious, so having a meeting would be a waste of time. But whenever there’s some specific design decision to be made, or even a broader question “how do we do that?” to be answered, these short meetings are golden. Not only because you take better decisions, but also because everyone on the team is informed about the decision and more importantly – about the reasons for that decision. So when another team member has to work on the same part of the code base in three weeks, he’ll have at least part of the picture still in his head.

I would recommend having such meetings (limiting them to 20 minutes), for the sake of having better design and a more informed team. And they are not actually “full-featured” meetings – they don’t need to be on the calendar a few days prior, and depending on the size of the team, they can be done even while sitting at your desks (or via skype), almost as a casual conversation.


KISS With Essential Complexity

April 14, 2015

Accidental complexity, in a broader sense, is the complexity that developers add to their code and that is not necessary for the code to work. That may include overengineering, overuse of design patterns, poor choice of tools, frameworks and paradigms, writing snippets of code in a hard to read way. For example, if you can do a project with a simple layered architecture, doing it with microservices (and having to decide on granularity, coordinate them, etc.) or message-driven architecture (with setting up the broker and its queues’ all sorts of configurations) increases the complexity of the software unnecessarily, and is therefore accidental complexity. If you want to parse XML and covert it to objects, using SAX adds a lot of accidental complexity, compared to xmo-to-object mapper (e..g JAXB), where you just add a few annotations (hopefully). If your logic can be expressed with a few lambda expressions, but instead you do several nested for-loops with if clauses inside, then that’s accidental complexity. For me at least, accidental complexity is about making it hard to read, maintain and deploy with no good reason, apart from not knowing better.

Essential complexity on the other hand comes from the world you are trying to model with your software. It’s about the inevitable edge cases you have to handle if you want your software to be fit for actual use. Essential complexity can and does make your code harder to read and maintain. It makes it look like “legacy” code, but as Spolsky points out, that’s the way of things, and that’s the way it should be. Unexpected API calls, classes that exist for some bizarre edge-case that you discovered after half a year of actual use, ifs and fors that you think you can just remove – these are the marks of real software.

(I’m aware of of another view of accidental complexity – that it still adds value, but is not the problem that you are solving. That’s a long discussion, but I think that anything that is inherently complex and needs to be done (e.g. rolling updates) is essential complexity, i.e. you can’t do without it.)

If the business process you are modeling has a lot of branches and even loops, and it can’t be optimized, then the code that handled that business process has to be “hairy”. When you have to run your software on a device that can lose connectivity, or have poor connectivity, can be restarted at any moment, then the code for retrying, for re-applying offline steps, and the likes, is necessary, even if it’s huge and hard to follow.

But is that it? Is it that we can’t do anything about our essential complexity, and we can only leave the ugly bits of code there, shrugging and saying “well, I know it’s bad, man, but what can you do – essential complexity”.

Well, we can’t get rid of it. But we can make it slightly friendlier. I have two specific approaches.

Document the scenarios that require the complexity. Either directly in the code, or linked in the code. Most of the code that looks “WTF” can look completely logical if you know why it’s there. If we make sure all the bizarre code makes sense to everyone, by explaining the business reason behind it, then we have solved part of the problem.

But that’s just on the surface. Can we actually follow the “Keep it simple, stupid” (KISS) principle when it comes to essential complexity? Yes, to an extent. You can’t make complexity simple, but you can present it in a simpler way. What we want to achieve is reduce the perceived complexity, to make it easier to follow and reason about.

The first thing to look for is any accidental complexity that you have introduced around the essential one. It usually happens that essential complexity makes accidental complexity more likely to appear, probably because all the focus of the developer is on grasping every aspect of the scenario he’s working on, that he forgets about good practices. But eliminating that is not enough either.

Ironically, here is where common (design) patterns and specific frameworks come handy. You need to represent a complex sequence of states of your application? Use a finite state machine implementation, rather than bits and pieces here and there. You need to represent a complex business process? Use a business process management framework, rather than just flow control structures. You have a lot of dependencies in your classes (even though your classes are designed and packaged well)? Use a dependency injection framework. Or in many cases – just refactor, I know this answer of mine is the most obvious thing, but we’ve all seen complex methods that just do a lot of stuff and do not follow that approach. Because it grew with time, so nobody realized it has become that big.

But apart from a couple of example, I cannot give a general rule. Reducing the perceived complexity is (obviously) highly dependent on the perception of the one reducing it. But as a one-line advice – always think of how you can rearrange the code around the inherent, essential complexity of your application to make it look less complex.


Getting Notified About RabbitMQ Cluster Partitioning

April 6, 2015

If you are running RabbitMQ in a cluster, it is not unlikely that the cluster gets partitioned (part of the cluster losing connection to the rest). The basic commands to show the status and configure the behaviour is explained in the linked page above. And when partitioning happens, you want to first be notified about that, and second – resolve it.

RabbitMQ actually automatically handles the second, with the cluster_partition_handling configuration. It has three values: ignore, pause_minority and autoheal. The partitions guide linked above explains that as well (“Which mode should I pick?”). Note that whatever you choose, you have a problem and you have to restore the connectivity. For example, in a multi-availability-zone setup I explained a while ago it’s probably better to use pause_minority and then to manually reconnect.

Fortunately, it’s rather simple to detect partitioning. The status command has an empty “partitions” element if there is no partitioning, and there is either a non-empty partitions element, or no such element at all, if there are partitions. So this line does the detection:

clusterOK=$(sudo rabbitmqctl cluster_status | grep "{partitions,\[\]}" | wc -l)

You would want to schedule that script to run every minute, for example. What to do with the result depends on the tool you use (Nagios, CloudWatch, etc). For Nagios there is a ready-to-use plugin, actually. And if it’s AWS CloudWatch, then you can do as follows:

if [ "$clusterOK" -eq "0" ]; then
	echo "RabbitMQ cluster is partitioned"
	aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value 1 --dimensions Stack=$STACKNAME --region $REGION
	aws cloudwatch put-metric-data --metric-name $METRIC_NAME --namespace $NAMESPACE --value 0 --dimensions Stack=$STACKNAME --region $REGION

When partitioning happens, the important things is getting notified about it. After that it depends on the particular application, problem, configuration of queues (durable, mirrored, etc.)


A Non-Blocking Benchmark

March 23, 2015

A couple of weeks ago I asked the question “Why non-blocking?”. And I didn’t reach a definitive answer, although it seemed that writing non-blocking code is not the better option – it’s not supposed to be faster or have higher throughput, even though conventional wisdom says it should.

So, leaving behind the theoretical questions, I decided to do a benchmark. The code is quite simple – it reads a 46kb file into memory and then writes it to the response. That’s the simplest scenario that’s still close the the regular usecase of a web application – reading stuff from the database, performing some logic on it, and then writing a view to the client (it’s disk I/O vs network I/O in case the database is on another server, but let’s disregard that for now)

There are 5 distinct scenarios: Servlet using BIO connector, Servlet using NIO connector, Node.js, Node.js using sync file reading and Spray (a scala non-blocking web framework). Gatling was used to perform the tests, and was run on a t2.small AWS instance; the application code was run on a separate m3.large instance.

The code used in the benchmark as well as the full results are available on GitHub. (Note: please let me know if you spot something really wrong with the benchmark that skews the results)

What do the results tell us? That it doesn’t matter whether it’s blocking or non-blocking. Differences in response time and requests/sec (as well as the other factors) are negligible.

Spray appears to be slightly better when the load is not so high, whereas BIO happens to have more errors on a really high load (but being fastest at the same time), Node.js is surprisingly fast for a javascript runtime (kudos to Google for V8).

The differences in the different runs are way more likely to be due to the host VM current CPU and disk utilization or the network latency, rather than the programming model or the framework used.

After reaching this conclusion, the fact that spray is seemingly faster bugged me (especially given that I executed the spray tests half an hour after the rest), so I wanted to rerun the tests this morning. And my assumption about the role of infrastructure factors could not have been proven more right. I ran the 60 thousand requests test and the mean time was 3 seconds (for both spray and servlet), with a couple of hundred failures and only 650 requests/sec. This aligned with my observation that AWS works a lot faster when I start and delete cloud formation stacks early in the morning (GMT+2, when Europe is still sleeping and the US is already in bed).

The benchmark is still valid, as I executed it within 1 hour on a Sunday afternoon. But the whole experiment convinced me even more of what I concluded in my previous post – that non-blocking doesn’t have visible benefits and one should not force himself to use the possibly unfriendly callback programming model for the sake of imaginary performance gains. Niche cases aside, for the general scenario you should pick the framework, language and programming model that people in the team are most comfortable with.


How to Land a Software Engineering Job?

March 13, 2015

The other day I read this piece by David Byttow on “How to land an engineering job”. And I don’t fully agree with his assertions.

I do agree, of course, that one must always be writing code. Not writing code is the worst that can happen to a software engineer.

But some details are where our opinions diverge. I don’t agree that you should know the complexities of famous algorithms and data structures by heart, and I do not agree that you should be able to implement them from scratch. He gives no justification for this advise, and just says “do so”. And don’t get me wrong – you should know what computational complexity is, and what algorithms there are for traversing graphs and trees. But implementing them yourself? What for? I have implemented sorting algorithms, tree structures and the likes a couple of times, just for the sake of it. 2 years later I can’t do it again without checking an example or a description. Why? Because you never need those things in your day-to-day programming. And why would you know the complexity of a graph search algorithms if you can look it up in 30 seconds?

The other thing I don’t agree with is solving TopCoder-like problems. Yes, they probably help you improve your algorithm writing skills, but spending time on that, rather than writing actual code (e.g. as side-projects) to me is a bit of waste. Not something you should not do, but something that you don’t have to. If you like solving those types of problems – by all means, do it. But don’t insist that “real programmers solve non-real-world puzzles”. Especially when the question is how to get an software engineering job.

Because software engineering, as I again agree with David Byttow, is a lot more than writing code. It’s contemplating all aspects of a software system, using many technologies and many levels of abstraction. But what he insists is that you must focus on lower levels (e.g. data structures) and be expert there. But I think you are free to choose the levels of abstraction you are an expert in, as long as you have a good overview of those below/above.

And let’s face it – getting an engineering job is easy. The demand for engineers is way higher than the supply, so you have to be really incompetent not to be able to get any job. How to get an interesting and highly-paid job is a different thing, but I can assure you that there’s enough of those as well, and not all of them require you to solve freshman year style problems on interviews. And I see that there is this trend, especially in Silicon Valley, to demand knowing the computer science components of software engineering by heart. And I don’t particularly like it, but probably if you want a job at Google or Facebook, then you do have to know the complexities of popular algorithms, and be able to implement a red-black tree on a whiteboard. But that doesn’t mean every interesting company out there requires those things, and does not mean that you are not a worthy engineer.

One final disagreement – not knowing exact details about the company you are applying at (or is recruiting you), is fine. Maybe companies are obsessed with themselves, but when you go to a small-to-medium sized company that does not have world-wide fame, not knowing the competition in their niche is mostly fine. (And it makes a difference whether you applied, or they headhunted you.)

But won’t my counter-advise land you a mediocre job? No. There are companies doing “cool stuff” that don’t care if you know Dijkstra’s algorithm by heart. As long as you demonstrate the ability to solve problems, broad expertise, and passion about programming, you are in. That includes (among others) TomTom, eBay, Rakuten, Ericsson (those I’ve interviewed with or worked at). It may not land you a job at Google, but should we focus on being good engineers, or on fulfilling Silicon Valley artificial interview criteria?

So far I’ve mostly disagreed, but I didn’t actually give a bullet-point how-to. So in addition to the things I agree with in David’s article, here’s some more:

  • know a technology well – if you’ve worked with a given technology for the past year, you have to know it in depth; otherwise you seem like that guy that doesn’t actually know what he’s doing, but still gets some of the boilerplate/easy tasks.
  • show that software engineering is not a 9-to-5 thing for you. Staying up-to-date with latest trends, having a blog, GitHub contributions, own side projects, talks, meetups – all of these count.
  • have broad expertise – being just a “very good Spring/Rails/Akka/…” developer doesn’t cut it. You have to know how software is designed, deployed, managed. You don’t need to have written millions of lines of CloudFormation, or supported a Puppet installation by yourself, but at least you have to know what infrastructure and deployment automation is. (Whew, I managed to avoid the “full-stack” buzzword)
  • know the basics – as pointed out above, you don’t have to know complexities and implementations by heart. But not knowing what a hashtable or a linked list is (in terms of usage patterns at least) hurts your chances significantly. Knowing that somethings exists when you need it is the practical compromise between knowing how to write it and not having the faintest idea about it.
  • be able to solve problems – usually interviewers may usually ask a hypothetical question (in fact, one that they recently faced) and see how you attack the problem. Don’t say you don’t have enough information or you don’t know – just try to solve it. It may not be correct, but a well-thought attempt still counts.
  • be respectful. That doesn’t mean overly-professional or shy, but assume that the people interviewing you are just like you – good developers that love creating software.

That won’t guarantee you a job, of course. And it won’t get you a job at Google. But you can land a job where you can do pretty interesting things on a large scale.


Why Non-Blocking?

March 2, 2015

I’ve been writing non-blocking, asynchronous code for the past year. Learning how it works and how to write it is not hard. Where are the benefits coming from is what I don’t understand. Moreover, there is so much hype surrounding some programming models, that you have to be pretty good at telling marketing from rumours from facts.

So let’s first start with clarifying the terms. Non-blocking applications are written in a way that threads never block – whenever a thread would have to block on I/O (e.g. reading/writing from/to a socket), it instead gets notified when new data is available. How is that implemented is out of the scope of this post. Non-blocking applications are normally implemented with message passing (or events). “Asynchronous” is related to that (in fact, in many cases it’s a synonym for “non-blocking”), as you send your request events and then get response to them in a different thread, at a different time – asynchronously. And then there’s the “reactive” buzzword, which I honestly can’t explain – on one hand there’s the reactive functional programming, which is rather abstract; on the other hand there’s the reactive manifesto which defines 3 requirements for practically every application out there (responsive, elastic, resilient) and one implementation detail (message-driven), which is there for no apparent reason. And how does the whole thing relate to non-blocking/asynchronous programming – probably because of the message-driven thing, but often the three go together in the buzzword-driven marketing jargon.

Two examples of frameworks/tools that are used to implement non-blocking (web) applications are Akka (for Scala nad Java) and Node.js. I’ve been using the former, but most of the things are relevant to Node as well.

Here’s a rather simplified description of how it works. It uses the reactor pattern (ahaa, maybe that’s where “reactive” comes from?) where one thread serves all requests by multiplexing between tasks and never blocks anywhere – whenever something is ready, it gets processed by that thread (or a couple of threads). So, if two requests are made to a web app that reads from the database and writes the response, the framework reads the input from each socket (by getting notified on incoming data, switching between the two sockets), and when it has read everything, passes a “here’s the request” message to the application code. The application code then sends a message to a database access layer, which in turn sends a message to the database (driver), and gets notified whenever reading the data from the database is complete. In the callback it in turn sends a message to the frontend/controller, which in turn writes the data as response, by sending it as message(s). Everything consists of a lot of message passing and possibly callbacks.

One problem of that setup is that if at any point in the code the thread blocks, then the whole things goes to hell. But let’s assume all your code and 3rd party libraries are non-blocking and/or you have some clever way to avoid blocking everything (e.g. an internal thread pool that handles the blocking part).

That brings me to another point – whether only reading and writing the socket is non-blocking as opposed to the whole application being non-blocking. For example, Tomcat’s NIO connector is non-blocking, but (afaik, via a thread pool) the application code can be executed in the “good old” synchronous way. Though I admit I don’t fully understand that part, we have to distinguish asynchronous application code from asynchronous I/O, provided by the infrastructure.

And another important distinction – the fact that your server code is non-blocking/asynchronous, doesn’t mean your application is asynchronous to the client. The two things are related, but not the same – if your client uses long-living connection where it expects new data to be pushed from the server (e.g. websockets/comet) then the asynchronicity goes outside your code and becomes a feature of your application, from the perspective of the client. And that can be achieved in multiple ways, including Java Servlet with async=true (that is using a non-blocking model so that long-living connections do not each hold a blocked thread).

Okay, now we know roughly how it works, and we can even write code in that paradigm. We can pass messages around, write callbacks, or get notified with a different message (i.e. akka’s “ask” vs “tell” pattern). But again – what’s the point?

That’s where it gets tricky. You can experiment with googling for stuff like “benefits of non-blocking/NIO”, benchmarks, “what is faster – blocking or non-blocking”, etc. People will say non-blocking is faster, or more scalable, that it requires less memory for threads, has higher throughput, or any combination of these. Are they true? Nobody knows. It indeed makes sense that by not blocking your threads, and when you don’t have a thread-per-socket, you can have less threads service more requests. But is that faster or more memory efficient? Do you reach the maximum number of threads in a big thread pool before you max the CPU, network I/O or disk I/O? Is the bottleneck in a regular web application really the thread pool? Possibly, but I couldn’t find a definitive answer.

This benchmark shows raw servlets are faster than Node (and when spray (akka) was present in that benechmark, it was also slower). This one shows that the NIO tomcat connector gives worse throughput. My own benchmark (which I lost) of spray vs spring-mvc showed that spray started returning 500 (Internal Server Error) responses with way less concurrent requests than spring-mvc. I would bet there are counter-benchmarks that “prove” otherwise.

The most comprehensive piece on the topic is the “Thousands of Threads and Blocking I/O” presentation from 2008, which says something I myself felt – that everyone “knows” non-blocking is better and faster, but nobody actually tested it, and that people sometimes confuse “fast” and “scalable”. And that blocking servers actually perform ~20 faster. That presentation, complemented by this “Avoid NIO” post, claim that the non-blocking approach is actually worse in terms of scalability and performance. And this paper (from 2003) claims that “Events Are A Bad Idea (for high-concurrency servers)”. But is all this objective, does it hold true only for the Java NIO library or for the non-blocking approach in general; does it apply to Node.js and akka/spray, and how do applications that are asynchronous from the client perspective fit into the picture – I honestly don’t know.

It feels like the old, thread-pool-based, blocking approach is at least good enough, if not better. Despite the “common knowledge” that it is not.

And to complicate things even further, let’s consider usecases. Maybe you should use a blocking approach for a RESTful API with a traditional request/response paradigm, but maybe you should make a high-speed trading web application non-blocking, because of the asynchronous nature. Should you have only your “connector” (in tomcat terms) nonblocking, and the rest of your application blocking…except for the asynchronous (from client perspective) part? It gets really complicated to answer.

And even “it depends” is not a good-enough answer. Some people would say that you should to your own benchmark, for your usecase. But for a benchmark you need an actual application. Written in all possible ways. Yes, you can use some prototype, basic functionality, but choosing the programming paradigm must happen very early (and it’s hard to refactor it later). So, which approach is more performant, scalable, memory-efficient? I don’t know.

What do I know, however, is which is easier to program, easier to test and easier to support. And that’s the blocking paradigm. Where you simple call methods on objects, not caring about callbacks and handling responses. Synchronous, simple, straightforward. This is actually one of the points in both the presentation and the paper I linked above – that it’s harder to write non-blocking code. And given the unclear benefits (if any), I would say that programming, testing and supporting the code is the main distinguishing feature. Whether you are going to be able to serve 10000 or 11000 concurrent users from a single machine doesn’t really matter. Hardware is cheap. (unless it’s 1000 vs 10000, of course).

But why is the non-blocking, asynchronous, event/message-driven programming paradigm harder? For me, at least, even after a year of writing in that paradigm, it’s still messier. First, it is way harder to trace the programming flow. With a synchronous code you would just tell your IDE to fetch the call hierarchy (or find the usage of a given method if your language is not IDE-friendly), and see where everything comes and goes. With events it’s not that trivial. Who constructs this message? Where is it sent to / who consumes it? How is the response obtained – via callback, via another message? When is the response message constructed and who actually consumes it? And no, that’s not “loose coupling”, because your code is still pretty logically (and compilation-wise) coupled, it’s just harder to read.

What about thread-safety – the event passing allegedly ensure that no contention, deadlocks, or race-conditions occur. Well, even that’s not necessarily true. You have to be very careful with callbacks (unless you really have one thread like in Node) and your “actor” state. Which piece of code is executed by which thread is important (in akka at least), and you can still have a shared state even though only a few threads do the work. And for the synchronous approach you just have to follow one simple rule – state does not belong in the code, period. No instance variables and you are safe, regardless of how many threads execute the same piece of code. The presentation above mentions also immutable and concurrent data structures that are inherently thread-safe and can be used in either of the paradigms. So in terms of concurrency, it’s pretty easy, from the perspective of the developer.

Testing complicated message-passing flows is a nightmare, really. And whereas test code is generally less readable than the production code, test code for a non-blocking application is, in my experience, much uglier. But that’s subjective again, I agree.

I wouldn’t like to finish this long and unfocused piece with “it depends”. I really think the synchronous/blocking programming model, with a thread pool and no message passing in the business logic is the simpler and more straightforward way to write code. And if, as pointed out by the presentation and paper linked about, it’s also faster – great. And when you really need asynchronously sending responses to clients – consider the non-blocking approach only for that part of the functionality. Ultimately, given similar performance, throughput, scalability (and ignoring the marketing buzz), I think one should choose the programming paradigm that is easier to write, read and test. Because it takes 30 minutes to start another server, but accidental complexity can burn weeks and months of programming effort. For me, the blocking/synchronous approach is the easier to write, read and test, but that isn’t necessarily universal. I would just not base my choice of a programming paradigm on vague claims about performance and scalability.


My Development Setup

February 23, 2015

I think I may have a pretty non-standard development setup (even for a Java-and-Scala developer). I use Windows, which I guess almost no “real” developer does. I’ve tried Linux (Ubuntu) a couple of times, but it ruins my productivity (or what’s left of it after checking all social networks).

But how do I manage to get anything done? How do I write scripts that I deploy on servers, how do I run non-Windows software, how do I manage to work on a project where all other developers use either Linux or Mac?

It’s actually quite simple. It’s Windows + Cygwin + VirtualBox with a Linux distro. For most of the things that a Java developer needs, Windows is just fine. IDEs and servlet containers run well, so no issue there. Some project automation is done with shell scripts, but whenever I need to execute them, Cygwin works pretty well. Same goes for project deployment scripts and the likes (and I generally prefer using a class with a main method rather than sed, awk, curl, etc, to test stuff). As for software that doesn’t run on Windows (e.g. Riak doesn’t have a Windows distribution), that goes on the VirtualBox. I always have a virtual machine running with the appropriate software installed and listening on some port, so that I can run any application locally.

No need to mention Git, as there is a git console for Windows, but also there’s SourceTree, which is a pretty neat UI for the day-to-day tasks. Newlines are automatically handled by git, and even when that’s not enough (or is not working, as cygwin needs the Linux endings), Notepad++ has a pretty easy EOL conversion.

What about viruses? Using Firefox with NoScript, combined with good internet habits, means I haven’t had a single virus, ever. Well, maybe I’ve had some very secret one that never manifested itself, who knows.

That may sound like an unnecessary complication – so many components just to achieve what a Linux installation would give out-of-the-box. Well, no. First, it takes 30 minutes to setup, and second, I wouldn’t go for Linux on a desktop. It’s just too unfriendly and you waste so much time fixing little things that usually go wrong. Like when intensive I/O gets your UI completely stuck, or when the wifi doesn’t work because of those-three-commands-you-have-to-execute-to-add-this-to-the-proper-config. In other words, I get the end-user robustness of Windows (and no, it doesn’t give BSOD anymore, that was true 10 years ago) combined with the tools of Linux.

With that I’m not saying that everyone should migrate to my setup tomorrow. I’m just pointing to a good alternative.