In Favour of Self-Signed Certificates

December 18, 2014

Today I watched the Google I/O presentation about HTTPS everywhere and read a couple of articles, saying that Google is going to rank sites using HTTPS higher. Apart from that, SPDY has mandatory usage of TLS, and it’s very likely the same will be true for HTTP/2. Chromium proposes marking non-HTTPS sites as non-secure.

And that’s perfect. Except, it’s not very nice for small site owners. In the presentation above, the speakers say “it’s very easy” multiple times. And it is, you just have to follow a dozen checklists with a dozen items, run your site through a couple of tools and pay a CA 30 bucks per year. I have run a couple of personal sites over HTTPS (non-commercial, so using a free StatCom certificate), and I still shiver at the thought of setting up a certificate. You may say that’s because I’m an Ops newbie, but it’s just a tedious process.

But let’s say every site owner will have a webmaster on contract who will renew the certificate every year. What’s the point? The talk rightly points out three main aspects – data integrity, authentication and encryption. And they also rightly point out that no matter how mundane your site is, there is information about that that can be extracted based on your visit there (well, even with HTTPS the Host address is visible, so it doesn’t matter what torrents you downloaded, it is obvious you were visiting thepiratebay).

But does it really matter if my blog is properly authenticating to the user? Does it matter if the website of the local hairdresser may suffer a man-in-the-middle attack with someone posing as the site? Arguably, not. If there is a determined attacker that wants to observe what recipes are you cooking right now, I bet he would find it easier to just install a keylogger.

Anyway, do we have any data of how many sites are actually just static websites, or blogs? How many websites don’t have anything more than a contact form (if at all)? 22% of newly registered domains in the U.S. are using WordPress. That doesn’t tell much, as you can build quite interactive sites with WordPress, but is probably an indication. My guess is that the majority if sites are simple content sites that you do not interact with, or interactions are limited to posting an anonymous comment. Do these sites need to go through the complexity and cost of providing an SSL certificate? Certification Authorities may be rejoicing already – forcing HTTPS means there will be an inelastic demand for certificates, and that means prices are not guaranteed to drop.

If HTTPS is forced upon every webmaster (which should be the case, and I firmly support that), we should have a way of free, effortless way to allow the majority of sites to comply. And the only option that comes to mind is self-signed certificates. They do not guarantee there is no man-in-the-middle, but they do allow encrypting the communication and making it impossible for a passive attacker to see what you are browsing or posting. Server software (apache, nginx, tomcat, etc.) can have a switch “use self-signed certificate”, and automatically generate and store the key pair on the server (single server, of course, as traffic is unlikely to be high for these kinds of sites).

Browsers must change, however. They should no longer report self-signed certificates as insecure. At least not until the user tries to POST data to the server (and especially if there is a password field on the page). Upon POSTing data to the server, the browser should warn the user that it cannot verify the authenticity of the certificate and he must proceed only if he thinks data is not sensitive. Or even passing any parameters (be it GET or POST) can trigger a warning. That won’t be sufficient, as one can issue a GET request for site.com/username/password or even embed an image or use javascript. That’s why the heuristics to detect and show a warning can include submitting forms, changing src and href with javascript, etc. Can that cover every possible case, and won’t it defeat the purpose? I don’t know.

Even small, content-only, CMS-based sites have admin panels, and that means the owner sends username and password. Doesn’t this invalidate the whole point made above? It would, if there wasn’t an easy fix – certificate pinning. Even now this approach is employed by mobile apps in order to skip the full certificate checks (including revocation). In short, the owner of the site can get the certificate generated by the webserver, import it in the browser (pin it), and be sure that the browser will warn him if someone tries to intercept his traffic. If he hasn’t imported the certificate, the browser would warn him upon submission of the login form, possibly with instructions what to do.

(When speaking about trust, I must mention PGP. I am not aware whether there is a way to use web-of-trust verification of the server certificate, instead of the CA verification, but it’s worth mentioning as a possible alternative)

So, are self-signed certificate for small, read-only websites, secure enough? Certainly not. But relying on a CA isn’t 100% secure either (breaches are common). And the balance between security and usability is hard.

I’m not saying my proposal in favour of self-signed certificates is the right way to go. But it’s food for thought. The exploits such an approach can bring to even properly secured sites (via MITM) are to be considered with utter seriousness.

Interactive sites, especially online shops, social networks, payment providers, obviously, must use a full-featured HTTPS CA certificate, and that’s only the bare minimum. But let’s not force that upon the website of the local bakery and the millions of sites like it. And let’s not penalize them for not purchasing a certificate.

P.S. EFF’s Let’s Encrypt looks a really promising alternative.

P.P.S See also in-session key negotiation

6

Algorithmic Music Composition [paper]

December 11, 2014

After I wrote my first post about computoser.com, many were interested in the code. Then I open-sourced it.

And now, to complete my contribution, I wrote a paper about my approach and findings. The paper is on Academia.edu and also on arxiv. I’d be happy to get honest peer reviews.

It’s not a great novelty, I agree, but think it’s an improvement over existing attempts and possibly an approach to build upon in the future.

3

Open Source for the Government [presentation]

December 9, 2014

A month ago I gave a talk at OpenFest (a Bulgarian conference for open technologies). The talk was in Bulgarian, but I’ve translated the slides, so here they are.

Currently, many governments orders custom software and companies implement it, but it’s usually of low quality, low applicability, or both. Software is often abandoned. And we don’t even know what bugs and holes are lurking in (I found a security hole in the former egov.bg portal that allowed me to extract all documents in the system, containing personal data and whatnot). And it’s all because it’s a black box.

In short, I propose to have all software ordered by our governments, open-source. I’m obviously not the first to have that idea (there are success stories in some countries, as you’ll see on the slides), but I think the idea is worth pursuing not only in my country, but in many other countries where the government still orders companies to build closed source software.

The process is simple – the company that builds the software (or makes a customization of an existing open-source software) works in a public SCM repo (git or mercurial) that everyone can trace. That way we can not only monitor the development process, but also have a more transparent view on public spending for software.

Here in Bulgaria the idea has already been embraced by some government representatives and is likely to gain traction in the next few years.

0

Static Typing Is Not for Type Checking

December 2, 2014

In his post “Strong typing vs strong testing” Bruce Eckel described the idea, that statically (or strongly) typed languages don’t give you much, because you should verify your programs with tests anyway, and those tests will check the types as well – no need for the compiler to do that (especially if it makes you less productive with the language).

While this looks like a very good point initially, I have some objections.

First, his terminology is not the popularly agreed one. This stackoverflow answer outlines the difference between statically-typed (types are checked at compile time) vs strongly-typed (no or few implicit type conversions). And to clarify this about the language used in the article – Python – this page tells us Python is dynamically and strongly typed language.

But let’s not nitpick about terminology. I have an objection to the claim that static typing simply gives you some additional tests, that you should write anyway.

In a project written in a dynamic language, can you see the callers of a method? Who calls the speak method in his example? You’ll do a search? Well, what if you have many methods with the same name (iterator(), calculate(), handle(), execute())? You would name them differently, maybe? And be sure that you never reuse a method name in the whole project? The ability to quickly navigate through the code of big project is one of the most important ones in terms of productivity. And it’s not that vim with nice plugins doesn’t allow you to navigate through classes and to search for methods – it’s just not possible to make it as precise in a dynamic language, as it can be done in a static one.

Then I want to know what I can call on a given object. To do API “discovery” while I write the code. How often, in a big project, you are absolutely sure which method you want to invoke on an object of a class you see for the first time? Go to that class? Oh, which one is it, since you only know it has the calculate() method (which is called in the current method)? Writing a test that validates whether you can invoke a given method or not is fine, but doesn’t give you the options to discover what are your options at that point – what method would do the job best for you. Here comes autocomplete with inline documentation. It’s not about saving keystrokes, it’s about knowing what is allowed at this point in the code. I guess constantly opening API documentation pages or other class definitions would work, but it’s tedious.

Then comes refactoring. Yes, you knew I’d bring that up. But that’s the single most important feature that the tests we write enable us to user. We have all our tests so that we can guarantee that when we change something, the code will continue to work correctly. And yet, only in a statically typed language it is possible to do full-featured refactoring. Adding arguments to a method, moving a method to another class, even renaming a method without collateral damage. And yes, there are multiple heuristics that can be employed to make refactoring somewhat possible in Ruby or Python (and JetBrains are trying), but by definition it cannot be that good. Does it matter? And even if it doesn’t happen automatically, tests will catch that, right? If you have 100% coverage, they will. But that doesn’t mean it will take less time to do the change. As opposed to a couple of keystrokes for a static language.

And what are those “mythical big projects” where all the features above are game-changers? Well, most projects with a lifespan of more than 6 months, in my experience.

So, no, static typing is not about the type checks. It’s about you being able to comprehend a big, unfamiliar (or forgotten) codebase faster and with higher level of certainty, to make your way through it and to change it safer and faster. Type checking comes as a handy bonus, though. I won’t employ the “statically typed languages have faster runtimes” argument. (and by all this I don’t mean to dismiss dynamically-typed languages, even though I very much prefer static and strong typing)

And then people may say “your fancy tools and IDEs try to compensate for language deficiencies”. Not at all – my fancy tools are build ontop of the language efficiencies. The tools would not exist if the language didn’t make it possible for them to exist. A language that allows powerful tools to be built for it is a powerful one, and that’s the strength of statically-typed languages, in my view.

23

Getting Started with Machine Learning

November 29, 2014

“Machine learning” is a mystical term. Most developers don’t need it at all in their daily work, and the only details about it we know are from some university course 5 years ago (which is already forgotten). I’m not a machine learning expert, but I happened to work in a company that does a bit of that, so I got to learn the basics. I never programmed actual machine learning tasks, but got a good overview.

But what is machine learning? It’s instructing the computer to make sense of big amounts of data (#bigdata hashtag – check). In what ways?

  • classifying a new entry into existing classes – is this email spam, is this news article about sport or politics, is this symbol the letter “a”, or “b”, or “c”, is this object in front of the self-driving car a pedestrian or a road sign.
  • predicting a value of a new entry (regression problems) – how much does my car cost, how much will the stock price be tomorrow.
  • grouping entries into classes that are not known in advance (clustering) – what are you market segments, what are the communities within a given social network (and many more applications)

How? With many different algorithms and data structures. Which are fortunately already written by computer scientists and developers can just reuse them (with a fair amount of understanding, of course).

But if the algorithms are already written, then it must be easy to use machine learning? No. Ironically, the hardest part of machine learning is the part where a human tells the machine what is important about the data. This process is called feature selection. What are those features, that describe the data in a way, that the computer can use it to identify meaningful patterns. I am no machine learning expert, but the way I see it, this step is what most machine learning engineers (or data scientists) are doing on a day-to-day basis. They aren’t inventing new algorithms; they are trying to figure out what combinations of features for a given data gives best results. And it’s a process with many “heuristics” that I have no experience with. (That’s an oversimplification, of course, as my colleagues were indeed doing research and proposing improvements to algorithms, but that’s the scientific aspect of things)

I’ll now limit myself only to classification problems and leave the rest. And when I say “best results”, how is that measured? There are the metrics of “precision” and “recall” (they are most easily used for classification into two groups, but there are ways to apply them to multi-class or multi-label classification). If you have to classify an email as spam or not spam, your precision is the percentage of the emails properly marked as spam from all the emails marked as spam. And the recall is the percentage of emails properly marked as spam from the total number of emails marked as spam. So if you have 200 emails, 100 of them are spam, and your program marks 80 of them as spam correctly and 20 incorrectly, you have a 80% precision (80/80+20) and 80% recall (80/100 actual spam emails). Good results are achieved when you score higher in these two metrics. I.e. your spam filter is good if it correctly detects most spam emails, and it also doesn’t mark non-spam emails as spam.

The process of feeding data into the algorithm is simple. You usually have two sets of data – the training set and the evaluation set. You normally start with one set and split it in two (the training set should be the larger one). These sets contain the values for all the features that you have identified for the data in question. You first “train” your classifier statistical model with the training set (if you want to know how training happens, read about the various algorithms), and then run the evaluation set to see how many items were correctly classified (the evaluation set has the right answer in it, so you compare that to what the classifier produced as output).

Let me illustrate that with my first actual machine learning code (with the big disclaimer, that the task is probably not well-suited for machine learning, as there is a very small data set). I am a member (and currently chair) of the problem committee (and jury) of the International Linguistics Olympiad. We construct linguistics problems, combine them into problem sets and assign them at the event each year. But we are still not good at assessing how hard a problem is for high-school students. Even though many of us were once competitors in such olympiads, we now know “too much” to be able to assess the difficulty. So I decided to apply machine learning to the problem.

As mentioned above, I had to start with selecting the right features. After a couple of iterations, I ended up using: the number of examples in a problem, the average length of an example, the number of assignments, the number of linguistic components to discover as part of the solution, and whether the problem data is scrambled or not. The complexity (easy, medium, hard) comes from the actual scores of competitors at the olympiad (average score of: 0-8 points = hard, 8-12 – medium, >12 easy). I am not sure whether these features are related to problem complexity, hence I experimented with adding and removing some. I put the feature data into a Weka arff file, which looks like this (attributes=features):

@RELATION problem-complexity

@ATTRIBUTE examples NUMERIC
@ATTRIBUTE avgExampleSize NUMERIC
@ATTRIBUTE components NUMERIC
@ATTRIBUTE assignments NUMERIC
@ATTRIBUTE scrambled {true,false}
@ATTRIBUTE complexity {easy,medium,hard}

@DATA
34,6,11,8,false,medium
12,21,7,17,false,medium
14,11,11,17,true,hard
13,16,9,14,false,hard
16,35,7,17,false,hard
20,9,7,10,false,hard
24,5,8,6,false,medium
9,14,13,4,false,easy
18,7,17,7,true,hard
18,7,12,10,false,easy
10,16,9,11,false,hard
11,3,17,13,true,easy
...

The evaluation set looks exactly like that, but smaller (in my case, only 7 entries).

Weka was recommended as a good tool (at least for starting), and it has a lot of algorithms included, which one can simply reuse.

Following the getting started guide, I produced the following simple code:

public static void main(String[] args) throws Exception {
    ArffLoader loader = new ArffLoader();
    loader.setFile(new File("problem_complexity_train_3.arff"));
    Instances trainingSet = loader.getDataSet();
    // this is the complexity, here we specify what are our classes, 
    // into which we want to classify the data
    int classIdx = 5;
       
    ArffLoader loader2 = new ArffLoader();
    loader2.setFile(new File("problem_complexity_test_3.arff"));
    Instances testSet = loader2.getDataSet();
       
    trainingSet.setClassIndex(classIdx);
    testSet.setClassIndex(classIdx);
       
     // using the LMT classification algorithm. Many more are available   
     Classifier classifier = new LMT();
     classifier.buildClassifier(trainingSet);
       
     Evaluation eval = new Evaluation(trainingSet);
     eval.evaluateModel(classifier, testSet);  

     System.out.println(eval.toSummaryString());
 
     // Get the confusion matrix
     double[][] confusionMatrix = eval.confusionMatrix();
     ....
}

A comment about the choice of the algorithm – having insufficient knowledge, I just tried a few and selected the one that produced the best result.

After performing the evaluation, you can get the so called “confusion matrix”, (eval.toConfusionMatrix) which you can use to see the quality of the result. When you are satisfied with the results, you can proceed to classify new entries, that you don’t know the complexity of. To do that, you have to provide a data set, and the only difference to the other two is that you put question mark instead of the class (easy, medium, hard). E.g.:

...
@DATA
34,6,11,8,false,?
12,21,7,17,false,?

Then you can run the classifier:

ArffLoader loader = new ArffLoader();
loader.setFile(new File("unclassified.arff"));
Instances dataSet = loader.getDataSet();

DecimalFormat df = new DecimalFormat("#.##");
for (Enumeration<Instance> en = dataSet.enumerateInstances(); en.hasMoreElements();) {
    double[] results = classifier.distributionForInstance(en.nextElement());
    for (double result : results) {
        System.out.print(df.format(result) + " ");
     }
     System.out.println();
};

This will print the probabilities for each of your entries to fall into each of the classes. As we are going to use this output only as a hint towards the complexity, and won’t use it as a final decision, it is fine to yield wrong results sometimes. But in many machine learning problems there isn’t a human evaluation of the result, so getting higher accuracy is the most important task.

How does this approach scale, however. Can I reuse the code above for a high volume production system? On the web you normally do not run machine learning tasks in real time (you run them as scheduled tasks instead), so probably the answer is “yes”.

I am still a novice in the field, but having done one actual task made me share my tiny experience and knowledge. Meanwhile I’m following the Stanford machine learning course on Coursera, which can give you way more details.

Can we, as developers, use machine learning in our work projects? If we have large amounts of data – yes. It’s not that hard to get started, and although probably we will be making stupid mistakes, it’s an interesting thing to explore and may bring value to the product we are building.

2

Making Side Projects With New Technologies

November 22, 2014

(Captain Obvious mantle on)
You are a software engineer and maybe you have a side project – something that you do at home in your spare time. If you don’t, go ahead and have one – no life outside is better than a few more hours of programming. Unwitty jokes aside, having a side project is indeed a very useful practice (read on).

A side-project is sometimes thought of as “the thing that would make you rich and you won’t have to program ever again”. It very rarely is, so we’d better view it as “the thing that would sound cool when I speak about it”. But apart from the motivational/coolness aspect, side-projects have a very important practical consequence – they make you a better programmer.

Of course, every extra hour of doing something makes you better at it, but a side-project is even better, because you are the one that makes all the decisions – what to do, how to do it, when to do it, what technologies to use. I’ll focus a bit more on the last point. Not only you can choose the technologies to use, but you can choose technologies that you don’t know yet (imagine going to your manager in the beginning of a project and and asking him to build it with a language or framework that nobody on the team has ever used).

And that’s what I’m doing – for most of my side-project I choose technologies that I haven’t used before. I get to learn new frameworks, tools and languages (a.k.a. “technology”), and get relatively good with them. That’s the way I learned JSF, Android, Scala, AWS and more. Learning a technology by itself is not the most motivating endeavor, but learning it as part of a project; as part of building something meaningful, is a different thing – it comes naturally.

The obvious practical bonus of all this is that you become more “hireable”. Having a technology in your skillset makes you more eligible for certain positions than other people – knowing a bit of scala and AWS makes you way more qualified for a “scala full-stack engineer” than someone with just Java and Linux knowledge. Another scenario is when a new project starts and you get to pick the technologies, you can now say “I have experience with JSF, let’s build the front-end with that” (and that’s exactly what has happened to me).

Now, a clarification is due about the “new” word in the title. I don’t intend it to mean “untested, overhyped crap”, I mostly mean “new to you”, something that you haven’t used. It might be an already stable technology, or something that is gaining traction but your conservative company is never going to try. Of course, trying something “fresh” is also good, as being an early-adopter is sometimes rewarding.

Should you make side-projects with technologies you are familiar with? Of course, and I’ve done so as well. If the subject of the project is way more interesting than the technologies themselves (e.g. an algorithmic composer). But it is way better to use at least one new thing.

By the way, that’s not relevant only for “youngsters”. The “big, fat architect” also needs a bit of the side project experience too, otherwise he risks being irrelevant pretty soon.

In a way I think side projects are the way for developers to enrich their skillset and to be up to date. Learning only the technologies you need at work can make you forget how to learn; forget what programmers’ curiosity is – and that’s just bad. And constantly exploring the programming world not only gives you particular skills with a given technology, but also broadens your general engineering mindset.

3

Interrupting Executor Tasks

November 19, 2014

There’s this usecase that is not quite rare, when you want to cancel a running executor tasks. For example, you have ongoing downloads that you want to stop, or you have ongoing file copying that you want to cancel. So you do:

ExecutorService executor = Executors.newSingleThreadExecutor(); 
Future<?> future = executor.submit(new Runnable() {
    @Override
    public void run() {
        // Time-consuming or possibly blocking I/O
    }
});
....
executor.shutdownNow();
// or
future.cancel();

Unfortunately, that doesn’t work. Calling shutdownNow() or cancel() doesn’t stop the ongoing runnable. What these methods do is simply call .interrupt() on the respective thread(s). The problem is, your runnable doesn’t handle InterruptedException (and it can’t). It’s a pretty common problem described in multiple books and articles, but still it’s a bit counterintuitive.

So what do you do? you need a way to stop the slow or blocking operation. If you have a long/endless loop, you can just add a condition whether Thread.currentThread().isInterrupted() and don’t continue if it is. However, generally, the blocking happens outside of your code, so you have to instruct the underlying code to stop. Usually this is by closing a stream or disconnecting a connection. But in order to do that, you need to do quite a few things.

  • Extend Runnable
  • Make the “cancellable” resources (e.g. the input stream) an instance field, which
  • provide a cancel method to your extended runnable, where you get the “cancellable” resource and cancel it (e.g. call inputStream.close())
  • Implement a custom ThreadFactory that in turn creates custom Thread instances that override the interrupt() method and invoke the cancel() method on your extended Runnable
  • Instantiate the executor with the custom thread factory (static factory methods take it as an argument)
  • Handle abrupt closing/stopping/disconnecting of your blocking resources, in the run() method

The bad news is, you need to have access to the particular cancellable runnable in your thread factory. You cannot use instanceof to check if it’s of an appropriate type, because executors wrap the runnables you submit to them in Worker instances which do not expose their underlying runnables.

For single-threaded executors that’s easy – you simply hold in your outermost class a reference to the currently submitted runnable, and access it in the interrupt method, e.g.:

private final CancellableRunnable runnable;
...

runnable = new CancellableRunnable() {
    private MutableBoolean bool = new MutableBoolean();
    @Override
    public void run() {
        bool.setValue(true);
        while (bool.booleanValue()) {
            // emulating a blocking operation with an endless loop
        }
    }
    
    @Override
    public void cancel() {
        bool.setValue(false);
        // usually here you'd have inputStream.close() or connection.disconnect()
    }
};

ExecutorService executor = Executors.newSingleThreadExecutor(new ThreadFactory() {
    @Override
    public Thread newThread(Runnable r) {
       return new Thread(r) {
           @Override
           public void interrupt() {
               super.interrupt();
               runnable.cancel();
           }
       };
    }
}); 

Future<?> future = executor.submit(runnable);
...
future.cancel();

(CancellableRunnable is a custom interface that simply defines the cancel() method)

But what happens if your executor has to run multiple tasks at the same time? If you want to cancel all of them, then you can keep a list of submitted CancellableRunnable instance and simply cancel all of them when interrupted. Thus runnables will be cancelled multiple times, so you have to account for that.

If you want fine-grained control, e.g. by cancelling particular futures, then there is no easy solution. You can’t even extend ThreadPoolExecutor because the addWorker method is private. You have to copy-paste it.

The only option is not to rely on future.cancel() or executor.shutdownAll() and instead keep your own list of CancellableFuture instances and map them to their corresponding futures. So whenever you want to cancel some (or all) runnables, you do it the other way around – get the desired runnable you want to cancel, call .cancel() (as shown above), then get its corresponding Future, and cancel it as well. Something like:

Map<CancellableRunnable, Future<?>> cancellableFutures = new HashMap<>();
Future<?> future = executor.submit(runnable);
cancellableFutures.put(runnable, future);

//now you want to abruptly cancel a particular task
runnable.cancel();
cancellableFutures.get(runnable).cancel(true);

(Instead of using the runnable as key, you may use some identifier which makes sense in your usecase and store both the runnable and future as a value under that key)

That’s a neat workaround, but anyway I’ve submitted a request for enhancement of the java.util.concurrent package, so that in a future release we do have the option to manage that usecase.

0

Development Overhead

November 13, 2014

What does a developer spend his time on? Writing code, debugging, thinking and communicating with colleagues (that includes meetings). Anything that is beyond these activities is unnecessary overhead (some meetings are also unnecessary, but that’s a different topic).

And yet, depending on our language and tools, we have to do a lot more to support the process of writing code. These activities include, but are not limited to:

  • manually format your code – the code has to be beautifully aligned and formatted, but that’s an extra effort.
  • using search and replace instead of refactoring – few languages and tools support good refactoring, and that’s priceless in a big project
  • manually invoking compilation – compile on save gives you immediate feedback; the need to manually run a compiler is adding a whole unnecessary step to your coding process
  • slow compilation – ontop of the previous issue, if your compiler is slow, it’s just a dead time (mandatory xkcd)
  • slow time-to-deploy – if the time from writing the code code to running it is more than a few seconds, then you are wasting enormous amounts of time. E.g. if you need to manually make builds and copy files on a local server.
  • clunky resource navigation – if you can’t go to a given source file in a couple of keystrokes
  • infrastructure problems – you depend on a database, a message queue, possibly some external service. Installing and supporting these components on your development machine can be painful. Recently we spent one day trying to integrate 3 components, some of which had docker instances. And on neither Windows, nor Mac, Docker worked properly. It was painful error-google-try-error-google process to even get things started. Avoid immature, untested tools (not bashing docker here, just an example, and it might have already been improved/fixed)
  • OS issues – if your OS crashes every couple of days, your I/O is blocking your UI, you sometimes lose your ALT+TAB functionality (which are things that I’ve been experiencing when using Ubuntu), then your OS is wasting a significant amount of your time.

Most of the manual tasks above can be automated, and the others should not exist at all. If you are using Java, for example, you can have a stable IDE, with automatic formatting and refactoring, with compile-on-save, with save-and-refresh for webapps. And you can use an operating system that doesn’t make you recompile the kernel every now and then in order to keep it working (note: hyperbole here).

It’s often a tradeoff. If I have to compare Java to Groovy in terms of productivity, for example, the (perceived) verbosity of Java is a minor nuisance compared to the lack of refactoring, formatting, etc, etc, in groovy (at least that was the case a few years ago; and it’s still the same with scala nowadays). Yes, you have to write a few lines more, but it’s a known process. If you have immature tools that are constantly breaking or just don’t work (and that is the case, unfortunately), it’s unknown how you should process. And you may end up wasting 10 minutes in manual “labour”, which would kill the productivity that a language gives you. For me Linux was also such a tradeoff – having the terminal is sometimes useful indeed, but it did not justify the effort in keeping the system working (and it completely died after a version upgrade).

Because I really feel all that overhead draining my productivity, I am very picky when it comes to the technologies I use. Being able to type faster or write less lines of code is fine, but you have to weigh that against the rest of the procedures you are forced to do. And that’s part of the reason why I prefer an IDE over a text editor, I don’t use Emacs and I don’t like Scala, and I don’t use Linux.

Your experience may very well be different (and if facebook checking is already taking half of your day, then nothing above really matters). But try to measure (or at least observe) how much time you spend not doing actual programming (or thinking) and have to do “automatable” or redundant stuff instead. And try to ignore the feeling of accomplishment when you do something that you don’t have to do in the first place. And if your preferred technologies turn out to be silently draining productivity, then consider changing them (or improving them, if you have the spare time).

3

On Java Generics and Erasure

November 5, 2014

“Generics are erased during compilation” is common knowledge (well, type parameters and arguments are actually the ones erased). That happens due to “type erasure”. But it’s wrong that everything specified inside the <..> symbols is erased, as many developers are assuming. See the code below:

public class ClassTest {
  public static void main(String[] args) throws Exception {
    ParameterizedType type = (ParameterizedType) 
       Bar.class.getGenericSuperclass();
    System.out.println(type.getActualTypeArguments()[0]);
    
    ParameterizedType fieldType = (ParameterizedType) 
        Foo.class.getField("children").getGenericType();
    System.out.println(fieldType.getActualTypeArguments()[0]);
    
    ParameterizedType paramType = (ParameterizedType) 
        Foo.class.getMethod("foo", List.class)
        .getGenericParameterTypes()[0];
    System.out.println(paramType.getActualTypeArguments()[0]);
    
    System.out.println(Foo.class.getTypeParameters()[0]
        .getBounds()[0]);
  }
  
  class Foo<E extends CharSequence> {
    public List<Bar> children = new ArrayList<Bar>();
    public List<StringBuilder> foo(List<String> foo) {return null; }
    public void bar(List<? extends String> param) {}
  }
   
  class Bar extends Foo<String> {}
}

Do you know what that prints?

class java.lang.String
class ClassTest$Bar
class java.lang.String
class java.lang.StringBuilder
interface java.lang.CharSequence

You see that every single type argument is preserved and is accessible via reflection at runtime. But then what is “type erasure”? Something must be erased? Yes. In fact, all of them are, except the structural ones – everything above is related to the structure of the classes, rather than the program flow. In other words, the metadata about the type arguments of a class and its field and methods is preserved to be accessed via reflection.

The rest, however, is erased. For example, the following code:

List<String> list = new ArrayList<>();
Iterator<String> it = list.iterator();
while (it.hasNext()) {
   String s = it.next();
}

will actually be transformed to this (the bytecode of the two snippets is identical):

List list = new ArrayList();
Iterator it = list.iterator();
while (it.hasNext()) {
   String s = (String) it.next();
}

So, all type arguments you have defined in the bodies of your methods will be removed and casts will be added where needed. Also, if a method is defined to accept List<T>, this T will be transformed to Object (or to its bound, if such is declared. And that’s why you can’t do new T(). (by the way, one open question about this erasure)

So far we covered the first two points of the type erasure definition. The third one is about bridge methods. And I’ve illustrated it with this stackoverflow question (and answer).

Two “morals” of all this. First, java generics are complicated. But you can use them without understanding all the complications.

Second, do not assume that all type information is erased – the structural type arguments are there, so make use of them, if needed (but don’t be over-reliant on reflection)

4

The DSL Jungle

October 21, 2014

DSLs are a common thing in the programming world nowadays. Many frameworks and tools decide to build a DSL for their…specific things. Builds tools are the primary candidates, but testing frameworks, web frameworks and whatnot also decide to define a DSL. With these DSLs you define build steps, web routing rules, test acceptance criteria, etc.

What is the most common thing about all these DSLs? Two things. First, they are predominantly about configuration. Some specific way of configuring something specific to the tool or framework. The second thing is that you copy-paste code. Everytime I’m confronted with some DSL that is meant to help with my programming task, I end up copy-pasting examples or existing code, and then modifying it. Even though I’ve been working with a DSL for 8 months (from time to time), I just don’t remember its syntax.

And you may say “yeah, that’s because you use bad DSLs”. Well, then I haven’t seen a good one yet. I’m currently using sbt, spray routing, cucumber for scala, previously I’ve used groovy and grails DSLs, and a few others along the way.

But is it bad that you copy-paste existing pieces of code? Not always. You can, of course, base your configuration on existing, working pieces. But there are three issues – duplicate code, autocomplete and exploration. You know copy-pasting is wrong and leads to duplication. Not only that, but you may forget to change or remove something in the pasted code. And if you want to add some property, it would be good to be able to auto-complete it, rather than mistyping or, or forgetting whether it was “filePath”, “filepath”, “file-path” or just “path”. Having 2-3 DSLs in parts of a big project, you can’t remember all property names, so the alternative is to go and see the documentation (if you don’t have a working piece with that particular property to copy-paste from). Exploration is an even bigger issue. Especially when learning, or remembering how to do certain things with a given DSL, it is crucial to be able to explore the possibilities. What properties does this have, that might be useful? What does this property do exactly and does it have subproperties? What can I nest under this item? This is very important, regardless of your knowledge of the tool/framework.

But with most DSLs you don’t have that. They either have some bizarre syntax, or they are JSON-based, or they look like the language you are using, but not quite, and hence even an IDE finds it difficult to understand them (spray being such an example). You either look at the documentation, or you copy-paste, or both. And you are kind of lost in this DSL jungle of ever so “cooler” DSLs that do a wide variety of things.

And now I’ll drop the X-bomb. I love XML. Trusting the “XML configuration files are evil” meme has lead to many incomprehensible configurations, that are “short and easy to read and write”. Easy, if you remembered what those double-percentage signs meant compared to the single percentage signs, and where exactly to put the parentheses.

In almost every scenario where someone decided that a DSL is a good idea, XML would have worked brilliantly. Using an XSD schema (which, I agree, is a bit tedious to write) you can make any XML-aware tool be turned into an IDE for configuration. Take the maven pom file, for example. Did you forget what element you could nest under “build”? Hit CTRL+space and you’ll find out. Being unified, you can read the XML configuration of any framework or tool that uses it, not just this particular one, that is the n-th DSL in a single project. While XML is verbose, it is straightforward and standard. (To make a distinction: your application properties file is fine with key-value pairs, YAML, or something like typesafe, but that’s not coming from a framework, and it’s not a DSL in the narrower sense)

So if you are writing a tool, and can’t make some configuration available via annotations or via very simple code (builders, setters, fluent interfaces), don’t go for a DSL. Don’t write DSLs where you can easily use XML. It will look good on your README.md, but your users will copy-paste all the time and may actually hate it. So please don’t contribute to the DSL jungle.

And do you know why that is? Remember the initial note that these are DSLs you use when programming. Well, DSLs are not for programmers. DSLs are for non-programmers to express business logic in (almost) prose. Or at least their usage should be limited to that, where they can really excel. If you are making a tool for business analysts, feel free to design the most awesome DSL. If you are building a tool for programmers, don’t.

4