The DSL Jungle

October 21, 2014

DSLs are a common thing in the programming world nowadays. Many frameworks and tools decide to build a DSL for their…specific things. Builds tools are the primary candidates, but testing frameworks, web frameworks and whatnot also decide to define a DSL. With these DSLs you define build steps, web routing rules, test acceptance criteria, etc.

What is the most common thing about all these DSLs? Two things. First, they are predominantly about configuration. Some specific way of configuring something specific to the tool or framework. The second thing is that you copy-paste code. Everytime I’m confronted with some DSL that is meant to help with my programming task, I end up copy-pasting examples or existing code, and then modifying it. Even though I’ve been working with a DSL for 8 months (from time to time), I just don’t remember its syntax.

And you may say “yeah, that’s because you use bad DSLs”. Well, then I haven’t seen a good one yet. I’m currently using sbt, spray routing, cucumber for scala, previously I’ve used groovy and grails DSLs, and a few others along the way.

But is it bad that you copy-paste existing pieces of code? Not always. You can, of course, base your configuration on existing, working pieces. But there are three issues – duplicate code, autocomplete and exploration. You know copy-pasting is wrong and leads to duplication. Not only that, but you may forget to change or remove something in the pasted code. And if you want to add some property, it would be good to be able to auto-complete it, rather than mistyping or, or forgetting whether it was “filePath”, “filepath”, “file-path” or just “path”. Having 2-3 DSLs in parts of a big project, you can’t remember all property names, so the alternative is to go and see the documentation (if you don’t have a working piece with that particular property to copy-paste from). Exploration is an even bigger issue. Especially when learning, or remembering how to do certain things with a given DSL, it is crucial to be able to explore the possibilities. What properties does this have, that might be useful? What does this property do exactly and does it have subproperties? What can I nest under this item? This is very important, regardless of your knowledge of the tool/framework.

But with most DSLs you don’t have that. They either have some bizarre syntax, or they are JSON-based, or they look like the language you are using, but not quite, and hence even an IDE finds it difficult to understand them (spray being such an example). You either look at the documentation, or you copy-paste, or both. And you are kind of lost in this DSL jungle of ever so “cooler” DSLs that do a wide variety of things.

And now I’ll drop the X-bomb. I love XML. Trusting the “XML configuration files are evil” meme has lead to many incomprehensible configurations, that are “short and easy to read and write”. Easy, if you remembered what those double-percentage signs meant compared to the single percentage signs, and where exactly to put the parentheses.

In almost every scenario where someone decided that a DSL is a good idea, XML would have worked brilliantly. Using an XSD schema (which, I agree, is a bit tedious to write) you can make any XML-aware tool be turned into an IDE for configuration. Take the maven pom file, for example. Did you forget what element you could nest under “build”? Hit CTRL+space and you’ll find out. Being unified, you can read the XML configuration of any framework or tool that uses it, not just this particular one, that is the n-th DSL in a single project. While XML is verbose, it is straightforward and standard. (To make a distinction: your application properties file is fine with key-value pairs, YAML, or something like typesafe, but that’s not coming from a framework, and it’s not a DSL in the narrower sense)

So if you are writing a tool, and can’t make some configuration available via annotations or via very simple code (builders, setters, fluent interfaces), don’t go for a DSL. Don’t write DSLs where you can easily use XML. It will look good on your README.md, but your users will copy-paste all the time and may actually hate it. So please don’t contribute to the DSL jungle.

And do you know why that is? Remember the initial note that these are DSLs you use when programming. Well, DSLs are not for programmers. DSLs are for non-programmers to express business logic in (almost) prose. Or at least their usage should be limited to that, where they can really excel. If you are making a tool for business analysts, feel free to design the most awesome DSL. If you are building a tool for programmers, don’t.

2

Validate Configuration on Startup

October 15, 2014

Do you remember that time when you spent a whole day trying to fix a problem, only to realize that you have mistyped a configuration setting? Yes. And it was not just one time.

Avoiding that is not trivial, as not only you, but also the frameworks that you use should take care. But let me outline my suggestion.

Always validate your configuration on startup of your application. This involves three things:

First, check if your configuration values are correct. Test database connection URLs, file paths, numbers and periods of time. If a directory is missing, a database is unreachable, or you have specified a non-numeric value where a number or period of time is expected, you should know that immediately, rather the application has been used for a while.

Second, make sure all required parameters are set. If a property is required, fail if it has not been set, and fail with a meaningful exception, rather than an empty NullPointerException (e.g. throw new IllegalArgumentException("database.url is required"))

Third, check if only allowed values are set in the configuration file. If a configuration is not recognized, fail immediately and report it. This will save you from spending a whole day trying to find out why setting the “request.timeuot” property didn’t have effect. This is applicable to optional properties that have default values, and comes with the extra step of adding new properties to a predefined list of allowed properties, and possibly forgetting to do that leading to an exception, but that is unlikely to waste more than a minute.

A simple implementation of the last suggestion would like like this:

Properties properties = loadProperties();
for (Object key : properties.keySet()) {
  if (!VALID_PROPERTIES.contains(key)) {
    throw new IllegalArgumentException("Property " + key +
      " is not recognized as a valid property. Maybe a typo?");
  }
}

Implementing the first one is a bit harder, as it needs some logic – in your generic properties loading mechanism you don’t know if a property is a database connection url, a folder, a timeout. So you have to do these checks in the classes that know the purpose if each property. Your database connection handler knows how to work with a database url, your file storage handler knows what a backup directory is, and so on. This can be combined with the required property verification. Here, a library like Typesafe config may come handy, but it won’t solve all problems.

This is not only useful for development, but also for newcomers to the project that try to configure their local server, and most importantly – production, where you can immediately find out if there has been a misconfiguration in this release.

Ultimately, the goal is to fail as early as possible if there is any problem with the supplied configuration, rather than spending hours chasing typos, missing values and services that are accidentally not running.

0

Scala – the Good, the Bad and the Very Ugly [presentation]

October 6, 2014

The other day I gave a talk on a tech conference about my experience with Scala. Ironically, just two weeks after I wrote that I don’t like Scala, I started working with it on a daily basis, so I now have a better overview. And it’s not all black and white, but many of my arguments still hold.

You can check the slides (the talk was not in English). And I’d like to emphasize the final conclusion point: don’t give the users of your language, API or product all the possible options – they will misuse them.

5

Load-Testing Guidelines

September 18, 2014

Load-testing is not trivial. It’s often not just about downloading JMeter or Gatling, recording some scenarios and then running them. Well, it might be just that, but you are lucky if it is. And what may sound like “Captain Obvious speaking”, it’s good to be reminded of some things that can potentially waste time.

So, when you run the tests, eventually you will hit a bottleneck, and then you’ll have to figure out where it is. It can be:

  • client bottleneck – if your load-testing tool uses HttpURLConnection, the number of requests sent by the client is quite limited. You have to start from that and make sure enough requests are leaving your load-testing machine(s)
  • network bottlenecks – check if your outbound connection allows the desired number of requests to reach the server
  • server machine bottleneck – check the number of open files that your (most probably) linux server allows. For example, if the default is 1024, then you can have at most 1024 concurrent connections. So increase that (limits.conf)
  • application server bottleneck – if the thread pool that handles requests is too low, requests may be kept waiting. If some other tiny configuration switch (e.g. whether to use NIO, which is worth a separate article) has the wrong value, that may reduce performance. You’d have to be familiar with the performance-related configurations of your server.
  • database bottlenecks – check the CPU usage and response times of your database to see if it’s not the one slowing the requests. Misconfiguring your database, or having too small/few DB servers, can obviously be a bottleneck
  • application bottleneck – these you’d have to investigate yourself, possibly using some performance monitoring tool (but be careful when choosing one, as there are many “new and cool”, but unstable and useless ones). We can divide this type in two:
    • framework bottleneck – if a framework you are using has problems. This might be a web framework, a dependency injection framework, an actor system, an ORM, or even a JSON serialization tool
    • application code bottleneck – if you are misusing a tool/framework, have blocking code, or just wrote horrible code with unnecessarily high computational complexity

You’d have to constantly monitor the CPU, memory, network and disk I/O usage of the machines, in order to understand when you’ve hit the hardware bottleneck.

One important aspect is being able to bombard your servers with enough requests. It’s not unlikely that a single machine is insufficient, especially if you are a big company and your product is likely to attract a lot of customers at the start and/or making a request needs some processing power as well, e.g. for encryption. So you may need a cluster of machines to run your load tests. The tool you are using may not support that, so you may have to coordinate the cluster manually.

As a result of your load tests, you’d have to consider how long does it make sense to keep connections waiting, and when to reject them. That is controlled by connect timeout on the client and registration timeout (or pool borrow timeout) on the server. Also have that in mind when viewing the results – too slow response or rejected connection is practically the same thing – your server is not able to service the request.

If you are on AWS, there are some specifics. Leaving auto-scaling apart (which you should probably disable for at least some of the runs), you need to have in mind that the ELB needs warming up. Run the tests a couple of times to warm up the ELB (many requests will fail until it’s fine). Also, when using a load-balancer and long-lived connections are left open (or you use WebSocket, for example), the load balancer may leave connections from itself to the servers behind it open forever and reuse them when a new request for a long-lived connection comes.

Overall, load (performance) testing and analysis is not straightforward, there are many possible problems, but is something that you must do before release. Well, unless you don’t expect more than 100 users. And the next time I do that, I will use my own article for reference, to make sure I’m not missing something.

1

EasyCamera Now in Maven Central

September 11, 2014

Several months ago I created the EasyCamera project (GitHub). As there has been a lot of interest, I just released it to Maven central, so that you don’t need to checkout & build it yourself. The packaging is also changed to aar now.

I would appreciate all reported issues and feedback about the API. Let me just remind the usage:

EasyCamera camera = DefaultEasyCamera.open();
CameraActions actions = camera.startPreview(surface);
PictureCallback callback = new PictureCallback() {
    public void onPictureTaken(byte[] data, CameraActions actions) {
        // store picture
    }
}
actions.takePicture(Callbacks.create().withJpegCallback(callback))

I won’t paste the code that you would normally have to write, but you should follow 10 steps, and I believe the EasyCamera API is a bit friendlier, easier to discover and harder to misuse.

1

Musical Scale Generator

September 7, 2014

We all know the C-major scale: do-re-mi-fa-sol-la-ti-do. But what’s behind it? And how many other scales there are? It’s complicated. Let me do a brief introduction into the theory first, without trying to be precise or complete.

In use are more than a dozen scales, and the popular one in the western world are the major, minor (natural, harmonic), and the “old modes”: Dorian, Lydian, Locrian, etc. All these are heptatonic (7-tone) scales. There are also pentatonic (5-tone) scales, and also other scales like Turkish, Indian, Arabic. All of them share a common purpose: to constraint melodies in order to make them sound pleasant. The notes in each scale trigger a different level of consonance with each other, which in turn provides different “feel”. The predominant scales all fall within the so called chromatic scale, which consists of all the 12 note octave on a piano keyboard (counting both white and black keys).

How are the scales derived? There are two main aspects: the harmonic series and temperament. The harmonic series (closely related to the concept of an overtone) are derived from the physical behaviour of the musical instruments, and more precisely – oscillation (e.g. of a string). The harmonic (or overtone) series produce ever-increasing pitches, which are then transposed into a single octave (the pitch space between the fundamental frequency and 2 times that frequency). This is roughly how the chromatic scale is obtained. Then there is temperament – although the entirely physical explanation sounds a perfect way to link nature and music, in practice the thus obtained frequencies are not practical to play on musical instruments, and also yield some dissonances. That’s why musicians are tuning their instruments by changing the frequencies obtained from the harmonic series. There are multiple ways to do that, one of which is that 12-tone equal temperament, where an octave is divided in 12 parts, which are equal on a logarithmic scale (because pitch changes are perceived as the logarithm of their frequencies).

But what does that have to do with programming? Computers can generate an almost infinite amount of musical scales that follow the rules of the scales already proven to be good. Why limit ourselves to 7-tone scales out of 12 tones, when we can divide the octave into 24 parts and make a scale of 15 tones? In fact, some composers and instrument makers, the most notable being Harry Partch, have experimented with such an approach, and music has been written in such “new” scales (although not everyone would call it “pleasant”). But with computers we can test new scales in seconds, and write music in them (or let the computer write it) in minutes. In fact, I see this as one way for advancing the musical landscape with the help of computers (algorithmic composition aside).

That’s why I wrote a scale generator. It takes a few input parameters – the fundamental frequency, on which you want to base the scale (by default C=262.626); the size of the scale (by default 7); the size of the ‘chromatic scale’ out of which the scale will be drawn (by default 12); and the final parameter specifies whether to use equal temperament or not.

The process, in a few sentences: it starts by calculating the overtones (harmonics), skipping the 7th (for reasons I don’t fully understand). Then transposes all of them into the same octave (it does so, by calculating the ratio from a given harmonic to its tonic (the closest power-of-two multiple of the fundamental frequency), and then using that ratio calculates the frequency from the fundamental frequency itself. It does that until the “chromatic scale size” parameter value is reached. Then it finds the perfect interval (perfect fifth in case of heptatonic (diatonic) scale), i.e. the one with ratio 3/2. If equal temperament is enabled, the previous chromatic scale is replaced with an equal-tempered one. Then the algorithm makes a “circle” from the tones in the chromatic scale (the circle of fifths is one example), based on the perfect interval, and starting from the tone before the fundamental frequency, enumerates N number of tones, where N is the size of the scale. This is the newly formed scale. Note that starting from each note in the scale we just obtained (and continuing in the next octave when we run out of tones) would yield a completely different scale (this is the difference between C-major and a A-minor – they use the same notes)

Finally, my tool plays the generated scale (using low-level sound wave generation, which I copied from somewhere and is beyond the scope of this discussion) and also, using a basic form of my music composition algorithm, composes a melody in the given scale. It sounds terribly at first, because it’s not using any instrument, but it gives a good “picture” of the result. And the default arguments result in the familiar major scale being played.

Why is this interesting? Because hopefully music will evolve, and we will be able to find richer scales pleasant to listen to, giving composers even more material to work with.

3

Caveats of HttpURLConnection

September 5, 2014

Does this piece of code look ok to you?

HttpURLConnection connection = null;
try {
   connection = (HttpURLConnection) url.openConnection();
   try (InputStream in = url.getInputStream()) {
     return streamToString(in);
   }
} finally {
   if (connection != null) connection.disconnect();
}

Looks good – it opens a connection, reads from it, closes the input stream, releases the connection, and that’s it. But while running some performance tests, and trying to figure out a bottleneck issue, we found out that disconnect() is not as benign as it seems – when we stopped disconnecting our connections, there were twice as many outgoing connections. Here’s the javadoc:

Indicates that other requests to the server are unlikely in the near future. Calling disconnect() should not imply that this HttpURLConnection instance can be reused for other requests.

And on the class itslef:

Calling the disconnect() method may close the underlying socket if a persistent connection is otherwise idle at that time.

This is still unclear, but gives us a hint that there’s something more. After reading a couple of stackoverflow and java.net answers (1, 2, 3, 4) and also the android documentation of the same class, which is actually different from the Oracle implementation, it turns out that .disconnect() actually closes (or may close, in the case of android) the underlying socket.

Then we can find this bit of documentation (it is linked in the javadoc, but it’s not immediately obvious that it matters when calling disconnect), which gives us the whole picture:

The keep.alive property (default: true) indicates that sockets can be reused by subsequent requests. That works by leaving the connection to the server (which supports keep alive) open, and then the overhead of opening a socket is no longer needed. By default, up to 5 such sockets are reused (per destination). You can increase this pool size by setting the http.maxConnections property. However, after increasing that to 10, 20 and 50, there was no visible improvement in the number of outgoing requests.

However, when we switched from HttpURLConnection to apache http client, with a pooled connection manager, we had 3 times more outgoing connections per second. And that’s without fine-tuning it.

Load testing, i.e. bombarding a target server with as many requests as possible, sounds like a niche use-case. But in fact, if your application invokes a web service, either within your stack, or an external one, as part of each request, then you have the same problem – you will be able to make fewer requests per second to the target server, and consequently, respond to fewer requests per second to your users.

The advice here is: almost always prefer apache http client – it has a way better API and it seems way better performance, without the need to understand how exactly it functions underneath. But be careful of the same caveats there as well – check pool size and connection reuse. If using HttpURLConnection, do not disconnect your connections after you read their response, consider increasing the socket pool size, and be careful of related problems.

0

Open-Sourcing My Music Composition Algorithm

August 19, 2014

Less than two years ago I wrote about the first version of my algorithm for music composition. Since then computoser.com got some interest and the algorithm was incrementally improved.

Now, on my birthday, I decided it’s time to make it open-source. So it’s on GitHub.

It contains both the algorithm and the supporting code to run it on a website (written with spring and hibernate). The algorithm itself is in the com.music package, everything else is in subpackages, so it’s easy to identify it.

It isn’t a perfect piece of code, but I think it’s readable, if you happen to know some music theory. I am now preparing a paper to present my research (as some research is involved in the creation) as well as how the algorithm functions. Opening the code is part of the preparation for the paper – it will be noted there as a reference implementation.

The license is AGPL – as far as I know, that should not allow closed-source use of my algorithm on the server-side.

I don’t think making it open-source is such a significant step, but I hope it will somehow help algorithmic music composition advance further than it is today.

2

Get Rid of the URL Pollution

August 13, 2014

You want to copy the URL of a nice article/video/picture you’ve just opened and send it to friends in skype chats, whatsapp, other messengers or social networks. And you realize the URL looks like this:

http://somesite.com/artices/title-of-the-article?utm_campagin=fsafser454fasfdsaffffas&utm_bullshit=543fasdfafd534254543&somethingelse=uselessstuffffsafafafad&utm_source=foobar

What are these parameters that pollute the URL? The above example uses some of the Google Analytics parameters (utm*), but other analytics tools use the same approach. And probably other tools as well. How are these parameters useful? They tell Google Analytics (which is run with javascript) details about the current campaign, probably where the user is coming from, and other stuff I and especially users don’t really care about.

And that’s ugly. I myself always delete the meaningless parts of the URL, so that in the end people see only “http://somesite.com/artices/title-of-the-article”. But that’s me – a software engineer, who can distinguish the useless parts of the URL. Not many people can, and even fewer are bothered to cut parts of the URL, which results in looong and ugly URLs being pasted around. Why is that bad?

  • website owners have put effort in making their URLs pretty/ With “url pollution” that efforts goes to waste.
  • defeating the purpose of the parameters – when you copy-paste such a url, all the people that open it may be counted as, for example, coming from a specific AdWords campagin. Or from a source that’s actually wrong (because they got the URL in skype, for example, but utm_source is ‘facebook’)
  • lower likelihood of clicking on a hairy url with meaningless stuff in it (at least I find myself more hesitant)

If you have a website, what can you do about this URL pollution, without breaking your analytics tool? You can get rid of them with javascript:

    window.history.replaceState(null, null, 
        window.location.href.replace("utm_source=....", ""));

This won’t trigger fake analytics results (for GA, at least, as it requires manual work to trigger it after pushState). Now there are three questions: how to get the exact parameters, when to run the above code, and is it worth it?

You can get all parameters (as shown here) and then either remove some blacklisted ones (utm_source, utm_campagin, etc.), or remove all, unless your whitelisted parameters. If your application isn’t using GET parameters at all, that’s easy. If it is, then keeping the whitelist in sync would be tedious, so probably go for the blacklist.

When should you do that? A little after the page loads, and the analytics tool does its job. When exactly is that – I don’t know. Maybe on window.load, maybe you have to wait for a second and then remove the parameters. You’d have to experiment.

And is it worth it? I think yes. Less useless parameters, less noise, nicer, friendlier URLs (that’s why you spent time prettifying them, right?), and less incorrect analytics results due to copy-pasted long URLs.

And I have a request to Google and all other providers of similar tools – please cleanup your “mess” after you read it, so that we don’t have to do it ourselves.

20

Generating equals(..), hashCode() and toString()

August 10, 2014

You most probably need to override hashCode(), equals(..) and toString() – I won’t go into details when and why, but you need that (ok, just a reminder – always implement hashCode and equals together, and you most likely need to implement these methods if you are going to look up objects of a given class in a hashmap or an arraylist). And you have plenty of options to do it:

  • Manually implement the methods – that’s sort-of ok for toString() and quite impractical with hashCode() and equals(..). Unless you are pretty certain that you want a custom, well-considered hash function, then you should rely on another, more practical mechanism
  • Use the IDE – all IDEs can generate the three methods, asking you to specify the fields you want to base them on. The hash function is usually good enough, and the rest just saves you from the headache of writing boilerplate comparisons, ifs and elses. But when you add a field, you shouldn’t forget to regenerate the methods.
  • commons-lang – there’s EqualsBuilder, HashCodeBuilder and ToStringBuilder there, which help you write the methods quickly, either with manual append(field).append(field), or with reflection, e.g. reflectionEquals(..). Adding a field again requires modifications, and it’s easy to forget that.
  • guava – very similar to commons-lang, with all the pros and cons. Guava has Objects and MoreObjects, with helper functions for equals(..) and hashCode and a builder for toString() – you still have to manually add/compare each field you want to include.
  • project lombok – it plugs into the compiler and turns some annotations into actual implementations, sparing you writing the biolerplate code completely. For example, if you annotated the class with @EqualsAndHashCode, Lombok will generate the two methods with all the fields in the class (you can customize that). The other annotations are @ToString, @Value (for immutables), @Data (for value-objects). You just have to put a jar on your compile time classpath, and it should work.

Which of these should you use? I generally exclude the manual approach, as well as guava and commons-lang – they require too much manual work for a task that you shouldn’t need to care in 99% of the cases. The reflection option with commons-lang sounds interesting, but also sounds like performance overhead.

I’ve always used the IDE – the only downside of this is that you have to regenerate them. Sometimes you may forget and that may yield unexpected behaviour. But apart from that, it’s quick and robust approach.

Project lombok seems to eliminate the risk of forgetting to regenerate, but that sometimes has another side effect – you may not need to automatically include all new fields, and you can forget to exclude them. But my personal reluctance to use lombok is based on a sort-of a superstition – it does “black magic” by plugging into the compiler. It does work, but it you don’t know how exactly it manages to handle both eclipse compiler, javac, IntelliJ compiler; will it always work with maven, including your CI environment? Will it work through a major/minor compiler version upgrade? Obviously it does, and I have no rational argument against it. And it has some more useful features as well.

So, it’s up to you to pick either of the two approaches. But do not implement it manually, and I don’t think the helper functions/builders are that practical.

10