Basic API Rate-Limiting

July 14, 2017

It is likely that you are developing some form of (web/RESTful) API, and in case it is publicly-facing (or even when it’s internal), you normally want to rate-limit it somehow. That is, to limit the number of requests performed over a period of time, in order to save resources and protect from abuse.

This can probably be achieved on web-server/load balancer level with some clever configurations, but usually you want the rate limiter to be client-specific (i.e. each client of your API sohuld have a separate rate limit), and the way the client is identified varies. It’s probably still possible to do it on the load balancer, but I think it makes sense to have it on the application level.

I’ll use spring-mvc for the example, but any web framework has a good way to plug an interceptor.

So here’s an example of a spring-mvc interceptor:

@Component
public class RateLimitingInterceptor extends HandlerInterceptorAdapter {

    private static final Logger logger = LoggerFactory.getLogger(RateLimitingInterceptor.class);
    
    @Value("${rate.limit.enabled}")
    private boolean enabled;
    
    @Value("${rate.limit.hourly.limit}")
    private int hourlyLimit;

    private Map<String, Optional<SimpleRateLimiter>> limiters = new ConcurrentHashMap<>();
    
    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler)
            throws Exception {
        if (!enabled) {
            return true;
        }
        String clientId = request.getHeader("Client-Id");
        // let non-API requests pass
        if (clientId == null) {
            return true;
        }
        SimpleRateLimiter rateLimiter = getRateLimiter(clientId);
        boolean allowRequest = limiter.tryAcquire();
    
        if (!allowRequest) {
            response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
        }
        response.addHeader("X-RateLimit-Limit", String.valueOf(hourlyLimit));
        return allowRequest;
    }
    
    private SimpleRateLimiter getRateLimiter(String clientId) {
        return limiters.computeIfAbsent(clientId, clientId -> {
            return Optional.of(createRateLimiter(clientId));
        });
    }

	
    @PreDestroy
    public void destroy() {
        // loop and finalize all limiters
    }
}

This initializes rate-limiters per client on demand. Alternatively, on startup you could just loop through all registered API clients and create a rate limiter for each. In case the rate limiter doesn’t allow more requests (tryAcquire() returns false), then raturn “Too many requests” and abort the execution of the request (return “false” from the interceptor).

This sounds simple. But there are a few catches. You may wonder where the SimpleRateLimiter above is defined. We’ll get there, but first let’s see what options do we have for rate limiter implementations.

The most recommended one seems to be the guava RateLimiter. It has a straightforward factory method that gives you a rate limiter for a specified rate (permits per second). However, it doesn’t accomodate web APIs very well, as you can’t initilize the RateLimiter with pre-existing number of permits. That means a period of time should elapse before the limiter would allow requests. There’s another issue – if you have less than one permits per second (e.g. if your desired rate limit is “200 requests per hour”), you can pass a fraction (hourlyLimit / secondsInHour), but it still won’t work the way you expect it to, as internally there’s a “maxPermits” field that would cap the number of permits to much less than you want it to. Also, the rate limiter doesn’t allow bursts – you have exactly X permits per second, but you cannot spread them over a long period of time, e.g. have 5 requests in one second, and then no requests for the next few seconds. In fact, all of the above can be solved, but sadly, through hidden fields that you don’t have access to. Multiple feature requests exist for years now, but Guava just doesn’t update the rate limiter, making it much less applicable to API rate-limiting.

Using reflection, you can tweak the parameters and make the limiter work. However, it’s ugly, and it’s not guaranteed it will work as expected. I have shown here how to initialize a guava rate limiter with X permits per hour, with burstability and full initial permits. When I thought that would do, I saw that tryAcquire() has a synchronized(..) block. Will that mean all requests will wait for each other when simply checking whether allowed to make a request? That would be horrible.

So in fact the guava RateLimiter is not meant for (web) API rate-limiting. Maybe keeping it feature-poor is Guava’s way for discouraging people from misusing it?

That’s why I decided to implement something simple myself, based on a Java Semaphore. Here’s the naive implementation:

public class SimpleRateLimiter {
    private Semaphore semaphore;
    private int maxPermits;
    private TimeUnit timePeriod;
    private ScheduledExecutorService scheduler;

    public static SimpleRateLimiter create(int permits, TimeUnit timePeriod) {
        SimpleRateLimiter limiter = new SimpleRateLimiter(permits, timePeriod);
        limiter.schedulePermitReplenishment();
        return limiter;
    }

    private SimpleRateLimiter(int permits, TimeUnit timePeriod) {
        this.semaphore = new Semaphore(permits);
        this.maxPermits = permits;
        this.timePeriod = timePeriod;
    }

    public boolean tryAcquire() {
        return semaphore.tryAcquire();
    }

    public void stop() {
        scheduler.shutdownNow();
    }

    public void schedulePermitReplenishment() {
        scheduler = Executors.newScheduledThreadPool(1);
        scheduler.schedule(() -> {
            semaphore.release(maxPermits - semaphore.availablePermits());
        }, 1, timePeriod);

    }
}

It takes a number of permits (allowed number of requests) and a time period. The time period is “1 X”, where X can be second/minute/hour/daily – depending on how you want your limit to be configured – per second, per minute, hourly, daily. Every 1 X a scheduler replenishes the acquired permits (in the example above there’s one scheduler per client, which may be inefficient with large number of clients – you can pass a shared scheduler pool instead). There is no control for bursts (a client can spend all permits with a rapid succession of requests), there is no warm-up functionality, there is no gradual replenishment. Depending on what you want, this may not be ideal, but that’s just a basic rate limiter that is thread-safe and doesn’t have any blocking. I wrote a unit test to confirm that the limiter behaves properly, and also ran performance tests against a local application to make sure the limit is obeyed. So far it seems to be working.

Are there alternatives? Well, yes – there are libraries like RateLimitJ that uses Redis to implement rate-limiting. That would mean, however, that you need to setup and run Redis. Which seems like an overhead for “simply” having rate-limiting. (Note: it seems to also have an in-memory version)

On the other hand, how would rate-limiting work properly in a cluster of application nodes? Application nodes probably need some database or gossip protocol to share data about the per-client permits (requests) remaining? Not necessarily. A very simple approach to this issue would be to assume that the load balancer distributes the load equally among your nodes. That way you would just have to set the limit on each node to be equal to the total limit divided by the number of nodes. It won’t be exact, but you rarely need it to be – allowing 5-10 more requests won’t kill your application, allowing 5-10 less won’t be dramatic for the users.

That, however, would mean that you have to know the number of application nodes. If you employ auto-scaling (e.g. in AWS), the number of nodes may change depending on the load. If that is the case, instead of configuring a hard-coded number of permits, the replenishing scheduled job can calculate the “maxPermits” on the fly, by calling an AWS (or other cloud-provider) API to obtain the number of nodes in the current auto-scaling group. That would still be simpler than supporting a redis deployment just for that.

Overall, I’m surprised there isn’t a “canonical” way to implement rate-limiting (in Java). Maybe the need for rate-limiting is not as common as it may seem. Or it’s implemented manually – by temporarily banning API clients that use “too much resources”.

Update: someone pointed out the bucket4j project, which seems nice and worth taking a look at.

9

Simple Spring Boot Admin Setup

July 4, 2017

Spring Boot Admin is a cool dashboard for monitoring your spring boot applications. However, setting it up is not that trivial. The documentation outlines two options:

  • Including a client library in your boot application that connects to the admin application – this requires having the admin application deployed somewhere public or at least reachable from your application, and also making your application aware that it is being monitored.
  • Using cloud discovery, which means your application is part of a service discovery infrastructure, e.g. using microservices

Both are not very good options for simpler scenarios like a monolithic application being run on some IaaS and having your admin application deployed either on a local machine or in some local company infrastructure. Cloud discovery is an overkill if you don’t already need it, and including a client library introduces the complexity of making the admin server reachable by your application, rather than vice-versa. And besides, this two-way dependency sounds wrong.

Fortunately, there is an undocumented, but implemented SimpleDiscoveryClient that let’s you simply run the Spring Boot Admin with some configuration on whatever machine and connect it to your spring boot application.

The first requirement is to have the spring boot actuator setup in your boot application. The Actuator exposes all the needed endpoints for the admin application to work. It sounds trivial to setup – you just add a bunch of dependencies and possibly specify some config parameters and that’s it. In fact, in a real application it’s not that easy – in particular regarding the basic authentication for the actuator endpoints. You need a separate spring-security (in addition to your existing spring-security configuration) in order to apply basic auth only to the actuator endpoints. E.g.:

@Configuration
@Order(99) // the default security configuration has order 100
public class ActuatorSecurityConfiguration extends WebSecurityConfigurerAdapter {

    @Value("${security.user.name}")
    private String username;
    
    @Value("${security.user.password}")
    private String password;
    
    @Override
    protected void configure(HttpSecurity http) throws Exception {
        InMemoryUserDetailsManager manager = new InMemoryUserDetailsManager();
        manager.createUser(User.withUsername(username).password(password).roles("ACTUATOR","ADMIN").build());
        
        http.antMatcher("/manage/**").authorizeRequests().anyRequest().hasRole("ACTUATOR").and().httpBasic()
                .and().userDetailsService(manager);
    }
}

This is a bit counterintuitive, but it works. Not sure if it’s idiomatic – with spring security and spring boot you never know what is idiomatic. Note – allegedly it should be possible to have the security.user.name (and password) automatically included in some manager, but I failed to find such, so I just instantiated an in-memory one. Note the /manage/** path – in order to have all the actuator endpoints under that path, you need to specify the management.context-path=/manage in your application properties file.

Now that the actuator endpoints are setup, we have to attach our spring admin application. It looks like that:

@Configuration
@EnableAutoConfiguration
@PropertySource("classpath:/application.properties")
@EnableAdminServer
public class BootAdminApplication {
    public static void main(String[] args) {
        SpringApplication.run(BootAdminApplication.class, args);
    }

    @Autowired
    private ApplicationDiscoveryListener listener;
    
    @PostConstruct
    public void init() {
        // we have to fire this event in order to trigger the service registration
        InstanceRegisteredEvent<?> event = new InstanceRegisteredEvent<>("prod", null);
        // for some reason publising doesn't work, so we invoke directly
        listener.onInstanceRegistered(event);
    }
}

Normally, one should inject ApplicationEventPublisher and push the message there rather than directly invoking the listener as shown above. I didn’t manage to get it working easily, so I worked around that.

The application.properties file mentioned about should be in src/main/resources and looks like that:

spring.cloud.discovery.client.simple.instances.prod[0].uri=https://your-spring-boot-application-url.com
spring.cloud.discovery.client.simple.instances.prod[0].metadata.user.name=<basic-auth-username>
spring.cloud.discovery.client.simple.instances.prod[0].metadata.user.password=<basic-auth-password>
spring.boot.admin.discovery.converter.management-context-path=/manage
spring.boot.admin.discovery.services=*

What is that doing? It’s using the SimpleDiscoveryClient that gets instantiated by the autoconfiguration. Actually, that client didn’t work until the latest version – it threw NullPointerException because the metadata (which handles the username and password) was always null. In 1.2.2 of the cloud-commons they fixed it:

<dependency>
	<groupId>org.springframework.cloud</groupId>
	<artifactId>spring-cloud-commons</artifactId>
	<version>1.2.2.RELEASE</version>
</dependency>

The simple discovery client is exactly that – you specify the URL of your boot application and it fetches the data form the actuator endpoints periodically. Why that isn’t documented and why it didn’t actually work until very recently – I have no idea. Also, I don’t know why you have to manually send the event that triggers discovery. Maybe it’s not idiomatic, but it doesn’t happen automatically and that made it work.

As usual with things that “just work” and have “simple setups” – it’s never like that. If you have something slightly more complex than a hello world, you have to dig some obscure classes and go “off-road”. Luckily in this case, it actually works, rather than needed ugly workarounds.

5

Developers and Ethics

June 27, 2017

“What are some areas you are particularly interested in” – recruiters (head-hunters) tend to ask that question a lot. I don’t have a good answer for that – I’ll know it when I see it. But I have a list of areas that I wouldn’t like to work in. And one of them is gambling.

Several years ago I got a very lucrative offer for a gambling company, both well paid and technically challenging. But I rejected it. Because I didn’t want to contribute to abusing peoples’ weaknesses for the sake of getting their money. And no, I’m not a raging Marxist, but gambling is bad. You may argue that it’s a necessary vice and people need it to suppress other internal struggles, but I’m not buying that as a motivator.

I felt it’s unethical to write code that does that. Like I feel it’s unethical to profile users’ behaviours and “read” their emails in order to target ads, or to write bots to disseminate fake news.

A few months ago I was part of the campaign HQ for a party in a parliamentary election. Cambridge Analytica had already become popular after “delivering Brexit and Trump’s victory”, that using voters’ data in order to target messages at them sounded like the new cool thing. As head of IT & data, I rejected this approach. Because it would be unethical to bait unsuspecting users to take dumb tests in order to provide us with facebook tokens. Yes, we didn’t have any money to hire Cambridge Analytica-like companies, but even if we had, is “outsourcing” the dubious practice changing anything? If you pay someone to trick users into unknowingly giving their personal data, it’s as if you did it yourself.

This can be a very long post about technology and ethics. But it won’t, as this is a technical blog, not a philosophical one. It won’t be about philosophy – for interesting takes on the matter you can listen to Damon Horowitz’s TED talk or even go through all of Michael Sandel’s Justice lectures at Harvard. It won’t be about how companies should be ethical (e.g. following the ethical design manifesto)

Instead, it will be a short post focusing on developers and their ethical choices.

I think we have the freedom to be ethical – there’s so much demand on the job market that rejecting an offer, refusing to do something, or leaving a company for ethical reasons is something we have the luxury to do without compromising our well-being. When asked to do something unethical, we can refuse (several years ago I was asked to take part in some shady interactions related to a potential future government contract, which I refused to do). When offered jobs that are slightly better paid but would have us build abusive technology, we can turn the offer down. When a new feature requires us to breach people’s privacy, we can argue it, and ultimately not do it.

But in order to start making these ethical choices, we have to start thinking about ethics. To put ourselves in context. We, developers, are building the world of tomorrow (it sounds grandiose, but we know it’s way more mundane than that). We are the “tools” with which future products will be shaped. And yes, that’s true even for the average back-office system of an insurance company (which allows for raising the insurance for pre-existing conditions), and true for boring banking software (which allows mortgages way beyond the actual coverage the bank has), and so on.

Are these decisions ours to make? Isn’t it legislators that should define what’s allowed and what isn’t? We are just building whatever they tell us to build. Forgive me the far-fetched analogy, but Nazi Germany was an anti-humanity machine based on people who “just followed orders”. Yes, we’ll refuse, someone else will come and do it, but collective ethics gets built over time.

As Hannah Arendt had put it – “The sad truth is that most evil is done by people who never make up their minds to be good or evil.”. We may think that as developers we don’t have a say. But without us, no software can be built. So with our individual ethical stance, a certain unethical software may not be built or be successful, and that’s a stance worth considering, especially when it costs us next to nothing.

0

Electronic Signature Using The WebCrypto API

June 11, 2017

Sometimes we need to let users sign something electronically. Often people understand that as placing your handwritten signature on the screen somehow. Depending on the jurisdiction, that may be fine, or it may not be sufficient to just store the image. In Europe, for example, there’s the Regulation 910/2014 which defines what electronic signature are. As it can be expected from a legal text, the definition is rather vague:

‘electronic signature’ means data in electronic form which is attached to or logically associated with other data in electronic form and which is used by the signatory to sign;

Yes, read it a few more times, say “wat” a few more times, and let’s discuss what that means. And it can mean basically anything. It is technically acceptable to just attach an image of the drawn signature (e.g. using an html canvas) to the data and that may still count.

But when we get to the more specific types of electronic signature – the advanced and qualified electronic signatures, things get a little better:

An advanced electronic signature shall meet the following requirements:
(a) it is uniquely linked to the signatory;
(b) it is capable of identifying the signatory;
(c) it is created using electronic signature creation data that the signatory can, with a high level of confidence, use under his sole control; and
(d) it is linked to the data signed therewith in such a way that any subsequent change in the data is detectable.

That looks like a proper “digital signature” in the technical sense – e.g. using a private key to sign and a public key to verify the signature. The “qualified” signatures need to be issued by qualified provider that is basically a trusted Certificate Authority. The keys for placing qualified signatures have to be issued on secure devices (smart cards and HSMs) so that nobody but the owner can have access to the private key.

But the legal distinction between advanced and qualified signatures isn’t entirely clear – the Regulation explicitly states that non-qualified signatures also have legal value. Working with qualified signatures (with smartcards) in browsers is a horrifying user experience – in most cases it goes through a Java Applet, which works basically just on Internet Explorer and a special build of Firefox nowadays. Alternatives include desktop software and local service JWS applications that handles the signing, but smartcards are a big issue and offtopic at the moment.

So, how do we allow users to “place” an electronic signature? I had an idea that this could be done entirely using the WebCrypto API that’s more or less supported in browsers these days. The idea is as follows:

  • Let the user type in a password for the purpose of sining
  • Derive a key from the password (e.g. using PBKDF2)
  • Sign the contents of the form that the user is submitting with the derived key
  • Store the signature alongside the rest of the form data
  • Optionally, store the derived key for verification purposes

Here’s a javascript gist with implementation of that flow.

Many of the pieces are taken from the very helpful webcrypto examples repo. The hex2buf, buf2hex and str2ab functions are utilities (that sadly are not standard in js).

What the code does is straightforward, even though it’s a bit verbose. All the operations are chained using promises and “then”, which to be honest is a big tedious to write and read (but inevitable I guess):

  • The password is loaded as a raw key (after transforming to an array buffer)
  • A secret key is derived using PBKDF2 (with 100 iterations)
  • The secret key is used to do an HMAC “signature” on the content filled in by the user
  • The signature and the key are stored (in the UI in this example)
  • Then the signature can be verified using: the data, the signature and the key

You can test it here:

Having the signature stored should be enough to fulfill the definition of “electronic signature”. The fact that it’s a secret password known only to the user may even mean this is an “advanced electronic signature”. Storing the derived secret key is questionable – if you store it, it means you can “forge” signatures on behalf of the user. But not storing it means you can’t verify the signature – only the user can. Depending on the use-case, you can choose one or the other.

Now, I have to admit I tried deriving an asymmetric keypair from the password (both RSA and ECDSA). The WebCrypto API doesn’t allow that out of the box. So I tried “generating” the keys using deriveBits(), e.g. setting the “n” and “d” values for RSA, and the x, y and d values for ECDSA (which can be found here, after a bit of searching). But I failed – you can’t specify just any values as importKey parameters, and the constraints are not documented anywhere, except for the low-level algorithm details, and that was a bit out of the scope of my experiment.

The goal was that if we only derive the private key from the password, we can easily derive the public key from the private key (but not vice-versa) – then we store the public key for verification, and the private key remains really private, so that we can’t forge signatures.

I have to add a disclaimer here that I realize this isn’t very secure. To begin with, deriving a key from a password is questionable in many contexts. However, in this context (placing a signature), it’s fine.

As a side note – working with the WebCrypto API is tedious. Maybe because nobody has actually used it yet, so googling for errors basically gives you the source code of Chromium and nothing else. It feels like uncharted territory (although the documentation and examples are good enough to get you started).

Whether it will be useful to do electronic signatures in this way, I don’t know. I implemented it for a use-case that it actually made sense (party membership declaration signature). Whether it’s better than hand-drawn signature on a canvas – I think it is (unless you derive the key from the image, in which case the handwritten one is better due to a higher entropy).

1

“Architect” Should Be a Role, Not a Position

May 31, 2017

What happens when a senior developer becomes…more senior? It often happens that they get promoted to “architect”. Sometimes an architect doesn’t have to have been a developer, if they see “the bigger picture”. In the end, there’s often a person that holds the position of “architect”; a person who makes decisions about the architecture of the system or systems being developed. In bigger companies there are “architect councils”, where the designated architects of each team gather and decide wise things…

But I think it’s a bad idea to have a position of “architect”. Architect is a position in construction – and it makes sense there, as you can’t change and tweak the architecture mid-project. But software architecture is flexible and should not be defined strictly upfront. And development and architecture are so intertwined, it doesn’t make much sense to have someone who “says what’s to be done” and others who “do it”. It creates all sorts of problems, mainly coming from the fact that the architect often doesn’t fully imagine how the implementation will play out. If the architect hasn’t written code for a long time, they tend to disregard “implementation details” and go for just the abstraction. However, abstractions leak all the time, and it’s rarely a workable solution to just think of the abstraction without the particular implementation.

That’s my first claim – you cannot be a good architect without knowing exactly how to write the whole code underneath. And no, too often it’s not “simple coding”. And if you have been an architect for years, and so you haven’t written code in years, you are almost certainly not a good architect.

Yes, okay, YOU in particular may be a good architect. Maybe in your particular company it makes sense to have someone sitting in an ivory tower of abstractions and mandate how the peons have to integrate this and implement that. But even in that case, there’s a better approach.

The architect should be a role. Every senior team member can and should take the role of an architect. It doesn’t have to be one person per team. In fact, it’s better to have more. To discuss architecture in meetings similar to the feature design meetings, with the clear idea that it will be you who is going to implement the whole thing. Any overdesign (of which architects are often guilty) will have to be justified in front of yourself – “do I want to write all this boilerplate stuff, or is there a simple and more elegant way”.

The position is “software engineer”. The role can be “scrum master”, “architect”, “continuous integration officer”, etc. If the company needs an “architects’ council” to decide “bigger picture” integrations between systems, the developers can nominate someone to attend these meetings – possibly the most knowledgeable. (As many commenters put it – the architect is often a business analyst, a manager, a tech lead, most of the time. Basically “a senior developer with soft skills”. And that’s true – being “architect” is just one of their role, again)

I know what the architects are thinking now – that there are high-level concerns that developers either don’t understand or shouldn’t be bothered with. Wrong. If your developers don’t understand the higher level architectural concerns, you have a problem that would manifest itself sooner or later. And yes, they should be bothered with the bigger picture, because in order to fit the code that you are writing into the bigger picture, you should be pretty familiar with it.

There’s another aspect, and that’s team member attitudes and interaction dynamics. If someone gets “promoted” to an “architect”, without being a particularly good or respected developer, that may damage the team morale. On the other hand, one promoted to “architect” can become too self-confident and push design decisions just because they think so, and despite good arguments against them.

So, ideally (and that’s my second claim) – get rid of the architect position. Make sure your senior team members are engaged in architectural discussions and decision making – they will be more motivated that way, and they will have a more clear picture of what their development efforts should achieve. And most importantly – the architectural decisions won’t be detached from the day-to-day development “real world”, nor will they be unnecessarily complicated.

6

Overview of Message Queues [slides]

May 23, 2017

Yesterday I gave a talk that went through all the aspects of using messages queues. I’ve previously written that “you probably don’t need a message queue” – now the conclusion is a bit more nuanced, but I still stand by the simplicity argument.

The talk goes through the various benefits and use cases of using message queues, and discusses alternatives of the typical “message queue broker” architecture. The slides are available here

One maybe strange approach that I propose is to use distributed locks (e.g. with Hazelcast) for distributed batch processing – you lock on a particular id (possibly organization id / client id, rather than an individual record id) thus allowing multiple processors to run in parallel without stepping on each other’s toes (one node picks the first entry, the other one tries the first, but fails, and picks the second one).

Something I missed as a benefit of brokers like RabbitMQ is the available tooling – you can monitor and debug your queues pretty easily.

I wasn’t able to focus in detail on much on any of the concepts – e.g. how does brokerless work, or how to deploy a broker (e.g. RabbitMQ) in multiple data centers (availability zones), or how akka (and akka cluster) fits in the the “message queue” landscape. But I hope it’s a good overview that lets everyone have a clear picture of the options, which then they can analyze and assess in more details.

And I’ll finish with the Dijkstra quote from the slides:

Simplicity is prerequisite for reliability

0

Event Logs

May 12, 2017

Most system have some sort of event logs – i.e. what has happened in the system and who did it. And sometimes it has a dual existence – once as an “audit log”, and once as event log, which is used to replay what has happened.

These are actually two separate concepts:

  • the audit log is the trace that every action leaves in the system so that the system can later be audited. It’s preferable that this log is somehow secured (will discuss that another time)
  • the event log is a crucial part of the event-sourcing model where the database only stores modifications, rather than the current state. The current state is the obtained after applying all the stored modifications until the present moment. This allows seeing the state of the data at any moment in the past.

There are a bunch of ways to get that functionality. For audit logs there is Hibernate Envers, which stores all modifications in a separate table. You can also have a custom solution using spring aspects or JPA listeners (that store modifications in an audit log table whenever a change happens). You can store the changes in multiple ways – as key/value rows (one row per modified field), or as objects serialized to JSON.

Event-sourcing can be achieved by always inserting a new record instead of updating or deleting (and incrementing a version and/or setting a “current” flag). There some event-sourcing-native databases – Datomic and Event Store. (Note: “event-sourcing” isn’t equal to “insert-only”, but the approach is very similar)

They still seem pretty similar – both track what has happened on the system in a detailed way. So can an audit log be used as an event log, and can event logs be used as audit logs? And is there a difference?

Audit logs have specific features that you don’t need for event sourcing – storing business actions and being secure. When a user logs in, or views a given item, that wouldn’t issue an update (well, maybe last_login_time or last_seen can be updated, but those are side-effects). But you still may want to see details about the event in your audit log – who logged in, when, from what IP, after how many unsuccessful attempts. You may also want to know who read a given piece of information – if its sensitive personal data, it’s important to know who has seen it and whether the access was not some form of “stalking”. In the audit log you can also have “business events” rather than individual database operations. For example “checkout a basket” involves updating the basket status, decreasing the availability of the items, updating the purchase history. And you may want to just see “user X checked out basket Y with items Z at time W”.

Another very important feature if the audit logs is their integrity. You should somehow protect them from modification (e.g. by digitally signing them). Why? Because at some point that may be evidence in court (e.g. after some employee misuses your system), so the more secure and protected it is, the better evidence it is. This isn’t needed for event-sourcing, obviously.

Event sourcing has other requirements – every modification with every field must be stored. You should also be able to have “snapshots” so that the “current state” isn’t recalculated every time. The event log isn’t just a listener or an aspect around your persistence layer – it’s at the core of your data model. So in that sense it has more impact on the whole system design. Another thing event logs are useful for is consuming events from the system. For example, if you have an indexer module and you want to index changes as the come, you can “subscribe” the indexer to the events and each insert would trigger an index operation. That is not limited to indexing, and can be used for synchronizing (parts of) the state with external systems (which a Search engine is just one example of).

But due to the many similarities – can’t we have a common solution that does both? An audit log is a poor event-sourcing tool per-se, and event-sourcing models are not sufficient for audit logs.

Since event sourcing is a design choice, I think it should take the leading role – if you use event sourcing, then you can add some additional inserts for non-data-modifying business-level operations (login, view), and have your high-level actions (e.g. checkout) represented as events. You can also get a scheduled job to sign the entries (that may mean doing updates in an insert-only model, which would be weird, but still). And you will get a full-featured audit log with a little additional effort over the event sourcing mechanism.

The other way around is a bit tricky – you can still use the audit log to see the state of a piece of data in the past, but that doesn’t mean your data model will be insert-only. If you rely too much on the audit log as if it was event sourcing, maybe you need event sourcing in the first place. If it’s just occasional interest “what happened here two days ago”, then it’s fine having just the audit log.

The two concepts are overlapping so much, that it’s tempting to have both. But in conclusion, I’d say that:

  • you always need an audit log while you don’t necessarily need event sourcing.
  • if you have decided on using event-sourcing, walk the extra mile to get a proper audit log

But of course, as usual, it all depends on the particular system and its use-cases.

P.S. I’ve developed an audit trail service that guarantees the integrity of the logs – you can check it here.

2

Spring Boot, @EnableWebMvc And Common Use-Cases

April 21, 2017

It turns out that Spring Boot doesn’t mix well with the standard Spring MVC @EnableWebMvc. What happens when you add the annotation is that spring boot autoconfiguration is disabled.

The bad part (that wasted me a few hours) is that in no guide you can find that explicitly stated. In this guide it says that Spring Boot adds it automatically, but doesn’t say what happens if you follow your previous experience and just put the annotation.

In fact, people that are having issues stemming from this automatically disabled autoconfiguration, are trying to address it in various ways. Most often – by keeping @EnableWebMvc, but also extending Spring Boot’s WebMvcAutoConfiguration. Like here, here and somewhat here. I found them after I got the idea and implemented it that way. Then realized doing it is redundant, after going through Spring Boot’s code and seeing that an inner class in the autoconfiguration class has a single-line javadoc stating

Configuration equivalent to {@code @EnableWebMvc}.

That answered my question whether spring boot autoconfiguration misses some of the EnableWebMvc “features”. And it’s good that they extended the class that provides EnableWebMvc, rather than mirroring the functionality (which is obvious, I guess).

What should you do when you want to customize your beans? As usual, extend WebMvcConfigurerAdapter (annotate the new class with @Component) and do your customizations.

So, bottom line of the particular problem: don’t use @EnableWebMvc in spring boot, just include spring-web as a maven/gradle dependency and it will be autoconfigured.

The bigger picture here resulted in me adding a comment in the main configuration class detailing why @EnableWebMvc should not be put there. So the autoconfiguration magic saved me doing a lot of stuff, but I still added a line explaining why something isn’t there.

And that’s because of the common use-cases – people are used to using @EnableWebMvc. So the most natural and common thing to do is to add it, especially if you don’t know how spring boot autoconfiguration works in detail. And they will keep doing it, and wasting a few hours before realizing they should remove it (or before extending a bunch of boot’s classes in order to achieve the same effect).

My suggestion in situations like this is: log a warning. And require explicitly disabling autoconfiguration in order to get rid of the warning. I had to turn on debug to see what gets autoconfigured, and then explore a bunch of classes to check the necessary conditions in order to figure out the situation.

And one of Spring Boot’s main use-cases is jar-packaged web applications. That’s what most tutorials are for, and that’s what it’s mostly used for, I guess. So there should be special treatment for this common use case – with additional logging and logged information helping people get through the maze of autoconfiguration.

I don’t want to be seen as “lecturing” the Spring team, who have done amazing job in having good documentation and straightforward behaviour. But in this case, where multiple sub-projects “collide”, it seems it could be improved.

2

Distributed Cache – Overview

April 8, 2017

What’s a distributed cache? A solution that is “deployed” in an application (typically a web application) and that makes sure data is loaded from memory, rather than from disk (which is much slower), in order to improve performance and response time.

That looks easy if the cache is to be used on a single machine – you just load your most active data from the database in memory (e.g. a Guava Cache instance), and serve it from there. It becomes a bit more complicated when this has to work in a cluster – e.g. 5 application nodes serving requests to users in a round-robin fashion.

You have to update the in-memory cache on all machines each time a piece of data is updated by a request to one of the machines. If you just load all the data in memory and don’t invalidate it, the cache won’t be “coherent” – it will have stale values and requests to different application nodes will have different results, which you most certainly want to avoid. Or you can have a single big cache server with tons of memory, but it can die – and that may disrupt the smooth operation, so you’d want to have at least 2 machines in a cluster.

You can get a distributed cache in different ways. To list a few: Infinispan (which I’ve covered previously), Terracotta/Ehcache, Hazelcast, Memcached, Redis, Cassandra, Elasticache(by Amazon). The former three are Java-specific (both JCache compliant), but the rest can be used in any setup. Cassandra wasn’t initially meant to be cache solution, but it can easily be used as such.

All of these have different configurations and different options, even different architectures. For example, you can run Ehcache with a central Terracotta server, or with peer-to-peer gossip protocol. The in-process approach is also applicable for Infinispan and Hazelcast. Alternatively, you can rely on a cloud-provided service, like Elasticache, or you can setup your own cluster of Memcached/Redis/Cassandra servers. Having the cache on the application nodes themselves is slightly faster than having a dedicated memory server (cluster), because of the network overhead.

The cache is structured around keys and values – there’s a cached entry for each key. When you want to load something form the database, you first check whether the cache doesn’t have an entry with that key (based on the ID of your database record, for example). If a key exists in the cache, you don’t do a database query.

But how is the data “distributed”? It wouldn’t make sense to load all the data on all nodes at the same time, as the duplication is a waste of space and keeping the cache coherent would again be a problem. Most of the solutions rely on the so called “consistent hashing”. When you look for a particular key, its hash is calculated and (depending on the number of machines in the cache cluster), the cache solution knows exactly on which machine the corresponding value is located. You can read more details in the wikipedia article, but the approach works even when you add or remove nodes from the cache cluster, i.e. the cluster of machines that hold the cached data.

Then, when you do an update, you also update the cache entry, which the caching solution propagates in order to make the cache coherent. Note that if you do manual updates directly in the database, the cache will have stale data.

On the application level, you’d normally want to abstract the interactions with the cache. You’d end up with something like a generic “get” method that checks the cache and only then goes to the database. This is true for get-by-id queries, but caches can also be applied for other SELECT queries – you just use the query as a key, and the query result as value. For updates it would similarly update the database and the cache.

Some frameworks offer these abstractions out of the box – ORMs like Hibernate have the so-called cache providers, so you just add a configuration option saying “I want to use (2nd level) cache” and your ORM operations automatically consult the configured cache provider before hitting the database. Spring has the @Cacheable annotation which can cache method invocations, using their parameters as keys. Other frameworks and languages usually have something like that as well.

An important side-note – you may need to pre-load the cache. On a fresh startup of a cluster (e.g. after a new deployment in a blue-green deployment scheme/crash/network failure/etc.) the system may be slow to fill the cache. So you may have a batch job run on startup to fetch some pieces of data from the database and put them in the cache.

It all sounds easy and that’s good – the basic setup covers most of the cases. In reality it’s a bit more complicated to tweak the cache configurations. Even when you manage to setup a cluster (which can be challenging in, say, AWS), your cache strategies may not be straightforward. You have to answer a couple of questions that may be important – how much do you want your cache entries to live? How big do you need your cache to be? How should elements expire (cache eviction strategy) – least recently used, least frequently used, first-in-first-out?

These options often sound arbitrary. You normally go with the defaults and don’t bother. But you have to constantly keep on eye on the cache statistics, do performances tests, measure and tweak. Sometimes a single configuration option can have a big impact.

One important note here on the measuring side – you don’t necessarily want to cache everything. You can start that way, but it may turn out that for some types of entries you are actually better off without a cache. Normally these are the ones that are updated too often and read not that often, so constantly updating the cache with them is an overhead that doesn’t pay off. So – measure, tweak, repeat.

Distributed caches are an almost mandatory component of web applications. Yet, I’ve had numerous discussions about how a given system is very big and needs a lot of hardware, where all the data would fit in my laptop’s memory. So, much to my surprise, it turns out it’s not yet a universally known concept. I hope I’ve given a short, but sufficient overview of the concept and the various options.

1

Distributing Election Volunteers In Polling Stations

March 20, 2017

There’s an upcoming election in my country, and I’m a member of the governing body of one of the new parties. As we have a lot of focus on technology (and e-governance), our internal operations are also benefiting from some IT skills. The particular task at hand these days was to distribute a number of election day volunteers (that help observe the fair election process) to polling stations. And I think it’s an interesting technical task, so I’ll try to explain the process.

First – data sources. We have an online form for gathering volunteer requests. And second, we have local coordinators that collect volunteer declarations and send them centrally. Collecting all the data is problematic (to this moment), because filling the online form doesn’t make you eligible – you also have to mail a paper declaration to the central office (horrible bureaucracy).

Then there’s the volunteer preferences – in the form they’ve filled whether they are willing to travel, or they prefer their closest poling station. And then there’s the “priority” polling stations, which are considered to be more risky and therefore we need volunteers there.

I decided to do the following:

  • Create a database table “volunteers” that holds all the data about all prospective volunteers
  • Import all data – using apache CSV parser, parse the CSV files (converted from Google sheets) with the 1. online form 2. data from the received paper declarations
  • Match the entries from the two sources by full name (as the declarations cannot contain an email, which would otherwise be the primary key)
  • Geocode the addresses of people
  • Import all polling stations and their addresses (public data by the central election commission)
  • Geocode the addresses of the polling stations
  • Find the closest polling station address for each volunteer

All of the steps are somewhat trivial, except the last part, but I’ll still explain in short. The CSV parsing and importing is straightfoward. The only thing one has to be careful is have the ability to insert additional records on a later date, because declarations are being received as I’m writing.

Geocoding is a bit trickier. I used the OpenStreetMap initially, but it managed to find only a fraction of the addresses (which are not normalized – volunteers and officials are sometimes careless about the structure of the addresses). The OpenStreetMap API can be found here. It’s basically calling http://nominatim.openstreetmap.org/search.php?q=address&format=json with the address. I tried cleaning up some of the addresses automatically, which lead to a couple more successful geocodings, but not much.

The rest of the coordinates I obtained through Google maps. I extract all the non-geocoded addresses and their corresponding primary keys (for volunteers – the full name; for polling stations – the hash of a semi-normalized address), parse them with javascript, which then invokes the Google Maps API. Something like this:

<script type="text/javascript" src="jquery.csv.min.js"></script>
<script type="text/javascript">
	var idx = 1;
	function initMap() {
        var map = new google.maps.Map(document.getElementById('map'), {
          zoom: 8,
          center: {lat: -42.7339, lng: 25.4858}
        });
        var geocoder = new google.maps.Geocoder();

		$.get("geocode.csv", function(csv) {
			var stations = $.csv.toArrays(csv);
			for (var i = 1; i < stations.length; i ++) {
				setTimeout(function() {
					geocodeAddress(geocoder, map, stations[idx][1], stations[idx][0]);
					idx++;
				}, i * 2000);
			}
		});
      }

      function geocodeAddress(geocoder, resultsMap, address, label) {
        geocoder.geocode({'address': address}, function(results, status) {
          if (status === 'OK') {
            $("#out").append(results[0].geometry.location.lat() + "," + results[0].geometry.location.lng() + ",\"" + label.replace('"', '""').trim() + "\"<br />");
          } else {
            console.log('Geocode was not successful for the following reason: ' + status);
          }
        });
      }
</script>

This spits out CSV on the screen. Which I then took and transformed with regex replace (Notepad++) to update queries:

Find: (\d+\.\d+),(\d+\.\d+),(".+")
Replace: UPDATE addresses SET lat=$1, lon=$2 WHERE hash=$3

Now that I had most of the addresses geocoded, the distance searching had to begin. I used the query from this SO question to come up with this (My)SQL query:

SELECT MIN(distance), email, names, stationCode, calc.address FROM
(SELECT email, codePrefix, addresses.address, names, ( 3959 * acos( cos( radians(volunteers.lat) ) * cos( radians( addresses.lat ) )
   * cos( radians(addresses.lon) - radians(volunteers.lon)) + sin(radians(volunteers.lat))
   * sin( radians(addresses.lat)))) AS distance
 from (select address, hash, stationCode, city, lat, lon FROM addresses JOIN stations ON addresses.hash = stations.addressHash GROUP BY hash) as addresses
 JOIN volunteers WHERE addresses.lat IS NOT NULL AND volunteers.lat IS NOT NULL ORDER BY distance ASC) as calc
GROUP BY names;

This spits out the closest polling station to each of the volunteers. It is easily turned into an update query to set the polling station code to each of the volunteers in a designated field.

Then there’s some manual amendments to be made, based on traveling preferences – if the person is willing the travel, we pick one of the “priority stations” and assign it to them. Since these are a small number, it’s not worth automating it.

Of course, in reality, due to data collection flaws, the above idealized example was accompanied by a lot of manual labour of checking paper declarations, annoying people on the phone multiple times and cleaning up the data, but in the end a sizable portion of the volunteers were distributed with the above mechanism.

Apart from being an interesting task, I think it shows that programming skills are useful for practically every task nowadays. If we had to do this manually (and even if we had multiple people with good excel skills), it would be a long and tedious process. So I’m quite in favour of everyone being taught to write code. They don’t have to end up being a developer, but the way programming helps non-trivial tasks is enormously beneficial.

3