The Low Quality of Scientific Code
Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.
The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.
Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.
I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.
But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.
And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.
And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.
But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.
For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.
The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).
Maybe software engineers can help in the world of science. But money is a factor.
Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.
The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.
Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.
I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.
But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.
And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.
And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.
But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.
For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.
The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).
Maybe software engineers can help in the world of science. But money is a factor.
Hi,
I wrote a posts about this topic here
http://forthescience.org/blog/2011/08/11/computational-chemistry-development-in-research/
Comparing chemistry in the lab vs chemistry on a computer. The main problem is that when a project starts, the point is to solve an immediate problem. Then the project grows beyond the “controlled environment” and escapes into the wild, where it is subjected to harassment from many sources. Patch after patch by students and Ph.D. with no formal programming training bring it to the situation you describe. There’s no safe cure for this. Code sanity is not an evaluation criteria, and it becomes a little more of a concern when (if) the code goes commercial. Even then most of the time is included as is, and used as a black box.
In informatics, what we do are demonstrators and prototypes, and certainly not code that is intended to go in production.
Software and libs exist so other people (researchers, engineers, and students) can study them and use them for their own works.
So we write code that is easy to read (so other can study it), and to modify (because of new ideas, improvements, etc). Unless we are working in some specific domains, performances (memory, speed, reliability, etc.) are not an issue, or even a concern.
But if a professional developer is fool enough to use our code for a work in production, he can only blame himself: we are doing science, not commercial engineering.
Yes, money is an important factor, but it’s not all. Even if you were to invest effort into software engineering, you wouldn’t be thanked for it. What counts, at the end of the day, is publications, and it is exceedingly difficult to get publications for implementation projects unless you can give it a scientific twist.
I’d like to mention two additional hurdles. The first one is structural: scientific studies typically operate at the boundary of the unknown and try out novel ideas. Typically, that requires a kind of rapid prototyping approach that is quite incompatible with solid, specification-driven software engineering, and the resulting code is essentially a proof of concept, nothing more.
The second one is licensing. If you use Machine Learning for your work, you will require a training corpus, and almost all of them are under some form of restrictive license. If you are lucky, the license permits academic research use, but comparatively few licenses permit commercial use. I don’t want to go into the pros and cons here, but the consequence is clear: there is little transfer from academia into industry also due to the legal hurdles.
I was going to comment, but Sebastian above pretty much covered what I was going to say. I’ll just add one point. I’m a professional programmer, but I still do the bad things mentioned in the article, not because I don’t know how to write good code, but because first, as was said, there’s no payoff, and second, there’s never *time* to do things right. There’s always work waiting, so you write what will do the job, in the moment, then rush on to the next thing the lab scientists are demanding. It’s not pretty, but it’s what’s needed. Alas.
This is why as many software engineering principles as possible should be moved into the language. People select dependencies based upon an initial impression of what they do. Then the dependency must survive maintenance.
Sometimes I dispair and imagine that people are confusing bad code for high performance code. But people write really bad code when it’s actually easier to grok bad code than it is to write good code; which can be true when it’s small. As an example, in C, it is mostly a matter of wrong defaults (ie: unsafe by default, intly typed by default, nullable by default, etc.)
The compiler’s job should be to purge the code of logical inconsistency (usually through a type system that functions as documentation), and to make it easier to write malleable code.
It should also be easier for a third party to refine the types of some blackbox code to document the assumptions being made when using it. (ie: dependent typing). Most APIs have no well defined *protocol* which essentially spells out exactly which API usages are supported. Given such a protocol, it would be the very definition of what the API “means”.
Gord and Sebastian covered a lot of good points, but you have missed the boat. This is not a problem of scientific code, but rather research code in general. Classic CS research code is know for being horribly written for a reason, it is written with one goal; as long as it does that your golden.
Bioinformatics is an example of where there is a concentrated effort to make the code reusable. Thus, I have found it is among the better research oriented software.
To address your paragraph on funding: this will never happen. Staff software engineers tend to be try to be the enforcers in labs and add little real value to research. We no longer need one-dimensional researchers. In my opinion people who want to focus only on one discipline (either a science or CS) are hindering progress because of inabilities to communicate outside their niche.
As the lead developer on Axiom, a large computer algebra
system I completely agree with the author.
We are trying to raise the quality of published
computational mathematics by shifting the emphasis from raw
code to Knuth’s Literate Programming idea. We expect a
professional programmer to communicate ideas to people
through writing as well as communicate the actions to the
machine.
We have also tried to promote Journals that accept literate
programs. This is especially important given the growing
movement toward reproducible research. You wouldn’t publish
a new theorem without the proof. How can you publish a new
scientific fact backed by a program if you don’t publish
the program? And, if you’re going to publish, communicate.
(BTW, if you think that LP is a fringe movement I will
point out that a literate program won an Academy Award.
http://lambda-the-ultimate.org/node/4876)
Probably the best solution would be to force scientists to read “Code Complete”.
Great read and wonderful discussion, many valid points. I’d like to add my 2c as a computational biology Ph.D.
1) Some of this is lack of training, despite having a CS B.A I never worked in a production environment and had no idea how to do things properly. A lot of the problem stems from lack of proper software engineering culture, though: labs, and principal investigators (PIs) in particular, neither know nor encourage correct practices. We had professional programmers come in and write bad code because the environment promoted it.
I needed two days writing code in a start-up to figure out a dozen things I’ve done wrong. No code repository. No proper documentation. No communication between different programmers working on the same project. It was a mess.
2) No incentive to write good code. Scientists are judged by papers. When you submit the paper, you’re required to submit the code, but there is no quality control – other than other scientists coming in and trying to use it. When things break down they rarely make a noise. There are exceptions, of course, check out Lior Pachter’s blog:
http://liorpachter.wordpress.com/2014/02/12/why-i-read-the-network-nonsense-papers/
But that’s the exception rather than the rule.
Lack of lab culture + lack of outside QA = awful code.
I’m an undergraduate astrophysicist, but I originally wanted to be a computer scientist and as such have a modest background in computer science.
Recently, I’ve been dealing a lot with messy scientific code for my research project. In my opinion, this issue from the fact that curricula for science degrees have not evolved with the technological advances of scientific research. Most major universities don’t require any programming or CS classes to graduate with a science degree, and most that do require it only require an introductory class (e.g. intro to python). As a result, the only coding experience that most science students get in undergrad is maybe fixing up or slightly modifying some code that was left behind by someone else.
I have worked with graduate students on pace to get their PhD’s in the order of months that needed me to explain to them how objects worked. When my group recently adopted and started modifying a piece of software written in C (~15k lines), it took months to get everyone up to speed with the code because most of the grad students on our project didn’t know what a pointer was. The fact is that it is very possible (and extremely common) to get a high degree in a scientific field with a coding ability limited to that of a Freshman CS major.
As more and more scientific fields begin to rely more and more on computer modeling, this lack of preparation is a major hindrance for scientific progress. Most projects that I’ve seen (not just in my group, but in others at my university as well) spend more time trying to figure out basic CS concepts than actually doing science.
Somehow, we need to get the word out to Colleges and Universities to start requiring more CS classes as prerequisites to graduating with science degrees. It will make scientific code actually useful to non-scientists, and it will help scientists to do their jobs more effectively.
I disagree that scientists do not know how to write high quality code. Science has very different goals to making a reliable software product to sell. Scientists are optimizing for their particular situation, which is different to yours. I have done both science and software engineering, and I know exactly why they do not do good software engineering. It is not important to their goals. Science is about publishing new ideas and not writing software. However, this is slowly changing because so much of science relies on data analysis and other computational methods. I like writing software, so I went back to industry.
There are however plenty of things done properly in Science. Take a look at all the cool stuff coming out of LANL.
What I was trying to say is that many scientists know how to write good code but the chose not to.
> Refactor, cleanup, document, test. The authors of the
> libraries will be more than glad to have someone prettify
> their hairy code.
Sorry, no. We’re not computer janitors.
It is no panacea out in software engineering land either, http://stilldrinking.org/.
Also, if you look at bugtraq from the 1990s it is full of buffer overflow problems in Solaris.
Is is well known issue of communication between scientists and software engineers since scientists stay in their domain and use their terminology. Moreover, generally they do not care to much about target platform, possible optimizations of their know-how code and writing a clear documentation. On the other hand, software engineers do not care about low level scientific code (as author said, in other case they probably need to spend 3 years on a PhD). Since a problem of efficient communication appears, one personally thinks that scientists and software engineers should clearly understand their needs at early stage of the development process.
It is unfair to complain about the quality of scientific software without also noticing that software engineering is a complex project. Poor quality is not limited to scientific computing, and money alone won’t solve the problems: witness the Obamacare rollout debacle. In fact, Scientific computing, like other areas of computing, has its own range of quality, from the very well-written (like those used to find the Higgs boson) to the “back-of-the-envelope” type. The important thing to realize is that all such software did what they are supposed to do, rather than compete in an idealized “beauty contest” of software engineering. On what basis can we expect every piece of software to be a beauty?
Human knowledge has grown exponentially since the digital revolution, and scientists and software engineers have to apportion their own time to do what they do best. If Bozho wants to learn something to advance a new project goal, Bozho has to live with the status quo rather than complaining about it. We all do.
I recommend reading the daily WTF, for commercial code quality.
It’s not really better.
They just don’t share it.
A LOT of crappy code is out there, because too few people contribute improvements and cleanups. Everybody wants do to something new, not clean up someone elses mess. Unless forced badly to do so by external pressure – i.e. money.
Instead of supporting this with having paid positions just to assist software development in research, we do the opposite: we require researchers to constantly write grant requests to obtain future funding. So they have even less time to focus on what they are good at: research.
No wonder they don’t have time to clean up their code either. They have to write grants to survive.
First, I agree with the general sentiment of your post, that the body of software generated by those in the scientific community is generally of poor quality when held up to reasonable standards. But I would also posit that this is true in every field. What I take exception to is your statement that scientists *can’t* write good code, and your assumption that scientists should have to write code that holds up to the standards of industrial software engineering.
Until very recently, programming has been formally taught only to those going into the fields of software engineering and (in some cases) computer science. The balance of people who program (and who vastly outnumber professional software engineers and computer scientists) have had to learn on their own. Just as I can produce a competent meal but in no way belong in the kitchen of a restaurant, most programmers can do what they need to do to satisfy their goals (which are almost never to produce high-quality, usable software) but wouldn’t last a day in a professional software development shop. And that’s fine, because in most cases a program will only be used by its creator.
The problem comes when we (scientists) make our programs publicly available. This happens for at least one of two reasons: 1) we have to (due to a requirement by a journal or granting institution) and/or 2) we want to, either out of altruism or because we hope that the community will contribute to improving our software. It’s important that both creators and users of scientific software know what they’re getting into when they try to use academic software. You should generally not even try to use software that was published only by demand, as it will most likely be difficult or impossible to use without expert knowledge. The programmer had no incentive to make the program useable, and may resent that he had to publish it in the first place. On the other hand, I would argue that someone making their software public by choice should have a much higher burden in terms of writing well documented, easy to use, efficient software, either by becoming formally trained or by hiring a professional software engineer. Someone who produces crappy software and encourages others to use it is just making more work for everyone. I would rather use the second-best algorithm available if it came packaged as good software rather than waste days trying to figure out how to use the best algorithm only to find out that the programmer hard-coded a limit on the number of input records just because his computer only had 4GB of memory.
Fortunately, things are improving. First of all, there are people like me and my colleagues, who were formally trained in computer science (and maybe even worked as industrial software engineers, as I did) before getting their PhD in a scientific discipline, such as bioinformatics. At least in genetics (my field), many of the widely used programs are developed by people like me, and there are grants available to fund their long-term maintenance and improvement. If you want an example of an excellent scientist who is also an excellent programmer, check out Heng Li (http://lh3lh3.users.sourceforge.net/).
Second, educators are starting to recognize the importance of teaching programming to people of all disciplines. All students entering the PhD program from which I graduated were strongly encouraged to take at least one programming class, and some university departments are beginning to require that all incoming undergraduates learn to program. While the quality of programming education undoubtedly varies, I’m hopeful that the next step will be for educators to recognize the importance of teaching software engineering principles as well.
Third, funding agencies are coming around to the idea that software quality is an important consideration, and are willing to fund grants that include a software engineer as a line-item in the budget. Unfortunately, in the current climate (at least in the US) funding levels are stagnant or decreasing and payouts are being reduced across the board, but I think it’s important that the problem has moved from being an institutional one to a political one.
Anyway, if you want things to get better, 1) vote for people that will increase funding for science, 2) help teach current and future scientists better programming practices, and/or 3) consider getting your PhD in a scientific discipline and try to change things from the inside!
Just wish to say your article is as amazing. The clearness to your submit is just nice and that
i could think you are knowledgeable on this subject.
Well with your permission let me to grab your RSS feed to stay updated with
approaching post. Thanks a million and please keep up the gratifying work.
my weblog – interior basement waterproofing
I’m lucky enough to work in a scientific research group that has 4 software engineers out of a group of 20. They’re very hard to get because funders don’t want to pay for software engineers – they say, “why don’t you hire a researcher to do that?” So we have to fund them ourselves out of the money left over from previous projects, income from running training courses, or commercial consulting projects we do.
I was an advisor to a multi-million pound research project between three universities. They wanted to do something that relied on creating mobile phone apps. But while they had something like 12 full-time researchers, they only had one software engineer. So there was no way the software engineer could keep up, and the researchers had to do other things with their time.