Writing Big JSON Files With Jackson

Sometimes you need to export a lot of data to JSON to a file. Maybe it’s “export all data to JSON”, or the GDPR “Right to portability”, where you effectively need to do the same.

And as with any big dataset, you can’t just fit it all in memory and write it to a file. It takes a while, it reads a lot of entries from the database and you need to be careful not to make such exports overload the entire system, or run out of memory.

Luckily, it’s fairly straightforward to do that, with a the help Jackson’s SequenceWriter and optionally of piped streams. Here’s how it would look like:

    private ObjectMapper jsonMapper = new ObjectMapper();
    private ExecutorService executorService = Executors.newFixedThreadPool(5);

    @Async
    public ListenableFuture<Boolean> export(UUID customerId) {
        try (PipedInputStream in = new PipedInputStream();
                PipedOutputStream pipedOut = new PipedOutputStream(in);
                GZIPOutputStream out = new GZIPOutputStream(pipedOut)) {
        
            Stopwatch stopwatch = Stopwatch.createStarted();

            ObjectWriter writer = jsonMapper.writer().withDefaultPrettyPrinter();

            try(SequenceWriter sequenceWriter = writer.writeValues(out)) {
                sequenceWriter.init(true);
            
                Future<?> storageFuture = executorService.submit(() ->
                       storageProvider.storeFile(getFilePath(customerId), in));

                int batchCounter = 0;
                while (true) {
                    List<Record> batch = readDatabaseBatch(batchCounter++);
                    for (Record record : batch) {
                        sequenceWriter.write(entry);
                    }
                    if (batch.isEmpty()) {
                        // if there are no more batches, stop.
                        break;
                    }
                }

                // wait for storing to complete
                storageFuture.get();

                // send the customer a notification and a download link
                notifyCustomer(customerId);
            }  

            logger.info("Exporting took {} seconds", stopwatch.stop().elapsed(TimeUnit.SECONDS));

            return AsyncResult.forValue(true);
        } catch (Exception ex) {
            logger.error("Failed to export data", ex);
            return AsyncResult.forValue(false);
        }
    }

The code does a few things:

  • Uses a SequenceWriter to continuously write records. It is initialized with an OutputStream, to which everything is written. This could be a simple FileOutputStream, or a piped stream as discussed below. Note that the naming here is a bit misleading – writeValues(out) sounds like you are instructing the writer to write something now; instead it configures it to use the particular stream later.
  • The SequenceWriter is initialized with true, which means “wrap in array”. You are writing many identical records, so they should represent an array in the final JSON.
  • Uses PipedOutputStream and PipedInputStream to link the SequenceWriter to a an InputStream which is then passed to a storage service. If we were explicitly working with files, there would be no need for that – simply passing a FileOutputStream would do. However, you may want to store the file differently, e.g. in Amazon S3, and there the putObject call requires an InputStream from which to read data and store it in S3. So, in effect, you are writing to an OutputStream which is directly written to an InputStream, which, when attampted to be read from, gets everything written to another OutputStream
  • Storing the file is invoked in a separate thread, so that writing to the file does not block the current thread, whose purpose is to read from the database. Again, this would not be needed if simple FileOutputStream was used.
  • The whole method is marked as @Async (spring) so that it doesn’t block execution – it gets invoked and finishes when ready (using an internal Spring executor service with a limited thread pool)
  • The database batch reading code is not shown here, as it varies depending on the database. The point is, you should fetch your data in batches, rather than SELECT * FROM X.
  • The OutputStream is wrapped in a GZIPOutputStream, as text files like JSON with repetitive elements benefit significantly from compression

The main work is done by Jackson’s SequenceWriter, and the (kind of obvious) point to take home is – don’t assume your data will fit in memory. It almost never does, so do everything in batches and incremental writes.

3 thoughts on “Writing Big JSON Files With Jackson”

  1. @Async not wait for this worker thread。if this is a http request , the client can not get the return value.user can not get response for the thread work fail.Can you introduce some way to resolve this problem(sorry for my bad English)

  2. Yes, I added one line to the code – notifyCustomer(..).
    It is expected that the http request should not wait for the response, as it may take minutes to generate. So, when ready, send a notification to the customer that his file is ready, and a download link

Leave a Reply

Your email address will not be published. Required fields are marked *