Improving Zip unpacking performance

Spoilers: You Probably Don’t Want To Use URLSession.bytes(from:delegate:)

Back when I started writing my on the fly zip unpacker the app was all nicely asynchronous and I wanted the zip extractor to be all nice and async too. The data comes from the URLSession download and the only option available to get the data in an async / await context is URLSession.bytes(from:delegate) so that’s what I went with.

I knew it would be slow because it’s switching async contexts every byte (but the discussion on the Swift forums about it possibly being slow suggested that they weren’t too worried about it being too much slower) so I wrote a fairly naive unpacker using it. On my test zip (2.6GB in size) it took about 700seconds and the CPU was 100% the whole time. Yeah, it’s slow, but it worked, so I moved onto other things.

Over the last week I’ve wanted to make it faster. I profiled it in Instruments and a lot of the time was spent searching for header information in the unpacked data.

Quick aside about the zip format: The way the zip file format works is that there’s a header to say what type the next chunk of data is, and if it’s a file there’s a second header to say the name, the compression type, the data length and then the file data. Thing is that sometimes that data length is 0, and in that case, there’s an extra chunk AFTER the file data that contains how long the data was and then at the end of the file, there’s a directory with offsets to all the data size chunks thoughout the file. I guess it’s a way to do things, but it’s not very useful if the goal is to unpack the file as it’s being downloaded.

With this in mind, my first method ignored the data length and scanned the file data for the 4byte header that indicated the data length chunk was starting using a nice ring buffer and stuff. It was pretty neat, but slow.

So to avoid all the searching I implemented a fast path if the zip file did contain the data length in the file header before the data. That let me know how many bytes to copy out and it was all good. Timing it and it came out at ~300seconds. Twice as fast, nice.

But there were still big chunks of time that Instruments didn’t really seem to be able to explain very well, but I got the feeling that maybe they were coming from the use of URLSession.AsyncBytes as my AsyncSequence

I wrote a small wrapper that turned the URLSession.dataTask(with:) delegate method into an AsyncStream<Data> that just returns the buffers of data as they are downloaded.

public final class AsyncDownloader: NSObject {
    typealias ThrowingContinuation = AsyncThrowingStream<Data, any Error>.Continuation
    
    private lazy var session: URLSession = {
        let configuration = URLSessionConfiguration.default
        configuration.waitsForConnectivity = true
        return URLSession(configuration: configuration,
                          delegate: self,
                          delegateQueue: nil)
    }()
    
    private var taskToContinuation:[URLSessionDataTask: ThrowingContinuation] = [:]
    
    public func buffer(from url: URL) -> AsyncThrowingStream<Data, Error>
    {
        AsyncThrowingStream<Data, Error> { contination in
            let dataTask = session.dataTask(with: url)
            taskToContinuation[dataTask] = contination
            
            dataTask.resume()
        }
    }
}

extension AsyncDownloader: URLSessionDataDelegate {
    public func urlSession(_ session: URLSession,
                           dataTask: URLSessionDataTask,
                           didReceive data: Data) {
        if let continuation = taskToContinuation[dataTask] {
            continuation.yield(data)
        }
    }
    
    public func urlSession(_ session: URLSession,
                           task: URLSessionTask,
                           didCompleteWithError error: Error?) {
        guard let dataTask = task as? URLSessionDataTask else {
            fatalError("Unknown task in session")
        }
        
        if let continuation = taskToContinuation[dataTask] {
            if let error {
                continuation.finish(throwing: error)
            } else {
                continuation.finish()
            }
        }
    }
}

This meant the unpacker had to be rewritten to process buffers instead of individual bytes, which, honestly, simplified the code enormously. I only rewrote the fast path because it turns out that all of the zip files I care about contain the data length in the inital file header anyway, but for completeness I’ll probably port the slow path too sometime.

Ok, so, how much faster was URLSession.dataTask(for:) over URLSession.bytes(from:delegate:)?

It unpacked the whole 2.6GB file, and wrote it out to disk in 1004ms

URLSession.bytes(from:delegate:) is not just slow. It is incredibly slow. You probably shouldn’t use it (and the related functions on URL and FileHandle too)