Article summary
I’ve been working on a Java project that extracts files from Zip archives in several places. We’re making heavy use of the Java Stream API in this codebase, and I wanted to deal with the contents of the Zip archives in the same way. (That is, mapping the entries, filtering the entries, processing entries in parallel, etc.)
The code I’m working with is provided with an InputStream
that provides the bytes that make up the Zip archive. That means I don’t know if the contents are coming from the local file or from somewhere else (S3 bucket, database, etc.). I’m also only concerned with pulling out files and can ignore entries that represent a directory.
ExtractedFile record
To provide full separation from all the Java Zip APIs, I want the resulting Stream to yield records similar to the ZipEntry class. However, they should only contain the filename and the contents of the file (the bytes).
public record ExtractedFile(String filename, byte[] bytes) {}
Using ZipFile
On solution would be to save the contents of the InputStream
to a temp file, use the built-in ZipFile class to read/stream the entries, and then delete the temp file when we’re done. This Listing a ZIP file contents with Stream API in Java 8 post shows a simple example of how the .stream()
method on ZipFile
can be used.
Building on that, here’s a full implementation that filters out directories, reads each entry into a byte array, and yields an ExtractedFile
record for each entry:
public Stream extractFiles(InputStream archiveStream) throws IOException {
Path tempFile = Files.createTempFile("temp", ".zip");
try (archiveStream) {
Files.copy(archiveStream, tempFile, java.nio.file.StandardCopyOption.REPLACE_EXISTING);
}
var zipFile = new ZipFile(tempFile.toFile());
return zipFile
.stream()
.filter(entry -> !entry.isDirectory())
.map(entry -> {
try (var zipEntryInputStream = zipFile.getInputStream(entry)) {
return new ExtractedFile(
entry.getName(),
zipEntryInputStream.readAllBytes());
} catch (IOException e) {
throw new RuntimeException("Failed to extract entry", e);
}
})
.onClose(() -> {
try {
zipFile.close();
Files.delete(tempFile);
} catch (IOException e) {
logger.warn("Error closing zip archive", e);
}
});
}
And an example usage might look like:
// Use a try-with-resources to ensure the stream gets closed
try (var entryStream = ZipUtils.extractFiles(archiveStream)) {
entryStream
.filter(entry -> entry.filename().endsWith(".txt"))
.parallel()
.forEach(entry -> {
// do something with the text file
});
}
Things to note about this implementation:
- The contents of the zip archive are first written to a temp file, which is then read back into memory by the
ZipFile
. The temp file is deleted when the stream is closed. - It’s parallel-izable in that a
.parallel()
can be tacked onto the resulting stream without causing problems.
Using ZipInputStream
The ZipFile
based implementation above works, but it adds in the extra overhead of writing the contents of the input stream to a temp file, and then reading it back in (and then having to delete it). In most situations, this is probably acceptable. But, I wanted to see what it would look like to do it without the temp file.
This next implementation relies on the ZipInputStream class, which can read the contents of a zip archive from an InputStream
without needing to write it to a file first. The ZipInputStream
class is a bit more low-level than the ZipFile
class, so it requires a bit more work to get it to do what we want.
I also don’t want to read the entire contents of the ZIP archive into memory right away. Rather, I want to read in each entry lazily as the next one is requested from the stream. One way to do this is by using a Spliterator implementation to read the entries in one at a time.
public Stream extractFiles(InputStream archiveStream) throws IOException {
var archive = new ZipInputStream(archiveStream);
var resultStream = StreamSupport.<ExtractedFile>stream(
new Spliterators.AbstractSpliterator<>(Long.MAX_VALUE, 0) {
@Override
public boolean tryAdvance(Consumer<? super ExtractedFile> action) {
// Find the next non-directory entry
var entry = archive.getNextEntry();
while (entry != null && entry.isDirectory()) {
entry = archive.getNextEntry();
}
// We've reached the end of the archive
if (entry == null) {
return false;
}
try {
action.accept(new ExtractedFile(
entry.getName(),
archive.readAllBytes()));
return true;
} finally {
try {
archive.closeEntry();
} catch (IOException e) {
logger.warn("Error closing zip entry", e);
}
}
}
},
// This false is important - it ensures the above is only called sequentially
false);
return resultsStream.onClose(() -> {
try {
archive.close();
} catch (IOException e) {
logger.error("Error closing zip archive", e);
}
});
}
There’s no difference in the usage of this implementation. The same example code from above should have the same results.
Things to note about this implementation:
- As the comment indicates, it’s critical to specify
false
as the second argument toStreamSupport.stream
otherwise a.parallel()
on the resulting stream will cause errors (the contents of the zip archive will be read in parallel, which isn’t supported by the ZipInputStream).
Conclusion
Both of these implementations work are functionally equivalent. The ZipFile
implementation is a bit simpler but has the runtime downside of having to write the contents of the zip archive to a temp file first. The ZipInputStream
implementation is more code, and a little more complex, but can stream the contents of the zip archive directly.