Apple Archive Utility (and ditto) and very large ZIP archives
Dragan, posted Dec 10th 2009 at 6:25PM
This post’s aim is to point your attention to some strange things you may expect if you use Apple built-in archiving tools Archive Utility (GUI application) and ditto (command line tool) to create “large” ZIP archives. By “large” I mean archives that exceeds regular ZIP limitations, so that the Zip64 format extension has to be used. In short, DO NOT USE these tools if you plan to share your archives with people using other operating systems or other archiving tools on Mac. They may not be able to open archives at all. For those who are more interested in the subject, here comes the long story…
First, let’s say something about the structure of a typical ZIP archive, and then define limitations of the standard ZIP specification and how those are overcome with Zip64 extensions. A very detailed specification of the ZIP format can be found here. I won’t describe all the details in it, as that would take too much time and space, I’ll concentrate only on the things important to realise why Apple archiving utilities fail.
The global structure of a ZIP archive is rather simple; first we have files contained within the archive stacked one after another. Every file consists of its local file header, immediately followed by file data (usually compressed using one of the common compression methods: Deflate, Deflate64, BZIP2…). This local file header contains quite some pieces of data describing the file, but here we’re interested in two particular fields, compressed size and uncompressed size. Both of these fields are 4 bytes (32 bits) long, which means the regular ZIP archive can store files which are not bigger than 4GB (4,294,967,295 bytes, or 2^31 – 1 bytes). So, we already know the first limitation of the standard ZIP format. After all the files, there come some other archive related data, not interesting for this story, and then at the end come central directory and end of central directory record. Now, central directory contains again file header of each file in the archive, but these headers are “extended” and called central directory file header. Each central directory file header contains more information than its corresponding local file header. Fore example, file comment and file attributes fields can be found in the central directory file header, but not in the local file header. But, the central directory doesn’t contain only extended file headers, it also contains some other archive related data, again not interesting for this story. And finally, we’ve got end of central directory record at the very end of the ZIP archive file. There are more fields in this last part, but the one of interest for us is total number of entries in the central directory. This field is 2 bytes (16 bits) long, implying the regular ZIP archive can have only 65,535 (2^16 – 1) files. This is the second limitation we care about for this particular story.
What happens it some of limitations of the standard ZIP are exceeded? Then the Zip64 extension should be used. It effectively adds two chunks of data to the archive, called zip64 end of central directory record and zip64 end of central directory locator. Presence of this additional information determines whether a ZIP archive is a standard one, or the one with Zip64 extensions. What is important is that If an archive is in Zip64 format, the compressed size and uncompressed size fields in both local file header and central directory file header (remember, they are only 4 bytes long) should contain value 0xFFFFFFFF, and the real sizes are written in so-called extra field in both local file header and central directory file header and that value is 8 bytes (64 bits) long (actually, this isn’t completely true, since the things with extra field are more complicated. Extra field is of variable size and a lot of information can be put into it. A part of that information is zip64 extended information extra field and again parts of that piece are original uncompressed file size and size of compressed data, which are 8 bytes long. Off course, zip64 extended information extra field in file headers is present only if above mentioned zip64 end of central directory record and zip64 end of central directory locator are present in the archive). So now, we know an archive in Zip64 standard can contain up to 2^64 – 1 files. Also, like the normal end of central directory record has its field total number of entries in the central directory, which is only 2 bytes long (allowing only for 65535 files in archive), zip64 end of central directory record has its own field named total number of entries in the central directory, which is 8 bytes long, thus allowing up to 2^64 -1 files in the ZIP archive in Zip64 standard.
To make this whole story more complete (and probably clear) here is the comparison of limitations of standard and Zip64 archives (the table includes some other limitations, not mentioned above):
|Attribute||Standard Format||Zip64 Format|
|Number of Files Inside an Archive||65,535||2^64 – 1|
|Size of a File Inside an Archive (bytes)||4,294,967,295||2^64 – 1|
|Size of an Archive (bytes)||4,294,967,295||2^64 – 1|
|Number of Segments in a Segmented Archive||999 (spanning), 65,535 (splitting)||4,294,967,295 – 1|
|Central Directory Size (bytes)||4,294,967,295||2^64 – 1|
Now, what Apple tools do? As far as any of limitations of the standard zip format is not reached, everything is fine. But once at least one of those limitations is reached, a proper tool should automatically switch to Zip64 standard, add zip64 end of central directory record and zip64 end of central directory locator (with relevant correct information) to the archive, start putting information about file sizes in extra field > zip64 extended information extra field > original uncompressed file size and extra field > zip64 extended information extra field > size of compressed data and fill compressed size and uncompressed size of file headers with 0xFFFFFFFF. Instead of that, Apple archiving tools continue populating archive with new files like it’s standard ZIP archive, keeping on stacking new files one after another. This implies no presence of Zip64 information at all and incorrect information about file sizes and number of files inside archive. We may rightly say the archive is corrupted (although all important data of all files archived are still present).
What happens when you encounter such an ZIP archive depends on the tool you use to process it. We can identify three base cases here:
1. The number of files in archive is greater than 65535, but all files are still smaller than 4GB and the total archive size is less than 4GB.
If the tool doesn’t use information about the archive stored in the central directory, but just enumerates through all the files (which I believe is wrong thing to do), then you’ll be able to open the archive and extract all files from it. Apple archiving utilities behave this way, as well as many archiving tools for Mac which run 7-zip command line tool in the background (since 7-zip behaves the same, it just enumerates through the files).
If the tool use information stored in the central directory (which I believe is right thing to do, that’s the sole purpose of central directory existence), then you won’t be able to see all the files in the archive. Most likely the tool will see the modulus of 65536 of the real value. For example, if there are 70000 files in the archive, the tool will see only the first 70000 – 65536 = 4464 files and extract them. All other files are unreachable by the tool (although they are in the archive). If you have an access to a Windows box with WinZip installed, you can confirm this very easily; create such an archive (70000 small files) with Apple Archive Utility, open it with WInZip and see how many files are (reported) there.
2. The archive total size is greater than 4GB, but sizes of each individual file is less than 4GB.
If the tool doesn’t use information about the archive stored in the central directory, but just enumerates through all the files, I assume all files will be reachable and possible to extract. I didn’t try this myself, but I don’t see the reason why they would behave differently.
If the tool use information stored in the central directory, I don’t exactly know what would happen, since I didn’t try it myself. It may be that the tool would open and extract the archive without any problems, but it may very well be that the tool would report archive being corrupted and wouldn’t open it at all.
3. At least one file in the archive is greater than 4GB.
I didn’t try, but I assume the tool that doesn’t use information about the archive stored in the central directory, but just enumerates through all the files, would be able to open an archive, but extraction would most probably fail, since the file size and its offset in the archive are reported incorrectly, so the tool may try to look for the file on the wrong place and expect to extract/decompress wrong number of bytes.
If the tool use information stored in the central directory, it won’t open the archive and will report it being corrupted, since reported file sizes and file offsets don’t match.
Off course, you can make various combinations of the above common cases, with combined resulting behaviour.
As a conclusion to this long and hopefully informative story, I’ll repeat once more: if you plan to share your big archives (more than 65535 files or/and archive bigger than 4GB or/and file in archive bigger than 4GB) with people using other operating systems or other archiving tools on Mac, DO NOT USE Apple archiving tools built into the Mac OS X. I’d also like to point out that Springy uses central directory to open and gather information about ZIP archive and files inside, but with built in tricks and workarounds, just to be able to process faulty archives made by Apple tools. Even more, if you open such an archive and modify it, Springy will automatically “fix” it into a proper Zip64 archive!
Apple is aware of all these bugs, I reported them quite some time ago. Unfortunately, nothing has happen since, the status of the bugs is still “open”.