Searching Large GZIP Files: gztool and SQLite

Sep 20, 2022

Tags: , , , ,

Categories: , ,


At work, I often need to extract information from large (10+ gigabytes) JSON files that are compressed with gzip. These files are produced daily and represent a snapshot of app metadata across mobile app stores. Every row contains information for a particular app, and the files grow larger as more apps are appended to the JSON file. Sometimes I need to traverse all of the rows of this file. But in many cases, I only need to look at a subset of rows. Perhaps I would like to examine a handful of apps per store to see what fields are available on a particular day.

There are two main problems. First, there is no way to quickly select a row, or a subset of rows. I would need to zcat the file and ignore the output until I reached the row of interest. Second, I also have no way of knowing which rows contain apps from a particular store. Once again, I would need to zcat the file from the start and use grep + some regex to extract the rows that satisfied my criteria.

As I was investigating this issue, I came across two useful tools. The first is gztool, which allows one to create a small (~40 MB on a 10 GB compressed file) index on a gzip file and provides near random access to the rows of interest. The second is zindex, which allows one to index the gzip file on one or more fields (e.g. a column for CSV files, or a key for JSON files via jq). In fact, it seemed like zindex would have been enough to handle all my needs, as it also permits line number-based queries. However, there were a few issues.

  1. The zindex format is really an SQLite database under the hood, and it takes up a lot of space.
  2. zq, the binary used to query the database, is SLOW, even when querying on line numbers. In contrast, gztool had nearly instant random access.

I liked the zindex strategy of creating a SQLite table to map the fields of interest to a line number. However, it was useless for querying in my case, so the extra metadata in the SQLite table only took up space. I decided the best strategy would be to combine the fast random access of gztool with the row-based indexing of SQLite, without actually adopting the zindex schema or query tool.


gztool

Creating an index with gztool is easy.

gztool -i file-YYYY-MM-DD.gz

This creates a file called file-YYYY-MM-DD.gzi that contains access points at evenly spaced offsets (by default every 10 MB) that allow for data extraction from the middle of the file. When we call gztool on the gz file again, it will automatically detect the index and use this to speed up row access.


zindex

Let’s say I want to create an index on the ID of the app (product_id) and the ID of the store it belongs to (store). Both of these fields are integers. Under the hood, the zindex database schema looks like this.

CREATE TABLE AccessPoints(
    uncompressedOffset INTEGER PRIMARY KEY,
    uncompressedEndOffset INTEGER,
    compressedOffset INTEGER,
    bitOffset INTEGER,
    window BLOB
);
CREATE TABLE Metadata(
    key TEXT PRIMARY KEY,
    value TEXT
);
CREATE TABLE LineOffsets(
    line INTEGER PRIMARY KEY,
    offset INTEGER,
    length INTEGER
);
CREATE TABLE Indexes(
    name TEXT PRIMARY KEY,
    creationString TEXT,
    isNumeric INTEGER
);
CREATE TABLE index_store(
    key INTEGER,
    line INTEGER,
    offset INTEGER
);
CREATE INDEX index_store_key_index ON index_store(key);
CREATE TABLE index_product_id(
    key INTEGER PRIMARY KEY,
    line INTEGER,
    offset INTEGER
);
CREATE INDEX index_product_id_key_index ON index_product_id(key);

All that I really need is the mapping between the index keys in index_product_id and index_store to the corresponding lines. If I have the line number, I can pass it to gztool and get the metadata I need.


SQLite-only

Instead, I create a new database with a single table

CREATE TABLE "manifest" (
"line" INTEGER,
  "product_id" INTEGER,
  "store" INTEGER
);
CREATE UNIQUE INDEX index_product_id ON manifest(product_id);
CREATE INDEX index_store ON manifest(store);

This is roughly half the size of the one created by zindex, without the unnecessary information.

To populate this table, I first extract all of the product IDs and stores from the gzip file via jq:

zcat file-YYYY-MM-DD.gz | jq -r "[.product_id, .store] | @csv" > rows.csv

Then I load it with pandas, and dump it to a sqlite database connection using to_sql.

df = pd.read_csv('rows.csv', header=None, names=['product_id', 'store'])
with contextlib.closing(conn.cursor()) as curs:
    # Add line number column
    df['line'] = range(1, len(df) + 1)
    # Ensure all columns are integers
    df = df.astype(int)
    # Dump
    df.to_sql('manifest', conn, index=False)
    # Create indices after we've already inserted data
    curs.execute('CREATE UNIQUE INDEX index_product_id ON manifest(product_id);')
    curs.execute('CREATE INDEX index_store ON manifest(store);')

Querying with gztool

Now that we have our SQLite database, we can obtain our fields of interest in two steps:

  1. Extract the line numbers corresponding to the fields we want from the database
  2. Pass the line numbers to gztool

I wrote a python script that prints each line returned for the database in a loop. As an example, let’s say I want to look at the first 10 products that belong to store ID 1.

index = 'store'
value = 1
limit = 10
with contextlib.closing(conn.cursor()) as curs:
    query = f"SELECT line FROM manifest WHERE `{index}`={value}"
    if limit is not None:
        query += f" LIMIT {limit}"
    query += ';'
    curs.execute(query)
    output = curs.fetchall()
for entry in output:
    print(entry[0])

I can then run a shell script that passes the line numbers returned by the python script to gztool:

LINES=($(python query_db_lines.py -i store -v 1 -R 10))
for line in ${LINES[@]}; do
    gztool -v 0 file-YYYY-MM-DD.gz -L $line -R 1
done

And that’s it! Using this method, I’m able to quickly and easily retrieve row subsets from my large .gz files.