From 4.5GB to 221MB: A Story about Parquet
Every so often I need to download our data lake export files from S3 for ad-hoc analysis or debugging production issues. Each daily JSONL file is around 4.5GB with about 1.5 million records.
Downloading takes forever – 10+ minutes on a good day, longer if you’re on slow wifi. It got me thinking: what if we used a more efficient storage format?
So I tried converting one to Parquet. Same 1.5 million records, but now it’s 221MB. That’s a 95% reduction. Downloads that took 10 minutes would take 30 seconds.
Why JSONL Files Are So Bloated
Here’s what a typical record looks like (anonymized):
{"customer_id": "C-12345678", "status": "ACTIVE", "contract_type": "STANDARD", "region": "NORTH", "is_premium": "TRUE", "last_updated": "2025-08-26", "balance": -569.04, ...}Now multiply that structure by 1.5 million. The field names alone (customer_id, status, contract_type, etc.) are repeated for every single record. With 106 fields taking up 1.3KB per record, that’s 1.3KB × 1.5M records = 1.9GB just for the keys!
But it gets worse. Take the status field – it only has 6 possible values, but “ACTIVE” appears in 99.8% of records. We’re storing that 6-character string 1.5 million times. Or contract_type with just 3 values (split ~37%/33%/28%) – that could be represented with 2 bits instead of string literals.
The Parquet Difference
I wrote a quick Python script using Polars. The key part is schema inference:
# First 1000 rows to figure out the data typesschema_df = pl.read_ndjson(input_file, n_rows=1000)
# Now Polars knows: customer_id is string, balance is float,# is_premium is boolean (not the string "TRUE"!), etc.This is huge. JSONL has no schema, so parsers treat everything as text until proven otherwise. Parquet knows upfront that balance is a float, is_premium is a boolean, and region is a string that probably repeats a lot.
Where the Magic Happens
Instead of row-by-row storage like JSONL, Parquet groups all values of the same column together. Think of it like transposing a spreadsheet – all customer IDs in one place, all statuses in another. This enables some clever optimizations:
Dictionary Encoding for Low-Cardinality Fields
Fields with few unique values compress incredibly well:
status(6 unique values) → stores each value once + tiny indexescontract_type(3 values: STANDARD/PREMIUM/BASIC) → 2 bits per recordregion(maybe 10-20 values) → small dictionary + references
This is where we see massive wins. That status field that was taking ~9MB (6 bytes × 1.5M) now takes about 570KB.
Boolean Fields Are Just Bits
We have 32 boolean flags (is_premium, has_discount, auto_renew, etc.):
- JSONL: “TRUE”/“FALSE” = 4-5 bytes each
- Parquet: 1 bit each
32 boolean fields × 1.5M records × ~5 byte savings = 230MB saved right there.
Smart Null Handling
About 21% of all field values are null. In JSONL that’s the string “null” repeated millions of times. In Parquet it’s a simple bitmap – one bit per row saying “has value” or “doesn’t have value”.
Numeric Types Stay Numeric
Customer IDs like “C-12345678” take 10 bytes as text. Store just the numeric part as an integer: 4 bytes.
Timestamps stored as ISO strings (“2025-08-26T14:30:00Z”) are 20+ bytes each. As proper timestamps: 8 bytes.
The Actual Numbers
Original: data_export_20250826.jsonl - 4.5GBConverted: data_export_20250826.parquet - 221MBReduction: 95.1%Where did 4.28GB of savings come from? I dug deeper and found something surprising - the JSONL file is 66% overhead! Only 33% is actual data values, the rest is JSON structure.
- Field names stored once: 106 columns, 1.3KB of names per row × 1.5M rows = 1.9GB saved
- Dictionary encoding: 20+ categorical fields (status, payment_type, meter locations, etc.) = ~800MB saved
- Zstandard compression: Final compression on the columnar data = ~340MB saved
- Boolean compression: 32 boolean fields from “TRUE”/“FALSE” strings to bits = 230MB saved
- Null optimization: 21% of values are null, eliminating “null” strings = 130MB saved
- Other optimizations: Numeric type efficiency, run-length encoding, etc. = ~880MB saved
The shocking realization: In JSONL, we’re literally spending twice as much space on JSON syntax as on actual data!
Trade-offs
You lose some things:
- Can’t just
tail -forgrepthe file - Need proper tools (though
duckdb -c "SELECT * FROM 'file.parquet'"works great) - Schema evolution requires more thought