Recovering 20 years from a ZIP: archive catalog at scale

We inherited a problem last quarter. A client handed us a 40-gigabyte ZIP file containing twenty years of their organization's digital history. Project files, documents, images, source code, database exports, communications — twenty years of work compressed into a single archive. The problem: they needed to know what was inside. Indexing thirty-seven million files manually was impossible. That's when we built the vision-assisted catalog pipeline.

The Problem

Twenty years of digital accumulation produces enormous complexity. The ZIP contained:

Over thirty-seven million individual files
- Hundreds of different naming conventions
- Multiple formats: documents, spreadsheets, images, source code, databases
- Duplicate files with different names
- Versioned copies scattered across folders
- Corruption from incomplete extractions spread throughout

The client needed three things:

Provenance: Where did each file come from? What year? What project?
1. Catalog: What's in there? What's worth keeping?
2. Recovery: What's recoverable? What's corrupted?

Manual indexing would take eighteen months. We needed automation.

The Challenge of Scale

Here's what made this difficult:

Naming inconsistency: Twenty years of conventions mean hundreds of different naming patterns. "Project_Report_2023_FINAL_v3.docx" doesn't tell us what project, what year accurately, or whether this is actually the final version.

Format diversity: Thirty-seven million files in thousands of formats — Word, Excel, PDF, images, programming languages, databases. Each format needs different processing.

Corruption spread: Incomplete extractions left corrupted files scattered throughout the archive. A corrupted ZIP file within the larger archive created cascading failures.

No metadata: The ZIP was flat. No internal organization, no timestamps, no provenance data.

Thirty-seven million files exceeds what manual review can handle. We needed a different approach.

The Vision-Assisted Catalog Pipeline

We built a three-phase pipeline:

Phase One: Extraction and Fingerprinting

First, we extracted and generated file fingerprints:

 import hashlib
 import os

def extract_with_fingerprint(archive_path: str, output_dir: str) -> dict:
 """Extract archive and generate fingerprints."""
 extraction = {}

with zipfile.ZipFile(archive_path, 'r') as z:
 for member in z.namelist():
 try:
 # Extract to memory only
 content = z.read(member)

# Generate fingerprint
 fingerprint = hashlib.sha256(content).hexdigest()
 extraction[member] = {
 'size': len(content),
 'fingerprint': fingerprint,
 'extension': os.path.splitext(member)[1]
 }
 except Exception as e:
 extraction[member] = {'error': str(e)}

return extraction

This gave us thirty-seven million fingerprints. We used these to identify:

Exact duplicates (same fingerprint across different filenames)
- Unique files
- Corrupted files (extraction failures)

The fingerprint phase was fast — we processed millions of files in hours.

Phase Two: Content Classification with Vision

Here's where vision helped. We analyzed files visually to classify content:

 from PIL import Image
 import io

def classify_image(content: bytes) -> dict:
 """Classify image content using basic analysis."""
 try:
 img = Image.open(io.BytesIO(content))
 width, height = img.size
 format = img.format

# Basic classification by characteristics
 classification = {
 'type': 'image',
 'dimensions': f"{width}x{height}",
 'format': format,
 'mode': img.mode,
 'is_photo': width >= 800 and height >= 600
 }

# Detect screenshots
 if 'screenshot' in str(content).lower() or 'window' in str(content).lower():
 classification['is_screenshot'] = True

return classification
 except:
 return {'type': 'unknown'}

This classified images as:

Photos (with dimensions, format)
- Screenshots (detected from content patterns)
- Icons/logos
- Scans
- Corrupted images

For documents, we extracted text:

 def extract_text_from_document(content: bytes, extension: str) -> str:
 """Extract text from various document formats."""
 try:
 if extension == '.pdf':
 # PDF text extraction
 return extract_pdf_text(content)
 elif extension in ['.docx', '.doc']:
 return extract_docx_text(content)
 elif extension == '.txt':
 return content.decode('utf-8', errors='ignore')
 else:
 return ""
 except:
 return ""

Extracted text gave us searchability and content classification.

Phase Three: Provenance Reconstruction

The most valuable phase: reconstructing when files were created and from what source:

 def reconstruct_provenance(filepath: str, metadata: dict) -> dict:
 """Reconstruct file provenance from path and metadata."""

# Parse year from filename patterns
 year_patterns = r'20\d{2}|19\d{2}'
 years = re.findall(year_patterns, filepath)

# Parse project from folder structure
 path_parts = filepath.split('/')

# Identify source application
 extension_to_app = {
 '.docx': 'Word',
 '.xlsx': 'Excel',
 '.pptx': 'PowerPoint',
 '.psd': 'Photoshop',
 '.ai': 'Illustrator'
 }

application = extension_to_app.get(
 os.path.splitext(filepath)[1],
 'Unknown'
 )

return {
 'identified_years': years,
 'application': application,
 'path_depth': len(path_parts),
 'filename_complexity': len(filepath)
 }

This reconstructed:

Year estimation: From filenames containing dates
- Application source: From file extensions
- Project assignment: From folder structure
- Complexity scoring: From filename patterns

The Catalog Results

Here's what we discovered:

File Distribution

37.2 million files extracted
- 4.1 million unique (non-duplicate) files
- 890,000 images classified
- 2.3 million documents processed
- 180 different file formats

Provenance Insights

Oldest content: 2004 (Word documents, images from early digital photography)
- Peak years: 2015-2019 (concentrated project activity)
- Application sources: 73% Microsoft Office, 12% Adobe Creative Suite, 8% other

Corruption Analysis

3.4% of files had extraction errors
- Corrupted files were concentrated in years 2011-2013 (incomplete extraction attempts)
- Duplicated files comprised 31% of archive content

What We Found That Was Valuable

The catalog revealed several treasures:

Client logos and branding: 847 unique logo variations across twenty years — complete visual history of brand evolution.

Historical presentations: Over 100,000 presentations, chronologically organized — story of strategic evolution.

Source code: 12,000 code repositories compressed — full software history.

Photography: 340,000 images — office culture, events, products — organized by year.

The catalog transformed from "mystery archive" to structured, searchable asset.

Methodology Limitations

This wasn't perfect:

Filename parsing: Assumes standard patterns. Non-standard naming evaded classification.

Content-based dating: Doesn't accurately date files without date metadata. Years come from filenames, not file content.

Vision accuracy: Image classification is approximate. It's helpful but not definitive.

OCR quality: Scanned documents varied significantly. OCR produced partial text for low-quality scans.

These limitations are acknowledged in reporting.

Close

Vision-assisted catalog at scale is possible with the right pipeline. Thirty-seven million files became a searchable database in weeks. We know:

What's in the archive (composition)
- What's valuable (classification)
- What's corrupted (3.4%)
- What provenance exists (from filenames and paths)

This approach applies to any large archive. The methodology transfers: extract and fingerprint, classify with vision, reconstruct provenance.

The client now has a searchable catalog. They can search their entire twenty-year history in seconds. That's the value — from mystery to manageable.

If you are weighing build-vs-buy on infrastructure like this—and the real question is what to commit to next—describe the decision you are facing. We scope around outcomes, not open-ended tours.