Normalization Under Structural Entropy | Bright Foundry Research

Executive summary

Messy archives are where normalization matters most

Study #2 isolates the most operationally difficult part of the benchmark: unmanaged filesystem folders. Unlike managed libraries, folder-based archives do not preserve structural coherence naturally. They accumulate operational disorder through migrations, backups, exports, recovered media, duplicate propagation, and years of uncontrolled filesystem growth.

The strongest result was not raw throughput. It was deterministic convergence under operational stress: even when archive structure became fragmented, heterogeneous, and difficult, normalization behavior remained measurable, explainable, and globally predictable.

247,130files processed

8,861folders traversed

85,630duplicates handled

123,169NoGPS files

From managed libraries to unmanaged archives

Study #1 showed how structure improves normalization behavior. Study #2 moves into the opposite environment: unmanaged archives accumulated directly at filesystem level.

This distinction matters because libraries and folders do not behave the same way structurally. Managed libraries preserve locality, metadata continuity, and predictable behavior. Folder archives accumulate entropy progressively through years of ordinary use.

Managed libraries

Structure preserved

High metadata consistency
Strong geolocation continuity
Natural chronological locality
Logical duplicate handling
Cache-friendly workload behavior

Unmanaged archives

Structure fragmented

Deep directory trees
Mixed provenance and historical layers
Recovered and metadata-degraded media
Physical duplicate propagation
Randomized I/O access patterns

Structural entropy at scale

Large folder-based archives do not become complex all at once. They accumulate entropy gradually through migrations, backups, exports, synchronization workflows, recovery processes, device replacement cycles, and years of normal storage growth.

The filesystem becomes a visible record of operational history: duplicated folders, nested exports, recovered media, temporary copies, fragmented chronology, inconsistent naming, and metadata degradation.

25 years of fragmented archive accumulation — Historical accumulation of heterogeneous media across 25 years of unmanaged archive growth.

Key findings

What fragmented archives revealed

Finding 1

Structural entropy dominates cost

Processing behavior varied more with workload composition than with dataset size itself. NoGPS density, duplicate movement, directory locality, file size, and video concentration became dominant variables.

Finding 2

Normalization remained stable

Despite heterogeneous workload conditions and multi-day execution, global operating cost converged toward approximately 4.33 seconds per file.

Finding 3

Duplicates became physical I/O

Folder normalization physically handled 85,630 duplicated files, transforming duplicate management into a sustained storage and write-amplification workload.

Finding 4

Metadata availability reshaped throughput

NoGPS-heavy segments often reduced unit cost by minimizing geolocation-resolution work, showing that location resolution was not the dominant bottleneck under fragmented workloads.

Finding 5

Execution state mattered

Long-running I/O-bound execution proved sensitive to operating system interaction state, including idle and screensaver conditions.

Core result

Operational variability remained explainable

The archive did not behave randomly. It behaved as a sequence of observable operational states.

Operational conditions

This was not a synthetic benchmark. The workload ran under real-world operational conditions using consumer hardware, encrypted storage, deep directory traversal, random HDD reads, SSD writes, metadata extraction, duplicate movement, and large video handling.

Machine

MacBook Pro M2 Pro

32 GB unified memory, fully local execution, no cloud upload, no distributed compute.

Source

Encrypted HDD

WD My Passport 4 TB, exposing filesystem locality and seek-heavy access patterns.

Destination

Encrypted SSD

Samsung Portable SSD T7 Shield 4 TB, receiving organized and duplicated outputs.

Operational regimes

The folder archive did not behave like a uniform dataset. It behaved like a sequence of regimes shaped by metadata density, duplicate concentration, filesystem locality, media composition, file size, and execution state.

Processing cost versus workload composition — Relationship between workload composition and observed operational cost across heterogeneous processing segments.

Regime

Cache-dominant

Strong location continuity and accumulated reuse reduced external dependency and stabilized cost.

Regime

NoGPS-heavy

Metadata-light segments exposed a lower cost boundary by avoiding repeated location-resolution work.

Regime

Video-heavy

Large video directories amplified sustained I/O cost through prolonged read/write activity.

Regime

Duplicate-heavy

Duplicate handling became physical movement, creating sustained write amplification.

Regime

Seek-bound

Fragmented traversal reduced locality and forced unstable HDD access patterns.

Regime

Execution-state

Idle and screensaver conditions materially influenced throughput during long-running I/O-bound execution.

Unexpected finding

Execution state became an operational variable

One of the most unexpected findings was that throughput depended not only on archive composition, but also on the interaction state of the operating system itself.

During long-running I/O-bound execution, segments processed under prolonged screensaver and idle-state conditions showed measurable throughput degradation despite stable workload composition and controlled power settings.

The workload remained stable. The archive composition remained stable. Only the execution state changed.

This behavior was initially difficult to explain because the benchmark itself had not changed. The same directories, storage devices, workload composition, and execution pipeline produced materially different throughput depending only on whether the system remained actively interactive or entered prolonged idle-state conditions.

What initially appeared to be random operational instability gradually revealed a repeatable execution-state regime associated with screensaver and idle-state transitions during sustained I/O-bound processing.

Segment-level processing cost before and after idle mitigation — Segment-level operational clustering before and after idle-state mitigation.

Local variability. Global predictability.

Local segments varied continuously. Some were metadata-light, some duplicate-heavy, some video-heavy, and some seek-bound. Yet cumulative operating behavior remained structurally explainable and globally stable over time.

That is the defining result of Study #2: deterministic normalization remained viable even when archive structure deteriorated.

Global processing cost convergence across 247,130 heterogeneous files processed under shifting operational regimes.

What this changes

Study #2 changes how large-scale normalization should be understood. At scale, normalization behaves less like a simple metadata utility and more like a long-running operational system interacting with storage, filesystem structure, workload composition, and execution conditions.

Archive quality exists on a spectrum. Managed libraries and fragmented folders are different points along the same continuum.
Workload composition matters more than raw volume. Cost follows metadata density, locality, duplicates, video concentration, and execution state.
Determinism survives structural entropy. Even fragmented workloads remained measurable, explainable, and predictable.
Normalization becomes more valuable as archives degrade. Messy archives are where structural visibility matters most.

Normalization did not eliminate entropy. It made entropy operationally intelligible.

The benchmark series

Study #1 — Managed Libraries

Structured libraries reveal how metadata continuity and accumulated knowledge improve normalization efficiency.

Read Study #1 →

Current study

Study #2 — Structural Entropy

Fragmented folders reveal the operational cost of losing structure.

Model

Study #3 — Operational Normalization

A consolidated model for long-lived media archives.

Read Study #3 →

Implementation

File-level normalization is a method, not a dependency on a specific application. Applying it under fragmented archive conditions requires deterministic rules, duplicate isolation, explicit handling of incomplete metadata, and operational resilience across heterogeneous workloads.

See how MediaOrganizer applies file-level media normalization →