What Is Data Gravity?
Data gravity is the observation that as a dataset grows in size, it increasingly attracts services, applications, and additional data to its location. Like a celestial body whose gravitational pull strengthens as its mass increases, large datasets become progressively harder and more expensive to move, causing compute and tooling to migrate toward the data rather than the other way around.
The Origin of the Concept
The term was coined by Dave McCrory in 2010 to describe a pattern he observed in enterprise IT. When organizations accumulated terabytes of data in a particular cloud provider or data center, they found it impractical to move that data elsewhere. Instead, they built more services on top of it in the same location, deepening their dependency.
The analogy to physics is apt: the larger the dataset, the stronger the pull. A 10 GB database can be migrated in minutes. A 10 PB training corpus effectively cannot be moved at all within reasonable time and cost constraints.
Why Data Gravity Matters for AI
AI amplifies data gravity in three ways:
- Training data volumes are massive: Foundation models are trained on datasets measured in petabytes. Once a training corpus exists in a particular location, the GPU clusters needed to train on it must be co-located. Moving the data to a different provider could take weeks even on dedicated high-bandwidth links.
- Data begets more data: Model outputs, inference logs, fine-tuning datasets, and evaluation results accumulate alongside the original training data. Each derivative dataset adds mass to the gravitational field.
- Multi-modal convergence: Modern AI systems combine text, images, video, audio, and sensor data. As these modalities converge in one location for joint training, the total dataset size—and its gravity—grows exponentially.
The Cost of Moving Data
Data gravity has a direct financial dimension. Cloud providers charge egress fees that make moving data out of their infrastructure expensive:
| Dataset Size | Egress Cost (typical) | Transfer Time (1 Gbps) |
|---|---|---|
| 100 GB | ~$8 | ~13 minutes |
| 10 TB | ~$800 | ~22 hours |
| 1 PB | ~$50,000+ | ~3 months |
| 10 PB | ~$500,000+ | ~2.5 years |
These numbers explain why organizations rarely move large datasets. The combination of cost and time creates a lock-in effect that is difficult to escape.
Strategies for Managing Data Gravity
- Move compute to data: Instead of moving the dataset, deploy processing power at its location. This is the principle behind edge AI and orbital data centers.
- Federated architectures: Train models across distributed datasets without centralizing the data. Federated learning reduces the need to move raw data at all.
- Efficient transfer protocols: When data must move, the transfer protocol matters enormously. Protocols that maximize throughput over high-latency links reduce transfer time from months to days.
- Eliminate per-GB costs: Per-GB pricing makes large transfers financially impossible. Flat-rate or free transfer eliminates the cost component of data gravity.
How Handrive Addresses Data Gravity
Handrive's architecture directly combats data gravity. With no per-GB fees, the financial barrier to moving data disappears. Its peer-to-peer protocol avoids routing data through centralized cloud infrastructure, eliminating egress fees entirely. For AI teams moving training data, model weights, or inference results between locations, Handrive makes the transfer itself free regardless of volume. Explore how this applies to AI infrastructure on the AI Data Centers hub page.
Understand the broader shift in file transfer for AI:
File Transfer for AI Training Data: Moving Terabytes Between Edge, Cloud, and Beyond →