The Data Warehousing Handbook

Category: Computer Science
Author: Rob Mattison
This Month Hacker News 1


by epicmuffin   2018-09-23
How do you know the postgres implementation is naive? I've worked on several analytics platforms...including offshoots of google analytics within Google itself, and this problem domain is ridiculously easy to shard on natural partitions. And after sharding, you can start to do roll-ups, which Google Analytics does internally.

By 2014 when I left, we had a few petabytes of analytics data for a very small but high traffic set of customers. Could we query all of that at once within a reasonable online SLA? No. We partitioned and sharded the data easily and only queried the partitions we needed.

If I were to do this now and didn't need near real-time (what is real-time?) I'd use sqlite. Otherwise I'ld use trickle-n-flip on postgres or mysql. There are literally 10+ year-old books[1] on this wrt RDBMS.

And yes, even with 2000 clients reaching billions of requests per day, only the top few stressed the system. The rest is long tail.