The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

Author: Ralph Kimball, Margy Ross
This Month Hacker News 1


by 3pt14159   2018-08-20
Spark, etc, are great, but honestly if you're just getting started I would forget all about existing tooling that is geared towards people working at 300 person companies and I would read The Data Warehouse ETL Toolkit by Kimball:

I learned from the second edition, but I've heard even better things about the third. As you're working through it, create a project with real data and from-scratch re-implement a data warehouse as you go. It doesn't really matter what you tackle, but I personally like ETLing either data gather from web crawling a single site[0] or push in a weekly gathered wikipedia dump. You'll learn many of the foundational reasons for all the tools the industry uses, which will make it very easy for you to get up to speed on them and to make the right choices about when to introduce them. I personally tend to favour tools that have an API or CLI so I can coordinate tasks without needing to click around, but many others like a giant GUI so they can see data flows graphically. Most good tools have at least some measure of both.

[0] Use something like Scrapy for python (or Mechanize for ruby) with CSS selectors and use the extension Inspector Gadget to quickly generate CSS selectors.

by anonymous   2017-11-06
This guy actually wrote several, the one I was referring to is this but also there's a Data Warehousing collection . They are a great read if you're interested in large scale data warehousing.