The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling

Author: Ralph Kimball, Margy Ross
All Hacker News 7
This Month Stack Overflow 15


by anonymous   2019-07-21

This is a Data Warehouse. Small, but a data warehouse.

You have a Star Schema.

You have Facts:

  • response values are the measures

You have Dimensions:

  • time period. This has many attributes (year, quarter, month, day, week, etc.) This dimension allows you to accumulate unlimited responses to your survey.

  • question. This has some attributes. Typically your questions belong to categories or product lines or focus or anything else. You can have lots question "category" columns in this dimension.

  • participant. Each participant has unique attributes and reference to a Demographic category. Your demographic category can -- very simply -- enumerate your demographic combinations. This dimension allows you to follow respondents or their demographic categories through time.

But Ralph Kimball's The Data Warehouse Toolkit and follow those design patterns.
Buy the book. It's absolutely essential that you fully understand it all before you start down a wrong path.

Also, since you're doing Data Warehousing. Look at all the [Data Warehousing] questions on Stack Overflow. Read every Data Warehousing BLOG you can find.

There's only one relevant design pattern -- the Star Schema. If you understand that, you understand everything.

by anonymous   2019-07-21

I'll try to answer in general terms that are not tied to a specific database technology (I'm a MS SQL Server DWH person).

To address your specific questions ...

"1.My data can change like a appointment can be marked as cancelled later, i read that changing data in star schema is not a good idea. If not what is a better approach."

There are two main table types in DWHes Fact tables and Dimension tables.

Changing fact or dimensional data in a star schema is perfectly valid. It is not considered good practise to delete dimension records from a DWH.

You need to make a choice of type 1 (overwite history) or type 2 (keep history) changes to data (Slowly Changing Dimension).

I'm not sure if you are suggesting deleting fact records here, but a better approach would be to have a flag on each fact record to indicate the status of the appointment (booked/used/cancelled/etc) and if a patient cancels their appointment then change the fact record from status=booked to status=cancelled; not actually deleting the fact record. This way also you can track the number of cancelled appointments.

To add a complication you could have your fact table keeping history as well, so that you can show an "as at" position, i.e. be able to show the state of the database as at a particular point in time. This increases the size of your database quite a bit, depending on the number of changes that occurr.

"2.The data to my fact tables will inserted by a background task when the data is added to my main database. Is constant insertion of data to fact table a issue as reposts are viwed in the application almost anytime."

You need to have a discussion around the frequency of updates / the importance of having up to date data. Generally (as I'm sure you are aare) DWHes are not OLTP systems and so aren't meant to be constantly being updated with new data and able to be reported on the most up to date data. If you want that really you want an OLTP system.

That said, we have implemented a DWH that did updates every 20 minutes. This had a SQL DWH with an OLAP cube sitting on top. Im not sure if mysql has OLAP technology, but I'd feel sure there is some opensource version available. There are several flavours of OLAP (MOLAP/ROLAP/HOLAP) which give different focus to performance/data currency.

You would normally want to separate the DWH itself from the reporting DB layer, especially if there are many users.

"3.I am planning to implement it in mysql, and if someone can point me to some post releated to performance of mysql with this kind of structure it would be great. Also which is a better engine to implement this schema Innodb or Myisam"

I'll have to pass on this question. I used to know a bit about innoDB adn MyISAM, but its been about 8 years since i played with the technology.

A very good book on Star Schema DWH design is by Ralph Kimball on DWH Design Book

by anonymous   2019-07-21

my thoughts …

  • Don’t overestimate the size (in terms of time) of a new DWH project.
  • Start with something not very complex and well understood in terms of business rules.

The biggest problem we have had with new DWH projects/pilots (we are a DWH consultancy so have a number of clients) is getting management support of it. Often the DWH will be sponsored by someone in IT and there is no real board level support, so the project takes a long time to progress and it is difficult to get resources for it.

The best projects we have found are ones that have levels of support in three areas: Management (board level), IT and Business (preferably someone with good understanding of the business rules involved).

Have a look at Ralph Kimball’s Data Warehouse Toolkit which goes through different styles of DWH design for different industries. It is very good!

The tools I expect you would use (I’ve added a couple of technologies here)

  • SSIS (ETL tool) is used to Extract (from source systems) Transform (data into appropriate form the load) and Load (into Dim and Fact tables)
  • SSAS (OLAP tool) is used to create/process an OLAP cube. Warning: there is quite a large learning curve on this tool!!
  • SSRS (reporting tool) is used to create static and dynamic reports/dashboards.
  • MS Excel. There are free data mining models that can be added in and when connected to an OLAP cube which will allow very interesting DM to be performed.
  • Windows Sharepoint Services (WSS) (comes free with a Windows Server operating systems) to deploy your SSRS reports onto.

This is a good prototype scope in terms of technologies (if you are from the MS background), but the spread of technologies is very large and for one person coming in cold to them, this is perhaps unrealistic.

The most critical thing is to get the DWH star schema design correct (and there will be a number of solutions that are “correct” for any set of data), otherwise your OLAP cube design will be flawed and very difficult to build.

I would get a local DWH consultant to validate your design before you have built it. Make sure you give them a very tight scope of not changing much otherwise most consultants will “tinker” with things to make them “better”.

Good Luck!

by anonymous   2019-07-21

Look at Data Warehouse Toolkit from Ralph Kimball. Just a plain star-schema may be able to speed up reporting. And here is an example of how the star schema simplifies reporting.

by anonymous   2019-07-21

I am going to assume this is an analytic (historical) database with no current data. If not, you should consider separating your dbs.

You are going to want a few features to help speed up analysis:

  • Materialized views. This is essentially pre-calculating values, and then storing the results for later analysis. MySQL and Postgres (coming soon in Postgres 9.3) do not support this, but you can mimic with triggers.

  • easy OLAP analysis. You could use Mondrian OLAP server (java), but then Excel doesn't talk to it easily, but JasperSoft and Pentaho do.

  • you might want to change the schema for easier OLAP analysis, ie the star schema. Good book:

If you want open source, I'd go Postgres (doesn't choke on big queries like mysql can), plus Mondrian, plus Pentaho.

If not open source, then best bang for buck is likely Microsoft SQL Server with Analysis Services.

by baakss   2017-08-20
Yes, especially in business intelligence / data warehousing. Here is an excellent resource if you're interested in this sort of thing:
by meritt   2017-08-20
I would suggest reading some books on the topic of Dimensional Modeling [1] such as "The Data Warehouse Toolkit" [2]. The critical thing you need to expose to your users is the ability to ask for things which make sense in their world that are actually really difficult for even an engineer to code. Things like: "Show me average 9am-12pm sales on Mondays, Wednesday and Fridays for 1st quarter, 2012"

by eneveu   2017-08-20
If you are interested in Data Warehousing, you should read Ralph Kimball's "The Data Warehouse Toolkit":

When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.

It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).

He also has a newsletter with many DWH design tips (archives here: ).

by anonymous   2017-08-20

Your dimension table 'product' should look like this:

surrogate_key | natural_key (productID) | Color | Material | Size | ...
    1              5                      red     wood       20     ...
    2              6                      red     ...         

If you have to many properties, try to group them in another dimension. For example Color and Material can be attributes of another dimension if you can have the same product with same id and same price in another color or material. Your fact table can identify product with two keys: product_id and colormaterial_id...

Reading recommendation: The Data Warehouse Toolkit, Ralph Kimball

by kyle   2017-08-20

@Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.

There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.

One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:

+--------+     +---------+
|tblUsers|     |luGenders|
+--------+     +---------+
|userID  |     |genderID |
|genderID||gender   |
+--------+     +---------+

SELECT * FROM luGenders;
> 1 Female
> 2 Male

If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.

The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.

Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.

I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.

by anonymous   2017-08-20

I would stress you to read this book; might seem kind of outdated but the same theory still applies today. It is probably the best starter for general BI.

The Data Warehouse Toolkit - Ralph Kimball

Regarding Microsoft's BI it is a medium-sized tool that can do the job in your first steps (I have more experience with Cognos though). Haven't used MS tools since 2005 so I can't tell much about it.

In case you happen to be interested in Cognos, I have a few videos which can be of help: Cognos Tutorials

Good luck with your project.