Star Schema The Complete Reference

Author: Christopher Adamson
4.4
This Month Stack Overflow 1

Comments

by anonymous   2019-01-13

I can comment on Power BI part.

The key issue here is that Power BI requires Dimensional Model, not relational one. There is a huge difference.

As described, the model from the book is not suitable for BI tools, it must be redesigned. For example, table "Weather" in the book is presented as a "dimension", while in reality it must be a fact table (similar to table "Flights"). As a result, "Flights" and "Weather" should never have direct connections - they must share common dimensions, such as:

  • Airport
  • Airline
  • Plane
  • Date
  • Time

Similarly, multiple keys and multiple connections between tables are very rare exceptions and are frowned upon (usually, they are indications of design mistakes). In a properly designed model, you should never see them.

If you want to understand the issue more, read this book: Star Schema Complete Reference

To answer your Q3 specifically, in dimensional modeling "Flight" (I assume it's flight number) is called a "degenerate dimension". Normally, it would have been a key to a dimension table, but if it's absent, it stays in a fact table as an orphan key. Such situation is common for Order numbers, Invoice numbers, etc.

Degenerate dimensions

Overall, you are on the right track - if you figure out how to transform the model from the book into a proper star schema, and then use it in R and PowerBI, you will be impressed with the new capabilities - it's worth it.

by anonymous   2019-01-13

The way it usually works:

  1. In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
  2. In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
  3. Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).

Basic example (using T-SQL). Let's say you have the following 2 tables:

Table OLTP.Sales
(   Contract_BK, 
    Amount, 
    Quanity)

Table Dim.Contract
(   Contract_SK,
    Contract_BK,
    Contract Type)

To Swap keys:

SELECT
     c.Contract_SK
    ,s.Amount
    ,s.Quantity
INTO
    Fact.Sales
FROM
    OLTP.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK

-- Test for missing keys
SELECT 
    * 
FROM 
    Fact.Sale 
WHERE 
    Contract_SK IS NULL

On a side note, I believe you have some mistakes in your design.

  • Report Period should be a separate dimension. Usually it's a calendar table with all date/period related attributes.
  • You certainly should not add ContractNbr to other dimensions. You already have this data in Contract dimension. That's how star schema works - contract attributes are always available to you via fact table. No need to replicate them.
  • I can't say for sure (not enough information) but suspect that dim Accounting and dim Claim might be incorrectly designed. If you intend to list your individual transaction descriptions and individual claim attributes, it's a mistake. It will result in dimensions that are as large as fact table. In a good design, fact table are "tall and skinny", while dimesions are "short and fat". I.e, in a fact table you should have few fields and lots of records, while in dims lots of fields and few records. Typically, if your dim's number of records is more than 10-20% of the fact table records, it's an indication of incorrect design. Correct way of handling this problem is to decompose claims into multiple dimensions, and leave claim number (order number, invoice number, transaction number, etc) as a "degenerate dimension" in your fact table. It's a bit of an advanced topic but you clearly need it for your case. Reason why it's important: if your dimensions are as tall as you fact table, you will have increasingly poor performance. If number of trasactions or claims is in the millions of records, it might be so slow that it will kill your design.

If you need more information on this, I recommend this book:

Star Schema The Complete Reference

[Edit to answer a follow-up question]:

I did not mean to remove ClaimNbr field from Claim dimension. I suggested that you don't need such dimension at all.

This might be a bit hard to digest, but consider the following. "Claim" is essentially a container for information (same as "Invoice", "Order", etc). If you move all usefull pieces of data to their relevant dimensions, there should be nothing left but an empty container.

For example, let's assume that your OLTP claim table contains the following fields: Claim Number, Report Period, Claim Description, Claim Name, Contract Number, Claim Amount. You can model them as follows:

  • Report Period: becomes business key for "Date" dimension
  • Contract Number: becomes business key for "Contract" dimension
  • Claim Amount: stays in fact table as numeric (fully-additive) fact

That leaves 3 fields: Claim Number, Claim Name and Claim Description. At this point, some designers create dimension "Claim" and park these fields there. As I mentioned before, this is a mistake, because you will then have as many records in your dimension as in your fact table, leading to serious problems.

A better design is to leave these fields in the fact table. Claim Number becomes a "Degenerate dimension" - a business key to "empty" (non-existent) dimension. Essentially, it's just an ID for an information container, like invoice number, order number, etc.

Claim Name and Claim Description also should stay in the fact table and become "non-numeric" (non-additive) facts. If you need to display them in a report, it's easy to do, and you can count them, do conditional logic on them, measure their length, etc.

Another way of looking at this: dimensions are usually used to "slice" (disect) facts BY some attribute/field. For example, "Sale Amount by Country", "Product Costs by Plant Location", etc. But you can't slice by descriptions, notes, or other free text - it makes no sense.

What if your descriptions or other claim attributes are structured? For example, if they are used to categorize/classify your claims? In that case, they are not a free text, they are an attribute that belongs to a dimension. For example, you can design dimension "Claim Type". Or "Claim Status". Etc. If there are too many of these little attribute fileds, you can combine them into what's called a "junk" dimension (aka "Profile" dimension), i.e., dimension "Claim Profile". Such designs are clean and efficient.

Read more on junk dimensions here