The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling
When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.
It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).
He also has a newsletter with many DWH design tips (archives here: http://www.kimballgroup.com/html/07dt.html ).
Your dimension table 'product' should look like this:
surrogate_key | natural_key (productID) | Color | Material | Size | ...
1 5 red wood 20 ...
2 6 red ...
If you have to many properties, try to group them in another dimension. For example Color and Material can be attributes of another dimension if you can have the same product with same id and same price in another color or material. Your fact table can identify product with two keys: product_id and colormaterial_id...
The Data Warehouse Toolkit, Ralph Kimball
@Mark Ransom is definitely onto something with the notion of Excel keeping the data in memory, making it faster computationally. It's also possible that Excel pre-indexes datasets in such a way that makes it more responsive than your database.
There's one significant, non-algorithmic possibility for why it's faster: Excel, in Pivot Table usage, has no concept of a join. When you're fetching the data ad hoc from your database, any joins or correlations between tables will result in further lookups, scans, index loads, etc. Since Excel has all the data in a single location (RAM or no), it can perform lookups without having to pre-form datasets. If you were to load your database data into a temp table, it would be interesting to see how ad hoc queries against that table stacked up, performance-wise, against Excel.
One thing's certain, though: although databases are excellent tools for producing accurate reports, a traditionally-normalized database will be far less than optimal for ad hoc queries. Because normalized data structures focus on integrity above all else (if I may take that liberty), they sacrifice ad hoc optimization at the expense of keeping all the data sensible. Although this is a poor example, consider this normalized schema:
|userID | |genderID |
SELECT * FROM luGenders;
> 1 Female
> 2 Male
If, in this example, we wished to know the number of female/male users in our system, the database would need to process the join and behave accordingly (again, this is a bad example due to the low number of joins and low number of possible values, which generally should bring about some database-engine optimisation). However, if you were to dump this data to Excel, you'd still incur some database penalty to pull the data, but actually pivoting the data in Excel would be fairly speedy. It could be that this notion of up-front, fixed-cost penalty is being missed by your idea of Excel being quicker than straight ad hoc queries, but I don't have the data to comment.
The most tangential point, though, is that while general databases are good for accuracy, they often suck at ad hoc reports. To produce ad hoc reports, it's often necessary to de-normalize ("warehouse") the data in a more queryable structure. Looking up info on data warehousing will provide a lot of good results on the subject.
Moral of the story: having a fully algorithmic, fast ad hoc query system is an awesome ideal, but is less than practical given space and time constraints (memory and people-hours). To effectively generate an ad hoc system, you really need to understand the use cases of your data, and then denormalize it effectively.
I'd highly recommend The Data Warehouse Toolkit. For the record, I'm no DBA, I'm just a lowly analyst who spends 80 hours per week munging Excel and Oracle. I know your pain.
I would stress you to read this book; might seem kind of outdated but the same theory still applies today. It is probably the best starter for general BI.
The Data Warehouse Toolkit - Ralph Kimball
Regarding Microsoft's BI it is a medium-sized tool that can do the job in your first steps (I have more experience with Cognos though). Haven't used MS tools since 2005 so I can't tell much about it.
In case you happen to be interested in Cognos, I have a few videos which can be of help: Cognos Tutorials
Good luck with your project.