Found in 6 comments on Hacker News
brid · 2017-06-09 · Original thread
Check out the Data Warehouse Toolkit. You might want to look into natural vs surrogate keys.

baakss · 2016-08-01 · Original thread
Yes, especially in business intelligence / data warehousing. Here is an excellent resource if you're interested in this sort of thing:
meritt · 2013-02-25 · Original thread
Dimensional modeling (I'm a fan of Kimball's approach) mitigates these problems quite well while still offering very flexible ad-hoc reporting. Works great on a row-based RDBMS, even better on columnar.

Redshift is indeed a solid product but all these comparisons against Hive are surprising, as that's not the right tool in the first place. Infobright, greenplum, aster, vertica, etc are the products which Redshift seeks to disrupt.

eneveu · 2011-07-26 · Original thread
If you are interested in Data Warehousing, you should read Ralph Kimball's "The Data Warehouse Toolkit":

When I started learning about BI (Business Intelligence), a few members of the Pentaho community advised me to read this book. I'm glad I did. Kimball is one of the "fathers" of data warehousing, and his book had a lot of great insights for dimensional modeling. It helped me avoid many design mistakes while building my DWH, and gave me insight I might have taken years to discover.

It's a "theoretical" book, in the sense that it does not focus on any specific technology; it's also a "practical book", because he uses real-world scenarios (inventory management, e-commerce, CRM...) to demonstrate the various dimensional modeling techniques. I also liked the part about BI project management and encouraging BI in a company (= how to engage users and how to "sell" a BI project to management).

He also has a newsletter with many DWH design tips (archives here: ).

I am currently doing reports/datawarehouses/dashboards. When something more complicated that simple questions is needed (see Data Warehousing for Cavemen), ad-hoc queries are quite often not the answer anymore, either with NoSQL or with SQL.

I don't want my clients to be dependent on me (or someone else) to build complicated SQL queries when they have questions, so I focus on getting an easy to maintain facts/dimensions model (as advocated by Ralph Kimball which can evolve if needed.

The nice point about MongoDB when doing this is that it makes it a lot easier to add attributes to dimensions, or load the data, or evolve the reporting system in general (and I like that).

You can apply the same principles to build dimensions/facts based data structure and answer questions that SQL alone wouldn't be able to answer easily.

Example of such question: how many calls did we receive during french legal week #9 that were handled by team X outside the normal working hours or while we were in vacations ? In those calls, how many were issued by a woman (as it has a financial impact in this case) ?

AndrewO · 2009-04-30 · Original thread
This method has been the bane of my existence for going-on 5 years now. YMMV, of course, but here's my experience:

Back in 2004, we were decided to denormalize a set of records containing year, quarter, and 15 or so data points going back to 1996. Using this cross-tab method, we started out with 8 (years) * 4 (quarters/year) * 15 (data points) = 480 columns (a few more for non-denormalized data). 5 years later, we're up to the high 700s. Needless to say, this pretty much blows out your standard MySQL admin tools.

Another pitfall is, you really have to make sure you data is pristine (of course you should be doing that anyway, but reality is usually far from ideal, especially if you're me in 2004), otherwise you end up with doubling of values or values included when they shouldn't be. The generated queries are very very difficult to debug, again blowing out many admin tools and also text editors (unless you add pretty formatting in your query generator).

So what would I recommend instead? There are multi-dimensional databases or ETL tools that will properly put your transactional data in an analytical style data warehouse. Read Kimball's Data Warehouse Toolkit [1] (ignore the kinda ratty paper quality—the content is good). The key thing is to denormalize down to a star schema, but keep your facts and dimensions separate (things you'll learn in DWT).

There's an Java based OLAP server called Mondrian [2] that also may be worth your time. I ended up going with ActiveWarehouse and ActiveWarehouse-ETL [3], but I've felt some things a little more complicated than they needed to be (but that's another discussion—and I haven't kept up with their development lately, so it may have improved).

On a side note, recently I've started wondering if map-reduce style tools would make sense in this context, but I haven't explored using them that way yet, so I can't actually make a recommendation there.




Fresh book recommendations delivered straight to your inbox every Thursday.