Optimizing Joins running on HDInsight Hive on Azure at GFS

dennyglee's avatarDenny Lee

Introduction

To analyze hardware utilization within their data centers, Microsoft’s Online Services Division Global Foundation Services (GFS) is working with Hadoop/Hive via HDInsight on Azure.  A common scenario is to perform joins between the various tables of data.  This quick blog post provides a little context on how we managed to take a query from >2h to < 10 min and the thinking behind it.

Note that this post was published on April 26th, 2013. This information may not be current, but I have kept this post for posterity.

.“…to look at the stars always makes me dream, as simply as I dream over the black dots of a map representing towns and villages…”

— Vincent Van Gogh

Image Source: Vincent Van Gogh Painting Tilt Shifted: http://coolvibe.com/2011/16-van-gogh-paintings-tilt-shifted/tilt-shift-van-gogh-15/

Background

This scenario is a three-column join between a large fact table (~1.2 B rows/day) and a smaller dimension table (~300K rows). …

View original post 1,023 more words

Leave a comment