Optimizing Joins running on HDInsight Hive on Azure at GFS

Denny Lee

.“…to look at the stars always makes me dream, as simply as I dream over the black dots of a map representing towns and villages…”
— Vincent Van Gogh

Image Source: Vincent Van Gogh Painting Tilt Shifted: http://coolvibe.com/2011/16-van-gogh-paintings-tilt-shifted/tilt-shift-van-gogh-15/



To analyze hardware utilization within their data centers, Microsoft’s Online Services Division – Global Foundation Services (GFS) is working with Hadoop / Hive via HDInsight on Azure.  A common scenarios is to perform joins between the various tables of data.  This quick blog post provides a little context on how we managed take a query from >2h to <10min and the thinking behind it.


The join is a three-column join between a large fact table (~1.2B rows/day) and a smaller dimension table (~300K rows).  The size of a single day of compressed source files is ~4.2GB; decompressed is ~120GB.  When performing a regular join (in Hive parlance “common…

View original post 1,003 more words

Healthcare Compliance with Big Data and BI

Healthcare Compliance with Big Data and BI

Over the past few years Denny Lee  (Technical Principal Program Manager within Microsoft’s SQL Business Intelligence Group) and I are always working on a very exciting SQL Server projects, earlier this month we presented “Big Data, BI, and Compliance in Healthcare” at PASS BA Conference in Chicago, IL.

Few years ago, we implemented “Centralized Audit Framework” to manage and view the audits of entire SQL Server environment that will parse, load, and report all of audit logs.

Expanding on the “Reaching Compliance: SQL Server 2008 Compliance Guide” to more easily handle larger volumes of structured and unstructured data and to gain richer and deeper insight using the latest analytics. To achieve this, we are building a Big Data-to-BI project involving HDInsight (Hadoop on Windows or Azure), SQL Server 2012, SQL Server Analysis Service 2012 Tabular instance, Integration Services, PowerPivot, and Power View.

The purpose of this SDK is to provide a set of tools and jumpstart with scripts to implement the Auditing project involving HDInsight, SQL Server 2012, PowerPivot and Power View.

Implementation Overview

The basic implementation of the Auditing and Reporting solution is shown in the figure below.


Figure 2

The general flow of data in this solution is that Audits are created on any number of SQL Servers (2008 and 2012) in the environment and are set to log to the file system. The audit logs will be stored directly to a central network file share. A scheduled SQL Server Agent Job runs an SSIS package that reads the audit log files, combines them into large file sizes (250MB to 1GB file size) and uploads them to HDInsight Blob Storage which is the storage source for HDInsight on Azure or Windows.

Once audit logs stored in HDInsight Blob storage, we use Hive which is a data warehouse framework for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets.

Create a BI Semantic Data Model tabular to bring relevant Hadoop data into SQL Server Analysis Services Tabular by using HiveQL via SQL Server Linked Server connection to Hadoop Hive. Analysis Services can then serve up the data for ad-hoc analysis and reporting.

Reports are created with Excel 2013 using Power View that interacts with views of data from data models based on SSAS tabular model, or using Data Explorer to import audit data from Hive external table in HDInsight to allow compliance auditors and server administrators to assess the server compliance and trends in server compliance.

This information would then be fed back to the appropriate security, administrator and application development teams to enact policies to approve levels of compliance.

As the system evolves, teams may load additional application audit logs into Hadoop which could help tie these SQL Server specific activities to application and business activities.

The SDK will be available soon at GitHub to download the entire project. Stay tuned!