Big Data and Legos

Featured Image -- 222

Ayad Shammout:

Great Analogy!

Originally posted on Denny Lee:

I was recently asked the question – how to explain Big Data to an 8yo. So after realizing the 4 Vs of Big Data barely make sense to non-marketing (i.e. most of us) let alone to kids – I realized that the best construct would be to use Legos.

20140525-132454-48294755.jpg

When I was her age, the lego blocks were only squares and rectangles – I could build a lot of buildings and boxes which was great at that time (in data speak, relational databases).

Instead, Big Data is a massive amount (e.g. volume of data) of lego blocks of different shapes and sizes. With all the variety how quickly I can get more lego blocks, I can imagine new and different ways to build new toys or objects (in data speak, perform new novel analytics).

dwp lego nyc 130707

From Closer Look: United States of Lego.

With lego blocks in all shapes and sizes…

View original 26 more words

SQL 2012 Failover Cluster Instance SP1 installation failed with error

Installing SQL Server SP1 on SQL 2012 Failover Cluster Instance (FCI) failed with  the following error:

There was a failure to calculate the default value of setting PatchResult.

Error code 0×85640001

The issue was that when you launch SQLServer2012SP1-KB2674319-x64-ENU.exe, it creates a temporary folder (something like g:\27eb5c29eceb5ae55b8b531263faacc) with setup files on a cluster disk which happened to store SQL data files.

During SP1 installation, it restarts the SQL Server service which then took the cluster disk where SQL data files and the SP1 temp folder hosted offline, then the service pack was trying to read from disk that is unavailable caused the install to fail.

The workaround is to copy the SP1 temp folder to the local c:\ drive and run Setup from c:\ location, SP1 installation completed with no problems.

I know it sounds weird but the SP1 will extract the files to whatever drive has more space and in this case was a cluster disk.

DAX RANKX function – Part II

In a previous post, I explained how to use RANKX function to calculate revenue ranking for each provider in each department. In this post I’m going to explain how to rank at different levels.

I’m using my previous example which is a list of providers in different hospital clinical departments, I want to rank department and rank providers in each department by revenue (two ranking levels). Here is my sample table I started with creating a table in Excel “Table1” and add it to Power Pivot model.

clip_image001

I created following measures to rank provider by revenue:

TotalRevenue:=SUM(Table1[Revenue])

Rank:=RANKX(All(Table1[Provider]),[TotalRevenue],,,Dense)

The Rank measure will rank the [TotalRevenue] for the current Row against the [TotalRevenue] for all other rows in the Revenue column. The ALL(Table1[Provider]) is the list of values that the rank function use to compare the current row against. Since I used ALL function to ignore Provider filter context, I will get the following:

image

Now let’s add the department to pivot table and see how Rank measure works:

image

As you can see the Rank by provider still working within each department by the rank by department is not, it uses the department total and compare it against list of provider values. I want to rank each department against all departments, and rank providers within each department. In order to do that in Power Pivot, I want to create an IsDepartmentLevel measure that will return TRUE if the current selection in the PivotTable has a valid correspondent department definition, FALSE otherwise. If it is TRUE, Rank department, otherwise Rank providers. Let’s see how to write the IsDepartmentLevel measure:

IsDeptartmentLevel:=(

                                                 COUNTROWS(Table1)

                                                 =

                                                 CALCULATE(COUNTROWS(Table1) ,

                                                                          ALL(Table1),

                                                                           VALUES(Table1[Department])

                                                                          )

                                              )

 

The IsDepartmentLevel measure checks that the number of rows that are active in Table1 equal to the number of rows that would be active by removing from the filter context any column except Department.

As you can see in the following Pivot table, I added the IsDepartmentLevel measure to Pivot table, the IsDepartmentLevel measure is FALSE for every provider level but TRUE for department.

image

Now we are able to identify the department level which we need in order to evaluate the Rank measure and Rank department vs. provider.

Let’s re-write the Rank measure and see how to use the IsDepartmentLevel measure and RANKX function to rank departments and providers by revenue.

Rank:= IF(

                    [IsDeptartmentLevel],

                    RANKX(Filter(All(Table1[Department]), [TotalRevenue]),[TotalRevenue],,,Dense),

                    RANKX(Filter(All(Table1[Provider]), [TotalRevenue]),[TotalRevenue],,,Dense)

                  )

I used IF function to evaluate IsDepartmentLevel measure, if it is TRUE, it will calculate the revenue ranking for each department, otherwise it will calculate the revenue ranking for each provider.

The following Pivot table shows the ranking for each department.

image

When I expand the departments, it shows the ranking for each provider in each department and maintained the ranking for each department.

image

In this post, you have seen how to use RANKX function and calculate at different levels (department and provider) using one measure.

SQL 2012 SP1 CU6 issue with SSIS DB in Availability Group

Recently we were installing SQL Server 2012 SP1 CU6 to fix the slow performance when you run the SSIS Server Maintenance Job to remove old data in SQL Server 2012, you can read about the issue here.  We ran into a serious issue where the installation failed and couldn’t start SQL Server services.

Let me first describe the server, it is a named instance of SQL Server 2012 SP1 Engine and SSIS with AlwaysOn Availability Group, primary instance in main data center and secondary replica in DR site. The AG comprises all databases and SSISDB.

SQL 2012 SP1 CU6 installation failed with the following error:

Script level upgrade for database ‘master’ failed because upgrade step ‘SSIS_hotfix_install.sql’ encountered error 945, state 2, severity 25. This is a serious error condition which might interfere with regular operation and the database will be taken offline. If the error happened during upgrade of the ‘master’ database, it will prevent the entire SQL Server instance from starting. Examine the previous errorlog entries for errors, take the appropriate corrective actions and re-start the database so that the script upgrade steps run to completion.

The issue was related to configuring SSIS db  “SSISDB” with AlwaysON Availability Group. So what happened is that SQL Server service packs usually run in single-user mode, while an availability database must be a multi-user database. Therefore, during installing the CU6, all availability databases including SSISDB will be taken OFFLINE which become inaccessible and thus fail the patch.

To workaround this issue, here are the steps:

  1. Start SQL Server service with Trace Flag 902: 
    • >Net Start MSSQL$InstanceName /T902
  2. Open SQL Server Management Studio, go to Availability Group and remove SSISDB from the availability databases
  3. Open New Query, execute the SSIS_hotfix_install.sql script which can be found in Install folder under \Program Files\Microsoft SQL Server\MSSQL11.MSSQL$InstanceName \MSSQL
  4. Stop SQL Server services:  
    • >Net Stop MSSQL$InstanceName
  5. Start SQL server service from SQL Server configuration manager
  6. Add SSISDB back to Availability Group

That’s all and we are back in business. Click here from more information on how you can make use of SSIS and AlwaysOn Availability Groups.

Quick Tip for extracting SQL Server data to Hive

Originally posted on Denny Lee:

While I have documented various techniques to transfer data from Hadoop to SQL Server / Analysis Services (e.g. How Klout changed the landscape of social media with Hadoop and BI Slides Updated, SQL Server Analysis Services to Hive, etc.), this post calls out the reverse – how to quickly extract SQL Server data to Hadoop / Hive.   This is a common scenario where SQL Server is being used as your transactional store and you want to push data to some other repository for analysis where you are mashing together semi-structured and structured data.

How to minimize impact on SQL Server

Analogous to the above linked documents, you could use a JDBC driver to connect and extract this data from SQL Server (other techniques may involve SQOOP, HiveODBC, etc.).  But if your security policies allow for this, a fast way to extract data from your SQL Server with…

View original 307 more words

SQL Server Auditing

With SQL Server 2008 or higher, there is a powerful yet lightweight method to audit a SQL Server instance. But to manage and view the audits of your entire SQL Server environment, we have created the Centralized Auditing Framework that will parse, load, and report all of your audit logs.

From a high-level perspective, the architecture to audit sensitive operations for your SQL Server environment will be to:

  1. Turn on auditing on your various database and servers; you can find more information at http://msdn.microsoft.com/en-us/library/cc280386.aspx
  2. Have these audit logs go to one centralized location instead of a local folder
  3. Included in the Centralized Audit Framework is an SSIS package that will Parse through all of these audit log files.
  4. Populate a centralized database (also included) with audit dimension data (class, type, server names, etc.) and audit fact data (audit events separated by category of server, database, DDL, and DML actions).

Once the data is loaded, a nightly job will process the data to product report tables. These report tables can be viewed by SSRS reports (included).

The Auditing framework is available at GitHub to download the entire project.

T-SQL Linear Regression Function

Predictive Analytics is the wave of the future and has become vital in today’s data evolution.

Predictive analytics incorporates a variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future.

I’ll describe the linear regression approach and how to write a T-SQL function to calculate the regression and produce the Intercept, Slope and R2 which are used in a regression equation to predict a value. In simple linear regression, the topic of this post, the predictions of Y when plotted as a function of X form a straight line.

The following table lists a sample data of car drivers driving experiences (in years) and monthly auto insurance premiums.

 

Driving Experience

(X value)

Insurance premium

(Y values)

7

75

10

68

12

65

18

60

20

57

25

50

Table1

Here is the T-SQL code will take the X and Y values from table 1 and calculate the regression coefficients:

declare @n int,           

              @Intercept DECIMAL(38, 10),

              @Slope DECIMAL(38, 10),

              @R2 DECIMAL(38, 10)

Select @n=count(*) from Table1

Select

@Slope = ((@n * sum(x*y)) - (sum(x)*sum(y)))/ ((@n * sum(Power(x,2)))-Power(Sum(x),2)),

@Intercept = avg(y) - ((@n * sum(x*y)) - (sum(x)*sum(y)))/((@n * sum(Power(x,2)))-Power(Sum(x),2)) * avg(x)

From Table1

 

Select @R2 = (@Intercept * SUM(Y) + @Slope * SUM(x*y)-SUM(Y)*SUM(y)/@n) / (SUM(y*y) - SUM(Y)* SUM(Y) / @n)

Select @Slope as Slope, @Intercept as Intercept, @R2 AS R2

From Table1

 

So to predict what will be the insurance premium if the driving experience years = 15

Y = Intercept + 15 * Slope (where y is the insurance premium)

For an in depth article about the linear regression, please refer to the Wikipedia Entry,

SQL Server Reporting Services Disaster Recovery Case Study

Authors: Ayad Shammout & Denny Lee

Technical Reviewers: Lukasz Pawlowski, Sanjay Mishra, Prem Mehra, Michael Thomassy, Lindsey Allen, Burzin Patel, Andy Wu, Thomas Kejser, Carl Rabeler, Quoc Bui

Applies to: SQL Server 2005, SQL Server 2008

Introduction

There are many ways to perform disaster recovery with Microsoft SQL Server Reporting Services (SSRS). Based upon customer experience and internal testing, this technical note provides guidance around best practices to design and manage robust end-to-end disaster recovery (DR). This DR method will involve both automatic and manual failover in the form of content switches, SQL Server failover clustering, and database mirroring. This case study focuses on the lessons learned from CareGroup Healthcare System.

Environment Setup

Figure 1 shows a basic disaster recovery scenario. The left side (in green) represents the primary data center, while the right side (in orange) represents the disaster recovery (DR) site. In general the two sites should be identical or at least very similar to each other so that the DR site will be able to handle the load of the primary data center; all of this should be transparent to the users.

image

Figure 1: Primary data center and the disaster recovery site

Primary Data Center

The primary data center contains the Reporting Services Web service servers (there is only one for the purpose of this diagram); the report catalogs are placed on a separate server (RSDB).  As a general rule, you should set up the report catalogs (RSDB) in a failover cluster, because the report catalogs are central to the Reporting Services environment.

Disaster Recovery Site

The disaster recovery site should closely duplicate the primary data center Reporting Services environment. It should be located in another geographic location such that if the primary data center experiences a resource disruption (for example, in connectivity or power), all traffic can be redirected to the alternate site with minimal disruption to service. In noncritical scenarios, you can utilize fewer resources (for example, less powerful servers, fewer servers) but for mission-critical systems, we highly recommend a server-for-server duplication of resources.

Network Setup

To ensure connectivity from the clients to the primary data center and the disaster recovery site, a common technique is to use a content switch to load-balance traffic within the individual sites as well as between the global sites. In the case of CareGroup Healthcare, a Cisco GSS is used as the content switch. As well, there is direct fiber network connectivity between the primary data center and the disaster recovery site to ensure minimal latencies for any communication between the two centers. If the primary site goes down for any reason, the content switch transparently redirects all client traffic to the disaster recovery set of Reporting Services servers. If the content switch is unavailable, the IP address can be changed at the DNS level. This latter change is a manual switch with a slightly longer network outage, which is due to the DNS cache clearing the old IP address and pointing to the new one.

Database Setup

For the CareGroup Healthcare environment, the primary RSDB instance within the primary data center is named Boston\sql4 and is a SQL Server cluster using SQL Server 2008 Enterprise, with the content switch pointing to the alias sql4. An active/passive failover cluster is used; this allows other active database instances to be located on the server in the passive node. This technique works well if the passive node is not overutilized such that it can handle the Reporting Services workload if the Reporting Services active node goes offline. It may make sense to go to an active/active cluster if there is enough traffic, concurrent users, and/or resources, though the discussion of the tradeoffs between idle machines and high availability scenarios is outside the scope of this technical note. Each of these clustering techniques has unique advantages; ultimately, it is important to ensure that the report catalogs are consistently running with minimal downtime. For more information about how to best configure a SQL Server clustering environment, see SQL Server 2008 Failover Clustering.

To ensure that the RSDB within the disaster recovery site is up-to-date, a common technique is to use asynchronous database mirroring. For the CareGroup Healthcare environment, the mirrored RSDB instance is named Montréal\sql4 and is a SQL Server cluster using SQL Server 2008 Enterprise. Note that while the domain names are different (Montréal vs. Boston in this case), the SQL Server clusters for the primary data center and the disaster recovery site have the same name (more on this later). The key database to keep in sync is the report server (RSDB) database, because it stores all of the report metadata, including report definitions, report history and snapshots, and scheduling information. All Reporting Services operations must connect to this database to access its metadata to perform their functions. The reason asynchronous database mirroring is commonly chosen is that asynchronous mirroring has minimal or no impact on response time performance. In addition, asynchronous mirroring works well because RSDB metadata is not frequently updated.

The following table describes an example endpoint setup for these database mirrors.

image

Initializing Database Mirror

A relatively easy way to initialize a database mirroring setup is to:

1) Make full and transaction log backups of the Reporting Services databases on the principal server.

2) Copy the backups over to the disaster recovery site, restoring each Reporting Services database in no-recovery mode.

3) Set up the failover partner on the mirror (that is, the DR site) before you set up the failover partner on the principal server.

Failover Scenarios

There are three main disaster recovery scenarios that can occur within this Reporting Services disaster recovery environment:

· Reporting Services servers within the primary data center go offline.

· The active server for the RSDB in the primary data center failover cluster goes offline.

· The entire cluster for the RSDB in the primary data center goes offline.

Please note that this note does not explicitly cover the scenarios concerning the Reporting Source databases. While most of the information here is applicable to these databases (such as data marts, operational data stores, transactional systems, and data warehouses), they have their own disaster recovery requirements and procedures that should be separate from the reporting infrastructure.

Reporting Services Servers within the Primary Data Center Go Offline

As noted in Figure 2, both Reporting Services servers point to the same cluster of databases. In this case, they both point to the active/passive database cluster Boston\sql4.

image

Figure 2: Reporting Services servers within the Primary Data Center go offline

In this scenario where the Reporting Services servers within the primary data center go offline, the hardware content switch will detect a network connectivity issue and automatically fail over to the disaster recovery site. By default, the Reporting Services farm connects to the report catalog in the primary data center because it has the most up-to-date information (provided that the network is fast and responsive). Therefore in most cases, the client connectivity to the Reporting Services environment in the DR site is minimally impacted (clients might need to be closed and restarted to establish a new network connection, or they could receive a broken link/page in the browser and need to refresh or restart their report execution).

Active RSDB Server in the Primary Data Center Failover Cluster Goes Offline

If the active server in the primary data center failover cluster goes offline, the active/passive setup of the SQL Server failover cluster will kick in. The active server will automatically fail over to the passive instance of the node with little to no service interruption. The report server may report that it cannot connect to the report server database, but after the passive node comes online, it will automatically re-establish the connection. All report server connectivity will continue as previously performed except that now the passive instance has become the active instance.

image

Figure 3: Active cluster for RSDB in the primary data center goes offline

Entire RSDB Cluster in the Primary Data Center Goes Offline

If the primary data center RSDB SQL Server cluster goes offline, the disaster recovery solution is to manually fail over to the RSDB on the disaster recovery site to continue serving up reports. But note that this is a manual failover process, which requires manual configuration of the Reporting Services servers in the Reporting Services farm to point to the RSDB instance within the disaster recovery site. It is important to be aware that, with asynchronous mirroring, manual failover by definition forces the service to allow data loss. With Reporting Services, however, the data loss should be minimal because report metadata is not frequently updated and snapshots can be re-created. By configuring the primary data center and disaster recovery site RSDBs with identical instance names (recall that both instances are named sql4 but that they are in different domains), the manual failover is simplified because the Reporting Services servers will connect to an identically named instance of the RSDB (different domain, but same name).

image

Figure 4: Entire cluster for RSDB in the primary data center goes offline

Primary Data Center Outage Scenarios

If the entire primary data center goes offline for either a planned or unplanned outage, the content switch will automatically fail over to the Reporting Services farm in the disaster recovery site, but you will need to perform a manual failover to ensure that the Reporting Services servers can connect to the copy of the RSDB in the disaster recovery site.

Planned Outages

For planned outages:

1) The content switch will perform the task of suspending the primary IP addresses and activating the DR site IP addresses so that all connections will be redirected to the disaster recovery site.

2) You will need to execute a script on the primary database to manually switch to the mirrored database. After the partner database becomes the primary, the same script can be re-executed to switch back to the original primary database. Here is an example:

– Script to manually fail over to the mirror database

– Example: switch from Boston\sql4 to Montreal\sql4

USE [master]

GO

– Database ReportingServer

ALTER DATABASE ReportingServer SET PARTNER FAILOVER

GO

Unplanned Outages

For unplanned outages:

1) The content switch will perform the task of suspending the primary IP addresses and activating the DR site IP addresses so that all connections will be redirected to the disaster recovery site.

2) You will need to run a manual failover script to force the service to switch with possible data loss. The script needs to run on the mirrored server in the case of loss of the principal server. Here is an example:

– Script to force database fail over of Boston\sql4 to Montreal\sql4

– in case of SQL Server instance failure

USE [master]

GO

– Database ReportingServer

ALTER DATABASE ReportingServer

SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS

GO

After this script is run, the mirror server is the principal server and the mirror itself is suspended.

For more information about Reporting Services configuration for the database connection, see Configuring a Report Server Database Connection.

Break/Rebuild the Mirror

After the system goes down, it is undesirable to have the active system push data to the passive system (it cannot anyways). So it is important to break the mirror with the primary data center server and then re-establish it after the primary data center (in this example, Boston) server is back online. Note that the disaster recovery (in this example, Montréal) server is the one that is now active, so you will need to manually transfer the data from the DR site server back to the primary data center server (in this example, Montréal to Boston). To fully break the mirror, execute the following on the primary server:

ALTER DATABASE ReportingServer SET PARTNER OFF

RESTORE DATABASE ReportingServer WITH RECOVERY

To rebuild the mirror, follow the original steps in the Initializing Database Mirror section above.

Detail Logistics

It is important to understand that the Reporting Services report jobs, schedules, and subscriptions will not be mirrored between the RSDB from the primary data center and the disaster recovery site. This occurs because SQL Server Agent jobs are stored in the msdb database, which is not mirrored. A common method to resolve this issue is to script all of these Reporting Services jobs in the primary site and then use these scripts to re-create them on the disaster recovery site (with the jobs disabled); activate these jobs only if the system fails over. In the case of Reporting Services database migration, it is not necessary to perform this operation, as noted in How to: Migrate a Reporting Services Installation.

Note that logins created on the primary site may not be replicated to the disaster recovery site. To work around this problem, we recommend storing all of these logins into their own database and mirroring this database to the DR site as well. Then create a script to generate the logins within the DR site to re-establish the logins. For more information, see How to transfer logins and passwords between instances of SQL Server.

Conclusion

With these guidelines, you can prepare your Reporting Services environment for disaster recovery scenarios. The key here is that you will need to use a combination of hardware components, network connectivity, SQL Server database mirroring, and architecture to create a disaster recovery infrastructure to support an enterprise SQL Server Reporting Services environment. As noted, this case study has used specific techniques such as active/passive clustering and asynchronous database mirroring to support their disaster recovery requirements. Your situation may be different, but this technical note should provide you with the basic framework for Reporting Services disaster recovery. The hope is that you will never need to use it, but at least now you are ready in case you do.

Quick Tips and Q&A for SQL Server Analysis Services to Hive

Originally posted on Denny Lee:

Over the last few weeks I’ve fielded some questions concerning the paper that Dave Mariani (@dmariani) and I had contributed to the whitepaper / case study SQL Server Analysis Services to Hive; below is an aggregate of those tips – hope this helps!

Q: I’m running into the HiveODBC Error message “..expected data length is 334…”

A: Check out the post for details on how to potentially resolve this: HiveODBC error message “..expected data length is 334…”

 . 

Q: Can I connect Analysis Services Tabular to Hive instead of Multidimensional?

A: Yes! Ayad Shammout (@aashammout) has a couple of great blog posts that dig into exactly how to do this; it is similar to how PowerPivot can connect to Hive.

Import Hadoop Data into Analysis Services Tabular

SSAS Tabular using HiveODBC connection

 . 

Q: The original  SQL Server Analysis Services to Hive whitepaper had Analysis…

View original 888 more words

SQL Server Enterprise Management Data Warehouse (eMDW)

eMDW

Few years ago, I was managing 100’s of SQL servers with 100’s to 1000’s of databases in a very busy environment. My challenge was how to manage all SQL Server instances, databases and have a complete inventory of my environment. SQL Server Management Studio is a tool will allow me to view data per instance, so I have to connect to each instance and browse the properties of each instance but not from a central view. I needed a tool that I can see total number of Servers with summary that I can drill down by environment by business unit down to an object level. I want to look up a user permission across SQL Servers and level of access, I need to know how many databases, objects and users are by environment i.e.  Production, test, dev and QA, how much total storage my SQL Server are using with drill down capability and break it down by file type and display growth trending.

Long story short, I needed a centralized management view of my SQL server environments that I can have a complete inventory that can answer my previous questions.

I began building a centralized reporting system by creating a master repository database on a SQL Server instance. The repository would store data about every SQL Server instance, database, object, and user in an enterprise database environment.

I developed a custom an SSIS package that collects a plethora of SQL data with details on SQL server Instances and databases, and store the data in a central database.

After data is in the repository, you can use Reporting Services to view the data. I developed different dashboards and reports in SQL Server Reporting Services to view instance, databases, objects and users data. This solution helps me monitor the status and inventory of 100’s to 1000’s of databases and assess how big my environment is and have access to enterprise data from centralized reports.

You can download the entire solution at: GitHub