Data Mining with MS SQL Server 2005

Background

With the accelerated automation and information explosion, deriving ready-to-use information from ever-growing data becomes more and more important. This is process is called data mining.

The prerequisites for conducting effective scalable data mining are as following:

  1. Understanding the meaning of the data.
  2. Strong analytical skills and mathematical foundation, especially in statistics.
  3. Quality data organized in an efficient database.
  4. Tools for data analysis.

The first two are about personnel quality. The last two are about resource.

As for the resource, there are many database engines on the market - IBM, Oracle, MySQL (free open source), etc. Microsoft, a compnay founded by programmers, has been very friendly to programmers (n terms of products features and pricing strategy) and very successful in retaining them. The author, out of fear of monopoly, has been trying to support alternatives to Microsoft products, but has failed again and again for very simple reason - Microsoft products often offer signficantly better productivity over other competing products with bearable pricing. However, Microsoft has undoubtedly shown the sign of aging big corporation suffering from bureaucracy, disconnection. For now, the author stays with MS SQL server.

This article assumes the readers understand the basic concepts of database.

MS SQL Server Analysis Services (SSAS) should be treated as a separate engine in parallel with it database engine that every SQL server user is familiar with. Table 1 shows the comparison between these two engines:

Table 1. Comparison between MS SQL server database engine and analysis services engine

 Database EngineAnalysis Services (SSAS)
LanguageTransact-SQLMDX (Multidimensional Expressions), DMX (Data Mining Extensions)
ConnectionConnection stringsData Sources
Essential Data ComponentsTable, ViewData Source View
Most Effective Programming APIADO.NetADOMD.Net, AMO (Analysis Management Objects)
ToolsSQL Management Studio, Visual StudioSQL Management Studio, Business Intelligence Studio
   
   
   
   

Programming SSAS

There are three ways to use SSAS

  1. Use Business Intelligence Studio, which is essentially an extension of Visual Studio, to build data mining models and generate predictions. This is by far the easiest and fastest way to use SSAS for some ad hoc, non-iterative data analyses with little code writing. This can be done without any knowledge of MDX or DMX.
  2. Write MDX or DMX scripts in SQL Management Studio. This approach is requires some proficiency in DMX or DMX. In theory, it can do whatever that can be done with SSAS. It is not practical to use this approach to do complex iterative data analysis and even more difficult to combine data analysis using SSAS with other complex data processing involving SQL server. It is impossible to use this approach to deliver a data analysis products.
  3. Programming with any language with .Net support. This is the most powerful, flexible approach that combines ADOMD.Net's ability to program SQL server, ADOMD.Net and AMO's ability to program SSAS and all other capabilities brought by the language to do complicated data analysis and deliver an application with friendly user interface.

This article focuses on the third approach - using C#, the language with best integration with .Net, to do all the work. The steps to analyze data are detailed as following

    1. Connect to a server. The easiest way to create the connection string is to use UDL file to connect to the server and test it, then copy its connection string. Suppose the server is named RD and resides on a local machine, the code would be like following:

string sConnect = "Provider=MSOLAP.3;Cache Authentication=False;Integrated Security=SSPI;Persist Security Info=False;Initial Catalog=Stock;Data Source=(local)\\RD;Impersonation Level=Impersonate;Mode=ReadWrite;Auto Synch Period=20000;Default Isolation Mode=0;Default MDX Visual Mode=0;MDX Compatibility=0;MDX Unique Name Style=0;Non Empty Threshold=0;SQLQueryMode=Calculated;Safety Options=1;Secured Cell Value=0;SQL Compatibility=0;Compression Level=0;Real Time Olap=False;Packet Size=4096";
Server svr = new Server();
svr.Connect(sConnect);

    1. Create a or simply connect to a database. Suppose the data base name is TimeSeries. The code to create the database would be like the following:

      Database db = null;
      string sDBName = "TimeSeries";
      db = svr.Databases.FindByName(sDBName);
      if (db != null)
      {
      &nbsp&nbsp&nbsp&nbsp&nbsp&nbsp db.Drop();
      }
      db = svr.Databases.Add(sDBName);
      db.Update();

    2. Create or simply find the DataSource. Suppose the DataSource name is "ds", then the code would be like the following:

string dsName = "ds";
DataSource ds = db.DataSources.FindByName(dsName);
if (ds != null)
{
&nbsp&nbsp&nbsp&nbsp ds.Drop();
}
// Create the data source
ds = db.DataSources.Add(dsName, dsName);
//connect to databse Finance on SQL server SQL1
ds.ConnectionString = "Provider=SQLOLEDB.1;Data Source=SQL1;Integrated Security=SSPI;Initial Catalog=Finance";
ds.ImpersonationInfo = new ImpersonationInfo("UserID", "Password");
// Send the data source definition to the server.
ds.Update();

    1. Create a Data Source View. There are different types of Data Source Views. It can a simple table from the data source. The following example is a Data Source View of name query.

      string sDataSourceViewName = "TimeSeriesPriceChangeR1";
      DataSourceView dsv = db.DataSourceViews.FindByName(sDataSourceViewName);
      if (dsv != null)
      {
      &nbsp&nbsp&nbsp&nbsp dsv.Drop();
      }
      dsv = db.DataSourceViews.Add(sDataSourceViewName);
      dsv.DataSourceID =
      dsName;
      DataSet dset = new DataSet();
      string sSelect = "SELECT ReportDate, Ticker, PriceChangeR1 FROM StockState WHERE DATEDIFF(day, ReportDate, GETDATE()) < 150";
      OleDbDataAdapter oldbda = new OleDbDataAdapter(sSelect, oldbCnInfo);
      oldbda.SelectCommand.CommandTimeout = 100;
      oldbda.FillSchema(dset, SchemaType.Mapped, sDataSourceViewName);
      //add extended properties to the DataSet indicating it is a names query
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("TableType", "View"); //this is one of the key statements to make this a named query
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("QueryDefinition", sSelect); //this is one of the key statements to make this a named query
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("IsLogical", "True");
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("DbSchemaName", "dbo");
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("DbTableName", sDataSourceViewName);
      dset.Tables[sDataSourceViewName].ExtendedProperties.Add("FriendlyName", sDataSourceViewName);
      dsv.Schema = dset;
      dsv.Schema.Locale = CultureInfo.CurrentCulture;
      // Send the data source view definition to the server
      dsv.Update();
    2. Create a Mining Structure. Different mining models have different structures. The following is an example for MS Time Series:

      //create corresponding mining structure
      string sMS = "MSTimeSeries";
      Microsoft.AnalysisServices.MiningStructure ms = db.MiningStructures.FindByName(sMS);
      if (ms != null)
      {
      &nbsp&nbsp&nbsp&nbsp dms.Drop();
      }
      ms = db.MiningStructures.Add(sMS, sMS);
      ms.Source = new DataSourceViewBinding(sDataSourceViewName);

      //add the key columns
      ScalarMiningStructureColumn mscReportDate = ms.Columns.Add("ReportDate", "ReportDate");
      mscReportDate.IsKey = true;
      mscReportDate.Type = MiningStructureColumnTypes.Date;
      mscReportDate.Content = MiningStructureColumnContents.KeyTime;
      mscReportDate.KeyColumns.Add(sDataSourceViewName, "ReportDate", OleDbType.Date);

      ScalarMiningStructureColumn mscTicker = ms.Columns.Add("Ticker", "Ticker");
      mscTicker.IsKey = true;
      mscTicker.Type = MiningStructureColumnTypes.Text;
      mscTicker.Content = MiningStructureColumnContents.Key;
      mscTicker.KeyColumns.Add(sDataSourceViewName, "Ticker", OleDbType.Date);

      ScalarMiningStructureColumn mscObjective = ms.Columns.Add("PriceChangeR1", "PriceChangeR1");
      mscObjective.Type = MiningStructureColumnTypes.Double;
      mscObjective.Content = MiningStructureColumnContents.Continuous;
      mscObjective.KeyColumns.Add(sDataSourceViewName, "PriceChangeR1", OleDbType.Single);

ms.Update();

    1. Create a Mining Model.

if (ms.MiningModels.ContainsName("TimeSeries"))
{
&nbsp&nbsp&nbsp&nbsp ms.MiningModels["TimeSeries"].Drop();
}
Microsoft.AnalysisServices.MiningModel mm = ms.CreateMiningModel(true, "TimeSeries");
mm.Algorithm = MiningModelAlgorithms.MicrosoftTimeSeries;
mm.AlgorithmParameters.Add("MISSING_VALUE_SUBSTITUTION", "Previous");
mm.AlgorithmParameters.Add("HISTORIC_MODEL_GAP", 1);
mm.AlgorithmParameters.Add("HISTORIC_MODEL_COUNT", 100);

Microsoft.AnalysisServices.MiningModelColumn mmcObjective = mm.Columns.Find("PriceChangeR1");
mmcObjective.SourceColumnID = "PriceChangeR1";
mmcObjective.Usage = MiningModelColumnUsages.PredictOnly;

Microsoft.AnalysisServices.MiningModelColumn mmcReportDate = mm.Columns.Find("ReportDate");
mmcReportDate.SourceColumnID = "ReportDate";
mmcReportDate.Usage = MiningModelColumnUsages.Key;

Microsoft.AnalysisServices.MiningModelColumn mmcTicker = mm.Columns.Find("Ticker");
mmcTicker.SourceColumnID = "Ticker";
mmcTicker.Usage = MiningModelColumnUsages.Key;

mm.Update();

    1. Process the model..

mm.Process();

 

(originally written in 2011)