≡ Menu

Data Infrastructure

Data is critically important to Digital Decisioning. It is the basis for predictive analytic models that are embedded in Digital Decisioning, as well as for the analysis of decision effectiveness. Digital Decisioning must be integrated with the data infrastructure so that the data needed for a decision can be effectively passed to the system and recommended actions and associated information returned. Information about how decisions were made must be stored so that it can be accessed and analyzed later.

Five particular aspects of data infrastructure are important for Digital Decisioning.

Operational databases must be integrated with Digital Decisioning. Data warehouses contain information on performance and often represent the only cross-silo information store available for analysis. Analytic data marts are increasingly common for solution-centric analysis. Finally, in-database analytics and big data platforms are increasingly playing a role in many organizations.

Operational Databases

Operational databases contain both the raw operational data often used for building predictive analytic models and the transactional information for use at decision time. Generally, operational databases will support the transactional element of their role effectively. Digital Decisioning  needs very similar access to transactional data that other systems need. They need to be able to quickly access live data for the transaction being processed, and they need to be able to write data back—both action data and supporting information, such as decision execution logs and reason codes. Neither of these is out of the ordinary, and most organizations’ operational databases will support them well.

The use of operational databases to support the creation of predictive analytic models is more problematic. Building predictive analytic models requires history and transactional detail over time. For instance, building a model to predict customer retention requires access to information about when customers cancel or renew their subscription, as well as information about their behavior in the period leading up to a cancelation or renewal. The detailed history of this customer behavior is not always available in a data warehouse or analytical data mart, so the models will be built with information extracted directly from operational databases.

Operational databases are often not designed for this kind of analytic work for a number of reasons:

  • Data is overwritten: When new data is written back to the database, it simply overwrites existing data. This can create what is known as “a leak from the future,” as it will not be possible to recreate the data as it was at a particular point in time—for instance, a field that records total value of orders placed to date by a customer is updated after each order. As a result, it is not possible to use it when building a predictive analytic model, as its value at a point in time cannot be determined.
  • Not all records are available: If records are removed from the database under certain circumstances, then the data may suffer from what is known as “survivor bias.” If all customers who fail to renew within 12 months of canceling are removed from the database, then it may look like everyone who canceled renewed in the subsequent 12 months. Everyone who did not was removed, and the survivors therefore are no longer representative.
  • Outliers are eliminated: Many organizations have programs to clean and manage data to prevent bad data from getting into systems. These programs can be very effective, but they can also cause problems if they are too aggressive about removing outliers. This is particularly true when predicting fraud. It is the outlier values that are often most important, yet these can all too easily look like “bad data.”
  • Change logs are not kept: Few operational databases keep an easily accessible log of changes made to the database by systems. The number and types of changes made to data can be very important in predictive analytic models. For instance, in warranty claims, one of the best predictors of fraud is the number of attempts it took to get the address right for a service call. Most operational databases will only record the address once it is correct, however, and this means crucial data is lost.

Data Warehouses

Most organizations have either an enterprise data warehouse or a set of more localized data warehouses to complement their operational databases. Data warehouses typically contain data from multiple organizational silos. This data is often more integrated, better understood, and cleansed more thoroughly.

Data warehouses are often not used to develop predictive analytic models, despite the fact that these models often require data from multiple operational databases. Partly this is because data warehouses can suffer from the same problems noted above: cleansing the data for the data warehouse can eliminate critical outlier data, and the data warehouse may not store enough history to avoid leaks from the future. Despite this, data warehouses can be very useful for the construction of predictive analytic models, if built correctly:

  • Data warehouses are often not as space-constrained as operational databases and are more likely to be used for historical analysis. As such, there is less pressure to delete unused records and less likelihood of survivor bias problems in the data.
  • Similarly, the ability to store more data makes it more practical to store a new version of a record every time it is updated in one of the source systems. If this is done, then the data warehouse will not suffer from future leaks.
  • Many data warehouses are built to produce reports and analysis at a summary level. If only summary data is stored, they will be of little value for building predictive analytic models. The warehouse will be useful if transactional data is stored as well as the roll-up and summary data—and if cleansing and integration of data is handled at a transactional level.

If a data warehouse is built correctly and with sufficient detail, it is a great source for building predictive analytic models, not least because it is the source for the ongoing reporting and analysis that business user see and use every day.

Best practices in decision-centric data
One of the best practices for successful Digital Decisioning relates to the management of data across your data infrastructure. Experience suggests that widespread adoption of Digital Decisioning and predictive analytic models requires that the data your business users see in their reporting and analysis tools, the data in your predictive analytic models, and the data in your operational systems need to be kept highly synchronized. This synchronization is required at two levels.
First, the data elements used must be available in all three environments. If a piece of information is being used in a predictive analytic model (“Number of times a customer has been late making a payment in the last 12 months”), then this should also be available to business users in their reporting and analysis environment, as well as available in the operational environment so that it can be used in a Decision Service.
Second, the data business users see when they run reports should be the same as the data they see when evaluating models. If a predictive analytic model divides customers into various buckets, then the total number of customers should match the reporting environment.
The first form of synchronization is important for deployment and adoption of predictive analytic models. The second is essential for building trust and understanding between analytic and business teams.

Analytic Data Marts

Some organizations have developed analytic data marts. Data is extracted from operational databases or from an enterprise data warehouse and organized to focus on a particular solution area. Owned by a single business unit or business area, analytic data marts allow a particular group of users more control and flexibility when it comes to the data they need.

Reporting, spreadsheet access, and Online Analytical Processing (OLAP) needs often drive the creation of analytic data marts. They can and should also be used to support predictive analytic model creation. Applying the same criteria as noted for operational databases and data warehouses can ensure that the data has the characteristics needed to build a predictive analytic model. Because an analytic data mart is focused on a particular business problem, it is probable that all the data needed to build a predictive analytic model to support decisions in that business domain will be present in the mart. Similarly, all the data needed for performance analysis for a Digital Decisioning system may well be included in an analytic data mart for the business area.

In-Database Analytics

A number of database vendors, including IBM, Netezza (now IBM), Microsoft, Sybase, and Oracle provide predictive analytic modeling that is built directly into the database. The benefit of in-database analytics is that models can be built/scored without the need to extract any data from the database. This has a positive impact on performance by limiting the amount of data movement and allowing the execution to take place on the often larger database hardware rather than the analytic server hardware. Analytics within the database can be highly scalable if they can take advantage of the parallelism of the database infrastructure and by analyzing the data in place inside of the database. Just because an analytic runs within a database doesn’t mean it will perform better than an analytic running external to database. It depends on the implementation of the algorithm as to whether it takes advantage of the parallel/distributed infrastructure of the database. In-database analytics can involve in-database creation of predictive analytic models, in-database execution of predictive analytic models, or both.

In a standard analytic environment an analyst extracts an analytic dataset from the database or data warehouse, integrating multiple sources, flattening the structure, and cleaning things up—

before creating potentially large numbers of variables from the data. These variables often collapse behavior over time into a variable, taking payment records and creating a variable “Number of times payment was late in the last 12 months,” for instance. This enhanced analytic dataset (original data plus calculated variables) is then run through multiple modeling algorithms, and either the best algorithm is selected or an ensemble of the models is built.

In-database predictive analytic model creation involves running some part of this process on the database server—either the data integration and variable creation process alone or the whole process. Not only does this use the server processing capacity, it also eliminates the need to extract and move the data. By accessing the data directly from the same server, performance can be significantly improved, especially in the integration and variable creation pieces (less so in the modeling piece). Integration, cleansing, and variable creation can also be easily shared across multiple models, reducing the management and calculation overhead for multiple models. Predictive analytic model building is embedded within a relational database, usually as a set of stored procedures. If these routines do not take advantage of the parallel/distributed architecture of the database, they will offer little advantage over running analytics outside of the database environment.

In-database analytics can also take a predictive analytic model (whether or not the model was produced in-database) and execute it in the database. This pushes this execution onto the database server and lets you score the records in the database or warehouse by running them through the algorithm and storing the result. More interestingly, it can also mean being able to treat the model as though it were an attribute and calculate it live when the attribute is requested. So for instance, the segment to which a customer belongs can be treated by calling applications as though it were a stored a data element, even though it is calculated in real time from a model when it is needed. In this approach, the data is not persisted, and the in-database engine acts like a layer on top of the stored data, enhancing it with the model results.

Predictive analytic model scoring is embedded within a relational database, usually as an extension to the SQL language. Because business analysts do not usually want to spend their day writing SQL, some workbenches support the use of the database algorithms. In this case, the workbench becomes the interface for the analyst to generate SQL automatically (without actually having to write SQL).

The other in-database technique that is gaining acceptance is the creation of SQL User Defined Functions (UDF), which allows vendors to create code that can be executed by the database. Today, this is being used to enable the vendor’s predictive analytic models that have been built within the workbench to be scored directly within the database. These functions can be called by an application as part of their transaction and as one way to integrate a Decision Service. These scoring functions can also be called during batch decision processes to efficiently process a large number of decisions at once. A well-designed system can process millions to billions of decisions within a small batch window.

When deploying predictive analytic models to a database for in-database scoring, some systems will require the model to be recoded or a database administrator to deploy the model. This limits the agility of a Decision Service to pick up new versions of a predictive analytic model to be used in the Decision Service. In some scenarios (examples: fraud detection, online product up-sell, and others), the time between discovering of new patterns in the process of model building and the time to deploy those models can result in significant revenue or opportunity losses.

If you have in-database analytics, it is worthwhile to re-think the balance of pushing models into the database for execution and exporting them to a BRMS using PMML. The former will work well for something like a Regression Model or Neural Network and for models where it is “all or nothing” and the business users are comfortable with the idea of a model as a “black box.” When you need to interact with the model (to see which association rules to use, for instance), or when the visibility of the model is critical to the business, a BRMS is proven to be a very effective way to deploy the models. Exposing the models as readable, manageable business rules makes it easier to gain acceptance and to integrate the model results with the rest of the business rules involved in a business decision. Regardless, building the model in-database is still worthwhile.

Big Data platforms

The topic of Big Data and Big Data Analytics is hot in the industry because new hardware systems and software are making it affordable to collect and analyze much larger volumes of data than previously possible. Big Data is defined as many terabytes or petabytes of data.

Often this data is not as highly structured as data that is typically stored in relational databases, so the term “semi-structured” data is often used. Semi-structured data typically consists of log data from applications, data from sensors or devices, or data from network traffic or operations. The data has discrete data values, but the format may vary across files or even within records in the same file, and the format may evolve over time. Big Data also includes the ever-increasing volumes of unstructured data, such as text documents, audio, and video.

Organizations are being driven to retain this type of information for long periods of time, which is driving a need for new storage approaches. They are also finding value in analyzing this wealth of data to make better decisions, and each type of data requires different analysis techniques for deriving value.

A Big Data solution starts with a system that can efficiently handle the volume and type of data to be stored and analyzed. Apache Hadoop is one of the most prominent platforms for Big Data. It can affordably scale to many petabytes of storage using Hadoop Distributed File System (HDFS), and it provides an infrastructure for analyzing this efficiently called Hadoop Map/Reduce. There is an ecosystem of projects around Hadoop that adds capabilities to query, restructure, and analyze data using the infrastructure of HDFS and Map/Reduce.

To efficiently process such large quantities of data, Big Data systems combine data storage and processing on the same piece of hardware, which reduces the amount of data that needs to be moved in order to be analyzed. The data in Big Data systems is divided and distributed to the various machines in the cluster so that analysis can be performed in parallel on the different divisions of the data. In order to run in this type of distributed system, traditional analytics need to be redesigned to run in parallel and within the Big Data infrastructure.

Many Digital Decisioning systems make decisions about customers, and it is often helpful if these decisions are made based on a true 360-degree view of a customer. A Big Data platform can ensure that website logs, call detail records, social media, call center and customer service emails, and more all feed into these decisions. The non-traditional data managed by the Big Data platform can be analyzed, perhaps to derive customer sentiment, and this analysis can be fed into the existing data infrastructure for use in a predictive analytic model. The flexibility of Big Data platforms also has a role to play here. It is often not clear if a particular data source will add enough value to a decision to justify the cost of integrating or purchasing it. Using a Big Data platform initially can make it easier to use a new data source to improve a predictive analytic model, and, for example, to see if the improvement justifies the cost.

More data is usually good—but not always
Most data miners and builders of predictive analytic models will say that more data is (almost always) better. While this is generally true, there are circumstances where more data is not helpful. Too much data can cause analysis paralysis and over-engineered rules that consider too many outliers rather than focusing on the core issues that drive success. It can also cause performance problems if lots of data elements must be made available at decision time. Finally, it can become an end in itself rather than a means for better decisions.

Additionally, given the volumes and velocity of Big Data, it is unlikely that people will be able to be plugged into the solution once it is up and running. Their role will be to do the analysis, make the judgments, and set up a system to handle the transactions as they flow through. When you are talking about decisions that involve real-time, streaming data in huge volumes, you are talking about building systems to handle these decisions. Not visualizations or dashboards, but rather systems that handle things like multi-channel customer treatment decisions, detecting life threatening situations in time to intervene, managing risk in real time, and more. Digital Decisioning thus represents a powerful tool for making the most of Big Data platforms.

Analytic capabilities for these platforms are not mature
The analytic capabilities in these new Big Data systems are evolving quickly, but users will find that tools in the Big Data space have not yet reached the ease of use and deep capabilities of traditional analytic tools.