≡ Menu

Machine Learning Workbench

A Predictive Analytics Workbench or data mining workbench is a set of software components designed to enable the analysis of a set of data sources to determine the mathematical relationships within that data and to produce a predictive analytic model that embodies those relationships.

A Predictive Analytics Workbench is most commonly targeted at an analyst—someone with a working knowledge of predictive analytics, a reasonable understanding of the data that is available within the organization to be analyzed, and a good understanding of the business needs. Analysts are usually found within a centralized analytics team or within a line of business, working as a marketing analyst or fraud analyst, for example.

Predictive Analytics Workbenches provide a set of capabilities that enable the user to perform the following tasks:

  • Connect to data
  • Prepare the data for modeling
  • Visualize the data
  • Build predictive and statistical models
  • Test predictive analytic models against holdout data
  • Assess business impact of models
  • Deploy models into production
  • Manage deployed models
The iterative processing of building / testing a model
The process of building a predictive analytic model is usually iterative. The analyst will hold back a random sample of the data on which to test the performance of the model and will continue to rebuild models until the model performance is deemed to be acceptable. Performance is typically measured in the context of accuracy of the predictions being made. The speed with which the model will run against large datasets may also be a factor. The iterative process is likely to involve obtaining new sources of data or a larger sample of data if the model performance does not meet the business goals.

A Predictive Analytics Workbench has many different modeling techniques, the most common of which are as follows:

  • Rule induction
  • Decision trees
  • Linear regression
  • Logistic regression
  • Clustering
  • K-means
  • Affinity analysis
  • Nearest neighbor
  • Neural networks
  • Genetic algorithms

These techniques can be used to build four main classes of models:

  • Predictive analytic models: These look for patterns or trends in the data and provide a predicted outcome. The output could be binary (Will Churn / Won’t Churn), a numeric (Churn likelihood is 80%), or one of multiple results (Of the 20 campaigns we have running today, this customer is likely to respond to the Churn Campaign).
  • Clustering Models: These group similar sets of data together and provide ways of looking at the profiles of each of the clusters. Cluster models are often used to gain a broad understanding of a customer base; for example, “I have five different groups of customer types and the most profitable is made up of women between the age of 18 and 25 who have been customers for over six months.” Cluster models can be used to segment the data prior to building predictive analytic models on each of the individual segments for finer grained targeting.
  • Association Models: These look for situations in the data where one or more events have already occurred and there is a strong possibility of another event occurring. For example: “If a customer purchases a razor and after-shave, then that customer will purchase shaving cream with 80% confidence.” This is commonly used for analyzing customer basket data and when delivering recommendation engines behind online shopping sites.
  • Statistical Models: In the context of a workbench, statistical models are often used to validate hypotheses. For example, “I think that young men who have been a customer for over 24 months are a high churn risk, so what is the probability that this is a reliable finding, or just due to random variation?”
Choosing which modeling technique to use
For each of the categories of model types within a workbench, there
could be over 20 different algorithms available to the analyst to use. Picking the right algorithm can be tricky—there is no guarantee that any one algorithm will work best in any particular situation. Often an analyst will start with a brute-force approach, building a number of different models using different algorithms. As soon as a good candidate is found, the model can be further refined by tweaking the model-building parameters.
Automation can help reduce the difficulty of model building, especially for new analysts. Automated modeling often builds a whole range of different models and keeps the top three to five models that performed the best. During scoring, the output is determined by using a combination of the top models or selecting the model result that has the highest accuracy, profit, or ROI.
Another factor that can be important in deciding which types of
models are used is understandability of the model output. A number of predictive analytic modeling algorithms generate output that is human readable in the form of rules or decision trees. For some applications, it is essential that the models being can be understood (example: to maintain compliance in retail banking). Other algorithms produce output that is more mathematical or statistical in nature and less easily understood and explained.

A Predictive Analytics Workbench allows a user to create, validate, manage, and deploy predictive analytic models. A predictive analytic workbench consists of the components shown in the Figure below:

  • A model repository: This is a place where models and the specification of the tasks required to produce them can be stored, revised, and managed. Not all Predictive Analytic Workbenches have such a repository, and some still store models as script files.
  • Data management tools: Building predictive analytic models requires access to multiple data sources of various formats. A Predictive Analytics Workbench must be able to connect to and use this data.
Data clean up is important
A significant part of the time spent on building predictive analytic models is actually spent in data management. Cleaning up data, removing items with irregularities that would skew results, extrapolating values to fill in missing variables and much more all use up valuable modeling time. Good data capture, effective IT/analytic collaboration, and experimental design can all help reduce these problems.
  • Design tools for a modeler: Modelers need to be able to define how data will be integrated, cleaned, and enhanced, as well as the way in which it will be fed through modeling algorithms and the results analyzed and used.
  • Modeling algorithms: Predictive Analytic Workbenches have a wide array of modeling algorithms that can be applied to data to produce models.
  • Data visualization and analysis tools: Modelers must be able to understand the data available, analyzing distribution and other characteristics. They must also be able to analyze the results of a set of models in terms of their predictive power and validity.
  • Deployment tools: Models are not valuable unless they can be deployed in some way, and Predictive Analytic Workbenches need to be able to deploy models as code, as SQL, as business rules, or to a database using an in-database analytics engine.
Elements of a Machine Learning Workbench