Quantcast
Channel: Deep Learning - Machine Learning Techniques
Viewing all articles
Browse latest Browse all 11

Synthesizing Multi-Table Databases: Model Evaluation & Vendor Comparison

$
0
0

Synthesizing multi-table tabular data presents its own challenges, compared to single-table. When the database contains date columns such as transaction or admission date, a frequent occurrence in real-world datasets, generating high quality synthetizations and model evaluation are even more complicated. In this article, we focus on this type of problems, comparing generated observations produced by 3 vendors and open source. We look at preservation of data integrity across the multiple tables, run time, correct replication of the joint multivariate distribution present in the real data in parent and child tables, and non-violation of business rules based on combination of features, including categorical features.

Overview

Despite using the default settings, it took several hours to understand and work with some of the libraries. In several instances, the synthetization failed after multiple trials and fixes, especially with SDV but also with Gretel and Mostly.ai. Both the date columns and metadata contributed to the challenges that we faced with SDV.

Figure 1: Credit card database

For the credit card database in Figure 1, all vendors can produce synthetizations using the default settings. For the more complex AdventureWorks database, we experienced critical issues except with YData. The findings below highlight some of the problems faced on the credit card dataset:

  • Mostly.ai has a multi-year gap with very few transactions between February 1993 and May 1995.
  • For each numerical feature including time, Gretel artificially generates observations that exactly match the observed minima and maxima, presumably to pass the range test, causing other issues in the process.
  • The new balance (an extra feature that we created, measuring the remaining balance after a credit card transaction) should be positive most of the time. Only YData correctly reproduces this pattern.
  • Credit card transactions have monthly and semester periodicity, properly captured only by YData. See figure 2, showing the two features (amount and time) in a scatterplot.
  • Gretel transaction amounts and time exhibit exaggerated diffusion. YData has the opposite problem with the amounts, though it is easy to fix with randomization.
  • Mostly.ai creates multi-categories (spanning across multiple features) almost randomly, generating many multi-categories not found in the real dataset. As a result, average amount per category is wrong. The problem is much less pronounced with Gretel, and absent with YData.
Figure 2: Date and transaction amount, credit card DB

Conclusions

Compared to our analysis on single-tables from a year ago, Gretel significantly improved and now outperforms Mostly.ai. Also, YData.ai solidified its first position, now emerging as the undisputed winner. SDV still remains the laggard. Note that the free version of Gretel is limited to rather small datasets.

Figure 3: Vendor comparison

In Figure 3, “faithfulness” is the ability to reproduce the correct multivariate distribution attached to the real data, in each database table. The “success rate” is low for synthesizers that fail on several datasets, using the default parameters. A low rating for “business rules” means several violations to implicit rules such as new balance being negative after a credit card transaction. Finally, “category integrity” is low when the synthetic data contains a large chunk of records attached to category combinations not found in the real data, while “date columns” is the ability to generate timestamps with properties mimicking those found in the real data.

Download the full report

The 10-page technical document, with details about the methodology, high-resolution diagrams and illustrations, case studies, error messages encountered on some platforms, and links to source code & datasets, is available in PDF format on GitHub, here.

About the authors

Rajiv Iyer is the Lead Developer at GenAItechLab.com, and formerly Principal Member of Technical Staff at Oracle India. Vincent Granville is Chief AI Scientist at GenAItechLab.com, author (Elsevier, Wiley), formerly consultant with Wells Fargo, Visa, Microsoft, eBay, and NBC. Currently based in Seattle, Vincent did his postdoc at University of Cambridge.

The post Synthesizing Multi-Table Databases: Model Evaluation & Vendor Comparison first appeared on Machine Learning Techniques.


Viewing all articles
Browse latest Browse all 11

Trending Articles