Synthesizing Multi-Table Databases: Model Evaluation & Vendor Comparison

Synthesizing multi-table tabular data presents its own challenges, compared to single-table. When the database contains date columns such as transaction or admission date, a frequent occurrence in real-world datasets, generating high quality synthetizations and model evaluation are even more complicated. In this article, we focus on this type of problems, comparing generated observations produced by 3 vendors and open source. We look at preservation of data integrity across the multiple tables, run time, correct replication of the joint multivariate distribution present in the real data in parent and child tables, and non-violation of business rules based on combination of features, including categorical features.

Overview

Despite using the default settings, it took several hours to understand and work with some of the libraries. In several instances, the synthetization failed after multiple trials and fixes, especially with SDV but also with Gretel and Mostly.ai. Both the date columns and metadata contributed to the challenges that we faced with SDV.

For the credit card database in Figure 1, all vendors can produce synthetizations using the default settings. For the more complex AdventureWorks database, we experienced critical issues except with YData. The findings below highlight some of the problems faced on the credit card dataset:

Mostly.ai has a multi-year gap with very few transactions between February 1993 and May 1995.
For each numerical feature including time, Gretel artificially generates observations that exactly match the observed minima and maxima, presumably to pass the range test, causing other issues in the process.
The new balance (an extra feature that we created, measuring the remaining balance after a credit card transaction) should be positive most of the time. Only YData correctly reproduces this pattern.
Credit card transactions have monthly and semester periodicity, properly captured only by YData. See figure 2, showing the two features (amount and time) in a scatterplot.
Gretel transaction amounts and time exhibit exaggerated diffusion. YData has the opposite problem with the amounts, though it is easy to fix with randomization.
Mostly.ai creates multi-categories (spanning across multiple features) almost randomly, generating many multi-categories not found in the real dataset. As a result, average amount per category is wrong. The problem is much less pronounced with Gretel, and absent with YData.

Figure 2: Date and transaction amount, credit card DB

Conclusions

Compared to our analysis on single-tables from a year ago, Gretel significantly improved and now outperforms Mostly.ai. Also, YData.ai solidified its first position, now emerging as the undisputed winner. SDV still remains the laggard. Note that the free version of Gretel is limited to rather small datasets.

In Figure 3, “faithfulness” is the ability to reproduce the correct multivariate distribution attached to the real data, in each database table. The “success rate” is low for synthesizers that fail on several datasets, using the default parameters. A low rating for “business rules” means several violations to implicit rules such as new balance being negative after a credit card transaction. Finally, “category integrity” is low when the synthetic data contains a large chunk of records attached to category combinations not found in the real data, while “date columns” is the ability to generate timestamps with properties mimicking those found in the real data.

Download the full report

The 10-page technical document, with details about the methodology, high-resolution diagrams and illustrations, case studies, error messages encountered on some platforms, and links to source code & datasets, is available in PDF format on GitHub, here.

About the authors

Rajiv Iyer is the Lead Developer at GenAItechLab.com, and formerly Principal Member of Technical Staff at Oracle India. Vincent Granville is Chief AI Scientist at GenAItechLab.com, author (Elsevier, Wiley), formerly consultant with Wells Fargo, Visa, Microsoft, eBay, and NBC. Currently based in Seattle, Vincent did his postdoc at University of Cambridge.

The post Synthesizing Multi-Table Databases: Model Evaluation & Vendor Comparison first appeared on Machine Learning Techniques.

Synthesizing Multi-Table Databases: Model Evaluation & Vendor Comparison

Overview

Conclusions

Download the full report

About the authors

Trending Articles

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Griffith faces three more offences

NCERT Solutions for Class 9th Sanskrit Chapter 2 अविवेकः परमापदां पदम्

Skint TV teen to be sentenced

Stories • Goddess Stepmom

09g927750** 6 speed transmission TCM VAG original firmware files

Karnataka SSLC 10th Exam Time Table 2016 (www.kseeb.kar.nic.in)

गर्मी पर स्टेटस – Funny Summer Status in Hindi for Whatsapp

More things we have to put up with: when NOT to raise hell with Disclosure

PSM I question: Product Backlog item considered complete

Karimnagar District Police Office Mobile Numbers List in Telangana State

Ifield Avenue closed following crash in Langley Green

Practice Sheet of Right form of verbs for HSC Students

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

Moondru Mudichu 19-09-2017 – Polimer tv Serial

Parris out on $9,000 bail

The 10 Wyoming Cities With The Largest Black Population For 2021

Electronic Bank Statement field Assignment (ZUONR) missing alphabets from...

TASK ERROR: storage migration failed: block job (mirror) error:...

Scripting Tracker - Development Tool for SAP GUI Scripting