Data Synthetization: enhanced GANs vs Copulas

Using case studies, I compare generative adversarial networks (GANs) with copulas to synthesize tabular data. I discuss back-end and front-end improvements to help GANs better replicate the correlation structure present in the real data. Likewise, I discuss methods to further improve copulas, including transforms, the use of separate copulas for each population segment, and parametric model-driven copulas compared to a data-driven parameter-free approach. I apply the techniques to real-life datasets, with full Python implementation. In the end, blending both methods leads to better results. Both methods eventually need an iterative gradient-descent technique to find an optimum in the parameter space. For GANs, I provide a detailed discussion of hyperparameters and fine-tuning options.

I show examples where GANs are superior to copulas, and the other way around. My GAN implementation also leads to fully replicable results — a feature usually absent in other GAN systems. This is particularly important given the high dependency on the initial configuration determined by a seed parameter: it also allows you to find the best synthetic data using multiple runs of GAN in a replicable setting. In the process, I introduce a new matrix correlation distance to evaluate the quality of the synthetic data, taking values between 0 and 1 where 0 is best, and leverage the TableEvaluator library. I also discuss feature clustering to improve the technique, to detect groups of features independent from each other, and apply a different model to each of them. In a medical data example to predict the risk of cancer, I use random forests to classify the real data, and compare the performance with results obtained on the synthetic data.

Ability of copulas to replicate the correlation structure in the real data

In one artificial example with very strong patterns, the copula method fails at detecting the non-linear feature interactions, while GANs do a pretty good job. In another example mostly with linear feature interactions, the opposite is true. The document also has many references to technical papers available online for free, as well as a discussion of open source implementations. In particular, I feature an illustration of the CopulaGAN module — blending copulas and GANs — from the synthetic data vault library (SDV), applied to tabular data.

This applied paper is part of the newly added chapter in my book on synthetic data and explainable AI. The 289-page book now in version 4.0 and accepted by Elsevier, is currently available here. In the PDF version (the only one currently available, viewable in any browser), all back-links, external and internal links, in particular to or from other chapters, glossary, references and the index, are working. The datasets, Python code, and illustrations, are also on my GitHub repository. See the table of contents and access sample chapters, from here. You can download the free article “Data Synthetization: Enhanced GANs vs Copulas ” from here.

To not miss future articles, sign-up to our newsletter, here.

About the Author

Vincent Granville is a pioneering data scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), founder of MLTechniques.com, former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of multiple books, including “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on spatial stochastic processes, chaotic dynamical systems, experimental math and probabilistic number theory.

The post Data Synthetization: enhanced GANs vs Copulas first appeared on Machine Learning Techniques.

Data Synthetization: enhanced GANs vs Copulas

Table of Contents

About the Author

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112