Practical Synthetic Data Generation

Balancing Privacy and the Broad Availability of Data
 
 
O'Reilly (Verlag)
  • erschienen am 19. Mai 2020
  • |
  • 166 Seiten
 
E-Book | ePUB mit Adobe-DRM | Systemvoraussetzungen
978-1-4920-7269-0 (ISBN)
 
Building and testing machine learning models requires access to large and diverse data. But where can you find usable datasets without running into privacy issues? This practical book introduces techniques for generating synthetic datafake data generated from real dataso you can perform secondary analysis to do research, understand customer behaviors, develop new products, or generate new revenue.Data scientists will learn how synthetic data generation provides a way to make such data broadly available for secondary purposes while addressing many privacy concerns. Analysts will learn the principles and steps for generating synthetic data from real datasets. And business leaders will see how synthetic data can help accelerate time to a product or solution.This book describes:Steps for generating synthetic data using multivariate normal distributionsMethods for distribution fitting covering different goodness-of-fit metricsHow to replicate the simple structure of original dataAn approach for modeling data structure to consider complex relationshipsMultiple approaches and metrics you can use to assess data utilityHow analysis performed on real data can be replicated with synthetic dataPrivacy implications of synthetic data and methods to assess identity disclosure
  • Englisch
  • Sebastopol
  • |
  • USA
  • 8,47 MB
978-1-4920-7269-0 (9781492072690)
weitere Ausgaben werden ermittelt
  • Intro
  • Preface
  • Conventions Used in This Book
  • O'Reilly Online Learning
  • How to Contact Us
  • Acknowledgments
  • 1. Introducing Synthetic Data Generation
  • Defining Synthetic Data
  • Synthesis from Real Data
  • Synthesis Without Real Data
  • Synthesis and Utility
  • The Benefits of Synthetic Data
  • Efficient Access to Data
  • Enabling Better Analytics
  • Synthetic Data as a Proxy
  • Learning to Trust Synthetic Data
  • Synthetic Data Case Studies
  • Manufacturing and Distribution
  • Healthcare
  • Data for cancer research
  • Evaluating innovative digital health technologies
  • Financial Services
  • Synthetic data benchmarks
  • Software testing
  • Transportation
  • Microsimulation models
  • Data synthesis for autonomous vehicles
  • Summary
  • 2. Implementing Data Synthesis
  • When to Synthesize
  • Identifiability Spectrum
  • Trade-Offs in Selecting PETs to Enable Data Access
  • Decision Criteria
  • PETs Considered
  • Decision Framework
  • Examples of Applying the Decision Framework
  • Data Synthesis Projects
  • Data Synthesis Steps
  • Data Preparation
  • The Data Synthesis Pipeline
  • Synthesis Program Management
  • Summary
  • 3. Getting Started: Distribution Fitting
  • Framing Data
  • How Data Is Distributed
  • Fitting Distributions to Real Data
  • Generating Synthetic Data from a Distribution
  • Measuring How Well Synthetic Data Fits a Distribution
  • The Overfitting Dilemma
  • A Little Light Weeding
  • Summary
  • 4. Evaluating Synthetic Data Utility
  • Synthetic Data Utility Framework: Replication of Analysis
  • Synthetic Data Utility Framework: Utility Metrics
  • Comparing Univariate Distributions
  • Comparing Bivariate Statistics
  • Comparing Multivariate Prediction Models
  • Distinguishability
  • Summary
  • 5. Methods for Synthesizing Data
  • Generating Synthetic Data from Theory
  • Sampling from a Multivariate Normal Distribution
  • Inducing Correlations with Specified Marginal Distributions
  • Copulas with Known Marginal Distributions
  • Generating Realistic Synthetic Data
  • Fitting Real Data to Known Distributions
  • Using Machine Learning to Fit the Distributions
  • Hybrid Synthetic Data
  • Machine Learning Methods
  • Deep Learning Methods
  • Synthesizing Sequences
  • Summary
  • 6. Identity Disclosure in Synthetic Data
  • Types of Disclosure
  • Identity Disclosure
  • Learning Something New
  • Attribute Disclosure
  • Inferential Disclosure
  • Meaningful Identity Disclosure
  • Defining Information Gain
  • Bringing It All Together
  • Unique Matches
  • How Privacy Law Impacts the Creation and Use of Synthetic Data
  • Issues Under the GDPR
  • Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under the GDPR?
  • Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under the GDPR?
  • Does the GDPR regulate or otherwise affect (if at all) the resulting synthetic dataset?
  • Issues Under the CCPA
  • Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under the CCPA?
  • Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under the CCPA?
  • Does the CCPA regulate or otherwise affect (if at all) the resulting synthetic dataset?
  • Issues Under HIPAA
  • Is the use of the original (real) dataset to generate and/or evaluate a synthetic dataset restricted or regulated under HIPAA?
  • Is sharing the original dataset with a third-party service provider to generate the synthetic dataset restricted or regulated under HIPAA?
  • Does HIPAA regulate or otherwise affect (if at all) the resulting synthetic dataset?
  • Article 29 Working Party Opinion
  • Singling out
  • Linkability
  • Inference
  • Closing comments on the Article 29 opinion
  • Summary
  • 7. Practical Data Synthesis
  • Managing Data Complexity
  • For Every Pre-Processing Step There Is a Post-Processing Step
  • Field Types
  • The Need for Rules
  • Not All Fields Have to Be Synthesized
  • Synthesizing Dates
  • Synthesizing Geography
  • Lookup Fields and Tables
  • Missing Data and Other Data Characteristics
  • Partial Synthesis
  • Organizing Data Synthesis
  • Computing Capacity
  • A Toolbox of Techniques
  • Synthesizing Cohorts Versus Full Datasets
  • Continuous Data Feeds
  • Privacy Assurance as Certification
  • Performing Validation Studies to Get Buy-In
  • Motivated Intruder Tests
  • Who Owns Synthetic Data?
  • Conclusions
  • Index

Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie bei der Verwendung der Lese-Software Adobe Digital Editions: wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

47,99 €
inkl. 5% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe-DRM
siehe Systemvoraussetzungen
E-Book bestellen