Repurposing Legacy Data

Innovative Case Studies
Elsevier (Verlag)
  • 1. Auflage
  • |
  • erschienen am 13. März 2015
  • |
  • 176 Seiten
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
E-Book | PDF mit Adobe DRM | Systemvoraussetzungen
978-0-12-802915-2 (ISBN)

Repurposing Legacy Data: Innovative Case Studies takes a look at how data scientists have re-purposed legacy data, whether their own, or legacy data that has been donated to the public domain.

Most of the data stored worldwide is legacy data-data created some time in the past, for a particular purpose, and left in obsolete formats. As with keepsakes in an attic, we retain this information thinking it may have value in the future, though we have no current use for it.

The case studies in this book, from such diverse fields as cosmology, quantum physics, high-energy physics, microbiology, psychiatry, medicine, and hospital administration, all serve to demonstrate how innovative people draw value from legacy data. By following the case examples, readers will learn how legacy data is restored, merged, and analyzed for purposes that were never imagined by the original data creators.

  • Discusses how combining existing data with other data sets of the same kind can produce an aggregate data set that serves to answer questions that could not be answered with any of the original data
  • Presents a method for re-analyzing original data sets using alternate or improved methods that can provide outcomes more precise and reliable than those produced in the original analysis
  • Explains how to integrate heterogeneous data sets for the purpose of answering questions or developing concepts that span several different scientific fields

Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.
  • Englisch
  • Saint Louis
  • |
  • USA
  • 4,69 MB
978-0-12-802915-2 (9780128029152)
0-12-802915-3 (0128029153)
weitere Ausgaben werden ermittelt
1 - Front Cover [Seite 1]
2 - Repurposing Legacy Data [Seite 4]
3 - Copyright Page [Seite 5]
4 - Contents [Seite 6]
5 - Author Biography [Seite 8]
6 - 1 Introduction [Seite 10]
6.1 - 1.1 Why Bother? [Seite 10]
6.2 - 1.2 What Is Data Repurposing? [Seite 12]
6.3 - 1.3 Data Worth Preserving [Seite 13]
6.4 - 1.4 Basic Data Repurposing Tools [Seite 15]
6.4.1 - 1.4.1 A Simple Text Editor [Seite 15]
6.4.2 - 1.4.2 Simple Programming Skills [Seite 15]
6.4.3 - 1.4.3 Data Visualization Utilities [Seite 16]
6.5 - 1.5 Personal Attributes of Data Repurposers [Seite 17]
6.5.1 - 1.5.1 Data Organization Methods [Seite 18]
6.5.2 - 1.5.2 Ability to Develop a Clear Understanding of the Goals of a Project [Seite 18]
6.6 - References [Seite 19]
7 - 2 Learning from the Masters [Seite 20]
7.1 - 2.1 New Physics from Old Data [Seite 20]
7.2 - 2.2 Repurposing the Physical and Abstract Property of Uniqueness [Seite 22]
7.3 - 2.3 Repurposing a 2,000-Year-Old Classification [Seite 23]
7.4 - 2.4 Decoding the Past [Seite 28]
7.5 - 2.5 What Makes Data Useful for Repurposing Projects? [Seite 33]
7.6 - References [Seite 41]
8 - 3 Dealing with Text [Seite 44]
8.1 - 3.1 Thus It Is Written [Seite 44]
8.2 - 3.2 Search and Retrieval [Seite 47]
8.3 - 3.3 Indexing Text [Seite 50]
8.4 - 3.4 Coding Text [Seite 56]
8.5 - References [Seite 58]
9 - 4 New Life for Old Data [Seite 60]
9.1 - 4.1 New Algorithms [Seite 60]
9.2 - 4.2 Taking Closer Looks [Seite 65]
9.3 - 4.3 Crossing Data Domains [Seite 68]
9.4 - References [Seite 71]
10 - 5 The Purpose of Data Analysis Is to Enable Data Reanalysis [Seite 74]
10.1 - 5.1 Every Initial Data Analysis on Complex Datasets Is Flawed [Seite 74]
10.2 - 5.2 Unrepeatability of Complex Analyses [Seite 79]
10.3 - 5.3 Obligation to Verify and Validate [Seite 81]
10.4 - 5.4 Asking What the Data Really Means [Seite 85]
10.5 - References [Seite 88]
11 - 6 Dark Legacy: Making Sense of Someone Else's Data [Seite 92]
11.1 - 6.1 Excavating Treasures from Lost and Abandoned Data Mines [Seite 92]
11.2 - 6.2 Nonstandard Standards [Seite 94]
11.3 - 6.3 Specifications, Not Standards [Seite 97]
11.4 - 6.4 Classifications and Ontologies [Seite 100]
11.5 - 6.5 Identity and Uniqueness [Seite 106]
11.6 - 6.6 When to Terminate (or Reconsider) a Data Repurposing Project [Seite 110]
11.7 - References [Seite 114]
12 - 7 Social and Economic Issues [Seite 118]
12.1 - 7.1 Data Sharing and Reproducible Research [Seite 118]
12.2 - 7.2 Acquiring and Storing Data [Seite 119]
12.3 - 7.3 Keeping Your Data Forever [Seite 121]
12.4 - 7.4 Data Immutability [Seite 122]
12.5 - 7.5 Privacy and Confidentiality [Seite 124]
12.6 - 7.6 The Economics of Data Repurposing [Seite 127]
12.7 - References [Seite 130]
13 - Appendix A: Index of Case Studies [Seite 132]
14 - Appendix B: Glossary [Seite 134]
14.1 - References [Seite 167]
Chapter 2

Learning from the Masters

Data repurposing has made it possible for scientists to lay the foundations of quantum mechanics, evolution, and modern astronomy. It has enabled us to understand past lives (e.g., Mesoamerican history and culture) and has given us a way to identify every organism on earth (e.g., biometrics). This chapter explains the pivotal role played by data repurposing in these important intellectual achievements.


CODIS; identifiers; biometrics; finger prints; quantum physics; heliocentric system; Mayan glyphs

2.1 New Physics from Old Data

All science is description and not explanation.

Karl Pearson, The Grammar of Science, Preface to 2nd edition, 1899

Case Study 2.1

Sky Charts

For most of us, the positions of the planets and of the stars do not provide us with any useful information. This was not always so. For a large part of the history of mankind, individuals determined their locations, the date, and the time, from careful observations of the night sky. On a cloudless night, a competent navigator, on the sea or in the air, could plot a true course.

Repurposed data from old star charts was used to settle and unsettle one of our greatest mysteries; earth's place in the universe. Seven key scientists, working in tandem over a period of four centuries, used night sky data to reach profound and shocking conclusions: Aristarchus of Samos (circa 310-230 BCE), Nicolaus Copernicus (1473-1543), Tycho Brahe (1546-1601), Johannes Kepler (1571-1630), Galileo Galilei (1564-1642), Isaac Newton (1643-1727), and Albert Einstein (1879-1955).

Back in the third century BCE, Aristarchus of Samos studied the night sky and reasoned that the earth and planets orbited the sun. In addition, Aristarchus correctly assigned the relative positions of the known planets to their heliocentric orbits. About 1,800 years later, Copernicus reanalyzed Aristachus' assertions to confirm the heliocentric orbits of the planets, and plotted their elliptic trajectories. Soon thereafter, Tycho Brahe produced improved star charts, bequeathing this data to his student, Johannes Kepler. Kepler used the charts to derive three general laws describing the movement of planets. In 1687, Newton published his Principia, wherein Kepler's empiric laws, based on observational data, were redeveloped from physical principles, Newton's laws of motion. Newton's contribution was a remarkable example of data modeling, wherein an equation was created to describe a set of data pertaining to physical objects (see Glossary item, Modeling).

As is almost always the case, this multigenerational repurposing project led to a conceptual simplification of the original data. After the switch was made from a geocentric to a heliocentric system, operating under a simple set of equations, it became far easier to calculate the relative motion of objects (e.g., planetary orbits) and to predict the position of celestial bodies.

From Newton's work, based on Kepler's elliptical orbits, based in turn on Tycho Brahe's data, came the calculus and Newton's theory of relativity. Newton, as well as his predecessor Galileo, assumed the existence of an absolute space, within which the laws of motion hold true. The planets, and all physical bodies, were thought to move relative to one another in their own frames of reference, within an absolute space, all sharing an absolute time. Einstein revisited Newton's theories of relativity and concluded that time, like motion, is relative and not absolute.

The discovery of heliocentric planetary motion and the broader issues of relative frames of observation in space were developed over more than 2,000 years of observation, analysis, and reanalysis of old data. Each successive scientist used a prior set of observations to answer a new question. In so doing, star data, originally intended for navigational purposes, was repurposed to produce a new model of our universe.

Case Study 2.2

From Hydrogen Spectrum Data to Quantum Mechanics

In about 1880, Vogel and Huggins published the emission frequencies of hydrogen (i.e., the hydrogen spectroscopic emission lines) [1,2]. In 1885, Johann Balmer, studying the emission frequencies of the hydrogen spectral lines, developed a formula that precisely expressed frequency in terms of the numeric order of its emission line (i.e., n=1, 2, 3, 4, and so on). Balmer's attempt at data modeling produced one of the strangest equations in the history of science. There was simply no precedent for expressing the frequency of an electromagnetic wave in terms of its spectral emission rank. The formula was introduced to the world without the benefit of any theoretical explanation. Balmer himself indicated that he was just playing around with numbers. Nonetheless, he had hit upon a formula that precisely described multiple emission lines, in terms of ascending integers.

Twenty-eight years later, Niels Bohr, in 1913, chanced upon Balmer's formula and used it to explain spectral lines in terms of energy emissions resulting from transitions between discrete electron orbits. Balmer's amateurish venture into data repurposing led, somewhat inadvertently, to the birth of modern quantum physics.

2.2 Repurposing the Physical and Abstract Property of Uniqueness

L'art c'est moi, la science c'est nous.

Claude Bernard

An object is unique if it can be distinguished from every other object. The quality of object uniqueness permits data scientists to associate nonunique data values with unique data objects; hence, identifying the data. As an example, let us examine the utility of natural uniqueness for the forensic scientist.

Case Study 2.3

Fingerprints; from Personal Identifier to Data-Driven Forensics

Fingerprints have been used, since antiquity, as a method for establishing the identity of individuals. Fingerprints were pressed onto clay tablets, seals, and even pottery left by ancient civilizations that included Minoan, Greek, Japanese, and Chinese. As early as the second millennium BCE, fingerprints were used as a type of signature in Babylon, and ancient Babylonian policemen recorded the fingerprints of criminals, much as modern policemen do today (Figure 2.1).

Figure 2.1 U.S. Federal Bureau of Investigation Fingerprint Division, World War II. FBI, public domain (see Glossary item, Public domain).

Towards the close of the nineteenth century, Francis Galton repurposed fingerprint data to pursue his own particular interests. Galton was primarily interested in the heritability and racial characteristics of fingerprints, a field of study that can best be described as a scientific dead end. Nonetheless, in pursuit of his interests, he devised a way of classifying fingerprints by patterns (e.g., plain arch, tented arch, simple loop, central pocket loop, double loop, lateral pocket loop, and plain whorl). This classification launched the new science of fingerprint identification, an area of research that has been actively pursued and improved over the past 120 years (see Glossary item, Classification).

In addition to Galton's use of classification methods, two closely related simple technological enhancements vastly increased the importance of fingerprints. The first was the incredibly simple procedure of recording sets of fingerprints, on paper, with indelible ink. With the simple fingerprint card, the quality of fingerprints improved, and the process of sharing and comparing recorded fingerprints became more practical. The second enhancement was the decision to collect fingerprint cards in permanent population databases. Fingerprint databases enabled forensic scientists to match fingerprints found at the scene of a crime, with fingerprints stored in the database. The task of fingerprint matching was greatly simplified by confining comparisons to prints that shared the same class-based profiles, as described by Galton.

Repurposing efforts have expanded the use of fingerprints to include authentication (i.e., proving you are who you claim to be), keying (e.g., opening locked devices based on an authenticated fingerprint or some other identifying biometric), tracking (e.g., establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (i.e., identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events). In the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation [3].

Today, most of what we think of as the forensic sciences is based on object identification (e.g., biometrics, pollen identification, trace chemical investigation, tire mark investigation, and so on). When a data object is uniquely identified, its association with additional data can be collected, aggregated, and retrieved, as needed.

2.3 Repurposing a 2,000-Year-Old Classification

Our similarities are different.

Yogi Berra

Classifications drive down the complexity of knowledge domains and lay bare the relationships among different objects. Observations that hold for a data object may also hold for the other objects of the same class and for their class descendants (see Glossary item, Class). The data analyst can...

Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)


Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)


Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.

Download (sofort verfügbar)

42,78 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe DRM
siehe Systemvoraussetzungen
PDF mit Adobe DRM
siehe Systemvoraussetzungen
Hinweis: Die Auswahl des von Ihnen gewünschten Dateiformats und des Kopierschutzes erfolgt erst im System des E-Book Anbieters
E-Book bestellen