Repurposing Legacy Data: Innovative Case Studies takes a look at how data scientists have re-purposed legacy data, whether their own, or legacy data that has been donated to the public domain.
Most of the data stored worldwide is legacy data-data created some time in the past, for a particular purpose, and left in obsolete formats. As with keepsakes in an attic, we retain this information thinking it may have value in the future, though we have no current use for it.
The case studies in this book, from such diverse fields as cosmology, quantum physics, high-energy physics, microbiology, psychiatry, medicine, and hospital administration, all serve to demonstrate how innovative people draw value from legacy data. By following the case examples, readers will learn how legacy data is restored, merged, and analyzed for purposes that were never imagined by the original data creators.
- Discusses how combining existing data with other data sets of the same kind can produce an aggregate data set that serves to answer questions that could not be answered with any of the original data
- Presents a method for re-analyzing original data sets using alternate or improved methods that can provide outcomes more precise and reliable than those produced in the original analysis
- Explains how to integrate heterogeneous data sets for the purpose of answering questions or developing concepts that span several different scientific fields
Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.
Learning from the Masters
Data repurposing has made it possible for scientists to lay the foundations of quantum mechanics, evolution, and modern astronomy. It has enabled us to understand past lives (e.g., Mesoamerican history and culture) and has given us a way to identify every organism on earth (e.g., biometrics). This chapter explains the pivotal role played by data repurposing in these important intellectual achievements.
CODIS; identifiers; biometrics; finger prints; quantum physics; heliocentric system; Mayan glyphs
2.1 New Physics from Old Data
All science is description and not explanation.
Karl Pearson, The Grammar of Science, Preface to 2nd edition, 1899
Case Study 2.1
For most of us, the positions of the planets and of the stars do not provide us with any useful information. This was not always so. For a large part of the history of mankind, individuals determined their locations, the date, and the time, from careful observations of the night sky. On a cloudless night, a competent navigator, on the sea or in the air, could plot a true course.
Repurposed data from old star charts was used to settle and unsettle one of our greatest mysteries; earth's place in the universe. Seven key scientists, working in tandem over a period of four centuries, used night sky data to reach profound and shocking conclusions: Aristarchus of Samos (circa 310-230 BCE), Nicolaus Copernicus (1473-1543), Tycho Brahe (1546-1601), Johannes Kepler (1571-1630), Galileo Galilei (1564-1642), Isaac Newton (1643-1727), and Albert Einstein (1879-1955).
Back in the third century BCE, Aristarchus of Samos studied the night sky and reasoned that the earth and planets orbited the sun. In addition, Aristarchus correctly assigned the relative positions of the known planets to their heliocentric orbits. About 1,800 years later, Copernicus reanalyzed Aristachus' assertions to confirm the heliocentric orbits of the planets, and plotted their elliptic trajectories. Soon thereafter, Tycho Brahe produced improved star charts, bequeathing this data to his student, Johannes Kepler. Kepler used the charts to derive three general laws describing the movement of planets. In 1687, Newton published his Principia, wherein Kepler's empiric laws, based on observational data, were redeveloped from physical principles, Newton's laws of motion. Newton's contribution was a remarkable example of data modeling, wherein an equation was created to describe a set of data pertaining to physical objects (see Glossary item, Modeling).
As is almost always the case, this multigenerational repurposing project led to a conceptual simplification of the original data. After the switch was made from a geocentric to a heliocentric system, operating under a simple set of equations, it became far easier to calculate the relative motion of objects (e.g., planetary orbits) and to predict the position of celestial bodies.
From Newton's work, based on Kepler's elliptical orbits, based in turn on Tycho Brahe's data, came the calculus and Newton's theory of relativity. Newton, as well as his predecessor Galileo, assumed the existence of an absolute space, within which the laws of motion hold true. The planets, and all physical bodies, were thought to move relative to one another in their own frames of reference, within an absolute space, all sharing an absolute time. Einstein revisited Newton's theories of relativity and concluded that time, like motion, is relative and not absolute.
The discovery of heliocentric planetary motion and the broader issues of relative frames of observation in space were developed over more than 2,000 years of observation, analysis, and reanalysis of old data. Each successive scientist used a prior set of observations to answer a new question. In so doing, star data, originally intended for navigational purposes, was repurposed to produce a new model of our universe.
Case Study 2.2
From Hydrogen Spectrum Data to Quantum Mechanics
In about 1880, Vogel and Huggins published the emission frequencies of hydrogen (i.e., the hydrogen spectroscopic emission lines) [1,2]. In 1885, Johann Balmer, studying the emission frequencies of the hydrogen spectral lines, developed a formula that precisely expressed frequency in terms of the numeric order of its emission line (i.e., n=1, 2, 3, 4, and so on). Balmer's attempt at data modeling produced one of the strangest equations in the history of science. There was simply no precedent for expressing the frequency of an electromagnetic wave in terms of its spectral emission rank. The formula was introduced to the world without the benefit of any theoretical explanation. Balmer himself indicated that he was just playing around with numbers. Nonetheless, he had hit upon a formula that precisely described multiple emission lines, in terms of ascending integers.
Twenty-eight years later, Niels Bohr, in 1913, chanced upon Balmer's formula and used it to explain spectral lines in terms of energy emissions resulting from transitions between discrete electron orbits. Balmer's amateurish venture into data repurposing led, somewhat inadvertently, to the birth of modern quantum physics.
2.2 Repurposing the Physical and Abstract Property of Uniqueness
L'art c'est moi, la science c'est nous.
An object is unique if it can be distinguished from every other object. The quality of object uniqueness permits data scientists to associate nonunique data values with unique data objects; hence, identifying the data. As an example, let us examine the utility of natural uniqueness for the forensic scientist.
Case Study 2.3
Fingerprints; from Personal Identifier to Data-Driven Forensics
Fingerprints have been used, since antiquity, as a method for establishing the identity of individuals. Fingerprints were pressed onto clay tablets, seals, and even pottery left by ancient civilizations that included Minoan, Greek, Japanese, and Chinese. As early as the second millennium BCE, fingerprints were used as a type of signature in Babylon, and ancient Babylonian policemen recorded the fingerprints of criminals, much as modern policemen do today (Figure 2.1). Figure 2.1
U.S. Federal Bureau of Investigation Fingerprint Division, World War II. FBI, public domain (see Glossary item, Public domain).
Towards the close of the nineteenth century, Francis Galton repurposed fingerprint data to pursue his own particular interests. Galton was primarily interested in the heritability and racial characteristics of fingerprints, a field of study that can best be described as a scientific dead end. Nonetheless, in pursuit of his interests, he devised a way of classifying fingerprints by patterns (e.g., plain arch, tented arch, simple loop, central pocket loop, double loop, lateral pocket loop, and plain whorl). This classification launched the new science of fingerprint identification, an area of research that has been actively pursued and improved over the past 120 years (see Glossary item, Classification).
In addition to Galton's use of classification methods, two closely related simple technological enhancements vastly increased the importance of fingerprints. The first was the incredibly simple procedure of recording sets of fingerprints, on paper, with indelible ink. With the simple fingerprint card, the quality of fingerprints improved, and the process of sharing and comparing recorded fingerprints became more practical. The second enhancement was the decision to collect fingerprint cards in permanent population databases. Fingerprint databases enabled forensic scientists to match fingerprints found at the scene of a crime, with fingerprints stored in the database. The task of fingerprint matching was greatly simplified by confining comparisons to prints that shared the same class-based profiles, as described by Galton.
Repurposing efforts have expanded the use of fingerprints to include authentication (i.e., proving you are who you claim to be), keying (e.g., opening locked devices based on an authenticated fingerprint or some other identifying biometric), tracking (e.g., establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (i.e., identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events). In the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation .
Today, most of what we think of as the forensic sciences is based on object identification (e.g., biometrics, pollen identification, trace chemical investigation, tire mark investigation, and so on). When a data object is uniquely identified, its association with additional data can be collected, aggregated, and retrieved, as needed.
2.3 Repurposing a 2,000-Year-Old Classification
Our similarities are different.
Classifications drive down the complexity of knowledge domains and lay bare the relationships among different objects. Observations that hold for a data object may also hold for the other objects of the same class and for their class descendants (see Glossary item, Class). The data analyst can...