Preface
A critical element of a robust undergraduate education in science, technology, engineering, and mathematics (STEM) disciplines is an understanding of sets of numbers and how to process, plot, and compare them. This concept is usually first taught in introductory laboratory courses and reinforced in the advanced lab classes. These labs, however, are usually focused on the scientific concept being explored by the experiment as well as the methodologies of setting up the equipment and making the measurements. These classes often only give a brief glimpse into the many techniques for analyzing the obtained number sets, and these courses often devote even less attention to how one should visualize the number sets and compare number sets.
STEM students need to gain an appreciation of the uncertainties surrounding observations and model results. This uncertainty strongly governs the interpretation of the values and especially the comparison of several values. A related concept is uncertainty propagation, keeping track of the uncertainties as the values are processed (e.g., used as a value in an equation to yield a new number). Without a grasp on the uncertainty of a given number, its comparison with other numbers is meaningless.
Some STEM departments require their students to take a statistics course as part of the undergraduate degree program. While this is a highly worthwhile and useful topic for such students to understand, it typically does not cover the full range of issues concerning data analysis, visualization, and comparative metrics techniques that a practicing scientist or engineer should know. Applied statistics courses for STEM majors often cover the basic statistical processing for a single data set (calculating mean and standard deviation, for instance) and for two data sets (calculating a linear regression fit and correlation coefficient, for instance). It usually covers the basics of comparing those two data sets, including t tests, chi-squared tests, and F-statistic tests. It usually does not cover very much in the way of visualization of the data and almost never covers comparison metrics beyond the correlation coefficient.
Using numerous examples from the Earth, ocean, atmospheric, space, and planetary sciences, this volume presents a comprehensive introduction to data analysis, visualization, and data-model comparisons and metrics, within the framework of the uncertainty around the values. Currently, I teach an upper-level undergraduate course, "Data Analysis and Visualization for Geoscientists," at the University of Michigan, for which this textbook is written. The volume can be used as the text for a data analysis course, a supplement for an advanced laboratory series, or as a reference resource, for everyone from upper-level undergraduate students to experienced researchers in STEM fields.
While data-model comparisons have always been an essential component of scientific research, it is often a topic not rigorously introduced at the undergraduate level. This is no longer acceptable, especially with the advent of machine learning as a fast-growing field of analysis. A fundamental trait of machine learning is the optimization of the computer-developed model, fitting its result to the training data set. Undergraduate science students gain experience as data analysts, but for some reason, data-model comparisons are barely mentioned in most undergraduate curricula. Many of these students are going straight into the industrial and commercial sector at ever-increasing rates, often as data analytics experts. To be an effective data scientist and user of advanced statistical applications including machine learning, these students should have an understanding and appreciation of data-model comparison techniques.
How to Use This Book
This is a data analysis textbook for upper level undergraduate STEM students. It is designed to be their statistics course in the degree program, offering them a learning experience based on real geophysical observations. Data from geoscience examples are used throughout the book to actively engage the reader in the concept of uncertainty as a leading factor in the interpretation and usage of a set of numbers. Note that this book intentionally avoids many of the derivations of the formulas presented. Some are given, but only for context to understand the assumptions built into the formula so that students learn the limitations of that particular formula. No derivations are assigned in the homework problems at the end of the chapters. If an instructor wanted to add derivations to the assignments, then please feel free to do so, but I am omitting them intentionally because I want the focus to be on the application of the formulas for scientific investigations.
This book should be useful to students across all science disciplines, meant to serve as an initial course in scientific data analysis and hypothesis testing. This book provides the precursor knowledge to understanding machine learning techniques. While it does not explicitly cover machine learning, it provides a critical toolkit for students to fully understand, appreciate, and optimally use the latest machine learning advancements in data science. A key topic in geosciences that it does not cover is periodicity analysis. In my department at the University of Michigan, we have a separate course that deals explicitly with this subject, exploring Fourier transforms and other periodicity methodologies, and then teaching students how to interpret the resulting power spectral density graphs in a scientific context. Inclusion of this topic could occur in a future edition, when one merged book is used for both of these undergraduate geophysical data analysis courses, but that endeavor is reserved for the future.
While this book could be assigned as a reference text in conjunction with an advanced laboratory course, it is perhaps most effectively used as a separate course taken before, after, or in parallel to the lab class. The lab course is focused on the methods of data collection while this text is focused on the methods of data processing. These two topics are intimately related but the methods are completely different and each deserves its own focused learning experience.
The topic of data analysis and model metrics requires a computational approach. When teaching the course on which this book is based, half of the class sessions are held in a computer lab with experiential learning examples, walking the students through the usage of the concepts and equations presented in class. Specifically, these interactive analysis sessions are taught using the Python programming language via Jupyter notebooks, which include interleaved blocks of code and explanatory text. At least one version of these code files is already available online, as a supplement to the book content. Instructors should feel free to use this coding material in designing their own version of this type of course. The exercises at the end of the chapters assume some programming proficiency and most of them require some coding for completion.
I have not included any programming-language-specific content in this book; any necessary coding instruction should be provided in addition to the content of this book. For my class, I give them lots of code; this is not a class about the special tricks of opening a data set in a particular format but rather the usage of the analysis techniques to robustly assess one or more data sets. Note that some of the "Exercises and geosciences" problem sets at the ends of the chapters are quite long; instructors might want to think about whether to assign everything or a subset of it. For some of the chapters, I only assign half, switching back and forth each year.
Prerequisites for Using This Book
This is meant as an introductory statistics textbook for upper level undergraduates. It does not require any prior statistics coursework. If students have taken a statistics course already, then the some of the first half of the book (especially Chapters 4, 6, and 7) will be somewhat of a review. The content in these chapters, however, is different from a typical statistics approach to the material, so students should find it to be a new perspective on these concepts. There is a small bit of probability in the course, but again no prior knowledge of this topic is needed.
The math in Chapter 3 on uncertainty propagation includes differential calculus. It is assumed that students have this "Calculus 1" knowledge base, so proficiency to this level of math is essential for that part of the course. Chapter 3 includes partial derivatives, so higher level calculus is preferred, but this concept could be briefly introduced as part of this course (as I do, when I teach it).
Some programming experience is required to successfully navigate the homework problems in this book. I teach it in Python, via Jupyter Notebooks, going through coding technique and example code in class. I do not, however, teach the basic fundamentals of scientific programming, but rather go through a style guide of code format and documentation expectations. I allow students to work together on coding assignments, as I do not want the programming aspect to be a hurdle to understanding the statistical and data-model comparison concepts.
The book contains numerous examples in Earth, atmospheric, space, and planetary sciences. No prior knowledge of these topics is necessary to fully appreciate these examples; background context is provided and they are separated from...