Preface
pandas is a library for creating and manipulating structured data with Python. What do I mean by structured? I mean tabular data in rows and columns like what you would find in a spreadsheet or database. Data scientists, analysts, programmers, engineers, and others are leveraging it to mold their data.
pandas is limited to "small data" (data that can fit in memory on a single machine). However, the syntax and operations have been adopted by or inspired other projects: PySpark, Dask, and cuDF, among others. These projects have different goals, but some of them will scale out to big data. So, there is value in understanding how pandas works as the features are becoming the de facto API for interacting with structured data.
I, Will Ayd, have been a core maintainer of the pandas library since 2018. During that time, I have had the pleasure of contributing to and collaborating on a host of other open source projects in the same ecosystem, including but not limited to Arrow, NumPy and Cython.
I also consult for a living, utilizing the same ecosystem that I contribute to. Using the best open source tooling, I help clients develop data strategies, implement processes and patterns, and train associates to stay ahead of the ever-changing analytics curve. I strongly believe in the freedom that open source tooling provides, and have proven that value to many companies.
If your company is interested in optimizing your data strategy, feel free to reach out (will_ayd@innobi.io
).
Who this book is for
This book contains a huge number of recipes, ranging from very simple to advanced. All recipes strive to be written in clear, concise, and modern idiomatic pandas code. The How it works sections contain extremely detailed descriptions of the intricacies of each step of the recipe. Often, in the There's more. section, you will get what may seem like an entirely new recipe. This book is densely packed with an extraordinary amount of pandas code.
While not strictly required, users are advised to read the book chronologically. The recipes are structured in such a way that they first introduce concepts and features using very small, directed examples, but continuously build from there into more complex applications.
Due to the wide range of complexity, this book can be useful to both novice and everyday users alike. It has been my experience that even those who use pandas regularly will not master it without being exposed to idiomatic pandas code. This is somewhat fostered by the breadth that pandas offers. There are almost always multiple ways of completing the same operation, which can have users get the result they want but in a very inefficient manner. It is not uncommon to see an order of magnitude or more in performance difference between two sets of pandas solutions to the same problem.
The only real prerequisite for this book is a fundamental knowledge of Python. It is assumed that the reader is familiar with all the common built-in data containers in Python, such as lists, sets, dictionaries, and tuples.
What this book covers
Chapter 1, pandas Foundations, introduces the main pandas objects, namely, Series
, DataFrames, and Index
.
Chapter 2, Selection and Assignment, shows you how to sift through the data that you have loaded into any of the pandas data structures.
Chapter 3, Data Types, explores the type system underlying pandas. This is an area that has evolved rapidly and will continue to do so, so knowing the types and what distinguishes them is invaluable information.
Chapter 4, The pandas I/O System, shows why pandas has long been a popular tool to read from and write to a variety of storage formats.
Chapter 5, Algorithms and How to Apply Them, introduces you to the foundation of performing calculations with the pandas data structures.
Chapter 6, Visualization, shows you how pandas can be used directly for plotting, alongside the seaborn library which integrates well with pandas.
Chapter 7, Reshaping DataFrames, discusses the many ways in which data can be transformed and summarized robustly via the pandas pd.DataFrame
.
Chapter 8, Group By, showcases how to segment and summarize subsets of your data contained within a pd.DataFrame
.
Chapter 9, Temporal Data Types and Algorithms, introduces users to the date/time types which underlie time-series-based analyses that pandas is famous for and highlights usage against real data.
Chapter 10, General Usage/Performance Tips, goes over common pitfalls users run into when using pandas, and showcases the idiomatic solutions.
Chapter 11, The pandas Ecosystem, discusses other open source libraries that integrate, extend, and/or complement pandas.
To get the most out of this book
There are a couple of things you can do to get the most out of this book. First, and most importantly, you should download all the code, which is stored in Jupyter Notebook. While reading through each recipe, run each step of code in the notebook. Make sure you explore on your own as you run through the code. Second, have the pandas official documentation open (http://pandas.pydata.org/pandas-docs/stable/) in one of your browser tabs. The pandas documentation is an excellent resource containing over 1,000 pages of material. There are examples for most of the pandas operations in the documentation, and they will often be directly linked from the See also section. While it covers the basics of most operations, it does so with trivial examples and fake data that don't reflect situations that you are likely to encounter when analyzing datasets from the real world.
What you need for this book
pandas is a third-party package for the Python programming language and, as of the printing of this book, is transitioning from the 2.x to the 3.x series. The examples in this book should work with a minimum pandas version of 2.0 along with Python versions 3.9 and above.
The code in this book will make use of the pandas, NumPy, and PyArrow libraries. Jupyter Notebook files are also a popular way to visualize and inspect code. All of these libraries should be installable via pip
or the package manager of your choice. For pip users, you can run:
python -m pip install pandas numpy pyarrow notebook
Download the example code files
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support/errata and register to have the files emailed directly to you.
You can download the code files by following these steps:
- Log in or register at www.packt.com.
- Select the Support tab.
- Click on Code Downloads.
- Enter the name of the book in the Search box and follow the on-screen instructions.
The code bundle for the book is also hosted on GitHub at https://github.com/WillAyd/Pandas-Cookbook-Third-Edition. In case there is an update to the code, it will be updated in the existing GitHub repository.
Running a Jupyter notebook
The suggested method to work through the content of this book is to have a Jupyter notebook up and running so that you can run the code while reading through the recipes. Following along on your computer allows you to go off exploring on your own and gain a deeper understanding than by just reading the book alone.
After installing Jupyter notebook, open a Command Prompt (type cmd
at the search bar on Windows, or open Terminal on Mac or Linux) and type:
jupyter notebook
It is not necessary to run this command from your home directory. You can run it from any location, and the contents in the browser will reflect that location. Although we have now started the Jupyter Notebook program, we haven't actually launched a single individual notebook where we can start developing in Python. To do so, you can click on the New button on the right-hand side of the page, which will drop down a list of all the possible kernels available for you to use. If you are working from a fresh installation, then you will only have a single kernel available to you (Python 3). After selecting the Python 3 kernel, a new tab will open in the browser, where you can start writing Python code.
You can, of course, open previously created notebooks instead of beginning a new one. To do so, navigate through the filesystem provided in the Jupyter Notebook browser home page and select the notebook you want to open. All Jupyter Notebook files end in .ipynb
.
Alternatively, you may use cloud providers for a notebook environment. Both Google and Microsoft provide free notebook environments that come preloaded with pandas.
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781836205876.
Conventions
There are a number of text conventions used throughout this book.
CodeInText
: Indicates code words in...