Pandas Cookbook

Name: Pandas Cookbook | Practical recipes for scientific computing, time series, and exploratory data analysis using Python
Brand: Packt Publishing
Price: 29.99 EUR
Availability: OnlineOnly

Practical recipes for scientific computing, time series, and exploratory data analysis using Python

William Ayd Matthew Harrison(Autor*in)

Packt Publishing

1. Auflage

Erschienen am 31. Oktober 2024

404 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-83620-586-9 (ISBN)

29,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

Cover
Title Page
Copy right Page
Forweord
Contributors
Table of Contents
Preface
Making the Most Out of This Book - Get to Know Your Free Benefits
Chapter 1: pandas Foundations
Importing pandas
Series
DataFrame
Index
Series attributes
DataFrame attributes
Chapter 2: Selection and Assignment
Basic selection from a Series
Basic selection from a DataFrame
Position-based selection of a Series
Position-based selection of a DataFrame
Label-based selection from a Series
Label-based selection from a DataFrame
Mixing position-based and label-based selection
DataFrame.filter
Selection by data type
Selection/filtering via Boolean arrays
Selection with a MultiIndex - A single level
Selection with a MultiIndex - Multiple levels
Selection with a MultiIndex - a DataFrame
Item assignment with .loc and .iloc
DataFrame column assignment
Chapter 3: Data Types
Integral types
Floating point types
Boolean types
String types
Missing value handling
Categorical types
Temporal types - datetime
Temporal types - timedelta
Temporal PyArrow types
PyArrow List types
PyArrow decimal types
NumPy type system, the object type, and pitfalls
Chapter 4: The pandas I/O System
CSV - basic reading/writing
CSV - strategies for reading large files
Microsoft Excel - basic reading/writing
Microsoft Excel - finding tables in non-default locations
Microsoft Excel - hierarchical data
SQL using SQLAlchemy
SQL using ADBC
Apache Parquet
JSON
HTML
Pickle
Third-party I/O libraries
Chapter 5: Algorithms and How to Apply Them
Basic pd.Series arithmetic
Basic pd.DataFrame arithmetic
Aggregations
Transformations
Map
Apply
Summary statistics
Binning algorithms
One-hot encoding with pd.get_dummies
Chaining with .pipe
Selecting the lowest-budget movies from the top 100
Calculating a trailing stop order price
Finding the baseball players best at.
Understanding which position scores the most per tea
Chapter 6: Visualization
Creating charts from aggregated data
Plotting distributions of non-aggregated data
Further plot customization with Matplotlib
Exploring scatter plots
Exploring categorical data
Exploring continuous data
Using seaborn for advanced plots
Chapter 7: Reshaping DataFrames
Concatenating pd.DataFrame objects
Merging DataFrames with pd.merge
Joining DataFrames with pd.DataFrame.join
Reshaping with pd.DataFrame.stack and pd.DataFrame.unstack
Reshaping with pd.DataFrame.melt
Reshaping with pd.wide_to_long
Reshaping with pd.DataFrame.pivot and pd.pivot_table
Reshaping with pd.DataFrame.explode
Transposing with pd.DataFrame.T
Join our community on Discord
Chapter 8: Group By
Group by basics
Grouping and calculating multiple columns
Group by apply
Window operations
Selecting the highest rated movies by year
Comparing the best hitter in baseball across years
Chapter 9: Temporal Data Types and Algorithms
Timezone handling
DateOffsets
Datetime selection
Resampling
Aggregating weekly crime and traffic accidents
Calculating year-over-year changes in crime by category
Accurately measuring sensor-collected events with missing values
Chapter 10: General Usage and Performance Tips
Avoid dtype=object
Be cognizant of data sizes
Use vectorized functions instead of loops
Avoid mutating data
Dictionary-encode low cardinality data
Test-driven development features
Chapter 11: The pandas Ecosystem
Foundational libraries
NumPy
PyArrow
Exploratory data analysis
YData Profiling
Data validation
Great Expectations
Visualization
Plotly
PyGWalker
Data science
scikit-learn
XGBoost
Databases
DuckDB
Other DataFrame libraries
Ibis
Dask
Polars
cuDF
Packt Page
Other BooksYou May Enjoy
Index

Preface

pandas is a library for creating and manipulating structured data with Python. What do I mean by structured? I mean tabular data in rows and columns like what you would find in a spreadsheet or database. Data scientists, analysts, programmers, engineers, and others are leveraging it to mold their data.

pandas is limited to "small data" (data that can fit in memory on a single machine). However, the syntax and operations have been adopted by or inspired other projects: PySpark, Dask, and cuDF, among others. These projects have different goals, but some of them will scale out to big data. So, there is value in understanding how pandas works as the features are becoming the de facto API for interacting with structured data.

I, Will Ayd, have been a core maintainer of the pandas library since 2018. During that time, I have had the pleasure of contributing to and collaborating on a host of other open source projects in the same ecosystem, including but not limited to Arrow, NumPy and Cython.

I also consult for a living, utilizing the same ecosystem that I contribute to. Using the best open source tooling, I help clients develop data strategies, implement processes and patterns, and train associates to stay ahead of the ever-changing analytics curve. I strongly believe in the freedom that open source tooling provides, and have proven that value to many companies.

If your company is interested in optimizing your data strategy, feel free to reach out (will_ayd@innobi.io).

Who this book is for

This book contains a huge number of recipes, ranging from very simple to advanced. All recipes strive to be written in clear, concise, and modern idiomatic pandas code. The How it works sections contain extremely detailed descriptions of the intricacies of each step of the recipe. Often, in the There's more. section, you will get what may seem like an entirely new recipe. This book is densely packed with an extraordinary amount of pandas code.

While not strictly required, users are advised to read the book chronologically. The recipes are structured in such a way that they first introduce concepts and features using very small, directed examples, but continuously build from there into more complex applications.

Due to the wide range of complexity, this book can be useful to both novice and everyday users alike. It has been my experience that even those who use pandas regularly will not master it without being exposed to idiomatic pandas code. This is somewhat fostered by the breadth that pandas offers. There are almost always multiple ways of completing the same operation, which can have users get the result they want but in a very inefficient manner. It is not uncommon to see an order of magnitude or more in performance difference between two sets of pandas solutions to the same problem.

The only real prerequisite for this book is a fundamental knowledge of Python. It is assumed that the reader is familiar with all the common built-in data containers in Python, such as lists, sets, dictionaries, and tuples.

What this book covers

Chapter 1, pandas Foundations, introduces the main pandas objects, namely, Series, DataFrames, and Index.

Chapter 2, Selection and Assignment, shows you how to sift through the data that you have loaded into any of the pandas data structures.

Chapter 3, Data Types, explores the type system underlying pandas. This is an area that has evolved rapidly and will continue to do so, so knowing the types and what distinguishes them is invaluable information.

Chapter 4, The pandas I/O System, shows why pandas has long been a popular tool to read from and write to a variety of storage formats.

Chapter 5, Algorithms and How to Apply Them, introduces you to the foundation of performing calculations with the pandas data structures.

Chapter 6, Visualization, shows you how pandas can be used directly for plotting, alongside the seaborn library which integrates well with pandas.

Chapter 7, Reshaping DataFrames, discusses the many ways in which data can be transformed and summarized robustly via the pandas pd.DataFrame.

Chapter 8, Group By, showcases how to segment and summarize subsets of your data contained within a pd.DataFrame.

Chapter 9, Temporal Data Types and Algorithms, introduces users to the date/time types which underlie time-series-based analyses that pandas is famous for and highlights usage against real data.

Chapter 10, General Usage/Performance Tips, goes over common pitfalls users run into when using pandas, and showcases the idiomatic solutions.

Chapter 11, The pandas Ecosystem, discusses other open source libraries that integrate, extend, and/or complement pandas.

To get the most out of this book

There are a couple of things you can do to get the most out of this book. First, and most importantly, you should download all the code, which is stored in Jupyter Notebook. While reading through each recipe, run each step of code in the notebook. Make sure you explore on your own as you run through the code. Second, have the pandas official documentation open (http://pandas.pydata.org/pandas-docs/stable/) in one of your browser tabs. The pandas documentation is an excellent resource containing over 1,000 pages of material. There are examples for most of the pandas operations in the documentation, and they will often be directly linked from the See also section. While it covers the basics of most operations, it does so with trivial examples and fake data that don't reflect situations that you are likely to encounter when analyzing datasets from the real world.

What you need for this book

pandas is a third-party package for the Python programming language and, as of the printing of this book, is transitioning from the 2.x to the 3.x series. The examples in this book should work with a minimum pandas version of 2.0 along with Python versions 3.9 and above.

The code in this book will make use of the pandas, NumPy, and PyArrow libraries. Jupyter Notebook files are also a popular way to visualize and inspect code. All of these libraries should be installable via pip or the package manager of your choice. For pip users, you can run:

python -m pip install pandas numpy pyarrow notebook

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/support/errata and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the Support tab.
Click on Code Downloads.
Enter the name of the book in the Search box and follow the on-screen instructions.

The code bundle for the book is also hosted on GitHub at https://github.com/WillAyd/Pandas-Cookbook-Third-Edition. In case there is an update to the code, it will be updated in the existing GitHub repository.

Running a Jupyter notebook

The suggested method to work through the content of this book is to have a Jupyter notebook up and running so that you can run the code while reading through the recipes. Following along on your computer allows you to go off exploring on your own and gain a deeper understanding than by just reading the book alone.

After installing Jupyter notebook, open a Command Prompt (type cmd at the search bar on Windows, or open Terminal on Mac or Linux) and type:

jupyter notebook

It is not necessary to run this command from your home directory. You can run it from any location, and the contents in the browser will reflect that location. Although we have now started the Jupyter Notebook program, we haven't actually launched a single individual notebook where we can start developing in Python. To do so, you can click on the New button on the right-hand side of the page, which will drop down a list of all the possible kernels available for you to use. If you are working from a fresh installation, then you will only have a single kernel available to you (Python 3). After selecting the Python 3 kernel, a new tab will open in the browser, where you can start writing Python code.

You can, of course, open previously created notebooks instead of beginning a new one. To do so, navigate through the filesystem provided in the Jupyter Notebook browser home page and select the notebook you want to open. All Jupyter Notebook files end in .ipynb.

Alternatively, you may use cloud providers for a notebook environment. Both Google and Microsoft provide free notebook environments that come preloaded with pandas.

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781836205876.

Conventions

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in...

Systemvoraussetzungen

Als PDF speichern Als Link merken