Understand and implement big data analysis solutions in pandas with an emphasis on performance. This book strengthens your intuition for working with pandas, the Python data analysis library, by exploring its underlying implementation and data structures.
Thinking in Pandas introduces the topic of big data and demonstrates concepts by looking at exciting and impactful projects that pandas helped to solve. From there, you will learn to assess your own projects by size and type to see if pandas is the appropriate library for your needs. Author Hannah Stepanek explains how to load and normalize data in pandas efficiently, and reviews some of the most commonly used loaders and several of their most powerful options. You will then learn how to access and transform data efficiently, what methods to avoid, and when to employ more advanced performance techniques. You will also go over basic data access and munging in pandas and the intuitive dictionary syntax. Choosing the right DataFrame format, how to work with multi-level DataFrames, and how pandas might be improved upon in the future are also covered.
By the end of the book, you will have a solid understanding of how the pandas library works under the hood. Get ready to make confident decisions in your own projects by utilizing pandas-the right way.
What You Will Learn
- Understand the underlying data structure of pandas and why it performs the way it does under certain circumstances
- Discover how to use pandas to extract, transform, and load data correctly with an emphasis on performance
- Choose the right data frame so that the data analysis is simple and efficient.
- Improve performance of pandas operations with other Python libraries
Who This Book Is For
Software engineers with basic programming skills in Python keen on using pandas for a big data analysis project. Python software developers interested in big data
Hannah Stepanek is a software developer with a passion for performance and is an open source advocate. She has over seven years of industry experience programming in Python and spent one and a half of those years implementing a data analysis project using pandas. She currently works at a small remote company called Hypothesis, building an annotation tool for annotating the web.
Hannah was born and raised in Corvallis, OR, and graduated from Oregon State University with a major in electrical computer engineering. She enjoys engaging with the software community, often giving talks at local meetups as well as larger conferences. In early 2019, she spoke at PyCon US about the pandas library and at OpenCon Cascadia about the benefits of open source software. In her spare time she enjoys riding her horse Sophie and playing board games.
Section 1: Explains the underlying implementation of pandas and establish a foundation of understanding that will be built on in future sections. Chapter 1: An Introduction to Big Data & Pandas
Chapter Goal: Introduce the reader to big data and some exciting big data problems that pandas has helped solveNo of pages - 7Sub -Topics A brief introduction to big data Some examples of impactful problems that pandas has helped solve The limitations of pandas (aka when to use pandas and when to use a different library)
Chapter 2: How Pandas Works Under the HoodChapter Goal: Help the reader understand the data structures that pandas is built onNo of pages: 20Sub - Topics A brief review of C vs Python performance A brief review of Numpy (the library that pandas is built on) What a single indexed data frame looks like underneath What a multi-indexed dataframe looks like underneath What a multi-indexed multi-level-column data frame looks like underneath How to choose the right data frame
Section 2: Help the reader understand how to load and normalize data in pandas efficiently.Chapter 3: Loading and Normalizing Data in PandasChapter Goal: Help the reader understand how to load and normalize data in pandas in a performant manner. No of pages: 20Sub - Topics A review of some of the pandas loaders (CSV, JSON, SQL, etc) and some of their most useful options Normalizing while loading data How to get the best performance when loading data Some gotchas to watch out for when loading data
Section 3: Help the reader understand how to analyze and manipulate data in pandas efficiently.Chapter 4: Basic Data Access and Munging in PandasChapter Goal: An introduction to accessing data in pandas for beginners.No of pages: 5Sub - Topics: Using dictionary-syntax, iloc, and loc to access data. Using merge, join, and concatenate to combine data.
Chapter 5: Reshaping DataChapter Goal: Help the reader understand when to employ certain data reshaping techniques.No of pages: 20Sub - Topics: Review of pivot and pivot table Review of transpose Review of stack and unstack Review of melt
Chapter 6: Apply: When to Use it, When Not to Use it, and How to Get the Best Performance Chapter Goal: Aid the reader in understanding when to use Apply and how they can improve its performance. No of pages: 10Sub - Topics: Review some examples of when not to use Apply. Review some examples that warrant the use of Apply. The performance implications of using Apply How to implement a performant Apply
Chapter 7: GroupbyChapter Goal: Help the reader understand how to use Groupby and what alternative approaches exist.No of pages: 7Sub - Topics: How Groupby works underneath How to get the best performance when using Groupby Groupby alternatives
Section 4: Help the reader understand advanced techniques for improving performance and how Pandas might be improved upon in the future.Chapter 8: NumExpr: Performance Improvements Beyond PandasChapter Goal: Help the reader understand how installing NumExpr can improve pandas performance. No of pages: 10Sub - Topics: A brief review of computer architecture with an emphasis on memory caching. How NumExpr improves performance of pandas operations Why eval and query are faster when NumExpr is installed
Chapter 9: The Future of PandasChapter Goal: Help the reader understand where the pandas library is headed in terms of implementation and how it could be improved.No of pages: 7Sub - Topics: Where is pandas headed? What are problem areas/areas for improvement? Conclusion