Data Science Fundamentals with R, Python, and Open Data

Name: Data Science Fundamentals with R, Python, and Open Data
Brand: Wiley-Scrivener
Price: 112.99 EUR
Availability: OnlineOnly

Marco Cremonini(Autor*in)

Wiley-Scrivener (Verlag)

1. Auflage

Erschienen am 2. April 2024

480 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-394-21326-9 (ISBN)

112,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Data Science Fundamentals with R, Python, and Open Data

Introduction to essential concepts and techniques of the fundamentals of R and Python needed to start data science projects

Organized with a strong focus on open data, Data Science Fundamentals with R, Python, and Open Data discusses concepts, techniques, tools, and first steps to carry out data science projects, with a focus on Python and RStudio, reflecting a clear industry trend emerging towards the integration of the two. The text examines intricacies and inconsistencies often found in real data, explaining how to recognize them and guiding readers through possible solutions, and enables readers to handle real data confidently and apply transformations to reorganize, indexing, aggregate, and elaborate.

This book is full of reader interactivity, with a companion website hosting supplementary material including datasets used in the examples and complete running code (R scripts and Jupyter notebooks) of all examples. Exam-style questions are implemented and multiple choice questions to support the readers' active learning. Each chapter presents one or more case studies.

Written by a highly qualified academic, Data Science Fundamentals with R, Python, and Open Data discuss sample topics such as:

* Data organization and operations on data frames, covering reading CSV dataset and common errors, and slicing, creating, and deleting columns in R
* Logical conditions and row selection, covering selection of rows with logical condition and operations on dates, strings, and missing values
* Pivoting operations and wide form-long form transformations, indexing by groups with multiple variables, and indexing by group and aggregations
* Conditional statements and iterations, multicolumn functions and operations, data frame joins, and handling data in list/dictionary format

Data Science Fundamentals with R, Python, and Open Data is a highly accessible learning resource for students from heterogeneous disciplines where Data Science and quantitative, computational methods are gaining popularity, along with hard sciences not closely related to computer science, and medical fields using stochastic and quantitative models.

Weitere Details

Weitere Ausgaben

Person

Inhalt

Preface xiii

About the Companion Website xvii

Introduction xix

1 Open-Source Tools for Data Science 1

1.1 R Language and RStudio 1

1.2 Python Language and Tools 5

1.3 Advanced Plain Text Editor 8

1.4 CSV Format for Datasets 8

2 Simple Exploratory Data Analysis 13

2.1 Missing Values Analysis 13

2.2 R: Descriptive Statistics and Utility Functions 15

2.3 Python: Descriptive Statistics and Utility Functions 17

3 Data Organization and First Data Frame Operations 23

3.1 R: Read CSV Datasets and Column Selection 24

3.2 R: Rename and Relocate Columns 36

3.3 R: Slicing, Column Creation, and Deletion 38

3.4 R: Separate and Unite Columns 45

3.5 R: Sorting Data Frames 49

3.6 R: Pipe 55

3.7 Python: Column Selection 59

3.8 Python: Rename and Relocate Columns 67

3.9 Python: NumPy Slicing, Selection with Index, Column Creation and Deletion 69

3.10 Python: Separate and Unite Columns 81

3.11 Python: Sorting Data Frame 85

4 Subsetting with Logical Conditions 99

4.1 Logical Operators 99

4.2 R: Row Selection 101

5 Operations on Dates, Strings, and Missing Values 127

5.1 R: Operations on Dates and Strings 129

5.2 R: Handling Missing Values and Data Type Transformations 141

5.3 R: Example with Dates, Strings, and Missing Values 154

5.4 Pyhton: Operations on Dates and Strings 165

5.5 Python: Handling Missing Values and Data Type Transformations 173

5.6 Python: Examples with Dates, Strings, and Missing Values 182

6 Pivoting and Wide-long Transformations 195

6.1 R: Pivoting 197

6.2 Python: Pivoting 202

7 Groups and Operations on Groups 221

7.1 R: Groups 222

7.2 Python: Groups 244

8 Conditions and Iterations 271

8.1 R: Conditions and Iterations 272

8.2 Python: Conditions and Iterations 284

9 Functions and Multicolumn Operations 307

9.1 R: User-defined Functions 308

9.2 R: Multicolumn Operations 316

9.3 Python: User-defined and Lambda Functions 330

10 Join Data Frames 347

10.1 Basic Concepts 348

10.2 Python: Join Operations 369

11 List/Dictionary Data Format 393

11.1 R: List Data Format 395

11.2 R: JSON Data Format and Use Cases 410

11.3 Python: Dictionary Data Format 422

Questions 443

Index 447

Introduction

This text introduces the fundamentals of data science using two main programming languages and open-source technologies : R and Python. These are accompanied by the respective application contexts formed by tools to support coding scripts, i.e. logical sequences of instructions with the aim to produce certain results or functionalities. The tools can be of the command line interface (CLI) type, which are consoles to be used with textual commands, and integrated development environment (IDE), which are of interactive type to support the use of languages. Other elements that make up the application context are the supplementary libraries that contain the additional functions in addition to the basic ones coming with the language, package managers for the automated management of the download and installation of new libraries, online documentation, cheat sheets, tutorials, and online forums of discussion and help for users. This context, formed by a language, tools, additional features, discussions between users, and online documentation produced by developers, is what we mean when we say "R" and "Python," not the simple programming language tool, which by itself would be very little. It is like talking only about the engine when instead you want to explain how to drive a car on busy roads.

R and Python, together and with the meaning just described, represent the knowledge to start approaching data science, carry out the first simple steps, complete the educational examples, get acquainted with real data, consider more advanced features, familiarize oneself with other real data, experiment with particular cases, analyze the logic behind mechanisms, gain experience with more complex real data, analyze online discussions on exceptional cases, look for data sources in the world of open data, think about the results to be obtained, even more sources of data now to put together, familiarize yourself with different data formats, with large datasets, with datasets that will drive you crazy before obtaining a workable version, and finally be ready to move to other technologies, other applications, uses, types of results, projects of ever-increasing complexity. This is the journey that starts here, and as discussed in the preface, it is within the reach of anyone who puts some effort and time into it. A single book, of course, cannot contain everything, but it can help to start, proceed in the right direction, and accompany for a while.

With this text, we will start from the elementary steps to gain speed quickly. We will use simplified teaching examples, but also immediately familiarize ourselves with the type of data that exists in reality, rather than in the unreality of the teaching examples. We will finish by addressing some elaborate examples, in which even the inconsistencies and errors that are part of daily reality will emerge, requiring us to find solutions.

Approach

It often happens that students dealing with these contents, especially the younger ones, initially find it difficult to figure out the right way to approach their studies in order to learn effectively. One of the main causes of this difficulty lies in the fact that many are accustomed to the idea that the goal of learning is to never make mistakes. This is not surprising, indeed, since it's the criteria adopted by many exams, the more mistakes, the lower the grade. This is not the place to discuss the effectiveness of exam methodologies or teaching philosophies; we are pragmatists, and the goal is to learn R and Python, computational logic, and everything that revolves around it. But it is precisely from a wholly pragmatic perspective that the problem of the inadequacy of the approach that seeks to minimize errors arises, and this for at least two good reasons. The first is that inevitably the goal of never making mistakes leads to mnemonic study. Sequences of passages, names, formulas, sentences, and specific cases are memorized, and the variability of the examples considered is reduced, tending toward schematism. The second reason is simply that trying to never fail is exactly the opposite of what it takes to effectively learn R and Python and any digital technology.

Learning computational skills for the data science necessarily requires a hands-on approach. This involves carrying out many practical exercises, meticulously redoing those proposed by the text, but also varying them, introducing modifications, and replicating them with different data. All those of the didactic examples can obviously be modified, but also all those with open data can easily be varied. Instead of certain information, others could be used, and instead of a certain result, a slightly different one could be sought, or different data made available by the same source could be tried. Proceeding methodically (being methodical, meticulous, and patient are fundamental traits for effective learning) is the way to go. Returning to the methodological doubts that often afflict students when they start, the following golden rule applies, which must necessarily be emphasized because it is of fundamental importance: exercises are used to make mistakes, an exercise without errors is useless.

Open Data

The use of open data, right from the first examples and to a much greater extent than examples with simplified educational datasets, is one of the characteristics, perhaps the main one, of this text. The datasets taken from open data are 26, sourced from the United States and other countries, large international organizations (the World Bank and the United Nations), as well as charities and independent research institutes, gender discrimination observatories, and government agencies for air traffic control, energy production and consumption, pollutant emissions, and other environmental information. This also includes data made available by cities like Milan, Berlin, and New York City. This selection is just a drop in the sea of open data available and constantly growing in terms of quantity and quality.

Using open data to the extent it has been done in this text is a precise choice that certainly imposes an additional effort on those who undertake the learning path, a choice based both on personal experience in teaching the fundamentals of data science to students of social and political sciences (every year I have increasingly anticipated the use of open data), and on the fundamental drawback of carrying out examples and exercises mainly with didactic cases, which are inevitably unreal and unrealistic. Of course, the didactic cases, also present in this text, are perfectly fit for showing a specific functionality, an effect or behavior of the computational tool. As mentioned before, though, the issue at stake is about learning to drive in urban traffic, not just understanding some engine mechanics, and at the end the only way to do that is . driving in traffic, there's no alternative. For us it is the same, anyone who works with data knows that one of the fundamental skills is to prepare the data for analysis (first there would be that of finding the data) and also that this task can easily be the most time- and effort-demanding part of the whole job. Studying mainly with simplified teaching examples erases this fundamental part of knowledge and experience, for this reason, they are always unreal and unrealistic, however you try to fix them. There is no alternative to putting your hands and banging your head on real data, handling datasets even of hundreds of thousands or millions of rows (the largest one we use in this text has more than 500 000 rows, the data of all US domestic flights of January 2022) with their errors, explanations that must be read and sometimes misinterpreted, even with cases where data was recorded inconsistently (we will see one of this kind quite amusing). Familiarity with real data should be achieved as soon as possible, to figure out their typical characteristics and the fact that behind data there are organizations made up of people, and it is thanks to them if we can extract new information and knowledge. You need to arm yourself with patience and untangle, one step at a time, each knot. This is part of the fundamentals to learn.

What You Don't Learn

One book alone can't cover everything; we've already said it and it's obvious. However, the point to decide is what to leave out. One possibility is that the author tries to discuss as many different topics as she/he can think of. This is the encyclopedic model, popular but not very compatible with a reasonably limited number of pages. It is no coincidence that the most famous of the encyclopedias have dozens of ponderous volumes. The short version of the encyclopedic model is a "synthesis," i.e. a reasonably short overview that is necessarily not very thorough and has to simplify complex topics. Many educational books choose this form, which has the advantage of the breadth of topics combined with a fair amount of simplification.

This book has a hybrid form, from this point of view. It is broader than the standard because it includes two languages instead of one, but it doesn't have the form of synthesis because it focuses on a certain specific type of data and functionality: data frames, with the final addition of lists/dictionaries, transformation and pivoting operations, group indexing, aggregation, advanced transformations and data frame joins, and on these issues, it goes into the details. Basically, it offers the essential toolbox for data science.

What's left out? Very much, indeed. The techniques and tools for data visualization, descriptive and predictive...

Systemvoraussetzungen

Als PDF speichern Als Link merken