
Data-Driven Security
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Persons
Jay Jacobs is the coauthor of Verizon Data Breach Investigation Reports and the cofounder of the Society of Information Risk Analysts, where he currently sits on the board of directors.
Bob Rudis is the Director of Enterprise Information Security & IT Risk Management at Liberty Mutual Insurance and was named one of the Top 25 Influencers in Information Security by Tripwire.
Content
Introduction xv
Chapter 1 The Journey to Data-Driven Security 1
A Brief History of Learning from Data 2
Nineteenth Century Data Analysis 2
Twentieth Century Data Analysis 3
Twenty-First Century Data Analysis 4
Gathering Data Analysis Skills 5
Domain Expertise 6
Programming Skills 8
Data Management 10
Statistics 12
Visualization (aka Communication) 14
Combining the Skills 15
Centering on a Question 16
Creating a Good Research Question 17
Exploratory Data Analysis 18
Summary 18
Recommended Reading 19
Chapter 2 Building Your Analytics Toolbox: A Primer on Using R and Python for Security Analysis 21
Why Python? Why R? And Why Both? 22
Why Python? 23
Why R? 23
Why Both? 24
Jumpstarting Your Python Analytics with Canopy 24
Understanding the Python Data Analysis and Visualization Ecosystem 25
Setting Up Your R Environment 29
Introducing Data Frames 33
Organizing Analyses 36
Summary 37
Recommended Reading 38
Chapter 3 Learning the "Hello World" of Security Data Analysis 39
Solving a Problem 40
Getting Data41
Reading In Data 43
Exploring Data 47
Homing In on a Question 58
Summary 70
Recommended Reading 70
Chapter 4 Performing Exploratory Security Data Analysis 71
Dissecting the IP Address73
Representing IP Addresses 73
Segmenting and Grouping IP Addresses 75
Locating IP Addresses 77
Augmenting IP Address Data80
Association/Correlation, Causation, and Security Operations Center Analysts Gone Rogue 86
Mapping Outside the Continents90
Visualizing the ZeuS Botnet 92
Visualizing Your Firewall Data 98
Summary 100
Recommended Reading101
Chapter 5 From Maps to Regression 103
Simplifying Maps 105
How Many ZeroAccess Infections per Country? 108
Changing the Scope of Your Data 111
The Potwin Effect 113
Is This Weird? 117
Counting in Counties 120
Moving Down to Counties 122
Introducing Linear Regression 125
Understanding Common Pitfalls in Regression Analysis 130
Regression on ZeroAccess Infections 131
Summary 136
Recommended Reading 136
Chapter 6 Visualizing Security Data 137
Why Visualize? 138
Unraveling Visual Perception 139
Understanding the Components of Visual Communications 144
Avoiding the Third Dimension 144
Using Color 146
Putting It All Together 148
Communicating Distributions 154
Visualizing Time Series 156
Experiment on Your Own 157
Turning Your Data into a Movie Star 158
Summary 159
Recommended Reading 160
Chapter 7 Learning from Security Breaches 161
Setting Up the Research 162
Considerations in a Data Collection Framework 164
Aiming for Objective Answers 164
Limiting Possible Answers 164
Allowing "Other," and "Unknown" Options 164
Avoiding Conflation and Merging the Minutiae 165
An Introduction to VERIS 166
Incident Tracking 168
Threat Actor 168
Threat Actions 169
Information Assets 173
Attributes 173
Discovery/Response 176
Impact 176
Victim 177
Indicators 179
Extending VERIS with Plus 179
Seeing VERIS in Action 179
Working with VCDB Data 181
Getting the Most Out of VERIS Data 185
Summary 189
Recommended Reading 189
Chapter 8 Breaking Up with Your Relational Database 191
Realizing the Container Has Constraints 195
Constrained by Schema 196
Constrained by Storage 198
Constrained by RAM 199
Constrained by Data 200
Exploring Alternative Data Stores 200
BerkeleyDB 201
Redis 203
Hive 207
MongoDB 210
Special Purpose Databases 214
Summary 215
Recommended Reading 216
Chapter 9 Demystifying Machine Learning 217
Detecting Malware 218
Developing a Machine Learning Algorithm 220
Validating the Algorithm 221
Implementing the Algorithm 222
Benefiting from Machine Learning 226
Answering Questions with Machine Learning 226
Measuring Good Performance 227
Selecting Features 228
Validating Your Model 230
Specific Learning Methods 230
Supervised 231
Unsupervised 234
Hands On: Clustering Breach Data 236
Multidimensional Scaling on Victim Industries 238
Hierarchical Clustering on Victim Industries 240
Summary 242
Recommended Reading 243
Chapter 10 Designing Effective Security Dashboards 245
What Is a Dashboard, Anyway? 246
A Dashboard Is Not an Automobile 246
A Dashboard Is Not a Report 248
A Dashboard Is Not a Moving Van 251
A Dashboard Is Not an Art Show 253
Communicating and Managing "Security" through Dashboards 258
Lending a Hand to Handlers 258
Raising Dashboard Awareness 260
The Devil (and Incident Response Delays) Is in the Details 262
Projecting "Security" 263
Summary 267
Recommended Reading 267
Chapter 11 Building Interactive Security Visualizations 269
Moving from Static to Interactive270
Interaction for Augmentation 271
Interaction for Exploration 274
Interaction for Illumination 276
Developing Interactive Visualizations 281
Building Interactive Dashboards with Tableau 281
Building Browser-Based Visualizations with D3 284
Summary 294
Recommended Reading 295
Chapter 12 Moving Toward Data-Driven Security 297
Moving Yourself toward Data-Driven Security 298
The Hacker 299
The Statistician 302
The Security Domain Expert 302
The Danger Zone 303
Moving Your Organization toward Data-Driven Security 303
Ask Questions That Have Objective Answers 304
Find and Collect Relevant Data 304
Learn through Iteration 305
Find Statistics 306
Summary 308
Recommended Reading 308
Appendix A Resources and Tools 309
Appendix B References 313
Index 321
Chapter 1
The Journey to Data-Driven Security
“It ain't so much the things we don't know that get us into trouble. It's the things we know that just ain't so.”
Josh Billings, Humorist
This book isn't really about data analysis and visualization.
Yes, almost every section is focused on those topics, but being able to perform good data analysis and produce informative visualizations is just a means to an end. You never (okay, rarely) analyze data for the sheer joy of analyzing data. You analyze data and create visualizations to gain new perspectives, to find relationships you didn't know existed, or to simply discover new information. In short, you do data analysis and visualizations to learn, and that is what this book is about. You want to learn how your information systems are functioning, or more importantly how they are failing and what you can do to fix them.
The cyber world is just too large, has too many components, and has grown far too complex to simply rely on intuition. Only by augmenting and supporting your natural intuition with the science of data analysis will you be able to maintain and protect an ever-growing and increasingly complex infrastructure. We are not advocating replacing people with algorithms; we are advocating arming people with algorithms so that they can learn more and do a better job. The data contains information, and you can learn better with the information in the data than without it.
This book focuses on using real data—the types of data you have probably come across in your work. But rather than focus on huge discoveries in the data, this book focuses more on the process and less on the result. As a result of that decision, the use cases are intended to be exemplary and introductory rather than knock-your-socks-off cool. The goal here is to teach you new ways of looking at and learning from data. Therefore, the analysis is intended to be new ground in terms of technique, not necessarily in conclusion.
A Brief History of Learning from Data
One of the best ways of appreciating the power of statistical data analysis and visualization is to look back in history to a time when these methods were first put to use. The following cases provide a vivid picture of “before” versus “after,” demonstrating the dramatic benefits of the then-new methods.
Nineteenth Century Data Analysis
Prior to the twentieth century, the use of data and statistics was still relatively undeveloped. Although great strides were made in the eighteenth century, much of the scientific research of the day used basic descriptive statistics as evidence for the validity of the hypothesis. The inability to draw clear conclusions from noisy data (and almost all real data is more or less noisy) made much of the scientific debates more about opinions of the data than the data itself. One such fierce debate1 in the nineteenth century was between two medical professionals in which they debated (both with data) the cause of cholera, a bacterial infection that was often fatal.
The cholera outbreak in London in 1849 was especially brutal, claiming more than 14,000 lives in a single year. The cause of the illness was unknown at that time and two competing theories from two researchers emerged. Dr. William Farr, a well-respected and established epidemiologist, argued that cholera was caused by air pollution created by decomposing and unsanitary matter (officially called the miasma theory). Dr. John Snow, also a successful epidemiologist who was not as widely known as Farr, put forth the theory that cholera was spread by consuming water that was contaminated by a “special animal poison” (this was prior to the discovery of bacteria and germs). The two debated for years.
Farr published the “Report on the Mortality of Cholera in England 1848–49” in 1852, in which he included a table of data with eight possible explanatory variables collected from the 38 registration districts of London. In the paper, Farr presented some relatively simple (by today's standards) statistics and established a relationship between the average elevation of the district and cholera deaths (lower areas had more deaths). Although there was also a relationship between cholera deaths and the source of drinking water (another one of the eight variables he gathered), he concluded that it was not nearly as significant as the elevation. Farr's theory had data and logic and was accepted by his peers. It was adopted as fact of the day.
Dr. John Snow was passionate and vocal about his disbelief in Farr's theory and relentless in proving his own. It's said he even collected data by going door to door during the cholera outbreak in the Soho district of 1854. It was from that outbreak and his collected data that he made his now famous map in Figure 1-1. The hand-drawn map of the Soho district included little tick marks at the addresses where cholera deaths were reported. Overlaying the location of water pumps where residents got their drinking water showed a rather obvious clustering around the water pump on Broad Street. With his map and his passionate pleas, the city did allow the pump handle to be removed and the epidemic in that region subsided. However, this wasn't enough to convince his critics. The cause of cholera was heavily debated even beyond John Snow's death in 1858.
Figure 1-1 Hand-drawn map of the areas affected by cholera
The cholera debate included data and visualization techniques (long before computers), yet neither had been able to convince the opposition. The debate between Snow and Farr was re-examined in 2003 when statisticians in the UK evaluated the data Farr published in 1852 with modern methods. They found that the data Farr pointed to as proof of an airborne cause actually supported Snow's position. They concluded that if modern statistical methods were available to Farr, the data he collected would have changed his conclusion. The good news of course, is that these statistical methods are available today to you.
Twentieth Century Data Analysis
A few years before Farr and Snow debated cholera, an agricultural research station north of London at Rothamsted began conducting experiments on the effects of fertilizer on crop yield. They spent decades conducting experiments and collecting data on various aspects such as crop yield, soil measurements, and weather variables. Following a modern-day logging approach, they gathered the data and diligently stored it, but they were unable to extract the full value from it. In 1919 they hired a brilliant young statistician named Ronald Aylmer Fisher to pore through more than 70 years of data and help them understand it. Fisher quickly ran into a challenge with the data being confounded, and he found it difficult to isolate the effect of the fertilizer from other effects, such as weather or soil quality. This challenge would lead Fisher toward discoveries that would forever change not just the world of statistics, but almost every scientific field in the twentieth century.
What Fisher discovered (among many revolutionary contributions to statistics) is that if an experiment was designed correctly, the influence of various effects could not just be separated, but also could be measured and their influence calculated. With a properly designed experiment, he was able to isolate the effects of weather, soil quality, and other factors so he could compare the effects of various fertilizer mixtures. And this work was not limited to agriculture; the same techniques Fisher developed at Rothamsted are still used widely today in everything from medical trials to archaeology dig sites. Fisher's work, and the work of his peers, helped revolutionize science in the twentieth century. No longer could scientists simply collect and present their data as evidence of their claim as they had in the eighteenth century. They now had the tools to design robust experiments and the techniques to model how the variables affected their experiment and observations.
At this point, the world of science included statistical models. Much of the statistical and science education focused on developing and testing these models and the assumptions behind them. Nearly every statistical problem started with the question—“What's the model?”—and ended with the model populated to allow description and even prediction using the model. This represented a huge leap forward and enabled research never before possible. If it weren't for computers, the world would probably still consider these techniques to be modern. But computers are ubiquitous and they have enabled a whole new approach to data analysis that was both impossible and unfathomable prior to their development.
Twenty-First Century Data Analysis
It's difficult to pull out any single person or event that captures where data analysis is today like Farr and Fisher captured the previous stages of data analysis. The first glimpse at what was on the horizon came from John Tukey, who wrote in 1962 that data analysis should be thought of as different from statistics (although analysis leveraged statistics). He stated that data analysis must draw from science more than mathematics (can you see the term “data science” in there?). Tukey was not only an accomplished statistician, having contributed numerous procedures and techniques to the field, but he was also an early proponent of visualization techniques for the purpose of describing and...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.