The high penetration of data-centric services has markedly increased the risk of exposing sensitive customer and corporate data. In particular, the sectors of critical infrastructures (information technology (IT), and mobile computing are those which are constantly targeted by sophisticated adversaries, insiders, or bribed workers who launch an attack using advanced malware and hacking tools . The main purpose of these attackers is to gain long-term access to a system and steal critical sensitive data from the enterprise. This causes data breach or data leakage, also known as data exfiltration, which poses huge losses every year to a wide range of industries including many technology giants such as Google, Facebook, and Tesla. Google alone stores an enormous volume of sensitive data derived from sources worldwide . Despite the Covid-19 pandemic, in 2020, of the reported 32,002 incidents, 3950 involved data breaches , of which 86% were financially motivated. Recent outbreaks of ransomware are good examples of new data exfiltration-based attacks for the purpose of financial gain . Not only are attack methods becoming increasingly sophisticated, but most of the advanced hacking is conducted by state-sponsored hackers . Moreover, the consequences of data leakage could pose a critical threat to security and the privacy of users, particularly those who work for the government or the military. For instance, the recent leakage of the subcontractor database of the Australian Defence Force had severe implications for national security as the design of military combat aircraft was compromised .
Intruders take the opportunity to exploit the unknown vulnerabilities of security systems in order to penetrate an organization. Stuxnet worm  was one of the most well-known attack tools created by the intelligence agencies of the United States and Israel and was intended to thwart the Iranian nuclear development program. Indeed, this worm virus could be classified as a military-grade weapon considering its origin and sophistication. Stuxnet was used to exploit several undiscovered bugs in the Windows operating system to wirelessly spread itself and install a rootkit on the Siemens' programmable logic controllers (PLCs), which were manipulated and used to destroy the delicate equipment at the targeted nuclear power plant. Another example of the widespread vulnerability of security systems is the OpenSSL Heartbleed . There was no clear evidence indicating whether the bug was discovered and used by hackers before 2014, and for how long. Despite having a mathematically robust design, the implementation bug was accidentally embedded in the widely used OpenSSL cryptography library for decades. Since OpenSSL is the core of Transport Layer Security (TLS) encryption, vital services like secure web pages, email, or the secure shell protocol had been affected. Unfortunately, the security patching process took several months or more to fix the widespread vulnerability in several million devices worldwide. This resulted in several data breach incidents, such as the data breaches involving the Community Health Systems (CHS), which was the giant private hospital chain in the United States, and the Canada Revenue Agency (CRA) incident that impacted millions of Canadian taxpayers.
Since attack prevention measures might not be adequate, the detection-based approach plays a crucial role in minimizing damages resulting from data breaches. Generally, a malicious program (known as malware) is the primary tool employed by adversaries to help them gain access to a system or even automatically exfiltrate sensitive information. In spite of numerous malware detection approaches proposed in the literature, few works focus on data exfiltration. On the other hand, the studies that propose solutions for data breach detection focus only on a single data leakage channel: network monitoring, unsafe data exportation, or user authorization, to name a few. Such atomistic solutions give attackers opportunities to exfiltrate sensitive data via alternative channels. This suggests the need for a holistic data exfiltration detection approach. Hence, this research investigates several methods for detecting data exfiltration, which is crucial for various corporate sectors as the modern world is evolving toward the adoption of data-centric services.
Generally, off-the-shelf antivirus scanners primarily use known signatures to detect malicious programs. The signature-based solution has a very low false alarm rate, as it uses a hash of the program or signature strings contained in the malware binary to match with the virus signature. However, advanced hackers could develop a new unseen malware for the well-protected target or even use the benign program to steal the data. Furthermore, some sophisticated malware does not need to be installed on the disk storage at all (e.g. Code Red worm ). Instead, this malware could run in the memory to perform the malicious activity for the entire time. This is where the real-time behavior detection method plays an important role. However, detecting a nefarious purpose from a series of the program's actions is challenging, as there are too many possibilities that the program's actions infer the malicious transaction, especially in a real-time context. Hence, behavior-based data exfiltration is one issue that needs a careful and systematic investigation.
Apart from using malware to steal sensitive data, some attackers might simply exploit the benign program to exfiltrate the sensitive information. In some cases, the monitoring system cannot detect the unforeseen malware. Hence, the malicious behavior-monitoring approach alone might not be sufficient. Therefore, by taking different perspectives, we may be able to obtain a list of processes used to access sensitive data, which could cause data leakage. To do so, one will need to search for sensitive data in the memory space of those processes. This will allow sensitive files to be read, and other inputs such as keystrokes containing sensitive keywords, to be detected. Ultimately, all data in the computer system needs to be loaded into the main memory before it can be processed; hence, the adversary cannot avoid having the sensitive data in the main memory. Moreover, the data can remain in the memory even if the process has already been terminated . Hence, if those processes can be listed, it will be easier to narrow down and identify the root cause of the data breach regardless of whether the program is classified as malicious or benign. However, to the best of the authors' knowledge, this idea has not been closely examined by previous studies, and therefore this book will elucidate new ways of addressing such issues.
A sophisticated adversary could obfuscate the data exfiltration even more by using a temporal attack  to evade the detection system. Here, the data-stealing activity is delayed so as to trick the monitoring system into "thinking" that it is just a false alarm (false positive). In other words, the hacker could minimize the chance of being discovered by the monitoring system by stealing a minuscule amount of sensitive data at a time. Over a period of time, the attacker could reassemble those small amounts to form the original sensitive file. Even though these time-delay attacks have been reported for over a decade, few researchers have attempted to address the issue (e.g. [11, 12]). If this method is used to penetrate the critical systems of government departments or the military, the consequences will be devastating. Hence, this book looks at the data exfiltration detection issue by holistically approaching the problem from several perspectives, namely by: (i) examining the program's behavior, (ii) monitoring sensitive data program is accessing, and (iii) monitoring the collective activities of the process related to fractions of sensitive information being collected over a period of time.
1.1 Data Exfiltration Methods
The prevention of data exfiltration is a broad and complex issue. To examine this issue more closely, current methods can be categorized into four areas: (i) state-of-the-art survey, (ii) behavior-based data exfiltration solution, (iii) memory-based data exfiltration solution, and (iv) temporal-based data exfiltration prevention. The shortcomings of each of these methods are discussed here.
1.1.1 State-of-the-Art Surveys
To begin with, this book surveyed technologies that have a high potential to be used as a fundamental building block for data leakage prevention (DLP) methods, and data exfiltration prevention solutions shared similar core technologies that are used for intrusion detection systems (IDSs). While an IDS is a standard measure to protect computer systems from outsider and insider attacks, DLP is a more specialized and advanced security solution that can provide a better protection against security breaches. DLP aims to detect abnormal access to sensitive data, and this is based on the use of either machine learning (ML) or temporal reasoning algorithms.