Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Introduction
This book investigates the trade-off between computation and storage in the cloud. This is a brand new and significant issue for deploying applications with the pay-as-you-go model in the cloud, especially computation and data-intensive scientific applications. The novel research reported in this book is for both cloud service providers and users to reduce the cost of storing large generated application data sets in the cloud. A suite consisting of a novel cost model, benchmarking approaches and storage strategies is designed and developed with the support of new concepts, solid theorems and innovative algorithms. Experimental evaluation and case study demonstrate that our work helps bring the cost down dramatically for running the computation and data-intensive scientific applications in the cloud.
This chapter introduces the background and key issues of this research. It is organised as follows. Section 1.1 gives a brief introduction to running scientific applications in the cloud. Section 1.2 outlines the key issues of this research. Finally, Section 1.3 presents an overview for the remainder of this book.
Running scientific applications usually requires not only high-performance computing (HPC) resources but also massive storage [34]. In many scientific research fields, like astronomy [33], high-energy physics [61] and bioinformatics [65], scientists need to analyse a large amount of data either from existing data resources or collected from physical devices. During these processes, large amounts of new data might also be generated as intermediate or final products [34]. Scientific applications are usually data intensive [36,61], where the generated data sets are often terabytes or even petabytes in size. As reported by Szalay et al. in [74], science is in an exponential world and the amount of scientific data will double every year over the next decade and on into the future. Producing scientific data sets involves a large number of computation-intensive tasks, e.g., with scientific workflows [35], and hence takes a long time for execution. These generated data sets contain important intermediate or final results of the computation, and need to be stored as valuable resources. This is because (i) data can be reused - scientists may need to re-analyse the results or apply new analyses on the existing data sets [16] - and (ii) data can be shared - for collaboration, the computation results may be shared, hence the data sets are used by scientists from different institutions [19]. Storing valuable generated application data sets can save their regeneration cost when they are reused, not to mention the waiting time caused by regeneration. However, the large size of the scientific data sets presents a serious challenge in terms of storage. Hence, popular scientific applications are often deployed in grid or HPC systems [61] because they have HPC resources and/or massive storage. However, building and maintaining a grid or HPC system is extremely expensive and neither can easily be made available for scientists all over the world to utilise.
In recent years, cloud computing is emerging as the latest distributed computing paradigm which provides redundant, inexpensive and scalable resources on demand to system requirements [42]. Since late 2007 when the concept of cloud computing was proposed [83], it has been utilised in many areas with a certain degree of success [17,21,45,62]. Meanwhile, cloud computing adopts a pay-as-you-go model where users are charged according to the usage of cloud services such as computation, storage and network1 services in the same manner as for conventional utilities in everyday life (e.g., water, electricity, gas and telephone) [22]. Cloud computing systems offer a new way to deploy computation and data-intensive applications. As Infrastructure as a Service (IaaS) is a very popular way to deliver computing resources in the cloud [1], the heterogeneity of the computing systems [92] of one service provider can be well shielded by virtualisation technology. Hence, users can deploy their applications in unified resources without any infrastructure investment in the cloud, where excessive processing power and storage can be obtained from commercial cloud service providers. Furthermore, cloud computing systems offer a new paradigm in which scientists from all over the world can collaborate and conduct their research jointly. As cloud computing systems are usually based on the Internet, scientists can upload their data and launch their applications in the cloud from anywhere in the world. Furthermore, as all the data are managed in the cloud, it is easy to share data among scientists.
However, new challenges also arise when we deploy a scientific application in the cloud. With the pay-as-you-go model, the resources need to be paid for by users; hence the total application cost for generated data sets in the cloud highly depends on the strategy used to store them. For example, storing all the generated application data sets in the cloud may result in a high storage cost since some data sets may be seldom used but large in size, but if we delete all the generated data sets and regenerate them every time they are needed, the computation cost may also be very high. Hence there should be a trade-off between computation and storage for deploying applications; this is an important and challenging issue in the cloud. By investigating this issue, this research proposes a new cost model, novel benchmarking approaches and innovative storage strategies, which would help both cloud service providers and users to reduce application costs in the cloud.
In the cloud, the application cost highly depends on the strategy of storing the large generated data sets due to the pay-as-you-go model. A good strategy is to find a balance to selectively store some popular data sets and regenerate the rest when needed, i.e. finding a trade-off between computation and storage. However, the generated application data sets in the cloud often have dependencies; that is, a computation task can operate on one or more data set(s) and generate new one(s). The decision about whether to store or delete an application data set impacts not only the cost of the data set itself but also that of other data sets in the cloud. To achieve the best trade-off and utilise it to reduce the application cost, we need to investigate the following issues:
1. Cost model. Users need a new cost model that can represent the amount that they actually spend on their applications in the cloud. Theoretically, users can get unlimited resources from the commercial cloud service providers for both computation and storage. Hence, for the large generated application data sets, users can flexibly choose how many to store and how many to regenerate. Different storage strategies lead to different consumptions of computation and storage resources and ultimately lead to different total application costs. The new cost model should be able to represent the cost of the applications in the cloud, which is the trade-off between computation and storage.
2. Minimum cost benchmarking approaches. Based on the new cost model, we need to find the best trade-off between computation and storage, which leads to the theoretical minimum application cost in the cloud. This minimum cost serves as an important benchmark for evaluating the cost-effectiveness of storage strategies in the cloud. For different applications and users, cloud service providers should be able to provide benchmarking services according to their requirements. Hence benchmarking algorithms need to be investigated, so that we develop different benchmarking approaches to meet the requirements of different situations in the cloud.
3. Cost-effective dataset storage strategies. By investigating the trade-off between computation and storage, we determine that cost-effective storage strategies are needed for users to use in their applications at run-time in the cloud. Different from benchmarking, in practice, the minimum cost storage strategy may not be the best strategy for the applications in the cloud. First, storage strategies must be efficient enough to be facilitated at run-time in the cloud. Furthermore, users may have certain preferences concerning the storage of some particular data sets (e.g. tolerance of the accessing delay). Hence we need to design cost-effective storage strategies according to different requirements.
In particular, this book includes new concepts, solid theorems and complex algorithms, which form a suite of systematic and comprehensive solutions to deal with the issue of computation and storage trade-off in the cloud and bring cost-effectiveness to the applications for both users and cloud service providers. The remainder of this book is organised as follows.
In Chapter 2, we introduce the work related to this research. We start by introducing data management in some traditional scientific application systems, especially in grid systems, and then we move to the cloud. By introducing some typical cloud systems for scientific application, we raise the issue of cost-effectiveness in the cloud. Next, we introduce some works that also touch upon the issue of computation and storage trade-off and analyse the differences to ours. Finally, we introduce some works on the subject of data provenance which are the...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.
Dateiformat: PDFKopierschutz: Adobe-DRM (Digital Rights Management)
Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.
Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Dateiformat: ePUBKopierschutz: Wasserzeichen-DRM (Digital Rights Management)
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet - also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Wasserzeichen-DRM wird hier ein „weicher” Kopierschutz verwendet. Daher ist technisch zwar alles möglich – sogar eine unzulässige Weitergabe. Aber an sichtbaren und unsichtbaren Stellen wird der Käufer des E-Books als Wasserzeichen hinterlegt, sodass im Falle eines Missbrauchs die Spur zurückverfolgt werden kann.