Building a Scalable Data Warehouse with Data Vault 2.0

Implementation Guide for Microsoft SQL Server 2014
 
 
Morgan Kaufmann Publishers In
  • 1. Auflage
  • |
  • erschienen am 25. September 2015
  • |
  • 684 Seiten
 
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
E-Book | PDF mit Adobe DRM | Systemvoraussetzungen
978-0-12-802648-9 (ISBN)
 
Building a Scalable Data Warehouse with Data Vault 2.0 covers everything users need to create a scalable data warehouse from scratch, including a presentation of the Data Vault modeling technique which provides the foundations to create a technical data warehouse layer.

In addition, the book presents tactics on how to create the input layer (the stage layer) and the presentation layer (data mart) of the Data Vault 2.0 standard. Drawing upon years of practical experience and using numerous examples and an easy to understand framework, authors Dan Listedt and Michael Olschimke discuss tactics on how to load each layer using SQL Server Integration Services (SSIS), including automation of the Data Vault loading processes, important data warehouse technologies and practices, and data quality (DQS) and master data services (MDS) in the context of the data vault architecture.



- Provides a complete introduction to data warehousing, applications, and the business context so readers can get-up and running fast
- Explains theoretical concepts and provides hands-on instruction on how to build and implement a data warehouse
- Demystifies data vault modeling with beginning, intermediate, and advanced techniques
- Discusses the advantages of the data vault approach over other techniques, also including the latest updates to Data Vault 2.0 and multiple improvements to Data Vault 1.0
  • Englisch
  • San Francisco
  • |
  • USA
Elsevier Science & Technology
  • Für Beruf und Forschung
  • Höhe: 235 mm
  • |
  • Breite: 191 mm
  • 142,82 MB
978-0-12-802648-9 (9780128026489)
0128026480 (0128026480)
weitere Ausgaben werden ermittelt
Chapter 1: Introduction to Data Warehousing
Chapter 2: Scalable Data Warehouse Architecture
Chapter 3: Agile Data Warehouse Development Methodology
Chapter 4: The Data Vault 2.0 Methodology
Chapter 5: Data Vault Modeling
Chapter 6: Intermediate Data Vault Modeling
Chapter 7: Advanced Data Vault Modeling
Chapter 8: Dimensional Modeling
Chapter 9: Physical Database Design
Chapter 10: Master Data Management
Chapter 11: Metadata Management
Chapter 12: Data Extraction
Chapter 13: Loading the Data Vault
Chapter 14: Implementing Data Quality in the Data Vault
Chapter 15: Loading the Dimensional Data Mart
Chapter 16: Relational Reports (Optional)
Chapter 17: Multidimensional Database
Chapter 18: Refactoring the Data Warehouse
Chapter 19: Data Mining (Optional)
Chapter 20: Data Warehouse Administration (Optional)
Chapter 21: Best Practices (Optional)
Chapter 2

Scalable Data Warehouse Architecture


Abstract


Scalable data warehouses, as a desired solution to some of the problems introduced in the previous chapter, have specific architectural dimensions that are explained in this chapter, including workload, data complexity, query complexity, availability and data latency.

This chapter introduces the architecture of a Data Vault-based data warehouse, including the stage, data warehouse, and information mart layers. In addition, it shows how to use a business vault and other components of the proposed architecture.

Keywords


data complexity query complexity data latency data warehouse data vault data architecture Today's data warehouse systems make it easy for analysts to access integrated data. In order to achieve this, the data warehouse development team had to process and model the data based on the requirements from the user. The best approach for developing a data warehouse is an iterative development process [1]. That means that the functionality of the data warehouse, as requested by the business users, is designed, developed, implemented and deployed in iterations (sometimes called a sprint or a cycle). In each iteration, more functionality is added to the data warehouse. This is opposite to a "big-bang" approach where all functionality is developed in one large process and finally deployed as a whole. However, when executing the project, even when using an iterative approach, the effort (and the costs tied to it) to add another functionality usually increases because of existing dependencies that have to be taken care of. Figure 2.1 shows that the effort to implement the first information mart is relatively low. But when implementing the second information mart, the development team has to maintain the existing solution and take care of existing dependencies, for example to data sources integrated for the first information mart or operational systems consuming information from existing tables. In order to make sure that this previously built functionality doesn't break when deploying the new functionality for the second information mart, the old functionality has to be retested. In many cases, the existing solution needs to be refactored to maintain the functionality of the individual information marts when new sources are added to the overall solution. All these activities increase the effort for creating the second information mart and, equally, any subsequent information mart or other new functionality. This additional effort is depicted as the rise of the graph in Figure 2.1: once the first information mart is produced, the solution falls into a maintenance mode for all existing functionality. The next project implements another information mart. To implement the second information mart, the effort includes adding the new functionality and maintaining the existing functionality. Because of the dependencies, the existing functionality needs to be refactored and retested regularly. Figure 2.1 The maintenance nightmare [2]. In other words, the extensibility of many data warehouse architectures, including those presented in Chapter 1, Introduction to Data Warehousing, is not optimal. Furthermore, typical data warehouse architectures often lack dimensions of scalability other than the described dimension of extensibility. We discuss these dimensions in the next section.

2.1. Dimensions of Scalable Data Warehouse Architectures


Business users of data warehouse systems expect to load and prepare more and more data, in terms of variety, volume, and velocity [3]. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions.

2.1.1. Workload


The enterprise data warehouse (EDW) is "by far the largest and most computationally intense business application" in a typical enterprise. EDW systems consist of huge databases, containing historical data on volumes from multiple gigabytes to terabytes of storage [4]. Successful EDW systems face two issues regarding the workload of the system: first, they experience rapidly increasing data volumes and application workloads and, second, an increasing number of concurrent users [5]. In order to meet the performance requirements, EDW systems are implemented on large-scale parallel computers, such as massively parallel processing (MPP) or symmetric multiprocessor (SMP) system environments and clusters and parallel database software. In fact, most medium- to large-size data warehouses could not be implementable without larger-scale parallel hardware and parallel database software to support them [4]. In order to handle the requested workload, there is more required than parallel hardware or parallel database software. The logical and physical design of the databases has to be optimized for the expected data volumes [6-8].

2.1.2. Data Complexity


Another dimension of enterprise data warehouse scalability is data complexity. The following factors contribute to the growth of data complexity [9]: Variety of data: nowadays, enterprise organizations capture more than just traditional (e.g., relational or mainframe) master or transactional data. There is an increasing amount of semi-structured data, for example emails, e-forms or HTML and XML files and unstructured data, such as document collections, social network data, images, video and sound files. Another type of data is sensor- and machine-generated data, which might require specific handling. In many cases, enterprises try to derive structured information from unstructured or semi-structured data to increase the business value of the data. While the files may have a structure, the content of the files doesn't have one. For example, it is not possible to find the face of a specific person in a video without fully processing all frames of the video and building metadata tags to indicate where faces appear in the content. Volume of data: the rate at which companies generate and accumulate new data is increasing. Examples include content from Web sites or social networks, document and email collections, weblog data and machine-generated data. The increased data volume leads to much larger data sets, which can run into hundreds of terabytes or even into petabytes of data or beyond. Velocity of data: not only the variety and volume of data increases, but the rate at which the data is created also increases rapidly. One example is financial data from financial markets such as the stock exchange. Such data is generated at very high rates and immediately analyzed in order to respond to changes in the market. Other examples include credit card transactions data for fraud detection and sensor data or data from closed-circuit television (CCTV), which is captured for automated video and image analysis in real-time or near-real-time. Veracity (trustworthiness) of data: in order to have confidence in data, it must have strong data governance lineage traceability and robust data integration [10].

2.1.3. Analytical Complexity


Due to the availability of large volumes of data with high velocity and variety, businesses demand different and more complex analytical tasks to produce the insight required to solve their business problems. Some of these analyses require that the data be prepared in a fashion not foreseen by the original data warehouse developers. For example, the data that should be fed into a data mining algorithm should have different characteristics regarding the variety, volume and velocity of data. Consider the example of retail marketing: the campaign accuracy and timeliness need to be improved when moving from retail stores to online channels where more detailed customer insights are required [11]: In order to determine customer segmentation and purchase behavior, the business might need historical analysis and reporting of customer demographics and purchase transactions Cross-sell opportunities can be identified by analyzing market baskets that show products that can be sold together To understand the online behavior of their customers, click-stream analysis is required. This can help to present up-sell offers to the visitors of a Web site Given the high amount of social network data and user-generated content, businesses tap into the data by analyzing product reviews, ratings, likes and dislikes, comments, customer service interactions, and so on. These examples should make it clear that, in order to solve such new and complex analytical tasks, data sources of varying complexity are required. Also, mixing structured and unstructured data becomes more and more common [11].

2.1.4. Query Complexity


When business intelligence (BI) vendors select a relational database management system (RDBMS) for the storage and management of warehouse data, it is a natural choice. Relational databases provide simple data structures and high-level, set-oriented languages that make...

Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Dateiformat: PDF
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat PDF zeigt auf jeder Hardware eine Buchseite stets identisch an. Daher ist eine PDF auch für ein komplexes Layout geeignet, wie es bei Lehr- und Fachbüchern verwendet wird (Bilder, Tabellen, Spalten, Fußnoten). Bei kleinen Displays von E-Readern oder Smartphones sind PDF leider eher nervig, weil zu viel Scrollen notwendig ist. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

60,63 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe DRM
siehe Systemvoraussetzungen
PDF mit Adobe DRM
siehe Systemvoraussetzungen
Hinweis: Die Auswahl des von Ihnen gewünschten Dateiformats und des Kopierschutzes erfolgt erst im System des E-Book Anbieters
E-Book bestellen

Unsere Web-Seiten verwenden Cookies. Mit der Nutzung dieser Web-Seiten erklären Sie sich damit einverstanden. Mehr Informationen finden Sie in unserem Datenschutzhinweis. Ok