
Learning Apache Apex
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Person
Ananth is a senior application architect in the Decisioning and Advanced Analytics architecture team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack. Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA.Weise Thomas :
Thomas Weise is the Apache Apex PMC Chair and cofounder at Atrato. Earlier, he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a cofounder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several more of the ecosystem projects. He has been working on distributed systems for 20 years and has been a speaker at international big data conferences. Thomas received the degree of Diplom-Informatiker (MSc in computer science) from TU Dresden, Germany. He can be reached on Twitter at: @thweise.V. Ramanath Munagala :
Dr. Munagala V. Ramanath got his PhD in Computer Science from the University of Wisconsin, USA and an MSc in Mathematics from Carleton University, Ottawa, Canada. After that, he taught Computer Science courses as Assistant/Associate Professor at the University of Western Ontario in Canada for a few years, before transitioning to the corporate sphere. Since then, he has worked as a senior software engineer at a number of technology companies in California including SeeBeyond, EMC, Sun Microsystems, DataTorrent, and Cloudera. He has published papers in peer reviewed journals in several areas including code optimization, graph theory, and image processing.Yan David :
David Yan is based in the Silicon Valley, California. He is a senior software engineer at Google. Prior to Google, he worked at DataTorrent, Yahoo!, and the Jet Propulsion Laboratory. David holds a master of science in Computer Science from Stanford University and a bachelor of science in Electrical Engineering and Computer Science from the University of California at BerkeleyKnowles Kenneth :
Kenneth Knowles is a founding PMC member of Apache Beam. Kenn has been working on Google Cloud Dataflow-Google's Beam backend-since 2014. Prior to that, he built backends for startups such as Cityspan, Inkling, and Dimagi. Kenn holds a PhD in Programming Language Theory from the University of California, Santa Cruz.
Content
- Cover
- Title Page
- Copyright
- Credits
- About the Authors
- About the Reviewer
- www.PacktPub.com
- Customer Feedback
- Table of Contents
- Preface
- Chapter 1: Introduction to Apex
- Unbounded data and continuous processing
- Stream processing
- Stream processing systems
- What is Apex and why is it important?
- Use cases and case studies
- Real-time insights for Advertising Tech (PubMatic)
- Industrial IoT applications (GE)
- Real-time threat detection (Capital One)
- Silver Spring Networks (SSN)
- Application Model and API
- Directed Acyclic Graph (DAG)
- Apex DAG Java API
- High-level Stream Java API
- SQL
- JSON
- Windowing and time
- Value proposition of Apex
- Low latency and stateful processing
- Native streaming versus micro-batch
- Performance
- Where Apex excels
- Where Apex is not suitable
- Summary
- Chapter 2: Getting Started with Application Development
- Development process and methodology
- Setting up the development environment
- Creating a new Maven project
- Application specifications
- Custom operator development
- The Apex operator model
- CheckpointListener/CheckpointNotificationListener
- ActivationListener
- IdleTimeHandler
- Application configuration
- Testing in the IDE
- Writing the integration test
- Running the application on YARN
- Execution layer components
- Installing Apex Docker sandbox
- Running the application
- Working on the cluster
- YARN web UI
- Apex CLI
- Logging
- Dynamically adjusting logging levels
- Summary
- Chapter 3: The Apex Library
- An overview of the library
- Integrations
- Apache Kafka
- Kafka input
- Kafka output
- Other streaming integrations
- JMS (ActiveMQ, SQS, and so on)
- Kinesis streams
- Files
- File input
- File splitter and block reader
- File writer
- Databases
- JDBC input
- JDBC output
- Other databases
- Transformations
- Parser
- Filter
- Enrichment
- Map transform
- Custom functions
- Windowed transformations
- Windowing
- Global Window
- Time Windows
- Sliding Time Windows
- Session Windows
- Window propagation
- State
- Accumulation
- Accumulation Mode
- State storage
- Watermarks
- Allowed lateness
- Triggering
- Merging of streams
- The windowing example
- Dedup
- Join
- State Management
- Summary
- Chapter 4: Scalability, Low Latency, and Performance
- Partitioning and how it works
- Elasticity
- Partitioning toolkit
- Configuring and triggering partitioning
- StreamCodec
- Unifier
- Custom dynamic partitioning
- Performance optimizations
- Affinity and anti-affinity
- Low-latency versus throughput
- Sample application for dynamic partitioning
- Performance - other aspects for custom operators
- Summary
- Chapter 5: Fault Tolerance and Reliability
- Distributed systems need to be resilient
- Fault-tolerance components and mechanism in Apex
- Checkpointing
- When to checkpoint
- How to checkpoint
- What to checkpoint
- Incremental state saving
- Incremental recovery
- Processing guarantees
- Example - exactly-once counting
- The exactly-once output to JDBC
- Summary
- Chapter 6: Example Project - Real-Time Aggregation and Visualization
- Streaming ETL and beyond
- The application pattern in a real-world use case
- Analyzing Twitter feed
- Top Hashtags
- TweetStats
- Running the application
- Configuring Twitter API access
- Enabling WebSocket output
- The Pub/Sub server
- Grafana visualization
- Installing Grafana
- Installing Grafana Simple JSON Datasource
- The Grafana Pub/Sub adapter server
- Setting up the dashboard
- Summary
- Chapter 7: Example Project - Real-Time Ride Service Data Processing
- The goal
- Datasource
- The pipeline
- Simulation of a real-time feed using historical data
- Parsing the data
- Looking up of the zip code and preparing for the windowing operation
- Windowed operator configuration
- Serving the data with WebSocket
- Running the application
- Running the application on GCP Dataproc
- Summary
- Chapter 8: Example Project - ETL Using SQL
- The application pipeline
- Building and running the application
- Application configuration
- The application code
- Partitioning
- Application testing
- Understanding application logs
- Calcite integration
- Summary
- Chapter 9: Introduction to Apache Beam
- Introduction to Apache Beam
- Beam concepts
- Pipelines, PTransforms, and PCollections
- ParDo - elementwise computation
- GroupByKey/CombinePerKey - aggregation across elements
- Windowing, watermarks, and triggering in Beam
- Windowing in Beam
- Watermarks in Beam
- Triggering in Beam
- Advanced topic - stateful ParDo
- WordCount in Apache Beam
- Setting up your pipeline
- Reading the works of Shakespeare in parallel
- Splitting each line on spaces
- Eliminating empty strings
- Counting the occurrences of each word
- Format your results
- Writing to a sharded text file in parallel
- Testing the pipeline at small scale with DirectRunner
- Running Apache Beam WordCount on Apache Apex
- Summary
- Chapter 10: The Future of Stream Processing
- Lower barrier for building streaming pipelines
- Visual development tools
- Streaming SQL
- Better programming API
- Bridging the gap between data science and engineering
- Machine learning integration
- State management
- State query and data consistency
- Containerized infrastructure
- Management tools
- Summary
- Index
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.