Foreword
Preface
Part Ⅰ. Architectural Considerations for Hadoop Applications
1. Data Modeling in HadoopData Storage OptionsStandard File FormatsHadoop File TypesSerialization FormatsColumnar FormatsCompressionHDFS Schema DesignLocation of HDFS FilesAdvanced HDFS Schema DesignHDFS Schema Design SummaryHBase Schema DesignRow KeyTimestampHopsTables and RegionsUsing ColumnsUsing Column FamiliesTime-to-LiveManaging MetadataWhat Is Metadata?Why Care About Metadata?Where to Store Metadata?Examples of Managing MetadataLimitations of the Hive Metastore and HCatalogOther Ways of Storing MetadataConclusion
2. Data MovementData Ingestion ConsiderationsTimeliness of Data IngestionIncremental UpdatesAccess PatternsOriginal Source System and Data StructureTransformationsNetwork BottlenecksNetwork SecurityPush or PullFailure HandlingLevel of ComplexityData Ingestion OptionsFile TransfersConsiderations for File Transfers versus Other Ingest MethodsSqoop: Batch Transfer Between Hadoop and Relational DatabasesFlume: Event-Based Data Collection and ProcessingKafkaData ExtractionConclusion
3. Processing Data in HadoopMapReduceMapReduce OverviewExample for MapReduceWhen to Use MapReduceSparkSpark OverviewOverview of Spark ComponentsBasic Spark ConceptsBenefits of Using SparkSpark ExampleWhen to Use SparkAbstractionsPigPig ExampleWhen to Use PigCrunchCrunch ExampleWhen to Use CrunchCascadingCascading ExampleWhen to Use CascadingHiveHive OverviewExample of Hive CodeWhen to Use HiveImpalaImpala OverviewSpeed-Oriented DesignImpala ExampleWhen to Use ImpalaConclusion
4. Common Hadoop Processing PatternsPattern: Removing Duplicate Records by Primary KeyData Generation for Deduplication ExampleCode Example: Spark Deduplication in ScalaCode Example: Deduplication in SQLPattern: Windowing AnalysisData Generation for Windowing Analysis ExampleCode Example: Peaks and Valleys in SparkCode Example: Peaks and Valleys in SQLPattern: Time Series ModificationsUse HBase and VersioningUse HBase with a RowKey of RecordKey and StartTimeUse HDFS and Rewrite the Whole TableUse Partitions on HDFS for Current and Historical RecordsData Generation for Time Series ExampleCode Example: Time Series in SparkCode Example: Time Series in SQLConclusion
5. Graph Processing on HadoopWhat Is a Graph?What Is Graph Processing?How Do You Process a Graph in a Distributed System?The Bulk Synchronous Parallel ModelBSP by ExampleGiraphRead and Partition the DataBatch Process the Graph with BSPWrite the Graph Back to DiskPutting It All TogetherWhen Should You Use Giraph?GraphXJust Another RDDGraphX Pregel Interfacevprog0sendMessage0mergeMessage0Which Tool to Use?Conclusion
6. OrchestrationWhy We Need Workflow OrchestrationThe Limits of ScriptingThe Enterprise Job Scheduler and HadoopOrchestration Frameworks in the Hadoop EcosystemOozie TerminologyOozie OverviewOozie WorkflowWorkflow PatternsPoint-to-Point WorkflowFan- Out WorkflowCapture-and-Decide WorkflowParameterizing WorkflowsClasspath DefinitionScheduling PatternsFrequency SchedulingTime and Data TriggersExecuting WorkflowsConclusion
7. Near-Real-Time Processing with HadoopStream ProcessingApache StormStorm High-Level ArchitectureStorm TopologiesTuples and StreamsSpouts and BoltsStream GroupingsReliability of Storm ApplicationsExactly-Once ProcessingFault ToleranceIntegrating Storm with HDFSIntegrating Storm with HBaseStorm Example: Simple Moving AverageEvaluating StormTridentTrident Example: Simple Moving AverageEvaluating TridentSpark StreamingOverview of Spark StreamingSpark Streaming Example: Simple CountSpark Streaming Example: Multiple InputsSpark Streaming Example: Maintaining StateSpark Streaming Example: WindowingSpark Streaming Example: Streaming versus ETL CodeEvaluating Spark StreamingFlume InterceptorsWhich Tool to Use?Low-Latency Enrichment, Validation, Alerting, and IngestionNRT Counting, Rolling Averages, and Iterative ProcessingComplex Data PipelinesConclusion
Part Ⅱ. Case Studies
8. Clickstream AnalysisDefining the Use CaseUsing Hadoop for Clickstream AnalysisDesign OverviewStorageIngestionThe Client TierThe Collector TierProcessingData DeduplicationSessionizationAnalyzingOrchestrationConclusion
9. Fraud DetectionContinuous ImprovementTaking ActionArchitectural Requirements of Fraud Detection SystemsIntroducing Our Use CaseHigh-Level DesignClient ArchitectureProfile Storage and RetrievalCachingHBase Data DefinitionDelivering Transaction Status: Approved or Denied?IngestPath Between the Client and FlumeNear-Real-Time and Exploratory AnalyticsNear-Real-Time ProcessingExploratory AnalyticsWhat About Other Architectures?Flume InterceptorsKafka to Storm or Spark StreamingExternal Business Rules EngineConclusion
10. Data WarehouseUsing Hadoop for Data WarehousingDefining the Use CaseOLTP SchemaData Warehouse: Introduction and TerminologyData Warehousing with HadoopHigh-Level DesignData Modeling and StorageIngestionData Processing and AccessAggregationsData ExportOrchestrationConclusionA. Joins in Impala
Index