Business Intelligence Course Outline

Introduction to Data Storage and Processing

Installing the Hadoop Distributed File System (HDFS)

  • Defining key design assumptions and architecture
  • Configuring and setting up the file system
  • Issuing commands from the console
  • Reading and writing files

Setting the stage for MapReduce

  • Reviewing the MapReduce approach
  • Introducing the computing daemons
  • Dissecting a MapReduce job

Defining Hadoop Cluster Requirements

Planning the architecture

  • Selecting appropriate hardware
  • Designing a scalable cluster

Building the cluster

  • Installing Hadoop daemons
  • Optimizing the network architecture

Configuring a Cluster

Preparing HDFS

  • Setting basic configuration parameters
  • Configuring block allocation, redundancy and replication

Deploying MapReduce

  • Installing and setting up the MapReduce environment
  • Delivering redundant load balancing via Rack Awareness

Maximizing HDFS Robustness

Creating a fault–tolerant file system

  • Isolating single points of failure
  • Maintaining High Availability
  • Triggering manual failover
  • Automating failover with Zookeeper

Leveraging NameNode Federation

  • Extending HDFS resources
  • Managing the namespace volumes

Introducing YARN

  • Critiquing the YARN architecture
  • Identifying the new daemons

Managing Resources and Cluster Health

Allocating resources
  • Setting quotas to constrain HDFS utilization
  • Prioritizing access to MapReduce using schedulers

Maintaining HDFS

  • Starting and stopping Hadoop daemons
  • Monitoring HDFS status
  • Adding and removing data nodes

Administering MapReduce

  • Managing MapReduce jobs
  • Tracking progress with monitoring tools
  • Commissioning and decommissioning compute nodes

Maintaining a Cluster

Employing the standard built–in tools

  • Managing and debugging processes using JVM metrics
  • Performing Hadoop status checks

Tuning with supplementary tools

  • Assessing performance with Ganglia
  • Benchmarking to ensure continued performance

Extending Hadoop

Simplifying information access

  • Enabling SQL–like querying with Hive
  • Installing Pig to create MapReduce jobs

Integrating additional elements of the ecosystem

  • Imposing a tabular view on HDFS with HBase
  • Configuring Oozie to schedule workflows

Implementing Data Ingress and Egress

Facilitating generic input/output

  • Moving bulk data into and out of Hadoop
  • Transmitting HDFS data over HTTP with WebHDFS

Acquiring application–specific data

  • Collecting multi–sourced log files with Flume
  • Importing and exporting relational information with Sqoop

Planning for Backup, Recovery and Security

  • Coping with inevitable hardware failures
  • Securing your Hadoop cluster
Close Menu