Skip to main content

Datawarehouse Bigdata Integration - Proof of Concept

The objective of this proof of concept project is to evaluate the feasibility of converting a traditional ETL architecture for data warehouse load into a hybrid approach with bigdata integration.
Refer the following post for architectural details.
  • Proof of Concept - Project Plan

 The POC project has a timeline of 4 weeks.
Following activities planned during this period.
  1. Define Business goals and corresponding use cases.
  2. Setup and Configuration.
  3. Architecture and Design.
  4. Development.
  5. Evaluation and recommendations.
  • System Hardware Architecture

Minimal hardware investment planned for the POC project. Cluster configured with 1 master nodes and 3 slave nodes.


  • System Software Architecture

  • Application Use Cases

Application use cases include the following.
  1. Data ingestion from OLTP to HDFS using Sqoop.
  2. Load Facts using mapreduce jobs.
  3. Create Aggregates using Hive QLs.
  4. Export processed Aggregates from from HDFS to DW using Sqoop.

  • Performance Use Cases




Post a Comment

Popular posts from this blog

DW Architecture - Traditional vs Bigdata Approach

DW Flow Architecture - Traditional             Using ETL tools like Informatica and Reporting tools like OBIEE.   Source OLTP to Stage data load using ETL process. Load Dimensions using ETL process. Cache dimension keys. Load Facts using ETL process. Load Aggregates using ETL process. OBIEE connect to DW for reporting.  

Cloudera QuickStart virtual machines (VMs) Installation

Cloudera Distribution including Apache Hadoop ( CDH ) is the most popular Hadoop distribution currently available. CDH is 100% open source. Cloudera quick start VMs include everything that is needed to tryout basic package based CDH installation. This is useful to create initial deployments for proof of concept (POC) or development.

Amazon CloudSearch - Technology Review

Amazon CloudSearch is a fully managed service in the cloud that makes it easy to set up, manage, and scale a search solution. Amazon CloudSearch can search large collections of data such as web pages, document files, forum posts, or product information. CloudSearch makes it possible to search large collections of mostly textual data items called documents to quickly find the best matching results. Search requests are usually a few words of unstructured text. The returned results are ranked with the best matching, or most relevant, items listed first.