20 + petabyte Analytical Big Data Lake on izac

Streaming 300+ million subscribers, handling 15+ billion records a day, 

Case ​: Customer had a data warehouse with close to 10+ petabyte data store. In the recenttimes it was noticed that the warehouse is getting fuller by the day and has already reached a capacity of 80% utilization and it is becoming increasingly important to manage the situation by either refreshing the entire architecture or augmenting the architecture adding offloading mechanism to manage the workload parallel in a cost effective manner.

Main Objectives of the existing warehouse and ETL landscape were :

  • Managing application/business reporting
  • Managing adhoc data exploration
  • Performing ETL operations on production data sets being extracted from application databases
  • Providing insights on archival data by pulling them into the warehouse back whenever required

Challenge ​: Enterprise data warehouses like theirs are under considerable strain from increasing data volumes as well as bottlenecks of real time data ingestion, change data capture that they were not built to easily accommodate . These challenges were forcing us and organization like VI to rethink their data infrastructure in order to architect an enterprise data hub that can ingest, store and process large volumes of structured, unstructured and semi-structured data to deliver richer business insights and at the same time ​build a parallel data exchange architecture, result being we move onto a enterprise ready and open architecture.

This enables our business layer to do what it does best: business-critical reporting that supports high concurrency with low latency, rather than spending CPU cycles on transformation. Big data based open source systems are increasingly becoming the central data repository, supporting different use cases including batch, interactive, and real-time.

IZAC framework along with Hadoop platform fulfilled the need for open source enterprise ready softwares/databases thereby revealing that VI could handle all the above objectives at a substantial lesser cost.

Action : DWH Optimization – Stage 1

This initiative of Warehouse Optimization provides the following critical capabilities that VIL required :

  • A data management platform that helps store large volumes of data at a lower cost than alternatives of cheaper servers and storage.
  • Improved responsiveness of the data warehouse by performing ETL transformations on a different real time platform.
  • The ability to store, process and analyze new types of data which could be our weblogs or security logs and any of our upcoming web/mobile applications.
  • The ability to restore data warehouse CPU and storage capacity and finally sunset the same.

VIL aspired to have the flexibility to use this new architecture to reduce overall system cost by performing transformations on the new open platform and freeing up previously used storage and capacity

In addition, they could add more types and sources of data into this realtime architecture of “Single Source of Truth” for more granular and richer analytics across the combined solution.

Our Solution :

Our approach towards Data Exchange techniques and what’s coming next as part of new architectures that leading telecoms are pursuing or using, resulting in choosing open-source technologies to have a full data architecture for DWH and real time analytics use-cases based on modern frameworks. Three of the major components of this solution were :

  • Apache Hadoop – Hortonworks Data Platform – ​Creating the enterprise data lake architecture which would offer database technologies like Hive and Hbase to offer ad hoc data exploration capability to analysts.
  • Apache Kafka + Spark Streaming custom IZAC framework for handling real timeETL and data exchange platform.
  • Apache Druid for realtime data warehousing ​-Offloading workload from the existing DB2 or IIAS solution to keep it optimized and reduce TCO.