The Cloudera foundation is built upon the Apache Hadoop framework and employs the largest group of committers under one roof. Cloudera enables organizations to capture, store, analyze and act on any data at massive speed and scale in a single data solution using Hadoop platforms.
Cloudera is being agnostic to hardware and our solutions can be optimized for both the Cloud and on-premises environments. As a result, Cloudera has a vast partner ecosystem and we pride ourselves on our solutions being highly compatible with our Customers’ existing environment and service providers. This allows for our solution to be molded to environments for a custom experience rather than wasting time and resources introducing solutions that are not compatible with the pre-existing hardware, environment or service providers that are already in place, leading to any budget being vastly depleted even before the proposed solution is installed.
Your goals to modernize the legacy systems and better harness your data is the mission we at Cloudera share. We strive to bring a comprehensive solution-set of data analytics to data anywhere the enterprise needs to work, from the Edge to AI.
By implementing an open source data platform supported by Cloudera on your own infrastructure, in the cloud or a hybrid of both, we expect you can achieve the following core benefits as we enable your Data Lake:
- New Efficiencies for data architecture through a significantly lower cost storage platform by leveraging the industry’s only secure enterprise-ready open source Hadoop distribution. A modern data architecture will allow you to integrate, store and process all enterprise data regardless of source, format, and type at a fraction of the cost of proprietary solutions.
- Capture Data in Motion in a secure, traceable way to un-tap the potential of streaming data analytics, data routing and overall seamless data ingestion from Dubai Municipality owned, or public data sources.
- New Opportunities, Innovation & Insights by providing data scientists, business analysts, and data developers with the ability to easily access and query all enterprise data within one environment from batch to real time using the tools they are most familiar with.
Cloudera EDH provides a unified platform to cost-effectively collect, store and manage unlimited volumes of any structured, semi-structured and unstructured data.
Cloudera’s Enterprise Data Hub (EDH) consists of
- CDH (Cloudera’s Distribution including Hadoop)
- Cloudera’s Enterprise Management, Governance and Security layer.
CDH is 100% Apache-licensed open source and offers unified batch processing, interactive SQL, and interactive search, and role-based access controls. More enterprises have downloaded CDH than all other such distributions combined.
CDH includes the core elements of Apache Hadoop plus several additional key open source projects that, when coupled with customer support, management, and governance through a Cloudera Enterprise subscription, can deliver an enterprise data hub.
- Flexible – Store any type of data and prosecute it with an array of different computation frameworks including batch processing, interactive SQL, free text search, machine learning and statistical computation.
- Integrated – Get up and running quickly on a complete, packaged, Hadoop platform.
- Secure – Process and control sensitive data and facilitate multi-tenancy.
- Scalable & Extensible – Enable a broad range of applications and scale them with your business.
- Highly Available – Run mission-critical workloads with confidence.
- Compatible – Extend and leverage existing IT investments.
Cloudera’s Enterprise Management, Governance and Security layer:
- Cloudera Manager: the best-in-class holistic interface that provides end-to-end system management and key enterprise features to deliver granular visibility into and control over every part of an enterprise data hub. It is the only enterprise-grade Hadoop management application available – empowering operators to improve cluster performance, enhance quality of service, increase compliance, and reduce administrative costs.
- Cloudera Director: built for powering Hadoop across all the major cloud environments. It provides the flexibility to deploy on your environment of choice. With a single multi-cluster, multi-environment view, you can easily manage elasticity and dynamic cluster lifecycles across common workloads.
- Data Management
- Cloudera Navigator: the only native end-to-end governance solution for Apache Hadoop based systems. Through a single user interface, it provides visibility for administrators, data managers, data scientists, and analysts to secure, govern, and explore the large amounts of diverse data that land in Hadoop. Cloudera Navigator is part of Cloudera Enterprise’s comprehensive data security and governance offering and is a key part of meeting compliance and regulatory requirements.
- Cloudera Navigator Optimizer helps you port and optimize your SQL queries on Hadoop
- Cloudera Navigator Encrypt: the only Hadoop platform to provide out-of-the-box encryption for both “data in motion,” between processes and systems, as well as “data-at-rest” as it persists on disk or other storage mediums.
- Cloudera Navigator KeyTrustee provides industrial strength Encryption Key Management.
The data can be transformed or the raw data in its full fidelity can be ingested and then transformations can be applied afterwards. This allows you to have full flexibility in terms of where and how you want to apply transformations.
Cloudera’s Enterprise Data Hub ships with numerous out-of-the-box options for Data Ingestion:
- Sqoop is used to bulk move large datasets from a relational database to Hadoop or vice-versa.
- Apache Spark and Spark Streaming allow users to define data transformations and perform them in-memory on data as it streams into the platform. Apache Spark is open source and part of CDH.
- Apache Kafka allows real-time data integration. Apache Kafka is a distributed, partitioned, real-time pub/sub messaging system designed for speed, scalability, and durability. Apache Kafka is open source and part of CDH.
With Kafka (to transport events) and Spark Streaming (to process on events as they arrive) deployments can easily scale to achieve over 1 million end-to-end events per second.
The merger of Hortonworks and Cloudera on January 3, 2019 has led for the combining of products and roadmaps. Cloudera has stated publicly, that it will support both previous HDP and CDH deployments in their latest versions until January 2022. The first release of CDP will be composed of a selection of elements from HDP version 3.x and CDH 6 and will be focused on running customers’ existing workloads and data. CDP will be expected to run in the cloud, both private and/or public clouds. Additionally, the on-premise solution will be forthcoming.
Cloudera Enterprise Platform provides and End-to-End components that cover most of the components within the architecture under one platform. Few other components should be procured from Cloudera Ecosystems partners who are certified and supported to work with Cloudera Platform and to be integrated within Cloudera Manager as well.
The following graph provide high level architecture for solution provided:
High Level Architecture
The best approach to have a proper solution design for Big Data and analytics platforms is to have an understanding of use cases needed which dictate how overall architecture should look like. Cloudera provides a general end-to-end architecture that most of use cases use with some modification(s) here and there depending on the requirements. Most of Cloudera components in the platform will be used in a way or another to achieve functionalities required.