Hadoop | Cloud Academy Blog https://cloudacademy.com/blog/category/hadoop-2/ Thu, 29 Sep 2022 14:33:09 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.1 HDInsight – Azure’s Hadoop Big Data Service https://cloudacademy.com/blog/hdinsight-azures-hadoop-big-data-service/ https://cloudacademy.com/blog/hdinsight-azures-hadoop-big-data-service/#comments Tue, 31 May 2016 03:16:08 +0000 https://cloudacademy.com/blog/?p=11082 How can Azure HDInsight solve your big data challenges? Big data refers to large volumes of fast-moving data in any format that haven’t yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly termed as V3 in...

The post HDInsight – Azure’s Hadoop Big Data Service appeared first on Cloud Academy.

]]>
How can Azure HDInsight solve your big data challenges?

Big data refers to large volumes of fast-moving data in any format that haven’t yet been handled by your traditional data processing system. In other words, it refers to data which have Volume, Variety and Velocity (commonly termed as V3 in Big Data circles). Data can come from just about anywhere: application logs, sensor data, archived images, videos, streaming data like twitter trends, weather forecast data, astronomical data, biological genomes, and almost anything generated by human or machine. Handling data on this scale is a relatively new problem. Azure’s HDInsight is, appropriately, a new tool that aims to address this problem.

New Challenges and New Solutions: Coping with Variety and Velocity

Whether human-readable or not, managing fast moving data generated at a massive scale – while maintaining data integrity – will require a different kind of processing mechanism than we would have been used for a traditional database or data-mining. Older solutions handled Volume well enough, but Variety and Velocity are relatively new problems. Since not everyone can afford super computers, brute force approaches probably aren’t going to be practical. This challenge inspired the development of Hadoop, which made it possible to process big data on industry-standard servers, while guaranteeing reliable and scalable parallel and distributed computing on smaller budgets.
Without diving too deeply into the technologies of big data management, we can’t go on without at least mentioning the biggest and most important players:

  • Hadoop (batch processing),
  • NoSQL (HBase, MongoDB, and Cassandra for distributed non-ACID databases),
  • Storm and Kafka (real time streaming data),
  • Spark (in-memory distributed processing), and
  • Pig scripts and Hive queries.

Besides the standard Apache Hadoop framework, there are many vendors who provide customized Hadoop distributions, like Cloudera, Hortonworks, MapR, and Greenplum. Many established cloud vendors provide some variety of Hadoop Platform as a Service (PaaS), like AWS’s Amazon Elastic Map Reduce (EMR), and Google Big Query.
The purpose of this post is to introduce you to Azure HDInsight, which is based on the Hortonworks Data Platform (HDP). The flexibility of Azure cloud and the innovation of Hortonworks can make Azure HDInsight a very interesting and productive space to process your big data.

Azure’s Big Data Solutions

Azure provides various big data processing services. The most popular of them is HDInsight, which is an on-demand Hadoop platform powered by Hortonworks Data Platform (HDP). Besides HDInsight (on which we’re going to focus our attention in this post) Azure also offers:

  • Data Lake Analytics
  • Data Factory
  • SQL Data Warehouse
  • Data Catalog

Azure HDInsight: A Comprehensive Managed Apache Hadoop, Spark, R, HBase, and Storm Cloud Service

As we mentioned, Azure provides a Hortonworks distribution of Hadoop in the cloud. “Hadoop distribution” is a broad term used to describe solutions that include some MapReduce and HDFS platform, in addition to a full stack featuring Spark, NoSQL, Pig, Sqoop, Ambari, and Zookeeper. Azure HDInsight provides all those technologies as part of its Big Data service, and also integrates with business intelligence (BI) tools like Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
As a distribution, HDInsight comes in two flavors: Ubuntu and Windows Server 2012. Users can manage Linux-based HDInsight clusters using Apache Ambari and, for Windows users, Azure provides a cluster dashboard.
Cluster configuration for HDInsight is categorized into four different offerings:
1. Hadoop Cluster for query and batch processing with MapReduce and HDFS.
HDInsight - hadoop cluster

Hadoop clusters are made up of two types of nodes: Head Nodes that run Name Nodes and Job Trackers in a cluster (two nodes/cluster minimum); and Worker Nodes – also called Data Nodes – representing a Hadoop cluster’s true work horses. The minimum number of worker nodes is one, and you can scale worker nodes up or down according to need.

2. HBase for NoSQL-based workloads. NoSQL is a term defining a certain kind of data store and data processing engine that do not rely on traditional ACID properties but, instead, on CAP theorem.
HDInsight - hbase cluster
3. Storm for streaming data processing (remember the twitter data or sensor data use-case).
HDInsight - storm cluster
4. HDInsight Spark for in-memory parallel processing for big data analytics. Use cases for HDInsight Spark are Interactive data analysis and BI, Iterative Machine Learning, Streaming and real-time data analysis etc. In recent days, Apache Spark has taken over Hadoop based data analytics because of its capability to handle complex algorithms, faster in-memory processing and graph computing.
Azure Spark services diagram
You can also customize the cluster with scripts employing various technologies and languages. Spark, Solr, R, Giraph, and Hue can be used to install additional components.  HDInsight clusters also include the following (Apache) components:

  • Amabri for cluster provisioning, management, and monitoring.
  • Hive – and SQL query-like language that runs against data which, in turn, converts to MapReduce algorithms to process the data on HDFS.
  • Pig for data processing using user-created scripts.
  • Zookeeper for cluster co-ordination and management in a distributed environment.
  • Oozie for workflow management.
  • Mahout for machine learning.
  • Sqoop (SQL on Hadoop) for data import and export from SQL storage.
  • Tez – a successor of MapReduce that runs on YARN (Yet-Another-Resource-Negotiator) for complex and acyclic graph processing.
  • Phoenix – a layer over HBase to query and analyze data kept in SQL stores. Unlike Hive, Phoenix transforms queries into NoSQL API calls for processing and then converts them for MapReduce programming.
  • HCatalog provides a relational view of data in HDFS. It often used with Hive.
  • Avro – a data serialization format for the .NET environment.

The current version of HDInsight is 3.4, other software versions in the stack are up to these version numbers:

HDInsight software stack versions
  (HDInsight software stack)

Conclusion

Azure HDInsight offers all the best big data management features for the enterprise cloud, and has become one of the most talked about Hadoop Distributions in use. While users can quickly scale clusters up or down according to their needs, they will pay only for resources they actually use, and avoid the capital costs required to provision complex hardware configurations and the professionals needed to maintain them. HDInsight lets you crunch all kinds of data – whether structured or not – overcoming the burden of Hadoop cluster configuration.

If you’d like to learn more about Big Data and Hadoop, why not take our Analytics Fundamentals for AWS course? We’re currently building some great Azure content, but in the meantime this course provides a solid foundation in Big Data analytics for AWS. Check it out today!

Learn Analytics Fundamentals for AWS: EMR

 

The post HDInsight – Azure’s Hadoop Big Data Service appeared first on Cloud Academy.

]]>
1
Microsoft Azure Data Lake Store: an Introduction https://cloudacademy.com/blog/azure-data-lake-store/ https://cloudacademy.com/blog/azure-data-lake-store/#respond Wed, 16 Dec 2015 22:44:02 +0000 https://cloudacademy.com/blog/?p=11195 The Azure Data Lake Store service provides a platform for organizations to park – and process and analyse – vast volumes of data in any format. Find out how. With increasing volumes of data to manage, enterprises are looking for appropriate infrastructure models to help them apply analytics to their...

The post Microsoft Azure Data Lake Store: an Introduction appeared first on Cloud Academy.

]]>
The Azure Data Lake Store service provides a platform for organizations to park – and process and analyse – vast volumes of data in any format. Find out how.

With increasing volumes of data to manage, enterprises are looking for appropriate infrastructure models to help them apply analytics to their big data, or simply to store them for undetermined future use. In this post, we’re going to discuss Microsoft’s entry into the data lake market, Azure Data Lake, and in particular, Azure Data Lake store.

What is a data lake?

In simple terms, a data lake is a repository for large quantities and varieties of both structured and unstructured data in their native formats. The term data lake was coined by James Dixon, CTO of Pentaho, to contrast what he called “data marts”, which handled the data reporting and analysis by identifying “the most interesting attributes, and to aggregate” them. The problems with this approach are that “only a subset of the attributes is examined, so only pre-determined questions can be answered,” and that “data is aggregated, so visibility into the lowest levels is lost.”

A data lake, on the other hand, maintains data in their native formats and handles the three Vs of big data (Volume, Velocity and Variety) while providing tools for analysis, querying, and processing. Data lake eliminates all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on read, and various ways to access data (including programming, SQL-like queries, and REST calls).

With the emergence of Hadoop (including HDFS and YARN), the benefits of data lake – previously available only to the most resource-rich companies like Google, Yahoo, and Facebook – became a practical reality for just about anyone. Now, organizations who had been generating and gathering data on a large scale but had struggled to store and process them in a meaningful way, have more options.

Azure Data Lake

Azure Data Lake is the new kid on the data lake block from Microsoft Azure. Here is some of what it offers:

  • The ability to store and analyse data of any kind and size.
  • Multiple access methods including U-SQL, Spark, Hive, HBase, and Storm.
  • Built on YARN and HDFS.
  • Dynamic scaling to match your business priorities.
  • Enterprise-grade security with Azure Active Directory.
  • Managed and supported with an enterprise-grade SLA.

Azure Data Lake can, broadly, be divided into three parts:

  • Azure Data Lake store – The Data Lake store provides a single repository where organizations upload data of just about infinite volume. The store is designed for high-performance processing and analytics from HDFS applications and tools, including support for low latency workloads. In the store, data can be shared for collaboration with enterprise-grade security.
  • Azure Data Lake analytics – Data Lake analytics is a distributed analytics service built on Apache YARN that compliments the Data Lake store. The analytics service can handle jobs of any scale instantly with on-demand processing power and a pay-as-you-go model that’s very cost effective for short term or on-demand jobs. It includes a scalable distributed runtime called U-SQL, a language that unifies the benefits of SQL with the expressive power of user code.
  • Azure HDInsight – Azure HDInsight is a full stack Hadoop Platform as a Service from Azure. Built on top of Hortonworks Data Platform (HDP), it provides Apache Hadoop, Spark, HBase, and Storm clusters.

We’ve already been introduced to HDInsight in this series. Now we will discuss Azure Data Lake Store…which is still in Preview Mode.

Azure Data Lake Store

According to Microsoft, Azure Data Lake store is a hyper-scale repository for big data analytics workloads and a Hadoop Distributed File System (HDFS) for the cloud. It…

  • Imposes no fixed limits on file size.
  • Imposes no fixed limits on account size.
  • Allows unstructured and structured data in their native formats.
  • Allows massive throughput to increase analytic performance.
  • Offers high durability, availability, and reliability.
  • Is integrated with Azure Active Directory access control.

Some have compared Azure Data Lake store with Amazon S3 but, beyond the fact that both provide unlimited storage space, the two really don’t share all that much in common. If you want to compare S3 to an Azure service, you’ll get better mileage with the Azure Storage Service. Azure Data Lake store, on the other hand, provides an integrated analytics service and places no limits on file size. Here’s a nice illustration:
Azure Data Lake store - diagram

(Image Courtesy: Microsoft)

Azure Data Lake store can handle any data in their native format, as is, without requiring prior transformations. Data Lake store does not require a schema to be defined before the data is uploaded, leaving it up to the individual analytic framework to interpret the data and define a schema at the time of the analysis. Being able to store files of arbitrary size and formats makes it possible for Data Lake store to handle structured, semi-structured, and even unstructured data.

Azure Data Lake store file system (adl://)

Azure Data Lake Store can be accessed from Hadoop (available with an HDInsight cluster) using the WebHDFS-compatible REST APIs. However, Azure Data Lake store introduced a new file system called AzureDataLakeFilesystem (adl://). adl:// is optimized for performance and available in HDInsight. Data is accessed in the Data Lake store using:

adl://<data_lake_store_name>.azuredatalakestore.net

Azure Data Lake store security:

Azure Data Lake store uses Azure Active Directory (AAD) for authentication and Access Control Lists (ACLs) to manage access to your data. Azure Data Lake benefits from all AAD features including Multi-Factor Authentication, conditional access, role-based access control, application usage monitoring, security monitoring and alerting. Azure Data Lake store supports the OAuth 2.0 protocol for authentication within the REST interface. Similarly, Data Lake store provides access control by supporting POSIX-style permissions exposed by the WebHDFS protocol.

Azure Data Lake store pricing

Data Lake Store is currently available in US-2 region and offers preview pricing rates (excluding Outbound Data transfer):
Azure Data Lake store - cost

Conclusion

Azure Data Lake is an  important new part of Microsoft’s ambitious cloud offering. With Data Lake, Microsoft provides service to store and analyze data of any size at an affordable cost. In related posts, we will learn more about Data Lake Store, Data Lake Analytics, and HDInsight.

The post Microsoft Azure Data Lake Store: an Introduction appeared first on Cloud Academy.

]]>
0
Amazon EMR: Five Ways to Improve the Way You Use Hadoop https://cloudacademy.com/blog/amazon-emr-five-ways-to-improve-the-way-you-use-hadoop/ https://cloudacademy.com/blog/amazon-emr-five-ways-to-improve-the-way-you-use-hadoop/#comments Mon, 02 Nov 2015 09:01:29 +0000 https://cloudacademy.com/blog/?p=9644 Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it. Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce)...

The post Amazon EMR: Five Ways to Improve the Way You Use Hadoop appeared first on Cloud Academy.

]]>
Amazon EMR (Elastic MapReduce) allows developers to avoid some of the burdens of setting up and administrating Hadoop tasks. Learn how to optimize it.

Apache Hadoop is an open source framework designed to distribute the storage and processing of massive data sets across virtually limitless servers. Amazon EMR (Elastic MapReduce) is a particularly popular service from Amazon that is used by developers trying to avoid the burden of set up and administration, and concentrate on working with their data.

Over the years, Amazon EMR has undergone many transformations. AWS is constantly working to improve their product, pushing new updates for EMR with every new Hadoop release. Amazon EMR now integrates with versatile Hadoop Ecosystem applications, offering an improved core architecture and an even more simplified interface.

For those who have stumbled upon this blog having little or no knowledge of Amazon EMR, but are familiar with Hadoop, here is a very quick overview:

  • EMR is Hadoop-as-a-Service from Amazon Web Services (AWS).
  • EMR supports Hadoop 2.6.0, Hive 1.0.0, Mahout 0.10.0, Pig 0.14.0, Hue 3.7.1, and Spark 1.4.1
  • The MapR distributions supported by EMR are – MapR 4.0.2 (MapR3/MapR5/MapR7 with Hadoop 2.4.0, Hive 0.13.1, and Pig 0.12.0).
  • Cost effective and integrated with other AWS Services.
  • Flexible resource utilization model.
  • No capacity planning, Hardware-on-Demand.
  • Easy to use with a flexible hourly usage model for clusters.
  • Integrated with other AWS services like S3, CloudFormation, Redshift, SQS, DynamoDB, and Cloudwatch.

The current EMR release (EMR-4.1.0) is based on the Apache Bigtop project. Describing Bigtop is well beyond the scope of this post, but you can read about it here. Instead, we are going to talk about five exciting – and unique – Amazon EMR features.

Amazon EMR Components

The current Amazon EMR release adds elements necessary to bring EMR up to date. The components are either community contributed editions or developed in-house at AWS. For example, Hadoop itself is a community edition, while the Amazon DynamoDB connector (emr-ddb-3.0.0) comes exclusively with EMR. Here is Amazon’s components guide.

You can also install recommended third-party software packages on your cluster using Bootstrap Actions. Third party libraries can be packaged directly into your Mapper or Reducer executable. Alternatively, you could upload statically compiled executables using the Hadoop distributed cache mechanism.

EMR Hadoop Nodes

Unlike standard distributions, there are three types of EMR Hadoop Nodes.

  • Master Node: Master Node runs NameNode, Resource Manager in YARN.
  • Slave Node-Core: Slave Node Core runs HDFS and Node Manager.
  • Slave Node-Task: Slave Node Task runs a Node Manager, but not HDFS.

Amazon EMR - nodes
Slave nodes in EMR are of particular interest. In EMR, Core nodes and Task nodes constitute a cluster’s slave nodes. Core nodes include Task Trackers and Data Nodes, with Data Nodes running the HDFS distributed file system. Since they store HDFS data, Data Nodes cannot be removed from a running cluster. Task Nodes, on the other hand, only act as Task Trackers and have no HDFS restrictions. Task nodes can, therefore, be scaled up and down according to the changing processing needs of a specific job. That’s how EMR supports dynamic clustering.

With HDFS out of picture within Task Slave Nodes, node failures or the addition of new nodes are far simpler to deal with, as there is no need for HDFS rebalancing.

EMR File System (EMRFS)

EMRFS is an extension of HDFS, which allows an Amazon EMR cluster to store and access data from Amazon S3. Amazon S3 is a great place to store huge data because of its low cost, durability, and availability. But one potential problem with S3 is its eventual consistency model. With eventual consistency, you might not get the updated objects as soon as they are added to your bucket. This might be a concern during certain multi-step ETL processing.

To address the issue, EMR provides something called consistent-view. By creating a DynamoDB database to track the data in S3, the consistent view provides read-after-write consistency and improved performance. The consistent view can be added to and enabled in an EMR cluster. Be aware that there is a small cost overhead for consistent view’s DynamoDB usage.

Transient and Long Running Clusters

With EMR, you can choose between running a non-committed cluster, called a transient cluster, or a long-running cluster for larger workloads. With a transient cluster, after the processing job is done, the cluster will be automatically terminated. That ensures your AWS bill properly reflects your actual use. Transient clusters are particularly suitable for periodic jobs.

A long-running cluster, on the other hand, is meant for persistent job execution. Imagine that you need to upload a huge amount of data for EMR processing. It can sometimes be inefficient to load it in smaller packages. With long-running clusters, you can query the cluster continuously or even periodically as it will be running even if there are no jobs in the queue.

Making a long-running cluster is easy. As the administrator, you will need to choose NO for auto-termination in Advanced Options -> Steps. That’s it!

Using S3Distcp to Move data between HDFS and S3

S3DistCp is an extension of the DistCp tool that lets you move large amounts of data between HDFS and S3 in a distributed manner. S3DistCp is more scalable and efficient for parallel copying large numbers of objects across buckets and between AWS accounts. S3DistCp copies data using distributed map-reduce jobs. However, the main benefit S3DistCp provides over DistCp, is by having a reducer run multiple HTTP upload threads to upload the files in parallel.

You can add S3DistCp as a step to EMR job in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*[a-zA-Z,]+"]

Or using EMR console like this:
Amazon EMR - s3distcp
To copy files from S3 to HDFS, you can run this command in the AWS CLI:

aws emr add-steps --cluster-id j-1234MYCLUSTERXXXXX --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,\
Args=["--src,s3://mybucket/logs/j-1234MYCLUSTERXXXXX/node/","--dest,hdfs:///output","--srcPattern,.*daemons.*-hadoop-.*"]

Conclusion

Amazon EMR allows organizations to launch Hadoop clusters and jobs almost instantaneously. With the backing of AWS’s tested infrastructure and services and their seamless integration between EMR and services like DynamoDB, Redshift, SQS, and Kinesis, users have many opportunities to explore.

Would you like to learn more? See how AOL significantly optimized its Amazon EMR infrastructure. Also, Cloud Academy offers a hands-on lab guiding you through the process of deploying S3-based data to an Amazon EMR cluster.

Do you have your own Hadoop or EMR experiences you’d like to share? Feel free to comment below.

The post Amazon EMR: Five Ways to Improve the Way You Use Hadoop appeared first on Cloud Academy.

]]>
3
DynamoDB: An Inside Look Into NoSQL – Part 1 https://cloudacademy.com/blog/dynamodb-an-inside-look-into-nosql-part-1/ https://cloudacademy.com/blog/dynamodb-an-inside-look-into-nosql-part-1/#respond Wed, 28 May 2014 23:26:22 +0000 https://cloudacademy.com/blog/?p=1160 This is a guest post from 47Line Technologies. In our earlier posts (Big Data: Getting Started with Hadoop, Sqoop & Hive and Hadoop Learning Series – Hive, Flume, HDFS, and Retail Analysis), we introduced the Hadoop ecosystem & explained its various components using a real-world example of the retail industry....

The post DynamoDB: An Inside Look Into NoSQL – Part 1 appeared first on Cloud Academy.

]]>
This is a guest post from 47Line Technologies.

In our earlier posts (Big Data: Getting Started with Hadoop, Sqoop & Hive and Hadoop Learning Series – Hive, Flume, HDFS, and Retail Analysis), we introduced the Hadoop ecosystem & explained its various components using a real-world example of the retail industry. We now possess a fair idea of the advantages of Big Data. NoSQL data stores are being used extensively in real-time & Big Data applications, which is why a look into its internals would help us make better design decisions in our applications.

NoSQL datastores provide a mechanism for retrieval and storage of data items that are modeled in a non-tabular manner. The simplicity of design, horizontal scalability and control over availability form the motivations for this approach. NoSQL is governed by the CAP theorem in the same way RDBMS is governed by ACID properties.

NoSQL Triangle
Figure 1: NoSQL Triangle

From the AWS stable, DynamoDB is the perfect choice if you are looking for a NoSQL solution. DynamoDB is a “fast, fully managed NoSQL database service that makes it simple and cost-effective to store and retrieve any amount of data, and serve any level of request traffic. Its guaranteed throughput and single-digit millisecond latency make it a great fit for gaming, ad tech, mobile, and many other applications.” Since it is a fully managed service, you need not to worry about provisioning & managing of the underlying infrastructure. All the heavy lifting is taken care for you.

Majority of the documentation available on the Net are how-to-get-started guides with examples of DynamoDB API usage. Let’s look at the thought process and design strategies that went into the making of DynamoDB.

“DynamoDB uses a synthesis of well-known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. DynamoDB employs a gossip-based distributed failure detection and membership protocol. Dynamo is a completely decentralized system with minimal need for manual administration. Storage nodes can be added and removed from DynamoDB without requiring any manual partitioning or redistribution.” You must be wondering – “Too much lingo for one paragraph”. Fret not, why fear when I am here 🙂 Let’s take one step at a time, shall we!

Requirements and Assumptions

This class of NoSQL storage system has the following requirements –

  • Query Model: A “key” uniquely identifies a data item. Read and write operations are performed on this data item. It must be noted that no operation spans across multiple data items. There is no need for relational schema and DynamoDB works best when a single data item is less than 1MB.
  • ACID Properties: As mentioned earlier, there is no need for relational schema and hence ACID (Atomicity, Consistency, Isolation, Durability) properties are not required. The industry and academia acknowledge that ACID guarantees lead to poor availability. Dynamo targets applications that operate with weaker consistency if it results in high availability.
  • Efficiency: DynamoDB needs to run on commodity hardware infrastructure. Stringent SLA (Service Level Agreement) ensure that latency and throughput requirements are met for the 99.9% percentile of the distribution. But everything has a catch – the tradeoffs consist of performance, cost, availability and durability guarantees.

In subsequent articles, we will look into Design Considerations & System Architecture.

The post DynamoDB: An Inside Look Into NoSQL – Part 1 appeared first on Cloud Academy.

]]>
0
Hadoop Learning Series – Hive, Flume, HDFS, and Retail Analysis https://cloudacademy.com/blog/hadoop-series-hive-flume-hdfs-and-retail-analysis/ https://cloudacademy.com/blog/hadoop-series-hive-flume-hdfs-and-retail-analysis/#comments Mon, 24 Mar 2014 01:35:18 +0000 https://cloudacademy.com/blog/?p=658 This is a guest article by 47Line Technologies. Last week we introduced Big Data ecosystem and showed a glimpse of the possibilities. This week we take one industry (Retail) use case and illustrate how the various tools can be orchestrated to provide insights. Background The last couple of decades has...

The post Hadoop Learning Series – Hive, Flume, HDFS, and Retail Analysis appeared first on Cloud Academy.

]]>
This is a guest article by 47Line Technologies.

Last week we introduced Big Data ecosystem and showed a glimpse of the possibilities. This week we take one industry (Retail) use case and illustrate how the various tools can be orchestrated to provide insights.

Background

The last couple of decades has seen a tectonic shift in the retail industry. The hawkers and mom and pop stores are being sidelined by heavyweight retail hypermarkets who operate in a complex landscape involving franchisees, JVs, and multi-partner vendors. In this kind of an environment, try visualizing the inventory, sales, supplier info for thousands of SKUs (Stock Keeping Units) per store and multiply it with the several thousand stores across cities, states and even countries over days, months and years and you will realize the volume of data they would be collecting.

One such retail hypermarket, let’s say BigRetail had 5-years of data containing vast amounts of a semi-structured dataset which they wanted to be analyzed for trends and patterns.

The Problem

  • The 5-year dataset was 13TB in size.
  • Traditional business intelligence (BI) tools work best in the presence of a pre-defined schema. The BigRetail dataset was mostly logs which didn’t conform to any specific schema.
  • BigRetail took close to half a day to move the data into their BI systems weekly. They wanted to reduce this time.
  • Queries over this data set took hours

The Solution

This is where Hadoop shines in all its glory!
The problem is 2-fold:
Problem 1: Moving the logs into HDFS periodically
Problem 2: Performing analysis on this HDFS dataset

As we had seen in the previous post, Apache Sqoop is used to move structured dataset into HDFS. Alas! How do we move semi-structured data? Fret not. Apache Flume is specially designed for collecting, aggregating, and moving large amounts of log data into HDFS. Once the dataset is inside HDFS, Hive was used to perform analysis.

Let’s dig deep. Mind you – The devil is in the details.

Problem 1: How Flume solved the data transfer problem?
The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent HDFS store.

Flume Architecture
Flume Architecture
Flume’s typical dataflow is as follows: A Flume Agent is installed on each node of the cluster that produces log messages. These streams of log messages from every node are then sent to the Flume Collector. The collectors then aggregate the streams into larger streams which can then be efficiently written to a storage tier such as HDFS.

Problem 2: Analysis using Hive

Hive Implementation
Hive uses “Schema on Read” unlike a traditional database which uses “Schema on Write”. Schema on Write implies that a table’s schema is enforced at data load time. If the data being loaded doesn’t conform to the schema, then it is rejected. This mechanism might slow the loading process of the dataset usually, Whereas Schema on Read doesn’t verify the data when it’s loaded, but rather when a query is issued. For this precise reason, once the dataset is in HDFS moving it into Hive controlled namespace is usually instantaneous. Hive can also perform analysis on dataset in HDFS or local storage. But the preferred approach is to move the entire dataset into Hive controlled namespace (default location – hdfs://user/hive/warehouse) to enable additional query optimizations.

While reading log files, the simplest recommended approach during Hive table creation is to use a RegexSerDe. It uses regular expression (regex) to serialize/deserialize. It deserializes the data using regex and extracts groups as columns. It can also serialize the row object using a format string.

Caveats:

  • With RegexSerDe all columns have to be strings. Use “CAST (a AS INT)” to convert columns to other types.
  • While moving data from HDFS to Hive, DO NOT use the keyword OVERWRITE

Solution Architecture using Flume + Hive

Flume + Hive Architecture
The merchandise details, user information, time of the transaction, area / city / state information, coupon codes (if any), customer data and other related details were collected and aggregated from various backend servers.

As mentioned earlier, the data-set to be analyzed was 13TB. Using the Hadoop default replication factor of 3, it would require 13TB * 3 = 39TB of storage capacity. After a couple of iterations on a smaller sample data set and subsequent performance tuning, we finalized the below cluster configuration and capacities –

  • 45 virtual instances, each with
    • 64-bit OS platform
    • 12 GB RAM
    • 4-6 CPU cores
    • 1 TB Storage

Flume configuration

Following Flume parameters were configured (sample) –

  • flume.event.max.size.bytes uses the default value of 32KB.
  • flume.agent.logdir was changed to point to an appropriate HDFS directory
  • flume.master.servers: 3 Flume Masters – flumeMaster1, flumeMaster2, flumeMaster3
  • flume.master.store uses the default value – zookeeper

Hive configuration

Following Hive parameters were configured (sample) –

  • javax.jdo.option.ConnectionURL
  • javax.jdo.option.ConnectionDriverName: set the value to “com.mysql.jdbc.Driver
  • javax.jdo.option.ConnectionUserName
  • javax.jdo.option.ConnectionPassword

By default, Hive metadata is usually stored in an embedded Derby database which allows only one user to issue queries. This is not ideal for production purposes. Hence, Hive was configured to use MySQL in this case.

Using the Hadoop system, log transfer time was reduced to ~3 hours weekly and querying time also was significantly improved.

Some of the schema tables that were present in the final design were – facts, products, customers, categories, locations and payments. Some sample Hive queries that were executed as part of the analysis are as follows –

  • Count the number of transactions
    • Select count (*) from facts;
  • Count the number of distinct users by gender
    • Select gender, count (DISTINCT customer_id) from customers group by gender;

Only equality joins, inner & outer joins, semi joins and map joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a MapReduce job. Also, more than two tables can be joined in Hive.

  • List the category to which the product belongs
    • Select products .product_name, products .product_id, categories.category_name from products JOIN categories ON (products.product_category_id = categories.category_id);
  • Count of the number of transactions from each location
    • Select locations.location_name, count (DISTINCT facts.payment_id) from facts JOIN locations ON (facts.location_id = locations.location_id) group by locations .location_name;

Interesting trends/analysis using Hive

Some of the interesting trends that were observed from this dataset using Hive were:

  • There was a healthy increase in YoY growth across all retail product categories
  • Health & Beauty Products saw the highest growth rate at 72%, closely followed by Food Products (65 %) and Entertainment (57.8%).
  • Northern states spend more on Health & Beauty Products while the South spent more on Books and Food Products
  • 2 metros took the top spot for the purchase of Fashion & Apparels
  • A very interesting and out-of-the-ordinary observation was that men shop more than women! Though the difference isn’t much, it’s quite shocking J (Note: This also could be because when couples tend to shop together the man pays the bill!)

This is just one illustration of the possibilities of Big Data analysis and we will try to cover more in the coming articles. Stay tuned!

The post Hadoop Learning Series – Hive, Flume, HDFS, and Retail Analysis appeared first on Cloud Academy.

]]>
1
Big Data: Getting Started with Hadoop, Sqoop & Hive https://cloudacademy.com/blog/big-data-getting-started-with-hadoop-sqoop-hive/ https://cloudacademy.com/blog/big-data-getting-started-with-hadoop-sqoop-hive/#comments Mon, 17 Mar 2014 00:37:09 +0000 https://cloudacademy.com/blog/?p=553 (Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and...

The post Big Data: Getting Started with Hadoop, Sqoop & Hive appeared first on Cloud Academy.

]]>
(Update) We’ve recently uploaded new training material on Big Data using services on Amazon Web Services, Microsoft Azure, and Google Cloud Platform on the Cloud Academy Training Library. On top of that, we’ve been busy adding new content on the Cloud Academy blog on how to best train yourself and your team on Big Data.


This is a guest post by 47Line Technologies

We live in the Data Age! The web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore’s law not only holds true for hardware but for data being generated too! Without wasting time for coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data.

Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. At its core, it consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System (HDFS). Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The logical question arises – How do we set up a Hadoop cluster?

Map Reduce Archirecture
Figure 1: Map Reduce Architecture

 

Installation of Apache Hadoop 1.x

We will proceed to install Hadoop on 3 machines. One machine, the master, is the NameNode & JobTracker and the other two, the slaves, are DataNodes & TaskTrackers.
Prerequisites

  1. Linux as development and production platform (Note: Windows is only a development platform. It is not recommended to use in production)
  2. Java 1.6 or higher, preferably from Sun, must be installed
  3. ssh must be installed and sshd must be running
  4. From a networking standpoint, all the 3 machines must be pingable from one another

Before proceeding with the installation of Hadoop, ensure that the prerequisites are in place on all the 3 machines. Update /etc/hosts on all machines so as to enable references as master, slave1 and slave2.

Download and Installation
Download Hadoop 1.2.1. Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.
Configuration Files
The below mentioned files need to be updated:
conf/hadoop-env.sh
On all machines, edit the file conf/hadoop-env.sh to define JAVA_HOME to be the root of your Java installation. The root of the Hadoop distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.
conf/masters
Update this file on master machine alone with the following line:
master
conf/slaves
Update this file on master machine alone with the following lines:
slave1
slave2
conf/core-site.xml
Update this file on all machines:

<property>
	<name>fs.default.name</name>
	<value>hdfs://master:54310</value>
</property>

conf/mapred-site.xml
Update this file on all machines:

<property>
	<name>mapred.job.tracker</name>
	<value>master:54311</value>
</property>

conf/hdfs-site.xml
The default value of dfs.replication is 3. Since there are only 2 DataNodes in our Hadoop cluster, we update this value to 2. Update this file on all machines –

<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>

After changing these configuration parameters, we have to format HDFS via the NameNode.

bin/hadoop namenode -format

We start the Hadoop daemons to run our cluster.
On the master machine,

bin/start-dfs.sh
bin/start-mapred.sh

Your Hadoop cluster is up and running!

Loading Data into HDFS

Data stored in databases and data warehouses within a corporate data center has to be efficiently transferred into HDFS. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. The input to the import process is a database table. Sqoop will read the table row-by-row into HDFS. The output of this import process is a set of files containing a copy of the imported table.

Sqoop architecture
Figure 2: Sqoop architecture

Prerequisites

  1. A working Hadoop cluster

Download and Installation
Download Sqoop 1.4.4. Installing Sqoop typically involves unpacking the software on the NameNode machine. Set SQOOP_HOME and add it to PATH.
Let’s consider that MySQL is the corporate database. In order for Sqoop to work, we need to copy mysql-connector-java-<version>.jar into SQOOP_HOME/lib directory.
Import data into HDFS
As an example, a basic import of a table named CUSTOMERS in the cust database:

sqoop import --connect jdbc:mysql://db.foo.com/cust --table CUSTOMERS

On successful completion, a set of files containing a copy of the imported table is present in HDFS.

Analysis of HDFS Data

Now that data is in HDFS, it’s time to perform analysis on the data and gain valuable insights.
During the initial days, end users have to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacks the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis.

Enter Hive!

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful information. It provides an SQL-like language called HiveQL while maintaining full support for MapReduce. Any HiveQL query is divided into MapReduce tasks which run on the robust Hadoop framework.

Hive Architecture
Figure 3: Hive Architecture

Prerequisites

  1. A working Hadoop cluster

Download and Installation
Download Hive 0.11. Installing Hive typically involves unpacking the software on the NameNode machine. Set HIVE_HOME and add it to PATH.
In addition, you must create /tmp and /user/hive/warehouse (a.k.a. hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive. The commands are listed below –

$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Another important feature of Sqoop is that it can import data directly into Hive.

sqoop import --connect jdbc:mysql://db.foo.com/cust --table CUSTOMERS --hive-import

The above command created a new Hive table CUSTOMERS and loads it with the data from the corporate database. It’s time to gain business insights!
select count (*) from CUSTOMERS;

Apache Hadoop, along with its ecosystem, enables us to deal with Big Data in an efficient, fault-tolerant and easy manner!

Verticals such as airlines have been collecting data (eg: flight schedules, ticketing inventory, weather information, online booking logs) which reside in disparate machines. Many of these companies do not yet have systems in place to analyze these data points at a large scale. Hadoop and its vast ecosystem of tools can be used to gain valuable insights, from understanding customer buying patterns to upselling certain privileges to an increasing ancillary revenue, to stay competitive in today’s dynamic business landscape.
References

  • http://hadoop.apache.org/
  • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
  • http://sqoop.apache.org/
  • http://hive.apache.org/
This is the first in the series of Hadoop articles. In subsequent posts, we will deep dive into Hadoop and its related technologies. We will talk about real-world use cases and how we used Hadoop to solve these problems! Stay tuned.

The post Big Data: Getting Started with Hadoop, Sqoop & Hive appeared first on Cloud Academy.

]]>
5