Big SQL Analytics – Cloudera Using Impala
This is an overview of the webinar on Cloudera using Impala presented by Brett Neuman on October 22, 2013. For more information on this webinar, click here to view the video or continue reading to below.
Overview
In this webinar, Brett gives you an overview of Cloudera’s Impala platform and how you can use it in your business. He will present a customer case study and compares query results from Impala with Redshift. He also follows up with the limitations that the team found and what appeals to us about Impala, and quickly discusses Architecture Implications.
For those who are unfamiliar with Hadoop, it is an open-source big data platform which is essentially massively parallel servers that process huge amounts of data across inexpensive standardized hardware. Hadoop uses the Hadoop Distributed File System (HDFS) to automatically distribute data across multiple nodes.
Impala is a SQL engine written by Cloudera that works within the existing Hadoop platform. Impala was designed mainly for analytical purposes and is highly scalable. It runs directly within Hadoop and does not require any specialized hardware to run. It was written in C++ instead of java for the best performance, and it can read data directly from HDFS bypassing the data nodes for increased query performance. Because it does not use map-reduce, it can be hundreds of times faster than Hive.
Impala itself is a fully distributed system – meaning that you can start a query on any node and the executor (or coordinator node) will delegate the processing and pulling of data to all necessary nodes. You can submit queries using existing query tools via ODBC or JDBC. In this webinar, we look at querying Impala via the Hue web interface.
Impala supports text files, whether csv or fixed width, and it can use LZO compression for better use of hard drive space. It can also support Parquet files which are a new columnar storage format. This was not included in this investigation because it is still in the alpha stages of development. However, it promises to be a very interesting feature that we will look into in the future.
What separates Impala from Redshift is that is uses a big-data platform to store its data. When you create a table, you can point that table directly to file stored on the HDFS, and Impala will treat it like a populated table – no insert command needed. So, the time it takes to populate a table is just the time it takes to upload the file. Also, since the files are stored on the HDFS, they are already replicated across all available nodes. This is how query performance is greatly increased.
The way hive’s metadata manager looks at tables is by folders. One folder = one tables. One folder can hold multiple files. As long as they are all of the same format and structure, it will append the data together between all files in the folder automatically. This makes it very easy to append data on a daily basis. Simply upload the new data in a new file, and it’s ready for analysis.
Big Data VS Big SQL
Many companies, both small and large, are dealing with increasingly larger datasets. This data can come from legacy systems or large datasets (which we show in the webinar) which can be used to provide very effective predictive analytics. What do companies do with historical data? They may use compressed SQL database backups or tape drives; however, you can’t readily access the data from those systems.
The best course of action is to store your data where you can access it. Your options might be a SQL Server database that has plenty of storage, but you’ll have to higher DBA’s to tune and take care of the system if you ever want to use it. Big Data could be the answer, but you have to learn to use map-reduce to access your data. You can use more traditional MPP (massively parallel processing) appliances like Teradata, GreenPlum, or Netezza, but these can be prohibitively expensive.
We think the answer could be to take advantage of a big-data style storage mechanism with a SQL-based tool on the front end. This is why we’ve been investigating Impala – to see if Cloudera has come up with a viable Big SQL option. Like Redshift, Impala scales linearly. As the number of nodes in the cluster grows, you can see a comparable increase in query performance. This should be expected from all Big SQL platforms.
Installing Cloudera
The largest difference between Cloudera and Redshift is that you have to install Cloudera yourself. Redshift is a cloud-based Big SQL platform that you simply sign up for, and it’s ready to go. Cloudera has made installing Hadoop almost as simple, except that you have to use your own hardware. For the webinar demonstration, we used Amazon EC2 instances to build our cluster. We did this for two reasons. First, Cloudera integrates with EC2 very effectively meaning that we can set up a test cluster and only pay for the amount of time that we have the cluster turned on instead of buying new hardware. Second, because Amazon was used for demonstrating both Redshift and Impala, we can show you a neat price comparison.
To install Cloudera, you will first need an instance for the manager. You can put the manager on a relatively cheap node like an m1.large. It does require some very simple Linux commands to download and install the manager, so you do need to know a little about Linux. Once the manager is installed, adding hosts and configuring your cluster is very easy. Cloudera provides a “Cloud Express Wizard” which will pop up as soon as you log onto the manager for the first time. The wizard will recognize that it’s been installed on an EC2 instance and give you prompts for adding nodes to your existing EC2 account. All you have to do is input the number of desired nodes and give the system your AWS credentials for access. In a few minutes, your cluster will be up and running. You may refer to the webinar for a demonstration of the Cloudera Manager interface.
POC Statistics: Insert
When we tested the insert commands, we found some very interesting behavior with Impala. Our insert statement used a cross-join between three tables to create a 1.2B record dataset. Since cross joins are not supported, Brett created a fake ID field and set its value to 1 for all rows in all three tables. Joining on that field, he was able to accomplish the same behavior as a cross join. However, inserting the results from that query produced unexpected results. It took almost 30 minutes to insert 1.2B records into that table on a 4 node cluster. When he then attempted to perform the same query on a 2 node cluster, the same results appeared. We expected the process to take an hour, but that wasn’t the case. It appears that what is happening is that when we insert records into a table, Hive writes the data to a single file on one HDFS data node. Once the data is written, the file is replicated across the nodes. This could mean that either inserts do not benefit from the parallel architecture, or Impala is simply not optimized for inserts.
Test |
Large Cluster (4 Nodes) |
Small Cluster (2 Nodes) |
Redshift (4 Nodes) |
Insert 1.2B Records |
29.51m |
29.86m |
4.41m |
Append 1.2B Records |
29.57m |
29.95m |
4.65m |
Update 10% of 2.4B Records |
17.14m |
39.04m |
1.56m |
Delete 10% of 2.4B Records |
15.17m |
32.71m |
0.46m |
We attempted to research this behavior but were unable to find any documentation from Cloudera as explanation. What we were able to find was that Impala can only insert data into tables that use the TEXT or Parquet formats. For other file formats, you must insert the data using Hive and use Impala to query it. It seems that Impala acts like a Big Data engine in this respect, meaning that it wants to you upload new files to the table folders in order to append data to a table. Cloudera does make it rather easy to do this, but the process may require a small bit of retooling to use. In future webinars, we intend to explore different methods of inserting data into Cloudera and determine the best approach, whether it is Sqoop, Pig, or Map-Reduce.
In additions to inserts taking a long time, appending data to a table yields similar results. The Update test is also interesting. There isn’t an update command of course; instead you must use an “INSERT OVERWRITE” statement. Impala seems to read the table from all available data nodes then write to a temporary table. Because it can read from multiple nodes at once, the insert overwrite command does take less time, but the results are still not very promising (most likely due to the slow write). Deletions are very similar to updates. In fact, Impala doesn’t actually support deleting single records from a table. Instead you have to use the “INSERT OVERWRITE” command again and use a clever where clause to create the result set you want.
In our opinion, this almost takes Impala out of the running as a Big SQL platform. If it wasn’t for its query performance, we couldn’t see the merits of using Impala at all.
POC Statistics: Select
Our select statements test are the exact same tests that we ran on the Redshift cluster. We ran these on a 4 node and a 2 node cluster. The results show up in seconds instead of minutes like the insert statements which seem more promising.
Test |
Large Cluster (4 Nodes) |
Small Cluster (2 Nodes) |
Redshift (4 Nodes) |
Sum all Rows |
24.32s |
42.83s |
21.87s |
Sum Rows with Where Clause |
27.84s |
52.86s |
31.57s |
Sum Rows with Where Clause, Between Condition |
28.27s |
54.39s |
38.53s |
Sum Rows, Join to Second Table, Where Clause, Like Condition |
35.96s |
66.61s |
45.65s |
Sum Rows, Join to Second Table, Where Clause, Like and Or Condition |
39.94s |
72.53s |
70.6s |
Group by, Sum Rows, Join to Second Table, Where Clause with Advanced Like Condition |
40.86s |
72.76s |
52.71s |
There is certainly linear scalability at play. Most queries on the 4 node cluster took between 25 and 40 seconds while the same queries took 40 to 70 seconds on the 2 node cluster. Looking at the “large cluster” queries as compared to the Redshift queries, you can see that, for the most part, Impala might actually be faster. On average, Impala is about 10 seconds faster in this particular test. The results may vary, but you can get a feel for how Impala will perform in some situations.
Price Comparison: Amazon Redshift VS Cloudera in EC2
Because we were able to install Cloudera on an Amazon EC2 cluster, we can compare monthly costs between the two systems pretty easily. First, Redshift comes to about $1,000 per terabyte per year while a single node Cloudera cluster is slightly more than that. This is mainly due to the overhead of the manager node. When we looked at a 4 node cluster, we saw a little bit more separation in price, but only in the up-front costs of reserved instances. A 4 node 3 yr. reserved Redshift instance will cost you $12,000 on day one as opposed to about $10,500 for a comparable sized Cloudera cluster.
In the end, cost won’t matter enough to help you make your decision here. Cost would only be a large factor when you’re looking at what it would costs to install Cloudera on your own hardware. You would then have to consider depreciation of assets (and all the fun accounting that goes along with that.)
Architecture Implications: Redshift Comparison
In the end, we don’t think that Impala can compete head-to-head with Redshift. When we think of Big SQL, we want a tool that can completely replace a traditional relational database, not just parts of one. There’s practically no way of avoiding a bit of retooling on your team if you want to incorporate Impala. You would probably have to look to using Map-Reduce or Pig to load data in parallel before you can perform your analytics. Due to that limitation, Impala can only complement a traditional relational database system in your architecture – not replace it.
Luckily, Hadoop is an open-source platform, and Cloudera is an excellent software company. We look forward to seeing how Impala evolves over the next few years.
Impala Limitations
There a currently numerous limitations. Insert, Update, Delete are too temperamental which is a pretty big loss. Also, table order in a FROM clause is important for optimization. Big tables have to come first or the SQL engine will perform poorly. There are also no cross joins. While both of these are small issues, this illustrates the fact that Impala is still in early stages of development. There are many other SQL language elements that are not implemented by Impala yet, but we won’t go into those as they are prone to change over time. You can find a complete list on Cloudera’s website.
One more very important limitation here is its lack of stability. If one of the impala nodes go down during a query, the entire query will fail. This is due to the way Impala delegates its queries. Whichever node you connect to run the query becomes the coordinator node and delegates tasks to the other nodes. If one of the nodes fail, the coordinator doesn’t have a backup ready to take over that task. Cloudera says that they forfeited stability for performance here. All it really means is that Impala is still very young. Cloudera has said that they are looking into adding automatic failover to its nodes in future releases.
Impala’s Appealing Features
Impala shows a lot of promise. Appending data by uploading a new file to a folder is quick and easy. The data is ready almost immediately and is appended much faster than a SQL insert. Linear scalability on queries is confirmed which is a necessity when talking about a Big SQL platform. It is also inexpensive when compared to more traditional MPP appliances. Moreover, the Cloudera Manager provides a very easy interface to manage your cluster from. It does make it fairly easy to add, remove and manage your nodes as opposed to trying to manually install Hadoop and use the command-line-interface to manage your cluster. Another important feature is that you can install Cloudera on any hardware. This puts more control back in your hands and can increase the cost savings depending on the hardware you have on hand.
The Final Word
If you’re going with this implementation, then you are going with Hadoop and not SQL. Impala will require a little bit of retooling to use completely. If you have not already decided to go with Hadoop, there is no compelling reason to add Impala to your architecture yet. It is best used for analytics – not predictions. If you want to stay with a SQL platform, Redshift is currently the better option.
Brett Neuman
Brett Neuman has been developing business intelligence strategies for more than eight years at several fortune 500 companies. He has implemented data warehouse solutions and analytics strategies for many verticals including Health Care, Finance, Retail and Marketing. He has a strong background in application development, data warehouse architecture, data modeling, ETL, OLAP, reporting and dashboarding utilizing both the Business Objects and the Microsoft suite of Business Intelligence platforms. Brett joined StatSlice for the opportunity to empower organizations to make more effective business decisions through the availability of accurate, mission-critical information.