Big SQL Analytics: Redshift Webinar Q&A Summary
I recently conducted a webinar on Redshift as part of our Big SQL tools investigation series. At the end of the webinar, we had some pretty interesting questions and answers. I thought it would be a good idea to summarize the audience feedback in a blog. It would be helpful to view the webinar before reading the rest of the blog. You can access the webinar here.
Q1: How can you schedule and set up secure uploads?
Let me address the security question first. Amazon offers the ability to a create a Virtual Private Cloud (VPC) to allow you to carve off one or multiple Amazon cloud services (including Redshift) and make them part of your network. In addition, you can encrypt data transfers, data at rest inside Redshift, and data backups.
Scheduling data uploads and moving data over to Amazon Redshift can occur in one of two ways:
- Using an ETL Tool via an ODBC Connection: This method is relatively easy to setup but leads to slow performance.
- Move Flat Files via S3: The fastest way to move data is by dumping your internal data out into flat files, encrypting those files with GZIP, and moving them over to an Amazon S3 service. S3 is Amazon’s Simple Storage Server, and it’s a service that allows you to store up to 2 terabytes of data for free. Once you get the files over there, you can use the Copy command to bulk load the file directly to Redshift with GZIP option (so Redshift will un-compress them for you).
Both options above can be managed and maintained using an ETL tool.
Q2: If healthcare data is housed and breached at Amazon, what happens next? Who’s liable? How strong is security?
We are not HIPPA experts, and we wouldn’t be able to answer the liability question. However, regarding strength of security, Redshift does allow encryption of data, in terms of communication to Redshift as well as encryption of data at rest. In terms of how strong security is, that question can apply to a lot of cloud services out there. For healthcare data, that would require deeper investigation beyond the scope of what we’ve done so far.
Q3: Can you mention what the width of the multi-billion record data set is?
We used a narrow table. It has four dimension keys and one metric. The four keys are integers, and the metric is the decimal. It’s not very large, but it’s representative of what a fact table would do, and given the extreme performance, and the fact that we only did this on the two and four nodes, we could always increase the number of nodes if we are dealing with a wider table or allow for more time to process it.
Q4: What is the annual cost or using 2 terabytes per year? Can I schedule secure auto data loads into a database remotely?
Two terabytes make up one node, and if you do the three year commitment, your cost would be $2000/yr. If you do by the hour, one node would be 85 cents/hr. The $2000 includes front cost per terabyte and on-going usage cost. And yes, you can schedule secure auto data loads as explained earlier.
Q5: What are the top three benefits Redshift has over Vertica?
We haven’t done a POC on Vertica. Vertica and Redshift are both columnar databases. Some experts say Vertica has problems updating data, and it seems that Redshift has overcome those issues. Redshift is also very easy to maintain and launch because it’s cloud-based. We have several tools that we’re evaluating as part of our big SQL series; Vertica is not currently on our roadmap, but we could look into it.
Q6: Does the platform use indexes?
This is the benefit of this tool. You don’t have to build any indexes – in fact, you can’t build indexes. As part of design, you may have to set up your short keys and your join keys, but you don’t have to worry about indexes. The service does it in the background for you.
Q7: Does Redshift have any dashboard capabilities?
Redshift is not a dashboarding tool. Redshift is a database tool. You connect it to a dashboard either via native connector such as what Tableau offers or via an ODBC connection if you didn’t have a native data connector. Over ODBC connection, it’s a little slower. Keep in mind that when you’re running analytics against Redshift, you’re returning aggregated data, so most of the work is done on the server versus on your client tool.
Q8: Since SPs are not allowed, how do you defend against SQL injection attacks?
The way you’re running the scripts is via control ETL mechanism. You wouldn’t use Redshift for e-commerce sites or other public internet sites. You’re using it as an analytics system mostly internal to your organization. Maybe you can even surface it to external clients but you have to do it via a tool like Tableau or something else, and those would protect against SQL injections by default. But this system/data is not designed to support OLTP-style applications.
Contact us at email@example.com if are interested in finding out more.
Bernard Wehbe is responsible for operational strategy, account management and the management of day-to-day company activities at StatSlice Systems. He has over twelve years of consulting experience focused exclusively on data warehousing and business intelligence. His experience includes data warehousing architecture, OLAP, data modeling, ETL, reporting and dashboarding, business analytics, team leadership, and project management. His business systems expertise includes but is not limited to Consulting Services, Financial Services, Retail, Transportation, Manufacturing, Telecom, and Online Subscription. Bernard received a Bachelor of Science as well as a Master of Science in Industrial Engineering & Management from Oklahoma State University.
You can subscribe to our RSS feed.