26 Nov Connect CSV Data in S3 Bucket with Athena Database
Introduction
Athena is a serverless query service that allows you to run SQL queries on your data stored in S3. It can easily query large datasets (or Data Lakes), whether they are structured, unstructured, or semi-structured data. A database in Athena is a logical grouping for tables you create in it.
In this short guide, you will learn how to connect to CSV data in S3 bucket with Athena database. Let’s dive right in.
The AWS glue crawler will help to catalog the table and give it the right schema. The steps are highlighted below.
Step 1: Open the Athena database from your AWS console and toggle to the database you want to create the table on. Click on Create table.
Step 2: Click on “from AWS Glue Crawler”. The “from S3 bucket data” is the manual process where you will enter the data columns manually.
After you have clicked on the from AWS Glue Crawler you will be transferred to the AWS Glue platform for the crawler definition.
Step 3: Input the name of the crawler and press the Next button.
Step 4: Leave every entry at the default and enter the S3 bucket directory where the CSV file is located. Ensure you add a forward slash after the directory definition.
Step 5: Choose “No” for Add another data store.
Step 6: On the IAM role section, make sure you create a new role for the crawler and press the “Next” button.
Step 7: After you have created the crawler, locate the name of the crawler. Select the crawler and click on run Crawler. The crawler will then create a table from the CSV file in the S3 bucket.
NOTE: A comma (,) at any point in a row of a CSV in the s3 bucket will surely disrupt the Athena data table, so make sure you replace any comma symbol in all rows of the CSV files.
As a culture, I like to replace comma symbols with spaces. In this way, Athena will not return a bad table and will not split up any rows using the comma.
Below is the original data from EXCEL.
The picture below shows the CSV file after being ingested from the S3 bucket to Athena Table. As you can see the data was returned as unstructured data in the Athena database.
Resolution: Replace comma (,) symbols in all rows of the CSV files, then bring the data back into the S3 bucket. Then this will be perfectly provision by Athena Database because of the absence of comma symbols.
That’s pretty much it. Follow the step-by-step guide to connect your CSV data in S3 to Athena database. Share your thoughts in the comment box below.
No Comments