Athena is a serverless query service that allows you to run SQL queries on your data stored in S3. It can easily query large datasets (or Data Lakes), whether they are structured, unstructured, or semi-structured data. A database in Athena is a logical grouping for tables you create in it.
In this short guide, you will learn how to connect to CSV data in S3 bucket with Athena database. Let’s dive right in.
The AWS glue crawler will help to catalog the table and give it the right schema. The steps are highlighted below.
After you have clicked on the from AWS Glue Crawler you will be transferred to the AWS Glue platform for the crawler definition.
NOTE: A comma (,) at any point in a row of a CSV in the s3 bucket will surely disrupt the Athena data table, so make sure you replace any comma symbol in all rows of the CSV files.
As a culture, I like to replace comma symbols with spaces. In this way, Athena will not return a bad table and will not split up any rows using the comma.
Below is the original data from EXCEL.
The picture below shows the CSV file after being ingested from the S3 bucket to Athena Table. As you can see the data was returned as unstructured data in the Athena database.
Resolution: Replace comma (,) symbols in all rows of the CSV files, then bring the data back into the S3 bucket. Then this will be perfectly provision by Athena Database because of the absence of comma symbols.
That’s pretty much it. Follow the step-by-step guide to connect your CSV data in S3 to Athena database. Share your thoughts in the comment box below.