Data Catalogs and Distributed Querying is now available on #LetsData
Today, we are announcing the integration of the #LetsData S3 Write Connectors with AWS Glue (Crawlers, Data Catalog) and AWS Athena (Distributed Querying).
With these integrations, #LetsData will now:
Crawl Data: Automatically configure and run AWS Glue Crawlers on a dataset’s S3 files
Discover Schema: Discover the file data schema, partitioning and metadata
Manage Data Catalogs: Add the schema, partitioning and metadata to the AWS Glue Data Catalog
Data Governance: Automatically configure access and permissions for the catalog and S3 files
Data Warehousing: Enable Distributed Querying support via AWS Athena.
The details of the S3 Write Connector Data Catalog integrations are at our Data Catalog Docs
Overview
The LetsData Write Destinations are categorized into either:
permanent, durable data store (think databases such as S3, Dynamo DB, Vector Indexes)
ephemeral data containers (such as streams, queues etc)
The permanent, durable data stores can be enabled for analytical workloads (OLAP). LetsData enables these by running AWS Glue Crawlers on these data stores to discover the data schema, partitioning and metadata. These are then added to data catalog / meta stores (AWS Glue Data Catalog). Customers can run AWS Athena queries using the AWS Glue Data Catalog - these are distributed queries that enable on demand, massively parallel processing of large datasets.
See  AWS Athena Docs to understand the AWS analytics services and their usage,  Redshift Docs about Columnar Storage to understand OLAP / OLTP differences.
Highlights
Highlights about LetsData's Data Catalog implementation:
Automatic Data Cataloging: LetsData can automatically catalog datasets that write to S3 destinations by running glue crawlers on S3 data files and creating databases and tables for the LetsData users.
Primitive Data Lake: Adding to a Data Catalog, defining a permissioning model for access and enabling Distributed Queries is essentially forming a Data Lake. (There are different terms that one comes across when defining a data strategy for OLAP workloads - Data Warehouses, Data Lakes and Data Lakehouse etc. This  external link explains the different terminology). The LetsData's data catalog is primitive data lake (not a datawarehouse).
Access and Permissioning: The AWS Lake Formation is a permissioning model over AWS Glue Catalog and vends automated temporary credentials to different services - Lake Formation has different user roles and allows for permissions delegation and sharing as well. As of now, LetsData permissioning model allows data catalog access to data owner only and we do not offer permission sharing / permissions delegation yet.
Architecture
Automated Data Catalog
Here is a high level architecture of how automated data catalog is implemented:
Distributed Querying
Here are a high level steps to query a dataset's data catalog tables.
Details
Availability
Automated Data Catalog is currently available for S3 Write Connectors only (S3, S3 Aggregate File, S3 Spark). Automated Data Catalog is currently available for resourceLocation: LetsData only. resourceLocation: Customer requires some additional architectural redesign which we've deferred for now.
Enable Config
To enable automated Data Catalog, add the "addToCatalog": true
attribute to the dataset's write connector config.
Catalog Details
The dataset's initialization creates the catalog database, the glue crawler and sets up the crawl configuration. These dataset's write connector node is updated with these catalog details. You'll need these in Athena queries. Use the CLI datasets view command to view the dataset details. Here are the example details:
Here are what the different attributes in the configuration mean:
awsGlueCrawlerName: The name of the AWS Glue crawler that was created by dataset. You can use aws cli / sdk to query details about the crawler and its crawl runs. See additional details in the crawler section below.
awsGlueCatalogName: The name of the AWS Glue catalog that the crawler will store the schema in. This is currently set to the default AWS account catalog. You'll need this to query the catalog details about the tables and when running actual athena queries.
awsGlueDatabaseName: The name of the AWS Glue catalog database. Every LetsData user gets their own database accessible only to them. The database is named as
<stack_name>tenant<tenant_id>
. For example tenant with tenantId (d5feaf90-71a9-41ee-b1b9-35e4242c3155) on the prod stack would be assigned a databaseprodtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155
awsGlueTableNamePrefix: The crawler sets up the crawl configuration and specifies the tableName prefix. The actual tableName(s) are obtained later by calling the AWS Glue GetTables API using credentials from the the dataset's
customerAccessRoleArn
andawsGlueCatalogName / awsGlueDatabaseName
as the catalog name / database name respectively.athenaQueryOutputPath: To help with querying that works without any additional setup requirements, we provisioned an S3 Bucket path for each dataset configured for Athena queries for that dataset. You can run Athena queries with the S3 Bucket path as the results output path. The contents of the path are deleted upon dataset deletion (so do save any results as needed).
Crawler Configuration
We've configured the crawler by default with settings that should work out of the box for most use cases. However, for those who'd need advanced options, here is how the crawler is setup:
re-crawl everything on each run
schema change policies of update / delete in database on updates / deletes to schema
crawler lineage settings are disabled.
tableGroupingPolicy to combine compatible schemas (and create a single table)
If we should add some classifier support / additional options out of the box, do let us know we can enable these (mailto: support@letsdata.io). (Â Crawler in AWS Docs)
Here is an example crawler configuration:
Running Athena Queries
Here are a few example commands that can be used to get details about a dataset's catalog and run athena queries.
Access Credentials
Get dataset's access credentials:
We need the
customerAccessRoleArn
,createDatetime
from the dataset to be able to access the bucket files.
CUSTOMER_ACCESS_ROLE_ARN=`./letsdata datasets view --datasetName <dataset_name> --prettyPrint 2> /dev/null | grep customerAccessRoleArn| sed 's/.* "//g'| sed 's/",//g'`
EXTERNAL_ID=`./letsdata datasets view --datasetName <dataset_name> --prettyPrint 2> /dev/null | grep createDatetime|sed 's/^.*: //g'|sed 's/,//g'`
Suppose that for the current dataset, the
customerAccountForAccess
is308240606591
. This has an IAM Admin user for this account whose credentials are stored in the~/.aws/credentials
file as the profileIamAdminUser308240606591
. Run the following AWS CLI command to get time limited credentials and save them to the~/.aws/credentials
file in thestsassumerole
node
output="/tmp/assume-role-output.json"
aws sts assume-role --role-arn $CUSTOMER_ACCESS_ROLE_ARN --external-id $EXTERNAL_ID --role-session-name 'AccessToDatasetCredentials' --profile IamAdminUser308240606591 > $output
cat $output
# output should have contents similar to the following
# {
# "Credentials": {
# "AccessKeyId": "ASIATIBD24ZTWCHOBDHH",
# "SecretAccessKey": "W09pNZg...",
# "SessionToken": "FwoG...",
# "Expiration": "2024-02-08T01:29:20+00:00"
# },
# "AssumedRoleUser": {
# "AssumedRoleId": "AROATIBD24ZTZWJR5CZEW..."
# }
# }
# copy these the ~/.aws/credentials file
cat ~/.aws/credentials
[IamAdminUser308240606591]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZT3CRVCKVD
aws_secret_access_key = Bx8RVkAXfoGh5ERnL90ceftz5GJMWnxHx27uxTpS
[stsassumerole]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZTWCHOBDHH
aws_secret_access_key = W09pNZg...
aws_session_token = FwoG...
Crawler
Get the status of the crawler using get-crawler
CLI command, start a new crawler run using the start-crawler
and stop a running crawler using stop-crawler
CLI commands. (The crawler name is in the dataset config as the writeConnector.catalog.awsGlueCrawlerName
attribute)
aws glue get-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1
aws glue start-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1
aws glue stop-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1
Table Details
The table(s) are created when the the crawler has run successfully. You can get the table details for the created tables using the get-tables
CLI command. (The database name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName
attribute). Additionally, you can get the database, partitions and partition indexes details as well.
aws glue get-tables --database-name testtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155 --profile stsassumerole --region us-east-1
# This should return table details including the schema
# {
# "TableList": [
# {
# "Name": "b8e4b3e8...",
# "DatabaseName": "testtenant...",
# ...
# "StorageDescriptor": {
# "Columns": [
# {
# "Name": "language",
# "Type": "string"
# },
# ...
# ],
# "Location": "s3://tldwc.../",
# "InputFormat": "TextInputFormat",
# "OutputFormat": "HiveIgnoreKeyTextOutputFormat",
# "Compressed": true,
# "NumberOfBuckets": -1,
# "SerdeInfo": {
# "SerializationLibrary": "JsonSerDe",
# "Parameters": {
# "paths": "samples,percentile,language"
# }
# },
# "BucketColumns": [],
# "SortColumns": [],
# "Parameters": {
# "sizeKey": "9953",
# "objectCount": "2",
# ...
# },
# "StoredAsSubDirectories": false
# },
# "PartitionKeys": [
# {
# "Name": "partition_0",
# "Type": "string"
# }
# ],
# "TableType": "EXTERNAL_TABLE",
# "Parameters": {
# "sizeKey": "9953",
# "objectCount": "2",
# ...
# },
# ...
# }
# ]
# }
#
Run Athena Query
Using the schema from the get tables, construct a query to run. You can run the query using the start-query-execution CLI command. (The database name, catalog name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName
, writeConnector.catalog.awsGlueCatalogName
attribute. Additionally, use the writeConnector.catalog.athenaQueryOutputPath
to store your results.)
# run the query
aws athena start-query-execution --query-string \
"SELECT language, COUNT(*) as recs \
FROM b8e4b3e... \
GROUP BY language \
ORDER BY recs desc" \
--query-execution-context Database=testtenant...,Catalog=223413462631 \
--result-configuration OutputLocation=s3://test-.../athena-queries/ \
--profile stsassumerole --region us-east-1
# {
# "QueryExecutionId": "8ce557cd-39eb-414d-8cbd-104a1223b009"
# }
# use the query execution id to get the query execution details
aws athena get-query-execution --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1
# use the query execution id to get the query results
aws athena get-query-results --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1
# list the athenaQueryOutputPath and download the results as needed
aws s3 ls s3://test-.../commoncrawladdtocatalog19/ --profile stsassumerole --region us-east-1
# copy the results file from S3
aws s3 cp s3://test-.../athena-queries/8ce557cd-39eb-414d-8cbd-104a1223b009.csv . --profile stsassumerole --region us-east-1
Conclusions
With automated AWS Glue integrations and AWS Athena enablement:
we’ve defined a primitive Data Cataloging, Data Warehousing and Data Governance capabilities on #LetsData
the AWS Glue Data Catalog enables a large number of AWS Data scenarios such as Redshift Datawarehouse, EMR processing and Quicksight analytics to name a few
This is a huge space and while what we’ve built barely scratches the surface, it is quite powerful. We’d love to know the customer scenarios around Data Cataloging, Governance and Datawarehousing and how #LetsData may help solve some of these challenges.