Data Catalogs and Distributed Querying is now available on #LetsData

Apr 11, 2024

Today, we are announcing the integration of the #LetsData S3 Write Connectors with AWS Glue (Crawlers, Data Catalog) and AWS Athena (Distributed Querying).

With these integrations, #LetsData will now:

Crawl Data: Automatically configure and run AWS Glue Crawlers on a dataset’s S3 files
Discover Schema: Discover the file data schema, partitioning and metadata
Manage Data Catalogs: Add the schema, partitioning and metadata to the AWS Glue Data Catalog
Data Governance: Automatically configure access and permissions for the catalog and S3 files
Data Warehousing: Enable Distributed Querying support via AWS Athena.

The details of the S3 Write Connector Data Catalog integrations are at our Data Catalog Docs

Overview

The LetsData Write Destinations are categorized into either:

permanent, durable data store (think databases such as S3, Dynamo DB, Vector Indexes)
ephemeral data containers (such as streams, queues etc)

The permanent, durable data stores can be enabled for analytical workloads (OLAP). LetsData enables these by running AWS Glue Crawlers on these data stores to discover the data schema, partitioning and metadata. These are then added to data catalog / meta stores (AWS Glue Data Catalog). Customers can run AWS Athena queries using the AWS Glue Data Catalog - these are distributed queries that enable on demand, massively parallel processing of large datasets.

See AWS Athena Docs to understand the AWS analytics services and their usage, Redshift Docs about Columnar Storage to understand OLAP / OLTP differences.

Highlights

Highlights about LetsData's Data Catalog implementation:

Automatic Data Cataloging: LetsData can automatically catalog datasets that write to S3 destinations by running glue crawlers on S3 data files and creating databases and tables for the LetsData users.
Primitive Data Lake: Adding to a Data Catalog, defining a permissioning model for access and enabling Distributed Queries is essentially forming a Data Lake. (There are different terms that one comes across when defining a data strategy for OLAP workloads - Data Warehouses, Data Lakes and Data Lakehouse etc. This external link explains the different terminology). The LetsData's data catalog is primitive data lake (not a datawarehouse).
Access and Permissioning: The AWS Lake Formation is a permissioning model over AWS Glue Catalog and vends automated temporary credentials to different services - Lake Formation has different user roles and allows for permissions delegation and sharing as well. As of now, LetsData permissioning model allows data catalog access to data owner only and we do not offer permission sharing / permissions delegation yet.

Architecture

Automated Data Catalog

Here is a high level architecture of how automated data catalog is implemented:

Distributed Querying

Here are a high level steps to query a dataset's data catalog tables.

#LetsData Distributed Query Architecture

Details

Availability

Automated Data Catalog is currently available for S3 Write Connectors only (S3, S3 Aggregate File, S3 Spark). Automated Data Catalog is currently available for resourceLocation: LetsData only. resourceLocation: Customer requires some additional architectural redesign which we've deferred for now.

Enable Config

To enable automated Data Catalog, add the "addToCatalog": true attribute to the dataset's write connector config.

Example Add To Catalog Configuration: #LetsData Write Connector

Catalog Details

The dataset's initialization creates the catalog database, the glue crawler and sets up the crawl configuration. These dataset's write connector node is updated with these catalog details. You'll need these in Athena queries. Use the CLI datasets view command to view the dataset details. Here are the example details:

Here are what the different attributes in the configuration mean:

awsGlueCrawlerName: The name of the AWS Glue crawler that was created by dataset. You can use aws cli / sdk to query details about the crawler and its crawl runs. See additional details in the crawler section below.
awsGlueCatalogName: The name of the AWS Glue catalog that the crawler will store the schema in. This is currently set to the default AWS account catalog. You'll need this to query the catalog details about the tables and when running actual athena queries.
awsGlueDatabaseName: The name of the AWS Glue catalog database. Every LetsData user gets their own database accessible only to them. The database is named as <stack_name>tenant<tenant_id>. For example tenant with tenantId (d5feaf90-71a9-41ee-b1b9-35e4242c3155) on the prod stack would be assigned a database prodtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155
awsGlueTableNamePrefix: The crawler sets up the crawl configuration and specifies the tableName prefix. The actual tableName(s) are obtained later by calling the AWS Glue GetTables API using credentials from the the dataset's customerAccessRoleArn and awsGlueCatalogName / awsGlueDatabaseName as the catalog name / database name respectively.
athenaQueryOutputPath: To help with querying that works without any additional setup requirements, we provisioned an S3 Bucket path for each dataset configured for Athena queries for that dataset. You can run Athena queries with the S3 Bucket path as the results output path. The contents of the path are deleted upon dataset deletion (so do save any results as needed).

Crawler Configuration

We've configured the crawler by default with settings that should work out of the box for most use cases. However, for those who'd need advanced options, here is how the crawler is setup:

re-crawl everything on each run
schema change policies of update / delete in database on updates / deletes to schema
crawler lineage settings are disabled.
tableGroupingPolicy to combine compatible schemas (and create a single table)

If we should add some classifier support / additional options out of the box, do let us know we can enable these (mailto: support@letsdata.io). ( Crawler in AWS Docs)

Here is an example crawler configuration:

Running Athena Queries

Here are a few example commands that can be used to get details about a dataset's catalog and run athena queries.

Access Credentials

Get dataset's access credentials:

We need the customerAccessRoleArn, createDatetime from the dataset to be able to access the bucket files.

CUSTOMER_ACCESS_ROLE_ARN=`./letsdata datasets view --datasetName <dataset_name> --prettyPrint 2> /dev/null | grep customerAccessRoleArn| sed 's/.* "//g'| sed 's/",//g'`

EXTERNAL_ID=`./letsdata datasets view --datasetName <dataset_name> --prettyPrint 2> /dev/null | grep createDatetime|sed 's/^.*: //g'|sed 's/,//g'`

Suppose that for the current dataset, the customerAccountForAccess is 308240606591. This has an IAM Admin user for this account whose credentials are stored in the ~/.aws/credentials file as the profile IamAdminUser308240606591. Run the following AWS CLI command to get time limited credentials and save them to the ~/.aws/credentials file in the stsassumerole node

output="/tmp/assume-role-output.json"

aws sts assume-role --role-arn $CUSTOMER_ACCESS_ROLE_ARN --external-id $EXTERNAL_ID --role-session-name 'AccessToDatasetCredentials' --profile IamAdminUser308240606591 > $output

cat $output

    # output should have contents similar to the following
    # {
    #    "Credentials": {
    #        "AccessKeyId": "ASIATIBD24ZTWCHOBDHH",
    #        "SecretAccessKey": "W09pNZg...",
    #        "SessionToken": "FwoG...",
    #        "Expiration": "2024-02-08T01:29:20+00:00"
    #    },
    #    "AssumedRoleUser": {
    #        "AssumedRoleId": "AROATIBD24ZTZWJR5CZEW..."
    #    }
    # }

# copy these the ~/.aws/credentials file    
cat ~/.aws/credentials

[IamAdminUser308240606591]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZT3CRVCKVD
aws_secret_access_key = Bx8RVkAXfoGh5ERnL90ceftz5GJMWnxHx27uxTpS

[stsassumerole]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZTWCHOBDHH
aws_secret_access_key = W09pNZg...
aws_session_token = FwoG...

Crawler

Get the status of the crawler using get-crawler CLI command, start a new crawler run using the start-crawler and stop a running crawler using stop-crawler CLI commands. (The crawler name is in the dataset config as the writeConnector.catalog.awsGlueCrawlerName attribute)

  aws glue get-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1

  aws glue start-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1

  aws glue stop-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1

Table Details

The table(s) are created when the the crawler has run successfully. You can get the table details for the created tables using the get-tables CLI command. (The database name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName attribute). Additionally, you can get the database, partitions and partition indexes details as well.

  aws glue get-tables --database-name testtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155 --profile stsassumerole --region us-east-1

  # This should return table details including the schema
  # {
  #     "TableList": [
  #         {
  #             "Name": "b8e4b3e8...",
  #             "DatabaseName": "testtenant...",
  #             ...
  #             "StorageDescriptor": {
  #                 "Columns": [
  #                     {
  #                         "Name": "language",
  #                         "Type": "string"
  #                     },
  #                     ...
  #                 ],
  #                 "Location": "s3://tldwc.../",
  #                 "InputFormat": "TextInputFormat",
  #                 "OutputFormat": "HiveIgnoreKeyTextOutputFormat",
  #                 "Compressed": true,
  #                 "NumberOfBuckets": -1,
  #                 "SerdeInfo": {
  #                     "SerializationLibrary": "JsonSerDe",
  #                     "Parameters": {
  #                         "paths": "samples,percentile,language"
  #                     }
  #                 },
  #                 "BucketColumns": [],
  #                 "SortColumns": [],
  #                 "Parameters": {
  #                     "sizeKey": "9953",
  #                     "objectCount": "2",
  #                     ...
  #                 },
  #                 "StoredAsSubDirectories": false
  #             },
  #             "PartitionKeys": [
  #                 {
  #                     "Name": "partition_0",
  #                     "Type": "string"
  #                 }
  #             ],
  #             "TableType": "EXTERNAL_TABLE",
  #             "Parameters": {
  #                 "sizeKey": "9953",
  #                 "objectCount": "2",
  #                 ...
  #             },
  #             ...
  #         }
  #     ]
  # }
  #

Run Athena Query

Using the schema from the get tables, construct a query to run. You can run the query using the start-query-execution CLI command. (The database name, catalog name is in the dataset config as the writeConnector.catalog.awsGlueDatabaseName, writeConnector.catalog.awsGlueCatalogName attribute. Additionally, use the writeConnector.catalog.athenaQueryOutputPath to store your results.)

  # run the query 
  aws athena start-query-execution --query-string \
 "SELECT language, COUNT(*) as recs \
 FROM b8e4b3e... \
 GROUP BY language \
 ORDER BY recs desc" \
 --query-execution-context Database=testtenant...,Catalog=223413462631 \
 --result-configuration OutputLocation=s3://test-.../athena-queries/ \ 
 --profile stsassumerole --region us-east-1

  # {
  #   "QueryExecutionId": "8ce557cd-39eb-414d-8cbd-104a1223b009"
  # }

  # use the query execution id to get the query execution details
  aws athena get-query-execution --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1

  # use the query execution id to get the query results 
  aws athena get-query-results --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1

  # list the athenaQueryOutputPath and download the results as needed
  aws s3 ls s3://test-.../commoncrawladdtocatalog19/ --profile stsassumerole --region us-east-1
  
  # copy the results file from S3
  aws s3 cp s3://test-.../athena-queries/8ce557cd-39eb-414d-8cbd-104a1223b009.csv . --profile stsassumerole --region us-east-1

Conclusions

With automated AWS Glue integrations and AWS Athena enablement:

we’ve defined a primitive Data Cataloging, Data Warehousing and Data Governance capabilities on #LetsData
the AWS Glue Data Catalog enables a large number of AWS Data scenarios such as Redshift Datawarehouse, EMR processing and Quicksight analytics to name a few

This is a huge space and while what we’ve built barely scratches the surface, it is quite powerful. We’d love to know the customer scenarios around Data Cataloging, Governance and Datawarehousing and how #LetsData may help solve some of these challenges.