LetsData Control SDK is now available
The LetsData Control SDK is now publicly available.
The Control SDK is what the LetsData CLI and the LetsData Website call to orchestrate and manage data pipelines on LetsData. Now, we’ve made our REST API available publicly for customers. Customers can build automation and develop their apps on Lets Data using familiar REST API semantics.
The documentation and runnable postman collection are available on Postman (LetsData Control SDK on Postman). The permalink on the LetsData docs website is at https://www.letsdata.io/docs/sdk-control-api/ - however, the docs page is mostly a redirect to Postman.
The Control API
The Control SDK API at a high level has the following resources, some high level details and the different supported actions are as follows:
Dataset:  Datasets are collection of data tasks grouped together as a logical entity. They can also be called Data jobs / Data tasks that the user needs to run. A dataset will have tasks for the work items in the dataset. The SDK has a number of APIs to manage datasets. For example, you can create datasets, list datasets, view datasets, manage the dataset's lifecycle (descale / freeze / delete), manage the dataset’s code artifacts and update a dataset’s compute configuration etc.
Tasks: Tasks are the system's representation of a work item that is executed by the compute engine (Lambda, Sagemaker, Spark etc). When a task executes, it reads from the read destination, calls the user data handlers and then writes to the write destination. During this execution, a task emits metrics, logs and errors records. Each of these is a separate resource in the SDK and these resources tie in together with tasks (taskId). As of now, the SDK defines APIs for listing and filtering a dataset’s tasks, redriving error tasks and stopping tasks.
Errors: Errors during the task execution in parsing records, record transformations or writing records are archived as error documents and classified as 'Record Errors' in Let's Data. The SDK supports the list errors API to list the errors for tasks and a view API to view each individual error record file. (Do note that the Let's Data infrastructure errors and unhandled exceptions are classified as 'Task Errors' and are handled differently. Error Handling docs have additional details: https://www.letsdata.io/docs/error-handling/)
Logs: Each task writes its logs to files which essentially become a task's execution trace that is useful in debugging and monitoring. The SDK supports a view API to view a task’s log file.
Metrics: Dataset execution comes with some system defined metrics dashboards that can be used to monitor a dataset’s progress, debug performance issues and take corrective actions. The SDK supports the view API to view different metrics dashboards for a dataset.
Usage Records: Â Datasets execution initializes different AWS resources - write connector (e.g. kinesis stream), the error connectors (e.g. S3 bucket) and different internal components such as queues, database tables, compute resources etc. We meter the usage of these resources for each dataset and create usage records. These usage records are then used in determining costs. The SDK supports an API to list these usage records.
VPC: Datasets that read / write to destinations where a cluster of machines is managed, Lets Data manages these machines in a Virtual Private Cloud. The VPC resource supports a list API to list a dataset's VPC details. The VPC resource also supports a 'vpcPeeringConnections' resource that has commands to accept (create) and delete VPC peering connections to the customer VPCs.Â
Users: Â A Tenant (company) in #Lets Data can create different user accounts (login credentials in #Lets Data) for different users in the company. These users can be assigned administrator / user roles and can run datasets individually. The SDK also supports user management APIs such as list, create, update and delete users.
Costs: Costs are a collection of the company's the billing account information, the payment method on file, the pricing details, a list of invoices, their payment status and links to pay the invoice /download the invoices as a PDF file. The SDK supports an API to list these cost details.
Setup
To start using the LetsData Control SDK, you need to:
Login Credentials: Have a valid LetsData account (username and a password). You can sign-up for a LetsData account at: https://www.letsdata.io/#signup
ClientId: For any serious calls, you'll need a valid Client Id to use the LetsData Control SDK. You can request one by emailing at support@letsdata.io or logging an issue at https://www.letsdata.io/#support - we'll enable Control API access for the username/password. For any testing and experimentation, you can use the testing and experimentation clientId
6ent0fqtc4v5ud6i8o41ado8rj
. Being a multi tenant system, the clientId helps us differentiate API calls from different clients.
Authentication
You'll need to obtain AccessToken
and IdToken
by calling AWS Cognito.
Here is a sample request and response:
Save the post in a file
auth_data.json
{
"AuthParameters" : {
"USERNAME" : "{{LetsData Username}}",
"PASSWORD" : "{{LetsData Password}}"
},
"AuthFlow" : "USER_PASSWORD_AUTH",
"ClientId" : "{{LetsData ClientId}}"
}
Make a POST request to AWS Cognito and save the output to
creds.json
file
curl -X POST --data @auth_data.json \
-H 'X-Amz-Target: AWSCognitoIdentityProviderService.InitiateAuth' \
-H 'Content-Type: application/x-amz-json-1.1' \
https://cognito-idp.us-east-1.amazonaws.com/ \
--output creds.json
Example response is saved in the
creds.json
- a quickcat creds.json | jq
show the following json - copy theAccessToken
andIdToken
, you'll need these for API calls. Also note the expiry which is the duration the token is valid for.
{
"AuthenticationResult": {
"AccessToken": "eyJraWQiOiJaQk...<redacted>",
"ExpiresIn": 3600,
"IdToken": "eyJraWQiOiJuSktcL1JN...<redacted>",
"RefreshToken": "eyJjdHkiOiJKV1Qi...<redacted>",
"TokenType": "Bearer"
},
"ChallengeParameters": {}
}
API Calls
You can call any of the LetsData Control SDK API by adding the
"Authorization: Bearer IdToken"
and"LetsDataAuthorization: Bearer AccessToken"
headers. Here is an example API call that does a GET to retrieve a dataset's details.
curl "https://www.letsdata.io/api/dataset?tenantId={{tenantId}}&userId={{userId}}&datasetName={{datasetName}}" \
-H "Authorization: Bearer IdToken" \
-H "LetsDataAuthorization: Bearer AccessToken"
Almost every Control SDK API requires the
tenantId
, theuserId
for the authenticated user (TenantAdmins can pass a different userId to retrieve data for other users in the organization - see user roles documentation) and thedatasetName
. You can find your tenantId, userId via the website, CLI (docs) or decode theIdToken
to get the tenant and user ids ( jwt.io has a decoder). Here is a decoded Id token - thesub
field is theuserId
and thecustom:tenantid
is thetenantId
{
"sub": "accb3567-2b6e-41ae-b00d-6ce1f9a58d94",
"custom:companyaddress": "{\"addressLine1\":\"1234 Some Street\",\"addressLine2\":\"Apt F8\",\"city\":\"Bellevue\",\"state\":\"WA\",\"country\":\"US\",\"postalCode\":\"98006\"}",
"cognito:groups": [
"Tenant-d5feaf90-71a9-41ee-b1b9-35e4242c3155-Users"
],
"custom:userrole": "TenantAdmin",
"iss": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_asdjery68Ts",
"cognito:username": "user@letsdata.io",
"custom:companyname": "LetsData IO",
"origin_jti": "1a0038a2-f8dd-4a71-996e-e64cde31003c",
"custom:tenantid": "d5feaf90-71a9-41ee-b1b9-35e4242c3155",
"aud": "11bbm85f3niuukca8su98dqc2t",
"event_id": "978d2a81-fa94-4fdf-a5ce-52240c02aaaa",
"token_use": "id",
"auth_time": 1708220818,
"exp": 1708224418,
"iat": 1708220818,
"jti": "194095f3-f350-425b-ad84-c5f8e1dd67fc",
"email": "user@letsdata.io"
}
Improvements
For the longest time, we had resisted publicly releasing the Control SDK API, primarily because we had full featured CLI and Website clients that were adequately serving the current needs. The CLI is automation friendly, so building automation scripts using the CLI has been the recommended automation way. We also provided private API access where needed.
So why are we publicly releasing it now?
I was recently in a technical conversation with a respected software architect about designing an API for a usecase and we had quite a thorough discussion about the API design and implementation. His experiences in API design helped me disconfirm my beliefs about API design. This led me to review the LetsData Control API and measure it against a higher API Software Design bar.
Here are some issues that we believe need to be discussed in context of the Control API release and possible improvements to a vNext API.
HTTP Verb Abuse - the API uses a total of 2 Http Verbs, GET and POST to get everything done.
Updates and Deletes are done via POST on update/delete sub resources. We need to use PUT for creates, POST for updates and DELETE for deletions.
Get single item vs Get list of items is via GET on the resource and /list sub resource etc. Item and List GET needs to be disambiguated with the GET parameters.
Getting these right from the get go should not have cost any additional time, so this is a miss (sigh, what was I thinking at that time!)
Authentication - The API uses both the IdToken and the AccessToken for authentication, which seems a little non standard. Why are we doing this this way? I believe this is our security paranoia kicking in - on each API request, we do deeper validations on the tenant details, the user details, the clientId - essentially doing some redundant validations but making sure calls authenticated are not going to be a security issue. (and we did not find a way to add custom attributes such as tenantId in the AWS Cognito access token, hence using both tokens). While this decision was reviewed for security, we could re-review this for simplification.
API Simplification - There is a case for parameter simplification that can be made - since tenantId and userId are already present in the authentication tokens, they can be removed from API parameters (or made optional to override the auth token values if needed). For example, api/dataset?tenantId={{tid}}&userId={{uid}}&datasetName={{dName}} could be simplified to api/dataset/{{dName}}. The latter does seem simpler, but I ambivalent about this as of now - maybe we’ll run into some edge case that requires the tenant and user ids. Deferring to vNext for now.
API Keys (lack thereof) - since we are allowing each customer a separate clientId, we believe as of now, the clientId is a sufficient in place of the API Key. APIKeys have a very nice integration in AWS API Gateway, where usage plans can be specified and the API Gateway can enforce these even before the request hits the web service code. However, we do not have such large scale quota or usage plan needs at this time - we can add any API Key specific logic (if needed) against the clientId (DDOS or other operational issue) and then have API Key in the vNext of the API. As of now, there is no API usage rate limits and restrictions.
Language Client SDKs - When we did our earlier startup (LetsResonate), we had defined our API model using the API Gateway model. One nice thing about this definition was that the API Gateway generates Language specific clients for different languages. However, that definition and its request and response mappings was time consuming and frustrating. This time around, we by-passed the model definitions and defined the request and response as pass through strings, which were serialized/deserialized by our web service code. As a result, we don’t have the rich language specific clients that are auto generated by the API Gateway. We will try and get our API definition in some defined format and generate language clients for customers. Until then, REST Http API can be used to call the API.
API Regional availability - API Gateway supports regional API availability. When we went from a US-EAST-1 region service to a multi-region service with availability in 6 regions around the globe, we made a conscious decision to keep the Control SDK API available only in the US-EAST-1 region. For a truly global service with optimal latencies, regional deployments of the API should be done, but this requires either:
DBs being in the same region as the regional API deployment (requires multi master replication and conflicts resolution which is easy to get wrong)
OR
having the API servers have persistent connections to the DBs in the different regions (this is a challenge because our API servers are themselves Serverless and are initiated on demand)
See the Edge Proxy Design for benefits of API regional availability.
Some of these were conscious decisions because being a small startup, we prioritized for a bias for action rather than a perfect API and we do believe the decisions we made were correct - the API works well and these fixes would be work with lesser benefits. We’ll fix these when they become an issue / re-evaluate for vNext.
Conclusion
With the CLI, Website and API availability, it becomes very easy to get started and automate your data workflows on LetsData. We’d love to learn more about your data usecase and your thoughts on how we can improve the API / developer experience.