Launch Announcement: #LetsData is now available in Python and Javascript and supports Containers

Dec 11, 2023

Today, we are announcing the availability of #LetsData in Python and Javascript languages and support for containers.

Customers can now implement the LetsData interfaces in Java, Python and Javascript languages and package their implementations as docker containers (in addition to the existing JAR Files) for a simplified development experience.

Here is a high level support matrix:

For those interested in digging immediately into the code, Here are the interface and example packages and the #LetsData docs:

Python: Git Hub Interface Package, Git Hub Interface Implementation Examples
Javascript: Git Hub Interface Package, Git Hub Interface Implementation Examples
Java: Git Hub Interface Package, Git Hub Interface Implementation Examples
#LetsData Docs:
- Read Connector Docs: https://www.letsdata.io/docs/read-connectors
- SDK Interface: https://www.letsdata.io/docs/sdk-interface/

Dev Experience Overview

We’ve translated the existing Java interfaces to Python and Javascript and packaged them as buildable, ready to deploy projects on docker container images. The developer workflow looks like as follows:

update the interface files with the implementation
build a docker image and upload it to ECR
reference this ECR Image in their datasets

Architecture

Internally, we package the ECR Image as a “Language Bridge” Lambda function with http request-response web methods that invoke the user’s implementations. Our existing Data Task Java functions can then call these language-bridges with the data that needs to be processed. This is essentially a micro-service architecture with the Data Task micro-service calling the Language-Bridge micro-service.

In-Proc vs Micro-service

An interesting difference between the existing Java language interfaces and the new Python / Javascript language interfaces is that the existing Java language interfaces do not use micro-services, they are essentially in-proc calls in the same JVM. This is primarily because all our existing code is implemented in Java and we take your implementation JAR, build it with our code and execute it as a single JVM process. This results in the following differences:

Stateless vs Stateful interfaces: Java interface implementations can be stateful - the code can easily maintain a thread to stateful implementation mapping. When we move to Python / Javascript using micro-services, we lose the stateful support - the interfaces need to be stateless. This is because subsequent calls from a thread could land at different micro-service endpoints ( lambda functions) which may not have state from the prior call. Statefulness can be implemented using persistent sessions, but that becomes overly complicated in terms of an implementation and becomes a micro-services anti-pattern IMHO. So is the move to micro-service a step down? Not necessarily - here is to micro-services defense:
- Most stateful parsing and transformation implementations can be simplified to stateless implementations (in similar parlance to SQL to NoSQL transitions)
- The loose coupling of services affords higher, horizontal scalability and increased performance.
- A simplified overall architecture
Latency: Java interface implementations are blazingly fast since they are within the same JVM, whereas the Python / Javascript micro-service implementations require network transit. These are within the same network and from experience we’ve seen that Python / Javascript interface implementations are adequately fast - we’ve seen parse message calls latency of ~1-2 millisec (inlcuding the network) for python and slightly higher (~5 millisec) for Javascript. With all the other responsibilities of the Data Task function (network reads, destination writes, serialization / deserialization, compute etc), these difference in latencies seem to get amortized (or shows slight perceptible increase). We were initially skeptical about the network latency for such an architecture, but we’ve been very happy with the out of the box performance that we are seeing. With additional tweaking, we should be able to improve performance even more if needed.
Dataset Initialization / Provisioning: Our provisioning of Java interfaces creates a new java build that packages the customer’s JAR with #LetsData interfaces. This decision was because we wanted to i.) make sure that JARs play nice together and that there are no compile time failures during interface implementations etc. ii.) we believe that this would allow us extensibility where we can customize the workflow to run some focused tests during the build to find any possible issues at initialization time. While all this is great, the downside to this is that this java build takes around 3-5 mins to complete and becomes the long pole in the dataset initialization workflow. The python/javascript implementations using pre-built ECR container images simplify this very nicely - we do not have to do any additional builds and the dataset initialization becomes quite short - we’ve seen datasets start processing data within 1-2 mins of creation. While we love this container packaging as of now, however, if we were to build similar focused tests to find issues at initialization time, we might see increased runtimes. However, this is unlikely, since the ECR container being a separate deployment unit can be deployed and scaled independently and therefore any issues can be treated separately as well.
Developer Experience: The overall developer experience when developing with java JAR and LetsData is somewhat complex when compared to python/javascript containers and LetsData.
- Python/Javascript containers and LetsData simplify the development quite a bit. We’ve implemented the Document interfaces by default in python / javascript - with support for key-value map, these are flexible for most use-cases. Python / Javascript developers do not have to write their own Document classes and do additional tests around serialization / deserialization. We should probably implement the same for Java as well.
- The Http request-response semantics on the python/javascript containers allow you to end to end test your code trusting that as long as your implementation does the right thing given the right inputs, the end to end will work quite well. Docker’s superior development infrastructure (IMHO) has a lot to do with this delight as well. With the java implementations, as we package the java project infrastructure today, the end to end testing and how everything fits in isn’t quite clear. This does mean we need to do more work to make the java experience as facile as python / javascript, but we are not there yet IMHO.
  Http Request-Response Packaging of Lambda Functions in Containers allows for build time sanity tests
Security: Running user’s lambda containers in your AWS account does increase the security risk - customer’s code could be malicious and could attempt to do all sorts of incorrect things, stealing credentials, generating malicious data etc. This concern is similar to running customer code in-proc, the in-proc code could be malicious as well. The way we mitigate in-proc code is that we run the in-proc code with a scoped execution role - which limits the code to be able to do only what we’ve determined should be accessible by the dataset. Similar security fencing should carry over to the language-bridge lambda functions. While we’ve not limited that execution to the dataset’s execution role yet, we’ve granted the language-bridge function a bare minimum set of permissions that are needed for lambda function to work. However, some accesses are for ‘*’ resources.
Lambda’s default permission requirements
My understanding of the system suggests that this should allow user code to be able to create log groups, put logs and metric data to namespaces they don’t own and do similar malicious activity with xray. I haven’t tested this yet and would be surprised if a lambda function malicious code can do these, I am sure there are some internal safeguards that should disallow a lambda function to run amok with respect to these blanket accesses. However, this is okay for now, but is on our minds as a security issue that we may need to fix.

Ideally, we’ll want all languages to be i.) in-proc AND ii.) have the scalability and dev experiences similar to micro-services. But duplicating all our existing code that is currently in Java seems like a larger effort - the read connector code, the #LetsData writers and all the orchestration code. To have a true in-proc, we’d have to rewrite the stack for each different language, which seems untenable as of now. The micro-services approach is a very nice implementation that should cater to the approx. 95%+ of the cases (unscientific gut number).

Implementation

Config

Here is an example read connector dataset configuration using Python.

The JAR attributes have been replaced by ECR Image attributes
Since the project structure and docker image is a known template, #LetsData infrastructure knows where the interface locations are. This allows us to simplify the configuration where we don’t specify the implementation class names anymore.

Code

The interface code is simple - for example, we converted our stateful Common Crawl java interface implementation to a stateless python/javascript implementation.

Single File Parser interface implementation in Python

Conclusion

We’d love to learn how SAAS platform owners have dealt with the challenges of multi-language support - separate technology stacks might be the inevitable reference architecture.

Despite the cons and issues mentioned above, simplified development in popular programming languages and zero infrastructure management while processing data are compelling reasons to give #LetsData a try. We’d be happy to work with folks to onboard their data use-cases to #LetsData.