#LetsData at AWS Summit, NY, 2023
Its been a little quiet on our #LetsData page - while no excuses for being silent are actually going to sound any better, the reason we've been quiet is that we've been really busy with:
Prospecting customers
Co-Founder - Sales hiring (lots of great conversations but nothing final, lets see how this goes)
Hiring Web Developer (Woot! we hired an engineer who started this week and already has some templates going!)
Building additional product features (Stay tuned, we have additional connector announcements soon)
Finding how we fit into the overall eco-system (Product-Market Fit)
Between all this, regular posting on LinkedIn kind of automatically got de-prioritized. Anyways, on to the main topic for today.
We've been trying to find creative ways for prospecting customers(#1) and for any customer connects, it is critical that we enunciate a crisp narrative to our Product-Market Fit (#5). For these two critical reasons, I attended the AWS Summit in New York(www) yesterday. Here is my debrief of the event and its applicability to #LetsData.
Preparation
Leading up to the event, I updated our Marketing Solutions Brief (pdf) and our Executive Summary (pdf) to create a Customer Package for any customer connects. Printed 30 odd copies at our local Fedex office, sorted, bound, laminated and boxed ready to be handed out to customers. I had registered as an attendee (not as a sponsor), so had no booths assigned in the Expo / Exhibition Hall. The idea here was to squat on some empty table if such an opportunity arose and chat with interested folks. This was the prep work for prospecting customers (#1).
For Product-Market Fit (#5), I went through the session catalog and identified the sessions that I'd want to attend. The selection criteria was:
anything related to data / compute space in general
data / compute AWS techs that I hadn't worked with
customer stories on how customers were actually using data techs in the industry
The idea here was that as long as I was consuming knowledge about different techs, customer stories, they'd connect to define a crisper strategy narrative and clarity for Product Market Fit.
Here are some sessions that I selected (now attending all these would be a challenge, the ones I attended are highlighted):
Not Applicable, The Keynote
200 - INTERMEDIATE, Self-paced labs: Come and go as you please
200 - INTERMEDIATE, Adobe’s journey toward building an internal developer platform (IDP)
200 - INTERMEDIATE, Chart your Kubernetes course with Amazon EKS
200 - INTERMEDIATE, Fidelity’s observability platform for telemetry
200 - INTERMEDIATE, Create a CI/CD pipeline to deploy your application to AWS ECS
200 - INTERMEDIATE, Serverless SaaS: Building for multi-tenancy
200 - INTERMEDIATE, AWS networking fundamentals: Setting up your global network
200 - INTERMEDIATE, How modern data management can fuel your success on AWS
300 - ADVANCED, Faster insight with Amazon Redshift: Zero-ETL integration & sharing
300 - ADVANCED, Build a data governance framework to balance data control and access
300 - ADVANCED, Architecting for low latency and performance in financial services
300 - ADVANCED, Best practices for microservices deployed on Amazon ECS
300 - ADVANCED, Making a modern data architecture a reality
300 - ADVANCED, Managing resources with the new AWS Cloud Control Terraform provider
AWS Summit, NY, 2023
Boarded a 9 PM (Seattle) flight and arrived at Newark at 5AM - a 30 min Uber ride and I was at the Jarvis Center in West Manhattan at 6 AM. Fueled by a Coffee and a Cream Cheese Bagel, I was ready to attend some sessions and talk to some customers.
Some observations about the event, the venue and the overall ambience at the AWS Summit NYC. The venue, Javits Center, on the banks of the Hudson and a short walk from the 9th Ave skyscrapers and attractions such as Vessel, Madison Square Garden, was impressive. The AWS event itself was Grand, a choreographed AWS show that demonstrates their success as the de-facto cloud leader. Large number of attendees and the queuing / sell out at lectures was a gauge of the interest and applicability of what they are doing. Every tiny little detail had been thought about and meticulously planned - so my plans of squatting a table on the Expo Hall were impossible, I'd have to improvise.
Create a CI/CD pipeline to deploy your application to AWS ECS
After getting my bearings around the event, I attended the Lab about ECS deployments - "Create a CI/CD pipeline to deploy your application to AWS ECS". This was a relatively large hall that was setup as a Computer Lab - every attendee got a desktop with couple of monitors, connected to the internet. The instructor led the session in a Peloton style (Peloton reference since I had seen their store in the neighborhood earlier that day) - telling the audience about the lab and getting confirmation that audiences were following along.
The software and infrastructure that AWS has built around these Labs, Workshops, Skill building and Training was pretty impressive - simple login that gets you an AWS Cloud environment that is fully functional to work through these use cases. For example, the test (sandbox) and prod environments that I use during my day to day development are actual AWS accounts that I have (nothing sandbox about my test account) - so having such a scoped down environment and sandbox resources would be a huge productivity boost in my day to day development as well. (The scoped down env I know can be created via IAM roles and policies, but I don't know how to do sandbox on AWS). Overall, the ease with which attendees could get environments up and running was quite impressive IMHO.
The ECS CI/CD lab walked us through creating an ECS deployment and staged updates. It talked about CodeCommit, Code Pipelines, Code Builds and Lambda. We use these constructs quite heavily in #LetsData, so nothing new here. The container side of things, while I know what they are, and have done some experiments with these. These (ECS, Kubernetes) are ideas for compute for containerized applications which can be used instead of the default Lambda compute that we currently have in #LetsData. Walkthroughs of the container registry and deploying containerized applications. Good stuff!
Faster insight with Amazon Redshift: Zero-ETL integration & sharing
Next, I attended the Chalktalk "Faster insight with Amazon Redshift: Zero-ETL integration & sharing" - where the presenters whiteboarded how their newer Zero ETL features simplified the data ingesting to AWS Redshift. They showed different integrations from AWS Aurora, AWS S3 and AWS Kinesis directly into AWS Redshift. The integrations seemed easy to setup, simplified the existing complex pipelines, no cost in most cases, automatic integrations and came with the Amazon's operability goodies (look at batches, progress in different tables).
No Code vs Low Code vs Your Code
This talk seemed highly relevant to #LetsData and on the surface, brings forth an existential questions for the company:
If there are zero touch and no cost ETL integrations natively available in AWS, would #LetsData be able to do better?
Or in terms of a Product-Market-Fit (#5) question, How does #LetsData fit in when there are existing zero ETL solutions natively available?
A few people I had talked to had asked me about No Code options in #LetsData and I had told them its an area we need to explore. What are our plans for no code?
Our distinguishing feature, flagship capability and the brightest feature in our cap is the managed service we've built around AWS Lambda as a Compute Engine - a true serverless and infinitely elastically scalable compute offering. Connectors are connections on reading and writing to / from different destinations.
With #LetsData Lambda Compute Engine, you do not have to create clusters, manage machines, run orchestration and management software such as Kubernetes etc.. Clusters, machine management and Kubernetes are extremely smart and feature rich on management of a shared pool of resources (machines, Kubernetes CPU resources etc) - however, this means:
you are still responsible for the operability and scale
infinite elastic scaling is not an option without additional provisioning
The #LetsData infinite elastic scale was actually demonstrated in our Web Crawl Archives Big Data Case Study - where with just a config option, we created 100 concurrent Lambda functions without having to provision any hardware, clusters or machines! (and the reason why we saw the system process 3 billion JSON docs in 48 hours! I like to compare it with Google's 8 billion searches per day - we aren't nearly as complex but 3 billion isn't your pocket change.)
Additionally, with our #LetsData Lambda Compute Engine, we've not only packaged AWS Lambda for ease of data compute, but we've also extended the service with many little things that will delight you once you start using the service. For example, Lambda function runtime today has an upper limit of 15 mins. In case your task takes longer (and data intensive tasks that read 10s of gigabytes of data files can take longer), we've built in automated task rerun capabilities. This essentially means your tasks relinquishes control and reschedules for a rerun.
I liken this to this to thread scheduling, premption and execution that the OSes do. For example, In an experiment that I ran, I fixed the task timeout to 60 seconds (config) and saw that the tasks were getting run for 60s, then giving the compute to other tasks and then getting rescheduled again. Very neat implementation and how it manifests in use, IMHO!
For example, reading from a partitioned stream as tasks for each partition and the partition doesn't have data available yet, this ensures there isn't starvation in a partition (similar to IO schedulers in low level OS implementations).
(I mentioned config a couple of times - this isn't traditional config, its actually the dataset configuration JSON that you create on defining your read, write, error and compute - we are not a config heavy system IMO, I believe our design philosophy follows this as a principle and that leads to simplicity. However, I acknowledge that we are just an MVP that does not have a gamut of features so becoming config heavy maybe the only option as we grow (i doubt that somehow), but as of now, I do believe this offers an ease of use. Also, I may have an owner's bias and some user might find our config overwhelming.)
I believe that this distinguishing feature solidly puts us in the "Your Code" (and maybe "Low Code") category instead of "No Code". Also, if I may bring another term into the mix, we are "No Provisioning" (true serverless). Consider "No Provisioning" design parameter when you design your infrastructure and realize the infrastructure and operability simplifications that this offers.
However, "No Code" / Zero ETL features are highly relevant and important - when the compute is simple copy to data into destination, they work really well (and I was impressed by how AWS has provided these with such ease of use).
For example, built-in automation, turn on once and any new files / data would keep on syncing automatically. Debugability and batching etc. We might just use these Zero ETL features as is and package them with the #LetsData resources such as logging, metrics, tasks etc to be the "No Code" pieces in our data pipelines. This would "build this in code" for better manageability instead of rogue actions such as enabling in studio or on the console which can be source of operator errors. We can integrate with their API so that you dont have to!
So these are my thoughts about #LetsData positioning with respect to the "No Code" and "Zero ETL" offerings. We can integrate with them for the basic copy tasks and offer a compute that can do much much more!
Expo Hall
Next, I headed to the expo hall and started chatting with a number of different companies and understanding what they do. Lots of companies in the data pipelines, observability and monitoring space. Almost all the open source SQL and NoSQL variants had a presence. Companies that increase developer agility, decrease costs, secure software development end to end and similar are all on the display.
From the expo hall, I made my way to the "Architecting for low latency and performance in financial services" session when it was the session time. There was a large queue and on the session time, the hall filled quicker and a number of attendees had to be turned away. Those early birds had beaten me to the punch :) Nevermind, I'll catch up on this session via the session recording on youtube when they are available. Back to the Expo hall I went. There was some time before the Keynote so spent some time in the Expo Hall talking to some folks. Some additional conversations around data pipelines and visiting card swaps.
Keynote
Realizing that queuing and limited capacity may be facts of the AWS Event, I turned on my theme park mode from my younger years to optimize ride wait times etc. Actually kidding, the queuing and limited capacity weren't that big of an issue, I just gave myself ample time for the Keynote and any events I wanted to attend instead of showing up on the event time.
The key takeaway from the Keynote was that AWS was making big investments in the AI and ML spaces - lots of new features, simplifications, cost reductions and ease of use for AI / ML.
During the course of my startup, a few people had asked me around how we can possibly benefit from the AI / ML wave that is sweeping the industry. I had read a few articles, some architecture docs and these two do a really good job at explaining these:
Emerging Architectures for LLM Applications (an Andreessen Horowitz blog): https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/
How OpenAI trained ChatGPT (an excellent summary of the MS Build talk):
https://blog.quastor.org/p/openai-trained-chatgpt
We definitely fit well in the "Data Preprocessing / Embedding" component of the architecture in the Andreessen Horowitz blog. In addition, from the MS Build talk summary (emphasis mine)
"The Data Mixture specifies what datasets are used in training. OpenAI didn’t reveal what they used for GPT-4, but Meta published the data mixture they used for LLaMA (a 65 billion parameter language model that you can download and run on your own machine).
...
From the image above, you can see that the majority of the data comes from Common Crawl, a web scrape of all the web pages on the internet"
Considering that we have demonstrated scale and performance in reading and processing Common Crawl docs "Case Study: Big Data: Building a Document Index From Web Crawl Archives", I believe we'd be very highly relevant here as well.
In addition, maybe the Compute can be leveraged for some training as well, but this seems highly specialized. I need to spend some time and work through a few example usecases and see how everything is working to internalize how #LetsData can be used in these scenarios. I'll say more about it once I myself internalize this a little better. (Task for self, deep dive into the OpenAI's / Meta's technology).
Headed for lunch after the keynote, grabbed a Tuna Sandwich on Ciabatta and a Diet Coke, headed out to the benches to enjoy a meal in the Sun. Hadn't done this in ages, for no specific reason, the warmth of the summer Sun made one of my liked rituals rather enjoyable.
Making a modern data architecture a reality
Headed over to "Making a modern data architecture a reality" and learned about lake formation in AWS S3 via AWS Glue.
Glue Crawlers infer data schema and add it as schema in the Glue Data Catalog
Glue Transforms by doing an SQL join of two different S3 data sources and writing it back to S3 via the Glue Studio's drag and drop designer. The designer auto generated a python script that gets run (seemed like Spark code)
Again, a Glue Crawler to infer the schema from the joined data
Athena to query over the S3 data using the schema from the Glue Catalog
Add a few neat Quicksight visualizations Heatmap and Treemap to round out the exercise
Data Cataloging is important for data governance and the built in Glue transformations and crawl infrastructure seems really interesting as well. Having a data catalog and Athena querying built in seems like a useful addition to #LetsData. We need to experiment with these a little and see what would be most beneficial for the customers and build something that the customers would use.
Unlike the earlier lab, this one was a little different in that this was BYOL (bring your own laptop) - but the software infra of simple login with AWS cloud resources provisioned and available was VERY impressive. (pat on the back for folks who made this possible)
Expo Hall
Back in the Expo Hall one more time. A fun number of activities were also in the Expo Hall, professional executive headshots, games such as ping pong etc. The line for the professional headshots was a little thinner, and in between my conversations, I got these done. Not very flattering though, I guess, 20 hours of travel and tiredness were showing through. Ah well, I like my current LinkedIn pic quite well.
AWS Startup Loft had a large presence on the Expo floor. I met with a few different Solution Architects and discussed the startup and got some tips around execution and engagement and the customary exchange of visiting cards. Next stop a few additional companies, asking as to what possible integrations and synergies might exist with #LetsData.
At this time, I went to the large lobby area and chilled out on some benches, seeing folks and the event pass by. I was a little tired and thought I'd stretch my legs and wait for my next talk.
Serverless SaaS: Building for multi-tenancy
AWS Community section on the Expo Hall were talking about Serverless SaaS and building for multi tenancy. Most of this was what I already knew from the AWS Well Architected Framework SaaS Lens (www). I had read this cover to cover and more for #LetsData and solved some really tricky challenges along the way but hearing this reiterated was fun.
Almost 5PM, my flight's at 7:30, no time for the after event socials, at-least not this time. Uber back to the airport, checkin, wait, board, disembark drive back home.
1AM at home but there are people waiting for me to come back safely.
Alhamdulilah (Thank God), Life's Good!