Bug Fixes for the day!
Bug fixes for the day
Woke up today to a couple of overnight alarms for #LetsData - the website's main page had thrown an error in the wee hours of the night. Also, a dataset's initialization was failing.
Initial thoughts, great that we've started running into issues like these - people are looking at the service and trying to use it!
Okay, enough gloating, its time to see how bad the failure situation is.
Website timeout
Look into the logs and find that the website's main page had timed out. This can happen from time to time - on initialization, we read a bunch of data from network and populate the caches - so the initial call is heavy, after that everything's cached.
The other issue is that we've implemented the server API in Lambda and when Lambda is inactive, the function is reclaimed. So, after 15-30 mins of inactivity, a request to the website would re-initialize the function and the network reads.
A couple of quick fixes:
Parallelize the network reads and do only the reads that are required for the page - 15 sec timeout reduced to 3 secs lambda response time
Run a cron that retrieves the index, docs and home page every 10 mins so that atleast 1 Lambda function remains initialized (and customers do not get the initial delay - this is a TODO as of now)
Dataset initialization
When a user creates a dataset, we do a number of initializations, initialize queues, task databases, lambda functions etc. ("$ > letsdata datasets view help"
for details)
In this case, our continuous tests had created a dataset and the build for the dataset had failed. The initialization didn't know what to do in this case.
A couple of things:
About our continuous tests: The MVP that we've built has 2 different services that we can read from and the reads can be 5 different configurations. We have 6 destinations that we can write to, and 3-4 different ways we can specify what work needs to be done. So in total, we did some data generation and have 33 different combinations of read, write and work specifications that can be done. We wrote a test suite that creates a dataset from these 33 configurations every 20 mins, waits for it to complete initialization, start processing and then deletes the dataset. This makes sure that our customers don't run into unknown issues.
About the build failure: Looks like the build had some transient failure where one of the maven dependencies was not found. A simple manual retry of the build fixed the issue.
This is a new issue and has happened twice in ~400 odd dataset runs thus far. Error signature copied to the code and a TODO added. As we see more of this, we'll either add an automated retry or fix the issue why it failed.
So that is a founder's start to the day - a good coding workout before some of the other tasks!