<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[#Lets Data Newsletter]]></title><description><![CDATA[The newsletter for #LetsData - cloud infrastructure that simplifies how you process, analyze and transform data. You'll find interesting posts on technology, big data and cloud services.
]]></description><link>https://blog.letsdata.io</link><image><url>https://substackcdn.com/image/fetch/$s_!npUu!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9f9f8d-5379-4537-bb95-7dcd2075aa02_129x129.png</url><title>#Lets Data Newsletter</title><link>https://blog.letsdata.io</link></image><generator>Substack</generator><lastBuildDate>Mon, 25 May 2026 11:17:00 GMT</lastBuildDate><atom:link href="https://blog.letsdata.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Resonance Labs LLC]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[letsdata@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[letsdata@substack.com]]></itunes:email><itunes:name><![CDATA[Usman]]></itunes:name></itunes:owner><itunes:author><![CDATA[Usman]]></itunes:author><googleplay:owner><![CDATA[letsdata@substack.com]]></googleplay:owner><googleplay:email><![CDATA[letsdata@substack.com]]></googleplay:email><googleplay:author><![CDATA[Usman]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Data Catalogs and Distributed Querying is now available on #LetsData]]></title><description><![CDATA[Today, we are announcing the integration of the #LetsData S3 Write Connectors with AWS Glue (Crawlers, Data Catalog) and AWS Athena (Distributed Querying).]]></description><link>https://blog.letsdata.io/p/data-catalogs-and-distributed-querying</link><guid isPermaLink="false">https://blog.letsdata.io/p/data-catalogs-and-distributed-querying</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Thu, 11 Apr 2024 16:42:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6d5f4c1c-e850-4111-bc0a-bc10969103d5_809x408.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the integration of the #LetsData S3 Write Connectors with AWS Glue (Crawlers, Data Catalog) and AWS Athena (Distributed Querying).  </p><p>With these integrations, #LetsData will now:</p><ul><li><p><strong>Crawl Data:</strong> Automatically configure and run AWS Glue Crawlers on a dataset&#8217;s S3 files </p></li><li><p><strong>Discover Schema:</strong> Discover the file data schema, partitioning and metadata  </p></li><li><p><strong>Manage Data Catalogs:</strong> Add the schema, partitioning and metadata to the AWS Glue Data Catalog</p></li><li><p><strong>Data Governance:</strong> Automatically configure access and permissions for the catalog and S3 files</p></li><li><p><strong>Data Warehousing:</strong> Enable Distributed Querying support via AWS Athena.</p></li></ul><p>The details of the S3 Write Connector Data Catalog integrations are at our <a href="https://www.letsdata.io/docs/data-catalog/">Data Catalog Docs</a></p><h2>Overview</h2><p>The LetsData Write Destinations are categorized into either:</p><ul><li><p>permanent, durable data store (think databases such as S3, Dynamo DB, Vector Indexes)</p></li><li><p>ephemeral data containers (such as streams, queues etc)</p></li></ul><p>The permanent, durable data stores can be enabled for analytical workloads (OLAP). LetsData enables these by running AWS Glue Crawlers on these data stores to discover the data schema, partitioning and metadata. These are then added to data catalog / meta stores (AWS Glue Data Catalog). Customers can run AWS Athena queries using the AWS Glue Data Catalog - these are distributed queries that enable on demand, massively parallel processing of large datasets.</p><p>See &nbsp;<a href="https://docs.aws.amazon.com/athena/latest/ug/when-should-i-use-ate.html">AWS Athena Docs</a> to understand the AWS analytics services and their usage, &nbsp;<a href="https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html">Redshift Docs about Columnar Storage</a> to understand OLAP / OLTP differences.</p><h4>Highlights</h4><p>Highlights about LetsData's Data Catalog implementation:</p><ul><li><p><strong>Automatic Data Cataloging: </strong>LetsData can automatically catalog datasets that write to S3 destinations by running glue crawlers on S3 data files and creating databases and tables for the LetsData users.</p></li><li><p><strong>Primitive Data Lake: </strong>Adding to a Data Catalog, defining a permissioning model for access and enabling Distributed Queries is essentially forming a Data Lake. (There are different terms that one comes across when defining a data strategy for OLAP workloads - Data Warehouses, Data Lakes and Data Lakehouse etc. This &nbsp;<a href="https://www.proserveit.com/blog/data-warehouse-architecture-patterns">external link</a> explains the different terminology). The LetsData's data catalog is primitive data lake (not a datawarehouse).</p></li><li><p><strong>Access and Permissioning: </strong>The AWS Lake Formation is a permissioning model over AWS Glue Catalog and vends automated temporary credentials to different services - Lake Formation has different user roles and allows for permissions delegation and sharing as well. As of now, LetsData permissioning model allows data catalog access to data owner only and we do not offer permission sharing / permissions delegation yet.</p></li></ul><h2>Architecture</h2><h4><strong>Automated Data Catalog</strong></h4><p>Here is a high level architecture of how automated data catalog is implemented:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GlnP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GlnP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 424w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 848w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 1272w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GlnP!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png" width="1200" height="766.6396761133603" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:789,&quot;width&quot;:1235,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:120309,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GlnP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 424w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 848w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 1272w, https://substackcdn.com/image/fetch/$s_!GlnP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcbf22e84-30ac-4b2d-a522-19bed1560bf7_1235x789.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture Diagram: #LetsData Automated Data Catalog</figcaption></figure></div><h4><strong>Distributed Querying</strong></h4><p>Here are a high level steps to query a dataset's data catalog tables.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!975T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!975T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 424w, https://substackcdn.com/image/fetch/$s_!975T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 848w, https://substackcdn.com/image/fetch/$s_!975T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 1272w, https://substackcdn.com/image/fetch/$s_!975T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!975T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png" width="522" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:522,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43714,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!975T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 424w, https://substackcdn.com/image/fetch/$s_!975T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 848w, https://substackcdn.com/image/fetch/$s_!975T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 1272w, https://substackcdn.com/image/fetch/$s_!975T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94d78937-5a1d-41d3-b5c0-cceab424982d_522x564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData Distributed Query Architecture</figcaption></figure></div><h2>Details</h2><h4><strong>Availability</strong></h4><p>Automated Data Catalog is currently available for S3 Write Connectors only (S3, S3 Aggregate File, S3 Spark). Automated Data Catalog is currently available for <em>resourceLocation: LetsData</em> only. <em>resourceLocation: Customer</em> requires some additional architectural redesign which we've deferred for now.</p><h4><strong>Enable Config </strong></h4><p>To enable automated Data Catalog, add the <code>"addToCatalog": true</code> attribute to the dataset's write connector config. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2qZD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2qZD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 424w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 848w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 1272w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2qZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png" width="569" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:569,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40619,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2qZD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 424w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 848w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 1272w, https://substackcdn.com/image/fetch/$s_!2qZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe769a1a2-9ef6-4ded-81c6-344e8a0b41b3_569x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example Add To Catalog Configuration: #LetsData Write Connector</figcaption></figure></div><h4><strong>Catalog Details</strong></h4><p>The dataset's initialization creates the catalog database, the glue crawler and sets up the crawl configuration. These dataset's write connector node is updated with these catalog details. You'll need these in Athena queries. Use the CLI datasets view command to view the dataset details. Here are the example details:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a8C7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a8C7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 424w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 848w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 1272w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a8C7!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png" width="1200" height="610.752688172043" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:568,&quot;width&quot;:1116,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:81034,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a8C7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 424w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 848w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 1272w, https://substackcdn.com/image/fetch/$s_!a8C7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb60f7f1-040c-4ff2-a460-f1ff7e8fa019_1116x568.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData Catalog Details</figcaption></figure></div><p>Here are what the different attributes in the configuration mean:</p><ul><li><p><strong>awsGlueCrawlerName: </strong>The name of the AWS Glue crawler that was created by dataset. You can use aws cli / sdk to query details about the crawler and its crawl runs. See additional details in the crawler section below.</p></li><li><p><strong>awsGlueCatalogName: </strong>The name of the AWS Glue catalog that the crawler will store the schema in. This is currently set to the default AWS account catalog. You'll need this to query the catalog details about the tables and when running actual athena queries.</p></li><li><p><strong>awsGlueDatabaseName: </strong>The name of the AWS Glue catalog database. Every LetsData user gets their own database accessible only to them. The database is named as <code>&lt;stack_name&gt;tenant&lt;tenant_id&gt;</code>. For example tenant with tenantId <em><strong>(d5feaf90-71a9-41ee-b1b9-35e4242c3155)</strong></em> on the <em><strong>prod</strong></em> stack would be assigned a database <code>prodtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155</code></p></li><li><p><strong>awsGlueTableNamePrefix: </strong>The crawler sets up the crawl configuration and specifies the tableName prefix. The actual tableName(s) are obtained later by calling the <em>AWS Glue GetTables API</em> using credentials from the the dataset's <code>customerAccessRoleArn</code> and <code>awsGlueCatalogName / awsGlueDatabaseName</code> as the catalog name / database name respectively.</p></li><li><p><strong>athenaQueryOutputPath: </strong>To help with querying that works without any additional setup requirements, we provisioned an S3 Bucket path for each dataset configured for Athena queries for that dataset. You can run Athena queries with the S3 Bucket path as the results output path. The contents of the path are deleted upon dataset deletion (so do save any results as needed).</p></li></ul><h4><strong>Crawler Configuration</strong></h4><p>We've configured the crawler by default with settings that should work out of the box for most use cases. However, for those who'd need advanced options, here is how the crawler is setup:</p><ul><li><p>re-crawl everything on each run</p></li><li><p>schema change policies of update / delete in database on updates / deletes to schema</p></li><li><p>crawler lineage settings are disabled. </p></li><li><p>tableGroupingPolicy to combine compatible schemas (and create a single table)</p></li></ul><p>If we should add some classifier support / additional options out of the box, do let us know we can enable these <em>(<a href="mailto:support@letsdata.io">mailto: support@letsdata.io</a>)</em>. (&nbsp;<a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-crawling.html#aws-glue-api-crawler-crawling-Crawler">Crawler in AWS Docs</a>) </p><p>Here is an example crawler configuration:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KSNl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KSNl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 424w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 848w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 1272w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KSNl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png" width="817" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:817,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KSNl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 424w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 848w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 1272w, https://substackcdn.com/image/fetch/$s_!KSNl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f2b6fa-cf93-4009-84eb-61a28cd1cdf9_817x658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData Glue Crawler Configuration</figcaption></figure></div><h2>Running Athena Queries</h2><p>Here are a few example commands that can be used to get details about a dataset's catalog and run athena queries.</p><h4><strong>Access Credentials</strong></h4><p>Get dataset's access credentials:</p><ul><li><p>We need the <code>customerAccessRoleArn</code>, <code>createDatetime</code> from the dataset to be able to access the bucket files.</p></li></ul><pre><code><code>CUSTOMER_ACCESS_ROLE_ARN=`./letsdata datasets view --datasetName &lt;dataset_name&gt; --prettyPrint 2&gt; /dev/null | grep customerAccessRoleArn| sed 's/.* "//g'| sed 's/",//g'`

EXTERNAL_ID=`./letsdata datasets view --datasetName &lt;dataset_name&gt; --prettyPrint 2&gt; /dev/null | grep createDatetime|sed 's/^.*: //g'|sed 's/,//g'`                    </code></code></pre><ul><li><p>Suppose that for the current dataset, the <code>customerAccountForAccess</code> is <code>308240606591</code>. This has an IAM Admin user for this account whose credentials are stored in the <code>~/.aws/credentials</code> file as the profile <code>IamAdminUser308240606591</code>. Run the following AWS CLI command to get time limited credentials and save them to the <code>~/.aws/credentials</code> file in the <code>stsassumerole</code> node</p></li></ul><pre><code><code>output="/tmp/assume-role-output.json"

aws sts assume-role --role-arn $CUSTOMER_ACCESS_ROLE_ARN --external-id $EXTERNAL_ID --role-session-name 'AccessToDatasetCredentials' --profile IamAdminUser308240606591 &gt; $output

cat $output

    # output should have contents similar to the following
    # {
    #    "Credentials": {
    #        "AccessKeyId": "ASIATIBD24ZTWCHOBDHH",
    #        "SecretAccessKey": "W09pNZg...",
    #        "SessionToken": "FwoG...",
    #        "Expiration": "2024-02-08T01:29:20+00:00"
    #    },
    #    "AssumedRoleUser": {
    #        "AssumedRoleId": "AROATIBD24ZTZWJR5CZEW..."
    #    }
    # }

# copy these the ~/.aws/credentials file    
cat ~/.aws/credentials

[IamAdminUser308240606591]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZT3CRVCKVD
aws_secret_access_key = Bx8RVkAXfoGh5ERnL90ceftz5GJMWnxHx27uxTpS

[stsassumerole]
aws_region = us-east-1
aws_access_key_id = ASIATIBD24ZTWCHOBDHH
aws_secret_access_key = W09pNZg...
aws_session_token = FwoG...              </code></code></pre><h4><strong>Crawler</strong></h4><p>Get the status of the crawler using <code>get-crawler</code> CLI command, start a new crawler run using the <code>start-crawler</code> and stop a running crawler using <code>stop-crawler</code> CLI commands. (The crawler name is in the dataset config as the <code>writeConnector.catalog.awsGlueCrawlerName</code> attribute)</p><pre><code><code>  aws glue get-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1

  aws glue start-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1

  aws glue stop-crawler --name testcrawler5855ca1062420abf35ed85b8c6eda82a --profile stsassumerole --region us-east-1            </code></code></pre><h4><strong>Table Details </strong></h4><p>The table(s) are created when the the crawler has run successfully. You can get the table details for the created tables using the <code>get-tables</code> CLI command. (The database name is in the dataset config as the <code>writeConnector.catalog.awsGlueDatabaseName</code> attribute). Additionally, you can get the database, partitions and partition indexes details as well. </p><pre><code><code>  aws glue get-tables --database-name testtenantd5feaf90-71a9-41ee-b1b9-35e4242c3155 --profile stsassumerole --region us-east-1

  # This should return table details including the schema
  # {
  #     "TableList": [
  #         {
  #             "Name": "b8e4b3e8...",
  #             "DatabaseName": "testtenant...",
  #             ...
  #             "StorageDescriptor": {
  #                 "Columns": [
  #                     {
  #                         "Name": "language",
  #                         "Type": "string"
  #                     },
  #                     ...
  #                 ],
  #                 "Location": "s3://tldwc.../",
  #                 "InputFormat": "TextInputFormat",
  #                 "OutputFormat": "HiveIgnoreKeyTextOutputFormat",
  #                 "Compressed": true,
  #                 "NumberOfBuckets": -1,
  #                 "SerdeInfo": {
  #                     "SerializationLibrary": "JsonSerDe",
  #                     "Parameters": {
  #                         "paths": "samples,percentile,language"
  #                     }
  #                 },
  #                 "BucketColumns": [],
  #                 "SortColumns": [],
  #                 "Parameters": {
  #                     "sizeKey": "9953",
  #                     "objectCount": "2",
  #                     ...
  #                 },
  #                 "StoredAsSubDirectories": false
  #             },
  #             "PartitionKeys": [
  #                 {
  #                     "Name": "partition_0",
  #                     "Type": "string"
  #                 }
  #             ],
  #             "TableType": "EXTERNAL_TABLE",
  #             "Parameters": {
  #                 "sizeKey": "9953",
  #                 "objectCount": "2",
  #                 ...
  #             },
  #             ...
  #         }
  #     ]
  # }
  #
            </code></code></pre><h4><strong>Run Athena Query </strong></h4><p>Using the schema from the get tables, construct a query to run. You can run the query using the start-query-execution CLI command. (The database name, catalog name is in the dataset config as the <code>writeConnector.catalog.awsGlueDatabaseName</code>, <code>writeConnector.catalog.awsGlueCatalogName</code> attribute. Additionally, use the <code>writeConnector.catalog.athenaQueryOutputPath</code> to store your results.)</p><pre><code><code>  # run the query 
  aws athena start-query-execution --query-string \
 "SELECT language, COUNT(*) as recs \
 FROM b8e4b3e... \
 GROUP BY language \
 ORDER BY recs desc" \
 --query-execution-context Database=testtenant...,Catalog=223413462631 \
 --result-configuration OutputLocation=s3://test-.../athena-queries/ \ 
 --profile stsassumerole --region us-east-1

  # {
  #   "QueryExecutionId": "8ce557cd-39eb-414d-8cbd-104a1223b009"
  # }

  # use the query execution id to get the query execution details
  aws athena get-query-execution --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1

  # use the query execution id to get the query results 
  aws athena get-query-results --query-execution-id 8ce557cd-39eb-414d-8cbd-104a1223b009 --profile stsassumerole --region us-east-1

  # list the athenaQueryOutputPath and download the results as needed
  aws s3 ls s3://test-.../commoncrawladdtocatalog19/ --profile stsassumerole --region us-east-1
  
  # copy the results file from S3
  aws s3 cp s3://test-.../athena-queries/8ce557cd-39eb-414d-8cbd-104a1223b009.csv . --profile stsassumerole --region us-east-1</code></code></pre><h2>Conclusions</h2><p>With automated AWS Glue integrations and AWS Athena enablement:</p><ul><li><p>we&#8217;ve defined a primitive Data Cataloging, Data Warehousing and Data Governance capabilities on #LetsData</p></li><li><p> the AWS Glue Data Catalog enables a large number of AWS Data scenarios such as Redshift Datawarehouse, EMR processing and Quicksight analytics to name a few</p></li></ul><p>This is a huge space and while what we&#8217;ve built barely scratches the surface, it is quite powerful. We&#8217;d love to know the customer scenarios around Data Cataloging, Governance and Datawarehousing and how #LetsData may help solve some of these challenges. </p>]]></content:encoded></item><item><title><![CDATA[LetsData Control SDK is now available]]></title><description><![CDATA[The LetsData Control SDK is now publicly available.]]></description><link>https://blog.letsdata.io/p/letsdata-control-sdk-is-now-available</link><guid isPermaLink="false">https://blog.letsdata.io/p/letsdata-control-sdk-is-now-available</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Mon, 11 Mar 2024 02:06:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1fe15085-50cb-44aa-a244-a9f34a6ee302_398x398.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The <strong>LetsData Control SDK</strong> is now publicly available. </p><p>The Control SDK is what the LetsData CLI and the LetsData Website call to orchestrate and manage data pipelines on LetsData. Now, we&#8217;ve made our REST API available publicly for customers. Customers can build automation and develop their apps on Lets Data using familiar REST API semantics. </p><p>The documentation and runnable postman collection are available on Postman (<a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y">LetsData Control SDK on Postman</a>). The permalink on the LetsData docs website is at  <a href="https://www.letsdata.io/docs/sdk-control-api/">https://www.letsdata.io/docs/sdk-control-api/</a> - however, the docs page is mostly a redirect to Postman. </p><h3>The Control API</h3><p>The Control SDK API at a high level has the following resources, some high level details and the different supported actions are as follows: </p><ul><li><p><strong>Dataset: </strong>&nbsp;Datasets are collection of data tasks grouped together as a logical entity. They can also be called Data jobs / Data tasks that the user needs to run. A dataset will have tasks for the work items in the dataset. The SDK has a number of APIs to manage datasets. For example, you can create datasets, list datasets, view datasets, manage the dataset's lifecycle (descale / freeze / delete), manage the dataset&#8217;s code artifacts and update a dataset&#8217;s compute configuration etc. </p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#8a1bd85c-21ed-45c2-9e8e-156d16ed9753">Datasets API on Postman</a></p></li></ul></li><li><p><strong>Tasks: </strong>Tasks are the system's representation of a work item that is executed by the compute engine (Lambda, Sagemaker, Spark etc). When a task executes, it reads from the read destination, calls the user data handlers and then writes to the write destination. During this execution, a task emits metrics, logs and errors records. Each of these is a separate resource in the SDK  and these resources tie in together with tasks (taskId). As of now, the SDK defines APIs for listing and filtering a dataset&#8217;s tasks, redriving error tasks and stopping tasks.</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#36484879-1050-42fb-8c5f-a8afb700c3ee">Tasks API on Postman</a></p></li></ul></li><li><p><strong>Errors: </strong>Errors during the task execution in parsing records, record transformations or writing records are archived as error documents and classified as <em><strong>'Record Errors'</strong></em> in Let's Data. The SDK supports the list errors API to list the errors for tasks and a view API to view each individual error record file.&nbsp;(Do note that the Let's Data infrastructure errors and unhandled exceptions are classified as <em><strong>'Task Errors'</strong></em> and are handled differently. Error Handling docs have additional details: <a href="https://www.letsdata.io/docs/error-handling/">https://www.letsdata.io/docs/error-handling/</a>)</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#38f7886d-30f5-46d7-b3cd-9ea5d414abc7">Errors API on Postman</a></p></li></ul></li><li><p><strong>Logs:</strong> Each task writes its logs to files which essentially become a task's execution trace that is useful in debugging and monitoring. The SDK supports a view API to view a task&#8217;s log file. </p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#6ce267fd-70eb-4800-8cfe-35507b8daa8f">Logs API on Postman</a></p></li></ul></li><li><p><strong>Metrics: </strong>Dataset execution comes with some system defined metrics dashboards that can be used to monitor a dataset&#8217;s progress, debug performance issues and take corrective actions. The SDK supports the view API to view different metrics dashboards for a dataset.</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#8b276cac-bd42-4379-bee2-aeea4e767df4">Metrics API on Postman</a></p></li></ul></li><li><p><strong>Usage Records: </strong>&nbsp;Datasets execution initializes different AWS resources - write connector (e.g. kinesis stream), the error connectors (e.g. S3 bucket) and different internal components such as queues, database tables, compute resources etc. We meter the usage of these resources for each dataset and create usage records. These usage records are then used in determining costs. The SDK supports an API to list these usage records.</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#bd9093ef-4eca-47f3-a68e-663b08747df3">Usage Records API on Postman</a></p></li></ul></li><li><p><strong>VPC: </strong>Datasets that read / write to destinations where a cluster of machines is managed, Lets Data manages these machines in a Virtual Private Cloud. The VPC resource supports a list&nbsp;API to list a dataset's VPC  details. The VPC resource also supports a <em><strong>'vpcPeeringConnections'</strong></em> resource that has commands to accept (create) and delete VPC peering connections to the customer VPCs.&nbsp;</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#c5130c09-5a1e-4a5f-901b-dd5a7f48818d">VPC API on Postman</a></p></li></ul></li><li><p><strong>Users: </strong>&nbsp;A Tenant (company) in #Lets Data can create different user accounts (login credentials in #Lets Data) for different users in the company. These users can be assigned administrator / user roles and can run datasets individually. The SDK also supports user management APIs such as list, create, update and delete users. </p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#6ec5d1ce-fb0e-4213-bd9b-4e7af3079fd7">Users API on Postman</a></p></li></ul></li><li><p><strong>Costs: </strong>Costs are a collection of the company's the billing account information, the payment method on file, the pricing details, a list of invoices, their payment status and links to pay the invoice /download the invoices as a PDF file. The SDK supports an API to list these cost details.</p><ul><li><p><a href="https://documenter.getpostman.com/view/33034393/2sA2xh3D7y#d2d214dd-f521-43e4-a001-89daf3db965b">Costs API on Postman</a></p></li></ul></li></ul><h3>Setup</h3><p>To start using the LetsData Control SDK, you need to:</p><ul><li><p><strong>Login Credentials:</strong> Have a valid LetsData account (username and a password). You can sign-up for a LetsData account at: <a href="https://www.letsdata.io/#signup">https://www.letsdata.io/#signup</a></p></li><li><p><strong>ClientId:</strong> For any serious calls, you'll need a valid Client Id to use the LetsData Control SDK. You can request one by emailing at <a href="https://mailto:support@letsdata.io/">support@letsdata.io</a> or logging an issue at <a href="https://www.letsdata.io/#support">https://www.letsdata.io/#support</a> - we'll enable Control API access for the username/password. For any testing and experimentation, you can use the testing and experimentation clientId <code>6ent0fqtc4v5ud6i8o41ado8rj</code>. Being a multi tenant system, the clientId helps us differentiate API calls from different clients.</p></li></ul><h3>Authentication</h3><p>You'll need to obtain <code>AccessToken</code> and <code>IdToken</code> by calling AWS Cognito. </p><p>Here is a sample request and response:</p><ul><li><p>Save the post in a file <code>auth_data.json</code></p></li></ul><pre><code><code>{
    "AuthParameters" : {
        "USERNAME" : "{{LetsData Username}}",
        "PASSWORD" : "{{LetsData Password}}"
    },
    "AuthFlow" : "USER_PASSWORD_AUTH", 
    "ClientId" : "{{LetsData ClientId}}"
}</code></code></pre><ul><li><p>Make a POST request to AWS Cognito and save the output to <code>creds.json </code>file</p></li></ul><pre><code><code>curl -X POST --data @auth_data.json \
    -H 'X-Amz-Target: AWSCognitoIdentityProviderService.InitiateAuth' \
    -H 'Content-Type: application/x-amz-json-1.1' \
    https://cognito-idp.us-east-1.amazonaws.com/ \
    --output creds.json</code></code></pre><ul><li><p>Example response is saved in the <code>creds.json</code> - a quick <code>cat creds.json | jq </code>show the following json - copy the <code>AccessToken</code> and <code>IdToken</code>, you'll need these for API calls. Also note the expiry which is the duration the token is valid for.</p></li></ul><pre><code><code>{
  "AuthenticationResult": {
    "AccessToken": "eyJraWQiOiJaQk...&lt;redacted&gt;",
    "ExpiresIn": 3600,
    "IdToken": "eyJraWQiOiJuSktcL1JN...&lt;redacted&gt;",
    "RefreshToken": "eyJjdHkiOiJKV1Qi...&lt;redacted&gt;",
    "TokenType": "Bearer"
  },
  "ChallengeParameters": {}
}</code></code></pre><h3>API Calls</h3><ul><li><p>You can call any of the LetsData Control SDK API by adding the <code>"Authorization: Bearer IdToken"</code> and <code>"LetsDataAuthorization: Bearer AccessToken"</code> headers. Here is an example API call that does a GET to retrieve a dataset's details.</p></li></ul><pre><code><code>curl "https://www.letsdata.io/api/dataset?tenantId={{tenantId}}&amp;userId={{userId}}&amp;datasetName={{datasetName}}" \
    -H "Authorization: Bearer IdToken" \
    -H "LetsDataAuthorization: Bearer AccessToken"</code></code></pre><ul><li><p>Almost every Control SDK API requires the <code>tenantId</code>, the <code>userId</code> for the authenticated user (TenantAdmins can pass a different userId to retrieve data for other users in the organization - see <a href="https://www.letsdata.io/docs/user-management/#user-roles">user roles documentation</a>) and the <code>datasetName</code>. You can find your tenantId, userId via the website, CLI (<a href="https://www.letsdata.io/docs/access-grants/#create-access-grants-role">docs</a>) or decode the <code>IdToken</code> to get the tenant and user ids ( <a href="https://jwt.io/">jwt.io</a> has a decoder). Here is a decoded Id token - the <code>sub</code> field is the <code>userId</code> and the <code>custom:tenantid</code> is the <code>tenantId</code></p></li></ul><pre><code><code>{
  "sub": "accb3567-2b6e-41ae-b00d-6ce1f9a58d94",
  "custom:companyaddress": "{\"addressLine1\":\"1234 Some Street\",\"addressLine2\":\"Apt F8\",\"city\":\"Bellevue\",\"state\":\"WA\",\"country\":\"US\",\"postalCode\":\"98006\"}",
  "cognito:groups": [
    "Tenant-d5feaf90-71a9-41ee-b1b9-35e4242c3155-Users"
  ],
  "custom:userrole": "TenantAdmin",
  "iss": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_asdjery68Ts",
  "cognito:username": "user@letsdata.io",
  "custom:companyname": "LetsData IO",
  "origin_jti": "1a0038a2-f8dd-4a71-996e-e64cde31003c",
  "custom:tenantid": "d5feaf90-71a9-41ee-b1b9-35e4242c3155",
  "aud": "11bbm85f3niuukca8su98dqc2t",
  "event_id": "978d2a81-fa94-4fdf-a5ce-52240c02aaaa",
  "token_use": "id",
  "auth_time": 1708220818,
  "exp": 1708224418,
  "iat": 1708220818,
  "jti": "194095f3-f350-425b-ad84-c5f8e1dd67fc",
  "email": "user@letsdata.io"
}</code></code></pre><h3>Improvements</h3><p>For the longest time, we had resisted publicly releasing the Control SDK API, primarily because we had full featured CLI and Website clients that were adequately serving the current needs. The CLI is automation friendly, so building automation scripts using the CLI has been the recommended automation way. We also provided private API access where needed.</p><p>So why are we publicly releasing it now?</p><p>I was recently in a technical conversation with a respected software architect about designing an API for a usecase and we had quite a thorough discussion  about the API design and implementation. His experiences in API design helped me disconfirm my beliefs about API design. This led me to review the LetsData Control API and measure it against a higher API Software Design bar. </p><p>Here are some issues that we believe need to be discussed in context of the Control API release and possible improvements to a vNext API. </p><ul><li><p><strong>HTTP Verb Abuse</strong> - the API uses a total of 2 Http Verbs, GET and POST to get everything done. </p><ul><li><p>Updates and Deletes are done via POST on update/delete sub resources. We need to use PUT for creates, POST for updates and DELETE for deletions. </p></li><li><p>Get single item vs Get list of items is via GET on the resource and /list sub resource etc. Item and List GET needs to be disambiguated with the GET parameters.</p></li><li><p>Getting these right from the get go should not have cost any additional time, so this is a miss (sigh, what was I thinking at that time!) </p></li></ul></li><li><p><strong>Authentication</strong> - The API uses both the IdToken and the AccessToken for authentication, which seems a little non standard. Why are we doing this this way? I believe this is our security paranoia kicking in - on each API request, we do deeper validations on the tenant details, the user details, the clientId - essentially doing some redundant validations but making sure calls authenticated are not going to be a security issue. (and we did not find a way to add custom attributes such as tenantId in the AWS Cognito access token, hence using both tokens). While this decision was reviewed for security, we could re-review this for simplification.</p></li><li><p><strong>API Simplification</strong> - There is a case for parameter simplification that can be made - since <em>tenantId</em> and <em>userId</em> are already present in the authentication tokens, they can be removed from API parameters (or made optional to override the auth token values if needed). For example, <em>api/dataset?tenantId={{tid}}&amp;userId={{uid}}&amp;datasetName={{dName}}</em> could be simplified to <em>api/dataset/{{dName}}. </em>The latter does seem simpler, but I ambivalent about this as of now - maybe we&#8217;ll run into some edge case that requires the tenant and user ids. Deferring to vNext for now.</p></li><li><p><strong>API Keys (lack thereof)</strong> - since we are allowing each customer a separate <em>clientId</em>, we believe as of now, the clientId is a sufficient in place of the API Key. APIKeys have a very nice integration in AWS API Gateway, where usage plans can be specified and the API Gateway can enforce these even before the request hits the web service code. However, we do not have such large scale quota or usage plan needs at this time - we can add any API Key specific logic (if needed) against the clientId (DDOS or other operational issue) and then have API Key in the vNext of the API.  As of now, there is no API usage rate limits and restrictions.</p></li><li><p><strong>Language Client SDKs</strong> - When we did our earlier startup (<a href="http://www.letsresonate.net">LetsResonate</a>), we had defined our API model using the API Gateway model. One nice thing about this definition was that the API Gateway generates Language specific clients for different languages. However, that definition and its request and response mappings was time consuming and frustrating. This time around, we by-passed the model definitions and defined the request and response as pass through strings, which were serialized/deserialized by our web service code. As a result, we don&#8217;t have the rich language specific clients that are auto generated by the API Gateway. We will try and get our API definition in some defined format and generate language clients for customers. Until then, REST Http API can be used to call the API. </p></li><li><p><strong>API Regional availability</strong> - API Gateway supports regional API availability. When we went from a US-EAST-1 region service to a multi-region service with availability in 6 regions around the globe, we made a conscious decision to keep the Control SDK API available only in the US-EAST-1 region. For a truly global service with optimal latencies, regional deployments of the API should be done, but this requires either:</p><ul><li><p>DBs being in the same region as the regional API deployment (requires multi master replication and conflicts resolution which is easy to get wrong)</p><p>OR</p></li><li><p>having the API servers have persistent connections to the DBs in the different regions (this is a challenge because our API servers are themselves Serverless and are initiated on demand)  </p></li><li><p>See the <a href="https://avinetworks.com/docs/latest/edge-proxy-design/">Edge Proxy Design</a> for benefits of API regional availability.</p></li></ul></li></ul><p>Some of these were conscious decisions because being a small startup, we prioritized for a bias for action rather than a perfect API and we do believe the decisions we made were correct - the API works well and these fixes would be work with lesser benefits. We&#8217;ll fix these when they become an issue / re-evaluate for vNext.</p><h3>Conclusion</h3><p>With the CLI, Website and API availability, it becomes very easy to get started and automate your data workflows on LetsData. We&#8217;d love to learn more about your data usecase and your thoughts on how we can improve the API / developer experience.</p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Spark is now available on LetsData]]></title><description><![CDATA[Today, we are announcing the availability of Apache Spark on the LetsData.]]></description><link>https://blog.letsdata.io/p/spark-is-now-available-on-letsdata</link><guid isPermaLink="false">https://blog.letsdata.io/p/spark-is-now-available-on-letsdata</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Fri, 16 Feb 2024 09:56:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f4ef9d94-c63e-4e16-ad00-f71a86fe79a1_809x408.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the availability of Apache Spark on the LetsData. You can now run Spark using LetsData's, leveraging the LetsData's Serverless task infrastructure built on AWS Lambda. You don&#8217;t need to create and manage clusters, job run schedules or any dedicated infrastructure. Your Spark code will just work out of the box - no jar issues, classpath problems or elaborate session and cluster configurations. LetsData spark jobs create Lambda compute on demand and allow for elastic scale. </p><p>Here are docs to get you started on LetsData Spark. </p><ul><li><p><a href="https://www.letsdata.io/docs/compute-engine/?tab=spark">Spark Compute Engine Docs</a>, <a href="https://www.letsdata.io/docs/read-connectors/?tab=s3-sparkreader">S3 Spark Read Connector Docs </a> and <a href="https://www.letsdata.io/docs/write-connectors/?tab=s3spark">S3 Spark Write Connector Docs</a> </p></li><li><p><strong>LetsData Spark Interfaces:</strong> <a href="https://github.com/lets-data/letsdata-data-interface/tree/main/src/main/java/com/resonance/letsdata/data/readers/interfaces/spark">Java</a>, <a href="https://github.com/lets-data/letsdata-python-interface/tree/main/letsdata_interfaces/readers/spark">Python</a> </p></li><li><p><strong>LetsData Spark Interface Implementation Examples:</strong>  <a href="https://github.com/lets-data/letsdata-common-crawl/tree/main/src/main/java/com/letsdata/commoncrawl/interfaces/implementations/spark">Java</a>, <a href="https://github.com/lets-data/letsdata-python-examples/tree/main/letsdata_interfaces/readers/spark">Python</a></p></li><li><p><a href="https://www.letsdata.io/docs/examples?tab=spark-extractandmapreduce">End to End Example - Spark Extract and Map Reduce</a> </p></li></ul><h2>How it Works?</h2><p>LetsData's Spark interfaces are inspired by the original Google Map Reduce paper. We have defined a <strong>MAPPER</strong> interface and a <strong>REDUCER</strong> interface.</p><p>Here is a Spark Dataset's Architecture Diagram and how the mapper and reducer tasks process a dataset using Spark.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SHfS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SHfS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 424w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 848w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 1272w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SHfS!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/acdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:240,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SHfS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 424w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 848w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 1272w, https://substackcdn.com/image/fetch/$s_!SHfS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Facdb8a90-aa90-4f1c-a02e-a2d9148a05a7_1769x292.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption"><code>    Figure: Spark Dataset's Architecture Diagram (runSparkInterfaces: MAPPER_AND_REDUCER)</code></figcaption></figure></div><p>Recall that a dataset's amount of work is defined by a manifest file. For example, for S3 read destination, this is essentially a list of files that need to be processed.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UMT7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UMT7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 424w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 848w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 1272w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UMT7!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png" width="1200" height="214.40598690364826" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a2f55c13-6423-46ba-8331-4e711539f787_1069x191.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:191,&quot;width&quot;:1069,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:87585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UMT7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 424w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 848w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 1272w, https://substackcdn.com/image/fetch/$s_!UMT7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f55c13-6423-46ba-8331-4e711539f787_1069x191.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">An example manifest file</figcaption></figure></div><p>The manifest file above specifies 4 files that need to be processed.</p><h4><strong>Mapper Tasks</strong></h4><ul><li><p>Each manifest file becomes a separate Mapper Task in LetsData.</p></li><li><p>And each Mapper Task runs the mapper interface code. </p></li><li><p>The mapper interface implements single partition operations (narrow transformations)</p></li><li><p>The Dataframe returned by the Mapper Task is written to S3 as an intermediate file by LetsData.</p></li><li><p>In this example case, <code>4 manifest files -&gt; 4 Mapper Tasks -&gt; 4 intermediate files.</code></p></li><li><p>The interfaces themselves are quite simple, here is the mapper interface: </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_IkN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_IkN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 424w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 848w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 1272w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_IkN!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png" width="1200" height="976.4087152516904" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1083,&quot;width&quot;:1331,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:285018,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_IkN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 424w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 848w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 1272w, https://substackcdn.com/image/fetch/$s_!_IkN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc589612f-07cf-41b6-84c0-7dc6b930d466_1331x1083.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The SparkMapperInterface - Spark Read Uri, Read Options, Write Uri and Write Options and an AWS Secret Manager ARN are passed in as function parameters</figcaption></figure></div><ul><li><p>Here is a sample implementation that reads the web crawl archive files to extract document which is written to S3 as an intermediate file:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8ICz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8ICz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 424w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 848w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 1272w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8ICz!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png" width="1200" height="794.0387481371088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:888,&quot;width&quot;:1342,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:269766,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8ICz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 424w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 848w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 1272w, https://substackcdn.com/image/fetch/$s_!8ICz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686d0184-7aa8-4300-95f8-9627743aebbb_1342x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">SparkMapperInterface - example implementation that uses the default session, read and write functions. Customer writes only their business logic code</figcaption></figure></div><h4><strong>Reducer Task</strong></h4><ul><li><p>LetsData creates a reducer task for any reduce operations for the dataset.</p></li><li><p>The intermediate files from the mapper phase are read by the reducer, any multi-partition operations, shuffles, aggregates, joins or similar wide transformations are performed on the intermediate files.</p></li><li><p>The dataset returned by the Reducer Task is written to the dataset's write destination.</p></li><li><p>In this example case, <code>4 intermediate files from the mapper tasks -&gt; 1 output file</code> by the reducer task</p></li><li><p>Here is the reducer interface: </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UK9n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UK9n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 424w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 848w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UK9n!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png" width="1200" height="630.4945054945055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:765,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:324621,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UK9n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 424w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 848w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!UK9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65f2304a-3ab4-482b-a5d4-3d9df39693ef_2138x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">SparkReducerInterface - Spark Read URIs, WriteUri and options are passed as parameters</figcaption></figure></div><ul><li><p>Here is a sample implementation that reads the intermediate files from S3 computes the reduce operations and writes the results to S3:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xuNy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xuNy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 424w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 848w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 1272w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xuNy!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png" width="1200" height="388.97168405365124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:435,&quot;width&quot;:1342,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:104458,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xuNy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 424w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 848w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 1272w, https://substackcdn.com/image/fetch/$s_!xuNy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a1ea97-0dbe-4259-b399-158687ac71b7_1342x435.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">SparkReducerInterface - example implementation that uses the default read, write and session functions. Customers write only their business logic code. </figcaption></figure></div><h3>Config</h3><p>You can configure the LetsData datasets to run both <strong>MAPPER_AND_REDUCER </strong>tasks,<strong> MAPPER_ONLY </strong>tasks or the <strong>REDUCER_ONLY </strong>tasks. Here is the config schema for Spark Compute Engine:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HKqt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HKqt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 424w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 848w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 1272w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HKqt!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png" width="1200" height="185.43956043956044" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:225,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:94476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HKqt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 424w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 848w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 1272w, https://substackcdn.com/image/fetch/$s_!HKqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F313f7d53-fcfd-43a2-b35d-b62ea13fa3a3_1656x256.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Configuration for the Spark Compute Engine</figcaption></figure></div><h3>High-level Overview of How Spark works on Clusters</h3><ul><li><p>At a high level, Spark is usually run on a cluster of machines where the cluster resources are divided as <code>spark master</code>, <code>driver</code> and <code>workers</code> processes. </p></li><li><p>Each of these <code>processes is allocated resources</code> (memory, cpu etc) according to their expected workload. </p></li><li><p>A spark <code>cluster can run many applications</code>, each application is scheduled by the master to have driver and some workers. </p></li><li><p>Spark cluster can run as many applications as the driver / worker node divisions allow. For example in a 4 machine cluster, where each machine memory is divided to have 4 workers, the cluster will have 16 workers (assuming that drivers and masters have already been accounted for). Now if the applications that are being run request 8 workers, we can run 2 applications concurrently on such cluster. <code>Additional applications will wait for resources to be available</code>.</p></li><li><p>The application's driver will coordinate the application run / worker management. It will send tasks that need to be accomplished to the workers. For narrow transformations (single partition), the task will be contained to a worker. For wide transformations (shuffle, repartition) it coordinates data from different partitions and assigns these reduce tasks to workers. Essentially, <code>driver is the brain that coordinates the data processing</code> and efficiently manages the workers.</p></li></ul><h3><strong>How LetsData does it?</strong> </h3><ul><li><p>LetsData uses the same primitive constructs, but stitches the dataset workflow a little differently. Here is what we do:</p><ul><li><p><strong>Scope Work:</strong> Since the files that are to be read are specified in the manifest, at a high level LetsData knows the number of readers (tasks) that it needs to initialize. We currently map each file to a single mapper task.</p></li><li><p><strong>Mapper Task:</strong> Each of these tasks (mapper) runs as a DataTask function on AWS Lambda. We install spark on each lambda function in standalone mode. This essentially means that each function (10 GB memory, 10 GB disk) now has a master, driver and workers for this task (which maps to a single file). Since each task is a self contained spark instance, it can only perform narrow transformations (and will lead to incomplete results for any cross partition actions such as re-partition etc). User's spark code is run to compute the intermediate result for the file / partition. This<code> intermediate result is stored in S3</code>. This will be read by the reducer task.</p></li><li><p><strong>Reducer Task:</strong> We create a reducer task that runs user's wide transformations (cross partition code). This task again is a DataTask on AWS Lambda where we've installed spark (with 10 GB memory, 10 GB disk) which now has a master, driver and workers for this reducer task. This task reads all the intermediate files and computes the final result using user's code. This final result is written to the write destination</p></li></ul></li><li><p>We've traded the smarts that Spark has built that compute an efficient query execution plan for a dataset with only task level efficiencies for now. As of now, we believe that with availability of workers elastically, parallelism can possibly be adequately sufficient in terms of performance. With usage based costing, we don&#8217;t pay for idle clusters, we do not have waits for a fully saturated cluster and on surface the cost efficiencies seem beneficial. Theoretically this seems like a good architecture (and we've seen good results on the tests that we have done), however, we'd like to run it with larger datasets to see how the system performs and what we can improve. In future, the standalone spark instances can be replaced with workers and LetsData could become the brain (driver) for the dataset and send work to the workers. (bigger project for another day). </p></li><li><p><strong>Defaults:</strong> We've built in default spark sessions, default read and write code and task management that should make getting started really easy. JAR incompatibilities, CLASSPATH issues, other setup and runtime issues have been standardized in our images and users should not have to deal with these.</p></li></ul><ul><li><p><strong>Errors, Logs, Metrics and Checkpointing: </strong>Spark tasks do not support the LetsData record errors (where individual record errors are sent to the error destination as json error detail files) and task checkpointing infrastructure yet. (where tasks periodically checkpoint their progress and support resume from last checkpoint semantics). So to answer the general question about errors, logs, metrics and checkpointing in spark tasks:</p><ul><li><p><strong>Logging:</strong> Logger is available, and logs would be sent to Cloudwatch and made available.</p></li><li><p><strong>Metrics:</strong> No metrics support within the mapper interface yet. We generate some high level metrics. We will hopefully enable the spark metrics soon.</p></li><li><p><strong>Errors:</strong> The LetsData dataset errors are currently not available for spark interfaces. You decide how you want to deal with errors. (Write separate error files (using same write credentials), log them to log file etc.). Any unhandled / terminal failures can be thrown as exceptions and the task will record these and transition to error state.</p></li><li><p><strong>Checkpointing:</strong> The LetsData checkpointing and restart from checkpoints is not available yet for spark interfaces. Interfaces run either completely or fail, in which case the intermediate progress isn't used for reduce. If rerun, the tasks will run from beginning and overwrite any intermediate progress.</p></li></ul></li></ul><h3>Example</h3><p>A working step by step example for Spark on LetsData is available on the <a href="https://www.letsdata.io/docs/examples/?tab=spark-extractandmapreduce">LetsData Docs Website</a>. Here is the problem description that the example solves:</p><p><em>&#8220;we'll read files (web crawl archive files) from S3 using Spark code and extract the web crawl header and the web page content as a LetsData Document. We'll then map reduce these documents using Spark to compute the 90th percentile contentLength grouped by language and write the results as a json document to S3&#8221;</em></p><h3>Conclusions</h3><p>We are amazed by Spark and the scale, ease and facilities it provides for data processing. With the simplifications and Serverless support that we&#8217;ve added, we are actually fairly excited about what we have built. </p><p>We&#8217;d love customers to try it out and let us know what works and what doesn&#8217;t so we can help improve. We&#8217;d also want to know how folks have productized Spark for their data processing and any pros and cons of their work with Spark. </p><p>Thoughts / comments?  Let us know.</p>]]></content:encoded></item><item><title><![CDATA[S3 Write Connector Improvements]]></title><description><![CDATA[Today, we are launching an S3 Aggregate File write connector - customers can now seamlessly create aggregate files on S3 from #LetsData processed records.]]></description><link>https://blog.letsdata.io/p/s3-write-connector-improvements</link><guid isPermaLink="false">https://blog.letsdata.io/p/s3-write-connector-improvements</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Sun, 24 Dec 2023 22:45:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7a4be06c-4654-4956-890f-12b3718d4e01_398x398.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are launching an <strong>S3 Aggregate File</strong> write connector - customers can now seamlessly create aggregate files on S3 from #LetsData processed records. You can read about the S3 Aggregate File write connector at <a href="https://www.letsdata.io/docs/write-connectors/?tab=s3aggregatefile">#LetsData Docs</a>.</p><h3>Existing S3 Write Connector and Aggregate Files</h3><p>#LetsData already supported an S3 Write Connector where each processed record is written as an individual file on S3. Here are what the files from this Write Connector look like: </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qp2x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qp2x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 424w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 848w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 1272w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qp2x!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png" width="1200" height="202.74725274725276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:52119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qp2x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 424w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 848w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 1272w, https://substackcdn.com/image/fetch/$s_!qp2x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d3ad471-57c6-4d49-9e27-f5dc207f8974_1551x262.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">#LetsData individual record files where each record is written to its own file</figcaption></figure></div><p>While this works great, a large number of scenarios require aggregate files. It is common to have large files in S3 which have multiple records. For example, the individual files from S3 Write Connector above can be coalesced into an  aggregate file as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E1qS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E1qS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 424w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 848w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 1272w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E1qS!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png" width="1200" height="130.21978021978023" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:158,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:42832,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E1qS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 424w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 848w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 1272w, https://substackcdn.com/image/fetch/$s_!E1qS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dc5178d-3cf6-4120-b0a1-0df500f576b4_1550x168.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">#LetsData S3 Aggregate File where multiple records are written to a single file</figcaption></figure></div><p>This is what we are launching today - you can now seamlessly create aggregate files with the #LetsData S3 Aggregate File Write Connector.</p><h3>S3 Aggregate File Write Connector</h3><p>S3 Aggregate File Write Connector is simple to get started with - define a write connector configuration and create a dataset - the processed records will automatically start getting aggregated into files on S3. Here is the write configuration schema:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EKr6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EKr6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 424w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 848w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 1272w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EKr6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png" width="638" height="285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bda2f701-885d-41dc-993c-84093631213e_638x285.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:285,&quot;width&quot;:638,&quot;resizeWidth&quot;:638,&quot;bytes&quot;:52771,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EKr6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 424w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 848w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 1272w, https://substackcdn.com/image/fetch/$s_!EKr6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbda2f701-885d-41dc-993c-84093631213e_638x285.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example S3 Aggregate File Write Connector Configuration</figcaption></figure></div><p>The notable configuration elements are:</p><ul><li><p><strong>fileRecordsSeparator:</strong> String - (Required) The string delimiter to separate the records in the file. For example, "\n".</p></li><li><p><strong>aggregateFileSizeInMB:</strong> Integer - (Optional) aggregateFileSizeInMB is the write connector's output file size in MB in S3. Allowed values: [10-128]. Defaults to 128 MB.</p></li></ul><h3>Implementation</h3><p>The implementation had a few different challenges which should make for an interesting reading.</p><h4>Streaming uploads on S3</h4><p>One interesting challenge was how to do streaming uploads to S3 files - here are some issues:</p><ul><li><p><strong>S3 Objects are Immutable: </strong>The S3 Objects are essentially immutable. Once you&#8217;ve created a file, it cannot be modified, only overwritten by a new file or deleted. This works great for data uploads where data files already exist, but doesn&#8217;t work quite well with streaming scenarios where data is continuously being created. </p></li><li><p><strong>Streaming scenarios require storage that supports appends</strong>: In streaming scenarios, the content arrives with time and needs to be appended to some storage container (file, database table etc). With S3, you need to have the content to be able create an object - appends don&#8217;t work as is. (For some background in distributed file systems, maybe look at the <a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">Google File System paper</a>)   </p></li><li><p> <strong>Multipart Uploads:</strong> S3 does have a very feature rich API that allows for almost all different kinds of scenarios - in this case, the <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html">multipart uploads API</a> seems like the recommended option - relevant snippets from the docs:</p><blockquote><p><em><strong>Pause and resume object uploads</strong> &#8211; You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.</em></p><p><em><strong>Begin an upload before you know the final object size</strong> &#8211; You can upload an object as you are creating it.</em></p></blockquote></li></ul><p>In our implementation, we simply store the data on the local filesystem and initiate an upload once the file has reached the aggregation threshold. Although this seems to work well for now, we might see:</p><ul><li><p> some issues from <strong>staleness</strong> in true streaming scenarios (as opposed to batch streaming). These are because we are aggregating records for a certain size before making them available in the datastore (S3)</p></li><li><p><strong>delayed checkpoints</strong> - although we are writing to the local filesystem, we aren&#8217;t checkpointing until those records are in S3. This can cause delays in checkpoints (checkpoints happen after say 300 MB data has been processed). This can also make the solution prone to wasteful work and larger recovery scenarios. More on this follows. </p></li></ul><h4>Enhanced Checkpointing </h4><p>In our data task, we have reader threads that read the data, compute threads that perform computations and writer threads that write to the write destination. Each of these functions is decoupled from the other - readers could have read 100 records (read pointer at 100), compute could have processed 50 records (compute pointer at 50) and writers could have processed 30 records (write pointer at 30). Here is a pictorial representation:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!amgJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!amgJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 424w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 848w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 1272w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!amgJ!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png" width="1200" height="259.3320235756385" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:220,&quot;width&quot;:1018,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:17316,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!amgJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 424w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 848w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 1272w, https://substackcdn.com/image/fetch/$s_!amgJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd849a56a-9e3e-4e46-83a2-04dd928ea040_1018x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The #LetsData Data Task with different threads and buffered records</figcaption></figure></div><p>In this case, checkpoint calc is simple - when the writer flushes the 30 records in its buffer, it sends an ack back to compute and readers to trim those records from their buffers and record a checkpoint at 30. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BUdW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BUdW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 424w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 848w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 1272w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BUdW!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png" width="1200" height="214.86123545210384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:200,&quot;width&quot;:1117,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:19062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BUdW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 424w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 848w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 1272w, https://substackcdn.com/image/fetch/$s_!BUdW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fcb0768-eb58-47c1-b469-29b9ff6bdf08_1117x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Writer Flushes the Buffered Records and sends acks for rec 30</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c33W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c33W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 424w, https://substackcdn.com/image/fetch/$s_!c33W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 848w, https://substackcdn.com/image/fetch/$s_!c33W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 1272w, https://substackcdn.com/image/fetch/$s_!c33W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c33W!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png" width="1200" height="231.02189781021897" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:211,&quot;width&quot;:1096,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:18092,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c33W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 424w, https://substackcdn.com/image/fetch/$s_!c33W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 848w, https://substackcdn.com/image/fetch/$s_!c33W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 1272w, https://substackcdn.com/image/fetch/$s_!c33W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F81aaa20b-2aac-448f-be15-c66a9b616495_1096x211.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Records are trimmed and Checkpoint is created at 30</figcaption></figure></div><p>In the S3 aggregated file writer case, there are two different flushes that need to happen before a checkpoint is created. 1.) the flushing of write records to the local filesystem 2.) flushing of the file to S3</p><p>We&#8217;d be writing records to local file and since we are aggregating, the flush to S3 thresholds are larger (for example 300 MB). If we wait for S3 flush to create the ack and checkpoints, The reader and compute buffers would be holding 300MB data each unnecessarily. (Contrived example, our implementation is quite efficient). Ideally, we can remove these 300MB records from all buffers once they have been flushed to the local file. The checkpoints can be created once the flush to S3 has completed. (So if the process crashes before the S3 flush is complete, we&#8217;ll redo the 300 MB work which is okay for now.)</p><p>This meant that our Acks and Checkpoints needed to be separated - the local filesystem writes drive the acks and the write to S3 create the checkpoints. A bit of rework and some testing later, the enhanced checkpointing is working quite well.   </p><h3>Conclusion</h3><p>We now support writing aggregate files as well as individual files to S3 and have a variety of <a href="https://www.letsdata.io/docs/read-connectors/">stateless and stateful readers</a> to read data from S3. In addition, we are low latency and high throughput and are built around the AWS best practices for S3 - features such as backups, versioning, secure transport, audit trails, access controls etc are available by default with the #LetsData pipelines. </p><p>We&#8217;d love to know how customers might be using S3 in their data architectures and how we can evolve our S3 connectors to better serve the customers.</p><p>Thoughts / Feedback? Let us know!</p>]]></content:encoded></item><item><title><![CDATA[DynamoDB Read Connectors now available!]]></title><description><![CDATA[Today, we are launching read connectors for processing data in DynamoDB Tables and Streams.]]></description><link>https://blog.letsdata.io/p/dynamodb-read-connectors-now-available</link><guid isPermaLink="false">https://blog.letsdata.io/p/dynamodb-read-connectors-now-available</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Sat, 23 Dec 2023 03:07:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/12c9ce69-2310-4ad0-906f-86265fb0122d_300x151.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are launching read connectors for processing data in DynamoDB Tables and Streams. At a glance, here is what we are launching:</p><ul><li><p><strong>Dynamo DB Table Read Connector:</strong> Scans the table and send the records to compute engine / write connectors. </p><ul><li><p><a href="https://www.letsdata.io/docs/read-connectors/?tab=dynamodb-tablereader">LetsData Docs</a></p></li><li><p>Interface (<a href="https://github.com/lets-data/letsdata-data-interface/blob/main/src/main/java/com/resonance/letsdata/data/readers/interfaces/dynamodb/DynamoDBTableItemReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-interface/blob/main/letsdata_interfaces/readers/dynamodb/DynamoDBTableItemReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-interface/blob/main/letsdata_interfaces/readers/dynamodb/DynamoDBTableItemReader.js">Javascript</a>) and Example Implementations (<a href="https://github.com/lets-data/letsdata-common-crawl/blob/main/src/main/java/com/letsdata/commoncrawl/interfaces/implementations/dynamodb/CommonCrawlDDBItemReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-examples/blob/main/letsdata_interfaces/readers/dynamodb/DynamoDBTableItemReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-examples/blob/main/letsdata_interfaces/readers/dynamodb/DynamoDBTableItemReader.js">Javascript</a>)</p></li></ul></li><li><p><strong>Dynamo DB Streams Read Connector:</strong> Similar to the Kinesis Read Connector, this read connector reads from the Dynamo DB Streams and processes the records according to the customer&#8217;s dataset configuration.  </p><ul><li><p><a href="https://www.letsdata.io/docs/read-connectors/?tab=dynamodb-streamsreader">LetsData Docs</a></p></li><li><p>Interface (<a href="https://github.com/lets-data/letsdata-data-interface/blob/main/src/main/java/com/resonance/letsdata/data/readers/interfaces/dynamodbstreams/DynamoDBStreamsRecordReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-interface/blob/main/letsdata_interfaces/readers/dynamodbstreams/DynamoDBStreamsRecordReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-interface/blob/main/letsdata_interfaces/readers/dynamodbstreams/DynamoDBStreamsRecordReader.js">Javascript</a>) and Example Implementations (<a href="https://github.com/lets-data/letsdata-common-crawl/blob/main/src/main/java/com/letsdata/commoncrawl/interfaces/implementations/dynamodbstreams/CommonCrawlDDBStreamReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-examples/blob/main/letsdata_interfaces/readers/dynamodbstreams/DynamoDBStreamsRecordReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-examples/blob/main/letsdata_interfaces/readers/dynamodbstreams/DynamoDBStreamsRecordReader.js">Javascript</a>)</p></li></ul></li></ul><p>Details for these connectors are as follows:</p><h3>DynamoDB Table Read Connector</h3><p>DynamoDB Table Read Connectors Scan the DynamoDB Tables. You can use #LetsData connector to scan DynamoDB tables for scenarios such as populating caches, running Gen AI pipelines or any aggregation jobs. </p><p>Here are some notable implementation highlights are: </p><ul><li><p><strong>Configurable Scanning Speed:</strong> Customers can configure the scanning speed by specifying the number of reader tasks (parallel scan segments), which in conjunction with the configured concurrency can aggressively scan the table or be set up as a low priority scan</p></li><li><p><strong>Item Filtering:</strong> We&#8217;ve enabled support for DynamoDB&#8217;s filter expressions to filter items that are read</p></li><li><p><strong>Scan Completion Conditions:</strong> we&#8217;ve enabled a couple of modes for scan completion</p><ul><li><p><strong>SingleTableScan:</strong> Dataset completes once a single scan of the table is completed</p></li><li><p><strong>Continuous:</strong> Dataset continuously scans the table until the dataset errors, is stopped or is deleted. </p></li></ul></li><li><p><strong>Simple Data Interfaces:</strong> we&#8217;ve defined a simple interface that customers can implement and it is available in our supported languages:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4anM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4anM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 424w, https://substackcdn.com/image/fetch/$s_!4anM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 848w, https://substackcdn.com/image/fetch/$s_!4anM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 1272w, https://substackcdn.com/image/fetch/$s_!4anM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4anM!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png" width="1200" height="56.10687022900763" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:49,&quot;width&quot;:1048,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:19253,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4anM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 424w, https://substackcdn.com/image/fetch/$s_!4anM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 848w, https://substackcdn.com/image/fetch/$s_!4anM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 1272w, https://substackcdn.com/image/fetch/$s_!4anM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6be913c1-58f6-4ae1-8404-7155ffe0bb1a_1048x49.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Dhp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Dhp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 424w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 848w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 1272w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Dhp!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png" width="1200" height="93.4054054054054" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:72,&quot;width&quot;:925,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:23506,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Dhp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 424w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 848w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 1272w, https://substackcdn.com/image/fetch/$s_!6Dhp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97bd5dd3-4e17-49b4-ad17-6880d704369b_925x72.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8uIe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8uIe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 424w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 848w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 1272w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8uIe!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png" width="1200" height="110.91314031180401" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:83,&quot;width&quot;:898,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:20422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8uIe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 424w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 848w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 1272w, https://substackcdn.com/image/fetch/$s_!8uIe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78ecffcf-8472-49a0-94d3-acc6a4cb9511_898x83.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Getting started with the Dynamo DB Table Read Connector is simple. </p><ul><li><p>Implement the interface, here is a simple python implementation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CQV4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CQV4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 424w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 848w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 1272w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CQV4!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png" width="1200" height="672.5023786869648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:589,&quot;width&quot;:1051,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:135963,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CQV4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 424w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 848w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 1272w, https://substackcdn.com/image/fetch/$s_!CQV4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b49c23-98a7-4971-9582-503a161d0a89_1051x589.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>define the <strong>read connector</strong> (tableName and artifact) and <strong>manifest</strong> for the Dynamo DB Table - in this example, we are selecting <em>english</em> language items from the DynamoDB Table by specifying the <code>readerFilterExpression</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AUTe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AUTe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 424w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 848w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 1272w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AUTe!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png" width="1200" height="787.9120879120879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:956,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:738049,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AUTe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 424w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 848w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 1272w, https://substackcdn.com/image/fetch/$s_!AUTe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f06b564-023a-477f-9e5d-9f30af816f1d_2101x1379.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Scanning a table reliably at scale is an interesting challenge and what we&#8217;ve built is pretty compelling in our humble opinions. We have some futuristic ideas on how to further improve the DynamoDB Table read connector - we&#8217;d love to hear user&#8217;s experiences so that we can iterate and improve.</p></li></ul><h3>DynamoDB Streams Read Connector</h3><p>DynamoDB Streams are fantastic! When they were launched, I remember being amazed by the new development scenarios that streams enabled -  database architectures had instantly transformed from querying models to event driven architectures. With the #LetsData DynamoDB Streams read connector, you can process these events at scale. Here are some notable callouts:</p><ul><li><p><strong>Database Changelog:</strong> DynamoDB Streams are essentially a <code>changelog</code> where every insert/update/delete is available as an ordered stream of records. </p></li><li><p><strong>Kinesis / Dynamo DB Streams:</strong> The #LetsData implementation for the DynamoDB Streams Read Connector is very similar to the Kinesis Read Connector with the notable difference being the availability of <code>before modification</code> and <code>after modification</code> item images in the stream record. This lends to a slightly verbose #LetsData interface</p></li><li><p><strong>Designing for Ephemeral Shards:</strong> One interesting challenge in developing for Kinesis / DynamoDB streams was that #LetsData defines a single task for each Kinesis / DynamoDB shard. However, the shards are ephemeral - a shard can split into child shards or adjacent shards can be merged into a single shard.  This essentially means that existing tasks complete and new tasks need to be created for the new shards. We&#8217;ve added code upon each tasks completion to detect if there are shards that haven&#8217;t been assigned to tasks and create new tasks if needed. </p><ul><li><p>In all our use cases thus far, we&#8217;d defined the complete work specification at dataset creation time. These instances have made us monitor for changes in work specifications and add / subtract tasks as needed. </p></li></ul></li><li><p><strong>LetsData Interfaces</strong>: The #LetsData interfaces are defined in different supported languages as follows: </p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_QH_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_QH_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 424w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 848w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 1272w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_QH_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png" width="1200" height="86.86868686868686" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:86,&quot;width&quot;:1188,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:33467,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_QH_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 424w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 848w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 1272w, https://substackcdn.com/image/fetch/$s_!_QH_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46c99796-52ea-43cb-9a7d-0361d406cefd_1188x86.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OmgH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OmgH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 424w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 848w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 1272w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OmgH!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png" width="1200" height="104.1372351160444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:86,&quot;width&quot;:991,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:33869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OmgH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 424w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 848w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 1272w, https://substackcdn.com/image/fetch/$s_!OmgH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55a662f5-6900-4ef0-a2e9-d8728d1d5801_991x86.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FS6l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FS6l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 424w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 848w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 1272w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FS6l!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png" width="1200" height="91.82879377431907" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:59,&quot;width&quot;:771,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:22855,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FS6l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 424w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 848w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 1272w, https://substackcdn.com/image/fetch/$s_!FS6l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F33de7579-b641-4d57-98a6-5b0879878fd4_771x59.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><strong>Configuration:</strong> The read connector configuration is simple - the <strong>read connector</strong> requires the <code>streamArn</code> and the <strong>manifest</strong> defines 1.) configuration for the start point (<code>Earliest</code> / <code>Latest</code>) 2.) completion conditions such as <code>StopWhenNoData</code> or <code>Continuous</code>. Here are some example configs:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qMNG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qMNG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 424w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 848w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 1272w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qMNG!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png" width="1200" height="241.4835164835165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:293,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:584345,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qMNG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 424w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 848w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 1272w, https://substackcdn.com/image/fetch/$s_!qMNG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff32eb509-b481-4526-8a1a-08e5a89bc1d9_2544x512.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><strong>Performance:</strong> We&#8217;ve seen quite nice throughput and latency numbers which are comparable to similar read connectors. </p></li></ul><h3>Conclusion</h3><p>With the Scan and Stream read connectors, any data that is in a Dynamo DB table can be processed by LetsData using different models - data pull using querying and data push with event streams. In addition to our already existing Dynamo DB Write Connector, and the nice performance numbers that we see, you can reliably use #LetsData and Dynamo DB as the central components in your data architecture.</p><p>Let us know, we&#8217;d love to work with you and onboard you to #LetsData!</p>]]></content:encoded></item><item><title><![CDATA[Launch: Kinesis Stream Read Connector is now available ]]></title><description><![CDATA[Today, we are announcing the public availability of Kinesis Read Connector on #LetsData.]]></description><link>https://blog.letsdata.io/p/launch-kinesis-stream-read-connector</link><guid isPermaLink="false">https://blog.letsdata.io/p/launch-kinesis-stream-read-connector</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Fri, 15 Dec 2023 00:58:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/202856c9-2228-4e87-a73c-2a2b0be3b72d_300x151.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the public availability of Kinesis Read Connector on #LetsData. Customers can now create #LetsData datasets to automatically read documents from Kinesis Streams. </p><p>Customers implement the KinesisRecordReader interface to transform Kinesis records to output documents. #LetsData handles reading from the stream, managing compute and writing to your selected write destination. </p><h2>Implementation</h2><p>While we discuss the implementation, there are a few interesting discussion points that should be explored:</p><h4>Queues vs. Streams</h4><p>When I first started looking at streams I grappled with the obvious question - <em>What is the difference between a queue and a stream?</em> I had worked with SQS and was now looking into Kinesis, and on the surface both seemed to have the similar sendMessage and getMessage semantics. For those who don&#8217;t work with queues  / streams in their everyday jobs or are at the periphery of these technologies, do research this simple question. The internalizations of the differences that have stayed with me are:</p><ul><li><p>Queue messages are ephemeral - you consume them and they are gone (messages are deleted / dequeued). Streams tend to be more like a database records - you consume them and they are still there, available for different consumers to read as well (when message are read, your read pointer is advanced to next record). </p></li><li><p>Queues are like glue that connect two systems and are created specifically for that system&#8217;s use-case example, sendMessage(&#8220;a task has been created, kindly assign to a worker&#8221;). Streams are more akin to a broadcast - where the stream can be used by many different systems, example sendMessage(&#8220;a task was created&#8221;). The many stream readers could do different actions for the same message. A monitor process will create alarms on the task, a scheduler will create a worker to assign task and a reporting process will generate a report of the number of tasks created per hour.</p></li><li><p>Ordering guarantees are much stronger in streams vs queues. Queues are mostly unordered or at best offer FIFO ordering semantics. Streams on the other hand allow for sequencing records, which allows for distributed events reasoning such as happens-before relationships and ordering of events.  </p></li><li><p>The common stream implementations have stronger facilities for key-space control and sharding vs common queue implementations which do not have as flexible key-space / partitioning control. </p></li></ul><p>This lends queues to be a great fit for message passing scenarios. Streams, on the other hand, are best for real time data processing and analysis.</p><h4>Architecture</h4><p>The architecture of a Kinesis Stream Reader Task in LetsData is similar to an SQS task, with the implementations being quite different. Here is a high level task architecture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rjKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rjKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 424w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 848w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 1272w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rjKQ!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png" width="1200" height="535.8778625954199" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:468,&quot;width&quot;:1048,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:65085,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rjKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 424w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 848w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 1272w, https://substackcdn.com/image/fetch/$s_!rjKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d7f0177-d033-4a9b-9615-0b8be53cf9b4_1048x468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture diagram for a Kinesis Stream Reader Task</figcaption></figure></div><p>Following are some interesting implementation details:</p><h4>Read Acknowledgements</h4><p>SQS  read connector is complicated in that it:</p><ul><li><p>reads the message from the queue which essentially is a lease for a certain duration</p></li><li><p>maintains the lease and renews it if needed </p></li><li><p>when processing is complete, deletes the message (read acknowledgement) </p></li><li><p>there are no offsets, messages have been deleted, so checkpointing is mostly for progress reporting.</p></li></ul><p> The Kinesis read connector read does not need any lease management and there isn&#8217;t any read acknowledgement needed. You read the record, process and then checkpoint the offset (sequenceNumber). This simplifies the reader and also increases the read throughput (one less call to make). </p><h4>Batched Reads, Network Calls and Concurrency</h4><p>In Kinesis Stream Reader, we use the batched read API to read 30 records and then for each record we call the user&#8217;s interface implementation to get the transformed doc. While testing, we ran into a couple of very interesting issues:</p><ul><li><p>The <em>calling user&#8217;s interface implementation</em> was essentially serial, which meant that we loop over the 30 records and call them one by one. This worked great when the user&#8217;s interface implementations were in-proc (java). With python/javascript, we added network calls which meant that user&#8217;s interface implementations latency was <em>30*latency of single network call. </em>We fixed this to be done in parallel and saw significant performance increase (100 ms &#8594; 20 ms).</p></li><li><p> When we added the above parallelism, we started seeing <em>Out of Memory - Unable to create a native thread </em>errors<em>. </em>Lots of debugging, stack dumps and code analysis later, it turned out that we were using the <code>Executors.newFixedThreadPool(1000)</code> method to create our threadpool. This had worked great but with the recent large increase  in threads, it looked like the thread reclamation was somehow not happening fast enough in certain scenarios. We switched the threadpool to a <code>ThreadPoolExecutor</code>  - specifying custom values for core pool size (100), max pool size (1000), a relatively aggressive thread reclamation timeout (500 ms), a custom thread factory that assigned stacktrace names to the thread and added threadpool statistic logs. We&#8217;ve not seen any issues since this change, threads are constantly around 100 (the core pool size) and thread reclamation is working quite well. </p></li></ul><h4>Offsets</h4><p>Our S3 readers were file based, so our offsets were integers i.e. number of bytes into the file. With SQS, we didn&#8217;t need offsets. With Kinesis, we started seeing offsets(sequenceNumbers) which were very large integers, for example, <code>49647350647089693080323675671827096928861743259506966562</code>. The engineer in me is curious as to how such numbers might be generated and if they are increasing numbers or not. However, that is not important. What is important is that integers were probably not a great idea for offsets especially when we are dealing with a myriad of read resources. So the offsets needed to be string. </p><p>We reworked our stack to store offsets as strings - the each destination&#8217;s readers can make sense of the offset according to their own logic, but to the overall system, these are opaque strings. This did require an update to our data interfaces as well (breaking change). </p><p>These are some interesting implementation tidbits that I thought readers might like and it might help them avoid similar issues / design better systems. Now let&#8217;s look at how we can actually develop on LetsData using this new read connector. </p><h2>Getting Started</h2><p>Getting started with the Kinesis Stream read connector requires implementing the interface (Code), defining config (Config), granting #LetsData access (Access) and then running the dataset using CLI (CLI). </p><h4>Code </h4><p>The KinesisRecordReader is a simple interface that has a single parseMessage method that needs to be implemented. </p><p>You can look at the interface definition on GitHub (<a href="https://github.com/lets-data/letsdata-data-interface/blob/main/src/main/java/com/resonance/letsdata/data/readers/interfaces/kinesis/KinesisRecordReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-interface/blob/main/letsdata_interfaces/readers/kinesis/KinesisRecordReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-interface/blob/main/letsdata_interfaces/readers/kinesis/KinesisRecordReader.js">Javascript</a>) and example implementations that simply echo the incoming record on GitHub as well (<a href="https://github.com/lets-data/letsdata-common-crawl/blob/main/src/main/java/com/letsdata/commoncrawl/interfaces/implementations/kinesisstreamreader/CommonCrawlStreamReader.java">Java</a>, <a href="https://github.com/lets-data/letsdata-python-examples/blob/main/letsdata_interfaces/readers/kinesis/KinesisRecordReader.py">Python</a>, <a href="https://github.com/lets-data/letsdata-javascript-examples/blob/main/letsdata_interfaces/readers/kinesis/KinesisRecordReader.js">Javascript</a>).  These are also available on our <a href="https://www.letsdata.io/docs/read-connectors/?tab=kinesis-streamreader#kinesisStreamReaderImplementation">read connector docs</a>.</p><p>Here is the example code in different supported languages that shows simplicity and ease of integration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_yfv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_yfv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 424w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 848w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_yfv!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1577055,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_yfv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 424w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 848w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!_yfv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79508194-03f2-4175-a6ae-af4882647d85_2559x1398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kinesis Stream Read Connector - Example interface implementations in Java, Python and Javascript</figcaption></figure></div><h4>Config</h4><p>Define the read connector config which is simply the interface implementation details and the Kinesis Stream ARN. <a href="https://www.letsdata.io/docs/read-connectors/?tab=kinesis-streamreader#kinesisStreamReaderConfig">Read Connector Config on LetsData Docs</a> has details and examples. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yLcQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yLcQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 424w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 848w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yLcQ!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png" width="1200" height="265.1933701657459" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:200,&quot;width&quot;:905,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:48446,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yLcQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 424w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 848w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yLcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538a5cf-ee64-496d-bd42-0255548d4f12_905x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">The Kinesis Stream Read Connector configuration</figcaption></figure></div><h4>Access</h4><ul><li><p>Trust #LetsData to access the Kinesis stream AWS account by creating an IAM Role that trusts the LetsData account. Details at <a href="https://www.letsdata.io/docs/access-grants/?tab=kinesisreadconnector#create-access-grants-role">Access Grants Docs</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H93_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H93_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 424w, https://substackcdn.com/image/fetch/$s_!H93_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 848w, https://substackcdn.com/image/fetch/$s_!H93_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 1272w, https://substackcdn.com/image/fetch/$s_!H93_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H93_!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png" width="1200" height="70.3971119133574" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:65,&quot;width&quot;:1108,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:19033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H93_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 424w, https://substackcdn.com/image/fetch/$s_!H93_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 848w, https://substackcdn.com/image/fetch/$s_!H93_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 1272w, https://substackcdn.com/image/fetch/$s_!H93_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54f4da5c-55af-492f-a32e-3ec749691d8c_1108x65.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>Create a policy to allow access to the Kinesis stream. <a href="https://www.letsdata.io/docs/access-grants/?tab=kinesisreadconnector#read-connectors">Access Grants Docs</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UQZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UQZh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 424w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 848w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 1272w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UQZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png" width="479" height="461" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:461,&quot;width&quot;:479,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38082,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UQZh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 424w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 848w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 1272w, https://substackcdn.com/image/fetch/$s_!UQZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38283a83-d656-4c1c-8d9b-c6d220ada2a2_479x461.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>Create the policy and attach to role. <a href="https://www.letsdata.io/docs/access-grants/?tab=kinesisreadconnector#create-access-grants-role">Access Grant Docs</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zAp5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zAp5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 424w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 848w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 1272w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zAp5!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png" width="1200" height="68.83365200764818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b846d119-e161-4717-9330-43df134866b1_1046x60.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:60,&quot;width&quot;:1046,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:19474,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zAp5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 424w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 848w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 1272w, https://substackcdn.com/image/fetch/$s_!zAp5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb846d119-e161-4717-9330-43df134866b1_1046x60.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DLbA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DLbA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 424w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 848w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 1272w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DLbA!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png" width="1200" height="61.1062335381914" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:58,&quot;width&quot;:1139,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:18930,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DLbA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 424w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 848w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 1272w, https://substackcdn.com/image/fetch/$s_!DLbA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59ddc0a6-88c0-47ab-a615-4288a683f0fe_1139x58.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This should allow LetsData to access the Kinesis stream in the Customer&#8217;s AWS account. You may also want to look at the <a href="https://www.letsdata.io/docs/examples/#uri-extractor-example-step4-granting-access">Examples</a> to get a step by step of the access grant process, be sure to replace the policy with the Kinesis stream policy. </p><h4>CLI</h4><ul><li><p>Okay, now let&#8217;s create the dataset using the LetsData CLI. <a href="https://www.letsdata.io/docs/downloads/#lets-data-cli">Download and Setup Instructions</a></p></li><li><p>Create the dataset configuration incorporating the read connector configuration from the Config section above. <a href="https://www.letsdata.io/docs/examples/#uri-extractor-example-step5-create-dataset-config">Examples</a> have a step by step rundown. </p></li><li><p>Create the dataset, monitor execution via CLI or via the <a href="https://www.letsdata.io/home">LetsData Website</a></p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_HRH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_HRH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 424w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 848w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 1272w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_HRH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png" width="728" height="58.59754433833561" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:59,&quot;width&quot;:733,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:14602,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_HRH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 424w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 848w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 1272w, https://substackcdn.com/image/fetch/$s_!_HRH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F112e7124-fae6-4ca3-b13c-6943d09eddf0_733x59.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>Conclusion</h2><p>Kinesis streams is a very powerful technology and can be used in a variety of different use-cases. #LetsData simplifies reading and writing to Kinesis - user has to write their business case logic only while #LetsData manages the read, write and processing infrastructure.</p><p>Thoughts? Let us know. Happy to chat!</p>]]></content:encoded></item><item><title><![CDATA[Launch Announcement: #LetsData is now available in Python and Javascript and supports Containers]]></title><description><![CDATA[Today, we are announcing the availability of #LetsData in Python and Javascript languages and support for containers.]]></description><link>https://blog.letsdata.io/p/launch-announcement-letsdata-is-now</link><guid isPermaLink="false">https://blog.letsdata.io/p/launch-announcement-letsdata-is-now</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Mon, 11 Dec 2023 17:26:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/965366a2-64ba-47a3-8334-69d2c451d7c2_300x151.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the availability of #LetsData in Python and Javascript languages and support for containers.</p><p>Customers can now implement the LetsData interfaces in Java, Python and Javascript languages and package their implementations as docker containers (in addition to the existing JAR Files) for a simplified development experience. </p><p>Here is a high level support matrix:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QQdl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QQdl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 424w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 848w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 1272w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QQdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png" width="730" height="523" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:523,&quot;width&quot;:730,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QQdl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 424w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 848w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 1272w, https://substackcdn.com/image/fetch/$s_!QQdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0679ed9d-df1e-4675-abc5-1c1dcb9a0c07_730x523.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For those interested in digging immediately into the code, Here are the interface and example packages and the #LetsData docs:  </p><ul><li><p><strong>Python:</strong> <a href="https://github.com/lets-data/letsdata-python-interface">Git Hub Interface Package</a>, <a href="https://github.com/lets-data/letsdata-python-examples">Git Hub Interface Implementation Examples</a></p></li><li><p><strong>Javascript:</strong> <a href="https://github.com/lets-data/letsdata-javascript-interface">Git Hub Interface Package</a>, <a href="https://github.com/lets-data/letsdata-javascript-examples">Git Hub Interface Implementation</a> Examples</p></li><li><p><strong>Java:</strong> <a href="https://github.com/lets-data/letsdata-data-interface">Git Hub Interface Package</a>, <a href="https://github.com/lets-data/letsdata-common-crawl">Git Hub Interface Implementation Examples</a></p></li><li><p><strong>#LetsData Docs:</strong></p><ul><li><p>Read Connector Docs: <a href="https://www.letsdata.io/docs/read-connectors">https://www.letsdata.io/docs/read-connectors</a></p></li><li><p>SDK Interface: <a href="https://www.letsdata.io/docs/sdk-interface/">https://www.letsdata.io/docs/sdk-interface/</a></p></li></ul></li></ul><h3>Dev Experience Overview</h3><p>We&#8217;ve translated the existing Java interfaces to Python and Javascript and packaged them as buildable, ready to deploy projects on docker container images. The developer workflow looks like as follows:</p><ul><li><p>update the interface files with the implementation</p></li><li><p>build a docker image and upload it to ECR</p></li><li><p>reference this ECR Image in their datasets</p></li></ul><h3>Architecture</h3><p>Internally, we package the ECR Image as a &#8220;Language Bridge&#8221; Lambda function with http request-response web methods that invoke the user&#8217;s implementations. Our existing Data Task Java functions can then call these language-bridges with the data that needs to be processed. This is essentially a micro-service architecture with the Data Task micro-service calling the Language-Bridge micro-service.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kjEU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kjEU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 424w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 848w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 1272w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kjEU!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:568,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122989,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kjEU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 424w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 848w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 1272w, https://substackcdn.com/image/fetch/$s_!kjEU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc8be22c-e7e1-46e2-be0b-4abcbce8ea29_1790x698.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture Diagram: Data Task and Language-Bridge </figcaption></figure></div><h3>In-Proc vs Micro-service</h3><p>An interesting difference between the existing Java language interfaces and the new Python / Javascript language interfaces is that the existing Java language interfaces do not use micro-services, they are essentially in-proc calls in the same JVM. This is primarily because all our existing code is implemented in Java and we take your implementation JAR, build it with our code and execute it as a single JVM process. This results in the following differences:</p><ul><li><p><strong>Stateless vs Stateful interfaces:</strong> Java interface implementations can be stateful - the code can easily maintain a thread to stateful implementation mapping. When we move to Python / Javascript using micro-services, we lose the stateful support - the interfaces need to be stateless. This is because subsequent calls from a thread could land at different micro-service endpoints ( lambda functions) which may not have state from the prior call. Statefulness can be implemented using persistent sessions, but that becomes overly complicated in terms of an implementation and becomes a micro-services anti-pattern IMHO. So is the move to micro-service a step down? Not necessarily - here is to micro-services defense:</p><ul><li><p>Most stateful parsing and transformation implementations can be simplified to stateless implementations (in similar parlance to SQL to NoSQL transitions)</p></li><li><p>The loose coupling of services affords higher, horizontal scalability and increased performance.</p></li><li><p>A simplified overall architecture</p></li></ul></li><li><p><strong>Latency:</strong> Java interface implementations are blazingly fast since they are within the same JVM, whereas the Python / Javascript micro-service implementations require network transit. These are within the same network and from experience we&#8217;ve seen that Python / Javascript interface implementations are adequately fast - we&#8217;ve seen parse message calls latency of ~1-2 millisec (inlcuding the network) for python and slightly higher (~5 millisec) for  Javascript. With all the other responsibilities of the Data Task function (network reads, destination writes, serialization / deserialization, compute etc), these difference in latencies seem to get amortized (or shows slight perceptible increase). We were initially skeptical about the network latency for such an architecture, but we&#8217;ve been very happy with the out of the box performance that we are seeing. With additional tweaking, we should be able to improve performance even more if needed. </p></li><li><p><strong>Dataset Initialization / Provisioning:</strong> Our provisioning of Java interfaces creates a new java build that packages the customer&#8217;s JAR with #LetsData interfaces. This decision was because we wanted to i.) make sure that JARs play nice together and that there are no compile time failures during interface implementations etc. ii.) we believe that this would allow us extensibility where we can customize the workflow to run some focused tests during the build to find any possible issues at initialization time. While all this is great, the downside to this is that this java build takes around 3-5 mins to complete and becomes the long pole in the dataset initialization workflow. The python/javascript implementations using pre-built ECR container images simplify this very nicely - we do not have to do any additional builds and the dataset initialization becomes quite short - we&#8217;ve seen datasets start processing data within 1-2 mins of creation. While we love this container packaging as of now, however, if we were to build similar focused tests to find issues at initialization time, we might see increased runtimes. However, this is unlikely, since the ECR container being a separate deployment unit can be deployed and scaled independently and therefore any issues can be treated separately as well. </p></li><li><p><strong>Developer Experience:</strong> The overall developer experience when developing with java JAR and LetsData is somewhat complex when compared to python/javascript containers and LetsData.</p><ul><li><p>Python/Javascript containers and LetsData simplify the development quite a bit. We&#8217;ve implemented the Document interfaces by default in python / javascript - with support for key-value map, these are flexible for most use-cases. Python / Javascript developers do not have to write their own Document classes and do additional tests around serialization / deserialization. We should probably implement the same for Java as well. </p></li><li><p>The Http request-response semantics on the python/javascript containers allow you to end to end test your code trusting that as long as your implementation does the right thing given the right inputs, the end to end will work quite well. Docker&#8217;s superior development infrastructure (IMHO) has a lot to do with this delight as well. With the java implementations, as we package the java project  infrastructure today, the end to end testing and how everything fits in isn&#8217;t quite clear. This does mean we need to do more work to make the java experience as facile as python / javascript, but we are not there yet IMHO.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwrf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwrf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 424w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 848w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwrf!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png" width="1200" height="1011.063829787234" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1188,&quot;width&quot;:1410,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:238799,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fwrf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 424w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 848w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 1272w, https://substackcdn.com/image/fetch/$s_!fwrf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff58f6c0b-b9ec-4d62-8d41-40384af674e0_1410x1188.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Http Request-Response Packaging of Lambda Functions in Containers allows for build time sanity tests</figcaption></figure></div></li></ul></li><li><p><strong>Security: </strong>Running user&#8217;s lambda containers in your AWS account does increase the security risk - customer&#8217;s code could be malicious and could attempt to do all sorts of incorrect things, stealing credentials, generating malicious data etc. This concern is similar to running customer code in-proc, the in-proc code could be malicious as well. The way we mitigate in-proc code is that we run the in-proc code with a scoped execution role - which limits the code to be able to do only what we&#8217;ve determined should be accessible by the dataset. Similar security fencing should carry over to the language-bridge lambda functions. While we&#8217;ve not limited that execution  to the dataset&#8217;s execution role yet, we&#8217;ve granted the language-bridge function a bare minimum set of permissions that are needed for lambda function to work. However, some accesses are for &#8216;*&#8217; resources. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ITpr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ITpr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 424w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 848w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 1272w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ITpr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png" width="475" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9253144e-de77-4421-8080-3971f0514e61_475x317.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:475,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48576,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ITpr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 424w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 848w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 1272w, https://substackcdn.com/image/fetch/$s_!ITpr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9253144e-de77-4421-8080-3971f0514e61_475x317.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Lambda&#8217;s default permission requirements</figcaption></figure></div><p>My understanding of the system suggests that this should allow user code to be able to create log groups, put logs and metric data to namespaces they don&#8217;t own and do similar malicious activity with xray. I haven&#8217;t tested this yet and would be surprised if a lambda function malicious code can do these, I am sure there are some internal safeguards that should disallow a lambda function to run amok with respect to these blanket accesses. However, this is okay for now, but is on our minds as a security issue that we may need to fix.</p></li></ul><p>Ideally, we&#8217;ll want all languages to be i.) in-proc AND ii.) have the scalability and dev experiences similar to micro-services. But duplicating all our existing code that is currently in Java seems like a larger effort - the read connector code, the #LetsData writers and all the orchestration code. To have a true in-proc, we&#8217;d have to rewrite the stack for each different language, which seems untenable as of now. The micro-services approach is a very nice implementation that should cater to the approx. 95%+ of the cases (unscientific gut number).</p><h2>Implementation</h2><h3>Config</h3><p>Here is an example read connector dataset configuration using Python. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LZjY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LZjY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 424w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 848w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 1272w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LZjY!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png" width="1200" height="440.1098901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:534,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:161233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LZjY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 424w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 848w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 1272w, https://substackcdn.com/image/fetch/$s_!LZjY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e61377c-1cd2-46a3-a9a0-3100fca31b2a_2150x788.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">S3 Single File Reader dataset configuration when using python and containers</figcaption></figure></div><ul><li><p>The JAR attributes have been replaced by ECR Image attributes</p></li><li><p>Since the project structure and docker image is a known template, #LetsData infrastructure knows where the interface locations are. This allows us to simplify the configuration where we don&#8217;t specify the implementation class names anymore. </p></li></ul><h3>Code</h3><p>The interface code is simple - for example, we converted our stateful Common Crawl java interface implementation to a stateless python/javascript implementation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JqBR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JqBR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 424w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 848w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 1272w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JqBR!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png" width="1200" height="1544.4" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1287,&quot;width&quot;:1000,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:275067,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JqBR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 424w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 848w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 1272w, https://substackcdn.com/image/fetch/$s_!JqBR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94c8185d-8d02-4b93-afa1-5bdf3fcafede_1000x1287.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Single File Parser interface implementation in Python</figcaption></figure></div><h2>Conclusion</h2><p>We&#8217;d love to learn how SAAS platform owners have dealt with the challenges of multi-language support - separate technology stacks might be the inevitable reference architecture.</p><p>Despite the cons and issues mentioned above, simplified development in popular programming languages and zero infrastructure management while processing data are compelling reasons to give #LetsData a try.  We&#8217;d be happy to work with folks to onboard their data use-cases to #LetsData.</p>]]></content:encoded></item><item><title><![CDATA[#LetsData is now available in 6 AWS regions]]></title><description><![CDATA[Today, we are announcing the launch of multi-region support for #LetsData.]]></description><link>https://blog.letsdata.io/p/letsdata-is-now-available-in-6-aws</link><guid isPermaLink="false">https://blog.letsdata.io/p/letsdata-is-now-available-in-6-aws</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 31 Oct 2023 05:39:30 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b0be9eb4-99b8-484a-80a6-71f720984c2b_300x151.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the launch of multi-region support for #LetsData.</p><p>#LetsData  is now available in the following AWS regions:</p><ul><li><p>us-east-1 (N. Virginia)</p></li><li><p>us-west-2 (Oregon)</p></li><li><p>us-east-2 (Ohio)</p></li><li><p>eu-west-1 (Ireland)</p></li><li><p>ap-south-1 (Mumbai)</p></li><li><p>ap-northeast-1 (Tokyo)</p></li></ul><p>You can read about the feature details in our our <a href="https://www.letsdata.io/docs#regions">docs</a> [1].</p><h2>Background: Data Locality</h2><p>Before we get into the details of the #LetsData multi-region feature, let&#8217;s look at why multi-region support is important in modern day data processing.  </p><h4>Data and Compute Locality</h4><p>Google&#8217;s <a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/16cb30b4b92fd4989b8619a61752a2387c6dd474.pdf">Map Reduce</a> [2] and Apache&#8217;s <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf">Hadoop / HDFS</a> [3] are seminal technologies for large scale distributed data processing. One of the key design tenets from these systems is about the data locality.</p><p>From the HDFS architecture:</p><blockquote><p>A computation requested by an application is much more efficient if it is executed near the data it operates on &#8230; The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running.</p></blockquote><p>From the Map Reduce paper:</p><blockquote><p>The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task&#8217;s input data (e.g., on a worker machine that is on the same network switch as the machine containing the data). When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth.</p></blockquote><p>These works are from ~ 2004/2005, when cloud computing either didn&#8217;t exist or was in infancy. Google Map Reduce paper cites commodity hardware, IDE disks and 100 Mbps - 1 Gbps ethernet. </p><h4>AWS, EC2 and Network Bandwidth</h4><p>With the likes of AWS and EC2 instances with 25 Gbps - 100 Gbps bandwidths available to services like S3 at reasonable costs, in my humble experience, the code optimizations for data locality to be on the same machine, rack or network switch might not be necessary. For example, here are the m5n instance types and their available network bandwidths and hourly costs.[4] </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q7Ed!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q7Ed!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 424w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 848w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 1272w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q7Ed!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png" width="1200" height="604.3478260869565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:417,&quot;width&quot;:828,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:74998,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!q7Ed!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 424w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 848w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 1272w, https://substackcdn.com/image/fetch/$s_!q7Ed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d356fba-a07c-4bef-843c-6dd15caec040_828x417.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The m5n EC2 Instances and their network performance and costs</figcaption></figure></div><p>(The costs by themselves are affordable in my opinion. For example the m5n.25xlarge / m5n.metal costs are around $4,100 per month. These can also be decreased drastically if one were to use reserved instances or spot instances).</p><p>And we see very nice performance even in higher level services such as AWS Lambda (that might be built on top of this EC2 infrastructure)  when reading / writing to AWS services such as S3 and Kinesis etc [5].</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ckRb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ckRb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 424w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 848w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ckRb!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png" width="1200" height="651.9230769230769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:791,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:397965,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ckRb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 424w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 848w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 1272w, https://substackcdn.com/image/fetch/$s_!ckRb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80f9df0f-6c78-4017-8096-ad7855311d96_1907x1036.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">With 500 concurrent lambda tasks, #LetsData peaked at reading 455 GB per minute from S3 and writing 12.36 GB per minute in AWS Kinesis, extracting 2.7 million records per minute (~45K records per second!)</figcaption></figure></div><p>So, to re-iterate my humble observation from above,  the code optimizations for data locality to be on the same machine, rack or network switch might not be necessary.  </p><p>What about same AWS availability zone? Same AWS region (different availability zone)? Same geographical region (different AWS region)? Different geographical region? <em>(For those who need a primer about these concepts, see<a href="https://docs.aws.amazon.com/whitepapers/latest/get-started-documentdb/aws-regions-and-availability-zones.html"> this page about AWS Global Infrastructure</a> [6] and about <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones">regions and availability zones</a>[7] ) </em></p><p>Lets look at some latencies [8]</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FpY6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FpY6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 424w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 848w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 1272w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FpY6!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png" width="1200" height="526.0632497273718" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d90883c4-1325-4a1d-8a80-e48989acca49_917x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:402,&quot;width&quot;:917,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:116708,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FpY6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 424w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 848w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 1272w, https://substackcdn.com/image/fetch/$s_!FpY6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd90883c4-1325-4a1d-8a80-e48989acca49_917x402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Percentile 99 Latency - AWS Regions from www.cloudping.com</figcaption></figure></div><p>So within the same region, we can expect a p99 latency &lt; 10 milliseconds. The website doesn&#8217;t breakdown numbers for within an availability zone or across availability zones, but <a href="https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/">Michael Haken&#8217;s blog post</a> [10] suggests that sub millisecond latency within same availability zone can be expected. Within the same datacenter latency and its comparison with disk [<a href="https://serverfault.com/questions/238417/are-networks-now-faster-than-disks">11</a>]</p><blockquote><p>Round trip within same datacenter 500,000 ns</p><p>Disk seek 10,000,000 ns</p></blockquote><p>To reason about the latency and throughput, refer to <a href="https://bradhedlund.com/2008/12/19/how-to-calculate-tcp-throughput-for-long-distance-links/">Brad Hedlund&#8217;s post</a> [9].</p><blockquote><p>Reduce latency? How is that possible? Unless you can figure out how to overcome the speed of light there is nothing you can do to reduce the real latency between sites. One option is, again, placing a WAN accelerator at each end that locally acknowledges the TCP segments to the local server, thereby fooling the servers into seeing very low LAN like latency for the TCP data transfers.</p></blockquote><p>So, the reality here is that you probably cannot be faster than the speed of light - <a href="https://www.instagram.com/p/Bz0q8bphl_m/">instagram animation</a>, speed of light does 7.5 orbits around earth in 1 second, so to travel across the world it would need 1000/7.5 =&gt; 133 milliseconds. <strong>Light needs 133.33 milliseconds to travel around the world.</strong></p><p>With this reasoning, in my humble opinion, you&#8217;d get acceptable performance by being closer to your data, in-region vs. cross region and can probably get away with not being on the same machine, rack or network switch.   </p><p>So, in case your data is in Europe, you&#8217;d now benefit from the #LetsData compute availability in eu-west-1 (Ireland) !</p><h2>#LetsData Dataset Types</h2><p>Okay, enough about the background, what can we do with #LetsData&#8217;s availability in multiple regions?</p><p>With multi-region support, Datasets on LetsData can now either be in a <strong>single region</strong> or <strong>cross-region</strong>.</p><ul><li><p><strong>Single Region Datasets:</strong> Single region datasets are completely located in a single region and all resources required for the dataset processing <em>(read destination, write destination, error connector, artifacts and compute engine)</em> are located in the same region</p></li><li><p><strong>Cross-Region Datasets: </strong>Cross region datasets are where the dataset resources <em>(read destination, write destination, error connector, artifacts and compute engine)</em> are located in different regions.</p></li></ul><p>Dataset configuration defines whether a dataset is single region or cross region. The dataset configuration supports <em><strong>specifying the region either at the dataset level (single region)</strong></em> or <em><strong>individually at each resource level (cross region)</strong></em>. This gives flexibility to create configurations according to the dataset needs. </p><p>Here is an example schema for dataset configuration with regions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ozPV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ozPV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 424w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 848w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 1272w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ozPV!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png" width="1200" height="748.2993197278912" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:550,&quot;width&quot;:882,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:79890,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ozPV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 424w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 848w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 1272w, https://substackcdn.com/image/fetch/$s_!ozPV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b303ff0-ef73-4b0c-a8c0-5ec7f2fbd83a_882x550.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Schema: specifying regions in dataset configuration</figcaption></figure></div><p>The Compute Engine region is probably the most important region for the dataset - this is the center of the dataset and where all the processing happens, essentially all distances (in region, cross region) have this region as its origin. </p><h2><strong>Interesting Dataset Configurations</strong></h2><p>Ideally, all processing should be done in-region for best performance and costs, but with multi-region support, one can now create a number of interesting different configurations to prioritize reads, writes or errors. For example:</p><ul><li><p><strong>Prioritizing for In-Region Reads:</strong> By selecting the compute region to be the same as the read region, one can prioritize reads to be in-region reads whereas writes can be in a different region. This is essentially <em><strong>pinning</strong></em> the compute region with the read region. This configuration is useful when the data read is much much greater than writes (and reads and writes need to be in different regions.)</p></li><li><p><strong>Prioritizing for In-Region Writes:</strong> By selecting the compute region to be the same as the write region, one can prioritize writes over reads - essentially <em><strong>pinning</strong></em> the compute region with the write region. This configuration is useful when the writes are significant in comparison to the reads.</p></li><li><p><strong>Degenerative case:</strong> The academic degenerative case for dataset component regions is where every component is in a separate region. We've tested this scenario, and it works.</p></li></ul><h2>#LetsData Regions Implementation Design</h2><p>How have we implemented #LetsData regions? We&#8217;ve partitioned a dataset into two obvious components: a control component and a data component. </p><ul><li><p><strong>Control Component:</strong> The control component is the dataset&#8217;s creation, initialization and management. These are low volume, one time / few time operations where we setup resources for data processing. Since these are one time / low volume, we run these in <strong>us-east-1</strong> region. Again, since these are infrequent, even a 250 ms cross region call from ap-south-1 should be okay and scalable as of now.</p></li><li><p><strong>Data Component:</strong> The data component is the Data Task Lambda Function and the Sagemaker  endpoints (if any) and they are created in the Compute Region. All the data processing happens in the Compute Region, data access, checkpointing, task processing etc. The only call across region is at the completion of tasks when we let the control process know that the tasks have completed. </p></li></ul><p>Here is the dataset and task lifecycle and the split between control and data components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b5oz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b5oz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 424w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 848w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 1272w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b5oz!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png" width="1200" height="486.2637362637363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:136576,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b5oz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 424w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 848w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 1272w, https://substackcdn.com/image/fetch/$s_!b5oz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc34cdc99-08a3-4c4a-8557-c2c66a460c23_1774x719.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Lets Data Region Implementation Design</figcaption></figure></div><h2>Learnings</h2><p>Here are some learned best practices from experience:</p><ul><li><p>perfecting a task in  a single region and then replicate it across geographies</p></li><li><p>question every task being  replicated as to whether it is needed in replicating region or not? Managing 1 region vs 6 regions is a very different effort - you&#8217;d have copies of credentials, storage, compute, logs, metrics and other such infrastructure. Try and see what the system performance would be if something is not regionalized. </p></li><li><p>deep dive into components when replicating across geographies, its very  easy to add regions with a sub optimal architecture - beware of the hidden cross region calls and that non regionalized component that you didn&#8217;t know was also part of the architecture. </p></li><li><p>run tests in different regions, analyze usage and look for any regional / cross region usage / data transfer. You might find a bug or two. </p></li><li><p>One thing we did was that instead of replicating databases across regions for low latency, we clearly separated the control databases and data databases. Control databases are only in the us-east-1 region and are accessed by control components only. Data databases are in the data regions and are accessed by data components only. This helped us get away without having to have global tables and cross region replication. While these features are great, as of now, not having to deal with such complexity is better architecture IMHO. </p></li></ul><p>A few things we would like to do but have deferred:</p><ul><li><p>API availability in each region with regional api and code deployments - add management overhead as of now</p></li><li><p>Separate AWS Accounts for each region - currently we are using a single AWS Account for all regions - again, having separate accounts is probably better scaling decision, but its a lot more work to manage</p></li></ul><h2>Additional Reads?</h2><p>Our Distributed Computing field is quite large and while I have read some papers and tried to form an opinion, I am sure many important papers since the ones I mentioned have advanced this field. If there are any must reads in this area, do share - I&#8217;d love to learn what people have done in this space. </p><p>Thoughts / feedback, let me know.</p><h2>References:</h2><p>[1] Region documentation on #LetsData - <a href="https://www.letsdata.io/docs#regions">https://www.letsdata.io/docs#regions</a></p><p>[2] Google Map Reduce - <a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/16cb30b4b92fd4989b8619a61752a2387c6dd474.pdf">https://storage.googleapis.com/pub-tools-public-publication-data/pdf/16cb30b4b92fd4989b8619a61752a2387c6dd474.pdf</a></p><p>[3] Apache Hadoop - <a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf">https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf</a> </p><p>[4] EC2 Instance Type Pricing - <a href="https://aws.amazon.com/ec2/pricing/on-demand/">https://aws.amazon.com/ec2/pricing/on-demand/</a></p><p>[5] #LetsData Case Study: Big Data: Building a Document Index From Web Crawl Archives - <a href="https://www.letsdata.io/#casestudies">https://www.letsdata.io/#casestudies</a>  - <a href="https://d108vtfcfy7u5c.cloudfront.net/images/CaseStudy-CommonCrawl.pdf">PDF</a></p><p>[6] AWS Regions and Availability Zones - <a href="https://docs.aws.amazon.com/whitepapers/latest/get-started-documentdb/aws-regions-and-availability-zones.html">https://docs.aws.amazon.com/whitepapers/latest/get-started-documentdb/aws-regions-and-availability-zones.html</a></p><p>[7] Regions and Zones - <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones</a></p><p>[8] AWS Latency Monitoring  - <a href="https://www.cloudping.co/grid/p_99/timeframe/1M">https://www.cloudping.co/grid/p_99/timeframe/1M</a></p><p>[9] How to Calculate TCP throughput for long distance WAN links - <a href="https://bradhedlund.com/2008/12/19/how-to-calculate-tcp-throughput-for-long-distance-links/">https://bradhedlund.com/2008/12/19/how-to-calculate-tcp-throughput-for-long-distance-links/</a></p><p>[10] Improving Performance and Reducing Cost Using Availability Zone Affinity -<strong> </strong><a href="https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/">https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/</a></p><p>[11] Are networks now faster than disks - <a href="https://serverfault.com/questions/238417/are-networks-now-faster-than-disks">https://serverfault.com/questions/238417/are-networks-now-faster-than-disks</a></p>]]></content:encoded></item><item><title><![CDATA[Metrics, Data Visualizations & Dashboards]]></title><description><![CDATA[Metrics, Dashboards and Instrumentation results are essential for a number of reasons - the reason I like them is the sense of accomplishment and achievement you feel as an engineer when you see that your code has processed terabytes of data and is scaling to handle thousands of requests per second (I know - such vanity!).]]></description><link>https://blog.letsdata.io/p/metrics-data-visualizations-and-dashboards</link><guid isPermaLink="false">https://blog.letsdata.io/p/metrics-data-visualizations-and-dashboards</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Sun, 08 Oct 2023 15:08:12 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ee50b607-2a14-467b-8319-57fc42915c1d_398x398.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Metrics, Dashboards and Instrumentation results are essential for a number of reasons - the reason I like them is the sense of accomplishment and achievement you feel as an engineer when you see that your code has processed terabytes of data and is scaling to handle thousands of requests per second (I know - such vanity!).  Short of customer commendation, such positive reinforcement can be a proxy of success and a powerful motivator. But I digress. </p><h2>Metric Definition</h2><p>Metrics, Dashboards and Instrumentation results give a view into the running of the service, whats going as expected, what is an anomaly and how the overall system is performing. While this would be a great explanation of a <strong>service&#8217;s operational dashboard</strong>, the science around dashboards goes much deeper. One classification that is simple and that I liked were classifying a metric / dashboard as either exploratory or explanatory. Here is a visual from <a href="https://www.linkedin.com/posts/brentdykes_datastorytelling-datavisualization-datavisualisation-activity-7097629835941875714-iL9b">Brent&#8217;s Linked In post</a> that talks about this [1]:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2fZG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2fZG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2fZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:450,&quot;width&quot;:800,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:36439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2fZG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2fZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48c5ee1f-74b9-4781-89fc-fdd4d18e7704_800x450.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Metric: Exploratory vs. Explanatory</figcaption></figure></div><p>While we weren&#8217;t thinking this deeply when we implemented our metrics but in hindsight, it makes sense to apply this theory to our metrics.  We essentially wanted to answer these primary questions:</p><ul><li><p>What do we intend to communicate with the metric / visualization? (Goal in Brent&#8217;s post)</p></li><li><p>What will the viewer view this visualization for (or the action that they would take)?  (Outcome in Brent&#8217;s post)</p></li></ul><p>Some of the other dimensions he mentions were implicitly assumed during our work:</p><ul><li><p>audience is a LetsData user (audience)</p></li><li><p>the LetsData user is familiar with the task and data domain (data familiarity)</p></li><li><p>the LetsData user would want to know about the system&#8217;s performance (visualization focus)</p></li><li><p>the user knows the narrative as it is specific to each metric - throughput from a read destination, scaling of a write destination etc (narrative)</p></li></ul><h2>What Should I Instrument?</h2><p>So what do I instrument? Well, it depends on what are you intending to learn. For example, for my earlier startup, we measured the customer acquisition funnel and how it was changing over time with the experiments that we were running [2]. These were essentially business KPIs that told us how healthy the business was. </p><p>With LetsData, we are a developer focused offering and a lot of focus is on making sure that we provide the developers with visibility into the different parts of the system that they have delegated to LetsData. This is essentially:</p><ul><li><p>a service&#8217;s operational dashboard </p></li><li><p>operators can use these to tune or correct the service</p></li><li><p>operators can monitor the different components of the service and reason about the overall progress</p></li></ul><p>A large part of what to instrument is intuitive and comes with experience developing, operating and monitoring services. When you are developing code, it becomes second nature to add metrics for long running cpu work, network calls and other such tasks. Also, not all these instrumentation need to be in the final service dashboards. </p><p>What we decide to display on the Service Level Dashboard needs to answer some need that user may have. One way to achieve this is to draw your system&#8217;s architecture diagram and reason about what components users would want to know more about. This approach seems to be consistent with different system analysis tasks - Threat Model Diagrams in Security, Data Flow Diagrams in Software Design etc. Generally, when unsure about some system aspects, drawing a diagram and breaking it into components helps as a divide and conquer approach.</p><p>Let&#8217;s look at the #LetsData Task architecture diagram and see how we&#8217;ve reasoned about a handful of top level metrics for the dashboard.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3m6B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3m6B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 424w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 848w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 1272w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3m6B!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png" width="1200" height="564.8829431438127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:563,&quot;width&quot;:1196,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:75392,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3m6B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 424w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 848w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 1272w, https://substackcdn.com/image/fetch/$s_!3m6B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03be31e8-3e75-4c6c-a7aa-3c9523b2fe5f_1196x563.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData Task Architecture Diagram</figcaption></figure></div><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MO4u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MO4u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 424w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 848w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 1272w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MO4u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png" width="184" height="61" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:61,&quot;width&quot;:184,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MO4u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 424w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 848w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 1272w, https://substackcdn.com/image/fetch/$s_!MO4u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe98a02b7-cd32-4aea-86da-cfc903dcd31d_184x61.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The first step in task lifecycle is the task start or task processing. As a #LetsData Task processes, the user would want to know how many tasks succeeded / failed (Task Success Rate). They would also want to know the time taken by a task (Task Latency).</p><ul><li><p><em>What we intend to communicate:</em> Overall task progress, whether system is succeeding or erroring. </p></li><li><p><em>What will the viewer view this visualization for (or the action that they would take)?</em> Alerts the user to any latency issues or task failures and they investigate by looking at additional metrics, logs etc.</p><div><hr></div></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Hd7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Hd7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 424w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 848w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 1272w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Hd7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png" width="412" height="27" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:27,&quot;width&quot;:412,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7795,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Hd7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 424w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 848w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 1272w, https://substackcdn.com/image/fetch/$s_!9Hd7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec4627ed-cad4-4f5f-966d-299bcdb18357_412x27.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The task is responsible for reading data from the read destination - metrics for read throughput (Task Bytes Read) and read latency (Message Read Latency) would help diagnose any integration issues with the read destination.</p><ul><li><p><em>What we intend to communicate:</em> The performance statistics of the task&#8217;s reader  </p></li><li><p><em>What will the viewer view this visualization for (or the action that they would take)?</em> Alert the user to performance issues when reading from read destination and they investigate by looking at read destination metrics, logs etc.</p></li></ul><div><hr></div><p>Describing each box in the diagram for what metrics are needed would become rather tedious and boring. These should give an idea on how to define the metrics needed for the service&#8217;s dashboard. Correlating with the diagram, here is a list of the metrics that are on our dashboards:</p><ul><li><p>Task Success &amp; Volume (Box A in the diagram)</p></li><li><p>Task Latency (Box A)</p></li><li><p>Task Checkpoint Success &amp; Latency (Box H)</p></li><li><p>Task Number of Records Processed / Errored / Skipped (Box F &amp; G)</p></li><li><p>Task Write Connector Success &amp; Volume (Box F)</p></li><li><p>Task Write Connector Put Retry % (Box F)</p></li><li><p>Task Record Latencies &amp; Volume (Iteration Latency - Box B to J)</p></li><li><p>Task Bytes Read &amp; Bytes Written (Box A and Box F)</p></li></ul><p>Here is an actual dashboard from the case study &#8220;Big Data: Building a Document Index From Web Crawl Archives&#8221; - the metrics are available and browsable at: <a href="https://www.letsdata.io/#casestudies">https://www.letsdata.io/#casestudies</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Whh8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png" data-component-name="Image2ToDOM"><div class="image2-inset image2-full-screen"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Whh8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 424w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 848w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Whh8!,w_5760,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;full&quot;,&quot;height&quot;:538,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:582763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-fullscreen" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Whh8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 424w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 848w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 1272w, https://substackcdn.com/image/fetch/$s_!Whh8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f7c787-256a-4d68-8c4b-e5c8b6b115b5_2472x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData Dashboard for the case study &#8220;Big Data: Building a Document Index From Web Crawl Archives&#8220;</figcaption></figure></div><p>The metrics we&#8217;ve defined are mostly explanatory - there is very little exploration. However,  they may lead to some curious exploration to understand the system. Our metrics infrastructure currently is not setup for exploration - specifically the <em>speed and flexibility</em> needed for exploration is missing. Enabling the users to filter, slice and dice the data and query metrics would be needed for exploration.  </p><p>Some advice from the metrics dashboard development experience:</p><ul><li><p><strong>metrics deluge</strong> on dashboards can be distracting so try limiting to one or two metrics per component area</p></li><li><p>know what<strong> type of visualization / chart</strong> you need - our metrics are mostly time series, so we&#8217;ve used line charts. An excellent resource on choosing different types of visualization is at [3]</p></li><li><p>one doesn&#8217;t know what screen size the graph would render on, and how much data / data range would the chart have. Would a larger range or limited screen size make the chart unusable? See if a zoom can be implemented that draws the chart in a modal maximizing the screen space (this was built from experience - one of the charts were rending such that the data wasn&#8217;t legible - we implemented a <strong>zoom control</strong>)</p></li><li><p>since you are emitting the metrics, you know what they mean. You know the idiosyncrasies of your data gathering library. <strong>Document how the metric should be interpreted</strong>. We&#8217;ve built an inline help control in our chart widget that we are extremely proud of. Help is only a click away. This IMHO should be a standard in all metrics and charting controls. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6FKw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6FKw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 424w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 848w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 1272w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6FKw!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png" width="1200" height="639.4299287410927" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:673,&quot;width&quot;:1263,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:354986,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6FKw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 424w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 848w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 1272w, https://substackcdn.com/image/fetch/$s_!6FKw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75aac3fc-c6f5-426d-84e1-91fa330d1848_1263x673.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LetsData Metrics Control - Inline Expandable Documentation</figcaption></figure></div><p>I&#8217;ll also link advice from the charting authorities [4] [5] that should make charts and visualizations crisper. We need to implement some of these in our charts as well. </p></li></ul><h2>How do I Instrument?</h2><p>This section is about the some technical aspects around how to emit metrics  around AWS Cloudwatch - feel free to skip ahead if not interested. </p><p>We use AWS Cloudwatch Metrics throughout #LetsData. We love AWS Cloudwatch Metrics:</p><ul><li><p><strong>Simple API:</strong> the SDK is super simple to use, comes with single and batched put metric APIs [6]. At high scale, the single datapoint API can quickly get throttled (example, 10 tasks, each processing 30 messages per second, each message emitting 30 metrics ~ 9000 metrics per second!). We almost exclusively use the statistics set and they work great - no throttling issues since we moved to these! </p></li><li><p><strong>Embedded Metrics Format:</strong> Since we&#8217;ve built our statistics set metrics infrastructure, Cloudwatch now also supports an embedded metric format [7] where you log your metrics to Cloudwatch logs and they automatically get recorded. We are yet to try this but this is the next level simplification to a very scale intensive problem. Recommend newer integrations to Cloudwatch to give this a try before the put-metric-data API. </p></li><li><p><strong>Data Lag &amp; Granularities:</strong> The data is available almost realtime and the per minute granularities that make it extremely useful (they also support 1-sec metrics). We emit per minute metrics and have not had any issues during investigations. </p></li><li><p><strong>Query:</strong> Rich querying capabilities allow for data querying for large number of use-cases [8]. The API allows for multiple query results in a single call - we get our entire dashboard data in a single call. In addition, you can specify what statistics you need so the server automatically does that computation [9]. </p></li><li><p><strong>Console Metrics &amp; Dashboards: </strong>Cloudwatch Metrics console visualizations and dashboards are great - you can construct different types of visualizations and create detailed dashboards. </p></li><li><p><strong>Metric Streams: </strong>Metric streams are also available [10] which allow for a interesting range of new scenarios that can be built around metrics. For example, a write destination scaling / descaling service that listens to the LetData PutItem Retries metrics  could automatically scale up  the write destination if the retries are greater than a threshold. </p></li></ul><h2>What data visualizations should I create?</h2><p>Our metrics are mostly time series, so we&#8217;ve used line charts. However, choosing the right data visualization for the data is critical. </p><p>[3] <em><strong>&#8220;Data Visualization Cheat Sheet&#8221;</strong></em>  and [4] <em><strong>&#8220;20 ideas for better data visualization&#8221; </strong></em> do an excellent job in how to choose the right visualization and the DOs and DONTs of each visualization type. [5] <em><strong>&#8220;What to consider when using text in data visualizations&#8221; </strong></em>has excellent advice on how to annotate charts. Even if you don&#8217;t have any need for charting, you should at-least read and bookmark these three links as a great <em>&#8220;Getting Started To Charting&#8221;</em> links whenever you do have some charting needs. From the data visualization cheatsheet:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oEYm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oEYm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 424w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 848w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 1272w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oEYm!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png" width="1200" height="837.3626373626373" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1016,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:472476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oEYm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 424w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 848w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 1272w, https://substackcdn.com/image/fetch/$s_!oEYm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2094962-17c5-4128-8cb2-37f2edf8b0a3_1647x1149.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Data Visualization Cheatsheet from DataCamp.com</figcaption></figure></div><h2>How do I create data visualizations?</h2><p>While AWS Cloudwatch Metrics gives us the raw metric datapoints, we use <a href="https://www.chartjs.org/">Chart.js (https://www.chartjs.org/)</a> on our website to create the data charts that we&#8217;ve shared earlier. We love Chart.js - a simple api, performant and feature rich. </p><p>We did have to write a server data translation later that converts AWS Cloudwatch Metrics datapoints to the chart.js format - essentially creating all the chart data and labels etc on the server, sending packaged data to the browser that can use it to display charts without any additional processing. </p><p>However, being such big fans of AWS Cloudwatch Metrics, we wish we didn&#8217;t have to use charts.js and the metrics data translation layer that we&#8217;ve currently written. AWS Metrics and Dashboards should natively support displaying charts and dashboards on external sites. Ideally in our multi tenant case, we would have a dashboard template that we would apply to each dataset which would generate a dataset dashboard. This dashboard can be displayed to the users on LetsData website in a frictionless manner.  </p><h2>Signing off</h2><p>That concludes today&#8217;s presentation. Hope you&#8217;ve enjoyed reading this. </p><p>We&#8217;d like to know:</p><ul><li><p>what stack are you using for metrics, data visualizations and dashboards </p></li><li><p>any additional items that should have been included when talking about data visualizations. </p></li><li><p>how we can improve</p></li></ul><h2>Resources:</h2><p>[1] Brent Dykes&#8217; post on differences between exploratory and explanatory data visualizations: <a href="https://www.linkedin.com/posts/brentdykes_datastorytelling-datavisualization-datavisualisation-activity-7097629835941875714-iL9b">https://www.linkedin.com/posts/brentdykes_datastorytelling-datavisualization-datavisualisation-activity-7097629835941875714-iL9b</a></p><p>[2] LetsResonate blog post <em><strong>&#8220;Growth Hack - The Case For Device Push Notifications&#8221;</strong></em> - <a href="https://blog.letsresonate.net/post/612587620515627008/growth-hack-the-case-for-device-push">https://blog.letsresonate.net/post/612587620515627008/growth-hack-the-case-for-device-push</a></p><p>[3] Richie Cotton - <em><strong>&#8220;Data Visualization Cheat Sheet&#8221;</strong></em> <a href="https://www.datacamp.com/cheat-sheet/data-viz-cheat-sheet">https://www.datacamp.com/cheat-sheet/data-viz-cheat-sheet</a></p><p>[4] Taras Bakusevych - <em><strong>&#8220;20 ideas for better data visualization&#8221;</strong></em><strong> </strong>- <a href="https://uxdesign.cc/20-ideas-for-better-data-visualization-73f7e3c2782d">https://uxdesign.cc/20-ideas-for-better-data-visualization-73f7e3c2782d</a></p><p>[5] Lisa Charlotte Muth - <em><strong>&#8220;What to consider when using text in data visualizations&#8221; </strong>- </em><a href="https://blog.datawrapper.de/text-in-data-visualizations/">https://blog.datawrapper.de/text-in-data-visualizations/</a></p><p>[6] AWS Cloudwatch Metrics - <em><strong>&#8220;Publish custom metrics&#8221; </strong>- </em><a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html">https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html</a></p><p>[7] AWS Cloudwatch Metrics - <em><strong>&#8220;Specification: Embedded metric format&#8221;</strong></em> -<a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html">https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html</a></p><p>[8] AWS Cloudwatch Metrics - <em><strong>&#8220;GetMetricData&#8221;</strong></em> - <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html">https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricData.html</a></p><p>[9] AWS Cloudwatch Metrics - <em><strong>&#8220;Get statistics for a metric&#8221;</strong></em> - <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/getting-metric-statistics.html">https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/getting-metric-statistics.html</a></p><p>[10] AWS Cloudwatch Metrics - <em><strong>&#8220;Use metric streams&#8221;</strong></em> - <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metric-Streams.html">https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Metric-Streams.html</a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Designing for #LetsData Sagemaker Compute Performance]]></title><description><![CDATA[The Problem]]></description><link>https://blog.letsdata.io/p/designing-for-letsdata-sagemaker</link><guid isPermaLink="false">https://blog.letsdata.io/p/designing-for-letsdata-sagemaker</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Fri, 29 Sep 2023 14:01:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/60ea0274-2b5f-4e45-a639-0ab767fbb29c_809x408.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Problem</h2><p>#LetsData datasets are Read-Compute-Write and different components often use multi threaded concurrency to achieve efficiency and performance. Most of the time, this internal parallelism does not need to be shared with the customers, but in some scenarios, the system&#8217;s internal details are needed so that customers can make an informed scaling decisions.</p><p>Sagemaker Compute Engine endpoints (fleets) need to be scaled in accordance with the number of tasks, internal concurrency threads and additional parameters. </p><p>This post looks at how concurrency in general is architected for #LetsData and how it can be used to set concurrency configuration that results in adequate performance. For details around the Sagemaker Compute Engine and its design, look at the <a href="https://blog.letsdata.io/p/launch-aws-sagemaker-available-as">Sagemaker Compute Engine Launch Announcement</a>[1], <a href="https://www.letsdata.io/docs#computeengine">Developer docs</a>[2] and <a href="https://www.letsdata.io/docs#examples">Step by Step Examples</a>[3]. </p><h2>Design Assumptions</h2><p>Lets look at a few assumptions and design issues and then understand the Sagemaker Compute Engine issue. </p><h4>The Random Distribution Assumption </h4><p>The concurrent reads, computes and writes are sufficiently randomly distributed so as not to run into pathological issues such as a larger number of requests go to some single server or a subset of servers etc.  </p><h4>The Faster Reads Assumption </h4><p>Assumption that reads are generally going to be faster than writes - so we will have more data that waiting to be written. The standard pattern for such reader-writer latency mismatch is to use multiple concurrent writers (if possible) for each reader. </p><h4>Batch Writes and Concurrent Writes</h4><p>Multiple concurrent writes aren&#8217;t possible where records are ordered by some write key (these records need to be written serially) - but in most cases, the random distribution of the records being read leads to performance gains when using concurrent writes. </p><p>For example, our write connectors collect documents in buffers and then either:</p><ul><li><p>Use a batch API (Kinesis, SQS)</p></li><li><p>or do a multi threaded client side batch if a batch API doesn&#8217;t exist or isn&#8217;t tenable (DynamoDB batch write API doesn&#8217;t support conditional checks, SQS batch call limits to 10 messages etc)  </p></li></ul><h4>Write Performance for AWS Services</h4><p>While the writes are slower than reads, they are still blazingly fast to cause any perceptible issues in performance. </p><p>For example, for services such as DynamoDB, a write takes around 15-20ms on average. Kinesis batched put latency is ~150 ms on average for ~500 records. They still benefit from multi threading in some cases (DynamoDB and SQS for example). In these cases, we allow max 10 threads per writer to write concurrently to the write destinations. </p><p><em>(We assume write destination is adequately scaled.)</em></p><h4>The Writer Fanout </h4><p>The #LetsData dataset&#8217;s work is divided into tasks at the time of initialization - where each task could be:</p><ul><li><p> a single file - S3 readers create 1 task for each file</p></li><li><p>a single shard - Kinesis readers create 1 task per shard</p></li><li><p>or some fixed slice of the overall work - SQS readers divide the work according to concurrency configuration</p></li></ul><p>Assume we are processing with 10 tasks in a dataset and lambda concurrency is &gt;= 10. This essentially means that we will have 10 tasks reading and writing concurrently to the read and write destination. </p><p>If a write destination is DynamoDB or SQS, each of these tasks is then writing with 10 threads concurrently. So your write destination (SQS / DynamoDB) can expect <code>10 tasks x 10 threads concurrent write requests</code> at any instant in time.</p><p>At #LetsData, we&#8217;ve mostly taken an experiential approach to deciding the concurrency and threading parameters for different concurrency issues such as the number of writer threads.  </p><p>We&#8217;d initially set a value from what we believe is reasonable, and increase / decrease it based on the results and these rudimentary tests became our informed capacity tests. We didn&#8217;t find the need to optimize these with scientifically (mathematically) designed tests since these experience based parameters were mostly working well and further perf optimizations could be deferred to when they would start becoming a perf issue. </p><p>This was great until a few weeks ago, when we integrated with Sagemaker Compute Engine and saw latencies that were starkly different from what we had seen. </p><h2>Compute Workloads </h2><p>When we integrated with Sagemaker Compute Engine, we were seeing a batch of 5 records latency of 2-3 secs and single call latencies of 500 ms on average. </p><p>Compute workloads are different from the traditional read &amp; parse workloads, and this is probably why you see APIs such as InvokeEndpointAsync [4] [5] (for large inputs, inputs and outputs stored in S3) &amp; Batch Transform APIs (asynchronously transform data stored in S3) in the Sagemaker API set.  So the on-demand inference API that we are using can be expected to have higher latencies than AWS Service read / write / simple compute latencies. </p><p>The Sagemaker InvokeEndpoint (on-demand) API is not batched - so we had to use multi threaded parallel calls to generate the vector embeddings. Since we configure the Sagemaker fleet size, we need to set the number of Sagemaker compute threads accordingly so that they don&#8217;t slowdown the Sagemaker fleet (Or conversely, we need to set the Sagemaker fleet size so that the multiple compute threads aren&#8217;t causing a performance issue). </p><h4>Sagemaker Compute Fanout</h4><p>The Lambda Task Concurrency decides the number of concurrent dataset tasks. For each of these lambda data tasks invocations, we allocate 5 threads (fixed internally in the system) to concurrently call Sagemaker endpoints. Each one of these threads processes a single document that might have multiple elements that need vectorization. If there are two elements that need vectorization (example, vectors for content and vectors for metadata), then we make these two calls in parallel to Sagemaker.</p><p>This becomes a fanout as follows:</p><pre><code>    # of parallel Sagemaker calls = Lambda Task Concurrency x 5 Fixed Internal Threads x min(3, # of vectors per document)   </code></pre><h4>Sagemaker Design Issues</h4><p>We need to be able to answer a few different questions: </p><ul><li><p>How do we decide the Lambda Task Concurrency and the Sagemaker Fleet Size in light of this scaling factor? </p></li><li><p>Our customers would be setting lambda concurrency and Sagemaker fleet sizes as well, so how do we provide documentation and framework that they can calculate and set these using our guidelines?</p></li><li><p>This does feel like a possible operation hot area - latency questions and issues will come up again and again around this space. How can we enable our customers so that they have the information that they need to investigate? </p></li></ul><p>We do the following: </p><ol><li><p>Run latency tests for the base unit of work to get a baseline latency profile</p></li><li><p>Revisit our Queuing System concepts to see how we can use these latency numbers to put reason around what the fleet sizes should be  - we make this into workable instructions</p></li><li><p>We instrument the record as it flows through the system for latency - we capture different latencies and make them available to the customers. We also surface AWS Sagemaker metrics so that customers can self serve investigations. </p></li></ol><h4>Sagemaker Latency Profile</h4><p>To get the baseline latency profile for our Sagemaker Compute Dataset, we run our datasets with Lambda Concurrency 1 (one task is running), Sagemaker Serverless with concurrency of 1 and Sagemaker Provisioned with instance count 1. </p><h5>Sagemaker Serverless</h5><p>Our Sagemaker Compute number of threads is still set to 5 - so we&#8217;d be sending 5 requests over at any time. </p><p>Sagemaker Serverless will process them serially. AWS metrics [6][7] tell us that each request takes around 250ms for model latency and 170 ms for model overhead. We process ~75 requests every minute.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ei1L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ei1L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 424w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 848w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 1272w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ei1L!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png" width="1908" height="640.804945054945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b676b893-61a9-4361-b786-60984d3bf67a_2466x828.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:489,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1908,&quot;bytes&quot;:122399,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ei1L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 424w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 848w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 1272w, https://substackcdn.com/image/fetch/$s_!Ei1L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb676b893-61a9-4361-b786-60984d3bf67a_2466x828.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sagemaker Serverless Latency and Throughput</figcaption></figure></div><p>On the client side, we measure the latency of the batch as approximately <code>(model overhead + model latency) x batch_size =&gt; (170+250) x 5 ~ 3 secs</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e-eb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e-eb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 424w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 848w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 1272w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e-eb!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png" width="1260" height="683.6538461538462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:790,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1260,&quot;bytes&quot;:44465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e-eb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 424w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 848w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 1272w, https://substackcdn.com/image/fetch/$s_!e-eb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5735bf41-50ab-41ac-aad6-68eab95590d6_1904x1033.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sagemaker Latency and Throughput - as measured by the LetsData client</figcaption></figure></div><p>What we&#8217;ve seen is that the model overhead latency remains constant with Sagemaker Serverless and almost seems like a fixed cost - we tried this setup with 5 Sagemaker Serverless Consistency and 20 Sagemaker Serverless latency - we do see some slight variation in model latency but model overhead remains similar ~ 170ms. We don&#8217;t see this overhead with Sagemaker Provisioned, so this seems like an Sagemaker Serverless artifact. </p><h5>Sagemaker Provisioned</h5><p>Sagemaker Provisioned instance count is 1 - our Sagemaker Compute number of threads is still set to 5 - so we&#8217;d be sending 5 requests over at any time. </p><p>Sagemaker Provisioned, since its an EC2 instance processes these in parallel. AWS metrics tell us that each request takes around 280ms for model latency and 5 ms for model overhead. We process ~800 requests every minute.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uf8t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uf8t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 424w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 848w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 1272w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uf8t!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png" width="1696" height="672.1098901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:577,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1696,&quot;bytes&quot;:191998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uf8t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 424w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 848w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 1272w, https://substackcdn.com/image/fetch/$s_!uf8t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19f2172-d9af-4843-bb94-cc2fe64833b5_2469x979.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sagemaker Provisioned - Latency and Throughput</figcaption></figure></div><p>On the client side, we measure the latency of the batch as approximately <code>(model overhead + model latency) =&gt; (5+280) ~ 400 ms</code></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NLLj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NLLj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 424w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 848w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 1272w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NLLj!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png" width="1644" height="890.8763736263736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1644,&quot;bytes&quot;:114531,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NLLj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 424w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 848w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 1272w, https://substackcdn.com/image/fetch/$s_!NLLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab67ab7e-615f-4f4c-bd70-b6be5ad7eec3_1901x1030.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Sagemaker Provisioned Latency and Throughput - as measured by LetsData client</figcaption></figure></div><p>Here are the total times and throughputs for the baseline latency profiles. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EzQr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EzQr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 424w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 848w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 1272w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EzQr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png" width="1080" height="162" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:162,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30285,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EzQr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 424w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 848w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 1272w, https://substackcdn.com/image/fetch/$s_!EzQr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc9efa7-a59f-424d-ab9b-9ede3347352e_1080x162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Queuing Systems Theory</h4><p>The Queuing Systems chapter in the book <a href="https://www.amazon.com/dp/1439875901">Probability and Statistics for Computer Scientists (2nd Edition) by Michael Baron</a> [8] provides a good refresher on c<em>ontinuous-time queuing process </em>and our current problem can essentially be modeled as one (Book is a recommended read for computer engineers). </p><p>We are an essentially <em><strong>M/M/1 queuing  process</strong></em> - where the first M is distribution of queue&#8217;s inter arrival times, second M is the distribution of the queue&#8217;s service times and 1 is the number of servers. M denotes an exponential distribution because it is <em>memoryless </em> and is a <em>Markov</em> process (i.e. present state decides the result, we get no information from the past that can be used to predict the future).  </p><p>Here is an M/M/1 queueing process cheatsheet and the utilization results for the Sagemaker Serverless and Sagemaker Provisioned setup. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yr9r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yr9r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 424w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 848w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 1272w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yr9r!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png" width="1200" height="949.1585473870682" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:893,&quot;width&quot;:1129,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:174481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yr9r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 424w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 848w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 1272w, https://substackcdn.com/image/fetch/$s_!Yr9r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F390806cf-6fab-4a99-8bd1-8e7bc87d58f9_1129x893.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Results derivation for M/M/1 Queues - Sagemaker Serverless &amp; Provisioned</figcaption></figure></div><p>We&#8217;ve made some assumptions (model overhead is equal to wait time) and tried to apply this theory to our process.  If we go by these numbers, and aim for a target utilization of 70-80%:</p><ul><li><p>1 Sagemaker Serverless Concurrency can support two Dataset tasks (2 lambda concurrency) </p></li><li><p>1 Sagemaker Provisioned instance can support possibly 4-5 Dataset tasks (maybe more) </p></li></ul><p>With these informed numbers, I&#8217;d rerun Sagemaker Serverless with two dataset tasks (lambda concurrency 2) and Sagemaker Provisioned with 5 Dataset tasks (lambda concurrency 5) and recalculate the utilizations and adjust the Lambda Tasks Per Sagemaker Single Unit. I&#8217;d then use this  Lambda Tasks Per Sagemaker Single Unit metric to decide the Sagemaker fleet size. </p><p>For example, if Lambda Tasks Per Sagemaker Single Unit for Sagemaker Serverless is 2 (i.e. 2  Tasks stress the Sagemaker Serverless instance by 80% utilization), and my dataset needs to process 100 tasks, then for 10 Lambda concurrency, maybe 5 Sagemaker concurrency is required. (I&#8217;d set this higher though, say to 20,  to give some headroom given that the costs might be reasonable). </p><h4>Latency Metrics Dashboard </h4><p>We&#8217;ve added a Task Details Dashboard which measures the latencies of the record as it moves through different subsystems and queues. We&#8217;ve gathered metrics such as read time, queue wait times, compute execution times, write times and doc sizes to help with any latency investigations that you may want to do. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LwsE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LwsE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 424w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 848w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 1272w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LwsE!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png" width="1200" height="479.8172124904798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:525,&quot;width&quot;:1313,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:129433,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LwsE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 424w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 848w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 1272w, https://substackcdn.com/image/fetch/$s_!LwsE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ced79e-7a3a-4110-9604-ce17e388c805_1313x525.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Task Details Dashboard - Reader Metrics</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FN49!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FN49!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 424w, https://substackcdn.com/image/fetch/$s_!FN49!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 848w, https://substackcdn.com/image/fetch/$s_!FN49!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 1272w, https://substackcdn.com/image/fetch/$s_!FN49!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FN49!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png" width="1800" height="697.2527472527472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1800,&quot;bytes&quot;:500266,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FN49!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 424w, https://substackcdn.com/image/fetch/$s_!FN49!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 848w, https://substackcdn.com/image/fetch/$s_!FN49!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 1272w, https://substackcdn.com/image/fetch/$s_!FN49!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F477a89c9-b445-4655-a685-a9b19dcf8be5_2541x985.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Task Details Dashboard - Compute Metrics</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XNAV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XNAV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 424w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 848w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 1272w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XNAV!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png" width="1816" height="707.1923076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1816,&quot;bytes&quot;:288185,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XNAV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 424w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 848w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 1272w, https://substackcdn.com/image/fetch/$s_!XNAV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56f7d2a1-98ed-401e-9aa9-b18650aa16f8_2537x988.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Task Details Dashboard - Writer Metrics</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K6pE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K6pE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 424w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 848w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 1272w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K6pE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png" width="883" height="532" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:883,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K6pE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 424w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 848w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 1272w, https://substackcdn.com/image/fetch/$s_!K6pE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd867cc83-b7f5-4716-bf0f-a973b2bfd407_883x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Task Details Dashboard - Overall Metrics</figcaption></figure></div><h2>Conclusions</h2><p>We were surprised by the latency issue when we integrated with Sagemaker and that led us into a formal deep dive into how to reason about the queuing systems and decide on the different concurrency configurations. We also instrumented our implementation to allow detailed metrics around record processing. </p><p>Approaches that we would like to investigate but did not for this investigation:</p><ul><li><p>benchmarking our model performance and seeing if the ~250-280 ms latency that we are seeing from model execution is reasonable or not</p></li><li><p>comparing the model execution latencies when its executed in non Sagemaker environments with Sagemaker execution latencies</p></li><li><p>testing with different models that may be differently computationally intensive and see if the results degrade as expected - for example, running a model that takes 700 ms or larger doc sizes - would the overall system be negatively impacted by such changes?</p></li></ul><p>Thoughts / Comments? </p><h2>Resources</h2><ul><li><p>[1] Launch: AWS Sagemaker available as a Compute Engine on #LetsData </p><p><a href="https://blog.letsdata.io/p/launch-aws-sagemaker-available-as">https://blog.letsdata.io/p/launch-aws-sagemaker-available-as</a></p></li><li><p>[2] #LetsData Sagemaker Compute Engine Docs: <a href="https://www.letsdata.io/docs#computeengine">https://www.letsdata.io/docs#computeengine</a></p></li><li><p>[3] #LetsData Example - Generate Vector Embeddings Using Lambda and Sagemaker Compute Engine: <a href="https://www.letsdata.io/docs#examples">https://www.letsdata.io/docs#examples</a></p></li><li><p>[4] Sagemaker - Different Inference APIs</p><ul><li><p>Real Time Inference: <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html">https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html</a></p></li><li><p>Async Inference: <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html">https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html</a></p></li><li><p>Batch Transforms: <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html">https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html  </a></p></li></ul></li><li><p>[5] Sagemaker Async Inference Notebook: <a href="https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb">https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb</a></p></li><li><p>[6] How do I troubleshoot latency with my Amazon SageMaker endpoint <a href="https://repost.aws/knowledge-center/sagemaker-endpoint-latency">https://repost.aws/knowledge-center/sagemaker-endpoint-latency</a></p></li><li><p>[7] Sagemaker Metrics Docs: <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html">https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html </a></p></li><li><p>[8] The Queuing Systems chapter in the book <a href="https://www.amazon.com/dp/1439875901">Probability and Statistics for Computer Scientists (2nd Edition) by Michael Baron</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Launch: AWS Sagemaker available as a Compute Engine on #LetsData]]></title><description><![CDATA[Automate your inferences models and run inferences / vectors at scale]]></description><link>https://blog.letsdata.io/p/launch-aws-sagemaker-available-as</link><guid isPermaLink="false">https://blog.letsdata.io/p/launch-aws-sagemaker-available-as</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Wed, 27 Sep 2023 14:12:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/24b53a41-8f70-4447-9c9e-d417699304b0_809x408.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the public availability of <strong>AWS Sagemaker Compute Engine </strong>on<strong> </strong><a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>. Customers can now create vector embeddings, automate their model inference pipelines and run inferences for their documents at scale on <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>.</p><h2>Architecture</h2><p>Here is an architecture diagram that shows how the Sagemaker compute engine has been integrated with #LetsData pipelines.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ggyX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ggyX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 424w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 848w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 1272w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ggyX!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png" width="1198" height="775.1764705882352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:539,&quot;width&quot;:833,&quot;resizeWidth&quot;:1198,&quot;bytes&quot;:153632,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ggyX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 424w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 848w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 1272w, https://substackcdn.com/image/fetch/$s_!ggyX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb05894f0-79e8-4ae1-82ab-1ee80cf191dd_833x539.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">LetsData Sagemaker Architecture Diagram</figcaption></figure></div><p>The Sagemaker compute engine has two major components - a <strong>Lambda</strong> compute component and a <strong>Sagemaker</strong> compute component. Here is how the pipeline works:</p><ul><li><p><strong>Read and Parse Feature Doc:</strong> The Lambda compute component is responsible for reading the read destination, parsing the data using the user&#8217; data handler interface implementations and creating a feature document as before. (Steps 1-4).</p></li><li><p><strong>Extract Doc Elements For Vectorization:</strong> The Lambda feature document previously would have been written to the write destination. However, with the Sagemaker compute engine, the feature document is vectorized. Step 5 extracts the feature doc elements that require vectorization. (This is a user&#8217;s implementation of #LetsData interface.) </p></li><li><p><strong>Generate Vector Embeddings using Sagemaker:</strong> The extracted elements  are then sent to a Sagemaker Endpoint  that generates vector embeddings (Step 6). </p></li><li><p><strong>Construct Output Vector Doc:</strong> The output vector doc is constructed from these vectors (Step 7) .</p></li><li><p><strong>Write Vector Doc:</strong> The rest of the pipeline is similar to earlier - the output vector document is written to the write destination and any errors are recorded in the error destination (Step 8-10).</p></li></ul><p>Lets look at:</p><ul><li><p>how this architecture can be used in the emerging LLM app stacks</p></li><li><p>the new #LetsData Sagemaker Vectors Interface </p></li><li><p>the AI / ML models and how they can be used with #LetsData Sagemaker compute engine</p></li><li><p>the details around setting up Sagemaker Endpoints that can be invoked to generate vector embeddings at scale. </p></li><li><p>the overall Sagemaker configuration    </p></li></ul><h2>#LetsData Sagemaker Compute Pipelines For  LLM Apps</h2><p>I had <a href="https://blog.letsdata.io/p/letsdata-at-aws-summit-ny-2023">shared some thoughts in an earlier post</a> around how #LetsData might be useful in the AI / ML apps and promised a deep dive around this. The LLM App Architecture from the <a href="https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/">Emerging Architectures for LLM Applications (an Andreessen Horowitz blog)</a> should help us understand how #LetsData can add value to the AI / ML ecosystem. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4IsJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4IsJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 424w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 848w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 1272w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4IsJ!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp" width="1200" height="839.8351648351648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1019,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:50120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4IsJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 424w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 848w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 1272w, https://substackcdn.com/image/fetch/$s_!4IsJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17fd673e-db67-40c9-b119-be0f836d8c3a_2000x1400.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Emerging LLM App Stack - Architecture</figcaption></figure></div><p>#LetsData Sagemaker pipelines can be used to implement Data Pipelines, Embedding Model,  Orchestration, API / App Hosting and Queries. </p><h2>#LetsData&#8217;s Sagemaker Vector Interface</h2><p>We&#8217;ve defined a simple <a href="https://github.com/lets-data/letsdata-data-interface/blob/main/src/main/java/com/resonance/letsdata/data/readers/interfaces/sagemaker/SagemakerVectorsInterface.java">Sagemaker interface(GitHub)</a> that users can implement to:</p><ol><li><p>Extract Document for Vectorization from the Feature Doc</p></li><li><p>Construct a Vector Doc from the Feature Doc and the generated Vector Embeddings</p></li></ol><p>Here is the interface definition:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!foYC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!foYC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 424w, https://substackcdn.com/image/fetch/$s_!foYC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 848w, https://substackcdn.com/image/fetch/$s_!foYC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 1272w, https://substackcdn.com/image/fetch/$s_!foYC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!foYC!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png" width="1200" height="455.7692307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:553,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:108833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!foYC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 424w, https://substackcdn.com/image/fetch/$s_!foYC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 848w, https://substackcdn.com/image/fetch/$s_!foYC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 1272w, https://substackcdn.com/image/fetch/$s_!foYC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6993af-96f7-48bf-9a4b-f8171eb0227f_1593x605.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#LetsData&#8217;s Sagemaker Vector Interface</figcaption></figure></div><p>Here is a sample implementation for the <a href="https://github.com/lets-data/letsdata-common-crawl/blob/main/src/main/java/com/letsdata/commoncrawl/interfaces/implementations/sagemaker/CommonCrawlSagemakerReader.java">Common Crawl Web Archives (GitHub)</a> documents. </p><p>The complete example and step by step instructions are also available on our <a href="https://www.letsdata.io/docs#examples">website (Generate Vector Embeddings Using Lambda and Sagemaker Compute Engine)</a>.  </p><h2>Using AI/ ML Models with #LetsData Sagemaker</h2><p>The #LetsData Sagemaker Compute Engine is automation built around the Sagemaker Inference models and endpoints - this essentially means that any AI / ML model that can be used with AWS Sagemaker can be used with #LetsData - the model code is packaged as a zip file and uploaded to S3 and imported as AWS Sagemaker model that is used by #LetsData.</p><p>#LetsData supports AI/ ML models for Sagemaker in these following configurations:</p><ul><li><p><strong>Reuse Existing LetsData Model:</strong> You created a model for a dataset on LetsData and would like to reuse it. You can specify the model Arn and LetsData will use that model for Sagemaker.</p></li><li><p><strong>Create New LetsData Model:</strong> You have the model code packaged as a zip file in S3. You'll specify the S3 Arn and LetsData will create a model for Sagemaker.</p></li></ul><p>We&#8217;ve tested with  the Hugging Face&#8217;s Sentence Transformer models in our implementations and have detailed examples on how to get the AI models working with #LetsData and the different customizations that are offered. Here are some quick highlights:</p><ul><li><p><strong>Model Container Images:</strong> We support the entire gamut of ECR Sagemaker model container images - HuggingFace, Inferentia, Pytorch, Scikit to name a few. The complete support list is at our <a href="https://www.letsdata.io/docs#computeengine">website</a> and at the <a href="https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-east-1.html">AWS ECR Sagemaker Image List</a>  </p></li><li><p><strong>Model Environment Variables:</strong> With #LetsData Sagemaker, you can customize your model environment to specify 1./ model runtime configuration and 2./ your model implementation details. For example, our HuggingFace SentenceTransformer model implementation uses the following env variables, essentially informing the model to use the vector generations for question-answer format and that our model code is in inference.py file and model/ directory has the custom code.</p><pre><code>    "HF_TASK": "question-answering",
    "SAGEMAKER_PROGRAM": "inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY": "model/"</code></pre></li><li><p><strong>Request and Response Customizations:</strong> While we&#8217;ve defined a fix format for request and response to the Sagemaker endpoints, your model code can add customizations as needed.</p><pre><code><strong>Request
-------</strong>    
def input_fn(request_body, request_content_type):
    """
        Args:
        request_body: The body of the request sent to the model.
        request_content_type: (string) the content type
        Returns:
        Input data in json format.
    """
    if request_content_type == 'text/plain':
        inp_var = request_body
        return inp_var.decode("utf-8")
    else:
        raise ValueError("This model only supports text/plain input")     

<strong>Response
--------
</strong>def predict_fn(data, model_and_tokenizer):
    """
        Args:
        input_data: Returned input data from input_fn
        model: Returned model from model_fn
        Returns:
        The predictions
    """
    model, tokenizer = model_and_tokenizer

    ...
        
    return vector_embeddings[0].tolist()</code></pre></li></ul><p>Our <a href="https://www.letsdata.io/docs#computeengine">Compute Engine Documentation</a> and <a href="https://www.letsdata.io/docs#examples">Step By Step Example</a> has more details around integrating models with #LetsData. </p><h2>Sagemaker Endpoints to Generate Vector Embeddings</h2><p>The Sagemaker Endpoint hosts the container image and the model, is called with the documents and it returns the vector result. </p><p>Sagemaker endpoints can be fine-tuned for concurrency, hardware, memory etc. Sagemaker supports two types of endpoints:</p><ul><li><p><strong>Serverless:</strong> Sagemaker automatically hosts the model and containers and scales it to your desired concurrency and memory</p></li><li><p><strong>Provisioned:</strong> Sagemaker provisions the requested hardware and automatically hosts the model and containers</p></li></ul><p>#LetsData supports both Serverless and Provisioned Endpoints for Sagemaker in these following configurations:</p><ul><li><p><strong>Bring Your Own Endpoint:</strong> You have an existing Sagemaker endpoint in an AWS account, you can specify the endpoint Arn and endpoint config Arn and LetsData will use that endpoint for Sagemaker.</p></li><li><p><strong>Reuse Existing LetsData Endpoint:</strong> You created an endpoint for a dataset on LetsData and would like to reuse it. You can specify the endpoint Arn and endpoint config Arn and LetsData will use that endpoint for Sagemaker.</p></li><li><p><strong>Create New LetsData Endpoint:</strong> You'd like a new Endpoint created for the dataset. You'll specify the endpoint type (Serverless/Provisioned) and its endpoint configuration and LetsData will create a Sagemaker endpoint for dataset execution.</p></li></ul><p>We&#8217;ve run our tests with Serverless and Provisioned endpoints and have seen the benefits of GPU hardware acceleration and the ml.inf.* EC2 instance types.  We&#8217;ve been impressed with the overall AI / ML inference infrastructure that AWS supports and how a diverse set of models and technology can all be integrated with the service.  Our integration with Sagemaker further simplifies the end to end AI / ML use-case data integration.</p><h2>Sagemaker Configuration  </h2><p>The overall sagemaker configuration / schema  is as follows: </p><h4>Schema</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5PqK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5PqK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 424w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 848w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 1272w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5PqK!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png" width="1200" height="1637.2774869109949" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1303,&quot;width&quot;:955,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:236017,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5PqK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 424w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 848w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 1272w, https://substackcdn.com/image/fetch/$s_!5PqK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d8115fd-04ce-482c-a07c-d611f7d157ac_955x1303.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The complete Schema specification for Sagemaker Compute Engine</figcaption></figure></div><h4>Example - Create New Model &amp; Endpoints</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tUVs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tUVs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 424w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 848w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 1272w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tUVs!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png" width="1200" height="903.7344398340249" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:726,&quot;width&quot;:964,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:110842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tUVs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 424w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 848w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 1272w, https://substackcdn.com/image/fetch/$s_!tUVs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3da4a278-a6dd-4f9d-b8a9-64e07f6d950c_964x726.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example configuration for Create New Model and Endpoints - Sagemaker Provisioned</figcaption></figure></div><h4>Example - Bring Your Own Model &amp; Endpoints</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aE7c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aE7c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 424w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 848w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 1272w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aE7c!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png" width="1200" height="660.7438016528926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:533,&quot;width&quot;:968,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:89327,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aE7c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 424w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 848w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 1272w, https://substackcdn.com/image/fetch/$s_!aE7c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e94fc5c-d94a-4921-bb1f-779ee328974b_968x533.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example configuration for Bring Your Own Model and Endpoints</figcaption></figure></div><p>Our <a href="https://www.letsdata.io/docs#computeengine">Compute Engine Documentation</a> and <a href="https://www.letsdata.io/docs#examples">Step By Step Example</a> has the complete details around configuration and examples for different Sagemaker configurations. </p><h2>Future Work</h2><ul><li><p><strong><s>Integration with Vector Database:</s></strong><s> The write destination should be some Vector Database / Index instead of Kinesis Stream - we need to implement a Vector write destination. We are looking into different options and will work on having some option natively available in #LetsData.</s> This is done, we are now integrated with Momento Vector Indexes (<a href="https://www.letsdata.io/docs/write-connectors?tab=momentovectorindexes">https://www.letsdata.io/docs/write-connectors?tab=momentovectorindexes</a>)</p></li><li><p><strong><s>Validate a Customer Journey:</s></strong><s> While we&#8217;ve tested and built the ML / AI pipelines, we&#8217;ve not constructed an end user example / customer journey yet. In large part because we aren&#8217;t putting these vector embeddings  in a vector database / queryable source. We should do this to further improve what we might have missed and validate the user scenario. </s> We did validations with Web Crawl Archives vector index and web search. See the search section for results (<a href="https://www.letsdata.io/docs/write-connectors?tab=momentovectorindexes#momento-vector-index-write-connector-implementation">https://www.letsdata.io/docs/write-connectors?tab=momentovectorindexes#momento-vector-index-write-connector-implementation</a>)</p></li><li><p><strong>Enabling Learning Scenarios:</strong> Current Sagemaker Support is Inference only - we need to look at enabling learning / training scenarios as well. </p></li><li><p><strong>Growing our system from individual datasets to a pipeline of connected datasets:</strong>  Today, our datasets read from a read destination, perform compute and then write to a write destination. If the write destination is an intermediate destination such as a Kinesis stream, we need another dataset to read from stream and write it to some durable location such as Vector Database. We need to natively support this. Example:</p><pre><code>{
    "pipelineName": "VectorIndexPipeline",
    "artifact": {
        ...
    },
    "errorConnector": {
        ...
    },
    "datasets": [
        {
            // dataset 1 
            //&#9;- read from s3
            //  - run sagemaker compute 
            //  - write to kinesis
        },
        {
            // dataset 2 
            //  - read kinesis stream in dataset 1,
            //  - run lambda compute 
            //  - write to database
        }
    ]
}</code></pre></li></ul><h2>Resources</h2><ul><li><p>#LetsData Sagemaker Compute Engine Docs: <a href="https://www.letsdata.io/docs#computeengine">https://www.letsdata.io/docs#computeengine</a></p></li><li><p>#LetsData Example - Generate Vector Embeddings Using Lambda and Sagemaker Compute Engine: <a href="https://www.letsdata.io/docs#examples">https://www.letsdata.io/docs#examples</a></p></li><li><p>Here are some references on customizing models with Sagemaker.</p><ul><li><p>Hugging Face Sagemaker Docs: <a href="https://huggingface.co/docs/sagemaker/inference#user-defined-code-and-modules">User defined code and modules</a></p></li><li><p>Hugging Face Sagemaker Custom Inference Notebook: <a href="https://github.com/huggingface/notebooks/blob/main/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb">Sentence Embeddings with Hugging Face Transformers</a></p></li><li><p>Blog at medium.com: <a href="https://medium.com/picus-security-engineering/customized-model-serving-via-aws-sagemaker-serverless-inference-a72879948321">Leveraging AWS SageMaker Serverless Inference for Customized Model Serving</a></p></li></ul></li><li><p>Some foundational reading on AI / ML / LLM architectures:</p><ul><li><p>Emerging Architectures for LLM Applications (an Andreessen Horowitz blog):&nbsp;<a href="https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/">https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/</a></p></li><li><p>How OpenAI trained ChatGPT (an excellent summary of the MS Build talk): <a href="https://blog.quastor.org/p/openai-trained-chatgpt">https://blog.quastor.org/p/openai-trained-chatgpt</a></p></li></ul></li></ul>]]></content:encoded></item><item><title><![CDATA[#LetsData now has a Blog / Newsletter ]]></title><description><![CDATA[#LetsData now has a blog / newsletter at blog.letsdata.io.]]></description><link>https://blog.letsdata.io/p/letsdata-now-has-a-blog-newsletter</link><guid isPermaLink="false">https://blog.letsdata.io/p/letsdata-now-has-a-blog-newsletter</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 26 Sep 2023 20:27:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GAyw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>#LetsData now has a blog / newsletter at <a href="http://blog.letsdata.io">blog.letsdata.io</a>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GAyw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GAyw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GAyw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg" width="1456" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:514155,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GAyw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GAyw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F022a8a73-390d-43ee-b250-38567709d989_1488x989.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Our &#8216;not so stock&#8217; newsletter stock image</figcaption></figure></div><p>We had been using LinkedIn articles and our LinkedIn page for our blogging needs. However, we like the conventional feel of having a permanent home on the website / linked to the website, having blogging features such as archiving, cross linking etc and the newsletter features such as delivery to subscribers inboxes. We&#8217;re still huge fans of LinkedIn and our page - we&#8217;d be sharing our progress on there - just that the content would be on this newer blog. </p><p>We&#8217;ve started a newsletter / blog at substack. For now, we&#8217;ve backfilled the blog with the content from our LinkedIn articles and page posts. I have been hearing great things about substack - though building a size-able subscriber base would be challenging (hint: subscribe to the newsletter if interested).</p><p>We have some interesting posts scheduled for the next few weeks - including our <strong>much anticipated Launch announcement tomorrow</strong>. Visit our <a href="http://blog.letsdata.io">blog (blog.letsdata.io)</a> and subscribe to our newsletter so that you get the latest in your inbox as soon as we launch! </p><ul><li><p>Blog: <a href="https://www.linkedin.com/company/letsbigdata">blog.letsdata.io</a></p></li><li><p>LinkedIn Page: <a href="https://www.linkedin.com/company/letsbigdata">www.linkedin.com/company/letsbigdata</a></p></li><li><p>Website: <a href="http://www.letsdata.io">www.letsdata.io</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[Bug Fixes for the day!]]></title><description><![CDATA[Bug fixes for the day]]></description><link>https://blog.letsdata.io/p/bug-fixes-for-the-day</link><guid isPermaLink="false">https://blog.letsdata.io/p/bug-fixes-for-the-day</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 26 Sep 2023 04:34:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lVd2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Bug fixes for the day</h3><p>Woke up today to a couple of overnight alarms for <a href="https://www.linkedin.com/feed/hashtag/?keywords=letsdata&amp;highlightedUpdateUrns=urn%3Ali%3Aactivity%3A7067003184396828672">#LetsData</a> - the website's main page had thrown an error in the wee hours of the night. Also, a dataset's initialization was failing.<br><br>Initial thoughts, great that we've started running into issues like these - people are looking at the service and trying to use it!<br><br>Okay, enough gloating, its time to see how bad the failure situation is.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lVd2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lVd2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lVd2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg" width="516" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:591791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lVd2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 424w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 848w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!lVd2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa05697f6-c82c-4cfa-b898-ebeba57b60cc_3264x3264.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Physical manifestation of a bug that I also dealt with!</figcaption></figure></div><h3>Website timeout</h3><p>Look into the logs and find that the website's main page had timed out. This can happen from time to time - on initialization, we read a bunch of data from network and populate the caches - so the initial call is heavy, after that everything's cached.<br><br>The other issue is that we've implemented the server API in Lambda and when Lambda is inactive, the function is reclaimed. So, after 15-30 mins of inactivity, a request to the website would re-initialize the function and the network reads.&nbsp;<br><br>A couple of quick fixes:</p><ul><li><p>Parallelize the network reads and do only the reads that are required for the page - 15 sec timeout reduced to 3 secs lambda response time</p></li><li><p>Run a cron that retrieves the index, docs and home page every 10 mins so that atleast 1 Lambda function remains initialized (and customers do not get the initial delay - this is a TODO as of now)</p></li></ul><h3>Dataset initialization</h3><p>When a user creates a dataset, we do a number of initializations, initialize queues, task databases, lambda functions etc. (<code>"$ &gt; letsdata datasets view help" </code>for details)<br><br>In this case, our continuous tests had created a dataset and the build for the dataset had failed. The initialization didn't know what to do in this case.<br><br>A couple of things:</p><ul><li><p><strong>About our continuous tests:</strong> The MVP that we've built has 2 different services that we can read from and the reads can be 5 different configurations.&nbsp;We have 6 destinations that we can write to, and 3-4 different ways we can specify what work needs to be done.&nbsp;So in total, we did some data generation and have 33 different combinations of read, write and work specifications that can be done.&nbsp;We wrote a test suite that creates a dataset from these 33 configurations every 20 mins, waits for it to complete initialization, start processing and then deletes the dataset. This makes sure that our customers don't run into unknown issues.</p></li><li><p><strong>About the build failure: </strong>Looks like the build had some transient failure where one of the maven dependencies was not found. A simple manual retry of the build fixed the issue.</p><p>This is a new issue and has happened twice in ~400 odd dataset runs thus far. Error signature copied to the code and a TODO added. As we see more of this, we'll either add an automated retry or fix the issue why it failed.&nbsp;</p></li></ul><p>So that is a founder's start to the day - a good coding workout before some of the other tasks!</p>]]></content:encoded></item><item><title><![CDATA[Launch: Kafka Write Connector on #LetsData]]></title><description><![CDATA[Today, we are announcing the public availability of Kafka Write Connector on #LetsData. Customers can now create #LetsData datasets to automatically write documents to Kafka.]]></description><link>https://blog.letsdata.io/p/launch-kafka-write-connector-on-letsdata</link><guid isPermaLink="false">https://blog.letsdata.io/p/launch-kafka-write-connector-on-letsdata</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 26 Sep 2023 04:25:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IksL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Today, we are announcing the public availability of <strong>Kafka Write Connector on </strong><a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>. Customers can now create <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> datasets to automatically write documents to Kafka.</p><h3>Challenges and Complexities</h3><p>Developing the <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> Kafka write connector has been a challenging undertaking. Kafka is a mature product and&nbsp;has a rich ecosystem. Some challenges:</p><ul><li><p>AWS implementations come in a couple of flavors - a <strong>Serverless</strong> option and a <strong>Provisioned</strong> option.</p></li><li><p><strong>Different Kafka versions</strong> support additional features such as tiered storage.</p></li><li><p>Unlike its competitors such as Kinesis, AWS Kafka implementations require separate <strong>VPC and networking setup</strong></p></li><li><p><strong>Complexities around VPC connectivity, compounded by cross account access</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IksL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IksL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 424w, https://substackcdn.com/image/fetch/$s_!IksL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 848w, https://substackcdn.com/image/fetch/$s_!IksL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 1272w, https://substackcdn.com/image/fetch/$s_!IksL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IksL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png" width="813" height="412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69295d56-4afc-4f91-83cd-1677b36838be_813x412.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:412,&quot;width&quot;:813,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63241,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IksL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 424w, https://substackcdn.com/image/fetch/$s_!IksL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 848w, https://substackcdn.com/image/fetch/$s_!IksL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 1272w, https://substackcdn.com/image/fetch/$s_!IksL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69295d56-4afc-4f91-83cd-1677b36838be_813x412.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">10 concurrent tasks, Kafka Provisioned Cluster, 6 t3 small nodes. S3 Read throughput max ~10 GB per min, Kafka Write throughput max ~ 700 MB per min</figcaption></figure></div><p>We've resolved a bunch of these ambiguities in our implementation and automated the Kafka cluster setup and networking. With <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> Kafka Write Connector, you can create a working Kafka Cluster and a secure VPC within a few minutes. The connector comes with <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> trademark write performance, simplicity and operations.&nbsp;</p><p>You can read about the Kafka write connector in our docs at <a href="http://www.letsdata.io/docs#writeconnectors">www.letsdata.io/docs#writeconnectors</a></p><h3>VPC Networking</h3><p>To manage the Kafka Write Connector Vpc and its connectivity, we&#8217;ve built an extensive networking subsystem and we are launching our Vpc &amp; Vpc connectivity APIs today in our CLI. At a high level, we've implemented:</p><ul><li><p><strong>IP Address Management</strong> for the multi-tenant system with disparate IP Address allocations and reclamations for each dataset. We've used Vpc's IPAM (IP Address Manager) feature that works really well.</p></li><li><p>Setup a <strong>secure, isolated Vpc</strong> for each Kafka Write Connector, setup for outbound connectivity only. Here is the architecture at a glance:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UO-k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UO-k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 424w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 848w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 1272w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UO-k!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png" width="1200" height="528.688524590164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:430,&quot;width&quot;:976,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:51614,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UO-k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 424w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 848w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 1272w, https://substackcdn.com/image/fetch/$s_!UO-k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83d3b95a-2fe5-42b0-acea-c06498953fab_976x430.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Kafka Cluster VPC Architecture</figcaption></figure></div><ul><li><p>Connectivity to the resources in the Vpc via Vpc Peering - we've added self service APIs to establish connectivity.</p></li></ul><div><hr></div><pre><code># run this command with customerAccountForAccess's aws credentials
 
$ &gt; aws ec2 create-vpc-peering-connection --vpc-id 'clientVpcId' --peer-vpc-id 'letsdataVpcId' --peer-owner-id 'letsdataVpcOwnerId' --peer-region 'letsdataVpcRegion' </code></pre><div><hr></div><pre><code># accept the vpc peering connection on behalf of #Let's Data by using the following #Let's Data CLI command  

$ &gt; letsdata vpcs vpcPeeringConnections accept --datasetName 'datasetName' --vpcPeeringConnectionId 'vpcPeeringConnectionId' --requesterVpcId 'clientVpcId' --accepterVpcId 'letsdataVpcId' --prettyPrint  </code></pre><div><hr></div><pre><code># list the vpc peering connections for a #LetsData vpc 

$ &gt; letsdata vpcs vpcPeeringConnection list --datasetName 'datasetName' --vpcId 'letsdataVpcId' [--userId 'userId'] </code></pre><div><hr></div><pre><code># delete a vpc peering connection for a #LetsData vpc 

$ &gt; letsdata vpcs vpcPeeringConnections delete --datasetName 'datasetName' --letsdataVpcId 'letsdataVpcId' --customerVpcId 'customerVpcId' --vpcPeeringConnectionId 'vpcPeeringConnectionId' [--userId 'userId']</code></pre><div><hr></div><p>
You can read the technical details around our Vpc subsystem at <a href="http://www.letsdata.io/docs#vpcs">www.letsdata.io/docs#vpcs</a></p><h3>Support Matrix</h3><p>We support Kafka Serverless and Kafka Provisioned in the <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> account (resourceLocation: LetsData). We also support Kafka Provisioned as BYOC (Bring Your Own Cluster) when the Kafka cluster is located in an external AWS account. (ResourceLocation: Customer).</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2u8N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2u8N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 424w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 848w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 1272w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2u8N!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png" width="1200" height="151.64835164835165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:184,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:68411,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2u8N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 424w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 848w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 1272w, https://substackcdn.com/image/fetch/$s_!2u8N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feccb4c89-d3a1-46e3-b102-f4331b377865_1489x188.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Support Matrix for Kafka Provisioned Write Connector</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cdQn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cdQn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 424w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 848w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 1272w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cdQn!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png" width="1200" height="93.95604395604396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:114,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:37518,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cdQn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 424w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 848w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 1272w, https://substackcdn.com/image/fetch/$s_!cdQn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F054b9f56-b383-4549-b0ac-373945632a9a_1490x117.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a><figcaption class="image-caption">Support Matrix for Kafka Serverless Write Connector</figcaption></figure></div><p>That is quite a feature and test matrix for a write connector destination! What we seemed to have built is either at the bleeding edge for these technologies (or a small chance that we are doing it wrong and no one does it this way!) Why? Well, we ran into a couple of issues that cut short this matrix:</p><ul><li><p><strong>The AWS MSK IAM Auth externalId issue: </strong>AWS MSK IAM Auth is a seamless authentication library that solves the authentication in such a simple way (kudos AWS IAM folks!). However, it did seem like we had to build a custom version for our use case. <a href="https://github.com/aws/aws-msk-iam-auth/issues/128">GitHub Issue</a></p></li><li><p><strong>Lambda Functions connecting to Vpc in a different account</strong> - Lambda functions work with Virtual Private Clouds by establishing elastic network interfaces to the Vpc subnets and accessing the resources in the subnets. However, this seems to be limited to Vpc's in the same account (So lambda cannot connect to Kafka cluster in customer account). Some creative networking (with Vpc Peering) is probably required, but as of now, it isn't supported out of the box. This is why the Kafka Cluster in the Customer AWS Account has only 1 supported case - when the cluster is publicly available (and accessible to Lambda).</p></li></ul><h3>Example: Create a Dataset</h3><p>The examples that have the step by step instructions have been updated for a Kafka Serverless write connector. (<a href="http://www.letsdata.io/docs#examples">www.letsdata.io/docs#examples</a>)</p><p>Here is an example dataset configuration and commands to create the dataset.</p><pre><code>{
&nbsp; &nbsp; "datasetName": "ExtractTargetUriDemoTest",
&nbsp; &nbsp; "accessGrantRoleArn": "arn:aws:iam::308240606591:role/Extractor",
&nbsp; &nbsp; "customerAccountForAccess": "308240606591",
&nbsp; &nbsp; "readConnector": {
&nbsp; &nbsp; &nbsp; &nbsp; "connectorDestination": "S3",
&nbsp; &nbsp; &nbsp; &nbsp; "bucketName": "commoncrawl",
&nbsp; &nbsp; &nbsp; &nbsp; "bucketResourceLocation": "Customer",
&nbsp; &nbsp; &nbsp; &nbsp; "readerType": "Single File Reader",
&nbsp; &nbsp; &nbsp; &nbsp; "singleFileParserImplementationClassName": "com.letsdata.example.TargetUriExtractor"
&nbsp; &nbsp; &nbsp; &nbsp; "artifactImplementationLanguage": "Java",
&nbsp; &nbsp; &nbsp; &nbsp; "artifactFileS3Link": "s3://targeturiextractorjar-demotest/target-uri-extractor-1.0-SNAPSHOT-jar-with-dependencies.jar",
&nbsp; &nbsp; &nbsp; &nbsp; "artifactFileS3LinkResourceLocation": "Customer"
&nbsp; &nbsp; },
&nbsp; &nbsp; "writeConnector": {
&nbsp; &nbsp; &nbsp; &nbsp; "connectorDestination": "KAFKA",
&nbsp; &nbsp; &nbsp; &nbsp; "resourceLocation": "LetsData",
&nbsp; &nbsp; &nbsp; &nbsp; "kafkaClusterType": "Serverless",
&nbsp; &nbsp; &nbsp; &nbsp; "kafkaTopicName": "commoncrawl",
&nbsp; &nbsp; &nbsp; &nbsp; "kafkaTopicPartitions": 5,
&nbsp; &nbsp; &nbsp; &nbsp; "kafkaTopicReplicationFactor": 3,
&nbsp; &nbsp; &nbsp; &nbsp; "kafkaClusterSize": "small"
&nbsp; &nbsp; },
&nbsp; &nbsp; "errorConnector": {
&nbsp; &nbsp; &nbsp; &nbsp; "connectorDestination": "S3",
&nbsp; &nbsp; &nbsp; &nbsp; "resourceLocation": "letsdata"
&nbsp; &nbsp; },
&nbsp; &nbsp; "computeEngine": {
&nbsp; &nbsp; &nbsp; &nbsp; "computeEngineType": "Lambda",
&nbsp; &nbsp; &nbsp; &nbsp; "concurrency": 15,
&nbsp; &nbsp; &nbsp; &nbsp; "memoryLimitInMegabytes": 10240,
&nbsp; &nbsp; &nbsp; &nbsp; "timeoutInSeconds": 900,
&nbsp; &nbsp; &nbsp; &nbsp; "logLevel": "DEBUG"
&nbsp; &nbsp; },
&nbsp; &nbsp; "manifestFile": {
&nbsp; &nbsp; &nbsp; &nbsp; "manifestType": "S3ReaderTextManifestFile",
&nbsp; &nbsp; &nbsp; &nbsp; "readerType": "SINGLEFILEREADER",
&nbsp; &nbsp; &nbsp; &nbsp; "fileContents": "crawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00000.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00001.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00002.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00003.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00004.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00005.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00006.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00007.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00008.warc.gz\r\ncrawl-data/CC-MAIN-2022-49/segments/1669446706285.92/warc/CC-MAIN-20221126080725-20221126110725-00009.warc.gz"
&nbsp; &nbsp; }
}</code></pre><div><hr></div></li></ul><pre><code># create the dataset on #Let's Data using the CLI.&nbsp;
$ &gt; ./letsdata datasets create --configFile datasetConfiguration.json --prettyPrint&nbsp;


# view the dataset on #Let's Data using the CLI. Once the dataset is created, it takes ~3 mins to initialize the resources (dataset is in INITIALIZING state, no tasks have been created yet).&nbsp;
$ &gt; ./letsdata datasets view --datasetName ExtractTargetUriDemoTest --prettyPrint&nbsp;


# list the dataset tasks on #Let's Data using the CLI&nbsp;
$ &gt; ./letsdata tasks list --datasetName ExtractTargetUriDemoTest --prettyPrint</code></pre><div><hr></div><h3>Example: SDK Sample</h3><p>We&#8217;ve also updated our SDK samples on how to read data from <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> Kafka clusters. (<a href="https://github.com/lets-data/letsdata-writeconnector-reader">https://github.com/lets-data/letsdata-writeconnector-reader</a> and <a href="http://www.letsdata.io/docs#customeraccountforaccess">www.letsdata.io/docs#customeraccountforaccess</a>)</p><p>Here is the sample CLI driver application that can be used to read from the Kafka cluster. This assumes connectivity to the VPC has been established. (Instructions on how to establish connectivity can be found at: <a href="http://www.letsdata.io/docs#vpcs">www.letsdata.io/docs#vpcs</a>)</p><pre><code># cd into the bin directory
$ &gt; cd src/bin

# awsAccessKeyId and awsSecretKey are the security credentials of an IAM User in the customer AWS account. This is the customer AWS account that was granted access. In case this is a root account, you can create an IAM user. See the "IAM User With AdministratorAccess" section above.

# Connect a Kafka Consumer to the Kafka Cluster using aws-msk-iam-auth library
$ &gt; kafka_reader --clusterArn 'clusterArn' --customerAccessRoleArn 'customerAccessRoleArn' --externalId 'externalId' --awsRegion 'awsRegion' --awsAccessKeyId 'awsAccessKeyId' --awsSecretKey 'awsSecretKey' --topicName 'topicName'

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
listTopics
{commoncrawl1}

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
assignTopicPartitions

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
topicPartitionPositions
{commoncrawl1={0=0, 1=0, 2=0, 3=0, 4=0}}

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
pollTopic
...
...
...

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
commitPolledRecords

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
topicPartitionPositions
{commoncrawl1={0=179, 1=424, 2=249, 3=185, 4=233}}

&gt; Enter the kafka consumer method to invoke. ["listTopics", "listSubscriptions", "subscribeTopic", "pollTopic", "commitPolledRecords", "topicPartitionPositions","assignTopicPartitions", "listAssignments","quit"]
quit</code></pre><p>Our docs, CLI and web have all been updated to work with Kafka Write Connector.</p><h3>Conclusion</h3><p>With these new Vpc &amp; Clustering subsystems, we&#8217;ll now be looking to add additional clustering data destinations to LetsData. We also have plans to expand the Read Connectors to Kinesis, DynamoDB and Kafka.</p><p>We hope you&#8217;d like working with the newer features and would love to hear any feedback on what works well / doesn&#8217;t work well.</p>]]></content:encoded></item><item><title><![CDATA[#LetsData at AWS Summit, NY, 2023]]></title><description><![CDATA[Its been a little quiet on our #LetsData page - while no excuses for being silent are actually going to sound any better, the reason we've been quiet is that we've been really busy with:]]></description><link>https://blog.letsdata.io/p/letsdata-at-aws-summit-ny-2023</link><guid isPermaLink="false">https://blog.letsdata.io/p/letsdata-at-aws-summit-ny-2023</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 26 Sep 2023 04:01:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TVOs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Its been a little quiet on our <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> page - while no excuses for being silent are actually going to sound any better, the reason we've been quiet is that we've been really busy with:&nbsp;</p><ol><li><p>Prospecting customers</p></li><li><p>Co-Founder - Sales hiring (lots of great conversations but nothing final, lets see how this goes)</p></li><li><p>Hiring Web Developer (Woot! we hired an engineer who started this week and already has some templates going!)</p></li><li><p>Building additional product features (Stay tuned, we have additional connector announcements soon)</p></li><li><p>Finding how we fit into the overall eco-system (Product-Market Fit)</p></li></ol><p>Between all this, regular posting on LinkedIn kind of automatically got de-prioritized. Anyways, on to the main topic for today.&nbsp;</p><p>We've been trying to find creative ways for prospecting customers(#1) and for any customer connects, it is critical that we enunciate a crisp narrative to our Product-Market Fit (#5). For these two critical reasons, I attended the <a href="https://aws.amazon.com/events/summits/new-york/">AWS Summit in New York(www)</a> yesterday. Here is my debrief of the event and its applicability to <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>.</p><h3>Preparation</h3><p>Leading up to the event, I updated our <a href="https://theresonancelabscom-my.sharepoint.com/:b:/g/personal/usmanshami_theresonancelabs_com/EV_UsdgquPhIvKC5VsdmHb4BhjoMPYr5wizJrvoyeBNboQ?e=5bq87x">Marketing Solutions Brief (pdf)</a> and our <a href="https://theresonancelabscom-my.sharepoint.com/:b:/g/personal/usmanshami_theresonancelabs_com/EfyANIql3dZMtYIKBoHPAqABmuz5Q2A1W-Tov2J8MUQ3og?e=byFJl1">Executive Summary (pdf)</a> to create a Customer Package for any customer connects. Printed 30 odd copies at our local Fedex office, sorted, bound, laminated and boxed ready to be handed out to customers. I had registered as an attendee (not as a sponsor),&nbsp;so had no booths assigned in the Expo / Exhibition&nbsp;Hall. The idea here was to squat on some empty table if such an opportunity arose and chat with interested folks. This was the prep work for prospecting customers (#1).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TVOs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TVOs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TVOs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg" width="1456" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:514155,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TVOs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TVOs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F296c8556-fb6e-4fcd-ae58-90386f0c3b04_1488x989.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Exhibit A: The #LetsData Package</figcaption></figure></div><p>For Product-Market Fit (#5), I went through the session catalog and identified the sessions that I'd want to attend. The selection criteria was:&nbsp;</p><ul><li><p>anything related to data / compute space in general&nbsp;</p></li><li><p>data / compute AWS techs that I hadn't worked with&nbsp;</p></li><li><p>customer stories on how customers were actually using data techs in the industry</p></li></ul><p>The idea here was that as long as I was consuming knowledge about different techs, customer stories, they'd connect to define a crisper strategy narrative and clarity for Product Market Fit.</p><p>Here are some sessions that I selected (now attending all these would be a challenge, the ones I attended are <strong>highlighted</strong>):</p><ul><li><p>Not Applicable, The Keynote</p></li><li><p>200 - INTERMEDIATE, Self-paced labs: Come and go as you please</p></li><li><p>200 - INTERMEDIATE, Adobe&#8217;s journey toward building an internal developer platform (IDP)</p></li><li><p>200 - INTERMEDIATE, Chart your Kubernetes course with Amazon EKS</p></li><li><p>200 - INTERMEDIATE, Fidelity&#8217;s observability platform for telemetry</p></li><li><p><strong>200 - INTERMEDIATE, Create a CI/CD pipeline to deploy your application to AWS ECS</strong></p></li><li><p><strong>200 - INTERMEDIATE, Serverless SaaS: Building for multi-tenancy</strong></p></li><li><p>200 - INTERMEDIATE, AWS networking fundamentals: Setting up your global network</p></li><li><p>200 - INTERMEDIATE, How modern data management can fuel your success on AWS</p></li><li><p><strong>300 - ADVANCED, Faster insight with Amazon Redshift: Zero-ETL integration &amp; sharing</strong></p></li><li><p>300 - ADVANCED, Build a data governance framework to balance data control and access</p></li><li><p>300 - ADVANCED, Architecting for low latency and performance in financial services</p></li><li><p>300 - ADVANCED, Best practices for microservices deployed on Amazon ECS</p></li><li><p><strong>300 - ADVANCED, Making a modern data architecture a reality</strong></p></li><li><p>300 - ADVANCED, Managing resources with the new AWS Cloud Control Terraform provider</p></li></ul><h3>AWS Summit, NY, 2023</h3><p>Boarded a 9 PM (Seattle) flight and arrived at Newark at 5AM - a 30 min Uber ride and I was at the Jarvis Center in West Manhattan at 6 AM. Fueled by a Coffee and a Cream Cheese Bagel, I was ready to attend some sessions and talk to some customers.&nbsp;</p><p>Some observations about the event, the venue and the overall ambience at the AWS Summit NYC. The venue, Javits Center, on the banks of the Hudson and a short walk from the 9th Ave skyscrapers and attractions such as Vessel, Madison Square Garden, was impressive. The AWS event itself was Grand, a choreographed AWS show that demonstrates their success as the de-facto cloud leader. Large number of attendees and the queuing / sell out&nbsp;at lectures was a gauge of the interest and applicability of what they are doing. Every tiny little detail had been thought about and meticulously planned - so my plans of squatting a table on the Expo Hall were impossible, I'd have to improvise.&nbsp;</p><h3>Create a CI/CD pipeline to deploy your application to AWS ECS</h3><p>After getting my bearings around the event, I attended the Lab about ECS deployments - <strong>"Create a CI/CD pipeline to deploy your application to AWS ECS"</strong>. This was a relatively large hall that was setup as a Computer Lab - every attendee got a desktop with couple of monitors, connected to the internet. The instructor led the session in a Peloton style (Peloton reference since I had seen their store in the neighborhood earlier that day) - telling the audience about the lab and getting confirmation that audiences were following along.&nbsp;</p><p>The software and infrastructure that AWS has built around these Labs, Workshops, Skill building and Training was pretty impressive - simple login that gets you an AWS Cloud environment that is fully functional to work through these use cases. For example, the test (sandbox) and prod environments that I use during my day to day development are actual AWS accounts that I have (nothing sandbox about my test account) - so having such a scoped down environment and sandbox resources would be a huge productivity boost in my day to day development as well. (The scoped down env I know can be created via IAM roles and policies, but I don't know how to do sandbox on AWS). Overall, the ease with which attendees could get environments up and running was quite impressive IMHO.&nbsp;</p><p>The ECS CI/CD lab walked us through creating an ECS deployment and staged updates. It talked about CodeCommit, Code Pipelines, Code Builds and Lambda. We use these constructs quite heavily in <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>, so nothing new here. The container side of things, while I know what they are, and have done some experiments with these. These (ECS, Kubernetes) are ideas for compute for containerized applications which can be used instead of the default Lambda&nbsp;compute that we currently have in <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>. Walkthroughs of the container registry and deploying containerized applications. Good stuff!</p><h3>Faster insight with Amazon Redshift: Zero-ETL integration &amp; sharing</h3><p>Next, I attended the Chalktalk <strong>"Faster insight with Amazon Redshift: Zero-ETL integration &amp; sharing"</strong> - where the presenters whiteboarded how their newer Zero ETL features simplified the data ingesting to AWS Redshift. They showed different integrations from AWS Aurora, AWS S3 and AWS Kinesis directly into AWS Redshift. The integrations seemed easy to setup, simplified the existing complex pipelines, no cost in most cases, automatic integrations and came with the Amazon's operability goodies (look at batches, progress in different tables).</p><h3>No Code vs Low Code vs Your Code</h3><p>This talk seemed highly relevant to <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> and on the surface, brings forth an existential questions for the company:</p><ul><li><p>If there are zero touch and no cost ETL integrations natively available in AWS, would <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> be able to do better?&nbsp;</p></li><li><p>Or in terms of a Product-Market-Fit (#5) question, How does <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> fit in when there are existing zero ETL solutions natively available?&nbsp;</p></li><li><p>A few people I had talked to had asked me about <strong>No Code</strong> options in <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> and I had told them its an area we need to explore. What are our plans for no code?</p></li></ul><p>Our distinguishing feature, flagship capability and the brightest feature in our cap is the managed service we've built around AWS Lambda as a Compute Engine - a true serverless and infinitely elastically scalable compute offering. Connectors are connections on reading and writing to / from different destinations.&nbsp;</p><p>With <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> Lambda Compute Engine, you do not have to create clusters, manage machines, run orchestration and management software such as Kubernetes etc.. Clusters, machine management and Kubernetes are extremely smart and feature rich on management of a shared pool of resources (machines, Kubernetes CPU resources etc) - however, this means:</p><ul><li><p>you are still responsible for the operability and scale</p></li><li><p>infinite elastic scaling is not an option without additional provisioning</p></li></ul><p>The <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> infinite elastic scale was actually demonstrated in our Web Crawl Archives Big Data Case Study - where with just a config option, we created 100 concurrent Lambda functions without having to provision any hardware, clusters or machines! (and the reason why we saw the system process 3 billion JSON docs in 48 hours! I like to compare it with Google's 8 billion searches per day - we aren't nearly as complex but 3 billion isn't your pocket change.)&nbsp;&nbsp;&nbsp;</p><p>Additionally, with our <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> Lambda Compute Engine, we've not only packaged AWS Lambda for ease of data compute, but we've also extended the service with many little things that will delight you once you start using the service. For example, Lambda function runtime today has an upper limit of 15 mins. In case your task takes longer (and data intensive tasks that read 10s of gigabytes of data files can take longer), we've built in automated task rerun capabilities. This essentially means your tasks relinquishes control and reschedules for a rerun.&nbsp;</p><p>I liken this to this to thread scheduling, premption and execution that the OSes do. For example, In an experiment that I ran, I fixed the task timeout to 60 seconds (config) and saw that the tasks were getting run for 60s, then giving the compute to other tasks and then getting rescheduled again. Very neat implementation and how it manifests in use, IMHO!</p><p>For example, reading from a partitioned stream as tasks for each partition and the partition doesn't have data available yet, this ensures there isn't starvation in a partition (similar to IO schedulers in low level OS implementations).</p><p>(I mentioned config a couple of times - this isn't traditional config, its actually the dataset configuration JSON that you create on defining your read, write, error and compute - we are not a config heavy system IMO, I believe our design philosophy follows this as a principle and that leads to simplicity. However, I acknowledge that we are just an MVP that does not have a gamut of features so becoming config heavy maybe the only option as we grow (i doubt that somehow), but as of now, I do believe this offers an ease of use. Also, I may have an owner's bias and some user might find our config overwhelming.)</p><p>I believe that this distinguishing feature solidly puts us in the <strong>"Your Code"</strong> (and maybe <strong>"Low Code"</strong>) category instead of <strong>"No Code"</strong>. Also, if I may bring another term into the mix, we are <strong>"No Provisioning" </strong>(true serverless). Consider "No Provisioning" design parameter when you design your infrastructure and realize the infrastructure and operability simplifications that this offers.</p><p>However, "No Code" / Zero ETL features are highly relevant and important - when the compute is simple copy to data into destination, they work really well (and I was impressed by how AWS has provided these with such ease of use).&nbsp;</p><p>For example, built-in automation, turn on once and any new files / data would keep on syncing automatically. Debugability and batching etc. We might just use these Zero ETL features as is and package them with the <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> resources such as logging, metrics, tasks etc to be the "No Code" pieces in our data pipelines. This would "build this in code" for better manageability instead of rogue actions such as enabling in studio or on the console which can be source of operator errors. We can integrate with their API so that you dont have to!</p><p>So these are my thoughts about <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> positioning with respect to the "No Code" and "Zero ETL" offerings. We can integrate with them for the basic copy tasks and offer a compute that can do much much more!</p><h3>Expo Hall</h3><p>Next, I headed to the expo hall and started chatting with a number of different companies and understanding what they do. Lots of companies in the data pipelines, observability and monitoring space. Almost all the open source SQL and NoSQL variants had a presence. Companies that increase developer agility, decrease costs, secure software development end to end and similar are all on the display.&nbsp;</p><p>From the expo hall, I made my way to the <strong>"Architecting for low latency and performance in financial services" </strong>session when it was the session time. There was a large queue and on the session time, the hall filled quicker and a number of attendees had to be turned away. Those early birds had beaten me to the punch :) Nevermind, I'll catch up on this session via the session recording on youtube when they are available. Back to the Expo hall I went. There was some time before the Keynote so spent some time in the Expo Hall talking to some folks. Some additional conversations around data pipelines and visiting card swaps.&nbsp;</p><h3>Keynote</h3><p>Realizing that queuing and limited capacity may be facts of the AWS Event, I turned on my theme park mode from my younger years to optimize ride wait times etc. Actually kidding, the queuing and limited capacity weren't that big of an issue, I just gave myself ample time for the Keynote and any events I wanted to attend instead of showing up on the event time.&nbsp;</p><p>The key takeaway from the Keynote was that AWS was making big investments in the AI and ML spaces - lots of new features, simplifications, cost reductions and ease of use for AI / ML.</p><p>During the course of my startup, a few people had asked me around how we can possibly benefit from the AI / ML wave that is sweeping the industry. I had read a few articles, some architecture docs and these two do a really good job at explaining these:</p><ul><li><p>Emerging Architectures for LLM Applications (an Andreessen Horowitz blog):&nbsp;<a href="https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/">https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/</a></p></li><li><p>How OpenAI trained ChatGPT (an excellent summary of the MS Build talk): </p></li></ul><p>https://blog.quastor.org/p/openai-trained-chatgpt</p><p>We definitely fit well in the "Data Preprocessing / Embedding" component of the architecture in the Andreessen Horowitz blog. In addition, from the MS Build talk summary&nbsp;(<strong>emphasis mine</strong>)</p><blockquote><p>"The Data Mixture specifies what datasets are used in training. OpenAI didn&#8217;t reveal what they used for GPT-4, but Meta published the data mixture they used for LLaMA (a 65 billion parameter language model that you can download and run on your own machine).</p><p>...</p><p>From the image above, you can see that the <strong>majority of the data comes from Common Crawl</strong>, a web scrape of all the web pages on the internet"</p></blockquote><p>Considering that we have demonstrated scale and performance in reading and processing Common Crawl docs <a href="https://www.letsdata.io/#casestudies">"Case Study: Big Data: Building a Document Index From Web Crawl Archives"</a>, I believe we'd be very highly relevant here as well.</p><p>In addition, maybe the Compute can be leveraged for some training as well, but this seems highly specialized. I need to spend some time and work through a few example usecases and see how everything is working to internalize how <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> can be used in these scenarios. I'll say more about it once I myself internalize this a little better. (Task for self, deep dive into the OpenAI's / Meta's technology).</p><p>Headed for lunch after the keynote, grabbed a Tuna Sandwich on Ciabatta and a Diet Coke, headed out to the benches to enjoy a meal in the Sun. Hadn't done this in ages, for no specific reason, the warmth of the summer Sun made one of my liked rituals rather enjoyable.&nbsp;</p><h3>Making a modern data architecture a reality</h3><p>Headed over to <strong>"Making a modern data architecture a reality"</strong> and learned about lake formation in AWS S3 via AWS Glue.</p><ul><li><p><strong>Glue Crawlers</strong> infer data schema and add it as schema in the <strong>Glue Data Catalog</strong></p></li><li><p><strong>Glue Transforms</strong> by doing an SQL join of two different S3 data sources and writing it back to S3 via the Glue Studio's drag and drop designer. The designer auto generated a python script that gets run (seemed like Spark code)</p></li><li><p>Again, a <strong>Glue Crawler</strong> to infer the schema from the joined data</p></li><li><p><strong>Athena</strong> to query over the S3 data using the schema from the Glue Catalog</p></li><li><p>Add a few neat <strong>Quicksight visualizations</strong> Heatmap and Treemap to round out the exercise</p></li></ul><p><strong>Data Cataloging</strong> is important for data governance and the built in Glue transformations and crawl infrastructure seems really interesting as well. Having a data catalog and Athena querying built in seems like a useful addition to <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>. We need to experiment with these a little and see what would be most beneficial for the customers and build something that the customers would use.&nbsp;&nbsp;&nbsp;</p><p>Unlike the earlier lab, this one was a little different in that this was BYOL (bring your own laptop) - but the software infra of simple login with AWS cloud resources provisioned and available was VERY impressive. (pat on the back for folks who made this possible)</p><h3>Expo Hall</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wr-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wr-x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wr-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg" width="492" height="738" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1500,&quot;width&quot;:1000,&quot;resizeWidth&quot;:492,&quot;bytes&quot;:413774,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Wr-x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Wr-x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21ed63ba-7bd2-4460-af04-2a1c2ac5ad0d_1000x1500.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Exhibit B: Founder and CEO, #LetsData - as I mentioned, not a very flattering picture but adding for jounalistic completeness!</figcaption></figure></div><p>Back in the Expo Hall one more time. A fun number of activities were also in the Expo Hall, professional executive headshots, games such as ping pong etc. The line for the professional headshots was a little thinner, and in between my conversations, I got these done. Not very flattering though, I guess, 20 hours of travel and tiredness were showing through. Ah well, I like my current LinkedIn pic quite well.&nbsp;</p><p><strong>AWS Startup Loft</strong> had a large presence on the Expo floor. I met with a few different Solution Architects and discussed the startup and got some tips around execution and engagement and the customary exchange of visiting cards. Next stop a few additional companies, asking as to what possible integrations and synergies might exist with <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a>.&nbsp;</p><p>At this time, I went to the large lobby area and chilled out on some benches, seeing folks and the event pass by. I was a little tired and thought I'd stretch my legs and wait for my next talk.&nbsp;&nbsp;</p><h3>Serverless SaaS: Building for multi-tenancy</h3><p><strong>AWS Community</strong> section on the Expo Hall were talking about <strong>Serverless SaaS and building for multi tenancy</strong>. Most of this was what I already knew from the <a href="https://docs.aws.amazon.com/wellarchitected/latest/saas-lens/saas-lens.html">AWS Well Architected Framework SaaS Lens (www)</a>. I had read this cover to cover and more for <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> and solved some really tricky challenges along the way but hearing this reiterated was fun.&nbsp;</p><p>Almost 5PM, my flight's at 7:30, no time for the after event socials, at-least not this time. Uber back to the airport, checkin, wait, board, disembark drive back home.</p><p>1AM at home but there are people waiting for me to come back safely.</p><p>Alhamdulilah (Thank God), Life's Good!</p>]]></content:encoded></item><item><title><![CDATA[#Let's Data is now a trusted AWS Partner ]]></title><description><![CDATA[#LetsData is now a trusted APN (AWS Partner Network) Partner and we are now listed in the AWS Partner Solution Finder listing.]]></description><link>https://blog.letsdata.io/p/lets-data-is-now-a-trusted-aws-partner</link><guid isPermaLink="false">https://blog.letsdata.io/p/lets-data-is-now-a-trusted-aws-partner</guid><dc:creator><![CDATA[Usman]]></dc:creator><pubDate>Tue, 26 Sep 2023 03:53:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!IC2O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p><a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> is now a trusted APN (AWS Partner Network) Partner and we are now listed in the <a href="https://partners.amazonaws.com/partners/0018W00002FlzWEQAZ/#Let's%20Data">AWS Partner Solution Finder listing</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IC2O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IC2O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 424w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 848w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 1272w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IC2O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png" width="1063" height="699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:699,&quot;width&quot;:1063,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155296,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IC2O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 424w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 848w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 1272w, https://substackcdn.com/image/fetch/$s_!IC2O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f13b2fb-9bd5-4c30-9540-59c2c70c9704_1063x699.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">#Let's Data Search Results in AWS Partner Solution Finder Search</figcaption></figure></div><p>We've also earned this shiny new AWS Partner badge!</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k0SZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k0SZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 424w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 848w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 1272w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k0SZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png" width="120" height="120" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:120,&quot;width&quot;:120,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3607,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k0SZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 424w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 848w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 1272w, https://substackcdn.com/image/fetch/$s_!k0SZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe42c4304-6ba9-456b-aa41-fbca341f9aa0_120x120.png 1456w" sizes="100vw"></picture><div></div></div></a><figcaption class="image-caption">The #Let's Data AWS Partner Badge</figcaption></figure></div><p>The AWS Partner Network membership has many advantages and the validation criteria is thorough. A company has to follow the AWS best practices and undergo a <a href="https://aws.amazon.com/partners/foundational-technical-review/">Foundational Technical Review (FTR)</a>, which is designed to ensure the company follows best practices across categories such as auditing, logging, access controls, security etc.&nbsp;</p><p>While we've built <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> following software best practices and have taken care that anything that customers / developers find nagging is eliminated, the FTR uncovered a range of things that takes our offering from a <em><strong>solid, startup service offering to an </strong></em><strong>Enterprise grade </strong><em><strong>offering</strong></em>.&nbsp;</p><p>The value add of the simple FTR process is immense IMHO. In addition, the process emphasizes usage of tools and services (Management, Governance, Security, Identity and Compliance ) that are not mainly used when developing a product but are lifesavers when things go wrong. The FTR becomes a great ramp-up on the operational side of things.&nbsp;</p><p>Let's look at the tools and services that we discovered as part of the FTR process and are now using continuously to shore up our compliance and governance infrastructure.&nbsp;</p><div><hr></div><h3>Security Hub</h3><p>AWS Security Hub automate AWS security checks and centralize security alerts. Its fairly easy to get started and use. It comes with a few different default benchmarks that once can enable. It finds all sorts of violations and deviations from best practices that are nice fixes to the product. An extremely useful service IMHO!&nbsp;</p><div><hr></div><h3>AWS Config</h3><p>AWS Config assess, audit, and evaluate configurations of the AWS resources. The way it works is that you enable the AWS Config recorder which starts recording the resource configurations and any changes to these. For example, an S3 bucket ACL change etc.&nbsp;</p><p>Interestingly, it allows the users (and services such as security hub) define configuration rules that they are interested in. These rules are continuously evaluated and the results are forwarded to the default bus on AWS Event Bridge. Event Bridge allows you to build custom processing logic on each rule and then forward them to the whole gamut of AWS foundational services(SNS, Lambda, Cloudwatch etc).&nbsp;&nbsp;&nbsp;</p><p>This is a powerful construct, because now this service becomes a foundational building block - you can detect any configuration change that you are interested in and invoke custom actions / pipeline. I was really amazed at the end to end around this - a large part of this is the richness of Event Bridge - but the overall end to end is a very complete experience. A job really well done guys!&nbsp;&nbsp;&nbsp;</p><div><hr></div><h3>AWS Trusted Advisor</h3><p>AWS Trusted Advisor provides recommendations that help you follow AWS best practices, reduce costs, improve performance, improve security. This service looks at the more practical side of the operations, things such as Cost optimization, Performance, Fault tolerance, Service limits and Security as well.&nbsp;</p><p>We didn't find many issues when we ran this, some of this could be because:</p><ul><li><p>Trusted Advisor has tiers that are tied to the AWS Account's customer support plans. We were in the basic plan, so we didnt get to run the full gamut of checks.&nbsp;&nbsp;</p></li><li><p>we dont have real usecases running at this time, so cost, performance and service limits issues probably wouldn't surface at this time</p></li><li><p>there are comprehensive checks for AWS services such as EC2, EBS, RDS etc that we are not currently using.</p></li></ul><p>But overall looks like a great service to have continuously running and once we have some customer traction, I might just upgrade to premium support so that I can get the full gamut of Trusted Advisor checks!</p><div><hr></div><h3>AWS Guard Duty</h3><p>Intelligent Threat Detection to Protect Your AWS Accounts and Workloads. This service looks at a bunch of different logs to flag issues that could be possibly security threats. In our case, it flagged for iam user root credential usage, bucket policy changes and changes in the cloudtrail logging configuration. Additional controls for EKS, RDS and Malware probably didn't run (since we are no in those services yet). We're keeping this running, let see what threats we discover over time.&nbsp;</p><div><hr></div><h3>Amazon CloudTrail</h3><p>Track user activity and API usage. This again is a foundational service that integrates with every other service in AWS and audits whatever actions are being done and by whom. I've used this to debug issues or two ever once in a while. However, the FTR's emphasis on cloudtrail enablement, cloudtrail logs security and paranoia over having these logs delivered to a different account underscores the importance of the data that this service logs. Its your complete AWS central audit system that can help trace every action that took place in the account. Here are some notable observations:&nbsp;</p><ul><li><p>the data cloudtrail captures is very important and unlocks newer use cases of compliance, security and more.&nbsp;</p></li><li><p>cloudtrail auto generates insights from this data, such as the Cloudwatch Log service is seeing 465% increase in Create Log Stream API calls at this time etc.&nbsp;</p></li><li><p>integrates with data lakes, event data stores and allows sql query semantics over this data.&nbsp;</p><div><hr></div></li></ul><h3>AWS Event Bridge</h3><p>Although not in the FTR services toolset, but I did want to mention this service since I was amazed when I used it in <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> use cases. The concept of event buses and custom rules trigged via events on the event buses or by cron is very powerful.&nbsp;</p><p>The filtering mechanisms are simple to implement and the complexity from the large number and different types of events is very beautifully abstracted in the console by having all the sample events available to test in the console.&nbsp;</p><p>And the auto delivery integrations to SNS, Lambda and Cloudwatch Events complete the end to end for this really well.</p><p>And the fact that systems can publish automatically on the default event bus without you having to configure different services, grant permission and other similar actions makes it simple to use.&nbsp;</p><p>For example, to setup the "Alarm when S3 bucket becomes public", all one had to do was to copy the name of the Security Hub config rule for s3 bucket becomes public and create an event filter on the default bus to trigger when rule matches and is in alarm. Delivering this to cloudwatch logs automatically and configuring a cloudwatch alarm on these events completed the end to end without much fuss. Great vision folks!</p><div><hr></div><h3>AWS Backup</h3><p>AWS Backup is a cost-effective, fully managed, policy-based service that simplifies data protection at scale and centralizes backup and restore. We enabled backups for all S3 buckets in the account. This included buckets that had customer data as well as <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> buckets. Great centralized backup service - did require some finagling to get it working - ideally I'd want this to be as easy and facile as DynamoDB Point in Time Restore (that feature works like a charm out of the box!)</p><div><hr></div><h2>What we fixed</h2><p>Okay, now that we know what all was run, here are the issues that we found and fixed:</p><h3><strong>Cloudfront and S3 Bucket interactions</strong></h3><p>Cloudfront and S3 integrations are easy to configure and developers can get started in a few clicks. However, this also easily leaves secrity bugs if not configured properly. The Security Hub disabled public access on our S3 buckets (which we had behind the cloudfront, this was a remnant from when we didn't have cloudfront and public objects were used), configured default objects and configured controls that would disallow discovering unintended content in the S3 bucket using path based attacks. Benign but important. Great set of checks!</p><h3>S3 Buckets</h3><p>S3&nbsp;being central to almost every usecase at AWS, gets a larger number of security checks to make sure the bucket configurations are correct. Here is what we enforced:</p><ul><li><p>we locked our S3 buckets to use SSL transport only (no man in the middle attacks)</p></li><li><p>enabled block public access on all buckets so that even accidentally nothing leaks&nbsp;</p></li><li><p>enabled versioning on all our buckets</p></li><li><p>we even added alerts that alarm when a bucket becomes public! And interestingly, the way alert was done uncovered a very powerful foundational pattern on the AWS IMHO (see AWS Event Bridge above)&nbsp;</p></li></ul><p>Again, great set of standardizations!</p><h3>Identity and Access Management:</h3><p>Security Hub flags for issues in the AWS account's identity, access and credentials usage. Things such as:</p><ul><li><p>disabling root security credentials - I had resisted this for the longest time, but if the system is built using the security best practices such as IAM Roles granting temporary access credentials, disabling the root security credentials is a validation that your system is secure!</p></li><li><p>add hardware MFA to root account, software MFA to IAM accounts with console access, enforcing strong passwords and so on. These controls do hamper mobility, I believe when it comes to running secure enterprise services, such controls are a must. Also, they prevent operator errors that can occur from time to time since we are only human. Recommendations on rotation of keys and credentials every 90 days, audit access frequently and disable employee access on leaving company&nbsp;instills you to make sure that you have processes around these.</p></li><li><p>standardize cross-account access via IAM Roles (we were already doing this) and a randomized / not guessable externalId when granting temporary credentials for an additional security layer as a solution to the confused deputy problem https://aws.amazon.com/blogs/apn/securely-accessing-customer-aws-accounts-with-cross-account-iam-roles (we added this)</p></li><li><p>secret and credential storage in Secrets Manager (we were already doing this)</p></li><li><p>we created an incident response runbook following the whitepaper: https://docs.aws.amazon.com/whitepapers/latest/aws-security-incident-response-guide/runbooks.html - we normally know what to do when an incident occurs and these processes do formalize over time IMO as the issue collateral builds up. However, having a process defined at this time was useful in making sure we had everything in place to respond to any incidents.</p></li></ul><p>We are using IAM according to the best practices! Great validation to have!</p><h3>Data Security</h3><p>We had already classified the different types of data in <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> and published it as part of the <a href="https://www.letsdata.io/privacypolicy">privacy policy</a> and data at rest and data in transit was being encrypted. We audited this again as part of the FTR and found no violations. We are encrypted at rest and in transit!&nbsp;</p><h3>Cloudtrail</h3><p>We audited what events we had had enabled for cloudtrail. Additionally, we enabled log security and have them delivered to a different AWS account. This enabled us to have a complete auditing solution in place in case we need to trace any issues!&nbsp;</p><h3>Backup and Restore</h3><p>The FTR asks you to do your <em><strong>Recovery Point Objectives (RPO)</strong></em> and <em><strong>Recovery Time Objectives (RTO)</strong></em> calculations to define what the service's recovery objectives are (SLAs).</p><p>Since the durability of Dynamo DB and S3 have so many 9s, we had taken this for granted that data loss would probably be only an academic concern. However, an important component of data loss is functional bugs that we or the partner code might have.</p><p>For example, if someone pushes a code bug that updates the database rows to an incorrect value, your data durability is toast. Granted, its a bug and fault of the code writer, but finger pointing and blame accusations are pointless in this scenario. As a resilient service, if this happpens, what will we do about it?</p><p>This logic had us enabling <strong>AWS Backups</strong> on all the S3 buckets and <strong>DynamoDB point in time restore (PITR)</strong> on all the Dynamo DB tables. We defined backup schedules and retention duration and came up with approximate RPOs and RTOs. We also did resilency tests (simulate data loss and then restore from backups)! And because we've built this into the service, <a href="https://www.linkedin.com/feed/hashtag/letsdata">#LetsData</a> customers get this for free and can be assured that we have their data covered according to enterprise data best practices. Thanks FTR - we had completely missed this one!&nbsp;</p><p>So that is what we had been upto in the last week or so! We are a <a href="https://partners.amazonaws.com/partners/0018W00002FlzWEQAZ/#Let's%20Data">trusted AWS Partner</a> and have been validated for following the AWS foundational best practices for software running on AWS! You can engage with us <a href="https://www.letsdata.io/#support">either directly</a> or though <a href="https://partners.amazonaws.com/partners/0018W00002FlzWEQAZ/#Let's%20Data">our partner page</a>, knowing that AWS vouches somewhat for what we have to offer!</p><p></p><p><em>Originally posted as <a href="https://www.linkedin.com/pulse/lets-data-now-trusted-aws-partner-letsbigdata">LinkedIn Article</a> on the #LetsData page - moved to this substack to backfill the blog</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://blog.letsdata.io/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading #Lets Data Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>