Building a Data Analytics Pipeline on AWS



in this episode we talk about actionable big data analytics hi and welcome to scary cocom my name is Christophe LeClair and in this episode i'm joined by Monisha su Lee who is the director of big data at Lenox Academy so she is one of my co-workers and the reason that I invited her on the show is because she was working on one of our products that we built from the ground up using big data analytics so that was one of the very important parts of that product as we were rolling it out and so the reason that she's on the show is because she's gonna be able to share some of her experiences building that from the ground up and so she's gonna be able to answer questions that can help you take that information and then apply it to your own applications so for example how can you build a scalable data pipeline that collects data in real time for example or how can you grab that information and make sense out of it for the stakeholders to be able to make decisions or to look at it and see what's going on or how can you all do all of that without spending tens of thousands of dollars or without having years and years of experience so those are some of the questions that we'll be answering in this episode stay tuned Monisha thank you so much for being on the show I appreciate it oh I appreciate now on her to be on the show I know you've been really busy building that big data analytics into the platform for the past couple of months or actually a few months now and also know that you've been busy on the weekends for example you just told me that you ran a triathlon in not too far away from here so I know you've been busy so I do appreciate you taking the time to do this but to get started I'd like for the audience to know you a bit better so could you tell us more about your technical background and where you come from sure so I grew up in Mumbai India where I did my bachelor's in computer science and I knew from eighth grade onwards that yeah that's what I wanted to do I wanted to get into computer programming just do wonderful things with it and thankfully I did get that fortunately along the way I had a chance to work with some you know wonderful people wonderful teams and if the dream continues to be able to work with the Knox Academy and I actually started and to enterprise software development more of Java j2ee spent about a decade with sterling commerce which was later on bought by IBM and in IBM that's when I made my switch from you know course of engineering software development to big data analytics and that was also the time when data science you know was becoming the buzzword and thankfully I got the opportunity to get started on this very interesting field so my first experience with actually was with Cognos Cognos is an IBM product it is majorly into business intelligence data warehousing they were also expanding into more general-purpose analytics and collaborating with Watson analytics so that's where I started my analytics journey I then later and actually moved to IBM SPARC Center which is where I got my hands on experience in Apache spark I was performing the role of a data scientist coming Apache spark evangelist over there well what we would do is we would you know work with various clients understand their needs create proof of concepts using Apache spark and help them understand how spark can fit into their solution and after that I came across this very exciting unfortunately it's Linux Academy and you're I am initially started as a Big Data instructor so we have launched a few courses on the Knox Academy namely run of course an apache spark then there's a big in essentials machine learning we have one on ensure when I made up newest and the latest one is a big data speciality on AWS to check those out if you haven't had a chance and the essentials one I haven't gone through it but I look through the syllabus and it seems to cover a lot of the trends of what's happening in the industry and how in just where companies can use these different tools to get insights into their data and so that's the fact of what's going on with today there are a lot of different tools you have to use the right tool for the job and even though you have extensive experience with spark oddly enough you didn't even end up using any spark for this project here so we'll find out why and all of that good stuff now yes so you joined Linux Academy a few months ago I think it was over the summer of last year correct me if I'm wrong that's perfect and then you moved on to create course courses for Lex Academy and then earlier this year I think it was in January very early of this year you got a phone call that said we've got a very exciting opportunity we're going to build a new product from the ground up and that product is called cloud assessments can you tell us more about that product of course yes so cloud assessments is like Christophe said a brand new product launched by the NOx Academy and it serves really two purposes one is for the student so for example the student is one thing to target a particular job or a particular skill level we provide assessments using which the students can measure up to where they stand with your skill level if they are behind we offer labs which let them learn or we offer training which get them up to speed on the other hand there are employers who may use this tool as a recruitment tool for screening applicants but the main theme I think behind cloud assessments is the concept of lean learning and what that means is you know to offer the right content to the right learner at the right time in the right way so let me explain what that means by the right content what I mean is we do not want the student to get bogged down by too much so statistically you know we have this huge popularity for massive online courses the Coursera EDX etc and you would be surprised to know that the rate of completion for these courses is actually less than 10% and the main reason cited for students dropping out are either it's too difficult for that or too basic or it's just too much and they don't have that so what we want to do in cloud assessments is just offer the right content which is just exactly what you need no more no less in the right way so what is the right way I mean we don't just have you know quizzes or just plain question answers we actually have live labs so for example if you want to test your skill in AWS we have a lab that lets you configure through an ec2 server or electric elastic load balancer and we have a system that that is able to grade each and every configuration step you take and that tells you you know how how much you got right how much you got wrong and enables the learner to exactly do what they want so that's the whole concept of lean learning about cloud assessments so you have students who go in and they can take assessments to test their skills and then they can take labs where does analytics play a part into that entire platform yes so let me start by explaining the different types of analytics we have something called descriptive analytics predictive analytics and prescriptive analytics now descriptive analytics allows you to understand the current state of affairs so what that means is in this case you can understand how your customer is using your product whether the customer likes the product or not so for example for an assessment we can determine what is it that they like about the assessment what is it that they don't like about it what are their preferences what would they like to see improved so this helps you in not only improving your product but also in marketing and sales another type of descriptive analytics will be to understand a group of students who are let's see that's more likely to use it more based on maybe their skills your interests maybe their job title the industry that they work in so on and so forth it's something that gives you an immediate view into how your product is being used that's descriptive analytics the other one is predictive analytics now this is more of a machine or a statistic use case where you are able to uncover hidden prints or hidden patterns or correlations between usage of your product again the underlying outcome is either to improve your product or to you know retain your students you may be able to uncover a trend where their students are more likely to terminate their membership what you can do is take corrective actions to prevent such situations so that is predictive analytics and prescriptive is just taking it a step further the what action should you take to achieve a certain outcome now under this category are also something called recommendation engines so when I say the right content for the right that's where a recommendation engine plays a big role so based on a student's lab history how many laps they have taken based on their assessment history or based under just user profile we're able to make recommendations based on the whole collection of data the student usage that we have so there are several analytics use cases that ultimately you know help you improve the product help you in making strategic business decisions or help you in selling it to the right people to market it to the right people so ultimately the goal is to benefit your product and that's what I find so fascinating about this story because when you think analytics especially big data analytics you oftentimes think okay so this is for established companies that already have massive amounts of data massive amounts of customers and can derive value from it but in this case that's not at all what's going on what you're saying is there are smaller use cases for example how can I tell what is engaging to my users and what isn't engaging where our customers dropping off into the in the user experience and so when you're thinking about startups of all sizes in all industries ecommerce mobile applications whatever it is you can apply this cut these kinds of use cases and this simple analytics to get result that benefits you you don't have to be a massive corporation and in fact I mentioned that you started working on this sometime in January or maybe even early February what was the life cycle of that project alright so the life cycle was briefly divided into four to five phases the first phase was the infrastructure phase in this phase we did her mind what's the technology to use what our what our data collection pipelines are gonna look like how am I going to process the data you know what is their data store going to be since you may have heard that the data lake term is synonymous with everything big data so the question was do we really need a data Lake or do we have a subversion of a data Lake and things like that so all those decisions needed to be you know first of all prove that it's going to work for us and then implement it and then test it then the second phase is reporting so once you have your infrastructure in place once you have you know your data pipelines in place your area is flowing in you have your data stores that are collecting your data then comes using that data so again there are two types two to three types of using that data one I would call is just straightforward reports where you just execute straightforward queries on your data stores so for example if you would like to know you know how many users have signed up but they are still inactive meaning they have not taken any assessments they are not taking any labs they're not really engaged with the product so that is one example of reporting another may be to find out who are your student leaders so who is really you know doing a great job in staying engaged as well as was your highest scores it's a way of you know motivating students and users to make better use of the product so that's the reporting side of it next is analytics so analytics can again as I described has a descriptor in a predictive part in descriptive what we use are more like dashboards with graphs and charts we let's you not only get a historical trends but also real time real time is very important in today's industry because your organization has to be able to react to real-time events and so in our architecture we have provisions of accommodating both real time as well as historical descriptive analytics so it under this category are things like dashboards and we have used AWS quick site to use that along with just a dashboard you also have the capability to slice and dice so to go to a certain granular level so for example at you you want to analyze your geography which regions are your majority of students coming from is there any dependencies between particular cities versus some other countries and so on and so forth so your descriptive analytic solution has to be able to correlate all these different items and let you slice and dice the way you want to and then the fourth stage is predictive analytics so this is where machine learning comes into picture and really the goal of machine learning is to you know find hidden patterns or correlations that are not evident to the plain eye and one important concept about machine learning is that it needs data so even if you have a very clever machine learning algorithm and on the other hand if you have a much larger data set always the algorithm which has been trained with a much larger data set has better chances of performing and predicting accurate results so under this face really your volume and your quality of data is very important but yet in our case this was a difficult situation because we were building our analytic solution at the same time that cloud assessment was being built so what that means is we do not really have real data right plus there is a lot of fluctuation the requirements are changing data formats are up for grabs we do not have our data ingestion mechanisms finalized and not to mention data a noise that comes in when you're notif eclis constructing a product from scratch there's a lot of testing going on so we had to be sure that we are differentiating and not polluting our pipelines with noisy data and then the fifth phase of the product is maintenance and monitoring and then of course you repeat you don't you don't just do it once and be complacent about it right there there are many many ways to corley with it you can go revisit your infrastructure decisions you can go revisit your pipelines the way they are structured your data formats predictive analytics is a continuous improvement process you are always tweaking and tuning your machine learning algorithms so that's that's when repetition in cycles is very important but on the other hand even monitoring since you have so many different pieces and moving parts all coming together it is very important to monitor and it is very difficult complex to monitor all these different moving parts to make sure that you know each one is doing its own job and they're all coming together for delivering the big picture how long did it take you to go from that first phone call saying we're shifting your focus to this to something that was able to go in production so we had a very strict timeline and I would say it was about six to eight weeks and that's six to eight weeks included learning the technology so as I told you before you know I have a strict Apache spark background AWS is something that I had just started picking up while in Linux Academy but AWS was our choice and and really there were multiple reasons the first one is AWS you know takes care of all the nitty gritties for you write things like scalability fault tolerance high availability you take these things for granted when you're using AWS instead of you know having to create your own Hadoop clusters and installing spart and then plus integrating it with as multiple other products to build an ecosystem AWS just you know makes it very easy for you to use these very very performant and efficient services and we're gonna talk about all the services that we used later on but but I'm I want to get this message across that AWS was the solution for getting us up to speed in such a quick and efficient way so let's break this down how are you collecting all of this data that you were talking about previously all right so on data collection we stick to the Big Data lambda architecture what the lambda architecture tells you is that you have three layers first is the real time data collection second is the batch processing and third is a serving layer now what real-time means is it gives you an idea of user behavior with respect to time so what is a particular user doing at this very particular second versus batch processing gives you a whole picture of what the user has done over a period of time and then you combine real-time to batch in the serving layer to give you the best analytics so our architecture was based on this lambda architecture where we have three layers batch real time and serving and those on top of that we decided to go the serverless route so by server less what I mean is you as a developer are free to just focus on the code or the development of what you want to do instead of worrying about your servers instead of you know worrying about how to monitor those how to maintain those how to patch those users focus on the code and again AWS server less services like kinases firehose kinases analytics Athena quick side this was the first time that we used them of course there were a lot of proof of concepts that we tried before we you know dealt into so that's a very important phase that was a part of our life cycle that instead of just diving straight into a particular service you have to first ensure that it first fits your use case and so you mentioned that there's a batch layer which is for historical purposes when does it make sense to use that versus the other one which is the stream or the speed layer which collects real-time events how do you know when which one makes sense right so for batch processing the ultimate goal is to compute historical trends for example if you want to understand timeline you know how which months are busier for your product or which is a particular geographical region that has the highest activity or how is a particular student faring for a particular assessment or a lab or even to determine how effective is a lab crucial to a student's success so for all this information if you want to analyze this you what you need is a master data set right and so that's what your batch layer does for you it collects and preserves your entire data collected over time which you're able to then query and provide these different results whereas on the real-time side you want to know what is your product usage looking at right now right this time so how many students at this time are taking your test or for example how many of your labs are being used at this particular time or what is your time lag between an assessment getting ready versus an assessment being graded so if you want to get a real-time picture and how your product is used you want to capture all those real-time events as and when they are happening and if you want to determine historical trends or get a more whole picture on how the product is being used you consider Batchelor processing and ideally what you do is combine the two you combine real-time with batch to get the big picture how do you even collect real-time data though how do you grab that information and put it somewhere else how does that work alright so we will go to some of the services that we have used in this case for example kinases and DynamoDB so kinases has a provision of streams and when you put kinases streams there are multiple ways that you can read and write to kinases streams so kinases streams can be written to by external applications that's what our Cloud assessment website and our cloud assessment engine does they use the AWS SDK API to write events to kinases streams this is how we capture real time data so when a student is taking an assessment or when a student is taking a lab the different events for example the start of an assessment versus the completion of a lab or the completion of a test or the grading of a test all these are sent as real-time events to the kinases stream similarly to dynamodb when an assessment is completely graded and finished that's when entire picture of the assessment or the student's performance on that assessment session is stored in dynamo DB and kinases streams as well as dynamodb streams can then further be processed by things like lambda functions or you can even have your own custom logic you can send them to s3 for larger collection and also can be processed using Athena tables so if you're not familiar with these services be sure to check them out they are all AWS services and I'll also have a description of them below the video so you can look at them up but Manish oh you already mentioned a few different services like dynamo DB s3 lamda Kinesis can you list all of the various services that you're using in this architecture sure so I would highlight the fact the highlight the fact for serverless being that server using server list you only have to pay what you use so if you compare several eyes to things like easy to an easy to server you have to pay by hour whereas on server less you only pay as per your usage so for example in kinases stream the unit of scalability is called a shard and a shard allows you to do at least two megabytes of data per second for reading and one megabyte of data per second for writing now depending on the data that is going in through the kinases stream that's how you are charged versus you know a constant charge for just a particular service so cost is one factor that puts serverless on top of a managed service the other thing is of flexibility you do not have to worry about setting up clusters or adjusting them to scale your application stings like lambda they automatically scale as per the amount of data that is being processed through lambda so to go back to your question on Christoph the serverless services that we have used consists of s3 DynamoDB kinases streams firehose kinases analytics athena quick site SNS and the most important of them all is lambda is there anything that you try doing that didn't work with the service architecture now there were many situations where we had to rethink our strategies for example when you consider Athena so Athena is actually a very new service and it's an interactive query service which is serverless of course and what it does is it lets you read data that is in your s3 so let's imagine that you have a certain set of data stored in an s3 bucket what you can do is in Athena you define a schema over that s3 bucket which lets you create a table which is of course only read-only and then using the table you can execute queries now the question that we had in our minds is that typically a red shift and RDS are you know managed AWS database services which obviously have been proven to be very flexible when there are complex queries like joins or using time windows and we were not sure if Athena is going to be able to support that also we were not sure if using lambda functions we can execute queries in Athena but ultimately we did find a way to do that and and the route was using JDBC so in Athena you can query using a JDBC adapter which just restricts it to Java for now but also if you do want to use Python you can also use third-party libraries that convert your Python calls to JDBC calls so the point was that yes there were there were situations where we were contemplating the usage of these other alternative managed services but I think AWS you know has a good idea of how their services are meant to be used and make provisions so that there is no use case that is blocked and so ultimately we did stick to just Athena using a JDBC in our lambda functions so that's a question I see quite often especially since we have service content at the lowest Academy is okay this sounds really good what's the catch or what can't we do using these surreal services and so you hit some of those on the head there and I know also that sometimes you can lose a little bit of flexibility because they are managed services for you you don't have to manage them which means you know oftentimes have to configure them as much you still have to configure them to some degree but you may not be able to go into the server actually you won't be able to go and server and change some settings at that level so sometimes you could lose a little bit of configuration there what do you think about that yes definitely in fact another example is the timeout on a lambda function so lambda functions are typically meant to be used in a microservice architecture right and the idea behind microservice architecture is to build your application it's as many independent services rather than one large monolithic codebase and in the spirit of there's many little independent services your cord is not supposed to be a long-running job but instead something that is you know executed and completed in a short amount of time so that the time out on your lambda service is currently set to 5 minutes and you cannot exceed those 5 minutes so what that does it puts a restriction what if you have a very complex process that goes above 5 minutes what do you do that the the alternative to such a restriction is just using putting your lambda code in its own container on docker and running the docker on ec2 and that lets you you know create your own lambda service but again it's it's not the same as a managed lambda service at that point what you lose is the ability to monitor the ability to integrate with things like api gateway so there were obvious pros and cons like you said of losing flexibility versus gaining integration points now you mentioned a few different services you talked about how you're using Kinesis lambda all those different services can we break it down just a step further and figure out ok so I know you're using Kinesis I know you're using streams and firehose for Canisius with lambda but how does that work if someone's never used Kinesis before and they've never used lambda how do those two services work together and why did you use them sure so the purpose of kinases like I said before is to capture real time data so a kinases stream can either be a producer or a consumer now in case of a consumer you have apps that right to your kinases stream and a kinases allows your data to be retained for 24 hours that's by default it can also be increased to 7 days and plus if you want your kinases stream to be you know consumed owned by multiple applications down the stream for example you are getting your real live data in a kinases stream now the next step is to process this data and so to process this data let's say you have a bunch of lambda functions that perform some kind of logic on your data and then finally persist it to s3 now what if you have multiple lambda functions that are reading from the same kinases stream in that case there is a pattern called a fan-out so what you do is you connect one kinases stream to write to multiple other kinases streams in that way what you're doing is you're increasing your lead throughput and so there's a lot of flexibility given just one single source of data you can process it in multiple ways and this was the flexibility that we were looking for we wanted real-time data plus the flexibility of being able to process it in multiple ways and the combination of kinases and lambda was perfect for that then the next service was DynamoDB how does a so now we understand how that data is moving between Kinesis and and lambda but how does DynamoDB play a role or to that so dynamodb the advantage of dynamodb is that it is schema-less and that was exactly what we wanted while our product was being built so let's say for example we want to store a user profile in dynamo dB now a user profile has to be schema-less since there are multiple user attributes that are not present for every user also you may want to add different attributes at different times for example let's say social media links you may not have it today but as your product matures you may want to add that information and hence that's what dynamodb gives us the flexibility to be schema-less and err subtract data as you grow plus dynamodb also has the support for streams so what that allows you is similar to kinases streams where you can connect your DynamoDB streams to lambda functions and so whenever there is an update or an addition or deletion made to your dynamodb object a lambda function triggers and processes that change so again that fits into you know the real time processing story that we have talked about and again DynamoDB is a no sequel database and how it compares to something like MongoDB or another no sequel databases it's it's service in the sense that you don't have to manage the servers behind that so if you want to spin up a MongoDB cluster you have to manage that cluster yourself whereas enable us if you use dynamo DB you can just say provision this amount of capacity in terms of read and write throughput and Amazon promises that they'll handle that for you so you don't have to manage all of that just just in case they don't know what dynamodb is all right so now you've got dynamo DB you've got Kinesis and you've got lambda we also have a service called Amazon s3 which stands for a simple storage service where you can store all kinds of different objects how does that differ from dynamo DB and why did you use both of those services so s3 has a great integration point with things like Athena and Athena was our query service that would ultimately be fed into our dashboards now s3 is also touted to be very powerful powerful enough to replace the HDFS that is used for big data a lot so HDFS is the core component of Hadoop which is the Hadoop distributed file system and the goal for HDFS is to provide unlimited number of objects and volume of data to be able to store unlimited number of objects and volume of data to provide high bandwidth high throughput limit and also native support for versioning so all of these things are provided by s3 on top of that you do not need to pay extra for data replication that's what s3 takes care of under the covers also s3 is supported by things like spark and hive and presto so even though we do not at this point use spark but there will definitely be a need you know as we evolve our analytics product and add more sophisticated intelligence to it we will be using spark more for machine learning advanced algorithms and so s3 is this perfect solution to act and as a single source of data so s3 can also act as a storage for unprocessed data also can be used to store process data what that means is when we have our real-time events coming in we dump them to s3 as one straightforward pipeline and that is to make sure that if anything goes wrong down the pipeline down where we are processing the data we still want to be able to have a backup for the original form of data so again s3 is used for that s3 is used to store processed data in fact we have a decoupled architecture where the unit of you know data processed and store is considered as one unit and there are several of these three data processed or data processed or that ware store represents s3 and once you have various forms of data stored in s3 there are multiple use cases that can plug into these different buckets to ultimately provide things like reporting and analytics you mentioned data leaks the term data leaks a couple of times so far in this episode and I think that kind of ties back with Amazon s3 here could you describe what data leaks sure so data leak is very much used in terms of big data and what it represents is is just a large collection of data in in any format so it is not demand that data should be processed in a particular way in fact data leaks are ski and that's the direct contrast on to you know a relational database or a data warehouse where data has to be structured in a particular way as defined by the schema so data leak really is just this big massive store where you just dump your data and the reason behind is that when you are writing from the data leak that's when you apply your schema so for particular use cases let's say we have a machine learning use case that wants to look at the correlation between a student's skill set versus a students assessment scores at that time you will apply a schema on the data leg and read only that particular information so you will only read the assessment history for the student only the assessment score for the student and then ignore the rest of the data so that's how a data Lake is meant to be used that it it has all of the data but you do not really need to apply a schema when you are writing to the rated a put your plan schema when you're reading it so how do you pull data from it if it doesn't have a schema you're just dumping the data in there how do you even know what data to pull out of it or what to do with that data so we have these pipelines consisting of lambda functions which pull out the data and process it into a more schema oriented data so for example let's say us two computing a student leader board write to compute a student leader board we want to look at all the assessments all the scores and all the time taken by each student on the assessment and in that case what we do is we have a lambda function that reads data from the data lake now again the data lake is separated out into different s3 buckets so we have an s3 bucket for assessment history another for lab history another for just user profiles and this lambda function is now able to read from these different buckets join the data together and exactly give out the top scorers based on your score as well as the time taken to complete their assessment so it's a pipeline consisting of your lambda which again writes to s3 so really s3 as I said before it is our store for processed data as well as unprocessed data you can also I know that with Amazon s3 you can store those objects and they can grow a lot so it they call it infinitely scaling now if they can actually infinitely scale or not that's a question for debate but you have other features with s3 that are very attractive for this type of data or data in general such as you can enable versioning right that's could have versioning for Amazon s3 you can also have data replication it has a lot of high bandwidth and throughput it also has something like called lifecycle transitions can you describe how you're using some of those various features outside of just storing objects and are you using any of those features so s3 is more useful when your data is being frequently accessed and of course like I said there is no need to pay extra for data replication but you also can make use of things like glacier for archiving or cold data so so after a certain point since our product is relatively new right now we do not necessarily have the need for archiving cold data but we envision a point in the future where let's say after a year or so we are now able to archive this you know unprocessed raw data into glacier but at the same time keep it available for further needs if needed and you also mentioned Linda a few times by now whether people are familiar with lambda or not can you describe a little bit more of how it works I know they have functions and you could feed lambda functions and they execute the functions but how does that work sure so lambda is really the heart of our serverless architecture they also represent microservices what that means is that you have to divide your functionality into separate independent pieces of code now the advantage to this is that they can be written tested as well as deployed independently so the way you go and configure lambda is in a variety of languages that are supported so you have support for Java for Python and for nodejs and I believe Ruby also so whatever unit of code you have you can upload that code into your lambda further you can set up triggers so let's say you want to trigger or invoke this lambda function when a kinases event happens or DynamoDB event happens on even st so in case of DynamoDB let's say when your user profile is updated by maybe a social media link and at that point you have a lambda unit of function that goes in reads that social media link and fetches some additional details so that's a great example of how you can have your lambda unit of code respond to an event once it responds to an event it does is unit of processing and later on it can emit out the results to again various different sources so it can write to kinases streams to fire hose to redshift to s3 – already as it is a pretty large variety that you can output your results to but the important thing to remember about lambdas are that they are event-driven and that they are stateless so you do not maintain any state whatever storage you have that is outside the lambda and in this case your storage can be s3 or dynamodb or kinases all you do is you take the data from the storage you do your processing and then you output it out to a different format of a storage and that's your lambda functions for you there's one other service that we haven't really touched on too much yet but you've already mentioned a few times and I know it's being used and that is Athena what is Amazon Athena great question so Athena like I said is a real relatively new query service and it is meant to be more of an interactive service to quickly query patterns in your data Athena is also supported by presto no presto is the distributed sequel processing engine in AWS and what that allows you to do is is to query large vast amounts of data in relatively quick amounts of time so a tina is again a very powerful serverless service that lets you query vast amounts of data now the way we are using a tina is to one is to apply a schema on our s3 so whatever data we have in our s3 buckets at ena reads from those s3 buckets and is able to create tables which can then be queried furthermore these Athena tables can then be plugged into quick side now quick side is a WSS visualization tool you can create typically dashboards and graphs using quick side so Athena is a very actually important component in our descriptive analytics solution where it acts as a query service it has advantages of you know being able to yield results very fast because of the Presto support and has the advantage of integrating into quick side which really lets you create dashboards very very quickly so does Athena actually have the visualization aspect of it or do you have to use something else to pull that those queries out and actually visualize it right so so quick side is what integrates with Athena and so under the covers in quick side whenever you create a dashboard there is a tena query that gets executed and fetches those data which is then displayed by a quick side now monisha were starting to run out of time here I know there's a lot more we could be talking about but is there anything specific that you think we might have missed or that you should mention as well as it relates to this so along this way there were a few learning lessons that we learned and one of those is that we learned ways you know don't let perfectly the enemy of good so what that means is you can start small don't don't have a very grand vision that you're trying to tackle because that can be very overwhelming challenging and instead it's better to you know have small successes and then build into a more grand vision so following the agile methodology creating smaller goals and creating proof of concepts to first establish ensure that you're going to be able to reach those goals with the technology that you have chosen that is important then typically in a bigger situation you have to avoid failure of collections so what that means is is don't worry that if you do not have data today you your pipelines are not gonna work that that's not the case have your pipelines ready so that when your data starts rolling in you know you are in in the position to start making use of the data as they roll in the other thing is it's better to have but they were set of skills for your team especially in the micro services architecture you know it gives you the independence to develop and deploy in an independent fashion so you may have developers who are experts in Python some may be in Java some in are and really each of these technologies have their own strengths and weaknesses and a diverse team will help you in utilizing all these different technologies the other thing is especially for a data a big data project is to have the leadership to be you know data literate they it's very important that they understand how data can be used to you improve the product or to make strategic decisions and in that matter we have been very lucky to have Anthony as you know the vision of how we can use data also even after you have created your product the last but not the least is evangelizing so if you have a product but there's nobody to use it right it's it's meaningless and in this case you may you don't have to go the extra strap of training your team to use the tool so let's see if you have dashboards created you have to train your team and how to make best use of those dashboards so that's right on your evangelizing efforts need to be in place where where you are conveying to your company or to your group how to make best use of the product that you created Monisha thank you so much I appreciate your time and your experience in sharing in this information if people have any questions or if they're interested in learning more how do you recommend they do that sure so the ER at Knox Academy are also have a big data team that can answer any of your big data questions we have again a diverse set of experience on AWS as well as your so feel free to reach out to us on community linux academy community and also through scale your code thank you so much for tuning and everyone if you have any follow-up questions as well or if you just want to leave a comment thank Manisha for her time please leave a comment below this video otherwise thanks so much for tuning in and mini show thank you so much thank you Christo's

One Comment

  1. Kuldeepak Sharma said:

    Thanks for this excellent, in-depth interview!

    June 30, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *