Big Data Analytics Using Python | Python Big Data Tutorial | Python And Big Data | Simplilearn



through this lesson you will get to know what data science is and the skills you need as a data scientist this will help you discuss clearly your roles and responsibilities as a data scientist and look at various applications of data science you will also learn how data science works with big data to extract useful information explore data science as a discipline and understand how it's shaping the world this lesson will throw light on the importance of data science the lesson will help you learn and understand Python a popular programming language used by data scientists the problems it resolves and how it's an effective and user-friendly data science tool what is data science let's start with some of the common definitions that's doing the rounds some say that data science is a powerful new approach for making discoveries from data others term it as an automated way to analyze enormous amounts of data and extract information from it still others refer to it as a new discipline which combines aspects of Statistics mathematics programming and visualization to gain insights now that you have looked at some of its definitions let's learn more about data science when domain expertise and scientific methods are combined with technology we get data science which enables one to find solutions for existing problems let's look at each of the components of data science separately the first component is domain expertise and scientific methods data scientists should also be domain experts as they need to have a passion for data and discover the right patterns in them traditionally domain experts like scientists and statisticians collected and analyzed the data in a laboratory setup or a controlled environment the data has been subject to relevant laws or mathematical and statistical models to analyze the data set and derive relevant information from it for instance they use the models to calculate the mean median mode standard deviation and so on of a data set it helped them test their hypothesis or create a new one in the next slide we will see how data science technology have now made this process faster and more efficient but before we do that let's understand the different types of data analysis an important aspect of data science data analysis can either be descriptive where one studies a data set to explain what happened or be predictive where one creates a model based on existing information to predict the outcome and behavior it can also be prescriptive for one suggest the action to be taken in a given situation using the collected information we now have access to tools and techniques that process data and extract the information we need for instance there are data processing tools for data wrangling we have new and flexible programming languages that are more efficient and easier to use with the creation of operating systems that support multiple OS platforms it's now easier to integrate systems and process Big Data application designs and extensive software libraries helped develop more robust scalable and data-driven applications data scientists use these technologies to build data models and run them in an automated fashion to predict the outcome efficiently this is called machine learning which helps provide insights into the underlying data they can also use data science technology to manipulate data packed information summit and use it to build tools applications and services but technological skills and domain expertise alone without the right mathematical and statistical knowledge might lead data scientists to find incorrect patterns and convey the wrong information now that you have learned what data science is it will be easier to understand what a data scientist does data scientists start with a question or a business problem then they use data acquisition to collect data sets from the real world the process of data wrangling is implemented with data tools and modern technologies that include data cleansing data manipulation data discovery and data pattern identification the next step is to create and train models for machine learning they then design mathematical or statistical models after designing a data model it's represented using data visualization techniques the next task is to prepare a data report after the report is prepared they finally create data products and services let us now look at the various skills a data scientist should have data scientist should ask the right questions for which they need domain expertise the curiosity to learn and create concepts and the ability to communicate questions effectively to domain experts data scientists should think analytically to understand the hidden patterns in a data structure they should wrangle the data by removing redundant and irrelevant data collected from various sources statistical thinking and the ability to apply mathematical methods are important traits for a data scientist data should be visualized with graphics and the proper storytelling to summarize and communicate the analytical results to the audience to get these skills they should follow a distinct road map it's important they adopt the required tools and techniques like Python and its libraries they should build projects using real-world data sets that include data gov NYC open data Gapminder and so on they should also build data-driven applications for digital services and data products scientists work with different types of datasets for various purposes now that big data is generated every second through different media the role of data science has become more important so you need to know what Big Data is and how you are connected to it to figure out a way to make it work for you every time you record your heartbeat through your phone's biometric sensors post or tweet on the social network create any blog or website switch on your phone's GPS network upload or view an image video or audio in fact every time you log into the Internet you are generating data about yourself your preferences and your lifestyle big data is a collection of these and a lot more data that the world is constantly creating in this age of the Internet of Things or IOT big data is a reality and a need big data is usually referenced by three V's volume velocity and variety volume refers to the enormous amount of data generated from various sources big data is also characterized by velocity huge amounts of data flow at a tremendous speed from different devices sensors and applications to deal with it an efficient and timely data processing is required variety is the third be of big data because big data can be categorized into different formats like structured semi-structured and unstructured structured data is usually referenced to as our DBMS data which can be stored and retrieved easily through SQL semi structured data are usually in the form of files like XML JSON documents and no SQL database text files images videos or multimedia content are examples of unstructured data in short big data is a very large information database usually stored on distributed systems or machines popularly referred to as Hadoop clusters but to be able to use this database we have to find a way to extract the right information and data patterns from it that's where data science comes in data science helps to build information driven enterprises let's go on to see the applications of data science in different sectors social network platforms such as Google Yahoo Facebook and so on collect a lot of data every day which is why they have some of the most advanced data centers spread across the world having data centers all over the world and not just in the US help these companies serve their international customers better and faster without any network latency they also help them deal effectively with the enormous amount of data so what do all these different sectors do with all this big data their team of data scientists to analyze all the raw data with the help of modern algorithms and data models to turn it into information they then use this information to build digital services data products and information driven Maps now let's see how these products and services work will first look at LinkedIn let's suppose that you are a data scientist based in New York City so it's quite likely that you would want to join a group or build connections with people related to data science in New York City now what LinkedIn does with the help of data science is that it looks at your profile your posts and likes the city you are from the people you are connected to and the groups you belong to then it matches all that information with its own database to provide you with information that is most relevant to you this information could be in the form of news updates that you might be interested in industry connections or professional groups that you might want to get in touch with or even job postings related to your field and designation these are all examples of data services let's now look at something that we use every day Google's search engine Google search engine has the most unique search algorithm which allows machine learning models to provide relevant search recommendations even as the user typed in his or her query this feature is called autocomplete it is an excellent example of how powerful machine learning can be there are several factors that influence this feature the first one is query volume Google's algorithms identify unique and verifiable users that search for any particular keyword on the web based on that it builds a query volume for instance Republican debate 2016 Ebola threat CDC or the center of Disease Control and so on are some of the most common user queries another important factor is a geographical location the algorithms tag a query with the locations from where it is generated this makes a query volume location specific it's a very important feature because this allows Google to provide relevant search recommendations to its user based on his or her location and then of course the algorithms consider the actual keywords and phrases that the user types in it takes up those words and crawls the web looking for similar instances the algorithms also try to filter or scrub out inappropriate content for instance sexual violent or terrorism related content hate speeches and legal cases are scrubbed out from the search recommendations but how does data science help you today even the healthcare industry is beginning to tap into the various applications of data science to understand this let's look at wearable devices these devices have biometric sensors and a built-in processor to gather data from your body when you are wearing them they transmit this data to the big data analytics platform via the IOT gateway ideally the platform collects hundreds of thousands of data points and the collected data is ingested into the system for further processing the big data analytics platform applies data models created by data scientists and extracts the information that is relevant to you it sends the information to the engagement dashboard where you can see how many steps you want what your heart rate is over a period of time how good your sleep was how much calories you burned and so on knowing such details would help you to set personal goals for a healthy lifestyle and reduce overall health care and insurance costs it would also help your doctor record your vitals and diagnose any issue the finance sector can easily use data science to help it function more efficiently suppose a person applies for a loan the loan manager submits the application to the enterprise infrastructure for processing the analytics platform applies data models and algorithms and creates an engagement dashboard for the loan manager the dashboard would show the applicants credit reports credit history amount if approved and risks associated with him or her the loan manager can now easily take a look at all the relevant information and decide whether the loan can be approved or not governments across different countries are gradually sharing large datasets from various domains with the public this kind of transparency makes the government seem more trustworthy it provides the country data that can be used to prepare itself for different types of issues like climate change and disease control it also helps encourage people to create their own digital products and services the US government hosts and maintains data gov a website that offers information about the federal government it provides access to over 195 thousand datasets across different sectors the US government has kicked off a number of strategic initiatives in the field of data science that includes u.s. digital service and open data we have seen how data science can be applied across different sectors let's now take a look at the various challenges that a data scientist faces in the real world while dealing with datasets data quality the quality of data is mostly not up to the set standards you will usually come across data that is inconsistent and accurate incomplete not in the desirable format and with anomalies integration data integration with several enterprise applications and systems is a complex and painstaking task unified platform data is distributed to Hadoop distributed file system or HDFS from various sources to ingest process analyze and visualize huge datasets the size of these Hadoop clusters can vary from T nodes to thousand nodes the challenge is to perform analytics on these large datasets efficiently and effectively this is where Python comes into play with its powerful set of libraries functions modules packages and extensions Python can efficiently tackle each stage of data analytics that includes data acquisition Python libraries such as scrappy comes handy here data wrangling Python data frames are very efficient in handling large data sets and makes data wrangling easier with its powerful functions Explorer matplotlib libraries are very rich when it comes to data exploration model scikit-learn statistical and mathematical functions to help to build models for machine learning visualization modern libraries such as boca creates very intuitive and interactive visualization it's huge set of libraries and functions make it big data analytics seem easy and hence solves a bigger problem Python applications and programs are portable and it helps them scale out on any big data platform Python is an open-source programming language that lets you work quickly and integrate systems more effectively now that we have talked about how the Python libraries help the different stages of data analytics let's take a closer look at these libraries and how they support different aspects of data science numpy or numerical Python is the fundamental package for scientific computing side PI is the core of scientific computing libraries and provides many user-friendly and efficiently designed numerical routine matplotlib is a Python 2d plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms scikit-learn is built on numpy side pi hand matplotlib for data mining and data analysis pandas is a library providing high-performance easy to use data structures and data analysis tools for python all these libraries modules and packages are open source and hence using them is convenient and easy factors which positions Python well and makes it the tool for data science python is easy to learn it's a general-purpose function and object-oriented programming language as Python is an open-source programming language it is readily available easy to install and get started it also has a large presence of open source community for software development and support Python and it's tool enjoy multi-platform support applications developed with Python integrate easily with other enterprise systems and applications there are a lot of tools platforms and products in the market some different vendors as they offer great support and services Python and its libraries create unique combinations for data science because of all these benefits it's usually popular among academicians mathematicians statisticians and technologists is supported by well-established the data platforms and processing frameworks that help it analyze data in a simple and efficient way enterprise Big Data Platform cloud era is the pioneer in providing enterprise ready Hadoop a big data platform and supports Python Hortonworks is another Hadoop a big data platform provider and supports Python MapReduce Matt Barr is also committed to Python and provides the Hadoop of Big Data Platform big data processing framework MapReduce spark and flink provides very robust and unique data processing framework and support Python Java Scala and Python languages are used for big data processing framework but to access big data you have to use a big data platform which is a combination of the Hadoop infrastructure also known as Hadoop distributed file system or HDFS and an analytics platform Hadoop is a framework that allows data to be distributed across clusters of computers for faster cheaper and efficient computing it's completely developed and coded in Java one of the most popular analytics platforms is SPARC it easily integrates with HDFS it can also be implemented as a standalone analytics platform and integrated with multiple data sources it helps data scientists perform their work more efficiently SPARC is built using scaler since there is a disparity in the programming language that data scientists use and that of the big data platform it impedes data access hand flow as Python is a data scientists first language of choice both Hadoop and spark provide Python api's that allow easy access to the Big Data Platform consequently a data scientist need not learn Java or scaler or any other platform specific data languages and can instead focus on performing data analytics there are several motivations for Python Big Data solutions Big Data is a continuously evolving field which involves adding new data processing frameworks that can be developed using any programming language moreover new innovation and research is driving the growth of Big Data solutions and platform providers it would be difficult for data scientists to focus on analytics if they have to constantly upgrade themselves on information or under the hood architecture or implementation of the platform therefore it's important to keep the entire data science platform and any language agnostic to simplify a data scientists job consequently almost all major vendors solution providers and data processing framework developers are providing Python API s this allows a data scientist to perform big data analytics using only Python rather than learning other languages like Java or skaila to help them work on the big data platform let's look at an example and understand how data is stored across Hadoop distributed clusters big data is generated from different data sources a large file usually greater than 100 megabytes gets routed from a name node to data nodes name nodes hold the metadata information about the files stored on data nodes it stores the address and information of a block of file and the data node associated with it data nodes hold the actual data blocks the file is split into multiple smaller files usually of 64 megabytes or 128 megabytes size it's then copied to multiple physical servers the smaller files are also called file blocks one file block gets replicated to different servers the default replication factor is three which means a single-file block gets copied at least three times on different servers or data nodes there is also a secondary name node which keeps a backup of all the metadata information stored on the main or primary node this node can be used if and wind the main name node fails now that you have understood a little about HDFS let's look at the second core component of hadoop mapreduce the primary framework of the HDFS architecture a file is split into three blocks as split 0 split 1 and split – whenever quest comes in to retrieve the information the mapper task is executed on each data node that contains the file blocks the mapper generates an output essentially in the form of key value pairs that are sorted copied and merged once the mapper task is complete the reducer works on the data and stores the output on HDFS this completes the MapReduce process let's discuss the MapReduce functions mapper and reducer in detail the mapper Hadoop ensures that mappers run locally on the nodes which hold a particular portion of the data to avoid the network traffic multiple mappers run in parallel and each mapper process is a portion of the input data the input and output of the mapper are in the form of key value pairs note that it can either provide zero or more key value pairs as output the reducer after the map phase all intermediate values for an intermediate key are combined into a list which is given to a reducer all values associated with a particular intermediate key are directed to the same reducer this step is known as shuffle and sort there may be a single reducer or multiple reducers note that the reducer also provides outputs in the form of 0 or more than one final key value pairs these values are then returned to HDFS the reducer usually emits a single key value pair for each input key you have seen how MapReduce is critical for HDFS to function a good thing is you don't have to learn Java or other Hadoop centric languages to write a MapReduce program you can easily run such Hadoop jobs with a code completely written in Python with the help of Hadoop streaming API Hadoop streaming acts like a bridge between your Python code on the Java based HDFS and lets you seamlessly access Hadoop clusters and executes nap produced tasks you have seen how MapReduce is critical for HDFS to function thankfully you don't have to learn Java or other Hadoop centric languages to write a MapReduce program you can easily run such to dupe jobs with a code completely written and type on it shown here are some user friendly Python functions that are written for the mapper class suppose we have the list of numbers we want to square we have the square function defined as shown on the screen we can call the map function with a list and the function which is to be executed on each item in that list the output of this process is as shown on the screen reducer can also be written in Python here we would like to sum the squared numbers of the previous map operation this can be done using the sum operation as shown on the screen we can now call the reduced function with the list of data which is to be aggregated and aggregator function in our case sum is used for this purpose big data analysis requires a large infrastructure Cloudera provides enterprise ready for Duke Big Data Platform which supports Python as well to execute Hadoop jobs you have to first install cloud era it's preferable to install cloud eras virtual machine on a UNIX system as it functions best on it to set up the cloud era Hadoop environment visit the cloud era link shown here select quick start download for CDH 5.5 and vmware from the drop-down lists click the download now button once the VM image is downloaded please use 7-zip to extract the files to download and install it visit the link shown on screen Cloudera VMware has some system prerequisites the 64-bit virtual machine requires a 64-bit host operating system or OS and the virtualization product that can support a 64-bit guest OS to use a VMware VM you must use a player compatible with workstation eight point X or higher such as player four point X or higher or fusion four point X or higher you can use older versions of workstation to create a new VM using the same virtual disk or VMDK file but some features in VMware tools will be unavailable the amount of RAM required will vary depending on the runtime option you choose to launch the VMware Player you will either need VMware Player for Windows and Linux or VMware fusion for Mac so please visit the VMware link shown on screen to download the relevant VMware Player now launch the VMware Player with the cloud era VM the default username and password is Cloudera click the terminal icon is shown here it will launch the UNIX terminal for Hadoop HDFS interaction to verify that the UNIX terminal is functioning correctly type in PWD which will show you the present working directory you can also type in LS space hyphen LRT to list all the current files folders and directories these are some simple unix commands which will come in handy later while you are implementing MapReduce tasks you have seen how the Hadoop distributed file system works along with MapReduce the data is written on and read by disks MapReduce jobs require a lot of disk read and write operations which is also known as disk i/o or input and output reading and writing to a disk is not just expensive it can also be slow and impact the entire process hand operation this is specifically true for iterative processes Hadoop is built for write once read many type of jobs which means it's best suited for jobs that don't have to be updated or accessed frequently but in several cases particularly in analytics and machine learning users need to write and rewrite commands to access and compute on the same data more than once every time such a request is sent out MapReduce requires that data is read and or written on to disks directly note that the time to access or write on disks is measured in milliseconds when you are dealing with large file sizes the time factor gets compounded significantly this makes the process highly time-consuming you in contrast apache spark uses resilient distributed data sets or rdd's to carry out such computations our DDS allowed data to be stored in memory which means that every time users want to access the same data a disk i/o operation is not required they can easily access data stored in the cache accessing the cache or RAM is much faster than accessing disks for instance if disk access is measured in milliseconds in memory data access is measured in sub milliseconds this radically reduces the overall time taken for iterative operations on large data sets in fact programs on SPARC run at least 10 to 100 times faster than on MapReduce that's why SPARC is gaining popularity among most data scientists as it is more time efficient when it comes to running analytics and machine learning computations one of the main differences in terms of hardware requirements for MapReduce and SPARC is that while MapReduce requires a lot of servers and CPUs SPARC additionally requires a large and efficient Ram let's understand resilient distributed datasets and detail as you have already seen the main programming approach of spark is RDD rdds are fault tolerant collections of objects spread across a cluster that you can operate on in parallel they are called fault tolerance because they can automatically recover from machine failure you can create an RDD either by copying the elements from an existing collection or by referencing a data set stored externally say on an HDFS rdd's support two types of operations transformations and actions transformations use an existing data set to create a new one for example map creates a new RTD containing the results after passing the elements of the original data set through a function some other examples of transformations are filter and join actions compute on the data set and return the value to the driver program for example reduce aggregates all the RDD elements using a specified function and returns this value to the driver program some other examples of actions are count collect and save it's important to note that if the available memory is insufficient then spark writes the data to disk here are some of the advantages of using spark it's almost 10 to 100 times faster than hadoop mapreduce it has a simple data processing framework it provides interactive api's for python that allow faster application development it has multiple tools for complex analytics operations these tools help data scientists perform machine learning and other analytics much more efficiently and easily than most existing tools it can easily be integrated with the existing hadoop infrastructure hi spark is the Python API used to access the spark programming model and perform data analysis let's take a look at some transformation functions and action methods which are supported by PI SPARC for data analysis these are some common transformation functions map returns RDD formed by passing data elements from the source data set filter returns are DD based on selected criteria flat map Maps items present in the data set and returns a sequence reduced by key returns key/value pairs where values for each key is aggregated by a given reduced function let's now look at some common action functions collect returns all elements of the data set as an array count returns the number of elements present in the data set first returns the first element in the data set take returns the number of elements as specified by the number in the parentheses sparked context or FC is the entry point to spark for the spark application and must be available at all times for data processing there are mainly four components in spark tools spark SQL it's mainly used for querying the data stored on HDFS as a resilient distributed data set or RDD in spark through integrated api's in Python Java and Scala spark streaming it's very useful for data streaming process and where data can be read from various data sources EML lib it's mainly used for machine learning processes such as supervised and unsupervised learning graph X it can be used to process or generate graphs with rdd's let's set up the apache spark environment and also learn how to integrate spark with jupiter notebook first visit the apache link and download apache spark to your system now use 7-zip software and extract the files to your systems local directory to set up the environment variables for spark first set up the user variables click new and then inner spark home in the variable name and enter the spark is the Latian path as variable value now click on the path and then click new and enter the spark bin path from the Installer directory location now let's spark notebook specific variables this will integrate the spark engine with Jupiter notebook type in PI spark it will launch a Jupiter notebook after a while create a Python notebook and type in SC command to check the spark context hey want to become an expert in big data then subscribe to the simply learned Channel and click here to watch more such videos centered up and get certified in Big Data click here

5 Comments

  1. Hemanth Kumar Tanneru said:

    which tools you have used to prepare a presentation?

    May 23, 2019
    Reply
  2. Nishant Chandravanshi said:

    I need help. i want to learn hadoop. i know python and want to learn hadoop with python. Please help me. please please [email protected]

    May 23, 2019
    Reply
  3. FSB said:

    do u have videos on how to learn python

    May 23, 2019
    Reply
  4. Energy economics & Data science said:

    https://www.youtube.com/watch?v=z4tiH5dKh84

    May 23, 2019
    Reply
  5. Rajan Gupta said:

    Make some more advanced topic in python

    May 23, 2019
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *