thank you welcome to this introductory course on data analytics the first in a series of courses designed to prepare you for a career as a junior data analyst to quote a Forester Consulting report on the power of data to transform business businesses today recognize the untapped value in data and data analytics as a crucial factor for business competitiveness to drive their data and analytics initiatives companies are hiring and upskilling people they're expanding their teams and creating centers of excellence to set up a multi-pronged data and analytics practice in their organizations combined to this is the significant supply and demand mismatch in skilled data analysts making it a highly sought after and well-paid profession you can choose to master data analytics as a career path or leverage it as a stepping stone to Branch out into other data professions such as data science data engineering business analytics and business intelligence Analytics this course is for you if you're a fresh graduate from any stream a working professional considering a mid-career transition a data-driven decision maker or in any analytics enabled role the course introduces you to the Core Concepts processes and tools you need to gain entry into data analytics or even to strengthen your current role as a data-driven decision maker it will equip you with an understanding of the data ecosystem and the fundamentals of data analysis such as data Gathering wrangling mining analysis and data visualization you'll also get the feel of a day in the life of a data analyst practicing data analysts share their experience in gaining entry into this field career options and learning paths you can consider and what employers look for in a data analyst they also share their knowledge and best practices about some of the aspects of the data analysis process what lies ahead is truly exciting both for the field and for you as a data analyst so congratulations on choosing to be on this journey and good luck foreign [Music] to quote a Forbes 2020 report on data in the coming decade the constant increase in data processing speeds and a bandwidth the non-stop invention of new tools for creating sharing and consuming data and the steady addition of new data creators and consumers around the world ensure that data growth continues unabated data begets more data in a constant virtuous cycle a modern data ecosystem includes a whole network of interconnected independent and continually evolving entities it includes data that has to be integrated from disparate sources different types of analysis and skills to generate insights active stakeholders to collaborate and act on insights generated and tools applications and infrastructure to store process and disseminate data as required let's start with the data sources data is available in a variety of structured and unstructured data sets residing in text images videos click streams user conversations social media platforms The Internet of Things or iot devices real-time events that stream data Legacy databases and data sourced from professional data providers and agencies the sources have never before been so diverse and dynamic when you're working with so many different sources of data the first step is to pull a copy of the data from the original sources into a data repository at this stage you're only looking at acquiring the data you need working with data formats sources and interfaces through which this data can be pulled in reliability security and integrity of the data being acquired are some of the challenges you work through at this stage once the raw data is in a common place it needs to get organized cleaned up and optimized for Access by end users the data will also need to conform to compliances and standards enforced in the organization for example conforming to guidelines that regulate the storage and use of personal data such as health Biometrics or household data in the case of iot devices adhering to master data tables within the organization to ensure standardization of Master data across all applications and systems of an organization is another example the key challenges at this stage could involve data management and working with data repositories that provide High availability flexibility accessibility and security finally we have our business stakeholders applications programmers analysts and data science use cases all pulling this data from the Enterprise data Repository the key challenges at this stage could include the interfaces apis and applications that can get this data to the end users in line with their specific needs for example data analysts may need the raw data to work with business stakeholders may need reports and dashboards applications may need custom apis to pull this data it's important to note the influence of some of the new and emerging technologies that are shaping today's data ecosystem and its possibilities for example cloud computing machine learning and big data to name a few thanks to Cloud Technologies every Enterprise today has access to Limitless storage high performance Computing open source Technologies machine learning Technologies and the latest tools and libraries data scientists are creating predictive models by training machine learning algorithms on past data also Big Data today we're dealing with data sets that are so massive and so varied that traditional tools and Analysis methods are no longer adequate Paving the way for new tools and techniques and also new knowledge and insights we'll learn more about big data and its influence in shaping business decisions further along in this course [Music] today organizations that are using data to uncover opportunities and are applying that knowledge to differentiate themselves are the ones leading into the future whether looking for patterns in financial transactions to detect fraud using recommendation engines to drive conversion mining social media posts for customer Voice or Brands personalizing their offers based on customer Behavior Analysis Business Leaders realize that data holds the key to competitive advantage to get value from data you need a vast number of skill sets and people playing different roles in this video we're going to look at the role data Engineers data analysts data scientists business analysts and business intelligence or bi analysts play in helping organizations tap into vast amounts of data and turn them into actionable insights it all starts with a data engineer data Engineers are people who develop and maintain data architectures and make data available for business operations and Analysis data Engineers work within the data ecosystem to extract integrate and organize data from disparate sources clean transform and prepare data design store and manage data in data repositories they enable data to be accessible in formats and systems that the various business applications as well as stakeholders like data analysts and data scientists can utilize a data engineer must have good knowledge of programming sound knowledge of Systems and Technology architectures and in-depth understanding of relational databases and non-relational data stores now let's look at the role of a data analyst in short a data analyst translates data and numbers into plain language so organizations can make decisions data analysts inspect and clean data for deriving insights identify correlations find patterns and apply statistical methods to analyze and mine data and visualize data to interpret and present the findings of data analysis analysts are the people who answer questions such as are the users search experiences generally good or bad with the search functionality on our site or what is the popular perception of people regarding our rebranding initiatives or is there a correlation between sales of one product and another data analysts require good knowledge of spreadsheets writing queries and using statistical tools to create charts and dashboards modern data analysts also need to have some programming skills they also need strong analytical and storytelling skills and now let's look at the role data scientists play in this ecosystem data scientists analyze data for actionable insights and build machine learning or deep learning models that train on past data to create predictive models data scientists are people who answer questions such as how many new social media followers am I likely to get next month or what percentage of my customers am I likely to lose to competition in the next quarter or is this financial transaction unusual for this customer data scientists require knowledge of mathematics statistics and a fair understanding of programming languages databases and building data models they also need to have domain knowledge then we also have business analysts and bi analysts business analysts leverage the work of data analysts and data scientists to look at possible implications for their business and the actions they need to take or recommend bi analysts do the same except their focus is on the market forces and external influences that shape their business they provide business intelligence solutions by organizing and monitoring data on different business functions and exploring that data to extract insights and actionables that improve business performance to summarize in simple terms data engineering converts raw data into usable data data analytics uses this data to generate insights data scientists use data analytics and data engineering to predict the future using data from the past business analysts and business intelligence analysts use these insights and predictions to drive decisions that benefit and grow their business interestingly it's not uncommon for data professionals to start their career in one of the data roles and transition to another role within the data ecosystem by supplementing their skills data analysis is the process of gathering cleaning analyzing and Mining data interpreting results and Reporting the findings with data analysis we find patterns within data and correlations between different data points and it is through these patterns and correlations that insights are generated and conclusions are drawn data analysis helps businesses understand their past performance and informs their decision making for future actions using data analysis businesses can validate a course of action before committing to it saving valuable time and resources and also ensuring greater success we'll explore four primary types of data analysis each with a different goal and place in the data analysis process descriptive analytics helps answer questions about what happened over a given period of time by summarizing past data and presenting the findings to stakeholders it helps provide essential insights into past events for example tracking past Performance Based on the organization's key performance indicators or cash flow analysis diagnostic analytics helps answer the question why did it happen it takes the insights from descriptive analytics to dig deeper to find the cause of the outcome for example a sudden change in traffic to a website without an obvious cause or an increase in sales in a region where there has been no change in marketing Predictive Analytics helps answer the question what will happen next historical data and Trends are used to predict future outcomes some of the areas in which businesses apply predictive analysis are risk assessment and sales forecasts it's important to note that the purpose of Predictive Analytics is not to say what will happen in future its objective is to forecast what might happen in the future all predictions are probabilistic in nature prescriptive analytics helps answer the question what should be done about it by analyzing past decisions and events the likelihood of different outcomes is estimated on the basis of which a course of action is decided self-driving cars are a good example of prescriptive analytics they analyze the environment to make decisions regarding speed changing lanes which route to take Etc or Airlines automatically adjusting ticket prices based on customer demand gas prices the weather or traffic on connecting routes now let's look at some of the key steps in any data analysis process understanding the problem and desired result data analysis begins with understanding the problem that needs to be solved and the desired outcome that needs to be achieved where you are and where you want to be needs to be clearly defined before the analysis process can begin setting a clear metric this stage of the process includes deciding what will be measured for example number of product X sold in a region and how it will be measured for example in a quarter or during a festival season Gathering data once you know what you're going to measure and how you're going to measure it you identify the data you require the data sources you need to pull this data from and the best tools for the job cleaning data having gathered the data the next step is to fix quality issues in the data that could affect the accuracy of the analysis this is a critical step because the accuracy of the analysis can only be ensured if the data is clean you will clean the data from missing or incomplete values and outliers for example a customer demographics data in which the age field has a value of 150 is an outlier you will also standardize the data are coming in from multiple sources analyzing and Mining data once the data is clean you will extract and analyze the data from different perspectives you may need to manipulate your data in several different ways to understand the trends identify correlations and find patterns and variations the interpreting results after analyzing your data and possibly conducting further research which can be an iterative loop it's time to interpret your results as you interpret your results you need to evaluate if your analysis is defendable against objections and if there are any limitations or circumstances under which your analysis may not hold true presenting your findings ultimately the goal of any analysis is to impact decision making the ability to communicate and present your findings in clear and impactful ways is as important a part of the data analysis process as is the analysis itself reports dashboards charts graphs Maps case studies are just some of the ways in which you can present your data thank you thank you in this video we will listen to several data professionals talk about how they Define data analytics and what this term means to them I Define data analytics as the process of collecting information and then analyzing that information to confirm various hypotheses to me data analytics also means storytelling with data using data to clearly and concisely convey the state of the world to the people around you data analysis is the use of information around you to make decisions just like you get up every morning you watch the news the weather report will tell you the temperature for the day whether it's going to rain that may dictate what you're going to wear or what activities you can do so data analysis isn't an abstract concept it's something that we do naturally but it has a technical name and now people are being paid to do it in a much larger larger or grander experience but really it's not that complicated the way I put it is that you've got a problem and you need to use facts to test your hypothesis that's where data analytics comes into play the process starts from defining the problem and then you need to create your own hypothesis and to test that you need to collect data clean data analyze data and then present it to the key stakeholders data analytics is really any sense of data that you can use to review information anything that's going to help you to understand what is going on in my case as a CPA I am always looking at financial statements and I'm always analyzing data to predict where someone's been where they are right now and where they're headed and so that data helps me to see further and almost predict the future of any company that I'm working with so data analytics is the collecting cleansing analyzing presenting and ultimately sharing of data and your analysis to be able to help communicate exactly what's going on with your business what's going on in the data so that you can help make better decisions I would Define data analytics as a process or better yet a phenomenon of taking information gathered from a relevant population maybe your customers or your social audience and breaking that information down into subsets and using that data to make decisions about products or services that you want to offer or in cases of the digital environment that we're in making decisions about certain pieces of content that you want to publish so that it appeals to your target audience foreign [Music] while the role of a data analyst varies depending on the type of organization and the extent to which it has adopted data-driven practices there are some responsibilities that are typical to a data analyst role in today's organizations these include acquiring data from primary and secondary data sources creating queries to extract required data from databases and other data collection systems filtering cleaning standardizing and reorganizing data in preparation for data analysis using statistical tools to interpret data sets using statistical techniques to identify patterns and correlations in data analyzing patterns in complex data sets and interpreting Trends preparing reports and charts that effectively communicate Trends and patterns creating appropriate documentation to Define and demonstrate the steps of the data analysis process corresponding to these responsibilities let's look at some of the skills that are valuable for a data analyst the data analysis process requires a combination of technical functional and soft skills let's first look at some of the technical skills that you need in your role as a data analyst these include expertise in using spreadsheets such as Microsoft Excel or Google Sheets proficiency and statistical analysis and visualization tools and software such as IBM cognos IBM SPSS Oracle visual analyzer Microsoft power bi SAS and Tableau Proficiency in at least one of the programming languages such as R Python and in some cases C plus plus Java and Matlab good knowledge of SQL an ability to work with data in relational and no SQL databases the ability to access and extract data from data repositories such as data marts data warehouses data lakes and data pipelines familiarity with big data processing tools such as Hadoop Hive and Spark we will understand more about the features and use cases of some of these programming languages databases data repositories and big data processing tools further along in the course now let's look at some of the functional skills that you require for the role of data analysts these include proficiency and statistics to help you analyze your data validate your analysis and identify fallacies and logical errors analytical skills that help you research and interpret data theorize and make forecasts problem solving skills because ultimately the end goal of all data analysis is to solve problems probing skills that are essential for the discovery process that is for understanding a problem from the perspective of varied stakeholders and users because the data analysis process really begins with a clear articulation of the problem statement and desired outcome data visualization skills that help you decide on the techniques and tools that present your findings effectively based on your audience type of data context and end goal of your analysis project management skills to manage the process people dependencies and timelines of the initiative that brings us to your soft skills as a data analyst data analysis is both a science and an art you can Ace the technical and functional expertise but one of the key differentiators for your success is going to be soft skills this includes your ability to work collaboratively with business and cross-functional teams communicate effectively to report and present your findings tell a compelling and convincing story and gather support and buy-in for your work above all being curious is at the heart of data analysis in the course of your work you will stumble upon patterns phenomena and anomalies that may show you a different path the ability to allow new questions to surface and challenge your assumptions and hypotheses makes for a great analyst you will also hear data analysis practitioners talk about intuition as a must-have quality it's essential to note that intuition in this context is the ability to have a sense of the future based on pattern recognition and past experiences [Music] in this video we will listen to practicing data professionals talk about the qualities and skills required to become a data analyst the qualities and skills of a data analyst a person who's curious naturally someone who has attention to detail and enjoys working with computers a curious person will look to find answers even sometimes when there isn't a question or they don't mind researching and looking in areas that may not have been thought up before attention to detail or looking for patterns do you walk into a room and just count naturally people how the room is set up paying attention to close details and then enjoying computers because technology is moving so fast something or skill that you learned today in two to three years may not be applicable so you need to be able to develop new skills and learn new software depending on how the market where the industry has changed definitely both a technical skills and softer skills are required technical skills include python SQL R Tableau and power bi and the soft skills or interpersonal skills mean whether you know what's the right data to utilize and what's the right tool to use and how to present the data to the key stakeholders and these skill sets require the business Acumen and presentation skills you have to be very detail-oriented you have to love numbers you have to love information and be willing to look at that information and not just look at it on the surface but dive deeper so for example in what we do I can't just take a bank statement at face value I have to actually look at it and compare it does the seal look right especially in today's world there's a lot of Fraud and miscommunication so to be and and people that are trying to take your information and fraudulently use it so a good data analyst should be able to compare last year's information to this year's information to see if it looks right you have to have that eye and that mindset and not just take things at face value there are many qualities and skills required to be a data analyst and I'd break them down into two buckets basically soft skills and technical skills I think the most important soft skills for a data analyst is to be really curious to ask a lot of good questions to be really thought full and to listen carefully and understand both the user perspective and your co-workers perspective and what they most need from the data and always be willing to learn because analytics is a fast moving field so you have to constantly be learning and reading to stay on top of it there are many technical skills required to be a data analyst the most important technical skill for any new data analyst to learn is SQL it's by far the most widely used and anytime you're extracting data from a database you're going to need to know SQL and there is nothing quite like a data analyst with really really good SQL skills I think sometimes people get ahead of themselves and try a bunch of very complicated Technologies before getting the basics of SQL down and I think that's a really big mistake I think it's always nice to know Python and R which are the two main programming languages used uh to do data analysis I think as a new data analyst you don't need to be proficient in both or really either but starting to get good at one or the other is going to be really useful for your career another important technical skill to have for a data analyst is uh to be really good at at least one data visualization tool and to understand general principles of data visualization today the end-to-end skill set of a data analyst is far more Dynamic than what it used to be so data analyst needs to know what problem they're trying to solve with the data pulls that data as they need it in the structure they needed in using SQL from the data Lake um that it's sitting in uh you know there'll be many different tables and they'll need to figure out how to join them and then pull the data clean it up uh Wrangle manipulate it mine it so that they're able to um kind of green insights out of it present those insights concisely clearly using good visualizations and dashboards and in other words be able to tell a good story with that data [Music] a day in the life of a data analyst can include a number of possibilities from acquiring data from varied data sources to creating queries for pulling data from data repositories foraging through rows of data to look for insights creating reports and dashboards and interacting with stakeholders for gathering information and presenting the findings it's a spectrum and yes the big one cleaning and preparing the data so that the findings have a credible basis which by the way is a large part of what any data analyst may find themselves doing in their jobs but if I had to walk you through any one type of day I am going to pick one which has me foraging through data looking for insights this is the part of my job that I am totally in off hi I am sivaram jaladi I work as a data analyst with fluent grid a smart grid Technology Solutions company based in vishakhapatnam in India fluent grid is an IBM partner and the recipient of IBM Beacon awards for its Solutions in the areas of smart energy and Smart City industry segments we offer integrated operations center solutions for power utilities and smart cities leveraging our actionable intelligence platform known as fluent grid act diligence our client our power utility company in South India has been noticing a spike in complaints regarding overbilling and the frequency of these complaints seems to suggest there is something more to it than random occurrences so I'm asked to look at the complaints and the billing data and see if I can spot something I start by taking stock of what I have some of the obvious places that I know I'm going to be looking into is the complaint data the subscriber information data and the billing data that's going to be my starting point before I dive into the specifics of the data I'm going to make a list of questions initial hypotheses that I am going to start with such as 1. the usage pattern of subscribers reporting this issue is there a consumption range for which over billing is occurring more than others to area wise concentration of complaints other complaints concentrated in specific localities within the city three frequency and occurrence of complaints based on individual subscribers are the same subscribers reporting over billing repetitively if yes what is the frequency of occurrence in repeat cases if a subscriber is over billed once does the overbilling occur every month from the first occurrence or are repeat occurrences sporadic or not at all as I get clear on my initial hypotheses and the set of questions I'm going to start with I identify the data sets that I am going to isolate and analyze to validate or refute my hypotheses I pull out the average annual quarterly and monthly billing amounts of the complainants and look for a range in which the complaints are falling more than others I then pull up the location data of the complainants to see if there is a connection between overbilling and zip codes here I see what seems to be a concentration of complaints in certain areas this looked like it could add up to something so instead of moving to the third hypothesis I decide to get a little deeper into this data next I pull out the date of connection data more than 95 percent of the complainants had been our subscribers for more than seven years though not all subscribers over the seven year mark were facing this complaint so now we see some area wise concentration and we see a significant concentration of complaints based on the date of connection next I pull out the make and the serial number of the meters and there it is the serial numbers belonged to the same batch of meters provided by the same supplier the concentration of these meters and therefore the complaints was coming from areas in which these meters were installed at this stage I feel confident in presenting these findings to the stakeholders I'm also going to share the data sources and my process of arriving at this analysis that always goes a long way in lending credibility to the findings this could be the end of this project or it may very well come back maybe the same complaints with different commonalities or a completely different set of complaints for which we need to find answers thank you foreign data professionals talk about some of the applications of data analytics in today's world the applications of data analytics in the world today is everywhere every commercial that you see someone had to analyze and identify either from the consumer or for the company what information they want it to share so you know four out of 10 dentists or you'll see information related to calorie counts or reactions to certain things all of that required analysis this isn't something that should be thought of separate and apart from it's what we do every day in our lives even people monitoring their sugar level with diabetes there's always analysis going on so the applications are Universal so the great thing with analytics in this day and age is that it's very widely applicable every industry every vertical every function within a given organization can benefit from data and analytics whether you're doing sales pipeline analysis what you're doing financials at the end of the month creating predefined and standardized formatted reports or if you're doing something like head count planning or head count review all these across every vertical as I said whether it's Airlines Pharmaceuticals banking all these and the functions within them can benefit from analytics and in this climate that we're in right now with the pandemic there are companies who are paying close attention to their customers buying habits obviously they may have varied from what these companies expected these habits to be and so now data analytics is more important because they need to make sure they can pivot and keep up with the demand and really be able to cater to what their clients and their customers want I can talk about applications of data analytics in finance these years we have seen more and more applications of alternative data analytics in the finance world for example we can use sentiment analysis of tweets and new stories to supplement traditional financial analysis and to inform better investment decisions besides the satellite imagery data can be used to track the development of industrial activities in the geolocation data can be used to track the store traffic and to predict the sales volume foreign ecosystem includes the infrastructure software tools Frameworks and processes used to gather clean analyze mine and visualize data in this video we will go over a quick overview of the ecosystem before going into the details of each of these topics in subsequent videos let's first talk about data based on how well-defined the structure of the data is data can be categorized as structured semi-structured or unstructured data that follows a rigid format and can be organized neatly into rows and columns is structured data this is the data that you see typically in databases and spreadsheets for example semi-structured data is a mix of data that has consistent characteristics and data that doesn't conform to a rigid structure for example emails an email has a mix of structured data such as the name of the sender and recipient but also has the contents of the email which is unstructured data and then there is unstructured data data that is complex and mostly qualitative information that is impossible to reduce to rows and columns for example photos videos text files PDFs and social media content the type of data drives the kind of data reposit stories that the data can be collected and stored in and also the tools that can be used to query or process the data data also comes in a wide-ranging variety of file formats being collected from a variety of data sources ranging from relational and non-relational databases to apis web services data streams social platforms and sensor devices this brings us to data repositories a term that includes databases data warehouses data Marts data lakes and Big Data stores the type format and sources of data influence the type of data repositories that you could use to collect store clean analyze and mine the data for analysis if you're working with big data for example you will need big data warehouses that allow you to store and process large volume High Velocity data and also Frameworks that allow you to perform complex analytics in real time on Big Data the ecosystem also includes languages that can be classified as query languages programming languages and shell and scripting languages querying and manipulating data with SQL to developing data applications with python and writing shell scripts for repetitive operational tasks these are important components in a data analyst's workbench automated tools Frameworks and processes for all stages of the analytics process are part of the data analysts ecosystem from tools used for Gathering extracting transforming and loading data into Data repositories to tools for data wrangling data cleaning analysis Data Mining and data visualization it's a very diverse and Rich ecosystem spreadsheets Jupiter notebooks and IBM cognos are just a few examples we will cover some of the data analytics Tools in Greater detail in subsequent sections of the course [Music] data is unorganized information that is processed to make it meaningful generally data comprises of facts observations perceptions numbers characters symbols and images that can be interpreted to derive meaning one of the ways in which data can be categorized is by its structure data can be structured semi-structured or unstructured structured data has a well-defined structure or adheres to a specified data model can be stored in well-defined schemas such as databases and in many cases can be represented in a tabular manner with rows and columns structured data is objective facts and numbers that can be collected exported stored and organized in typical databases some of the sources of structured data could include SQL databases and online transaction processing or oltp systems that focus on business transactions spreadsheets such as Excel and Google sheets online forms sensors such as Global Positioning Systems or GPS and radio frequency identification or RFID tags and network and web server logs you can also easily examine structured data with standard data analysis tools and methods semi-structured data is data that has some organizational properties but lacks a fixed or rigid schema semi-structured data cannot be stored in the form of rows and columns as in databases it contains tags or elements or metadata which is used to group data and organize it in a hierarchy some of the sources of semi-structured data could include emails XML and other markup languages binary executables TCP or IP packets zippered files and integration of data from different sources XML and Json allow users to define tags and attributes to store data in a hierarchical form and are used widely to store and exchange semi-structured data unstructured data is data that does not have an easily identifiable structure and therefore cannot be organized in a mainstream relational database in the form of rows and columns it does not have any particular format sequence semantics or rules unstructured data can deal with the heterogeneity of sources and has a variety of business intelligence and analytics applications some of the sources of unstructured data could include web pages social media feeds images and varied file formats such as JPEG GIF and PNG video and audio files documents and PDF files PowerPoint presentations media logs and surveys unstructured data can be stored in files and documents such as a Word document for manual analysis or in nosql databases that have their own analysis tools for examining this type of data to summarize structured data is data that is well organized in formats that can be stored in databases and lends itself to standard data analysis methods and tools semi-structured data is data that is somewhat organized and relies on meta tags for grouping and hierarchy an unstructured data is data that is not conventionally organized in the form of rows and columns in a particular format in the next video we will learn about the different types of file structures foreign [Music] as a data professional you will be working with a variety of data file types and formats it is important to understand the underlying structure of file formats along with their benefits and limitations this understanding will support you to make the right decisions on the formats best suited for your data and performance needs some of the standard file formats that we will cover in this video include delimited text file formats Microsoft Excel open XML spreadsheet or xlsx extensible markup language or XML portable document format or PDF JavaScript object notation or Json delimited text files are text files used to store data as text in which each line or row has values separated by a delimiter where a delimiter is a sequence of one or more characters for specifying the boundary between independent entities or values any character can be used to separate the values but most common delimiters are the comma tab colon vertical bar and space comma separated values or csvs and tab separated values or tsvs are the most commonly used file types in this category in csvs the delimiter is a comma while in tsvs the delimiter is a tab when literal commas are present in Text data and therefore cannot be used as delimiters tsvs serve as an alternative to CSV format tab stops are infrequent in running text each row or horizontal line in the text file has a set of values separated by the delimiter and represents a record the first row works as a column header where each column can have a different type of data for example a column can be of date type while another can be a string or integer type data delimited files allow field values of any length and are considered a standard format for providing straightforward information schema they can be processed by almost all existing applications delimiters also represent one of various means to specify boundaries in a data stream Microsoft Excel open XML spreadsheet or xlsx is a Microsoft Excel open XML file format that falls under the spreadsheet file format it is an XML based file format created by Microsoft in an xlsx also known as a workbook there can be multiple worksheets and each worksheet is organized into rows and columns at the intersection of which is the cell each cell contains data xlsx uses the open file format which means it is generally accessible to most other applications it can use and save all functions available in Excel and is also known to be one of the more secure file formats as it cannot save malicious code extensible markup language or XML is a markup language with set rules for encoding data the XML file format is both readable by humans and machines it is a self-descriptive language designed for sending information over the internet XML is similar to HTML in some respects but also has differences for example an XML does not use predefined tags like HTML does XML is platform independent and programming language independent and therefore simplifies data sharing between various systems portable document format or PDF is a file format developed by Adobe to present documents independent of application software hardware and operating systems which means it can be viewed the same way on any device this format is frequently used in legal and financial documents and can also be used to fill in data such as forms JavaScript object notation or Json is a text-based open standard designed for transmitting structured data over the web the file format is a language independent data format that can be read in any programming language Json is easy to use is compatible with a wide range of browsers and is considered as one of the best tools for sharing data of any size and type even audio and video that is one reason many apis and web servers return data as Json foreign [Music] as we touched upon in one of our previous videos data sources have never been as Dynamic and diverse as they are today in this video we will look at some common sources such as relational databases flat files and XML data sets apis and web services web scraping data streams and feeds typically organizations have internal applications to support them in managing their day-to-day business activities customer transactions human resource activities and their workflows these systems use relational databases such as SQL Server Oracle MySQL and IBM db2 to store data in a structured way data stored in databases and data warehouses can be used as a source for analysis for example data from a retail transaction system can be used to analyze sales in different regions and data from a customer relationship management system can be used for making sales projections external to the organization there are other publicly and privately available data sets for example government organizations releasing demographic and economic data sets on an ongoing basis then there are companies that sell specific data for example point of sale data or financial data or weather data which businesses can use to define strategy predict demand and make decisions related to distribution or marketing promotions among other things such data sets are typically made available as flat files spreadsheet files or XML documents flat files store data in plain text format with one record or row per line and each value separated by delimiters such as commas semicolons or tabs data in a flat file maps to a single table unlike relational databases that contain multiple tables one of the most common flat file formats is CSV in which values are separated by commas spreadsheet files are a special type of flat files that also recognize data in a tabular format rows and columns but a spreadsheet can contain multiple worksheets and each worksheet can map to a different table although data in spreadsheets is in plain text the files can be stored in custom formats and include additional information such as formatting formulas Etc Microsoft Excel which stores data in an XLS or xlsx format is probably the most common spreadsheet others include Google Sheets Apple numbers and LibreOffice XML files contain data values that are identified or marked up using tags while data in flat files is flat or maps to a single table XML files can support more complex data structures such as hierarchical some common uses of XML include data from online surveys bank statements and other unstructured data sets many data providers and websites provide apis or application program interfaces and web services which multiple users or applications can interact with and obtain data for processing or analysis apis and web services typically listen for incoming requests which can be in the form of web requests from users or network requests from applications and return data in plain text XML HTML Json or media files let's look at some popular examples of apis being used as a data source for data analytics the use of Twitter and Facebook apis to Source data from tweets and posts for performing tasks such as opinion mining or sentiment analysis which is to summarize the amount of appreciation and criticism on a given subject such as policies of a government product a service or customer satisfaction in general stock market apis used for pulling data such as share and commodity prices earnings per share and historical prices for trading and Analysis data lookup and validation apis which can be very useful for data analysts for cleaning and preparing data as well as for co-relating data for example to check which city or state a postal or zip code belongs to apis are also used for pulling data from database sources within and external to the organization web scraping is used to extract relevant data from unstructured sources also known as screen scraping web harvesting and web data extraction web scraping makes it possible to download specific data from web pages based on defined parameters web scrapers can among other things extract text contact information images videos product items and much more from a website some popular uses of web scraping include collecting product details from retailers manufacturers and e-commerce websites to provide price comparisons generating sales leads through public data sources extracting data from Posts and authors on various forums and communities and collecting training and testing data sets for machine learning models some of the popular web scraping tools include beautiful soup and Scrapy pandas and selenium data streams are another widely used source for aggregating constant streams of data flowing from sources such as instruments iot devices and applications GPS data from Cars computer programs websites and social media posts this data is generally time stamped and also geotagged for geographical identification some of the data streams and ways in which they can be leveraged include stock and market tickers for financial trading retail transaction streams for predicting demand and Supply Chain management surveillance and video feeds for threat detection social media feeds for sentiment analysis sensor data feeds for monitoring industrial or farming Machinery web click feeds for monitoring web performance and improving design and real-time flight events for rebooking and rescheduling some popular applications used to process data streams include Apache Kafka Apache spark streaming and Apache storm RSS or really simple syndication feeds are another popular data source these are typically used for capturing updated data from online forums and new sites where data is refreshed on an ongoing basis using a feed reader which is an interface that converts RSS text files into a stream of updated data updates are streamed to user devices [Music] [Music] in this video we will learn about some of the languages relevant to the work of data professionals these can be categorized as query languages programming languages and shell scripting having Proficiency in at least one language in each category is essential for any data professional simply stated query languages are designed for accessing and manipulating data in a database for example SQL programming languages are designed for developing applications and controlling application behavior for example python R and Java and shell and scripting languages such as Unix or Linux shell and Powershell are ideal for repetitive and time-consuming operational tasks in the remaining video we will examine these languages in Greater depth SQL or structured query language is a querying language designed for accessing and manipulating information from mostly though not exclusively relational databases using SQL we can write a set of instructions to perform operations such as insert update and delete records in a database create new databases tables and Views and write stored procedures which means you can write a set of instructions and call them for later use here are some advantages of using SQL SQL is portable and can be used independent of the platform it can be used for querying data in a wide variety of databases and data repositories although each vendor may have some variations and special extensions it has a simple syntax that is similar to the English language its syntax allows developers to write programs with fewer lines than some of the other programming languages using basic keywords such as select insert into and update it can retrieve large amounts of data quickly and efficiently it runs on an interpreter system which means code can be executed as soon as it is written making prototyping quick and easy SQL is one of the most popular querying languages due to its large user community and the sheer volume of documentation accumulated over the years it continues to provide a uniform platform worldwide to all its users python is a widely used open source general purpose high-level programming language its syntax allows programmers to express their Concepts in fewer lines of code AS compared to some of the older languages python is perceived as one of the easiest languages to learn and has a large developer Community because of its focus on Simplicity and readability and a low learning curve it's an Ideal tool for beginning programmers it is great for performing High computational tasks in vast amounts of data which can otherwise be extremely time consuming and cumbersome python provides libraries like numpy and pandas which eases the task by the use of parallel processing it has inbuilt functions for almost all of the frequently used Concepts python supports multiple programming paradigms such as object-oriented imperative functional and procedural making it suitable for a wide variety of use cases now let's look at some of the reasons that make python one of the fastest growing programming languages in the world today it is easy to learn with python you have the advantage of using fewer lines of code to accomplish tasks compared to other languages it is open source python is free and uses a community-based model for development it runs on Windows and Linux environments and can be ported to multiple platforms it has widespread Community Support with plenty of useful analytics libraries available it has several open source libraries for data manipulation data visualization statistics and Mathematics to name just a few its vast array of libraries and functionalities also include pandas for data cleaning and Analysis numpy and scipy for statistical analysis beautiful suit and scrapey for web scraping matplotlib and Seabourn to visually represent data in the form of bar graph's histogram and pie charts openc for image processing R is an open source programming language and environment for data analysis data visualization machine learning and statistics widely used for developing statistical software and Performing data analytics it is especially known for its ability to create compelling visualizations giving it an edge over some of the other languages in this space some of the key benefits of R include the following it is an open source platform independent programming language it can be paired with many programming languages including python it is highly extensible which means developers can continue to add functionalities by defining new functions it facilitates the handling of structured as well as unstructured data which means it has more comprehensible data capability it has libraries such as ggplot2 and plotly that offer aesthetic graphical plots to its users you can make reports with the data and scripts embedded in them also interactive web apps that allow users to play with the results and the data it is dominant among other programming languages for developing statistical tools Java is an object-oriented class-based and platform independent programming language originally developed by Sun Microsystems it is among the top ranked programming languages used today Java is used in a number of processes all through data analytics including cleaning data importing and exporting data statistical analysis and data visualization in fact most of the popular framework and tools used for Big Data are typically written in Java such as Hadoop Hive and Spark it is perfectly suited for Speed critical projects a Unix or Linux shell is a computer program written for the Unix shell it is a series of unix commands written in a plain text file to accomplish a specific task writing a shell script is fast and easy it is most useful for repetitive tasks that may be time consuming to execute by typing one line at a time typical operations performed by shell scripts include file manipulation program execution system administration tasks such as disk backups and evaluating system logs installation scripts for complex programs executing routine backups and running batches Powershell is a cross-platform automation tool and configuration framework by Microsoft that is optimized for working with structured data formats such as Json CSV XML and rest apis websites and office applications it consists of a command line shell and scripting language Powershell is object-based which makes it possible to filter sort measure group compare and many more actions on objects as they pass through a data pipeline it is also a good tool for data mining building guis and creating charts dashboards and interactive reports a data repository is a general term used to refer to data that has been collected organized and isolated so that it can be used for business operations or mined for reporting and data analysis it can be a small or large database infrastructure with one or more databases that collect manage and store data sets in this video we will provide an overview of the different types of repositories your data might reside in such as databases data warehouses and Big Data stores and examine them in Greater detail in further videos let's begin with databases a database is a collection of data or information designed for the input storage search and retrieval and modification of data and a database management system or dbms is a set of programs that creates and maintains the database it allows you to store modify and extract information from the database using a function called querying for example if you want to find customers who have been inactive for six months or more using the querying function the database management system will retrieve data of all customers from the database that have been inactive for six months or more even though a database and dbms mean different things the terms are often used interchangeably there are different types of databases several factors influence the choice of database such as the data type and structure querying mechanisms latency requirements transaction speeds and intended use of the data it's important to mention two main types of databases here relational and non-relational databases relational databases also referred to as rdbms's built on the organizational principles of flat files with data organized into a tabular format with rows and columns following a well-defined structure and schema however unlike flat files rdbms's are optimized for data operations and querying involving many tables and much larger data volumes structured query language or SQL is the standard querying language for relational databases then we have non-relational databases also known as nosql or not only SQL non-relational databases emerged in response to the volume diversity and speed at which data is being generated today mainly influenced by advances in cloud computing the internet of things and social media proliferation Built For Speed flexibility and scale non-relational databases made it possible to store data in a schema list or free form fashion no SQL is widely used for processing Big Data a data warehouse works as a central repository that merges information coming from disparate sources and consolidates it through the extract transform and load process also known as the ETL process into one comprehensive database for analytics and business intelligence at a very high level the ETL process helps you to extract data from the different data sources transform the data into a clean and usable State and load the data into the Enterprise's data repository related to data warehouses are the concepts of data Marts and data Lakes which we will cover later data Marts and data warehouses have historically been relational since much of the traditional Enterprise data has resided in rdbms's however with the emergence of no SQL Technologies and new sources of data non-relational data repositories are also now being used for data warehousing another category of data repositories are big data stores that include distributed computational and storage infrastructure to store scale and process very large data sets overall data repositories help to isolate data and make reporting and analytics more efficient and credible while also serving as a data archive [Music] a relational database is a collection of data organized into a table structure where the tables can be linked or related Based on data common to each tables are made of rows and columns where rows are the records and the columns the attributes let's take the example of a customer table that maintains data about each customer in a company The Columns or attributes in the customer table are the customer ID customer name customer address and customer primary phone and each row is a customer record now let's understand what we mean by tables being linked or related Based on data common to each along with the customer table the company also maintains transaction tables that contain data describing multiple individual transactions pertaining to each customer The Columns for the transaction table might include the transaction date customer ID transaction amount and payment method the customer table and transaction tables can be related based on the common customer ID field you can query the customer table to produce reports such as a customer statement that consolidates all transactions in a given period this capability of relating tables based on common data enables you to retrieve an entirely new table from data in one or more tables with a single query it also allows you to understand the relationships among all available data and gain new insights for making better decisions relational databases use structured query language or SQL for querying data we'll learn more about SQL later in this course relational databases build on the organizational principles of flat files such as spreadsheets with data organized into rows and columns following a well-defined structure and schema but this is where the similarity ends relational databases by Design are ideal for the optimized storage retrieval and processing of data for large volumes of data unlike spreadsheets that have a limited number of rows and columns each table in a relational database has a unique set of rows and columns and relationships can be defined between tables which minimizes data redundancy moreover you can restrict database fields to specific data types and values which minimizes irregularities and leads to Greater consistency and data Integrity relational databases use SQL for querying data which gives you the advantage of processing millions of records and retrieving large amounts of data in a matter of seconds moreover the security architecture of relational databases provides controlled access to data and also ensures that the standards and policies for governing data can be enforced relational databases range from small desktop systems to massive cloud-based systems they can be either open source and internally supported open source with commercial support or commercial closed-source systems IBM db2 Microsoft SQL Server MySQL Oracle database and postgresql are some of the popular relational databases cloud-based relational databases also referred to as database as a service are gaining wide use as they have access to the Limitless compute and storage capabilities offered by the cloud some of the popular Cloud relational databases include Amazon relational database service or RDS Google Cloud SQL IBM db2 on cloud Oracle cloud and SQL Azure rdbms is a mature and well-documented Technology making it easy to learn and find qualified talent one of the most significant advantages of the relational database approach is its ability to create meaningful information by joining tables some of its other advantages include flexibility using SQL you can add new columns add new tables rename relations and make other changes while the database is running and queries are happening reduced redundancy relational databases minimize data redundancy for example the information of a customer appears in a single entry in the customer table and the transaction table pertaining to the customer stores a link to the customer table ease of backup and Disaster Recovery relational databases offer easy export and import options making backup and restore easy exports can happen while the database is running making restore on failure easy cloud-based relational databases do continuous mirroring which means the loss of data on restore can be measured in seconds or less acid compliance asset stands for atomicity consistency isolation and durability asset compliance implies that the data in the database remains accurate and consistent despite failures and database transactions are processed reliably now we'll look at some use cases for relational databases online transaction processing oltp applications are focused on transaction oriented tasks that run at high rates relational databases are well suited for oltp applications because they can accommodate a large number of users they support the ability to insert update or delete small amounts of data and they also support frequent queries and updates as well as fast response times data warehouses in a data warehousing environment relational databases can be optimized for online analytical processing or olap where historical data is analyzed for business intelligence iot Solutions internet of things or iot Solutions require speed as well as the ability to collect and process data from Edge devices which need a lightweight database solution this brings us to the limitations of rdbms rdbms does not work well with semi-structured or unstructured data and is therefore not suitable for extensive analytics on such data for migration between two rdbms's schemas and type of data need to be identical between the source and destination tables relational databases have a limit on the length of data fields which means if you try to enter more information into a field then it can accommodate the information will not be stored despite the limitations and the evolution of data in these times of Big Data cloud computing iot devices and social media rdbms continues to be the predominant technology for working with structured data foreign [Music] which stands for not only SQL or sometimes non-sql is a non-relational database design that provides flexible schemas for the storage and retrieval of data no SQL databases have existed for many years but have only recently become more popular in the era of cloud big data and high volume web and mobile applications they are chosen today for their attributes around scale performance and ease of use it's important to emphasize that the no in no SQL is an abbreviation for not only and not the actual word no no SQL databases are built for specific data models and have flexible schemas that allow programmers to create and manage modern applications they do not use a traditional row column table database design with fixed schemas and typically not use the structured query language or SQL to query data although some may support SQL or SQL like interfaces no SQL allows data to be stored in a schema-less or free form fashion any data be it structured semi-structured or unstructured can be stored in any record based on the model being used for storing data there are four common types of no SQL databases key Value Store document based column based and graph based key Value Store data in a key value database is stored as a collection of key value pairs the key represents an attribute of the data and is a unique identifier both keys and values can be anything from simple integers or strings to complex Json documents key value stores are great for storing user session data and user preferences making real-time recommendations and targeted advertising and in-memory data caching however if you want to be able to query the data on specific data value need relationships between data values or need to have multiple unique Keys a key value store may not be the best fit redis memcache D and dynamodb are some well-known examples in this category document based document databases store each record and its Associated data within a single document they enable flexible indexing powerful ad-hoc queries and analytics over collections of documents document databases are preferable for e-commerce platforms medical records storage CRM platforms and analytics platforms however if you're looking to run complex search queries and multi-operation transactions a document-based database may not be the best option for you mongodb documentdb couchdb and cloudant are some of the popular document-based databases column based column based models store data in cells grouped as Columns of data instead of rows a logical grouping of columns that is columns that are usually accessed together is called a column family for example a customer's name and profile information will most likely be accessed together but not their purchase history so customer name and profile information data can be grouped into a column family since column databases store all cells corresponding to a column as a continuous disk entry accessing and searching the data becomes very fast column databases can be great for systems that require heavy write requests storing time series data weather data and iot data but if you need to use complex queries or change your querying patterns frequently this may not be the best option for you the most popular column databases are Cassandra and hbase graph based graph-based databases use a graphical model to represent and store data they are particularly useful for visualizing analyzing and finding connections between different pieces of data the circles are nodes and they contain the data the arrows represent relationships graph databases are an excellent choice for working with connected data which is data that contains lots of interconnected relationships graph databases are great for social networks real-time product recommendations Network diagrams fraud detection and access management but if you want to process High volumes of transactions it may not be the best choice for you because graph databases are not optimized for large volume analytics queries neo4j and Cosmos DB are some of the more popular graph databases no SQL was created in response to the limitations of traditional relational database technology the primary advantage of no SQL is its ability to handle large volumes of structured semi-structured and unstructured data some of its other advantages include the ability to run as distributed systems scaled across multiple data centers which enables them to take advantage of cloud computing infrastructure an efficient and cost effective scale out architecture that provides additional capacity and performance with the addition of new nodes and simpler design better control over availability and improved scalability that enables you to be more agile more flexible and to iterate more quickly to summarize the key differences between relational and non-relational databases rdbms schemas rigidly Define how all data inserted into the database must be typed and composed whereas no SQL databases can be schema agnostic allowing unstructured and semi-structured data to be stored and manipulated maintaining high-end commercial relational database Management Systems is expensive whereas no SQL databases are specifically designed for low-cost commodity Hardware relational databases unlike most nosql support acid compliance which ensures reliability of transactions and crash recovery rdbms is a mature and well-documented technology which means the risks are more or less perceivable as compared to no SQL which is a relatively newer technology nonetheless no SQL databases are here to stay and are increasingly being used for Mission critical applications foreign [Music] earlier in the course We examined databases data warehouses and Big Data stores now we'll go a little deeper in our exploration of data warehouses data Marts and data lakes and also learn about the ETL process and data Pipelines a data warehouse works like a multi-purpose storage for different use cases by the time data comes into the warehouse it has already been modeled and structured for a specific purpose meaning it is analysis ready as an organization you would opt for a data warehouse when you have massive amounts of data from your operational systems that need to be readily available for reporting and Analysis data warehouses serve as the single source of Truth storing current and historical data that has been cleansed conformed and categorized a data warehouse is a multi-purpose enabler of operational and performance Analytics a data Mart is a subsection of the data warehouse built specifically for a particular business function purpose or community of users the idea is to provide stakeholders data that is most relevant to them when they need it for example the sales or Finance teams accessing data for their quarterly reports and projections since a data Mart offers analytical capabilities for a restricted area of the data warehouse it offers isolated security and isolated performance the most important role of a data Mart is business specific reporting and Analytics a data lake is a storage repository that can store large amounts of structured semi-structured and unstructured data in their native format classified and tagged with metadata so while a data warehouse stores data processed for a specific need a data lake is a pool of raw data where each data element is given a unique identifier and is tagged with meta tags for further use you would opt for a data Lake if you generate or have access to large volumes of data on an ongoing basis but don't want to be restricted to specific or predefined use cases unlike data warehouses a data Lake would retain all Source data without any exclusions and the data could include all types of data sources and types data Lakes are sometimes also used as a staging area of a data warehouse the most important role of a data lake is in predictive and advanced Analytics now we come to the process that is at the heart of gaining value from data the extract transform and load process or ETL ETL is how raw data is converted into analysis ready data it is an automated process in which you gather raw data from identified sources extract the information that aligns with your reporting and Analysis needs clean standardize and transform that data into a format that is usable in the context of your organization and load it into a data repository while ETL is a generic process the actual job can be very different in usage utility and complexity extract is the step where data from Source locations is collected for transformation data extraction could be through batch processing meaning Source data is moved in large chunks from the source to the Target system at scheduled intervals tools for batch processing include Stitch and blendo stream processing which means Source data is pulled in real time from the source and transformed while it is in transit and before it is loaded into the data repository tools for stream processing include Apache samsa Apache storm and Apache Kafka transform involves the execution of rules and functions that convert raw data into Data that can be used for analysis for example making date formats and units of measurement consistent across all Source data removing duplicate data filtering out data that you do not need enriching data for example splitting full name to First middle and last names establishing key relationships across tables applying business rules and data validations load is the step where process data is transported to a destination system or data repository it could be initial loading that is populating all the data in the Repository incremental loading that is applying ongoing updates and modifications as needed periodically or full refresh that is erasing contents of one or more tables and reloading with fresh data load verification which includes data checks for missing or null values server performance and monitoring load failures are important because of this process step it is vital to keep an eye on load failures and ensure the right recovery mechanisms are in place ETL has historically been used for batch workloads on a large scale however with the emergence of streaming ETL tools they are increasingly being used for real-time streaming event data as well it's common to see the terms ETL and data pipelines used interchangeably and although both move data from source to destination data pipeline is a broader term that encompasses the entire journey of moving data from one system to another of which ETL is a subset data pipelines can be architected for batch processing for streaming data and a combination of batch and streaming data in the case of streaming data data processing or transformation happens in a continuous flow this is particularly useful for data that needs constant updating such as data from a sensor monitoring traffic a data pipeline is a high performing system that supports both long-running batch queries and smaller interactive queries the destination for a data pipeline is typically a data Lake although the data may also be loaded to different Target destinations such as another application or a visualization tool there are a number of data pipeline Solutions available most popular among them being Apache beam and data flow [Music] in this Digital World everyone leaves a trace from our travel habits to our workouts and entertainment the increasing number of Internet connected devices that we interact with on a daily basis record vast amounts of data about us there's even a name for it big data Ernst young offers the following definition Big Data refers to the dynamic large and disparate volumes of data being created by people tools and machines it requires new Innovative and scalable technology to collect host and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to Consumers risk profit performance productivity management and enhanced shareholder value there is no one definition of big data but there are certain elements that are common across the different definitions such as velocity volume variety veracity and value these are the v's of Big Data velocity is the speed at which data accumulates data is being generated extremely fast in a process that never stops near or real-time streaming local and cloud-based Technologies can process information very quickly volume is the scale of the data or the increase in the amount of data stored drivers of volume are the increase in data sources higher resolution sensors and scalable infrastructure variety is the diversity of the data structured data fits neatly into rows and columns in relational databases while unstructured data is not organized in a predefined way like tweets blog posts pictures numbers and video variety also reflects that data comes from different sources machines people and processes both internal and external to organizations drivers or mobile technologies social media wearable Technologies geotechnologies video and many many more veracity is the quality and origin of data and its Conformity to facts and accuracy attributes include consistency completeness integrity and ambiguity drivers include cost and the need for traceability with the large amount of data available the debate rages on about the accuracy of data in the digital age is the information real or is it false value is our ability and need to turn data into value value isn't just profit it may have medical or social benefits as well as customer employee or personal satisfaction the main reason that people invest time to understand big data is to derive value from it let's look at some examples of the V's in action velocity every 60 seconds hours of footage are uploaded to YouTube which is generating data think about how quickly data accumulates over hours days and years volume the world population is approximately 7 billion people and the vast majority are now using digital devices mobile phones desktop and laptop computers wearable devices and so on these devices all generate capture and store data approximately 2.5 quintillion bytes every day that's the equivalent of 10 million Blu-ray DVDs variety let's think about the different types of data text pictures film Sound Health Data from wearable devices and many different types of data from devices connected to the Internet of Things veracity eighty percent of data is considered to be unstructured and we must devise ways to produce reliable and accurate insights the data must be categorized analyzed and visualized data scientists today derive insights from Big Data and cope with the challenges that these massive data sets present the scale of the data being collected means that it's not feasible to use conventional data analysis tools however alternative tools that leverage distributed computing power can overcome this problem tools such as Apache spark Hadoop and its ecosystem provides ways to extract load analyze and process the data across distributed compute resources providing new insights and knowledge this gives organizations more ways to connect with their customers and enrich the services they offer so next time you strap on your Smartwatch unlock your smartphone or track your workout remember your data is starting a journey that might take it all the way around the world through Big Data analysis and back to you foreign the big data processing Technologies provide ways to work with large sets of structured semi-structured and unstructured data so that the value can be derived from Big Data in some of the other videos we discussed Big Data Technologies such as nosql databases and data lakes in this video we are going to talk about three open source Technologies and the role they play in big data analytics Apache Hadoop Apache Hive and Apache spark Hadoop is a collection of tools that provides distributed storage and processing of Big Data Hive is a data warehouse for data query and Analysis built on top of Hadoop spark is a distributed data analytics framework designed to perform complex data analytics in real time Hadoop a Java based open source framework allows distributed storage and processing of large data sets across clusters of computers in Hadoop distributed system a node is a single computer and a collection of nodes forms a cluster Hadoop can scale up from a single node to any number of nodes each offering local storage and computation Hadoop provides a reliable scalable and cost-effective solution for storing data with no format requirements using Hadoop you can incorporate emerging data formats such as streaming audio video social media sentiment and click stream data along with structured semi-structured and unstructured data not traditionally used in a data warehouse provide real-time self-service access for all stakeholders optimize and streamline costs in your Enterprise data warehouse by consolidating data across the organization and moving cold data that is data that is not in frequent use to a Hadoop based system one of the four main components of Hadoop is Hadoop distributed file system or hdfs which is a storage system for big data that runs on multiple commodity Hardware connected through a network hdfs provides scalable and reliable big data storage by partitioning files over multiple nodes it splits large files across multiple computers allowing parallel access to them computations can therefore run in parallel on each node where data is stored it also replicates file blocks on different nodes to prevent data loss making it fault tolerant let's understand this through an example consider a file that includes phone numbers for everyone in the United States the numbers for people with last names starting with an a might be stored on server one B on server 2 and so on with Hadoop pieces of this phone book would be stored across the cluster to reconstruct the entire phone book your program would need the blocks from every server in the cluster hdfs also replicates these smaller pieces into two additional servers by default ensuring availability when a server fails in addition to higher availability this offers multiple benefits it allows the Hadoop cluster to break up work into smaller chunks and run those jobs on all servers in the cluster for better scalability finally you gain the benefit of data locality which is the process of moving the computation closer to the node on which the data resides this is critical when working with large data sets because it minimizes Network congestion and increases throughput some of the other benefits that come from using hdfs include fast recovery from Hardware failures because hdfs is built to detect faults and automatically recover access to streaming data because hgfs supports High data throughput rates accommodation of large data sets because hdfs can scale to hundreds of nodes or computers in a single cluster portability because hdfs is portable across multiple Hardware platforms and compatible with a variety of underlying operating systems Hive is an open source data warehouse software for reading writing and managing large data set files that are stored directly in either hdfs or other data storage systems such as Apache hbase Hadoop is intended for long sequential scans and because Hive is based on Hadoop queries have very high latency which means Hive is less appropriate for applications that need very fast response times Hive is not suitable for transaction processing that typically involves a high percentage of right operations Hive is better suited for data warehousing tasks such as ETL reporting and data analysis and includes tools that enable easy access to data via SQL this brings us to spark a general purpose data processing engine designed to extract and process large volumes of data for a wide range of applications including interactive analytics streams processing machine learning data integration and ETL it takes advantage of in-memory processing to significantly increase the speed of computations and spilling to disk only when memory is constrained spark has interfaces for major programming languages such as Java Scala python R and SQL it can run using its Standalone clustering technology as well as on top of other infrastructures such as Hadoop and it can access data in a large variety of data sources including hdfs And Hive making it highly versatile the ability to process streaming data fast and perform complex analytics in real time is the key use case for Apache Spark [Music] at this stage you have an understanding of the problem and the desired outcome you know where you are and where you want to be you also have a well-defined metric you know what will be measured and how it will be measured the next step is for you to identify the data you need for your use case the process of identifying data Begins by determining the information you want to collect in this step you make decisions regarding the specific information you need and the possible sources for this data your goals determine the answers to these questions let's take the example of a product company that wants to create targeted marketing campaigns based on the age group that buys their products the most their goal is to design reach outs that appeal most to this segment and encourages them to further influence their friends and peers into buying these products based on this use case some of the obvious information that you will identify includes the customer profile purchase history location age education profession income and marital status for example to ensure you gain even greater insights into this segment you may also decide to collect the customer complaint data for this segment to understand the kind of issues they face because this could discourage them from recommending your products to know how satisfied they were with the resolution of their issues you could collect the ratings from their customer service surveys taking this step forward you may want to understand how these customers talk about your products on social media and how many of their connections engage with them in these discussions for example the likes shares and comments their posts receive the next step in the process is to define a plan for collecting data you need to establish a time frame for collecting the data you have identified some of the data you need may be required on an ongoing basis and some over a defined period of time for collecting website visitor data for example you may need to have the numbers refreshed in real time but if you're tracking data for a specific event you have a definitive beginning and end date for collecting the data in this step you can also Define how much data would be sufficient for you to reach a credible analysis is the volume defined by the segment for example all customers within the age range of 21 to 30 years or a data set of a hundred thousand customers within the age range of 21 to 30 you can also use this step to define the dependencies risks mitigation plan and several other factors that are relevant to your initiative the purpose of the plan should be to establish the clarity you need for execution the third step in the process is for you to determine your data collection methods in this step you will identify the methods for collecting the data you need you will Define how you will collect the data from your data sources you have identified such as internal systems social media sites or third-party data providers your methods will depend on the type of data the time frame over which you need the data and the volume of data once your plan and data collection methods are finalized you can Implement your data collection strategy and start collecting data you will be making updates to your plan as you go along because conditions evolve as you implement the plan on the ground the data you identify the source of that data and the practices you employ for Gathering the data have implications for Quality security and privacy none of these are one-time considerations but are relevant through the life cycle of the data analysis process working with data from disparate sources without considering how it measures against the quality metric can lead to failure in order to be reliable data needs to be free of Errors accurate complete relevant and accessible you need to define the quality traits the metric and the checkpoints in order to ensure that your analysis is going to be based on quality data you also need to watch out for issues pertaining to data governance such as security regulation and compliances data governance policies and procedures relate to the usability integrity and availability of data penalties for non-compliance can run into millions of dollars and can hurt The credibility of not just your findings but also your organization another important consideration is data privacy data you collect needs to check the boxes for confidentiality license for use and compliance to mandated regulations checks validations and an audible Trail needs to be planned loss of trust in the data used for analysis can compromise the process result in suspect findings and invite penalties identifying the right data is a very important step of the data analysis process done right it will ensure that you are able to look at a problem from multiple perspectives and your findings are credible and reliable [Music] data sources can be internal or external to the organization and they can be primary secondary or third-party sources of data let's look at a couple of examples to understand what we mean by primary secondary and third-party sources of data the term primary data refers to information obtained directly by you from the source this could be from internal sources such as data from the organization CRM HR or workflow applications it could also include data you gather directly through surveys interviews discussions observations and focus groups secondary data refers to information retrieved from existing sources such as external databases research articles Publications training material and internet searches or financial records available as public data this could also include data collected through externally conducted surveys interviews discussions observations and focus groups third-party data is data you purchase from aggregators who collect data from various sources and combine it into comprehensive data sets purely for the purpose of selling the data now we'll look at some of the different sources from which you could be gathering data databases can be a source of primary secondary and third-party data most organizations have internal applications for managing their processes workflows and customers external databases are available on a subscription basis or for purchase a significant number of businesses have or are currently moving to the cloud which is increasingly becoming a source for accessing real-time information and on-demand insights the web is a source of publicly available data that is available to companies and individuals for free or commercial use the web is a rich source of data available in the public domain these could include textbooks government records papers and articles that are for public consumption social media sites and interactive platforms such as Facebook Twitter Google YouTube and Instagram are increasingly being used to Source user data and opinions businesses are using these data sources for quantitative and qualitative insights on existing and potential customers sensor data produced by wearable devices smart buildings smart cities smartphones medical devices even household appliances is a widely used source of data data exchange is a source of third-party data that involves the voluntary sharing of data between data providers and data consumers individuals organizations and governments could be both data providers and data consumers the data that is exchanged could include data coming from business applications sensor devices social media activity location data or consumer Behavior data surveys gather information through questionnaires distributed to a select group of people for example gauging the interest of existing customers and spending on an updated version of a product surveys can be web or paper-based census data is also a commonly used source for Gathering household data such as wealth and income or population data for example interviews are a source for Gathering qualitative data such as the participants opinions and experiences for example an interview conducted to understand the day-to-day challenges faced by a customer service executive interviews could be telephonic over the web or face to face observation studies include monitoring participants in a specific environment or while performing a particular task for example observing users navigate an e-commerce site to assess the ease with which they are able to find products and make a purchase data from surveys interviews and observation studies could be available as primary secondary and third-party data data sources have never been as Dynamic and diverse as they are today they're also evolving continuously supplementing your primary data with secondary and third-party data sources can help you explore problems and Solutions in new and meaningful ways foreign [Music] we will learn about the different methods and tools available for Gathering data from the data sources discussed earlier in the course such as databases the web sensor data data exchanges and several other sources leveraged for specific data needs we will also learn about importing data into different types of data repositories SQL or structured query language is a querying language used for extracting information from relational databases SQL offers simple commands to specify what is to be retrieved from the database the table from which it needs to be extracted grouping records with matching values dictating the sequence in which the query results are displayed and limiting the number of results that can be returned by the query amongst a host of other features and functionalities non-relational databases can be queried using SQL or SQL like query tools some non-relational databases come with their own querying tools such as cql for Cassandra and graphql for neo4j application programming interfaces or apis are also popularly used for extracting data from a variety of data sources apis are invoked from applications that require the data and access an endpoint containing the data endpoints can include databases web services and data marketplaces apis are also used for data validation for example a data analyst may use an API to validate postal addresses and zip codes web scraping also known as screen scraping or web harvesting is used for downloading specific data from web pages based on defined parameters among other things web scraping is used to extract data such as text contact information images videos podcasts and product items from a web property RSS feeds are another source typically used for capturing updated data from online forums and new sites where data is refreshed on an ongoing basis data streams are a popular source for aggregating constant streams of data flowing from sources such as instruments iot devices and applications and GPS data from Cars data streams and feeds are also used for extracting data from social media sites and interactive platforms data exchange platforms allow the exchange of data between data providers and data consumers data exchanges have a set of well-defined exchange standards protocols and formats relevant for exchanging data these platforms not only facilitate the exchange of data but they also ensure that security and governance are maintained they provide data licensing workflows de-identification and protection of personal information legal Frameworks and a quarantined analytics environment examples of popular data exchange platforms include AWS data exchange crunchbase low to me and snowflake numerous other data sources can be tapped into for specific data needs for marketing Trends and AD spending for example research firms like Forester and Business Insider are known to provide reliable data research and advisory firms such as Gartner and Forrester are widely trusted sources for strategic and operational guidance similarly there are many trusted names in the areas of user Behavior data mobile and web usage Market surveys and demographic studies data that has been identified and gathered from various data sources now needs to be loaded or imported into a data repository before it can be wrangled mined and analyzed the importing process involves combining data from different sources to provide a combined View and a single interface using which you can query and manipulate the data depending on the data type the volume of data and the type of destination repository you may need varying tools and methods specific data repositories are optimized for certain types of data relational databases store structured data with a well-defined schema if you're using a relational database as the destination system you will only be able to store structured data such as data from oltp systems spreadsheets online forms sensors Network and web logs structured data can also be stored in nosql semi-structured data is data that has some organizational properties but not a rigid schema such as data from emails XML zipped files binary executables and TCP or IP protocols semi-structured can be stored in no SQL clusters XML and Json are commonly used for storing and exchanging semi-structured data Json is also the preferred data type for web services unstructured data is data that does not have a structure and cannot be organized into a schema such as data from web pages social media feeds images videos documents media logs and surveys no SQL databases and data Lakes provide a good option to store and manipulate large volumes of unstructured data data Lakes can accommodate all data types and schema ETL tools and data pipelines provide automated functions that facilitate the process of importing data tools such as talent and Informatica and programming languages such as Python and R and their libraries are widely used for importing data foreign [Music] data wrangling also known as data munging is an iterative process that involves data exploration transformation validation and making it available for a credible and meaningful analysis it includes a range of tasks involved in preparing raw data for a clearly defined purpose where raw data at this stage is data that has been collected through various data sources in a data Repository data wrangling captures a range of tasks involved in preparing data for analysis typically it is a four-step process that involves Discovery transformation validation and Publishing the discovery phase also known as the exploration phase is about understanding your data better with respect to your use case the objective is to figure out specifically how best you can clean structure organize and map the data you have for your use case the next phase which is the transformation phase forms the bulk of the data wrangling process it involves the tasks you undertake to transform the data such as structuring normalizing denormalizing cleaning and enriching the data let's begin with the First Transformation task structuring this task includes actions that change the form and schema of your data the incoming data can be in varied formats you might for example have some data coming from a relational database and some data from web apis in order to merge them you will need to change the form or schema of your data this change may be as simple as changing the order of fields within a record or data set or as complex as combining Fields into complex structures joins and unions are the most common structural transformations used to combine data from one or more tables how they combine the data is different joins combine columns when two tables are joined together columns from the First Source table are combined with columns from the Second Source table in the same row so each row in the resultant table contains columns from both tables unions combine rows rows of data from the First Source table are combined with rows of data from the Second Source table into a single table each row in the resultant table is from One Source table or another transformation can also include normalization and denormalization of data normalization focuses on cleaning the database of unused data and reducing redundancy and inconsistency data coming from transactional systems for example where a number of insert update and delete operations are performed on an ongoing basis are highly normalized denormalization is used to combine data from multiple tables into a single table so that it can be queried faster for example normalized data coming from transactional systems is typically denormalized before running queries for reporting and Analysis another transformation type is cleaning cleaning tasks are actions that fix irregularities in data in order to produce a credible and accurate analysis data that is inaccurate missing or incomplete can skew the results of your analysis and need to be considered it could also be that the data is biased or has null values in relevant fields or have outliers for example you may want to find out the demographic information on the sale of a certain product but the data you have received does not capture the gender you either need to Source this data point and merge it with your existing data set or you may need to remove and not consider the records with this missing field we will explore many more examples of data cleaning further on in this course enriching the data is the fourth type of transformation when you consider the data you have to look at additional data points that could make your analysis more meaningful you are looking at enriching your data for example in a large organization with information fragmented across systems you may need to enrich the data set provided by one system with information available in other systems or even public data sets consider a scenario where you sell it peripherals to businesses and want to analyze the buying patterns of your customers over the last five years you have the customer master and transaction tables from where you've captured the customer information and purchase history supplementing your data set with the performance data of these businesses possibly available as a public data set could be valuable for you to understand factors influencing their purchase decisions inserting metadata also enriches data for example Computing a sentiment score from a customer feedback log collecting geo-based weather data from a Resort's location to analyze occupancy Trends or capturing published time and tags for a blog post after transformation the next phase in data wrangling is validation this is where you check the quality of the data post structuring normalizing cleaning and enriching validation rules refer to repetitive programming steps used to verify the consistency quality and security of the data you have this brings us to publishing the fourth phase of the data wrangling process publishing involves delivering the output of the wrangled data for Downstream project needs what is published is the transformed and validated version of the input data set along with the metadata about the data set lastly it is important to note the criticality of documenting the steps and considerations you have taken to convert the raw data to analysis ready data All Phases of data wrangling are iterative in nature in order to replicate the steps and to revisit your considerations for performing these steps it is vital that you document all considerations and actions [Music] [Music] thank you in this video we will look at some of the popularly used data wrangling software and tools such as Excel power query or spreadsheets open refine Google data prep Watson Studio Refinery Trifecta Wrangler Python and r let's begin with the most basic software used for manual wrangling spreadsheets spreadsheets such as Microsoft Excel and Google Sheets have a host of features and inbuilt formulas that can help you identify issues clean and transform data add-ins are available that allow you to import data from several different types of sources and clean and transform data as needed such as Microsoft power query for Excel and Google Sheets query function for Google Sheets open refine is an open source tool that allows you to Import and Export data in a wide variety of formats such as tsv CSV XLS XML and Json using open refine you can clean data transform it from one format to another and extend data with web services and external data open refine is easy to learn and easy to use it offers menu-based operations which means you don't need to memorize commands or syntax Google data prep is an intelligent cloud data service that allows you to visually explore clean and prepare both structured and unstructured data for analysis it is a fully managed service which means you don't need to install or manage the software or the infrastructure data prep is extremely easy to use with every addition that you take you get suggestions on what your ideal Next Step should be dataprep can automatically detect schemas data types and anomalies Watson Studio Refinery available via IBM Watson Studio allows you to discover cleanse and transform data with built-in operations it transforms large amounts of raw data into consumable quality information that's ready for Analytics data Refinery offers the flexibility of exporting data resigning in a spectrum of data sources it detects data types and classifications automatically and also enforces applicable data governance policies automatically trifacta Wrangler is an interactive cloud-based service for cleaning and transforming data it takes messy real-world data and cleans and rearranges it into data tables which can then be exported to excel Tableau and r it is known for its collaboration features allowing multiple team members to work simultaneously python has a huge library and set of packages that offer powerful data manipulation capabilities let's look at a few of these libraries and packages Jupiter notebook is an open source web application widely used for data cleaning and transformation statistical modeling also data visualization numpy or numerical python is the most basic package that python offers it is fast versatile interoperable and easy to use it provides support for large multi-dimensional arrays and matrices and high-level mathematical functions to operate on these arrays pandas is designed for fast and easy data analysis operations it allows complex operations such as merging joining and transforming huge chunks of data performed using simple single line commands using pandas you can prevent common errors that result from misaligned data coming in from different sources R also offers a series of libraries and packages that are explicitly created for wrangling messy data such as D plier data table and Json Lite using these libraries you can investigate manipulate and analyze data d-plier is a powerful library for data wrangling it has a precise and straightforward syntax data table helps you aggregate large data sets quickly Jason light is a robust Jason parsing tool great for interacting with web apis tools for data wrangling come with varying capabilities and dimensions your decision regarding the best tool for your needs will depend on factors that are specific to your use case infrastructure and teams such as supported data size data structures cleaning and transformation capabilities infrastructure needs ease of use and learnability foreign [Music] [Music] to Gartner report on data quality poor quality data weekends and organization's competitive standing and undermines critical business objectives missing inconsistent or incorrect data can lead to false conclusions and therefore ineffective decisions and in the business world that can be costly data sets picked up from disparate sources could have a number of issues including missing values inaccuracies duplicates incorrect or missing delimiters inconsistent records and insufficient parameters in some cases data can be corrected manually or automatically with the help of data wrangling tools and scripts but if it cannot be repaired it must be removed from the data set although the terms data cleaning and data wrangling are sometimes used interchangeably it is important to keep in mind that data cleaning is only a subset of the entire data wrangling process data cleaning forms a very significant and integral part of the transformation phase in a data wrangling workflow a typical data cleaning workflow includes Inspection Cleaning and verification the first step in the data cleaning workflow is to detect the different types of issues and errors that your data set may have you can use scripts and tools that allow you to define specific rules and constraints and validate your data against these rules and constraints you can also use data profiling and data visualization tools for inspection data profiling helps you to inspect the source data to understand the structure content and interrelationships in your data it uncovers anomalies and data quality issues for example blank or null values duplicate data or whether the value of a field Falls within the expected range visualizing the data using statistical methods can help you to spot outliers for example plotting the average income in a demographic data set can help you spot outliers that brings us to the actual cleaning of the data the techniques you apply for cleaning your data set will depend on the use case and the type of issues you encounter let's look at some of the more common data issues let's start with missing values missing values are very important to deal with as they can cause unexpected or biased results you can choose to filter out the records with missing values or find a way to source that information in case it is intrinsic to your use case for example missing age data from a demographic study a third option is a method known as imputation which calculates the missing value based on statistical values your decision on the course of action you choose needs to be anchored in what's best for your use case you may also come across duplicate data data points that are repeated in your data set these need to be removed another type of issue you may encounter is that of irrelevant data data that does not fit within the context of your use case can be considered irrelevant data for example if you are analyzing data about the General Health of a segment of the population their contact numbers may not be relevant for you cleaning can involve data type conversion as well this is needed to ensure that values in a field are stored as the data type of that field for example numbers stored as numerical data type or dates stored as a date data type you may also need to clean your data in order to standardize it for example for Strings you may want all values to be in lower case similarly date formats and units of measurement need to be standardized and there may be syntax errors for example white spaces or Extra Spaces at the beginning or end of a string is a syntax error that needs to be rectified this can also include fixing typos or format for example the state name being entered as a full form such as New York versus an abbreviated form such as NY in some records data can also have outliers or values that are vastly different from other observations in the data set outliers may or may not be incorrect for example when an age field in a voters database has the value 5 you know it is incorrect data and needs to be corrected now let's consider a group of people where the annual income is in the range of one hundred thousand to two hundred thousand dollars except for that one person who earns a million dollars a year while the data point is not incorrect it is an outlier and needs to be looked at depending on your use case you may need to decide if including this data will skew the results in a way that does not serve your use case this brings us to the next step in the data cleaning workflow verification in this step you inspect the results to establish Effectiveness and accuracy achieved as a result of the data cleaning operation you need to re-inspect the data to make sure the rules and constraints applicable on the data still hold after the corrections you made and in the end it is important to note that all changes undertaken as part of the data cleaning operation need to be documented not just the changes but also the reasons behind making those changes and the quality of the currently stored data reporting how healthy the data is is a very crucial step [Music] thank you in this segment data professionals share what portion of their job involves Gathering cleaning and preparing data for analysis I would say a relatively big proportion of my job involves Gathering preparing and cleaning data for analysis I work at a company with a really great data engineering team so I don't have to do this kind of work as much as some other data scientists do but still any person that is working closely with data be they a data scientist a data analyst a machine learning engineer really needs to get comfortable understanding where the data comes from and inevitably no data set is perfect there's always going to be compromises or small errors so it's really important to spend a significant portion of your time understanding the underlying data that was used to generate the data set and what some potential problems might be with that data my job as a CPA involves a lot of analysis financial statements account activity assessing processes and controls the Gathering piece can be pretty simple as long as the accounting information resides in a general ledger system or a central repository where the data is easy to gather [Music] so you need to prep the data make sure it's accurate make sure things are adding up make sure you have all all months of information so for example in the financial statement I need to make sure that people have given me 12 months worth of make statements and I'm not I'm not missing any data and that if I am that I have enough information to be able to project or to forecast or even to look back to estimate what was done in that month based on what I have and so that is that is definitely how this segment data professionals talk about the steps they take to ensure data is reliable one of the essential steps to making sure your data is reliable is to run summary statistics on individual columns in your data and make sure that they're uh consistent with reality so for example if you have a column somewhere that records visits per month to a website and you run summary statistics on that column you get the minimum the mean the median the Max and you see something funky like one month there's negative visits or something like this you know the data isn't reliable financial information in particular must be reliable it must be non-biased it must be free from error those are just a few of the many attributes that are necessary for data to be relied upon so doing what I call a a logic check before you get into the details of a transaction does it make sense at a high level if you expected top line revenue to increase but you see that it has drastically decreased then figure that part out first is my source correct am I running a query in the right period am I pulling the right general ledger account so start there make sure the basic data Integrity questions have been addressed first once we know that the data is reliable then we can start to Deep dive into the reviews and form conclusions about the financial performance based on our analysis of the data [Music] before we understand statistical analysis its relation to data analysis and specifically data mining let's first examine what statistics is statistics is a branch of mathematics dealing with the collection analysis interpretation and presentation of numerical or quantitative data it's all around us in our day-to-day lives whether we're talking about average income average age or highest paid professions it's all statistics today's statistics is being applied across Industries for decision making Based on data for example researchers using statistics to analyze data from the production of vaccines to ensure safety and efficacy or companies using statistics to reduce customer churn by gaining greater insight into customer requirements now let's look at what statistical analysis is statistical analysis is the application of statistical methods to a sample of data in order to develop an understanding of what that data represents it includes collecting and scrutinizing every data sample in a set of items from which samples can be drawn a sample in statistics is a representative selection drawn from a total population where population is a discrete group of people or things that can be identified by at least one common characteristic for purposes of data collection and Analysis for example in a certain use case population may be all people in a state that have a driving license and a sample of this population that is a part or subset of the population could be men drivers over the age of 50. statistical methods are mainly useful to ensure that data is interpreted correctly and apparent relationships are meaningful and Not Just Happening by chance whenever we collect data from a sample there are two different types of Statistics we can run descriptive statistics to summarize information about the sample and inferential statistics to make inferences or generalizations about the broader population descriptive statistics enables you to present data in a meaningful way allowing simpler interpretation of the data data is described using summary charts tables and graphs without any attempts to draw conclusions about the population from which the sample is taken the objective is to make it easier to understand and visualize raw data without making conclusions regarding any hypotheses that were made for example we want to describe the English test scores in a specific class of 25 students we record the test scores of all students calculate the summary statistics and produce a graph some of the common measures of descriptive statistical analysis include central tendency dispersion and skewness central tendency is locating the center of a data sample some of the common measures of central tendency include mean median and mode these measures tell you where most values in your data set fall so in the earlier example the mean score or the mathematical average of the class of 25 students would be the sum total of the scores of all 25 students divided by 25 that is the number of students if you order the above data set from the smallest score value to the highest score value of the 25 students and pick the middle value that is the value with 12 values to the left and 12 values to the right of a score value that score value would be the median for this data set if 12 students have scored less than 75 percent and 12 students have scored greater than 75 percent then the median is 75. median is unique for each data set and is not affected by outliers mode is the value that occurs most frequently in a set of observations for example if the most common score in this group of 25 students is 72 percent then that is the mode for this data set so you can see how looking at your data set through these values can help you get a clear understanding of your data set dispersion is the measure of variability in a data set common measures of statistical dispersion are variance standard deviation and range variance defines how far away the data points fall from the center that is the distribution of values when a distribution has lower variability the values in a data set are more consistent however when the variability is higher the data points are more dissimilar and extreme values become more likely understanding variability can help you grasp the likelihood of an event happening standard deviation tells you how tightly your data is clustered around the mean and range gives you the distance between the smallest and largest values in your data sets skewness is the measure of whether the distribution of values is symmetrical around a central value or skewed left or right skewed data can affect which types of analyzes are valid to perform these are some of the basic and most common used descriptive statistic tools but there are other tools as well for example using correlation and Scatter Plots to assess the relationships of paired data the second type of statistical analysis is inferential statistics inferential statistics takes data from a sample to make inferences about the larger population from which the sample was drawn using methods of inferential statistics you can draw generalizations that apply the results of the sample to the population as a whole some common methodologies of inferential statistics include hypothesis testing confidence intervals and regression analysis hypothesis testing for example can be used for studying the effectiveness of a vaccine by comparing outcomes in a control group hypothesis tests can tell you whether the efficacy of a vaccine observed in a control group is likely to exist in the population as well confidence intervals incorporate the uncertainty and Sample error to create a range of values the actual population value is likely to fall Within regression analysis incorporates hypothesis tests that help determine whether the relationships observed in the sample data actually exist in the population rather than just the sample there are various software packages to perform statistical data analysis such as statistical analysis system or SAS statistical package for the social sciences or SPSS and statsoft statistics form the core of data mining by providing measures and methodologies necessary for data mining and identifying patterns that help identify differences between random noise and significant findings both data mining which we will learn more about in this course and statistics as techniques of data analysis help in better decision making [Music] thank you data mining or the process of extracting Knowledge from data is the heart of the data analysis process it is an interdisciplinary field that involves the use of pattern recognition Technologies statistical analysis and mathematical techniques its goal is to identify correlations and data find patterns and variations understand Trends and predict probabilities you'll hear about patterns and Trends frequently in the context of data analysis so let's first understand these Concepts pattern recognition is the discovery of regularities or commonalities in data consider the log data for logins to an application in an organization it contains information such as the username login timestamp time spent in each login session and activities performed when we analyze this data to gain insights into the habits or behaviors of users for example the time of the day when maximum users tend to log in or user roles that typically spend the maximum hours logged into the application or modules in the workflow application that are being used we're examining the data manually or through tools to uncover patterns hidden in the data a trend on the other hand is the general tendency of a set of data to change over time for example global warming in the short term like a year-on-year basis temperatures May remain the same or go up or down by a few degrees but the overall global temperatures continue to increase over time making global warming a trend data mining has applications across Industries and disciplines for example profiling customer behaviors needs and disposable income in order to offer targeted campaigns financial institutions tracking customer transactions for unusual behaviors and flagging fraudulent transactions using data mining models the use of statistical models to predict a patient's likelihood for specific health conditions and prioritizing treatment accessing performance data of students to predict achievement levels and make a focused effort to provide support where required helping investigation agencies deploy police force where the likelihood of crime is higher and aligning Supply and Logistics with demand forecasts there are several techniques you can use to detect patterns and build accurate models for Discovery be it descriptive diagnostic predictive or prescriptive modeling let's understand some of the most commonly used techniques classification is a technique that classifies attributes into Target categories for example classifying customers into low medium or high Spenders based on how much they earn clustering is similar to classification but involves grouping data into clusters so they can be treated as groups for example clustering customers based on geographic regions anomaly or outlier detection is a technique that helps find patterns and data that are not normal or unexpected for example spikes in the usage of a credit card that can flag possible misuse Association rule mining is a technique that helps establish a relationship between two data events for example the purchase of a laptop being frequently accompanied by the purchase of a cooling pad sequential patterns is the technique that traces a series of events that take place in a sequence for example tracing a customer's shopping trail from the time they log into an online store to the time they log out Affinity grouping is a technique used to discover co-occurrence in relationships this technique is widely used in online stores for cross-selling and upselling their products by recommending products to People based on the purchase history of other people who purchase the same item decision trees help build classification models in the form of a tree structure with multiple branches where each branch represents a probable occurrence this technique helps to build a clear understanding of the relationship between input and output regression is a technique that helps identify the nature of the relationship between two variables which could be causal or correlational for example based on factors such as location and covered area a regression model could be used to predict the value of a house data mining essentially helps separate the noise from the real information and helps businesses Focus their energies on only what is relevant [Music] in this video we will learn about some of the commonly used software and tools for data mining such as spreadsheets R language python IBM SPSS statistics IBM Watson studio and SAS spreadsheets such as Microsoft Excel and Google Sheets are commonly used for performing basic data mining tasks spreadsheets can be used to host data that has been exported from other systems in an easily accessible and easy to read format you can pivot tables to Showcase specific aspects of your data which is vital when you have huge amounts of data to sort through and analyze they also make it relatively easier to make comparisons between different sets of data add-ins available for Excel such as the data mining client for Excel Excel Miner and knowledge Miner for Excel allow you to perform common mining tasks such as classification regression Association rules clustering and model building Google Sheets also has an array of add-ons that can be used for analysis and Mining such as text analysis text Mining and Google Analytics R is one of the most widely used languages for performing statistical modeling and computations by statisticians and data miners R is packaged with hundreds of libraries explicitly built for data mining operations such as regression classification data clustering Association rule mining text mining outlier detection and social network analysis some of the popular R packages include TM and Twitter TM a framework for text mining applications within R provides functions for text mining Twitter provides a framework for mining tweets our studio is a popularly used open source integrated development environment or IDE for working with the r programming language python libraries like pandas and numpy are commonly used for data mining pandas is an open source module for working with data structures and Analysis it is possibly one of the most popular libraries for data analysis in Python it allows you to upload data in any format and provides a simple platform to organize sort and manipulate that data using pandas you can perform basic numerical computations such as mean median mode and range calculate statistics and answer questions regarding correlation between data and distribution of data explore data visually and quantitatively visualize data with help from other python libraries numpy is a tool for mathematical Computing and data preparation in Python numpy offers a host of built-in functions and capabilities for data mining Jupiter notebooks have become the tool of choice for data scientists and data analysts when working with python to perform Data Mining and statistical analysis SPSS stands for statistical process for social sciences while the name suggests its original usage in the field of social sciences it is popularly used for advanced analytics text analytics Trend analysis validation of assumptions and translation of business problems into data science Solutions SPSS is closed source and requires a license for use SPSS has an easy to use interface that requires minimal coding for complex tasks it comprises of efficient data management tools and is popular because of its in-depth analysis capabilities and accurate data results IBM Watson Studio included in the IBM Cloud pack for data leverages a collection of Open Source tools such as Jupiter notebooks and extends them with closed Source IBM tools that make it a powerful environment for data analysis and data science it is available through a web browser on the public Cloud private cloud and as a desktop app Watson Studio enables team members to collaborate on projects that can range from simple exploratory analysis to building machine learning and AI models it also includes SPSS modeler flows that enable you to quickly develop predictive models for your business data SAS Enterprise Miner is a comprehensive graphical workbench for data mining it provides powerful capabilities for interactive data exploration which enables users to identify relationships within data SAS can manage information from various sources mine and transform data and analyze statistics it offers a graphical user interface for non-technical users with SAS you can identify patterns in the data using a range of available modeling techniques explore relationships and anomalies in data analyze Big Data validate the reliability of findings from the data analysis process SAS is very easy to use because of its syntax and is also easy to debug it has the ability to handle large databases and offers high security to its users in this video we have learned about just a few of the data mining tools available today your decision regarding the best tool for your needs will be driven by the data size and structure the tool supports the features it offers its data visualization capabilities infrastructure needs ease of use and learnability it's fairly common to use a combination of data mining tools to meet all your needs [Music] thank you the data analysis process begins with understanding the problem that needs to be solved and the desired outcome that needs to be achieved and it ends with communicating the findings in ways that impact decision making data projects are the result of a collaborative effort spread across business functions involving people with multi-disciplinary skills with the findings being incorporated into a larger business initiative the success of your communication depends on how well others can understand and trust your insights to take further action so as data analysts you need to tell the story with your data by visualizing the insights clearly and creating a structured narrative explicitly targeted at your audience before you begin to create the communication you need to reconnect with your audience Begin by asking yourself these questions who is my audience what is important to them what will help them trust me your audience is mostly going to be a diverse group in terms of the business functions they represent whether they play an operational or strategic role in the organization how impacted are they by the problem and other such factors your presentation needs to be framed around the level of information your audience already has based on your understanding of the audience you will decide what and how much information is essential to enable a better understanding of your findings it's tempting to bring out all the data that you've been working with but you have to consider what pieces are more important to your audience than others a presentation is not a data dump facts and figures alone do not influence decisions and move people to action you have to tell a compelling story include only that information as is needed to address the business problem too much information will have your audience struggling to understand the point you're making begin your presentation by demonstrating your understanding of the business problem to your audience it's easy to fall back on the assumption that we all know what we're here for but reflecting your understanding of the problem that needs to be solved and the out outcome that needs to be achieved is a great first step in winning their attention and starting with trust speaking in the language of your organization's business domain is another important factor in building a connection between you and your audience the next step in designing your communication is to structure and organize your presentation for Maximum Impact reference the data you have collected remember that the data the very basis of everything that you are communicating is like a black box for the audience if you're unable to establish The credibility of your data people don't know that they can trust your findings share your data sources hypotheses and validations work towards establishing credibility of your findings along the way don't gloss over any key assumptions made during the analysis organize information into logical categories based on the information you have do you have both qualitative and quantitative information for example be deliberate in taking a top-down or bottom-up approach in your narrative both can be effective depends on your audience and use case be consistent in your approach it's important to determine what communication formats will be most useful to your audience do they need to take away an executive summary a fact sheet or a report how is your audience going to use the information you have presented that should determine the formats you choose insights must be explained in a way that inspires action if your audience doesn't grasp the significance of your Insight or are unconvinced of its utility the Insight will not drive any value a thousand word essay will not have the same impact as a visual in creating a clear mental image in the minds of your audience a powerful visualization tells a story through the graphical depiction of facts and figures data visualizations graphs charts diagrams are a great way to bring data to life whether you're showing a comparison a relationship distribution or composition you have tools that can help you show patterns and conclusions about hypotheses data has value through the stories that it tells your audience must be able to trust you understand you and relate to your findings and insights by establishing credibility of your findings presenting the data within a narrative and supporting it through Visual Impressions you can help your audience Drive valuable insights foreign [Music] we will listen to data professionals talk about the role storytelling plays in the life of a data analyst the role of Storytelling in a data analyst life cannot be overstated it is super critical to get really good at storytelling with data I think humans naturally understand the world through stories so if you're trying to convince anyone to do anything with data the first thing you have to do is tell a clear concise compelling story I also think it can be really useful for the data analysts to develop a story anytime they're working with a data set to help themselves under better understand the underlying data set and what it's uh what it's doing there's always going to be a balance between telling a clear coherent Simple Story and making sure you're conveying all the complexities that you might find within the data and I think finding that balance can be really challenging but is really critical The Art of Storytelling is significant effect the life of a data analyst it doesn't matter how much or what wonderful information you've come up with if you can't find a way to communicate that to your audience whether it's the consumer or a director level or executive level person then it's for not you have to find a way to communicate that and it's usually best to do it in a visual or through telling a story so that they understand how that information can be useful I have to say storytelling is essential skill set it's like the last Mile in delivery a lot of people can handle the technical side through a short period of training however the ability to extract value from data and to communicate it is ensure Supply if you think about the long-term career I think it's very critical to know how to tell a compelling story with data storytelling is absolutely crucial to data analytics this is how you actually convey your message everyone can show numbers but if you don't have a story around if you don't have a compelling reason to act then ultimately what you're presenting isn't going to resonate with your audience they did a study at Stanford where they had people present their pitches and in that pitch they had simply kpis number statistics but they also told the story the audience members were then quizzed after the fact what they remembered from each of those presentations and it was those stories that stuck with them yes there were still facts and figures contained within the story but that is the way that you drive it home having that emotional connection to the story to the understanding to the data is really how you're going to get people to take the action that you want and need them to take [Music] thank you data visualization is the discipline of communicating information through the use of visual elements such as graphs charts and Maps its goal is to make information easy to comprehend interpret and retain imagine having to look through thousands of rows of data to draw interpretations and compare that to a visual representation of that same data summarizing the findings using data visualization you can provide a summary of the relationships Trends and patterns hidden in the data which if not impossible would be very hard to decipher from a data dump for data visualization to be of value you have to choose the visualization that most effectively delivers your findings to your audience and for that you need to begin by asking yourself some questions what is the relationship that I am trying to establish do I want to compare the relative proportion of the sub parts of a whole for example the contribution of different product lines in the total revenue of the company do I want to compare multiple values such as the number of products sold and revenues generated over the last three years or do I want to analyze a single value over time which in this example could mean how the sale of one specific product has changed over the last three years do I need my audience to see the correlation between two variables the correlation between weather conditions and booking in a ski resort for example do I want to detect anomalies in data for example finding values and data that could potentially skew the findings what is the question I'm trying to answer is not just an overarching question in the data visualization design and process you need to be able to answer this question for your audience with every data set and information that you visualize you also need to consider whether the visualization needs to be static or interactive an interactive visualization for example can allow you to change values and see the effects on a related variable in real time so think about the key takeaway for your audience anticipate their information needs and the questions they might have and then plan the visualization that delivers your message clearly and impactfully let's look at some basic examples of the types of graphs you can create for visualizing your data bar charts are great for comparing related data sets or parts of a whole for example in this bar chart you can see the population numbers of 10 different countries and how they compare to one another column charts compare values side by side you can use them quite effectively to show change over time for example showing how page views and user sessions time on your website is changing on a month-to-month basis although alike except for the orientation bar charts and column charts cannot always be used interchangeably for example a column chart may be better suited for showing negative and positive values pie charts show the breakdown of an entity into its subparts and the proportion of the subparts in relation to one another each portion of the pie chart represents a static value or category and the sum of all categories is equal to a hundred percent in this example in a marketing campaign with four marketing channels social sites native advertising paid influencers and Live Events you can see the total number of leads generated per channel line charts display Trends they're great for showing how a data value is changing in relation to a continuous variable for example how has the sale of your product or multiple products changed over time where time is the continuous variable line charts can be used for understanding Trends patterns and variations in data also for comparing different but related data sets with multiple series data visualization can also be used to build dashboards dashboards organize and display reports and visualizations coming from multiple data sources into a single graphical interface you can use dashboards to monitor Daily Progress or the overall health of a business function or even a specific process dashboards can present both operational and analytical data for example you could have a marketing dashboard from which you monitor your current marketing campaign for reach outs queries generated and sales conversions in real time as part of the same dashboard you could also be seeing how the conversion rate for this campaign compares to the conversion rate of some of the successfully run campaigns in the past dashboards are a great tool to present a bird's eye view of the complete picture while also allowing you to drill down into the next level of information for each parameter dashboards are easy to comprehend by an average user Make collaboration easy between teams and allow you to generate reports on the go using dashboards you can see the result of variations in data and metrics almost instantly and this can help you evaluate a situation from multiple perspectives on the go without having to go back to the drawing board foreign [Music] we will look at some of the most commonly used data visualization software and tools these include spreadsheets jupyter notebook and python libraries rstudio and our shiny IBM cognos analytics Tableau and Microsoft power bi some of these are end-to-end data analytics Solutions While others are specifically for data visualization ranging from free open source tools to commercially available Solutions spreadsheets such as Microsoft Excel and Google Sheets are possibly the most commonly used software to make graphical representations of data sets spreadsheets are easy to learn and have a ton of documentation and video tutorials available online for ready reference Excel provides several chart types ranging from the basic bar line pi and pivot charts to the more advanced options such as scatter charts trend lines Gantt charts waterfall charts and combination charts using which you can combine more than one type of charts Excel also provides recommendations on the best visual representation for your data set to make the charts more presentable you can add a chart title change colors of the elements and add labels to data Google Sheets also offers similar chart types for visualization though Excel does have more inbuilt formula based options than Google Sheets like Excel Google Sheets can help you choose the right visualization all you have to do is highlight the data you wish to visualize and click the chart button and you get a list of suggested charts best suited for your data charts and reports automatically update in Excel as well as in Google Sheets as the underlying data is changed Google Sheets is preferred over Excel where multiple users need to collaborate Jupiter notebook is an open sourced web application that provides a great way to explore data and create visualizations you don't have to be a python expert to use jupyter notebook python provides a host of libraries that are used for data visualization let's look at a few of those libraries matplotlib is a widely used python data visualization Library it provides different kinds of 2D and 3D plots and the flexibility to create plots in several different ways using matplotlib you can create high quality interactive graphs and plots with just a few lines of code it has a large community support and cross-platform support as it is an open source tool bokeh provides interactive charts and plots and is known for delivering high performance interactivity over large or streaming data sets bokeh offers flexibility for applying interaction layouts and different styling options to visualize it can also transform visualizations written in some of the other python libraries such as matplotlib Seaborn and ggplot Dash is a python framework for creating interactive web-based visualizations using Dash you can build highly interactive web applications using python code while knowledge of HTML and JavaScript is useful it is not a requirement Dash is easily maintainable cross-platform and mobile ready using rstudio you can create basic visualizations such as histograms bar charts line charts box plots and Scatter Plots and advanced visualizations such as heat Maps Mosaic Maps 3D graphs and corelograms shiny is an r package that helps build interactive web apps that you can host as Standalone apps on a web page these web apps seamlessly display our objects such as plots and tables and can be made live to allow access to anyone you can also build dashboards using shiny the ease of working with shiny is what popularized it among data professionals IBM cognos analytics is an end-to-end analytic solution some of the visualization features provided by cognos include importing custom visualizations a forecasting feature that provides time series data modeling and forecasts Based on data presented in corresponding visualizations recommendation for visualizations based on your data conditional formatting which allows you to see the distribution of your data and highlight exceptional data points for example highlighting high and low sales numbers over a certain threshold cognos is known for its Superior visualizations and overlaying data on the physical world using its geospatial capabilities Tableau is a software company that produces interactive data visualization products using Tableau products you can create interactive graphs and charts in the form of dashboards and worksheets with drag and drop gestures Tableau also offers the option to publish results in the form of stories you can import R and Python scripts in Tableau and take advantage of its visualization features that are far more Superior to that of other languages tableau's visualization capabilities are easy and intuitive to use Tableau is compatible with Excel files text files relational databases and Cloud database sources such as Google analytics and Amazon redshift power bi is a cloud-based business and analytics service from Microsoft that enables you to create reports and dashboards it is a powerful and flexible tool known for its speed and efficiency and an easy to use drag and drop interface power bi is compatible with multiple sources including Excel SQL server and cloud-based data repositories which makes it an excellent choice for data professionals power bi provides the ability to collaborate and share customized dashboards and interactive reports securely even on mobiles power bi's dashboard consists of many visualizations on a single page that help you tell your story these visualizations called tiles are pinned to the dashboard the dashboard is interactive which means a change in one tile affects the other when deciding which tools to use you need to consider the ease of use and purpose of the visualization in terms of the tools that are available and the visualization capabilities they offer if you can visualize it you can create it foreign [Music] data professionals talk about the visualization tools they rely on the most and why the visualization tool that I rely on most in my day-to-day life is cognosis there's a few reasons for this one it allows me to very quickly import a spreadsheet connect to a database and visualize my data whether that's me understanding what I want to look at and dragging the fields on or using our AI assistant to present the data to help me understand and explore what might be interesting in there if it's a new data set I've networked with before now on top of that I can also go ahead and start to do some more complex things or even just some more robust analysis with our reporting tool to allow me to build out and schedule reports for delivery if I wanted my sales team to have their pipeline report or their sales opportunity report every Monday morning set that up once you use what we call bursky and then have that sent out automatically every Sunday night but so it's waiting for them in the morning on top of this I can start to combine multiple data sources and have the system help me create those joins together and then be able to visualize those all on a simple simple dashboard that's highly interactive allowing you to filter and sort dynamically as well as share that out with the rest of my organization so that not every user has to go through the same experience we've set up the dashboard once everyone can then have access to it in terms of visualization tools I Rely the most on looker which is a data visualization tool that sits atop my company's internal database it's similar to Tableau which I've also used in the past and find a pretty easy to use and the great thing about these data visualization tools like looker and Tableau is they let everyone throughout the organization regardless of whether or not they're data professionals um easily kind of see their data and do basic aggregation or sorting on it a data visualization tool I really rely on for exploratory data analysis is R I've been a big convert in recent years to the effectiveness of doing basic data analysis and data visualization in R particularly using the Tidy burst which is a collection of packages that help you really easily load in your data aggregate it at different levels and also quickly and easily visualize it Pablo and the power bi no-brainer they are easy to pick up and very helpful to demonstrate data and as more and more companies and people start to utilize them they're more and more built-in templates and libraries I would say the visual views will probably be Excel and more just the Microsoft suite and just looking at and using the sums and the the macros to make sure that the data when I before I even dive in that is clean and that makes sense and that is prepped for what we needed to be foreign data analysts job openings exist across industry government and Academia every industry be it Banking and finance Insurance Health Care retail or information technology has space for skilled data analysts these roles are as sought after in large businesses as they are in startups and new Ventures according to Forbes the global big data analytics Market that stood at 37.34 billion US dollars in 2018 is expected to grow at a compound annual growth rate of 12.3 percent from 2019 to 2027 to reach 105.08 billion US Dollars by the year 2027. currently the demand for skilled data analysts far outweighs the supply which means companies are willing to pay a premium to hire skilled data analysts there's a wide variety of job roles available for data analysts to understand the career path that's open to you we will broadly classify the roles into Data analysts specialist roles and domain specialist roles data analysts specialist roles are for data analysts who want to stay focused and grow in the technical and functional aspects of their role on this path you could be starting your career as an associate or Junior data analyst and work your way up through analyst senior analyst lead analyst and principal analyst roles the boundaries between these roles the years of experience that qualify you for the next level and the nature of experience you need to gain to move up could vary depending on the industry the size of the organization and how big your team is in smaller teams for example you could be gaining experience in all facets of data analysis from Gathering data all the way through to visualizing and presenting your findings to stakeholders and this may happen within a short span of time in larger teams and organizations roles May typically be bifurcated based on activity which means you could be gaining experience in one specific phase of the process before you move to the next this helps you hone your skills in one part of the process before you move to the next on your journey from an associate data analyst to a lead or principled data analyst you will be continually advancing your technical statistical and analytical skills from a foundational level to an expert level you will be demonstrating your ability to work with a wide-ranging set of tools and platforms different aspects of the data analysis process and a wide variety of use cases in terms of technical skills you may start off knowing just one querying tool and programming language any one type of data repository or a limited set of visualization tools as you gather more experience you're expected to learn and demonstrate your ability to work with more and more tools languages data repositories and newer Technologies your communication skills presentation skills stakeholder management skills and project management skills all need to be honed and taken up a notch progressively as a lead or principal analyst you may also be responsible for establishing processes in your team making recommendations for software and tools the team should work on upskilling the team and expanding the team to include more profiles in some organizations these responsibilities could be aligned with a manager level person who has risen through the ranks to manage a team of data analysts domain Specialists also known as functional analysts are analysts who acquire specialization in a specific domain and are seen as an authority in their domain such as HR Healthcare sales Finance social media or digital marketing they may not be the most technically skilled people these roles carry titles such as HR analyst marketing analyst sales analyst Healthcare analyst or social media analyst and then there are the analytics enabled job roles these include roles such as project managers marketing managers and HR managers these are jobs where analytics skills lead to Greater efficiency and Effectiveness a fair amount of the data analyst job openings are analytics enabled as more and more organizations rely on data for decision making as a data analyst you also have options for exploring and learning new skills to gain entry into other data professions such as data engineering or data science for example if you're starting off as a junior data analyst and really like working with data lakes and Big Data repositories you can acquire further expertise in these Technologies and evolve your career into becoming a big data engineer if the business side of things excite you more you could similarly explore the skills required for making a lateral move into business analytics or business intelligence Analytics while the data analyst career landscape is very vast the good thing is that you have a plethora of resources available to help you grow to be successful in your journey as a data analyst all you need to do is grab the opportunities you want to pursue or the ones that present themselves to you and learn along the way [Music] in this video we will listen to data professionals talk about how they got into this profession my current role as a data professional did not exist before I took the position I realized that there was a need in our company to provide data in a faster more efficient manner than going to the is Department who would have a meeting to discuss the meeting to have requirements and then they would have an in product that people weren't satisfied with but you had to get at the end of the line and go through the whole process again to get what you were looking for so through filling a need at the company to provide reports in two weeks I put together a company database that has access to more information we have analysts that are now able to meet that unmet need in the company I got into the day of professional role by chance I was actually working on my PhD in economics at the University of Illinois Urbana-Champaign when a colleague of mine suggested that a masters in statistics would also be an excellent value add so that's how I got into the statistics program as well in Illinois but once I started that I was pretty hooked and there was no going back so to speak so in other words my original goal of becoming an economist actually evolved into a career filled with data modeling analytics inside Gathering communication um visualization and of course underlying all of that data-driven problem solving I got into the data analyst role in your financial data company actually by accident back then my company started to hire Equity data analysts in mainland China and I was very lucky to join the team because they're looking for someone that was financial analysis skill sets which I can bring to the table and after that my team started to hire someone with technical skill sets like python R and cycle I've always had a love of numbers and one of the things that happens is when you work with numbers so much they start to tell a story and the ability to look at those numbers and tell that story is what speaks to me and so having always had that love of numbers I have just been always attracted to data analytics and whether it's Excel spreadsheets or whether it is QuickBooks or any sort of data sets that can help Drive the information that we're looking for especially in the financial industry when we're looking at profit and loss and balance sheet and and what happens when one company buys another company we're always looking at that data to talk to and speak about the company's history and their future I got my current role as a data scientist uh straight out of my grad program which was a master's in data science and before my grad program I worked as both a data analyst and an analytics manager [Music] in this video we will listen to data professionals talk about what employers look for in a data analyst employers look for data analysts with integrity during the hiring process I will ask if you had to choose just one would you rather meet a deadline or get a right answer and I'm always looking for someone who would say I want to make sure that the information is right missing a deadline isn't as detrimental as a company making a multi-million dollar decision on wrong information or someone losing their job because it wasn't pulled or it wasn't reported correctly so it's much more important to have integrity I think the number one thing employers look for in data analysts is someone who can communicate clearly if you do the most brilliant analysis in the world but you can't communicate it to external stakeholders then it's really not worth anything so I think that that skill is really sought after I think another thing that companies obviously look for when they look for data analysts is uh fluency with numbers ability to understand complex analyzes ability to understand AP tests and what the results of AP tests are saying and the implication of those results I also think increasingly employers are looking for data analysts with really strong SQL skills another thing employers are looking for in data analysts is a growth mindset and willingness to learn because the industry is changing at a really fast pace I think they are looking for the programming skills including python are sickle and at the same time they are looking for some personalities whether you are detail-oriented whether you like working with data and whether you are a problem solver so and so forth as an employer I hire people all the time what am I looking for we are looking for people who are detail-oriented and who are somewhat overachievers they don't just want to do what's what's in front of them they want to go further and so we're looking for people who have higher aspirations and who also are able to think outside the box and aren't just going to if I say do ABC they're not just going to do that they're going to do it plus they're going to do it plus think and give me some Alternatives people who are able to troubleshoot if something goes wrong they're not just going to stop and say oh my goodness I need to go talk to my supervisor they're going to say here's a problem here's my thoughts here are two possible solutions on how you can resolve this so that the the job and the company can keep moving forward that's what you want so not just detail-oriented and not just good with numbers you also have to be someone who can think outside the box and be able to problem solve and troubleshoot those are what's going to that's what employers are going to be looking them for now more than ever they look for the ability to know data and by no data we mean several things right be comfortable with it in various formats be able to think about it and by that we mean know what kind of data you want to solve the problems that are at hand so the knowing the data skill is very important problem solving is another very key skill meaning um uh if there is a problem presented to a data analyst they should be able to know how to tackle that problem using data in whatever format it may be sitting in and being able to analyze it and present the insights that will then uh solve the problem they also need to be very um dynamic in that if there are um if they are presented with a very different kind of data set suddenly which looks nothing like it did before they need to be able to adapt to that um change so that that's why the quality of being Dynamic and adaptable is also important they also need to be able to pick up technical skills quickly and by that we mean if there is one kind of um SQL diagram being used in one setting they need to be able to uh uh you know operate under a different Paradigm if there is a place that's using our studio but they know python they need to be able to pick up uh our studio quickly and that kind of thing so being able to learn fast being Dynamic and knowing data those are the few things that employers do look for in a good data analyst [Music] thank you there are various paths you can take to gaining entry into the data analyst field while some employers may ask for an academic degree as a prerequisite even if you don't have a degree you still have several options available to you that can help you gain an entry or even make a lateral move into the field of data analysis let's start with the most obvious path an academic degree in data analytics statistics computer science management information systems or information technology management can start you off with a strong advantage you could alternately enroll in online training programs that can equip you with the required knowledge comprehensive online programs for data analysis are multi-course specializations offered by learning platforms such as Coursera edx and Udacity these courses are designed and delivered by some of the world's best domain experts since you have a fair idea by now of the technical functional and soft skills you need in order to be a data analyst choosing the right learning path should be fairly straightforward as you gather more work experience you can keep advancing your Knowledge and Skills in specific areas for example statistics spreadsheets SQL python data visualization problem solving storytelling or making impactful presentations these courses also give you Hands-On assignments and projects which give you a feel for the real world application of your Knowledge and Skills you can even add these projects to your portfolio so if you don't have an academic qualification these courses can help you gain opportunities at an entry level and work your way up as your experience grows now let's look at a scenario where you have a couple of years of experience in a different line of work and want to make a switch into the data analysis field there's a very good chance that you can do that successfully if you plan well since data analysis is a vast field it would be useful for you to First research the knowledge and skills you need the various job opportunities that are available and the growth opportunities available on the path you may be considering you can tap into online resources forums and your network of friends and colleagues to connect with people in this field and gain insights into real-world scenarios if you're currently working in a non-technical role you may consider exploring the domain Specialist or functional analyst path if you're in sales you could consider starting your journey by positioning and Skilling yourself for a sales analyst position you begin with the advantage of Industry experience and skill yourself in other areas such as statistics and programming for example if you're currently working in a technical role you have the ability to quickly pick up the tools and software you need for the data analyst role you're also probably stepping in with the advantage of having a good understanding of the domain or industry you're from for some of the other skills such as problem solving project management communication and storytelling you may already be using these in some capacity in your existing job you can always enhance these skills through trainings online courses communities of practice and forums data analysis is a fast moving field if you're curious open to learning new things and excited about the field you will be able to forge a path forward regardless of the formal qualifications you think you may be missing [Music] in this video we will listen to practicing data professionals talk about the various career options available in this field the whole data related profession today has also become um very uh very colorful very Dynamic evolving all the time and it's also it also presents a lot of range of options to anyone who wants to enter the field of uh you know being a data professional so it ranges from if you were to think of various circles as options starting with a data analyst right um from there you have you can upskill a lot more become a data scientist you can also become a statistician which is what I was when I first started off um you can then further specialize yourself in a specific direction of data in order to become a data engineer or you can start by being a bi analyst or a specialist and then don't go to become a data engineer so in other words either you can do a track of data analyst and data scientists or you can do a track of a bi analyst and a data engineer so those are kind of parallel tracks within the data profession um you can then also go to The Other Extreme where you can become a machine learning engineer an AI engineer and so on so there are many such roles many many such rules that um anyone interested in a in the field of data um can really take on a few of the most common career options available to data analysts is to get deeper into the weeds with machine learning and engineering and become a data scientist or a machine learning engineer that focus more on machine learning modeling another career option available to data analyst is to dive deeper into the business they're in and to inform top-level company strategy I think that role is really important and interesting and as it has really evolved in recent years um another path for a data analyst is to start to become a people manager and manage other data analysts and work to triage uh what gets worked on because there's always going to be more uh questions in the organization that can be answered with data than there are people to answer them so a data manager role can be really interesting and critical in terms of making sure the most important pieces of work actually do get worked on you can be a bookkeeper you can be an accountant you could be a CPA you could be this back broker or you or a financial analyst for the government or or a lot of large companies you could be a real estate broker lots of people are great data analysts but to do that you do have to really like numbers and you have to be really detail-oriented if that's not you and numbers don't jump off the page at you data analysts might not be the right thing for you thank you [Music] in this video we will listen to data professionals giving advice to aspiring data analysts one piece of advice I give to aspiring data analyst is keep learning and don't get discouraged uh there is more that's been written about analytics than you could ever learn in a lifetime so don't try to learn everything at once but take your time and make sure every week every month your every year you're constantly learning something new and I think that'll serve you well one piece of advice I've been given uh in my career that I've found to be really really helpful is to consider your career like an uppercase t and you should have broad knowledge the top of the T represents kind of that you should have broad knowledge in a number of different areas although it doesn't have to be deep you should know a little bit at least about a b testing about machine learning about data visualization about SQL about python about r and then the bottom part of the T is you should go really deep on at least one area so there should be one area among the ones I just mentioned where you have a really deep rigorous understanding of it it it is use every job that you have to your advantage meaning something can be found from everything so whether it is looking at your parents budget or asking your parents if you can see the checkbooks or if you work at a fast food restaurant um looking at the numbers how many people are coming in how many how many dollars are being turned over talk to the manager about what's next what the numbers actually mean when you're talking to potential employers have your examples ready so it doesn't have to necessarily be just work experience but your life experience how did you how have analytics how have you used analytics even in your personal life so if you can tell me and talk to me about what you've done personally or professionally and how it relates to what we're doing that will take you a very very long way piece of advice I'd give to aspiring data scientists is to build out a professional portfolio that showcases your data science or data analytics skills and you can do this by looking up fun data sets online and analyzing those data sets you can also do that within your job even if your current job isn't to be a data analyst look for opportunities where you can crunch numbers and then that'll just kind of naturally lead you to a nice portfolio or nice wins in terms of data analyst projects my advice to an aspiring data analyst is to follow your passion find a job that meets your needs and gives you Joy during doing it I there's nothing worse than waking up every morning and hating to go to your place of employment there are so many data analyst jobs in various Industries departments there's just so many options that there's no need to take a job just to have a job find something that really fuels your passion and gives you something to get up every morning for foreign [Music] we will listen to women share their experience of being a data professional and their advice to women aspiring to enter this field as a woman in data science I still run up against The Stereotype that this is a man's job I've walked into meetings and had people look disappointed or confused I take that as an opportunity to prove them wrong this isn't a job Just For Men it's for a person who has the Insight the ability and the drive to get the job done and as long as you can possess those skills then there's no reason why anyone can't do anything that they put their mind to whether you're a male or a female whether you're white or black you have the opportunity to prove people wrong by the work that you produce I would say it's it can be tough but you have to find your voice and don't be afraid to use it a lot of times as women we're not able to find our voice or speak up or we're afraid of how people are going to treat us if we speak up but you know it's more important that you be heard and seen not just being loud and wrong but if you have the data to back it up if you have good content and things you want to say don't be afraid to raise your hand and let people know that you are a thinker and that you can get this done because that's going to be important as you progress and the only real way to get ahead is drive and people don't know you have drive if you're too quiet and so if you're just quietly working away in the corner a lot of times people can't see it so speak up make sure your voice is being heard make sure you are being seen as a woman who knows how to how to grow and how to help in the data science field when I started it was um mostly men in my class especially back in grad school um but now I'm seeing that data teams both data science and data engineering teams are filled with a lot of women as well so I would um advise women to continue upskilling um you know so that they're able if they are fond and um if they like um you know a career filled with programming data and problem solving then they should um continue building their technical skill set so that they can represent themselves in the landscape of a data professional as strongly as as possible don't allow your gender to be a crutch still go hard put in the work and show the world your amazing talents there are no roles that are set aside for specific genders if you're fortunate enough to work in a profession that you thoroughly enjoy then go for it foreign do you want to learn how to use spreadsheets and start analyzing data using Excel this course from IBM is designed to help you work with Excel and gives you a good grounding in the cleaning and analyzing of data which are important parts of the skill set required to become a data analyst you will not only learn data analysis techniques using spreadsheets but also practice using multiple Hands-On Labs throughout the course in module 1 you will learn about the basics of spreadsheets including spreadsheet terminology the interface and navigating around worksheets and workbooks in module 2 you will learn about selecting data entering and editing data copying and auto filling data formatting data and using functions and formulas in module 3 you will learn about cleaning and wrangling data using a spreadsheet including the fundamentals of data quality and data privacy removing duplicated and inaccurate data removing empty rows removing data inconsistencies and white spaces and using the Flash Fill and text to columns features in module 4 you will learn about analyzing data using spreadsheets including filtering data sorting data using common data analysis functions creating and using pivot tables and creating and using slicers and timelines at the end of this course in module 5 you will complete a series of Hands-On Labs which will guide you on how to create your first deliverable as a data analyst this will involve you understanding what the business scenario is cleaning and preparing your data and analyzing your data you will follow two different business scenarios throughout the course with each using their own data set these different scenarios and data sets will be used in the lesson videos and in the Hands-On labs after completing this course you will be able to understand how spreadsheets can be used as a data analysis tool understand when to use spreadsheets as a data analysis tool and their limitations create a spreadsheet and explain its basic functionality perform data wrangling and data cleaning tasks using Excel analyze data using filter sort and pivot table features within Excel spreadsheets you will also perform some intermediate level data wrangling and data analysis tasks to address a business scenario the course team and other peers are available to help in the course discussion forums in case you require any assistance let's get started with your next video where you will get an introduction to spreadsheets [Music] thank you in this first video of the course we will list some of the common spreadsheet applications available learn about key capabilities of spreadsheets and discuss why spreadsheets might be a useful tool for a data analyst there are several spreadsheet applications available in the marketplace some of them are more widely known and used than others and some are free While others need to be paid for by far the most commonly used spreadsheet application and the most fully featured of them all is Microsoft Excel the desktop version comes in a paid form as part of the office suite and some Microsoft 365 subscriptions but there is also a web-based cut down version called Excel for the web also known as Excel online the online version is free to users with a Microsoft account but does not offer all the advanced features that the desktop version provides the next most popular is Google Sheets which offers a lot though not all of the features that Excel provides and is free with a Google account this is a web-based application and it integrates nicely with other Google Apps such as Google forms Google Analytics and Google data Studio then there is LibreOffice calc a totally free and open source desktop spreadsheet application that offers more basic functionality than Excel or Google Sheets but still has a lot of the tools you need for data analysis such as charts conditional formatting and pivot tables other spreadsheet apps include Zoho sheet a fully featured web-based application that is comparable with Google Sheets OpenOffice calc quip for Salesforce smartsheet which is predominantly for project management and Apple Numbers which is included with Apple devices such as Mac computers and is also available on the app store for other Apple devices so there are many spreadsheet application options open to you from fully featured to basic from cloud-based to desktop apps from Paid 4 to free versions it's up to you to decide which one fits your needs and your budget spread sheets provide several advantages over manual calculation methods for example once you have your formulas correctly written you can be assured that your calculations are accurate and that the calculations will be performed automatically for you spreadsheets also help keep your data organized and easily accessible your data can be easily formatted filtered and sorted to suit your needs if you do make mistakes in your data entry or your calculations you can easily edit them undo them or use error checking tools to help remedy those mistakes and lastly you can analyze data in spreadsheets and create charts graphs and reports to help visualize your data analysis since spreadsheet software for personal computers first appeared on the market in the 1970s with vis-accalc on the Apple II PC spreadsheets have come a long way in terms of the capabilities and features they now offer businesses from uncomplicated tables and relatively simple computations to powerful tools for analysis management and visualization of enormous sets of data the most common business uses for spreadsheet applications include the following data entry and storage comparing large data sets modeling and planning charting identifying Trends flow charts for business processes tracking business sales Financial forecasting statistical analysis profit and loss accounting budgeting forensic auditing payroll and tax reporting invoicing and scheduling and away from the business side of things other typical uses include personal expenses household budgeting recipe Library Fitness tracking calorie counting and weight monitoring sports leagues such as fantasy football cataloging music libraries and even contact lists shopping lists and Christmas card lists as a data analyst you can use spreadsheets as a tool for your data analysis tasks including collecting and harvesting data from one or more distributed and different sources cleaning data to remove duplicates inaccuracies errors and resolve missing values to improve the quality of the data analyzing data by filtering sorting and interpreting it to determine what useful information can be gleaned from it and visualizing data to help you tell a story about your data analysis findings to key business stakeholders and any other interested parties within your organization in this video we had an introduction to spreadsheets we learned about some common spreadsheet applications what the main capabilities of spreadsheets are and why spreadsheets might be a useful tool for a data analyst in the next video we will look at the basics of spreadsheets including common spreadsheet terminology [Music] foreign now that we have a basic understanding of what spreadsheet software is available and why spreadsheets might be a useful tool for a data analyst let's get started on looking at some of the basics of using a spreadsheet application in these videos we will be using the full desktop version of excel but the majority of the tasks that we will perform can also be done using Excel online and other spreadsheet applications such as Google Sheets let's first cover some basic spreadsheet terminology when you open Excel you have the option of creating a new blank workbook or opening an existing workbook we're going to choose new and then blank workbook workbooks are the highest level component in Excel and are represented as a DOT xlsx file so when you open an existing workbook or create a new workbook you are in fact working with a DOT xlsx file the workbook contains all your data calculations and functions and contains several other underlying elements that make up a workbook a workbook consists of one or more worksheets Each of which is represented by a tab in Excel each worksheet is given a name which is displayed on the corresponding tab for the worksheet by default each tab is named sheet1 then sheet 2 and so on to make these worksheet tabs more meaningful it is usual to rename them so they make more sense in relation to the worksheet's purpose for example you might call a worksheet January sales or perhaps the name of a region or store or even an office or Department to do this right-click the tab and choose rename instead of right-clicking to rename you can also just double-click the name of a worksheet tab to rename it essentially worksheet tabs can be named anything you want to fit your particular needs to make it easier to understand what that worksheet represents note that a worksheet that is highlighted as the tire sales worksheet tab is here is referred to as the active worksheet if you want to order your worksheets in a different way that is very simple to do either drag a worksheet tab to the left or right and drop it in the place you want which is represented by the Little Black Arrow or if you are not comfortable with dragging and dropping then the longer way of doing that is to right-click the worksheet tab select move or copy and then in the list titled before sheet select where you want your worksheet tab to be placed and click ok every worksheet is made up of a lot of rectangular boxes called cells these cells will contain your data which may be text numbers formulas or calculation results are organized in columns which run vertically down the screen and use a letter system this is column B for instance and rows which run horizontally across the screen and use a numeric system this is row 7 for example each cell is represented by a cell reference which is essentially just a column letter and row number for example if we click somewhere near the center of this worksheet we now have the cell M20 selected this is usually referred to as the active cell this is not only indicated by the highlighted edges of the cell but also if you look in the top left corner of the worksheet you will see its cell reference is noted in the little box here you can see it says M20 one important thing to note here is that cells are always referenced by their column letter first then their row number so column M and row 20. the last element of a workbook I want to mention is a cell range this identifies a collection of several cells selected together that could mean a few cells in the same row or the same column or it could mean several rows and columns together this can either be done using the mouse by selecting the first cell then dragging down or across to include other cells or you can use shift plus arrow keys this range of cells is often referred to as an array and it's most commonly used as a reference in calculations and formulas for example if you want to add up all the values in a column between cells D9 and d19 you would specify this cell range within a formula note that cell ranges are notated using a full colon between the cell references so in this example it would be D9 colon d19 or to specify a few cells in the same row it might be D9 colon H9 or to select several rows and columns it might be D9 colon h19 we will see this notation in use later in the course when we start looking at calculations and formulas these cell ranges could even be a reference point to cells contained on another worksheet this is usually referred to as a 3D reference we can now close this workbook and we don't need to save it in this video we learned about some of the basic terminology of spreadsheet elements in the next video we will discuss how to navigate around a spreadsheet how to use the ribbon and menus and how to select data [Music] now that we have a basic understanding of the main elements that make up a worksheet let's see how to move around a spreadsheet get familiar with the ribbon and menus and learn how to select data in a worksheet to open a sample file we click file this opens backstage view here you can create a new workbook or open save or print a workbook you can also access Excel option now we want to open our sample file so we click open and either select it from my recent list or click browse to find the data file we want the first thing we should do is get acquainted with the ribbon and menus notice that on the ribbon at the top we have several tabs some of these tabs may be familiar to you with other office products such as the home insert and view tabs While others might be new to you such as formulas data and powerpivot to make a little more workspace for ourselves we can hide this ribbon by double clicking any Tab and to unhide it we do the same the other option is to use the shortcut key Control Plus F1 the ribbon is organized into groups of buttons to make them easier to find so on the Home tab we have groups for font alignment number Styles and so on some of these groups contain all the available buttons on the ribbon when viewing in full screen such as Styles and cells but other ribbon groups have more options which we access by clicking the little arrow icon in the bottom right corner of the group as can be seen here on the number group for example the next item I want to point out is the quick access toolbar at the top of the screen above the ribbon as the name suggests this is where you can quickly access the tools you use most often you can see we already have some tools in this toolbar such as save undo redo new and open but we can add other tools to the toolbar if we wish so if we click the drop down arrow in the toolbar and then select a tool we use a lot such as sort ascending that will be added and we will also add the sort descending button now we need to be comfortable with moving around a worksheet you can simply use the arrow keys to move left right up and down one cell at a time but you can also use page down and page up to move around a bit faster which is especially useful if you have lots of rows of data and to move even quicker up or down a large data sheet use the vertical scroll bar and to move left or right use the horizontal scroll bar again these can be very useful when you have a large data set there are also some useful shortcuts you can use Control Plus home key for example takes you back to the start of the worksheet that is cell A1 Control Plus end takes you to the Cell at the end of your data in the worksheet Control Plus down arrow takes you to the end of the column you're in while Control Plus up Arrow takes you back to the top of that column so a quick way to find out how many rows of data you have in your worksheet is to go to the first cell in your data and press control plus down arrow to see the last row of data so here you can see we have 160 rows now how do we go back to the Top Again Control Plus home will do it so far we have seen how to navigate around our worksheet and its data now we need to look at how we select data this is very important because you often need to select data to move it copy it or select it in a formula the simplest selection is a single cell usually done with a mouse or maybe a directional Arrow key The Next Step Up is to select multiple cells together and this can be done either with a mouse by dragging from one cell to additional adjoining cells or you can use the shift key with directional arrow keys next up is selecting a single column or row which is done simply by selecting the letter at the top of the column or the number on the left of a row then we can progress to selecting multiple columns and rows by clicking the mouse button holding it down and dragging across more columns or if you are not comfortable with dragging you can also select the column first then hold shift plus arrow keys to select multiple columns the same applies to rows 2. however if you have data in non-continuous rows or columns that is not next to each other you can select the first column then use the control key to select another unconnected column such as column C and F here the largest thing you might want to select is the whole worksheet which you can do by clicking in the top left corner of the cells however this selects the entire worksheet including all the empty rows and columns so if you only want the data in your worksheet you can use the shortcut control plus a word of warning When selecting data in cells rows and columns there are three types of cross symbols that you might see when working with selected cells the first one is the large white cross that you see when you select a cell as can be seen here in cell A4 this is the select cross that we have been using already in this video to select cells the second type you might see is when you hover over the bottom edge of a cell and see a thin black cross-type symbol with arrows on each point this is the move symbol and would move the cell data to another location the last type is the small thin black cross that is seen when you hover over the bottom right corner of a cell this is the fill handle or copy symbol and it fills or copies the cell data to another location in this video we learned how to move around a spreadsheet become familiar with the ribbon and menus and learn how to select data in a worksheet in the next videos we will discuss how to enter data how to copy and paste data and how to format data in a spreadsheet [Music] in this video we will listen to several data professionals discuss the advantages and limitations of using spreadsheets as a tool for data analysis let us start with what are the benefits and advantages of using spreadsheets as a tool for data analysis my experience using spreadsheets as a tool for data analysis is somewhat mixed I think they can be really really useful in the right context but using spreadsheets definitely has its limitations so the big Pro of using spreadsheets is you can see all the data cleanly laid out in front of you in a table so I think it's very clear to anyone looking at a spreadsheet exactly what the data is what format it comes in all of that you can just easily visually inspect it as a CPA I use Microsoft Excel on a daily basis and I have done so for the duration of my career the functionalities the pivots the pivot tables the charts Etc but also being able to use formulas my personal favorite is index match for using and it's a pretty simple way to take just thousands of lines of information and sift through all of that to find specifically what you're looking for Excel is really that One-Stop shop where you can perform calculations analyze financial ratios and even export reports out of the Erp that I spoke of earlier to customize it as you need my experience is using spreadsheets is that they're great for simple analysis I will say spreadsheets over the years the process itself has just improved as systems improve as technology improves spreadsheets are the way to go spreadsheets overall when you do have probably anywhere from zero to twenty thousand lines of data it's a good way to go you can really pull out the data whether I'm trying to see how much a client's making per month but they may have you know a thousand transactions all of that's helpful I can use this spread sheet to whittle down what is actually going on per month or if I want to do a sum if or you know if this happens give me this number it's really helpful to be able to dig in and wrap your hands around it and take something that seems on the surface 20 000 lines seems almost unmanageable but if I take it and I massage it put it in a spreadsheet and then sort it filter it make it pretty put it in a pivot table I can get what I need it's just all about not looking at it as being this intimidating thing but making it more manageable and breaking it down into bite-sized chunks spreadsheets are the easiest way to analyze data and present data we don't need any fancy tools or additional software for spreadsheets it's like the commonly utilized language to communicate thank you for that Insight but let's move on to look at the other side of the coin what are the drawbacks and limitations of using spreadsheets as a tool for data analysis I think one of the big cons in terms of analyzing data within spreadsheets is it's really hard to reproduce state so in other words if you load in some data and you filter out some bad values or you impute some missing values there's no way to tell your colleagues or your future self exactly the different steps you took uh to create that data set or to modify that data set it's almost a dilemma because of the plethora of options available within Excel and all of the functions that are they're supposedly to make your life easier but it's nearly impossible to know everything and you can find yourself in what we accountants call analysis paralysis when you're looking at something for too long or you're not well versed in a particular Excel function so you may spend a lot more time energy and effort trying to figure that one thing out and have you done it a different way or maybe a manual way you probably could have gotten to the solution a lot easier and the downside of using spreadsheets is that if you have complex formulas vlookups if statements at times they just stop working and you have to rebuild them so I have found that it's better to use Excel just for simple analysis and for a download of information I love a good spreadsheet I love using Excel and pivot tables to get to the data but I find that if I start to get over 10 20 000 lines of data it gets a little tricky and sometimes the spreadsheets will crash so that's when we might move to access and some of the other tools that we use it's very difficult to handle the extremely large data set in spreadsheet besides serashi has less flexibility for a complicated analysis and presentation foreign [Music] now that you have learned basic spreadsheet terminology and how to navigate your way around worksheets and select data in Excel it's now time to start entering some data first we will look at some of the handy viewing features provided in Excel and then we'll enter some data and then we'll edit that data when you have a lot of data in your worksheet it can be useful to zoom in closer to a specific area of the data the zoom slider at the bottom right corner of the worksheet allows you to do just that you can either click on the plus and minus buttons or drag the slider to select your preferred Zoom value you also have some Zoom controls in the ribbon on The View tab Zoom lets you pick a predefined zoom level or a custom one the 100 button zooms the worksheet back to its original size and zoom to selection enables you to select an area of data and then zoom into that specific selection only if you want to see several areas of your data at the same time while zoomed in you can use the split button this splits the screen into multiple sections and you can scroll each section separately if you only want two sections you can remove either the horizontal or the vertical split by double clicking on it if you have headings in your columns like a header row then you might want those to remain on screen while you move down the sheet to do that you need to use freeze panes you can freeze only the top row if you wish or if that doesn't suit as is the case here then you can select the row or even just a cell in the row below the row or rows you want to freeze and then select freeze panes you can do a similar thing for columns if you want to freeze them too and you can even freeze both rows and columns at the same time the trick here is to First select the cell that is both one row below where you want to freeze and one column to the right of where you want to freeze in this case that is cell C4 now we can scroll down the worksheet and across the worksheet and we can still see the header row and the manufacturer and model columns now if you have multiple workbooks open notice I said workbooks and not work sheets then you can switch between them by using view switch windows or the faster method is to use the Control Plus F6 shortcut now let's enter some data into our blank worksheet the easiest way to open a new worksheet from within Excel is to click the new button in the quick access toolbar or Control Plus n if you prefer keyboard shortcuts so let's enter some headings across the top of the worksheet this is typically referred to as a header row note that if you press enter after typing data into a cell the next active cell is the one directly below which is not what we want in this case but if we press tab after we enter data in a Cell it selects the next cell along in the row as the active cell now we'll enter some headings and press tab after each entry notice that the text is slightly longer in some of the cells and it either gets partly hidden by the next cell or overlaps it if you click and hold the divider line between two columns you can drag it left and right to resize it manually if you want to do that automatically you can double-click the divider line between two columns as these are going to be headings for our columns let's make them bold now let's add another column between the parts and accessories column simply select the right hand of those two columns then right click and choose insert to put another column to the left of the selected column let's call it servicing sales to tidy up our column width simultaneously we select all the columns from a to e then double-click any of the divider lines between columns this automatically reduces or increases each column's width to fit the data in that column okay now we have some headings let's enter some month data in column A so if we type Jan in cell A2 and press enter Then it takes us to the cell below which is what we want in this case and we can type Feb in cell A3 and so on until we get to December in a13. now let's suppose you need to change a couple of your headings you have several ways of editing existing data in a Cell you can either select the cell and then just start over typing or you can select the cell and press f2 on your keyboard to put the cursor at the end of the cell and make your changes or you can simply double click somewhere on the cell and put the cursor at that position in the cell and make your changes and you can even select the cell and then click in the formula bar to edit your cell data now let's do the same for the parts and accessories column headings in this video we learned about some of the viewing options in Excel and we learned how to enter and edit data in cells in the next video we will learn how to copy and fill data and how to format the cells and data in a worksheet foreign [Music] now that we have learned about some of the handy viewing features provided in Excel and entered and edited some data let's discuss how to move copy and fill data and how to format cells and data to suit our needs the first thing we are going to discuss is how to move data so if you select a range of cells in this case the headings in A1 to E1 and then hover over the top or bottom edge of a selected cell you will see the move pointer then you can drag the selection to another place on the worksheet alternatively if you want to copy the data instead you do the same thing but this time you also hold Ctrl key as you select and drag the selection to another location and you will see the copy pointer if you are not comfortable with dragging you can also use copy and paste menu commands or keyboard shortcuts if you select some data in column A and copy it to the clipboard then you simply select the new location and paste the copied data you can also move or copy between worksheets so let's create a new worksheet then select some data from sheet1 and this time let's use the control plus C keyboard shortcut to copy it to the clipboard then choose the other worksheet and use the control plus v shortcut to paste the data however notice that the column widths are not the same as the original Source data so let's undo that and try another paste option by default when you paste the copied data it uses the column width settings of the destination cells so to paste it and retain the column widths of the source data you choose the special option under the paste command called keep Source column widths as an alternative to having to enter data manually in a worksheet you can use an Excel feature that automatically fills cells with data when it follows a sequential series or pattern the feature is called autofill and it can be especially useful when you need to enter lots of repetitive data into Excel such as date information for example if you enter a month in a Cell even using a shortened version of the name you can use What's called the fill handle to select down to the end of the series and autofill will work out what the series is based on the selected data let's try the same thing with days of the week if you enter mun for Monday in a Cell then drag the fill handle to use autofill it will determine that you want to enter the days of the week sequentially however if you also enter when for Wednesday in the next cell down and select both cells in the series that is a16 and a17 and then drag the fill handle down autofill determines that the sequence has changed to every other day and fills in the data series for you it's important to select all the cells that determine the pattern when using autofill so that it can best determine what the pattern is in this case cells a16 and a17 a similar thing applies to numerical patterns if you enter 5 in a cell and then use the fill handle to fill the data down the column because the data is not the name of a day or month for example autofill can't determine what the pattern is yet so in this case it just copies the value 5 into every selected cell however if you enter the value 10 in B3 and then use the fill handle to fill data down the column autofill determines that the pattern is incrementing by 5 each time and it fills in the remainder of the data pattern for you now we're going to look at formatting our data and there are essentially two distinct parts to this first there's formatting of the cells themselves with a fill color and a bold border for example and bold text within it and then there's formatting the data in the cells for example making it text format number format or a specific currency or accounting format let's open the car sales worksheet we used previously then select the headings in cell A3 to P3 either using the mouse or you could use the shortcut keys Control Plus shift plus right arrow on the Home tab click the Styles drop down arrow and then select a style color for your cells then you can make the selected cells bold then you select the data in the manufacturer column either using the mouse or the shortcut keys Control Plus shift plus down arrow in the Styles drop down arrow select another style color for the selected cells again you can make the cells bold then you select the data in the model column again either using the mouse or the shortcut keys Control Plus shift plus down arrow in the Styles drop down arrow select another style color for the selected cells this time you could make the selected cells italic and you can also change the font size and style lastly you can select all the other cells in the data by using the mouse or the Control Plus shift plus right arrow then down arrow and apply borders to the data cells now it's time to format the cell data the sales figures in columns C and D can be formatted to display only one decimal place just select the data and click the decrease decimal button we also have an issue with a couple of the car models if you look in cells b129 and b130 where the model name is supposed to be displayed you can see there are actually two dates listed instead and if you look in the number format box the format type is custom this has happened because the model numbers are supposed to be the Saab 9-5 and the Saab 9-3 but when the files were imported from CSV files these two cells must have been incorrectly determined to be date values and not just numbers you can fix this by formatting these two cells as text and then enter the correct values of 9 5 and 9 3. the last thing we want to do is format some data as currency if you look at the heading in column f it says it is price in thousands of dollars and cell F4 is using the general format so let's change the format of this column to American currency format we select the column F in this case then select more number formats from the drop down list then we choose the currency option and the correct currency symbol and format and we're done in this video we learned how to move copy and fill data and how to format cells and sell data to suit our needs in the next video we will look at the basics of formulas learn how to perform simple calculations and learn how to select ranges and copy formulas [Music] foreign now that we have learned how to move copy and fill data and how to format cells and data next we will take a look at the basics of formulas including some basic calculations selecting ranges in formulas and how to copy formulas a typical formula is made of several key components the equal sign starts the formula off and lets Excel know you are creating a formula in this cell the next part is the function which performs the calculation for example the sum function adds up the values in referenced cells or cell ranges then comes the reference which is the cell or range of cells you want to include in your calculation and these need to be enclosed in parentheses you also have operators which specify what type of calculation to perform common arithmetic operators include addition subtraction multiplication and division and these are represented by symbols the plus symbol for addition the minus symbol for subtraction the asterisk for multiplication and the forward slash for division there are other types of operators too namely comparison text concatenation and reference you may also use constants in your formulas which as the name suggests are numbers or values which you can enter directly into a formula and which don't change this might be a whole number such as 5 it might be a percentage such as 10 percent or it might even be a date so a typical formula might be equal sign sum open parenthesis B5 asterisk 20 close parenthesis which would take the value you in cell B5 and multiply it by 20. let's start with a few basic calculations suppose you want to add up January and February sales of accessories you would start by typing an equal sign which lets Excel know you are entering a formula then you type in the function you wish to use in this case the sum function note the description next you type an open parenthesis then you select your cell range which in this case would be E2 to E3 so you could enter that as E2 comma E3 then a closed parenthesis and press enter and if you wanted to add March sales as well then you would have to extend the cell range to include E4 so you could type E2 comma E3 comma E4 as your range and it will work remember to edit a cell you select the cell and either edit it directly in the formula bar or press f2 or double-click the cell however it's very cumbersome and not very flexible to do it this way because if you wanted to add up the entire column then you'd have to type every cell reference one after the other so thankfully there's a better way instead of typing each cell to include in the reference you just put a colon between the first and last values in our range so E2 colon E4 in this case and if you wanted the whole column then you would enter E2 colon e13 in your formula but there's another way of doing it and that's by using your mouse to select the range so you still type equal sign sum then open parenthesis but select the range with your mouse or shift plus arrow keys and just press enter Excel will add the close parenthesis for you to total these columns up and add some tags you'll add some headings first for subtotals and tax at twenty percent then your formula will need to multiply the value in subtotals by 20 percent if you wanted to add up all the column subtotals and calculate the taxes then you would repeat the previous process for each column but that's very time consuming and you don't need to because Excel has some neat tricks to do this for you just select the fill handle in the bottom right corner of the cell and drag across to the other cells to copy the formula this is called autofill notice how the formula is copied but the row references change in relation to the cell's position on the worksheet so what was E2 colon e13 has become B2 colon B13 these are known as relative references but more on that later in the course and you can do the same thing for the tax values in row 16. now you need a row for showing the totals the calculation here is simple the subtotal value in cell B15 added to the tax in B16 and again you can use the fill handle to copy the formula across if you wanted to Total the sales of all products by month you'd add a column heading notice how the cell style is copied to the new heading automatically remember to widen a column either drag the divider manually or double-click the divider then enter the formula in cell F2 as you've done before however Excel has another trick up its sleeve it's called autosum and is found on the Home tab in the editing group this is a great little shortcut for some simple common functions like sum average count Max and Min but you can choose other functions too you want some for this particular calculation notice that it also has a keyboard shortcut of alt plus equals and then press enter and it's done now you can use the fill handle to copy down the remaining values but hold on there is one more Excel trick to show and it's a good one suppose your column of data was very long you might have to drag the fill handle down over several Pages which isn't easy to do and can easily lead to errors When selecting large lists of data values rather than needing to drag down to the rest of the column you can just double-click the fill handle and it will automatically copy the formula to all the remaining cells in that column this one is a real Time Saver finally let's format all these values to use the US dollar currency format in this video we learned about the basics of formulas how to perform simple calculations how to select ranges in formulas and how to copy formulas in the next video we will look at how to use some of the common functions used by data analysts and discover some more advanced functions [Music] now that you have learned about the basics of formulas learned how to perform some basic calculations and how to select ranges and copy formulas next we will have an introduction to functions including using some common statistical functions and then we will learn about some more advanced functions that a data analyst might also use first let's look at some common functions used for statistical calculations We'll add some row headings for average minimum maximum count and median then in cell B20 let's work out the average of the car sales for the year from the table above on the Home tab in the editing group we click on the autosum drop down list and choose average now because autosum tries to add up the values directly above it in the column we need to modify the cell range here to B2 to B13 then we can use the fill handle as we've seen before to copy the formula across to column e for the minimum calculation in B21 we select Min from the autosum list and again we need to modify the cell range so this calculates the lowest value in our range and fill across to column e and for the maximum calculation we select Max from the list and then modify the range and once again copy the formula across this calculates the highest value in our range in b23 we will calculate the count which basically just means the number of values that exist in the selected range so we select count numbers from the list then modify the range for the median calculation we can select more functions from the autosum list then select statistical as the category and scroll down to find the median function the median Returns the exact middle of a range of selected values note that if you're selecting an odd number of values it will return the figure that is the middle value in your selected range but if you have selected an even number of values in your range it will return the middle figure between the two middle values in your range once again we need to change the cell range to B2 to B13 and we can then copy this formula across to column e you've seen autosum and some of the common statistical functions in Excel but there are another 400 plus other functions available so let's explore just a few of those now on the formulas tab in the function Library group there are drop-down lists for several function categories the first is a list of recently used functions which updates automatically as you use them then you have functions related to financial calculations if you hover over the name of a function you see a short description for each one so here we have the accrued interest function and here is the interest rate function The Logical list has Boolean operator functions such as and if and or there are several functions related to text such as concat which is an updated version of a previous function called concatenate which is still supported by the way for backwards compatibility find and search there are also several functions related to dates and times such as Network days weekday and weeknum in the lookup and reference list there are functions such as areas hlookup sort by and vlookup in the math and trig list you'll find lots of useful mathematical functions such as power sumif and some product alongside many functions for trigonometric purposes such as cosine sine and tangent there is also a more functions list which provides several more function categories such as statistical engineering and information in the statistical list you'll find functions such as average count Max median and Min we saw some of these used earlier in this video if you're struggling to find the function you want in these lists you can also search for a function just click the insert function button on the formulas Tab and then either browse the category lists available or choose all and look down the alphabetical list for the function you want alternatively type the name of a function you want and click go to search for it then select the one you want from The Returned search in this video we learned about the basics of functions how to use some of the more common functions that a data analyst might employ and looked at some of the more advanced functions available in Excel in the next video we will look at referencing data in formulas specifically differentiating between relative and absolute references and error handling in formulas [Music] now that you've had an introduction to functions seeing the use of some common statistical functions and learned about some of the more advanced functions that a data analyst might use in this video we'll look at the difference between relative absolute and mixed references and formulas as well as how to use them and we'll learn about formula errors in Excel it's important to understand the difference between relative and absolute references when creating your formulas by default in Excel cell references are always relative references the term relative is the key here because it means that when you reference a cell you are in fact referencing the cell's position in relation to the cell that the formula is in that is why when we have been copying formulas from one cell to another so far in this course using either copy and paste or the fill handle we haven't needed to modify the cell references because Excel assumes you are using relative references when the formulas are copied the cell references are changed to match the relative positions of the cells that are being copied to so now we know that relative references are the default in Excel but how do we make it so that the cell references don't change when we copy them for that you need to use absolute references in contrast to relative references absolute references to cells stay the same when you copy a formula containing such references lastly there may also be some instances where you only want one of the cell reference identifiers to be absolute and the other one to be relative for example you might want the row identifier to be absolute but the column identifier to be relative or vice versa these are called mixed references and an example of this would be equal sign a dollar sign one plus dollar A3 where a dollar one has a relative column and an absolute row or dollar A3 has an absolute column and a relative row in contrast to relative and absolute references when you copy a formula containing mixed cell references any relative cell references will change whereas any absolute cell references will stay the same in the copied formula first let's look at an example of using relative references in a formula for example if we enter the formula equals A1 Plus A3 in cell four note the blue and red highlighted cells in A1 and A3 these denote the cells being relatively referenced in the formula if we copy the formula to the cell directly below using the fill handle we can see that the result changes and if we look at the copied formula you can see that the blue and red cell references have changed relative to their position on the worksheet the formula has been changed to equals A2 plus A4 in the copied formula that is each cell reference has moved one cell down and if we copy and paste the formula to C7 you can see that the result also changes and again we can see that the blue and red cell references in the copied formula have changed now let's look at an example of how to use absolute references in a formula all you need to do to make a cell reference absolute is put a dollar sign in front of the column and or row identifiers in the formula for example if we enter the formula equals dollar sign a dollar one plus sign dollar a dollar three in cell E4 note the blue and red highlighted cells in A1 and A3 these denote the cells being absolutely referenced in the formula when we copy the formula using the fill handle you can see that the result stays the same this time and if we look at the copied formula you can see that the blue and red cell references haven't changed the formula is still equal sign dollar a dollar one plus dollar a dollar three in the copied formula that is the cell references haven't changed similarly if we then copy and paste the formula to E7 you can again see that the result stays the same this time and we can see that the blue and red cell references haven't changed the formula is still equal sign dollar a dollar one plus dollar a dollar three in the copied formula that is the cell references haven't changed lastly we'll look at an example of how to use mixed references in a formula so if we enter the formula equals a dollar one plus dollar A3 in cell G4 note the blue and red highlighted cells in A1 and A3 these denote the cells being referenced in the formula if we copy the formula to the cell below using the fill handle you can see that the result changes but it's a different result from the previous examples and if we look at the copied formula you can see that the first blue cell reference has stayed the same but the second Red Cell reference has changed if we copy and paste the formula to G7 you can see that the same thing happens the result changes and again we can see that the first blue cell reference has stayed the same in the copied formula while only the Red Cell reference has changed now we'll have a quick introduction to dealing with formula errors in Excel because of the complexity of writing formulas especially the more complicated ones there are bound to be occasions when you make a mistake in the syntax or in the data selection which will lead to a formula error errors are typically denoted by displaying in the cell that is supposed to be displaying the result one of the error codes in this list when you see multiple hash symbols in a Cell It's not really an error it just means the column either isn't wide enough to display the whole word or value or it contains a negative date or time value so if we type Control Plus semicolon then space then Control Plus shift plus semicolon it enters today's date and the current time but the cell is too narrow to display it so what we see is multiple hash symbols if we adjust the column width we can now see the cell contents so as I said this really shouldn't be considered as an error however if we enter the formula seen in cell i7 when we press enter we see a hash name error this error was caused by trying to use an X as a multiplication operator when in fact it should be an asterisk note the small green triangle in the top left corner of the cell also note that when you select the cell an exclamation mark appears providing you with a hint about what caused the error in this case it says the formula contains unrecognized text when you click the drop down error next to the exclamation mark for an error you see several options the first line also gives you a clue on the nature of the error this one says invalid name error so it was probably a mistyped cell reference value or function name if you click help on this error a help pane opens with specific information related to this error if you click show calculation steps a dialog box opens displaying the current syntax with the error underlined and you can try to evaluate the error if you are certain the error is incorrect you can choose ignore error and if you want to edit the formula click edit in Formula bar and the cursor will be focused in the formula bar so that you can try and correct the formula error if you click error checking options the Excel options dialog box is opened at the section related to error checking rules and you can modify these options to suit your needs each of the errors you make which generate one of the error codes listed at the start of this video will have a different reason and a different solution for more information on each of these errors and typical Solutions visit the link provided in this video we learned about referencing data in formulas specifically differentiating between relative absolute and mixed references and how to use them and we learned about formula errors in Excel [Music] data analysis can play a pivotal role in business decisions and processes in order to use the data to make competent decisions we must have the right information for the project and the data must be free from errors in this video we will learn how to profile data to discover inconsistencies whether we are working with small sets of data or analyzing a spreadsheet with thousands of rows one of the most difficult parts of the data analysis is finding and keeping clean data to help with this process and qualify the data look at these five traits accuracy completeness reliability relevance and timeliness accuracy is the first and most significant aspect to data quality a data analyst must clean the data set by removing duplicates correcting formatting errors and removing blank rows another important aspect of data quality is determining if the information required to complete the data set is readily available why does this matter as a trait for Quality data let's say we are given the task to calculate the revenues of all sales per region after collecting the data we discovered that no regions were specified this data would then be considered incomplete and other sources would have to be considered to obtain the data required reliability is another vital factor in determining the quality of data for instance let's say we are given the task to determine the agent Revenue by customer when Gathering the data we find that agents keep their own records and do not always update the information in the shared company database with those factors in mind we would then determine that the data in the shared company database was unreliable and new processes would need to be established to ensure reliable data relevance is another trait of quality data when collecting information a data analyst must consider if the data being assembled is really necessary for the project for example when reviewing the data related to the sales revenue per customer information such as customer birthdays and other personal information is also included by making the determination early to exclude the personal information from the data set the analyst would save themselves from having to review unnecessary information the last factor in determining the quality of the data is timeliness this trait refers to the availability and accessibility of the selected data let's say our sales report is going to be used for weekly employee reviews but our report is only refreshed once a month this error in refreshing the data would cause a report to become outdated and would have serious consequences for employee reviews in this video we learned the important role of a data analyst in qualifying data by considering the five traits of good quality data an analyst can save time avoid serious issues and have data that is free from errors in the next video we will take the collected data and learn how to import it to our spreadsheet foreign [Music] now that you have learned about the importance of data quality in this video you will learn how to import data from a text file using the text import wizard learn how to adjust column widths and learn how to add and remove columns and rows as you know by default Excel works with DOT xlsx or dot XLS files and opens them as workbooks but Excel can also use data that is in other formats such as plain text or data that has been comma separated and tab separated sometimes these source files will be saved with a DOT txt extension and referred to as text files but others might be saved with a DOT CSV file extension and are typically referred to as CSV files here in notepad I have opened a text file that contains data about car sales and it uses comma separated values or csvs to separate each bit of data in a record notice that the Top Line holds headings such as manufacturer model engine size and so on and each one is separated by a comma we want these to become our headers when we import the file into Excel the line below these headings is the first line of real data and again you can see that each piece of data is also separated by a comma there are 16 headings and there are also 16 pieces of data on each of the lines below the headings if we scroll to the bottom we can see that last data record is for the Volvo S80 now to open the file in Excel we choose file open and then either select the file from the recently used list or click browse to find the file we want to import when we open the file the text import wizard launches automatically and it will start to try and determine what your file is note that it has been detected as being a delimited file that is one that has its data fields separated by a character such as a comma or a tab as we want the headings to become headers in Excel we need to ensure that we select the option my data has headers we can see a mini preview of the data in the preview box below then we click next to proceed in the wizard in Step 2 of The Wizard we need to select our delimiter that is which character is separating our pieces of data so we select comma and deselect any others note the data preview now starts to show us what the imported data will look like you can scroll down and across this preview window to ensure that the data is going to look as you want and expect it all looks okay so we'll continue with the wizard in step 3 of The Wizard we can set the data format for each column for example you might want to change a column to text or date format in this case we can just accept the default General format and finish the import wizard in Excel we can see that the headings in the text file have been imported as a header row but also notice that some of the columns are not showing all the data some of the headings are not showing in full and some of the data is not shown either all you can see are a number of hashes in the cells this is because the column widths are too narrow in some cases if you remember you can manually adjust a column's width by dragging the divider across but to change them all in one go we select all the columns first then double-click one of the selected column dividers we can also do a similar thing with rows by dragging them to make them bigger or smaller or double-clicking a row divider to Auto size it there are some columns that we have decided we don't really need namely vehicle type and latest launch so let's remove those this can either be done using the delete drop down menu in the cells group on the Home tab and select delete sheet columns or by selecting and right-clicking a column and deleting it that way to add another column you simply select the column to the right of where you want your new column to be then right click the column and choose insert give the header a name such as year to delete a row you don't need select the row right click it and choose delete and to add a row select the row below the place you want to add your new row right click the row and choose insert if you want to save the file as an Excel file you can either choose file save as or you can click save as in the yellow tooltip that appeared at the top of the worksheet when we imported the file then you would choose Excel workbook or dot xlsx in the save as type box in this video we learned how to import data using the text import wizard we learned how to adjust column widths and we learned how to add and remove columns and rows in the next video we will discuss the importance of data privacy including sensitive information and personally identifiable data foreign [Music] we will learn about data privacy and the regulations that govern the collected data when collecting customer data specific regulations apply to how that data can be used by understanding data privacy regulations and getting familiar with the following three fundamentals you can eliminate the risk of financial penalties and keep the trust of your customers confidentiality collection and use compliance confidentiality is an important element in data privacy and it acknowledges that the customer's personal information belongs to them the types of information that can be accessed by a data analyst can range from sales forecasts to employee information or even patient records when accessing these types of Records the analyst must be able to recognize the different types of personal data personal information or Pi is any type of information that can be traced back to a specific individual this type of information can include anything from emails to images personally identifiable information or pii is specific information that could be used to identify an individual this type of information could include a social security number or a driver's license number and lastly sensitive personal information or SPI may not necessarily identify a specific individual but contains private information that needs to be protected because if made public it could possibly be used to harm the individual this type of information can include data about race sexual orientation biometric or genetic information by understanding personal data and the associated regulations we can efficiently anonymize our data by removing unnecessary information this type of action can help build customer confidence and continue to develop the free flow of information when searching through data the analyst must know the location of the company collecting the data and the location of the respondent knowing where the data was collected is an essential element of data privacy and what regulations must be applied the general data protection regulation or gdpr is a regulation specific to the European Union and only applies to the jurisdiction of the individual a new law created in Brazil called the lgpd will take effect in August 2020. these new data policy regulations apply to individuals within Brazil and ignores the location of the data processor while the United States does not have one Countrywide principle law for data privacy because of this individual states began to make their own regulations for instance California created the California consumer Privacy Act or CCPA to better protect customer data there are also industry-specific regulations that govern the collection and use of sensitive and personal data for example in healthcare HIPAA or HIPAA privacy rules govern the collection and disclosure of protected health information in retail the PCI standards govern credit card data and failure to safeguard card holder information can result in Hefty fines with a basic understanding of these policies we are able to remain compliant when handling any sensitive information unfortunately breaches and customer data is an all too common occurrence and understanding how to remain compliant is essential understanding the data privacy regulations of the European Union the United States and other countries as well as Industries is key to keeping data safe companies must comply with these privacy regulations at all times and also make sure policies are readily accessible to employees for example let's say a data analyst downloads a spreadsheet of sensitive information in order to complete the report by Monday morning the analysts decided to take their work laptop home for the weekend after driving home the analyst accidentally left the laptop in their car the next morning they found their car had been stolen along with the laptop because it is the responsibility of the company to keep customer data safe this was a breach of privacy when the data left the company property this type of action could not only cost the company large amounts of money in fines and penalties but could also reduce customer confidence causing a significant impact to revenue while data privacy applies to most data that is collected there are some instances where these regulations do not apply in order for these laws and regulations not to apply the particular collection of data must be completely anonymous to make data Anonymous means to exclude all data which ties it back to a particular individual while this approach might not be practical in all circumstances collecting data with privacy in mind could remove privacy limitations and make data collections more accessible in this video we learned about the importance of data privacy and the challenges that data analysts can face when collecting and sorting through data in the videos in the next lesson we will learn about different methods for cleaning data in a spreadsheet foreign [Music] we will listen to several data professionals discuss the importance of data quality and data privacy as they relate to data analysis let us start with what is the importance of data quality as it relates to data analysis data quality is of the utmost importance in terms of data and analytics and the reason behind this is because as soon as what you're presenting does not align with what someone expects that's the first thing that they tend to go after where did you get the data what's happened to the data how has it been transformed because people like to think that they know and understand their their business and when you start to challenge that if you don't have the ground to stand on of the data that it's quality that it's clean and that it is from a trusted Source that's when you start to get into a lot of discussions a lot of debate and ultimately the plot of what you're trying to present gets lost the backbone of any successful data analysis project is good quality data there's a common term in computer science called garbage in garbage out which is essentially if you read in bad quality data you can expect to get bad quality results so there's really nothing more important when doing a data analysis than making sure that you're working with good quality data and it's really important to sense track the data yourself and really feel comfortable uh that the data you're using is of a really high quality data accuracy is above all garbage in garbage out it's a waste of time to analyze data of poor quality and it might mislead the business Direction the Integrity of the data that you're using or providing for someone else to use is of the utmost importance data is used to determine when or where to launch a product if a division is profitable or not and it's easy to get things confused if you're not paying attention to the details using inventory as an example if you're looking at inventory at a skew level and you accidentally pick the wrong SKU to analyze and then you draw these conclusions that this particular item isn't profitable when in fact it is so that's a major major decision for a company to make obviously so the expectation is that there will be lots of due diligence but in the beginning if you start off with bad data and then you build on that only to later realize that it wasn't a good idea you've lost time energy effort and in some cases Trust thank you for those viewpoints what about the importance of data privacy as it relates to data analysis data privacy is incredibly important especially when you're working in Industries like Pharmaceuticals or Healthcare but that's not where it stops we have to have the ability to make sure that the users are getting the appropriate level of data based on their roles and their permissions now we can do this through a number of cuts of the data specific to each geography or each function or in some tools such as cognitive analytics we can start to build out that as part of our model within there you can say who has access to what whether it's at a granular level of this person can see data in Canada or the us or whether it's simply this person can see this report in its entirety or not there's lots of different ways to handle this but data privacy is of the utmost important across all Industries in today's world data privacy is a huge thing on the tax side especially of our business we have we have what we call pii personal identifiable information we have to protect that and so we can't just send things through email we don't send tax returns or even actually in our business period we don't send things through email they have sensitive pii data in it we encrypt it we make sure the email is encrypted or we use software some certain softwares that will allow us to not show the social security numbers or the names or the dataverse and what will happen is we it has a certain sequence and we share that with the client by calling them we don't put that in an email and we certainly don't put that in the same email with the encrypted information because we want to make sure that you're always safe so we have to make sure we're protecting it at all cost [Music] now that we have learned about the importance of data quality and data privacy in this video we will learn how to deal with inaccurate data how to remove empty rows and how to remove duplicated data it's very common when collecting or importing data whether through manual or automated processes to get errors and inconsistencies in your data this can be as simple as spelling mistakes extra white space or the wrong case used in text to empty rows or missing values in your data to inaccurate or duplicated data having these errors and inconsistencies in your data can lead to issues with formulas not working with unsuccessful sorting and filtering operations and therefore inadequately visualized and presented data findings these data errors and inconsistencies require you to carry out some form of data cleaning routine to improve the quality and usability of the data let's start off with one of the easier of those tasks which is spell checking in Excel this works in pretty much the same way as you may have already encountered in applications such as Microsoft Word or other common word processing applications I have some data here relating to the sales of toy vehicles and the first thing we need to do is Select what data we wish to check for spelling in this case we will try column K which contains the product line data then we click spelling which is on the review tab well that seems to be okay so let's try the country information in column t so we do have an error here where a country name has been misspelled or more likely mistyped we just click change if we are happy with the spelling suggestion or we could choose another suggestion from the list or even ignore this error if we know the data is correct but in this case we will change it here's another typo for a country name and here's one more so that seems to be all the errors in this column let's try the final column now which is the deal size in column X here is a misspelling of the word small and another for medium and that seems to be all for this column the next inconsistency we will look for is empty rows empty rows in your data can cause lots of issues relating to moving around your data working with formulas and sorting and filtering therefore it's very important to remove them from your data if you remember from an earlier lesson when we click Control Plus down arrow it should take us to the end of that column of data but notice if we do that in this data set the cursor keeps stopping when we get to an empty row meaning that the data set is essentially being split into multiple sections separated by these empty rows that's not good so let's resolve that now we have a couple of options one option is to just manually scroll down the sheet looking for empty rows and deleting each one which is easy enough and fine to do if you only have a small amount of data but imagine if you were dealing with hundreds or thousands or even tens of thousands of rows that would be a very laborious and time consuming process there is a much better way which involves selecting all our data first either using the mouse or the Control Plus shift plus end keyboard shortcut then we select the filter icon on the data tab we can now see that each column has a filter icon next to the column header if we then select the customer name column filter in column M then uncheck select all then scroll down to the bottom of the list we can check the item called blanks and then click OK this will now show only the empty rows at the top of our sheet this can be quite hard to see but if you look in the row numbers you can see that rows 28 29 65 73 74 75 and 117 are listed at the top and are highlighted in blue text we can now select these rows either using the mouse or going to the first cell in the First Data row which is a28 and then using the Control Plus shift plus end keyboard shortcut then delete the offending empty rows we then need to clear the filter and turn it off so we can view our data again now if we go back to the first row in the top of the data sheet and try the Control Plus down shortcut again to go to the end of the data column it will work the next inconsistency we'll look for is duplicated rows of data it's quite common for duplicate data rows to exist in your imported data caused either by human input error or an error in the import process there are two ways of doing this in Excel the first way includes reviewing the data you plan to remove first before deleting it to ensure you are deleting the right data this is our preferred method as it provides an additional level of data security the second method which we will also show you is simpler as you don't review the data to be removed first but it lacks the security of the first method it's important to select a column of data that you would not expect to have duplicate values in for example if we consider the price each column which is C we would expect lots of these values to be repeated because the unit price of some products is the same so this is a bad example of a column to use to find duplicates instead let's use the sales column in column e because it is far less likely that these values will be duplicated in the normal process of things as they are the total sales for each order so we select the column and choose conditional formatting then highlight cells rules and then duplicate values when we click OK and scroll down the sheet we can see that only a few values have been identified as being duplicates there seem to be duplicate values in rows 36 to 40 and in rows 74 to 78. Let's zoom out so we can see both duplicate sections together it seems like these are in fact exact duplicate entries and are likely to be an input error let's delete the second section of duplicate rows as they are out of sequence as they relate to motorcycle sales and are in the ships section of the sheet so that was the first and recommended method of removing duplicate rows of data which previews the data to be removed first now let's try the second simpler but less secure method let's go back to 100 zoom and go back to the top of the worksheet this time we select the whole data sheet and on the data tab we use the remove duplicates button we then unselect all the columns then only select the sales column and the duplicate rows are deleted the last cleaning process we'll look at in this video is using the find and replace feature to repair some misspelt surnames in the customer contacts column find and replace tools are under find and select on the home page in Excel if you have used other office products such as word it should be familiar to you already we've had an email from a Swedish customer informing us that we have her surname spelled incorrectly on her order sheets so we type the misspelled surname into the find what box and click find next then click it again to see there are multiple incorrect entries if we click find all all instances are listed and we can open the replace tab to enter a name to replace the incorrect spellings her surname should be Larson with a double s so we'll replace all instances with that corrected spelling that looks better and we are finished in this video we learned how to deal with inaccurate data how to remove empty rows and how to remove duplicated data in the next video we will look at changing the case of text fixing date formatting errors and trimming white space from data [Music] now that we've learned how to deal with inaccurate data how to remove empty rows and how to remove duplicated rows in this video we'll look at changing the case of text fixing date formatting errors and trimming white space from data when you collect or receive data from various sources it's quite common to find that your data contains text in mixed case that is sum in uppercase sum in lowercase and some in capitalized proper case also known as sentence case some of this may be intentional but often it's not Excel doesn't have a change case button like there is in Microsoft Word so you need to use other methods to perform this data cleaning task those methods are functions namely the upper lower and proper functions you can use these functions to help you change the case of text in your data you can see that the header row here is using all uppercase characters so if you want to change that to use proper case then you need to add another row to put the function in this is referred to as a helper Row the proper function is simple to use just type equals then proper then open parenthesis then the cell reference in this case A1 then close parenthesis and press enter here you can see that the result in A2 is in proper case now you can try and drag the formula right across to column X by using the fill handle on A2 but this can be tricky when you have a lot of columns so let's try another way instead of dragging you can use shift plus right arrow to select the columns across to X first then press f2 to bring the cursor into focus in cell A2 then you hold down the control key while you press enter and it fills across for you you might think that you could now remove the original row but look at what happens when you do you get a ref error because the formula is referencing an invalid reference and the header row cells now contain just the failed formula rather than the actual header text so you need to undo that and instead you copy the contents of the helper row to row one but when you paste you need to choose the paste values option now the header row cells just contain header text and you can remove the helper Row in row 2. now let's use the upper function to change text from proper case to uppercase insert a column to the right of the column you want to change this will be a helper column then you type the formula containing the upper function in the First Data cell in this new helper column again it's a simple formula you type equals then upper then open parenthesis then the cell reference in this case T2 and then close the parenthesis and press enter you can see the result in the country name in uppercase and you can then copy that formula down the rest of the column by double clicking the fill handle cross symbol as before you then copy and paste the contents of the helper column to the original column but use the paste values option now you can delete the helper column next we'll use the lower function to change text from proper case to lowercase as before you insert a column to the right of the column you want to change this will be another helper column then you type the formula containing the lower function in the First Data cell in the helper column once again it's a very simple formula you type equals then lower then open parenthesis then the cell reference in this case K2 and then close parenthesis and press enter you can now see the result is the product line data in lower case and you can now copy the formula down to the rest of the column by double-clicking the fill handle once more as before you then copy and paste the contents of the helper column to the original column by ensuring you use the paste values option now you can delete the helper column it's quite common to receive data that has a mixture of date formats or that uses a date format that isn't suitable for your region now let's look at how to change the format of some dates you can see that this date format is currently using a two digit day a two-digit month and a four digit year value when you open the number format dialog box you can see in the Locale box that this is an English or United Kingdom date format you want to use a US date format so you first change the Locale to English United States in this list you can see there are several date options to choose from let's choose one that uses the full month name then a two-digit day and a four digit year value you could then copy this formula to the rest of the date cells however if you want to format these dates using your own custom format you can do that too in the number format list you select custom and then choose an existing format that is similar to what you want and simply modify it to create a new custom format here we'll have the day then three letter month then four digit year to apply that new custom date format to the rest of the column you could either use the format painter tool or you can select the rest of the column and choose the new column format from the custom list in the number format dialog box you might find that your data has some white space that is unwanted spaces in your data here you can see that we have some spaces at the start some spaces at the end and some unwanted double spaces in the middle of our data we'll first have a look at what you can do to clean up these unwanted spaces in your data by using the find and replace feature in Excel so you first select all the data then on the Home tab you click find and select then replace to get rid of double spaces you enter a double space in the find what box and a single space in the replace with box then you click find next and choose replace for each item you want to change you could click replace all to do all the fixes in one go but unless you are absolutely sure of the changes it's better practice to check and replace each one in sequence in case there are some valid reasons for these these Extra Spaces if you have a very large data set you might also choose replace all to save you a lot of time so using the find and replace feature got rid of most of those unwanted white spaces but not all of them we removed double spaces using that feature but we also have some single spaces left at the start and end of some of the cells you can't use find and replace to remove single spaces otherwise you would lose all spaces in your data including standard spaces between Words which you don't want to remove but there is another tool you can use to clear spaces from cells and that's the trim function to use the trim function you once again insert a helper column the trim function is simple to use just type equals then trim then open parenthesis then the cell reference in this case M2 then close parenthesis and press enter you can then double-click The Fill hand handle symbol to copy this formula down to the remainder of the column now you need to copy the contents of the new column n to column M and remember once again to paste using the paste values option you can now see that those erroneous spaces have been removed or more accurately speaking have been trimmed and lastly you can remove the helper column in this video we learned how to change the case of text how to change date formatting and how to trim white space from data in the next video we will discuss how to use the Flash Fill and text to columns features in Excel to help clean data foreign [Music] how to change the case of text how to change date formatting and how to trim white space from data in this video we'll discuss how to use the Flash Fill and text to columns features in Excel to help clean data we used Flash Fill briefly earlier in the course as a quick method of entering data that fits a specific pattern such as the names of months or days of the week but it can also be used as a data cleaning tool it can split a column of full names into two separate columns for the four name and surname and it can also help to modify the naming convention used in a column of names for example in the vehicle toy sales worksheet there is a column containing the last names of contacts and another containing their first names if you want to use the Flash Fill feature to combine these names into one name column you first insert a helper column let's call it contact name then in the first row in the new column you enter the full name of the first Contact in the format of your choice for example you might want surname then a comma then the for name or you might want surname and just an initial and so on in this case let's just enter the name in the standard format of fourname then surname with a space between them and then we press enter next you'll start typing the second contact's name and you'll see that Flash Fill displays a preview of the remaining names for you if you're happy with what's in the preview all you have to do is press enter and it fills in the remaining names for you write down the column it even works when there are two names in one of the columns such as Wing C here and dakuna here now you can remove the original columns if you no longer need them so in the previous task we saw how to combine two columns of data into one column using Flash Fill now let's see how to use it to modify the naming convention in a column let's switch to the customer contacts worksheet then in the First Data row of the next column that is B2 we type the name of the first Contact using whatever naming convention we want we'll use surname then comma then space then the forename and press enter again when we start typing the second contact's name in the next row down that is B3 Flash Fill detects the pattern and fills in the remaining names in column B when we press enter you could then copy and paste the column header and delete the original column A what we couldn't do with Flash Fill was take a single column with two names in and split that into two separate columns we need to use the text to columns feature to do that so we'll close this worksheet and we won't save the changes now let's see how the text to columns feature can help with data cleaning too as the name suggests and unlike Flash Fill the text to columns feature can take a column containing multi-part text and split that text into one or more other columns this can be useful for splitting any multi-part text such as names or addresses into separate component parts let's open the customer contacts worksheet again then we'll add column headings for the next two columns and copy the cell format used in the First Column header then we'll widen the columns if we then select the data in column A from A2 to a23 and on the data tab click Text to columns a wizard is launched on the first page of the wizard ensure that delimited is selected on the second page ensure that only space is selected as the delimiter on the third page of the wizard click the little arrow next to destination and select B2 on the worksheet then click the little arrow again to return to the wizard we're now finished with this wizard you can now see that the full customer contact names in column A have now been successfully split into two new columns in b and c and you could now remove column A if you no longer need it we'll close this worksheet and again we won't save the changes you can also achieve the same result using functions this would be required if you were using Excel for the web the online version of excel as this doesn't have the text to columns feature there's also a bit more flexibility with functions which can be especially useful if you have names that are complex and mixed such as having hyphenated names or some names with a middle name some with two middle initials and some with no middle initial so we open the customer contacts worksheet again then we'll add column headings for the next two columns and copy the cell format used in the First Column header then we'll widen the columns next we'll enter the formula in B2 to extract the four name part of the name this formula extracts five characters from cell A2 starting from the left and including the space then in cell C2 we enter the formula to extract the surname part of the name this formula extracts seven characters from cell A2 starting from the right then we'll double-click the fill handle in cell B2 to use autofill to complete the column and we do the same to the fill handle in cell C2 to use autofill to complete that column also in this video we learned how to use Flash Fill and text to columns features in Excel to help clean data foreign [Music] we will listen to several data professionals discuss issues around data quality can you tell us your experience with poor quality data and the cleaning of that data a large portion of my time is spent cleaning verifying checking Data before I run an analysis working in healthcare most of the information captured is based off of what someone's put in so humans can't be calibrated two people can have a similar situation and look at things slightly differently so it's up to me to make sure that if one describes something as navy blue and the other person describes it as dark blue that I consolidate it and make it blue that's just an example we don't normally do that in healthcare but the thought is that you always have to check the Integrity of the information before you do your analysis to make sure that your results are accurate no data is going to be perfect that's an unfortunate reality in the world in which we live in databases and data is collected for the broadest possible purpose but oftentimes there are still things that are missing or not quite in the format that we want whether that's collecting date and time as a single field whereas when we're doing our analysis we want to be able to break it out by day month and quarter these are things that we can take into consideration there's a lot of different cleansing activities that that can be done and can be undertaken to help you get something that's specific and works for you and the way you want to work I have had experiences with poor quality data where I'm reviewing financial statements and I'm looking at margins calculating ratios trying to understand is what I'm looking at is number one directionally correct but two am I looking at the right thing are all of these costs current costs relevant to the period that I'm analyzing has all of the data been captured do I have all of the revenue for a given month then you have to go back and look at the sources scrub that information to validate that what you're seeing is correct and from an accounting perspective if that data is incorrect or out of period Then adjustments need to be made to the general ledger which houses the data to properly reflect what's happening poor data quality can really come into play and cause discussions that don't need to be happening they can cause you to be second guessed they can cause you to not be able to be firm and present your case and your data reliably now if this is the case there's several different ways we can handle this one is to go all the way back to the source to ensure that the source data is being pulled appropriately or simply being able to outline and be very specific and Direct in terms of what Transformations or changes have been done to the data through tracking this in something like Watson knowledge catalog and being able to present that to your audience if you're filtering and sorting data and you find that it's wrong you have to go back and and fix things that time could have been spent working on other deliverables and again it can cause a data Integrity into question if you're constantly having to redo or reiterate certain parts of data and quite frankly it can be frustrating at times if you're habitually having to do that so paying attention to the details and the minutia so that you're not wasting time backtracking on something that you could have fixed early on are just some of the many benefits of ensuring that your data quality is good foreign [Music] now that we have learned how to collect and clean our data it is time to decide the best method for analysis in this video we will discuss the importance of filtering sorting performing calculations and shaping our data to provide meaningful information deciding how to manipulate our data can sometimes be difficult before we make any changes or adjustments we will need to visualize the final output below are some questions to ask before beginning the task how big is the data set what type of filtering is required to find the necessary information how should the data be sorted what type of calculations are needed now that we have visualized the final output we must decide the best approach to shape our data the most basic step would be to filter and sort the data by sorting the data we are able to organize it based on conditions such as alphabetically or numerically for example if we wanted to check for duplicate order numbers we could sort the data and quickly see any duplicates after sorting and removing the duplicate row we find that the view needs to be more specific to meet our requirements we now decide that we only want to see the data for the month of November by adding a filter we can now choose to only see items with a month ID that is equal to 11. by filtering our data we are now able to only see the rows that meet the filter criteria and it allows us to better analyze our information becoming familiar with all the tools to analyze data can seem daunting but one key benefit of using a spreadsheet is the ability to use functions functions in Excel are organized by several categories including mathematical statistical logical financial and date and time based let's say we wanted to get an average of company revenue for the month of June we realize there are over a hundred items that would need to be calculated in normal circumstances to get an average we would have to create a formula to add each row and divide by the total number of rows this type of calculation would not only be very long but can expose the analyst to possibly making a mistake with the use of a function we would be able to simplify our calculation in one easy step equals average open parenthesis B1 colon b160 close parenthesis while sorting and filtering data on our spreadsheet can be useful on its own first converting your data to a table has many benefits when we convert our data into a table we are able to filter and calculate the data more efficiently one example is the ability to easily calculate columns for the column MSRP we choose sum and we're able to quickly calculate the sum of the column if we then look at the data and we only want to calculate the MSRP total based on Japan we would filter the country column to only display Japan and the column would then only add the values in the rows that were associated with Japan while all data may not work in a table there are quite a few advantages to formatting your data as a table automatic calculations even when filtering column headings never disappear banded rows to make reading easier tables will automatically expand when adding new rows sometimes data needs to be more organized than what a basic tabular format can give us and creating pivot tables with charts can be a better way to analyze and display the required information in Excel we have the option of creating a pivot table to display and analyze our data and optionally an Associated pivot chart for example let's say we want to know what company ordered products in the month of October from the original table of data we create a pivot table to organize and analyze the required data along with a pivot chart to display the information by then adding the month filter to the newly created pivot table we can see the results for the month of October not only in the table but the changes are automatically updated in the pivot chart when trying to single out specific information in a large data set a pivot table is a nice way to show only the information that is required this allows us to quickly and easily scan the essential information pivot charts are a nice accessory to Pivot tables as they allow us to visually process data and in most cases we'll let the audience grasp the information quicker the advantages of selecting a pivot table in chart are manipulate data without using formulas quickly summarize large data sets ability to display engaging charts and graphs in this video we learned about the importance of filtering sorting performing calculations and shaping our data to provide meaningful information and we learned about some of the tools to begin analyzing our data in the next video we will learn more about filtering and sorting our data [Music] foreign we learned how to use the Flash Fill and text to columns features in Excel to help clean data in this video we will discuss how to filter and sort our data to enable us to control what information is displayed and how it's displayed in our worksheets filtering your data enables you to gain more control over which parts of your data are displayed at any given time in Excel this can help with the visibility of data by narrowing down the data to within specified criteria and parameters and it can also help when searching for specific pieces of data to filter your data the first thing you need to do is turn filtering on which is very simple on the data tab click filter and that's it you will now see a small filter icon next to each of the column headers as a side note if you want to only filter on one or more columns select those columns first then click filter as another other side note if you format your data as a table The Columns automatically have filter controls added to them so now each column has a filter that can be applied to the data in that column in the order date column you can filter on the years in product line you can filter on the different product types and in customer name you can filter on each customer by name let's first filter on the year we'll select orders from 2004 only by deselecting the other year and if you wanted to we could expand the year and filter by months also but we won't do that for now if you look at the status bar at the bottom of the worksheet you can see that there are only 50 out of 114 records now displayed if you want to clear a filter you can either click the clear filter from option or click the select all item in the filter list now let's filter on the product line column to display only the rows that hold data for sales of classic cars and again we'll clear the filter lastly we'll filter on the customer name column and only display sales to many gifts Distributors Limited and then clear that filter so far we've only applied one filter at a time but suppose you want to filter down to a greater degree we can do that too by just enabling all those filters together and now we are only displaying sales of classic cars to mini gifts Distributors Limited in 2004. remember if you only want to clear one filter then click its filter button in the column header and click the clear filter from option but if you want to quickly clear all filters you can use the clear button in the sort and filter group on the data tab so far we've used what are commonly referred to as Auto filters but you can also use custom filters to specify other criteria to apply to a filter to text or numbers for example if you wanted to see sales orders that are over or under a certain value you can do that with custom filters for the sales column let's add a number filter that only displays sales that are over two thousand dollars if you look in the status bar you can see that we are now showing 111 out of 114 records then let's clear that filter and filter it the other way to display the sales orders that are below two thousand dollars we can see that there are only three orders that are below two thousand dollars it's important to note that the data rows that we don't see have not been removed they are still there they have just been hidden from view by the filters and this is indicated by the row numbers you see on the left in blue the row numbers start at 69 and jump in large increments indicating that there are many more rows of data in our data set than are currently being displayed let's clear those filters if we look at a column filter for a column that contains text you will see that the menu item changes to text filters instead of number filters and you can see that there are several text filter options and if you want to turn off filtering altogether for a worksheet just click the filter button on the data tab now let's take a look at the basic sorting capabilities in Excel sorting is a very important part of the role of a typical data analyst you might need to organize your text-based data alphabetically your number based data numerically or your date-based data chronologically when you sort data using these logical parameters it makes it easier for you to conceptualize and visualize your data in a more meaningful way when sorting data the first thing you need to do is Select which data to sort for example if you want to sort your customers alphabetically select a cell in the customer named column first and then either sort by A to Z or by Z to A and if you want to sort your sales figures numerically select a cell in the sales column first and then either sort from smallest to largest or from largest to smallest and lastly if you want to sort your customers order dates chronologically select a cell in the order date column first then sort from oldest to newest or from newest to oldest but you can also sort your data by more than one column at a time simply select a cell in your data then on the data tab click sort then either use the sort by column suggested or use the drop down list to select a different column in this case we'll choose the order date column as our first sorting criteria and we'll choose oldest to newest in the order drop down list to add a further sorting level you click add level then you choose another sort column in the then buy drop-down list in our case we'll choose sales and for this sort level we'll choose largest to smallest in the order list if you have a header Row in your data as we do here then ensure you select the my data has headers checkbox then click ok to sort so the data is now sorted to list the oldest orders by order date first then within each order date if there are multiple instances with the same order date then the next sorting level lists data by the largest order values first down to the smallest order values in this video we learned how to use the filter and sort Tools in Excel to filter and sort our data to enable us to control what information is displayed and how it is displayed in our worksheets [Music] in this video we will listen to several data professionals discuss the importance of filtering and sorting your data why is it important to filter and sort your data filtering and sorting are very important as part of your analysis and visualization experience because this allows you to create one single view of the data but then provide a function for people to be able to do their own analysis on it now just to clarify what we mean by this is sorting tends to be highest to lowest alphabetical or in some cases you may want to create some custom sorting where you put your particular product or offering at the start and then have the rest falling behind it or you may want to group a few at the start to show your direct competitors versus others I love love love the filter sort feature in Microsoft Excel what it allows me to do is get to the heart of the data I can drill down and see for example how much revenue a client had for a specific time frame or how much money they made in a specific time frame without looking through a lot of rows and and a whole lot of information so filtering and sorting really allows you to narrow it down and to get very specific and get the answers that you're looking for and not just get loads of data that you don't necessarily need and when we talk about filtering we talk about this to mean that I have a particular value on which I want to see the data specified by for example we had a bar chart showing our sales over months and I want to see it in a particular geography or for a particular product line I could have that available and allow me to filter down so that my sales would be specific just to one geography or one product line [Music] thank you foreign how to use the filter and sort Tools in Excel to filter and sort our data to enable us to control what information is displayed and how it is displayed in our worksheets in this video we'll discuss how to use some of the most common functions a data analyst might use namely if ifs countif and sum if first up let's look at how to use the IF function the IF function is one of the most used logical functions in Excel the IF function enables you to logically compare a value against criteria you set in the function and then return a result based on whether the result of the logical comparison is true or false and these values can be text values or numeric values an IF function essentially says if something is true then return a value or do something but if it's not true then return a different value or do something else for example in our vehicle toy sales worksheet if we wanted to have a column that recorded whether the order had been shipped or not you could add a new column to the right of the existing column let's call it shipped and then enter the formula scene in cell H2 this formula is saying if the text in G2 says shipped then return yes and if it doesn't then return no you can then use the fill handle to copy this formula down the column you can see that most of the cells do say yes but some don't as the order hasn't been shipped for one reason or another we could also use the IF function to emphasize the size of an order so if we add a new column to the right of sales and name it 3K plus or minus then enter the formula seen in cell F2 this formula is saying if the order is over three thousand then return the text over 3K but if it isn't then return the text under 3K and we can copy the formula down the column in an ideal world you would only use the IF function to apply one or two conditions but there may be scenarios where you would want to apply multiple conditions in these cases you can use the nesting capabilities of functions to bring together several if statements in one formula these are called nested if functions for example if we add another column here for the order size and then enter the formula seen in cell F2 you can see that this formula contains multiple if functions one is needed for each condition one for large one for medium and one for small and it requires three sets of parentheses so it's a relatively long and complex formula but it does work again we can copy the formula down the column even though Excel technically supports the nesting of up to 64 different if functions in a formula it is not a recommended best practice having multiple if functions in a singular formula can become extremely challenging to manage for example suppose you come across a formula like this that you haven't used for some time or even worse was created by someone else it could be quite difficult to work out how and why it is being used also if your conditions increase then you need to add more conditions to an already quite complex and long formula which will only complicate matters more to resolve this issue a new function was developed called IFS the ifs function is only supported on Excel 2019 Excel for Microsoft 365 and Excel for the web as the name suggests this function can replace multiple nested if functions being used in a single formula to simplify matters so if we add a further column for order size but this time we'll use the ifs function instead as you can see in cell G2 this formula only has one set of parentheses instead of three and only uses one function instead of three let's copy that formula down the column two now let's have a look at another example of using the IF function but we'll combine it with conditional formatting too if we switch to the car sales worksheet and add a new column to the right of the Year resale value column and call it retention percent then we enter the formula scene in cell G2 which will divide the year resale value by the original retail price we need to format this as a percentage and then we can copy it down the column next we'll add a column to highlight the retention value for each car the formula we add here in cell H2 uses the IF function to state that if the percentage in the previous column is greater than 69 percent then Mark it as good but if it isn't then Mark it as poor once again we copy the formula down the column we could also use conditional formatting to highlight the retention value percentages even more we select H2 and on the Home tab click conditional formatting and make a new rule the condition in our rule will only format cells that contain a specific text value and that value is the word good and if it does match that condition then format it with a dark green font and fill the cell in pale green let's copy that conditional formatting down the rest of the column you can see that the cells that contain the word good are now formatted as we defined but the cells containing the word poor are not let's add another conditional format rule this time we'll select manage rules because we are going to add another rule to our existing rule the new rule will be the same as the previous one with the exception of looking for a match with the word poor instead and formatting those matching cells with red text and a pink background fill and once again we copy that down the column now all the cells that contain the word poor are formatted as red text with a pink cell fill let's now have a quick look at how to use the countif function countif is one of the statistical functions provided in Excel you can use it to count the number of cells that meet a certain Criterion such as the number of instances where an employee's name appears in a list of sales invoices or the number of occasions a particular part number appears in a list of purchase orders let's switch to the vehicle toy sales worksheet suppose you want to find out how many of the sales orders in the list went to customers based in the United Kingdom we enter the formula you see in cell ad7 note that when we are using text as a Criterion we have to enclose the text in quotation marks so there were six sales orders in the UK and if you wanted to discover the same thing for French customers then you would just edit the existing formula or copy it and then edit it you can see there were 14 orders for French customers notice that this time the text entered was in lower case and it still works so names in this function are not case sensitive and let's do the same for United States customers there are 41 orders to customers based in the states there is also a newer function called countifs which applies Criterion to cells across multiple ranges to count the number of occasions where all criteria have been met this removes the need to use multiple countif functions in a long and complex single formula the countifs function is only supported on Excel 2019 Excel for Microsoft 365 and Excel for the web now let's take a look at how to use the sum IF function which is a very commonly used mathematical function in Excel you use the sum IF function to sum the values within a specified range that meet specified criteria for example you might want to add up only the salaries that are over a specified salary level or you might want to find the total of all sales of a particular product category we'll enter the formula scene in cell ad10 this formula will add up each of the sales orders that have a total of more than three thousand dollars again notice that because we have used an arithmetic operator that is the greater than operator we must enclose the Criterion in quotes if we specify a Criterion that is only a number we don't enclose it in quotes so the total sum of all orders that were over three thousand dollars is almost four hundred and seventy thousand dollars you can also use wild cards such as question mark and asterisk when searching for partial matches and you can also specify to extract values from a different column than the column where you have specified the criteria for example if we enter the formula you can see in cell ad 13 it will sum all the car sales in column e for only those products in the product line column that end in cars there is also a newer function called sumifs that you can use to some cells based on multiple criteria this removes the need to use multiple sum if functions in a long and complex single formula the sumifs function is only supported on Excel 2019 Excel for Microsoft 365 and Excel for the web in this video we learned how to use the if ifs countif and some if functions in the next video we'll look at how to use the vlookup and hlookup reference functions foreign [Music] now that we've learned how to use the if ifs count if and some if functions in this video we'll look at how to use the vlookup and hlookup reference functions vlookup is one of the most commonly used reference type functions in Excel and IT enables you to find data referenced in a lookup table it stands for vertical lookup and therefore is a useful tool when you want to find something in a table or range by row shortly we will look at hlookup which stands for horizontal lookup which looks for data by column instead vlookup works by using a common shared key between the source data and the lookup data in the lookup table a typical lookup formula would look like equal sign vlookup open parentheses B3 comma A2 colon B12 comma 2 comma false close parentheses where B3 is the lookup value that is the value or word you are looking for A2 colon B12 is the lookup table or range that is the table array or range of cells that contains the lookup value in a formula Excel references this as table underscore array the lookup table can be on the same worksheet or in another separate worksheet two is the lookup column number that is the number of the column in the lookup table that contains the value you are looking for in a formula Excel references this as call underscore index under score num false is an optional parameter that determines whether the match found has to be exact denoted by false or can be approximate denoted by true in a formula Excel references this as open square bracket range underscore lookup close square bracket the square brackets round this argument in the formula signifies that it is an optional argument whereas the others are required arguments of a vlookup formula if you don't specify the optional false or true parameter in your formula it will default to false that is an exact match is required you can also use the number zero instead of false and the number one instead of true okay now let's see the vlookup function in action in the car sales worksheet suppose we wanted a quick price list of our favorite cars the first thing we need to do is put the column containing the value we want to search for in the leftmost column as vlookup requires this then we can delete the original column we then enter the formula scene in cell V16 which is looking for the word Corvette in the table array from cell A2 to g156 and then looks for the value in the fifth column in this case the price column that matches the row containing Corvette and returns an exact value of forty five thousand seven hundred and five dollars note that in this example we are using a part of our existing data table as the lookup table or table array let's format that as U.S currency then we'll format it to zero decimal places in fact rather than use the reference a25 in the formula it will be easier to use the reference to the word Corvette in the mini table in this worksheet where our list of favorite cars is so that is V5 and the formula still works now let's copy that formula up to the favorite car table above it in the worksheet but there's a problem because when we copied the formula the cell references changed this happened because as we learned earlier in this course the default state of cell references is relative and we want them to be absolute in this case so let's undo that copy operation to make the cell references absolute we need to add dollar symbols to all the cell references in the formula this can either be done manually or you can put the cursor in each cell reference in turn in the formula and press F4 each time to automatically add the dollar symbols let's try and copy the formula again and this time it works if we use the fill handle on Cell W5 to copy it down to the rest of the cars it doesn't work in fact we end up with the same result in every cell why because each one is referencing the same cells in the lookup value because we used an absolute reference all we need to do now is modify the formula to remove the absolute reference for just the row parameter in the lookup value part of the formula by removing the dollar symbol so in cell W5 we change dollar V dollar five to Dollar V5 comma then when we drag the fill handle down it will copy the formula correctly and all the prices will be changed to reflect their correct retail price lastly to show that the two tables are now connected by this vlookup function if we change the retail price for the Chevrolet Corvette in the main data table in cell e25 the price will also change in the favorite car's price list let's now take a look at the hlookup function which as we mentioned earlier does the same thing and works in virtually the same way as the vlookup function but it looks for data in columns rather than rows so each lookup looks for a word or value in the top row of a table and then returns a value in the same column from a row specified in the table array therefore you would use hlookup if your comparison values were situated in a row along the top of a data table in contrast you would use vlookup if your comparison values were located in a column to the left of the data you want to find as they were in the previous task of the two functions vlookup is used far more frequently than hlookup because of the nature of most data tables the Syntax for hlookup is identical to that of vlookup except that you specify a row index number referenced in a formula by Excel as row underscore index underscore num this indicates the number of the row in the lookup table that contains the value you are looking for let's create a small lookup table on the right hand side of our main data table a few columns have been hidden in this worksheet to make viewing a little easier so now we've got low HP medium HP and high HP in the top row of our lookup table next we'll add winding symbols as ratings for the three horsepower levels one sad face for the low horsepower rating two neutral faces for the medium rating and three happy faces for the high horsepower rating now let's add a new column to the right of the HP level column and call it HP rating then in cell L2 we'll enter the hlookup function this function will look for the value in cell K2 which in this case is medium HP and it will look for it in the cell range from y 21 to aa22 which is our little lookup table and it will return the answer it finds in Row 2 of the table under medium HP and use an exact value note that we've used some absolute references in this formula too notice that what is returned is the text KK so we need to format the cell using the wingdings font now when we double-click the fill handle the whole column shows the HP rating symbols relevant to each Row's HP level value and we're done although vlookup and hlookup are regularly still used as the de facto functions for lookup references in Excel there is a newer function called xlookup this version is only supported on Excel desktop versions from Excel from Microsoft 365 and on Excel for the web as well as on Excel for iPad and iPhone and Excel for Android tablets and phones xlookup is an improved and combined version of vlookup and hlookup together it can work in any direction vertically or horizontally it also uses separate lookup array and return array values instead of a single table array and a column or row index number in this video we learned how to use the vlookup and hlookup functions in Excel to find and connect to data referenced in both vertical and horizontal lookup tables in the videos coming up in the next lesson we'll start to look at using pivot tables in Excel [Music] now that we've learned how to use the vlookup and hlookup functions in this video we'll look at how to create and use pivot tables in Excel we'll first look at how to format our data as a table then how to create pivot tables and use fields in a pivot table to analyze data and lastly we'll see how to perform calculations in a pivot table having a worksheet full of informational data is all very well but to really get some use out of it we need to analyze it from different perspectives to find answers to questions related to the data now we've already used features such as filters and formulas to draw mathematical and logical conclusions about our data but not all questions can be answered easily using filters and formulas alone in order to obtain usable and presentable insights into your data you need something else and that's something else is Pivot tables pivot tables provide a simple and quick way in spreadsheets to summarize and analyze data to observe Trends and patterns in your data and to make comparisons of your data a pivot table is dynamic so as you change and add data to the original data set on which the pivot table is based so the analysis and summary information changes too a data analyst can use pivot tables to draw useful and relevant conclusions about and create insights into an organization's data in order to present those insights to interested parties within the company before you start to create a pivot table in Excel it can be very helpful to First format your data as a table the reason for this is not only to make it more organized and defined and to add table styles to your data but primarily it makes it a lot easier when adding records to the data set in the car sales worksheet let's first select any cell within the data and then on the Home tab in the Styles group choose format as table then choose a style from the gallery note that Excel automatically knows the boundaries of our data range but we can change this if we need to and ensure you select my table has headers if indeed it does after you click OK and the data has been formatted as a table note the filter drop downs at the top of each column these are automatically added when you format as a table if we now scroll down to the bottom of the table and start adding another row of data for another vehicle when you click tab or enter note that it is automatically formatted and included as part of our table okay now let's see how to create a basic pivot table and how to use fields to arrange data in a pivot table just before we do that there are a few things you should use as a checklist to ensure your data is in a fit state to make a pivot table from and these are format your data as a table for best results ensure column headings are correct and there is only one header row as these column headings become the field names in a pivot table remove any blank rows and columns and try to eliminate blank cells also ensure value fields are formatted as numbers and not text ensure date fields are formatted as dates and not text in the worksheet we can just select any cell in the table then on the insert tab we click pivot table note that in the selected table or range box the table name table 1 is already entered for us if we hadn't just formatted this data as a table we would specify the cell range here instead under that we need to decide whether we want to create the pivot table on a separate new blank worksheet or on this worksheet a new worksheet is the default and is the most commonly used option so a new blank worksheet opens displaying some basic pivot table instructions in the graphic on the left of the worksheet and a pivot table Fields pane on the right you can rename the worksheet for the pivot table if you wish to build the pivot table report we need to add some fields from the top of the pivot table Fields pane to one or more of the sections in the bottom part of the pane for example if we want to find out the total sales for each model of car let's drag the manufacturer field to the rows section of the report and and then we'll drag the model field there too but this isn't really the way we want it to look so we'll drag the manufacturer field to appear at the top of the rows section above the model which makes more sense with our data next we'll add the price field to the column section but again that really isn't the way we want to view the data so we'll drag price to the values section instead which makes a lot more sense and looks a lot better next we'll add the unit sales field to values too so now we can see both the individual price for each model and the number of unit sales of each model let's add the vehicle type field to columns but that doesn't seem very useful so let's remove that field which we can do in two ways either by using the drop down menu or if we undo that we can also do it by simply dragging the field out of the columns section either to the left over the worksheet or to the top over the fields list above let's now look at how to perform a simple calculation in a pivot table if we look at the sum of price column in our pivot table we can see that the figures are formatted as general so first let's change the format of these figures to U.S currency this can be done by modifying the value field settings for the field in the relevant section of the pivot table Fields pane we'll format the field as US Dollars and show no decimal places next we'll add a calculated field from the pivot table analyze tab using the fields items and sets button we want this field to calculate the total sales for each model by multiplying the price by the number of unit sales when we create and add this formula it gets added to the pivot table Fields pane as a field called total model sales and we can change the format to make it US Dollars again a new column called sum of total model sales has now appeared in the pivot table in our worksheet in row five we can see that there have been over 360 million dollars of sales of the Acura Integra model and in row 7 we can see that there has been over a billion dollars in sales of the Acura TL model in this video we learned how to format data as a table how to create a pivot table and use fields to analyze data in a pivot table and how to perform calculations using pivot table data in the next video we'll look at some other features of pivot tables foreign [Music] we will listen to several data professionals discuss their experience using pivot tables to analyze data what are your experiences using pivot tables to analyze data my experience using pivot tables in Excel is extensive I can use them all the time the thing to keep in mind is that you can sum average and count easily you can set it to group by so people can choose what the parameters are at the top it's great if you've got a couple of thousand records all the way up to whatever Excel can handle so a pivot table is just a real simple way of manipulation without having to do any actual querying or development language I once had a huge e-commerce sales data I need to analyze the kpis including gross merchandise volume and take rates however I can only generate limited insights if I stay at high level with payment tables I was able to group the data in terms of countries type of stores type of products which enable me to view the data and analyze the key kpis at different levels of granularity I use pivot tables and we use pivot tables in our firm especially during audits to assist us and help us to kind of drill down on the data because what a pivot table does is it helps you to take a large set of data and water it down to something that's meaningful so in the case of audits a client might have you know five hundred thousand dollars worth of Maintenance and Repair bills that are made up of 300 invoices well we don't want to see every invoice for every dollar we want to see the high dollar invoices so we're going to use that pivot table to narrow it down to the invoices that actually are going to have the highest level of impact on the financial statement I'd say Excel pivot tables are a great way to understand your data quickly and effectively being able to just open up an Excel sheet put it into a pivot table drag and drop things in to get a sense of what the numbers look like what the values are really can help you get a good sense of the data in order to then start to build out something a little bit more robust being able to understand the fields what they mean what they look like these are all things that can help you at the start of a project as you're looking to do your analysis uh pivot tables are incredibly useful to get a quick view of your data and to look at multiple levels of your data in a very quick and clean way it's just very very easy to create a pivot table on a set of raw data aggregate it by some level of Interest be it country be it you know country the user is from be it um the year the user joined uh or anything else be it the some something related to time um it's really good for a quickly uh seeing and understanding some of the more high-level summaries that are hidden within your data [Music] foreign how to create and use pivot tables in Excel in this video we'll look at some other features that we can use with pivot tables including recommended pivot tables filters slicers and timelines first let's look at recommended pivot tables which isn't exactly a feature as such it's really more of a list of suggested different combinations of data that could be used when creating a pivot table these recommendations are based on the data we select in the worksheet and they are a great way to get started creating pivot tables if you don't have much experience with them yet for example in the vehicle toy sales worksheet if we select column B which contains data about the quantity of items ordered when we choose recommended pivot tables from the insert tab then we are presented with a list of potential data combinations related to the order quantity information however if we select column F which contains order size information then the recommended pivot table list changes to reflect that data and if we select column e which contains sales information then the pivot tables recommended are related to sales data let's select the third one down which is the sum of sales by territory because that sounds like something we could get some useful Insight from by presenting it in a pivot table note that the new worksheet is opened containing the recommended pivot table and a new pane opens on the right called pivot table Fields let's rename the worksheet to something more meaningful in the pivot table Fields pane you can see that some Fields have already been added to the rows and values areas although it's a recommended pivot table we can still make it our own by adding more fields for example so let's add the product line item to the columns area using drag and drop now we have columns for each of the product lines in our pivot table such as motorcycles ships and trains in the pivot table we can manually expand any field we want to view its contents here we can see that the order dates are located underneath the territory names in our pivot table note that this matches the order of the fields in the rows area of the pivot table Fields pane we can manually collapse each of the fields too but we also have the option of expanding all the fields at once and collapsing them all too the next feature we will delve into is pivot table filtering pivot table filters work in much the same way as the standard filters we used earlier in the course note that we already have some inbuilt filtering in this pivot table for example the rows label header is a filter and we can filter on any of the listed territories such as Japan just like standard filters it's very simple to clear a filter in a pivot table we also have a column labels filter allowing us to filter on any of the product line items in this pivot table for example we could show data only for the trains product we also have the option of adding the product line field as a standard filter instead of a column Heading by dragging it to the filters area in the pivot table Fields pane and we can then use it as a standard filter as we have done earlier in this course the filter also allows us to select multiple filter items but because it's now being used as a standard filter rather than a column header we can't see the split of the information on these two product lines we just see a combined total when we have the filter as a column header the information on each product line was presented separately in each column let's display all the field totals again and we'll drag the product line field back to the columns area where it was previously so we can see the split of our different product lines and the pivot table the next pivot table feature we will look at are slicers slicers are essentially on-screen graphical filter objects that enable you to filter your data using buttons slicers make it easy to perform quick filtering of your pivot table data and they also display the current filter State making it easier for you to know and see what data is currently being shown and which is being hidden by the filter for example if we remove the product line field from the pivot table by dragging it out of the pivot table Fields Pane and then from the pivot table analyze tab we click insert slicer and then choose the territory field as our slicer we can see that the slicer can be freely moved around anywhere on the worksheet and it contains buttons for each of the territory names such as emea North America and Japan we can also select the multi-select button to filter on multiple territories if we wish we can click the clear filter button to clear all the slicer filters let's add another slicer to our worksheet for the product line field however be sure to select a cell in the pivot table first because if you don't then the insert slicer button won't work note that slicers can also be added from the filters group on the insert tab as well as from the pivot table analyze tab we'll select the product line field this time for our slicer and drag it near the top of the worksheet as before we can select only one slicer item or we can turn on multi-select and choose several items to filter on in the slicer then let's clear the slicer filters and now let's filter using both slicers note that when you use multi-select filtering when you select an item you are in fact filtering it out that is you are defining which items will not be displayed in the pivot table this is the opposite Behavior to when you were selecting single items in a slicer so now we are displaying only classic cars trains and trucks and buses products for the emea and North America territories now let's clear those slicer filters and put the product line field back in the columns area of the pivot table so it's ready for the next feature we will explore and let's move these slicers out of the way further down the worksheet the last useful feature for pivot tables we are going to look at is timelines a timeline is another type of filter tool that enables you to filter specifically on date related data in your pivot table this is a much quicker and more effective way of dynamically filtering by date rather than having to create and adjust filters on your date columns we can add a timeline for our pivot table either from the pivot table analyze tab or from the insert tab again ensure you select any cell in the pivot table first we'll select the order date field as our timeline filter then we can drag it up the worksheet and enlarge it the default for this timeline is to display data by month but you can also filter by days or by quarters you can select a single quarter or you can select a range of quarters in this case we'll select 12 months between quarter 3 of 2003 and quarter two of 2004. you can use the clear filter button to clear a timeline filter you can also filter by years for example here we have selected 2003 only and you can combine slicers and timelines as filters in a pivot table for example here we can filter the slicers to display only data for trains in the emea and North America territories and only in the year 2003. and if we filter on the year 2004 instead you'll see that there is no data being displayed meaning that there were no sales of trained products in 2004 in either the emea or the North American territories timelines and slicers have their own tabs in the ribbon when you select them and their properties can be modified to change how they look and how they work for example let's change this timeline to a light green shade and let's change this slicer to a nice orange color and lastly to remove a timeline or slicer you can either select it and press the delete key or right click it and choose cut in this video we learned about some of the other features in Excel that we can use with pivot tables namely recommended pivot tables filters slicers and timelines [Music] it is often said that a picture is worth a thousand words this phrase is especially relevant when it comes to data analytics data visualization plays an essential role in the representation of both small and large-scale data this course from IBM is designed to help you tell a compelling story with your data using various visualization techniques you will work with both Excel and cognos analytics to acquire the basic skills needed to create different types of plots charts and graphs and build interactive dashboards which are important parts of the skill set required to become a data analyst you will not only learn data visualization techniques using Excel and cognos analytics but also practice using multiple Hands-On labs and assignments throughout the course in module 1 you will learn about different types of charts and the Excel functions that are used to create basic charts and pivot chart visualizations by learning how to manipulate these features and by creating visualizations you will begin to understand the important role charts play in telling a data-driven story in module 2 you will learn about creating Advanced charts and learn the basics of dashboarding and how to create simple dashboards in Excel you will also learn how dashboards can be used to provide real-time snapshots of key performance indicators in module 3 you will learn about cognos analytics including how to sign up for it how to navigate around it and how to easily create stunning dashboards you will also learn some of the more advanced dashboarding capabilities of cognos analytics and make your dashboards Interactive in the final module you will complete a two-part Hands-On final assignment lab which will guide you on how to create visualizations in Excel and how to create visualizations and dashboards in cognos analytics this will involve you understanding what the scenario requirements are and then creating visualizations and a dashboard to fulfill those requirements you will follow two different business scenarios throughout the course with each using their own data set these different scenarios and data sets will be used in the lesson videos and in the Hands-On Labs after completing this course you will be able to explain the role visualizations play in conveying a story about data create basic charts pivot charts and advanced charts in Excel spreadsheets create a simple dashboard using Excel provision and instance of cognos analytics in the cloud navigate around the cognos analytics interface and leverage its Rich visualization capabilities build interactive dashboards using cognos analytics with a variety of basic and advanced visualizations you will also perform some intermediate level data visualization and dashboard creation tasks to address a business scenario the course team and other peers are available to help in the course discussion forums in case you require any assistance let's get started with your next video where you will get an introduction to charts foreign [Music] we'll give an overview of several different types of charts and visualizations and discuss how they can be used to tell a story let's begin by looking at a line chart When comparing different but related data sets a line chart is a great way to display the information they are able to display Trends and show how a data value is changing in relation to a continuous variable for example if time is a continuous variable how has the sale of a product or multiple products changed next we have pie charts this type of chart can show the breakdown of an entity into its subparts and the proportion of the subparts in relation to one another each portion of the pie represents a static value or category and the sum of all categories is equal to a hundred percent in this example we have a marketing campaign with four distinct categories social sites native advertising paid influencers and Live Events with this type of data representation we can easily see the total number of leads generated per category we now look at one of the most commonly used charts the bar chart this type of chart is the most common as they are easy to create and are great for comparing related data sets or parts of a whole for example in this bar chart we can see the population numbers of 10 different countries and how they compare to one another we can also use stacked bars in which each bar is divided into sub-bars that are stacked end to end in this stacked bar we can see the population of each country split into four age ranges would you like the graph to appear vertical and not horizontal then column charts would be a great pick this type of chart can be used quite effectively to show change over time and to compare values side by side for example showing page views versus user session time on a website as it changes on a month-to-month basis while this type of chart looks similar to a bar chart they cannot always be used interchangeably for example a column chart may be better suited for showing negative and positive values next We have tree Maps which are useful for displaying complex hierarchies using nested rectangles in this example the tree map depicts Statewide employment rates within the population of a country over the last year the size of the rectangle represents the population and the color represents the employment rate we can click on any region to see the employment data of the sub-regions within the selected region trying to display a pipeline or different stages of a continuous process then funnel charts are the way to go in this Example The Funnel chart is showing the conversion rate at each stage of the sales process from lead generation to the final sale another exceptional chart is the scatter chart in this type of chart the circle colors represent the categories of data and the circle sizes are indicative of the volume of data for example in this scatter chart we can see each product line by the number of units sold and the revenue it brings a scatter chart can be great for revealing Trends clusters patterns and correlations between data points next we look at bubble charts this is a variant of scatter charts and they are useful for comparing a handful of categories to one another in terms of relative significance for example understanding areas of significant expenditure in an organization's sales budgets lastly we have sparklines sparklines do not include an Axis or coordinates yet they display Trends simply and effectively these are great for showing the general trend of a variation for example stock market price fluctuations from the opening to the closing of a trading day in this video we learned about the importance of charts and how they are able to shape our data to provide meaningful information in the next video we will dive into more details about how to create and configure different types of charts in Excel foreign [Music] we will listen to several data professionals discuss the importance of using visualizations to tell a story about data can you tell us about the importance of using visualizations to tell a story about data visualizations are critical to storytelling with data um I think you're familiar with the phrase a picture is worth a thousand words and that's really true here uh you can get a much clearer picture of what's going on uh with your data if you have clean and clear data visualizations I also think data visualization is super helpful for the analyst that creates them because it forces them to make choices about what's really important in the show and what isn't important in the show for example if you're debating whether you should look at things temporally you can debate like am I is the overall trend the most important okay then I should do a Time series data visualization um do I think comparing one group versus another is more important then you're more likely to do a bar column chart uh so it's really important in clarifying the data analyst thinking and visualization is really important in in telling a clear concise story to stakeholders humans are visual creation you are more likely to tell a compelling story and get buying with visuals I once got a job offer with a visualized resume created by Tableau so one of the best ways to present data is visually numbers by themselves for the most part will tend to overwhelm people so if I walk up and I just and I'm talking in a company meeting and I say well last year in 2019 we did a hundred thousand dollars or I could give you a graph and say 2018 we did 75 000 2019 we did a hundred thousand in 2020 we're projected to do a hundred and twenty five thousand dollars if I put that in the graph and make it stand out and make it pretty people will non-accountants will and non-data people will kind of gravitate towards it and it'll prompt them to ask different questions and have different ideas and so by using maybe a PowerPoint to or even excel in Excel you can create graphs from the data make it pretty make sure that it made not just pretty but make sure that it highlights the important information of what you're trying to say it will create and drive the conversation around what needs to be done and how best to maybe run the business or make different decisions data visualization is a very important part of being helping people to understand the numbers that you're trying to present the reason we want to gravitate towards visualizations is that's how the brain really works the brain is much more able to process a high bar versus a low bar as opposed to looking at 100 rows or 100 lines in a spreadsheet using the visualizations and especially using the appropriate visualization for the given task can really help make sure that the user gets the easiest way to understand this as we talked about storytelling is really an important way for us to do this and so through the visualizations that's really how we tell a story we can augment it with text whether that's user generated or system generated to help people really drill down further to the understanding but starting with the visualization is the easiest way to help people quickly effectively understand what's going on and then you can have the further discussions around what exactly you're doing [Music] in this video we'll look at how to create a few basic types of charts in Excel we'll first create line charts then pie charts and lastly bar charts first let's start with line charts a line chart is a type of graph used to show information as a series of data points connected by straight lines in a line chart the horizontal axis typically represents time or a similar category and the vertical axis typically represents numerical values because line charts can display continuous data over a given time period they're perfect for showing Trends and data at equal time intervals such as days months quarters or years line charts are ideal for scenarios where you have data that's arranged in columns or rows or where your data contains multiple data series on the car sales worksheet of the car sales workbook let's first filter the data to display only Ford car models now let's create a line chart with this data we'll select the data from two non-adjacent columns in this case model and price then we select line chart from the 2D line category of the charts group let's change the chart title to price of Ford cars which we can do by simply double-clicking the chart titled text box and editing the text we now see a floating chart area containing our line chart which displays the price trend of Ford cars across its models let's move this line chart to the left side of the worksheet below our data okay now let's move on to pie charts a pie chart is a type of circular graph used to show the relative contribution of different categories which we see as slices to make an overall total which we see as a pi data points on a pie chart that is the slices are represented as percentages of the complete pie charts provide a very simple visualization of differing data results which we humans find very easy to comprehend pie charts are best used when you only have one data series and when your data contains no more than maybe a dozen categories otherwise the pie chart can start to look too busy and become difficult to read for the pie chart we'll use the model names manufactured by Ford along with their unit sales to create our pie chart we'll select the data from two non-adjacent columns in this case model and unit sales then we select pie chart from the 2D Pi category of the charts group the new floating chart area contains our pie chart which displays the relative contribution of unit sales from Individual Ford car models which are the slices of the pie and they combine together to make an overall total of unit sales of Ford cars which is the whole pie let's change the chart Style to customize the look of the pie chart there are numerous styles to choose from in the gallery and you can even make combinations of multiple styles for example here we've chosen style 3 and style 7 which gives us the percentage values displayed in each slice and a nice dark contrasting background color let's again move that chart this time to the center of the worksheet below our data lastly let's have a look at bar charts a bar chart is a type of graph used to compare values across categories either using vertical bars or horizontal bars in the case of column charts which are a variety of bar charts in a bar chart the categories are usually arranged on the vertical axis and the values are on the horizontal axis whereas in a column chart the categories are typically arranged on the horizontal axis and the values are displayed on the vertical axis to create our bar chart we'll select the data from two non-adjacent columns again in this case model and retention percentage then we select a style of bar chart from the 2D bar category of bar charts the new floating chart area contains our bar chart which displays comparative values for the retention percentage of the different Ford models using horizontal bars again we can change the chart color to customize the look of the bar chart if you just want to choose a color scheme based on the palette of colors rather than a style you can click the change colors button and then select a color palette from the list let's also move this chart this time to the right side of the worksheet below our data in this video we learned how to create line pi and bar charts in Excel in the next video we'll look at how to use the pivot chart feature in Excel [Music] foreign how to create a few basic types of charts in Excel in this video we'll look at how to create some other basic charts using the pivot chart feature from a pivot table in Excel we'll first create area charts and then column charts from a pivot table please note that the price and resale values in this sample data set are not real data and are merely used for explanatory and demonstration purposes a pivot chart is used to show the data series categories and chart axes the same way a basic chart is used but connecting a pivot table with it simply put a pivot chart is nothing more than a graphical representation of a pivot table in Excel it's useful when we have a pivot table containing complicated data a pivot chart can help us make sense of such data let's start with area charts an area chart is a type of graph used to show information as a series of data points connected using straight lines with a filled area below it area charts can handle both positive and negative values like line charts first let's create a copy of the pivot 1 worksheet of the car sales workbook in this copied worksheet of the car sales workbook let's first filter the data of the pivot table to display only Toyota car models if we expand the field Toyota we can see the details of different models from Toyota such as the average price of each model and the average year resale value now let's create an area chart using the pivot chart feature with this data here let's select the area chart type and choose the 3D area chart here we see a floating chart containing our area chart which displays the trend of average price as well as average year resale value of Toyota cars across its models note that we can also filter the data in the pivot chart itself rather than in the pivot table this is one of the key differences between a standard chart and a pivot chart so in our pivot chart let's filter the data to display only Chevrolet car models when we expand the field the pivot chart displays our data here we can see that it seems that the higher priced models don't retain their value after one year compared to the lower priced models we can also use the model filter drop down in our pivot chart to filter on models too now we are only displaying seven of the nine Chevrolet models in our pivot chart and its Associated pivot table so we can see that when we make a change such as adding a filter directly in our pivot chart those changes are immediately reflected in our pivot table data and the reverse is obviously also true if we make a change in our pivot table that change is immediately viewable in our pivot chart now let's have a look at column charts a column chart is a type of graph used to compare values across categories using vertical bars in a column chart the categories are typically arranged on the horizontal axis and the values are displayed on the vertical axis to create our column chart let's first create another copy of the pivot1 worksheet of the car sales workbook in this copied worksheet of the car sales workbook let's again filter the data of the pivot table but this time to display only BMW Cadillac and Hyundai car models now let's create a column chart using the pivot chart feature with this data here let's select the column chart type and choose the 3D clustered column chart the new floating area contains our column chart which displays comparative values for the average price as well as the average year resale value for BMW Cadillac and Hyundai cars using vertical bars from this chart data we can see that it seems that both the Hyundai and BMW ranges retain their one-year resale value better than the Cadillac models do now let's view all the BMW models in the table and chart by expanding the cell in the pivot table but note that we can also use the plus and minus buttons in the chart to expand and collapse the data view too these buttons can drill down and drill up through multiple category levels if you have multiple fields in the Axis or categories section of the pivot chart Fields pane for example if we had the models further categorized into model variants and then into engine capacities and then into colors and so on now we can see all the models for all three manufacturers displayed in our column chart note however that these buttons can only be used to expand or collapse all Fields if you want to expand or collapse just one field then you need to do it in the pivot table rather than the chart as we did in the previous step let's change the chart Style to customize the look of the column chart there are numerous styles to choose from in the gallery for example here we've chosen style 9 which gives us a nice dark contrasting background color in this video we learned how to create area and column charts using the pivot chart feature from a pivot table in Excel we also learned how to filter data using either the pivot table or the pivot chart and we learned how to expand and collapse data levels using both the pivot table and the pivot chart in the next video we'll look at some Advanced charts available in Excel thank you [Music] foreign now that we've learned how to create basic charts in this video we'll look at how to create some Advanced charts in Excel we'll first create tree Maps then scatter charts and lastly histograms please note that the price and resale values in this sample data set are not real and are merely used for explanatory and demonstration purposes let's start with tree Maps a tree map chart is used to compare values across hierarchy levels and show proportions within hierarchical levels as rectangles tree maps are a good way of displaying lots of data in one graphical asset because they use the color and closeness of proportional shapes within the chart to represent hierarchical data categories which is a difficult thing to achieve with most other types of chart in the tree map worksheet of the car sales workbook let's first select the data from two non-adjacent columns model and unit sales now let's create a tree map chart with this data we select tree map chart from the hierarchy category of the charts group we now see a floating chart area containing our tree map chart which displays the proportion of the unit sales of Ford cars within hierarchical levels as rectangles let's change the chart title to unit sales of Ford cars which we can do by simply double-clicking the chart title text box and editing the text let's change the chart Style to customize the look of the tree map chart there are numerous styles to choose from in the gallery for example here we've chosen Style 2. we can easily see from this tree map chart that the F series model is by far the biggest proportion of our Ford car sales followed by the Explorer and Taurus models which constitute a similar proportion of our Ford car sales while the Contour model shows the smallest proportion of Ford car sales next let's have a look at scatter charts a scatter chart is a type of graph used to compare two sets of numerical data values and show relationships between those sets of numerical values a scatter chart combines the two sets of values on the X and y-axis into single points of data and then displays them in clusters in the chart for this reason you will also sometimes see them referred to as XY charts common uses include the comparison of statistical scientific or engineering data values to create our scatter chart let's first select the data from two adjacent columns price and year resale value in the scatter worksheet of the car sales workbook now let's create a scatter chart with this data we select scatter chart from the X Y scatter category of the charts Group which compares the price of cars from all manufacturers with their year resale value let's change the chart title to comparing price with year resale value which we can do by simply double-clicking the chart titled text box and editing the text let's change the chart Style to customize the look of the scatter chart there are numerous styles to choose from in the gallery for example here we've chosen style 8. now let's add some axis titles for both the horizontal x-axis and the vertical y-axis we'll call the horizontal retail price and the vertical year resale value we can see from this scatter chart that as the retail price increases so does the differential between the retail price and the year resale value generally speaking the lower priced cars retain their resale value after one year better than the higher priced cars lastly let's have a look at histograms a histogram is a graph that shows the distribution of the data grouped into bins although a histogram may look like a column or bar chart it's totally different while a bar chart is used to compare data a histogram is used to display distribution of data to create our histogram let's first select the data from two non-adjacent columns model and price in the histogram worksheet of the car sales workbook now let's create a histogram with this data we select histogram from the statistical category of the charts group the new floating chart area contains our histogram which displays frequency distribution of the price of cars from all manufacturers note how Excel automatically puts the different price ranges into nine equally sized separate bins for you the first bin contains cars priced between nine thousand two hundred and thirty five dollars and eighteen thousand six hundred and thirty five the second bin contains cars priced between eighteen thousand six hundred and thirty five dollars and twenty eight thousand thirty five dollars and so on up to a maximum price range of 84 435 dollars to ninety three thousand eight hundred and thirty five let's change the chart title to count of car models by price range which we can do by simply double clicking the chart title text box and editing the text let's change the chart Style to customize the look of the histogram there are numerous styles again to choose from in the gallery for example here we've chosen style three this style shows the count values of the individual rectangles for each price range rather than using a vertical scale on the y-axis so from this histogram chart we can easily see that the largest proportion of car models are in the 18 635 dollar to twenty eight thousand thirty five price range with a count of 62 models in this range followed by the cheapest price range of nine thousand two hundred and thirty five to eighteen thousand six hundred and thirty five dollars which has a count of 42 models and the fewest count of models in a price range is shared by the two most expensive price ranges with only one model in each bin although Excel chooses your bin ranges automatically when you create a histogram you can change the bin sizes to suit your needs this is done by opening the formatting pane for the relevant chart element in this case the horizontal axis in the axis options section you can choose to display bins by several factors including bin width and by number of bins for example when we change the bin width value you can now see 15 bins in the chart because the price ranges are much narrower bins two and three have the two highest counts of 34 and 33 respectively and Bin 14 shows no models in this price range at all and if we change the axis options to display a set number of bins then the histogram updates again to show the price ranges split into the number of bins specified that is 10 bins and once again we can see that bin 2 has the largest proportion of models in its price range and if we choose automatic then the histogram reverts back to the format we started with in this video we learned how to create tree Maps scatter charts and histograms in Excel in the next video we'll look at some of the other Advanced charts available in Excel like filled map charts and sparklines [Music] in this video we'll look at some more advanced charts in Excel we'll first create a field map chart and then add sparklines to our data and lastly we'll briefly discuss some of the other charts available in Excel let's start with filled map charts a filled map chart is a type of chart used to compare values and show categories across geographical regions this chart is suitable for data which contains geographical regions like countries states or postal codes in the map chart worksheet of the car sales workbook let's first copy data from the pivot table containing country of sale and sum of unit sales then we'll paste the copied data beside the table now let's create a filled map chart with this data after selecting the data we select fill the map chart from the map category of the charts group the new floating chart area contains our filled map chart which displays the sum of unit sales of cars across different countries of sale let's change the chart title to sum of unit sales of cars by country which we can do by simply double-clicking the chart title text box and editing the text let's change the chart Style to customize the look of the filled map chart there are numerous styles to choose from in the gallery to suit your preference we can see from this field map visualization that the darker blue color which denotes the larger number of unit sales is covering the United States while the paler blue colors which denote medium numbers of unit sales are covering areas such as Canada Western Europe and Scandinavia and the almost white color which denotes the lowest number of unit sales is predominantly covering Eastern Europe India Japan and Australia next let's have a look at sparklines sparklines are mini charts placed inside single cells to represent a selected range of data they are typically used to show data Trends such as seasonal increase or decrease economic cycles and share rate or Price fluctuations they can also be used to highlight Max and Min values a sparkline provides the greatest impact when it is placed close to the data it represents to create our sparklines let's first select data from the four adjacent columns unit sales q1 unit sales Q2 unit sales Q3 and unit sales Q4 in the sparklines worksheet of the car sales workbook now let's create sparklines with this data we'll select the line type of sparkline from the sparklines group we then need to specify where we would like our sparkline to appear on the worksheet we can do this by either typing in the cell reference in the location range box or better still just click the cell in the worksheet where you want it to appear and Excel will fill it in for you note that it uses an absolute reference by adding dollar symbols to the cell references and then we can copy that sparkline down the rest of the column we now see a column containing our sparklines which displays the trend of the unit sales of Ford cars over the four quarters of a year let's name the column header of the column with the sparklines as quarter sales Trends we'll adjust the column width and also adjust the row height to display sparklines more clearly let's also display maximum and minimum values on the sparklines then change the chart Style to customize the look of the sparklines there are several Styles in the gallery for you to choose from and finally let's adjust the weight of the lines in the spark lines to make them stand out a bit these sparklines show us that Ford Escort unit sales started low in the first quarter and then increased in quarter two and three then declined again in quarter four we can also determine that generally across the majority of the range of Ford car models Q3 is the best quarter of unit sales over the year with a couple of exceptions such as Mustangs in quarter 4 and focus models in Quarter Two lastly let's have a look at some other available charts in Excel the waterfall chart type is used to show cumulative effect of a series of positive and negative values this is suitable for data which represents inflows and outflows like financial data the funnel chart type is used to show progressively smaller stages in a process this is suitable for data which shows progressively decreasing proportions the stock chart type is used to show the trend of stocks performance over time this is best suited to data with a series of multiple stock price values like volume open high low and close the surface chart type is used to show Trends and values across two dimensions in 3D surface areas or 2D contoured charts this is most suitable when categories and data series are both numeric and lastly the radar chart type is used to show values relative to a center point and is most suitable when categories are not directly comparable in this video we looked at how to create field map charts and sparklines in Excel and we reviewed some of the other charts available [Music] thank you in this video we will have a brief introduction to dashboards including what they are what they consist of why they can be a useful component in a data analysts toolkit and an essential skill in a data analyst skill set the term dashboard comes from the automotive industry where car designers have put the most important gauges and other display information such as engine oil temperature current speed current RPM amount of fuel left and so on in a handy graphical display that is easy for the driver to view and understand originally these displays were analog but most are now digital and use varying forms of visualization including digital meters and mini graphs you can take that same idea and apply it to a dashboard in a data analysis application designers of these types of dashboard want to put key business information in one place in the form of graphical displays to make it easier for the viewer to understand them dashboards can take this a step further by also allowing the user to interact with the dashboard and modify exactly what information they see by using tools supplied on the dashboard users of a dashboard therefore not only get a Consolidated visualization of their business data and key performance indicators or kpis but also get a controllable self-service business intelligence or bi interface through the use of filters which enable them to control precisely what information they see dashboards are typically created in a data analysis application by using multiple pivot tables and charts visualizations such as map charts and sparklines and filtering tools such as slicers and timelines these pivot tables and charts could be created from a single data source or from multiple data sources when you use dashboards in your data analysis application you get the following benefits they offer insights into your key data they can alert you to patterns and Trends in your data they provide an interactive experience for the user allowing them to filter what data they see they are updated dynamically as the source data changes they provide a centralized and Consolidated view of business data a dashboard can be a very useful tool in areas of the business such as fiscal forecasting and Reporting project management executive reporting Human Resources customer service help desk issue tracking Healthcare monitoring call center analytics social media marketing and many more for a budding data analyst dashboards can be a vital skill to add to their Arsenal as the majority of employers see the dashboarding skill as a must have rather than a nice to have if you can show that you have the skills to create accomplished and spectacular interactive yet easy to view and use dashboards whether in a spreadsheet application such as Microsoft Excel or Google Sheets or using a more Advanced Data analysis and visualization application such as bokeh or Dash and python are Studio shiny Tableau or IBM cognos analytics then that will greatly help you in your future career as a data analyst in this video we had a brief introduction to dashboards including what they are what they consist of why they can be a useful component in a data analysts toolkit and an essential skill in a data analyst skill set in the next videos we will learn how to create a simple dashboard using a spreadsheet application [Music] in the first part of this video we will listen to several data professionals discuss how dashboards can help when presenting data results can you tell us how dashboards can help you when presenting data results one thing I particularly love with data are dashboards they take the fluff out and they show you the most important things that you want to see often in real time and you can make them as pretty as you want I've seen some dashboards that are frankly an eyesore because of trying to cram so much data into One dashboard so it's important to be specific and be succinct in what it is that you're looking for so that you can avoid those sorts of things a lot of times the dashboards are great for executives or business owners on the go who are maybe looking at dashboards from a mobile device and they don't really have the the space or the capacity on a mobile device to look at so much data and that dashboards can really be highly effective in a short amount of time as long as you understand the deliverables and what it is that your stakeholder wants to see and what's important to them for purposes of making decisions presenting the information in a way that is palatable to your audience is important is because you have to have people are getting value out of what you're doing a lot of times I think that's why data analysts in general get a bad rep because people think we're just number crunchers I think the bigger problem is we haven't done as good of a job as we could to really explain the numbers and so that's where PowerPoint presentations with graphs um graphs we love graphs key performance indicators maybe that break out the information in a different way and highlight what's most important that will help and also just making sure that you're reading the room so if you are at in a in a meeting and everybody's eyes are glazed over because you're talking numbers maybe you need to ask them what information are you looking for what is most important to you so that when you go back and you create your next dashboard or create your next report you can highlight things that are important to your audience because we always need to be reading the room we need to make sure that we are showing our value showing people helping people to understand educating them so that they can bring up their knowledge base and so that starts to help the fear dissipate of the fear of the number starts to go away if we would just show them what the numbers actually mean and we can do that not just by throwing numbers at them so they'll get overwhelmed but by helping them to see it graphically and with dashboards and kpis and other ways just to bring it to them and make it real to them dashboards in a spreadsheet is an indicator for Action just like on your car dashboard if you see a low fuel light that means you need to put gas in your car so a dashboard in a spreadsheet should be just that simple it should tell the person what they need to focus on immediately or any indication that things need to change because it's either going up in the wrong direction or it's going down but don't put information that's nice to know on a dashboard it's on a need to know basis especially if you want to get action in the second part of this video we will listen to a data professional discuss how cognos analytics can help you create outstanding visualization dashboards for presenting data results can you tell us how you use cognos analytics to create visualization dashboards for presenting data results IBM cognitive analytics can really help you create a better visualization and dashboards in a number of different ways starting off we have our templates allowing you to quickly select from a template and simply drag and drop your visualizations into the slots to help you create something that's visually compelling effectively and easily we also have what we call our visualization recommender if you grab a couple Fields drag them onto the canvas we'll recommend a visualization if you don't like the one we recommend right off the bat then you do have the ability to go in and start to select some different visualizations from our recommendations on top of this we also have started to infuse AI into the off frame so from this you can start to have the system actually generate an entire dashboard for you you have a conversation with our assistant ask it questions and once you focus in on the particular area that you're interested in simply say generate dashboard at that point you'll get a beautiful dashboard created laid out nicely that you can start to use as your starting point for further discussions and help you really understand the data that's in your in your system a couple of things that we wanted to highlight one is the advanced analytic capabilities whether this is through our visualizations like our key driver analysis or through our AI infused forecasting the last thing would be the ability to share your visualizations and your dashboards in just a few clicks whether that's sharing it in the system through a link whether it's pushing it to email or even pushing it into a slack Channel where you can start to have a discussion there [Music] thank you now that we've learned about the basics of dashboards and why they are an essential tool for a data analyst in this video we'll look at how to set up and configure a relatively simple dashboard in Excel which will help us to tell a story about our data before creating our first dashboard we would have first collected and organized the data then verified that the data in our worksheet was clean error-free and did not contain any blank rows or columns and then we would have formatted it as a table next we would have created some pivot tables to help us analyze our data and we would have performed some sorting and filtering on the data in our pivot tables to highlight the key aspects of our data analysis lastly we would have created various data visualizations such as charts maps and slicers to help us tell a story about our data findings in this example Car Sales workbook we've already gone through those processes of collecting cleaning analyzing and visualizing our data and we are now at the point where we can combine that data analysis and those visualizations in a digital dashboard that will help us present our key data findings to stakeholders the first thing we need to do is create a new worksheet to host our dashboard and give it a name then drag it to the end of the tabs list in the workbook as we've already created several visualizations we can use those to populate our dashboard so if we look at some of the other worksheet tabs these contain various visualizations which we've been working on throughout this course we can just use some of them by copying a few charts and other visualizations from there to our dashboard we'll copy this pie chart from this worksheet and paste it to our dashboard then we'll grab a copy of this 3D column chart and paste that too we'll also copy the 3D area chart from the other pivot chart sheet to our dashboard we'll grab the tree map chart and copy that to our dashboard too let's copy the scatter plot chart too we'll also copy the histogram and copy that to our dashboard worksheet and let's take a copy of the map chart visualization 2. let's also copy the sparklines visualization to our new dashboard worksheet lastly let's go back to the line Pi Bar worksheet and select both the line chart and the bar charts and copy them both to the dashboard worksheet okay so now we have a lot of different visualizations on our dashboard we can make things look a little better by resizing some of the visualization objects and moving them around a bit so we've now resized some of the visualizations and moved things around and then we could zoom out to see all the visualizations on screen but if we remember what we heard from some of the subject matter experts during the expert viewpoints video sometimes less is more and one or two of the experts mentioned that if we provide and display too much information the key points can sometimes get lost so we should create a copy of this dashboard on another worksheet and then thin down the amount of visualizations and maybe make them more focused to highlight one or two of the more important views we want to convey to do that we'll first make a copy of the dashboard worksheet and move it to the end of the worksheets tabs on the copy dashboard 2 worksheet we'll now delete a few Superfluous charts let's first remove the tree map because that is giving us almost the same information as the pie chart then we can remove the 3D column chart because that has essentially the same data as the 3D area chart and let's also lose the scatter plot histogram and the sparklines because they aren't really key to our message let's copy the slicer from the line Pi Bar worksheet to our dashboard to give us some interactivity now let's Zoom back out and do some arranging and resizing to these visualizations to make the dashboard look a little sharper so now we've rearranged and resized the visualizations to make the dashboard look a little leaner and tidier we should also make some style and color changes to give our dashboard a more consistent look and feel we'll apply the same style to all the chart elements so we'll apply a dark gray style and monochromatic green colors to the 3D area chart and the line chart and the pie chart and the bar chart and the map chart and we can recolor the slicer too to fit in with our color scheme now things look a lot better and more professional and the last thing we should do is remove some of the Excel interface elements and other bits of unnecessary screen clutter to give us a nice clean looking dashboard we can remove screen clutters such as grid lines the formula bar and headings and we can collapse the ribbon too and that should do it now when we either present this dashboard or email it to a key stakeholder it includes some interactivity via the slicer and also via the filterable pivot chart if we use the slicer to select several Ford car models all three charts that are related to that data get updated at the same time we can modify filters in the 3D area chart to display only Ford cars instead of Chevrolets and if we switch to the pivot chart 1 worksheet we can see that the original data has also been updated with this new filter and when we make changes to the original Source data such as increasing the value of unit sales of cars in Australia the map chart visualization in this worksheet updates and the Australia turns dark blue but also the map chart visualization in the dashboard gets updated too so we can see that Australia is now dark green on this map chart all of this shows how live and interactive the data in our dashboard is creating a clean focused and interactive dashboard can be a major asset when trying to tell a story about data in this video we learned how to set up and configure a relatively simple dashboard in Excel which will help us to tell a story about our data in the next video we will get an introduction to IBM cognos analytics data analysis and visualization business intelligence platform [Music] in this video we're going to cover a couple of things the first is quick overview of what cognos analytics is and we're going to show you how to sign up for the trial cognitive analytics is a multi-faceted tool allowing you to perform both mode 1 and mode 2 type of analysis all in one product it contains a number of different tools such as the ability to model your data explore your data create compelling Advanced analytic visualizations like our key driver analysis display natural language generated based off of your data and create reports which are specific and tailored to your individual users either through filters or through the ability to create bursts we also have the ability to create fantastic dashboards which will be the main focus of this particular course now to sign up for the trial we're going to go to the ibm.biz try underscore cognos if you already have an account you can log in here and you'll simply have to fill out part of this form if you don't then we'll go ahead fill out this form quickly and the key one to take note of is to select a data center which is close to you in your particular geography now the system is being spun up for us and we can go ahead and actually launch this directly from this workflow now that you're in you can go ahead and either manage your subscription through this button or alternatively you can always re-access the system through this URL in the next video we'll take you through a little bit around how to navigate and use cognos and its dashboard capabilities [Music] in this particular video we'll cover off how we can upload spreadsheets General navigation and cognos analytics how to start a new dashboard using dashboard templates as well as navigating within the cognos analogux dashboard environment there are two main navigation areas for cognitive analytics along the left hand side as well as along the top these will change and update based on the area you're in the product now for today while cognos can connect to a number of databases we're going to start with simply uploading an Excel file now we can do this in one of two ways we can come to the new select upload files and navigate to find the file in question alternatively what we can do is we can drag and drop the file onto our main landing page here and have it take us directly into the experience that we're looking to do in our case here dashboard now you'll notice that this says that uploaded content will land in my content which is up there on the left hand navigation whether you do it from here or through the plus and upload it'll still land in the same place once in my content it can be taken moved around into a shared area within team content as the file uploads here you'll notice that we see something that says analyzing this allows us to integrate the data get an understanding of what's in the data in order to be able to help make some better assumptions and decisions for you as we look to build out content the first thing we'll do is we go to build a dashboard is to select a template you'll see a number here and based on what you're trying to achieve how many visualizations and what types you may want to select something specific in this case I'll choose one that has four and one larger space once we're in the dashboard we'll notice that the pane with our uploaded file column headers is displayed there's a few other things that we want to highlight here in terms of navigation to help you with the experience here in cognos Analytics our second pin down here is allows you to pin different visualizations for reuse in other dashboards across the system we'll come back to our assistant in a future video this allows you to ask questions in natural language and have the system tell you a little bit about your data as well as provide some visualizations the next system here is our view of all the different visualizations we support in the dashboard as well as the ability to upload your own custom visualization should we not have one that meets your needs the final one here is our additional widgets whether that's text images video hyperlinks or any of these shapes that are viewed here in the next video we'll dive a little bit deeper into how specifically to create dashboards thank you [Music] in this video we'll cover off creating some simple dashboards in cognos Analytics we'll cover off a variety of different methods for creating a visualization whether it's through our automatic generation manually populating slots or using our assistant we'll also cover off how you can filter within a cognos analytics dashboard as I mentioned as the data was uploaded we took a look at the data and tried to understand what types of data were included and that's denoted by the icon in front of each of these elements now in the case of order ID we actually want to switch this up and change it change the properties from a measure to an identifier with that done let's go ahead and start to create some visualizations so you can simply grab from my tree and drag it onto the canvas now in this case if I drop on top of the box what it'll do is it'll fill the space available as denoted by the template next look at the number of orders that we've had by counting the number of order IDs that we have now in this case you determine the best visualization for an identifier to be a list we can go ahead and take a look at what other recommended visualizations in this case there aren't any but we know we want to turn this into a summary so we can just go and manually select it next up let's look at the quantity ordered and we'll finally finish with the average sale so in this case I'll drag sales onto the canvas but what I'll do is I'll change our summarization from sum to average so just like that I now have a few kpis to be able to Monitor and track the other way we can search create visualizations is by selecting the specific visualization we want and dragging it onto the canvas in this case I'm interested in looking at our sales across the world based on countries so in this case I can find country alternatively if I have a lot of data I can simply start to type at the top and I'll see country comes to the surface as we said we wanted to take a look at sales in this case we look specifically at country but as you can see we can go further down in terms of latitude and longitude now with this I can manually drag and drop to make this as big or small as I want and you'll see that we show percentages of how much real estate it's actually taking up of course each of our visualizations has a number of properties we won't get into those in too much detail but if there's something that you're looking to do take a look at the properties and it's likely there the final way that we'll show creating visualizations is through our assistant now in this case I may have an idea of what I want to look at or I may not if I don't I can simply ask it to suggest questions this will offer up some insight that I may not have been thinking about in this case we'll keep it simple we'll just look at which product line has the top sales with a click I'm now presented with the visualization as well as Alternatives based on that particular visualization if I'm happy with a view I can simply drag and drop this onto my canvas and it now becomes a visualization alongside the others I can continue to resize this now all of our dashboard is designed and meant to be interactive so in this case if I were interested in seeing the rest of my dashboard through the specific lines of classic cars I can click on classic cars and you'll see that all my visualizations are now updating specifically with the filter of classic cars now this is one way that I can do it and I can do it multiple clicks or alternatively I can take a particular field and drop it in if I were perhaps interested in a particular status I could drive this into the all tabs or if I just wanted to be specific to this tab in here now with this I'll be able to choose one or many in this case maybe we want to take a look at how much we have in the on hold status so with that you'll see again all of these have been updated to reflect simply the on hold so it looks like from a worldwide view we only have holds in a couple of countries which is which is positive in the next video we'll delve deeper into a few of the more advanced capabilities of the dashboard [Music] in this video we're going to cover a few of the more advanced capabilities in cognos analogux dashboards cover how to create calculations how to leverage navigation paths how to do excludes from visualizations as well as how to set top bottom on a visualization much like Excel our dashboard can create calculations we list out a number of different options here that you can take a look at or alternatively you can simply start typing and will offer up suggestions now in this case we're interested in looking at our MSRP minus the price that we sell each unit for to give us a margin calculation now this calculation is treated exactly the same as any of our other fields that we have in here so this case let's select our margin and we'll look at product line and bring that in this case we can see that trains is not not doing very well it's actually a negative margin for us now we may want to drill into this a little bit more so we can create what we call a navigation path in this case I can choose any field from my data to be able to drill up and down on in this case we want to start with product line and perhaps we want to see customers and ultimately if there are specific orders now I can right click on trains I can start to drill down in this case I can see that we have a few offenders who are in the negatives when we look at this particular field so if we go ahead and look at something like many gifts again we can drill down a little bit further to see that well they have one which is positive their other ones are in the negatives let's take another look at something a little bit different so if we look at sales by status and by product line now in this case we're still filtered on trains so we may want to go back and navigate up in this case we're still filtered on just trained so let's unclick that to get back to our initial starting point now in this case we can see that our shift is really dramatically having an impact on this so we can go ahead and we can choose to exclude our shipped items now we get a much better review of everything else that's going on for the other statuses another thing that we could do is if we have a large number of data set data points in this case perhaps customer sales customer name and sales we may want to filter down on only our most important ones so in this case I can come to sales right click and I can ask it to show us a top X number in this case by default we have 10. so now we can see these are our top 10 customers and one last one just for fun we can actually create infographics on the Fly so in this case if I've got sales I can take any one of my shapes here perhaps a piggy bank drop it on top and just like that we've now created an infographic [Music] a type is how python represents different types of data in this video we will discuss some widely used types in Python you can have different types in Python they can be integers like 11 real numbers like 21.213 they can even be words integers real numbers and words can be expressed as different data types the following chart summarizes three data types for the last examples the First Column indicates the expression the second column indicates the data type we can see the actual data type in Python by using the type command we can have int which stands for an integer and float that stands for float essentially a real number the type string is a sequence of characters here are some integers integers can be negative or positive it should be noted that there is a finite range of integers but it is quite large floats are real numbers they include the integers but also numbers in between the integers consider the numbers between 0 and 1. we can select numbers in between them these numbers are floats similarly consider the numbers between 0.5 and 0.6 we can select numbers in between them these are floats as well we can continue the process zooming in for different numbers of course there is a limit but it is quite small you can change the type of the expression in Python this is called typecasting you can convert an INT to a float for example you can convert or cast the integer 2 to a float 2. nothing really changes if you cast a float to an integer you must be careful for example if you cast the float 1.1 to 1 you will lose some information if a string contains an integer value you can convert it to int if we convert a string that contains a non-integer value we get an error check out more examples in the lab you can convert an INT to a string or a float to a string Boolean is another important type in Python a Boolean can take on two values the first value is true just remember we use an uppercase t Boolean values can also be false with an uppercase F using the type command on a Boolean value we obtain the term Bool this is short for Boolean if we cast a Boolean true to an integer or float we will get a 1. if we cast a Boolean false to an integer or float we get a zero if you cast a one to a Boolean you get a true similarly if you cast a zero to a Boolean you get a false check the labs for more examples or check python.org for other kinds of types in Python [Music] in this video we'll cover expressions and variables Expressions describe a type of operation the computers perform expressions are operations the python performs for example basic arithmetic operations like adding multiple numbers the result in this case is 160. we call the numbers operands and the math symbols in this case addition are called operators perform operations such as subtraction using the subtraction sign in this case the result is a negative number we can perform multiplication operations using the asterisk the result is 25. in this case the operands are given by negative and asterisks can also perform division with the forward slash 25 divided by 5 is 5. 25 divided by 6 is approximately 4.167 in Python 3 the version we will be using in this course both will result in a float we can use the double slash for integer division where the result is rounded be aware in some cases the results are not the same as regular division python follows mathematical conventions when performing mathematical Expressions the following operations are in a different order in both cases python performs multiplication then addition to obtain the final result there are a lot more operations you can do with python check the labs for more examples we will also be covering more complex operations throughout the course the expressions in the parentheses are performed first we then multiply the result by 60. the result is 1920. now let's look at variables we can use variables to store values in this case we assign a value of 1 to the variable my underscore variable using the assignment operator I.E the equal sign we can then use the value somewhere else in The Code by typing the exact name of the variable we will use a colon to denote the value of the variable we can assign a new value to my underscore variable using the assignment operator we assign a value of 10. the variable now has a value of 10. the old value of the variable is not important we can store the results of expressions for example we add several values and assign the result to x x now stores the result we can also perform operations on X and save the result to a new variable y y now has a value of 2.666 we could also perform operations on X and assign the value X the variable X now has a value 2.666 as before the old value of x is not important we can use the type command in variables as well it good practice to use meaningful variable names so you don't have to keep track of what the variable is doing let's say we would like to convert the number of minutes in the highlighted examples to number of hours in the following music data set we call the variable that contains the total number of minutes total underscore Min it's common to use the underscore to represent the start of a new word you could also use a capital letter we call the variable that contains the total number of hours total underscore hour we can obtain the total number of hours by dividing total underscore Min by 60. the result is approximately 2.367 hours if we modify the value of the first variable the value of the variable will change the final result values change accordingly but we do not have to modify the rest of the code [Music] in Python A String is a sequence of characters a string is contained within two quotes you could also use single quotes a string can be spaces or digits a string can also be special characters we can bind or assign a string to another variable it is helpful to think of a string as an ordered sequence each element in the sequence can be accessed using an index represented by the array of numbers the first index can be accessed as follows we can access index 6. moreover we can access the 13th index we can also use negative indexing with strings the last element is given by the index negative one the first element can be obtained by index negative 15 and so on we can bind a string to another variable it is helpful to think of string as a list or Tuple we can treat the string as a sequence and perform sequence operations we can also input a stride value as follows the two indicates we'd select every second variable we can also incorporate slicing in this case we return every second value up to index 4. we can use the Lend command to obtain the length of the string as there are 15 elements the result is 15. we can concatenate or combine strings we use the addition symbols the result is a new string that is a combination of both we can replicate values of a string we simply multiply the string by the number of times we would like to replicate it in this case three the result is a new string the new string consists of three copies of the original string this means you cannot change the value of the string but you can create a new string for example you can create a new string by setting it to the original variable and concatenated with a new string the result is a new string that changes from Michael Jackson to Michael Jackson is the best strings are immutable backslashes represent the beginning of Escape sequences Escape sequences represent strings that may be difficult to input for example backslashes n represent a new line the output is given by a new line after the backslashes N is encountered similarly backslash T represents a tab the output is given by a tab where the backslash T is if you want to place a backslash in your string use a double backslash the result is a backslash after the escape sequence also place an r in front of the string now let's take a look at string Methods strings or sequences and as such have apply methods that work on lists and tuples strings also have a second set of methods that just work on strings when we apply a method to the string a we get a new string B that is different from a let's do some examples let's try with the method upper this method converts lowercase characters to uppercase characters in this example we set the variable a to the following value we apply the method upper and set it equal to B the value for B is similar to a but all the characters are uppercase the method replaces a segment of the string I.E a substring with a new string we input the part of the string we would like to change the second argument is what we would like to exchange the segment with the result is a new string with the segment changed the method find find substrings the argument is the substring you would like to find the output is the first index of the sequence we can find the substring jack if the substring is not in the string the output is negative 1. check the labs for more examples [Music] in this video we will cover lists and tuples these are called compound data types and are one of the key types of data structures in Python tuples tuples are an ordered sequence here is a tuple ratings tuples are expressed as comma separated elements within parentheses these are values inside the parentheses in Python there are different types strings integer float they can all be contained in a tuple but the type of the variable is Tuple each element of a tuple can be accessed via an index the following table represents the relationship between the index and the elements in the Tuple the first element can be accessed by the name of the Tuple followed by a square bracket with the index number in this case Zero we can access the second element as follows we can also access the last element in Python we can use negative index the relationship is as follows the corresponding values are shown here we can concatenate or combine tuples by adding them the result is the following with the following index if we would like multiple elements from a tuple we could also slice tuples for example if we want the first three elements we use the following command the last index is one larger than the index you want similarly if we want the last two elements we use the following command notice how the last index is one larger than the length of the Tuple we can use the Len command to obtain the length of a tuple as there are five elements the result is five tuples are immutable which means we can't change them to see why this is important let's see what happens when we set the variable ratings 1 to ratings let's use the image to provide a simplified explanation of what's going on each variable does not contain a tuple but references the same immutable Tuple object see the objects and classes module for more about objects let's say we want to change the element at index 2. because tuples are immutable we can't therefore ratings 1 will not be affected by a change in rating because the Tuple is immutable I.E we can't change it we can assign a different Tuple to the ratings variable the variable ratings now references another Tuple as a consequence of immutability if we would like to manipulate a tuple we must create a new Tuple instead for example if we would like to sort a tuple we use the function sorted the input is the original Tuple the output is a new sorted list for more on functions see our video on functions a tuple can contain other tuples as well as other complex data types this is called nesting we can access these elements using the standard indexing methods if we select an index with a tuple the same index convention applies as such we can then access values in the Tuple for example we could access the second element we can apply this indexing directly to the Tuple variable NT it is helpful to visualize this as a tree we can visualize this nesting as a tree the Tuple has the following indexes if we consider indexes with other tuples we see the Tuple at index 2 contains a tuple with two elements we can access those two indexes the same convention applies to index 3. we can access the elements in those tuples as well we can continue the process we can even access deeper levels of the tree by adding another square bracket we can access different characters in the string or various elements in the second Tuple contained in the first lists are also a popular data structure in Python lists are also an ordered sequence here is a list l a list is represented with square brackets in many respects lists are like tuples one key difference is they are mutable lists can contain strings floats integers we can Nest other lists we also nest tuples and other data structures the same indexing conventions apply for nesting like tuples each element of a list can be accessed via an index the following table represents the relationship between the index and the elements in the list the first element can be accessed by the name of the list followed by a square bracket with the index number in this case Zero we can access the second element as follows we can also access the last element in Python we can use a negative index the relationship is as follows the corresponding indexes are as follows we can also perform slicing in lists for example if we want the last two elements in this list we use the following command notice how the last index is one larger than the length of the list the index conventions for lists and tuples are identical check the labs for more examples we can concatenate or combine lists by adding them the result is the following the new list has the following indices lists are mutable therefore we can change them for example we apply the method extends by adding a DOT followed by the name of the method then parentheses the argument inside the parentheses is a new list that we are going to concatenate to the original list in this case instead of creating a new list L1 the original list L is modified by adding two new elements to learn more about methods check out our video on objects and classes another similar method is append if we apply a pen instead of extended we add one element to the list if we look at the index there is only one more element index 3 contains the list we appended every time we apply a method the list changes if we apply extend we add two new elements to the list the list L is modified by adding two new elements if we append the string a we further change the list adding the string a as lists are mutable we can change them for example we can change the first element as follows the list now becomes Hard Rock 10 1.2 we can delete an element of a list using the Dell command we simply indicate the list item we would like to remove as an argument for example if we would like to remove the first element the result becomes 10 1.2 we can delete the second element this operation removes the second element off the list we can convert a string to a list using Split For example the method split converts every group of characters separated by a space into an element of a list we can use the split function to separate strings on a specific character known as a delimiter we simply pass the delimiter we would like to split on as an argument in this case a comma the result is a list each element corresponds to a set of characters that have been separated by a comma when we set one variable b equal to a both A and B are referencing the same list multiple names referring to the same object is known as aliasing we know from the list slide that the first element in B is set as hard rock if we change the first element in a to Banana we get a side effect the value of B will change as a consequence A and B are referencing the same list therefore if we change a list B also changes if we check the first element of B after changing list a we get banana instead of Hard Rock you can clone list a by using the following syntax variable a references one list variable B references a new copy or clone of the original list now if you change a b will not change we can get more info on list tuples and many other objects in Python using the help command simply pass in the list Tuple or any other python object see the labs for more things you can do with lists thank you let's cover dictionaries in Python dictionaries are a type of collection in Python if you recall a list has integer indexes these are like addresses a list also has elements a dictionary has keys and values the key is analogous to the index they are like addresses but they don't have to be integers they are usually characters the values are similar to the element in a list and contain information to create a dictionary we use curly brackets the keys are the first elements they must be immutable and unique each key is followed by a value separated by a colon the values can be immutable mutable and duplicates each key and value pair is separated by a comma consider the following example of a dictionary the album title is the key and the value is the release data we can use yellow to highlight the keys and leave the values in white it is helpful to use the table to visualize a dictionary where the First Column represents the keys and the second column represents the values we can add a few more examples to the dictionary we can also assign the dictionary to a variable the key is used to look up the value we use square brackets the argument is the key this outputs the value using the key of Back in Black this Returns the value of 1980. the key the Dark Side of the Moon gives us the value of 1973 using the key The Bodyguard gives us the value 1992 and so on we can add a new entry to the dictionary as follows this will add the value 2007 with a new key called graduation we can delete an entry as follows this gets rid of the key Thriller and its value we can verify if an element is in the dictionary using the in command as follows the command checks the keys if they are in the dictionary they return a true if we try the same command with a key that is not in the dictionary we get a false in order to see all the keys in a dictionary we can use the method keys to get the keys the output is a list like object with all the keys in the same way we can obtain the values using the method values check out the labs for more examples and info on dictionaries [Music] let's cover sets they are also a type of collection sets are a type of collection this means that like lists and tuples you can input different python types unlike lists and tuples they are unordered this means sets do not record element position sets only have unique elements this means there is only one of a particular element in a set to define a set you use curly brackets you place the elements of a set within the curly brackets you notice there are duplicate items when the actual set is created duplicate items will not be present you can convert a list to a set by using the function set this is called type casting you simply use the list as the input to the function set the result will be a list converted to a set let's go over an example we start off with a list we input the list to the function set the function set returns a set notice how there are no duplicate elements let's go over set operations these can be used to change the set consider the set a let's represent this set with a circle if you are familiar with sets this can be part of a Venn diagram a Venn diagram is a tool that uses shapes usually to represent sets we can add an item to a set using the add method we just put the set name followed by a DOT then the add method the argument is the new element of the set we would like to add in this case in sync the set a now has in sync as an item if we add the same item twice nothing will happen as there can be no duplicates in a set let's say we would like to remove in sync from set a we can also remove an item from a set using the remove method we just put the set name followed by a DOT then the remove method the argument is the element of the set we would like to remove in this case in sync after the remove method is applied to the set set a does not contain the item in sync you can use this method for any item in the set we can verify if an element is in the set using the in command as follows the command checks of the item in this case AC DC is in the set if the item is in the set it returns true if we look for an item that is not in the set in this case for the item who as the item is not in the set we will get a false these are types of mathematical set operations there are other operations we can do there are lots of useful mathematical operations we can do between sets let's define the set album set one we can represent it using a red circle or Venn diagram similarly we can Define the set album set 2. we can also represent it using a blue circle or Venn diagram the intersection of two sets is a new set containing Elements which are in both of those sets it's helpful to use Venn diagrams the two circles that represent the sets combine the overlap represents the new set as the overlap is comprised of the red circle and blue circle we Define the intersection in terms of and in Python we use an ampersand to find the intersection of two sets if we overlay the values of the set over the circle placing the common elements in the overlapping area we see the correspondence after applying the intersection operation all the items that are not in both sets disappear in Python we simply just place the Ampersand between the two sets we see that both AC DC and Back in Black are in both sets the result is a new set album set three containing all the elements in both album set one and album set two the union of two sets is the new set of elements which contain all the items in both sets we can find the union of the sets album set one and album set 2 as follows the result is a new set that has all the elements of album set one and album set two this new set is represented in green consider the new album set album set three the set contains the elements AC DC and Back in Black we can represent this with a Venn diagram as all the elements in album set 3 are an album set one the circle representing album set 1 encapsulates the circle representing album set 3. we can check if a set is a subset using the is subset method as album set 3 is a subset of the album set one the result is true there is a lot more you can do with sets check out the lab for more examples [Music] thank you in this video you will learn about conditions and branching comparison operations compare some value or operand then based on some condition they produce a Boolean let's say we assign a value of a to 6. we can use the equality operator denoted with two equal signs to determine if two values are equal in this case if 7 is equal to six in this case as 6 is not equal to 7 the result is false if we performed an equality test for the value 6 the two values would be equal as a result we would get a true consider the following equality comparison operator if the value of the left operand in this case the variable I is greater than the value of the right operand in this case 5 the condition becomes true or else we get a false let's display some values for I on the left let's see the values greater than 5 in green and the rest in red if we set I equal to 6 we see that 6 is larger than 5 and as a result we get a true we can also apply the same operations to floats if we modify the operator as follows if the left operand I is greater than or equal to the value of the right operand in this case 5 then the condition becomes true in this case we include the value of 5 in the number line and the color changes to Green accordingly if we set the value of I equal to 5 the operand will produce a true if we set the value of I to 2 we would get a false because 2 is less than 5. we can change the inequality if the value of the left operand in this case I is less than the value of the right operand in this case 6 then condition becomes true again we can represent this with a colored number line the areas where the inequality is true are marked in green and red where the inequality is false if the value for I is set to 2 the result is a true as 2 is less than 6. the inequality test uses an explanation mark preceding the equal sign if two operands are not equal then the condition becomes true we can use a number line when the condition is true the corresponding numbers are marked in green and red for where the condition is false if we set I equal to 2 the operator is true as 2 is not equal to 6. We compare strings as well comparing AC DC and Michael Jackson using the equality test we get a false as the strings are not the same using the inequality test we get a true as the strings are different see the labs for more examples branching allows us to run different statements for a different input it's helpful to think of an if statement as a locked room if the statement is true you can enter the room and your program can run some predefined task if the statement is false your program will skip the task for example consider the blue rectangle representing an AC DC concert if the individual is 18 or older they can enter the AC DC concert if they are under the age of 18 they cannot enter the concert individual proceeds to the concert their age is 17 therefore they are not granted access to the concert and they must move on if the individual is 19 the condition is true they can enter the concert then they can move on this is the syntax of the if statement from our previous example we have the if statement we have the expression that can be true or false the brackets are not necessary we have a colon within an indent we have the expression that is run if the condition is true the statements after the if statement will run regardless if the condition is true or false for the case where the age is 17 we set the value of the variable age to 17. we check the if statement the statement is false therefore the program will not execute the statement to print you will enter in this case it will just print move on for the case where the age is 19 we set the value of the variable age to 19. we check the if statement the statement is true therefore the program will execute the statement to print you will enter Then it will just print move on the else statement will run a different block of code if the same condition is false let's use the AC DC concert analogy again if the user is 17 they cannot go to the AC DC concert but they can go to the meatloaf concert represented by the purple square if the individual is 19 the condition is true they can enter the AC DC concert then they can move on as before the syntax of the else statement is similar we simply append the statement else we then add the expression we would like to execute with an indent for the case where the age is 17 we set the value of the variable age to 17. we check the if statement the statement is false therefore we progress to the else statement we run the statement in the indent this corresponds to the individual attending the meatloaf concert the program will then continue running for the case where the age is 19 we set the value of the variable age to 19. we check the if statement the statement is true therefore the program will execute the statement to print you will enter the program skips the expressions in the else statement and continues to run the rest of the Expressions the alif statement short for else if allows us to check additional conditions if the preceding condition is false if the condition is true the alternate Expressions will be run consider the concert example if the individual is 18 they will go to the Pink Floyd concert instead of attending the AC DC or meatloaf concerts person of 18 years of age enters the area as they are not over 19 years of age they cannot see AC DC but as they are 18 years they attend Pink Floyd after seeing Pink Floyd they move on the syntax of the LF statement is similar we simply add the statement LF with the condition we then add the expression we would like to execute if the statement is true with an indent let's illustrate the code on the left an 18 year old enters they are not older than 18 years of age therefore the condition is false so the condition of the LF statement is checked the condition is true so then we would print go see Pink Floyd then we would move on as before if the variable age was 17 the statement go see meatloaf would print similarly if the age was greater than 18 the statement you can enter would print check the labs for more examples now let's take a look at logic operators logic operations take Boolean values and produce different Boolean values the first operation is the not operator if the input is true the result is a false similarly if the input is false the result is a true let A and B represent Boolean variables the or operator takes in the two values and produces a new Boolean value we can use this table to represent the different values the First Column represents the possible values of a the second column represents the possible values of B the final column represents the result of applying the or operation we see the or operator only produces a false if all the Boolean values are false the following lines of code will print out this album was made in the 70s or 90s if the variable album year does not fall in the 80s let's see what happens when we set the album year to 1990. the colored number line is green when the condition is true and red when the condition is false in this case the condition is false examining the second condition we see that 1990 is greater than 1989 so the condition is true we can verify by examining the corresponding second number line in the final number line the green region indicates where the area is true this region corresponds to where at least one statement is true we see that 1990 falls in the area therefore we execute the statement let A and B represent Boolean variables the and operator takes in the two values and produces a new Boolean value we can use this table to represent the different values the First Column represents the possible values of a the second column represents the possible values of B the final column represents the result of applying the and operation we see the and operator only produces a true if all the Boolean values are true the following lines of code will print out this album was made in the 80s if the variable album year is between 1980 and 1989. let's see what happens when we set the album year to 1983 as before we can use the colored number line to examine where the condition is true in this case 1983 is larger than 1980 so the condition is true examining the second condition we see that 1990 is greater than 1983. so this condition is also true we can verify by examining the corresponding second number line in the final number line the green region indicates where the area is true similarly this region corresponds to where both statements are true we see that 1983 falls in the area therefore we execute the statement branching allows us to run different statements for different inputs foreign we will cover Loops in particular for loops and while Loops we will use many visual examples in this video see the labs for examples with Data before we talk about loops let's go over the range function the range function outputs and ordered sequence as a list I if the input is a positive integer the output is a sequence the sequence contains the same number of elements as the input but starts at zero for example if the input is 3 the output is the sequence 0 1 2. if the range function has two inputs where the first input is smaller than the second input the output is a sequence that starts at the first input then the sequence iterates up to but not including the second number for the input 10 and 15 we get the following sequence see the labs for more capabilities of the range function please note if you use Python 3 the range function will not generate a list explicitly like in Python 2. in this section we will cover four Loops we will focus on lists but many of the procedures can be used on tuples Loops perform a task over and over consider the group of colored squares let's say we would like to replace each colored square with a white square let's give each square a number to make things a little easier and refer to all the group of squares as squares if we wanted to tell someone to replace Square 0 with a white square we would say equals replace Square 0 with a white square or we can say 4 squares 0 in squares Square 0 equals white square similarly for the next Square we can say 4 Square 1 in squares Square 1 equals white square for the next Square we can say 4 Square 2 in squares Square 2 equals white square we repeat the process for each Square the only thing that changes is the index of the square we are referring to if we are going to perform a similar task in Python we cannot use actual squares so let's use a list to represent the boxes each element in the list is a string representing the color we want to change the name of the color in each element to White each element in the list has the following Index this is a syntax to perform a loop in Python notice the indent the range function generates a list the code will simply repeat everything in the indent 5 times if you were to change the value to 6 it would do it six times however the value of I is incremented by one each time in this segment we change the ith element of the list to the string white the value of I is set to zero each iteration of the loop starts at the beginning of the indent we then run everything in the indent the first element in the list is set to White we then go to the start of the indent we progress down each line when we reach the line to change change the value of the list we set the value of index 1 to White the value of I increases by 1. we repeat the process for index 2. the process continues for the next index until we have reached the final element we can also iterate through a list or Tuple directly in Python we do not even need to use indices here is the list squares each iteration of the list we pass one element of the list squares to the variable Square let's display the value of the variable Square on this section for the first iteration the value of square is red we then start the second iteration for the second iteration the value of square is yellow we then start the third iteration for the final iteration the value of square is green a useful function for iterating data is enumerate it can be used to obtain the index and the element in the list let's use the Box analogy with the numbers representing the index of each Square this is the syntax to iterate through a list and provide the index of each element we use the list squares and use the names of the colors to represent the colored squares the argument of the function enumerate is the list in this case Square squares the variable I is the index and the variable square is the corresponding element in the list let's use the left part of the screen to display the different values of the variable square and I for the various iterations of the loop for the first iteration the value of the variable is red corresponding to the zeroth index and the value for I is zero for the second iteration the value of the variable square is yellow and the value of I corresponds to its index IE one we repeat the process for the last index while Loops are similar to for Loops but instead of executing a statement a set number of times a while loop will only run if a condition is met let's say we would like to copy all the orange squares from the list squares to the list new squares but we would like to stop if we encounter a non-orange square we don't know the value of the squares beforehand we would simply continue the process while the square is orange or C if the square equals orange if not we would stop for the first example we would check if the square was orange it satisfies the condition so we would copy the square we repeat the process for the second Square the condition is met so we copy the square in the next iteration we encounter a purple Square the condition is not met so we stop the process this is essentially what a while loop does let's use the figure on the left to represent the code we will use a list with the names of the color to represent the different squares we create an empty list of new squares in reality the list is of indeterminate size we start the index at zero the while statement will repeatedly execute the statements within the indent until the condition inside the bracket is false we append the value of the first element of the list squares to the list new squares we increase the value of I by 1. we append the value of the second element of the list squares to the list new squares we increment the value of I now the value in the array squares is purple therefore the condition for the while statement is false and we exit the loop check out the labs for more examples of loop many with real data [Music] in this video we will cover functions you will learn how to use sum of Python's built-in functions as well as how to build your own function functions take some input then produce some output or change the function is just a piece of code you can reuse you can Implement your own function but in many cases you use other people's functions in this case you just have to know how the function works and in some cases how to import the functions let the orange and yellow squares represent similar blocks of code we can run the code using some input and get an output if we Define a function to do the task we just have to call the function let the small squares represent the lines of code used to call the function we can replace these long lines of Code by just calling the function a few times now we can just call the function our code is much shorter the code performs the same task you can think of the process like this when we call the function F1 we pass an input to the function these values are passed to all those lines of code you wrote this returns a value you can use the value for example you can input this value to a new function F2 when we call this new function F2 the value is passed to another set of lines of code the function returns a value the process is repeated passing the values to the function you call you can save these functions and reuse them or use other people's functions python has many built-in functions you don't have to know how those functions work internally but simply what task those functions perform the function Len takes in an input of type sequence such as a string or list or type collection such as a dictionary or set and Returns the length of that sequence or collection consider the following list the Len function takes this list as an argument and we assign the result to the variable L the function determines there are eight items in the list then Returns the length of the list in this case 8. the function sum takes in an iterable like a tuple or list and Returns the total of all the elements consider the following list we pass the list into the sum function and assign the result to the variable s the function determines the total of all the elements then returns it in this case the value is 70. there are two ways to sort a list the first is using the function sorted we can also use the list method sort methods are similar to functions let's use this as an example to illustrate the difference the function sorted returns a new sorted list or Tuple consider the list album ratings we can apply the function sorted to the list album ratings and get a new list sorted album rating the result is a new sorted list if we look at the list album ratings nothing has changed generally functions take an input in this case a list they produce a new output in this instance a sorted list if we use the method sort the list album ratings will change and no new list will be created let's use the diagram to help illustrate the process in this case the rectangle represents the list album ratings when we apply the method sort to the list the list album rating changes unlike the previous case we see that the list album rating has changed in this case no new list is created now that we've gone over how to use functions in Python let's see how to build our own functions we will now get you started on building your own functions in Python this is an example of a function in Python that returns its input Value Plus 1. to define a function we start with the keyword def the name of the function should be descriptive of what it does we have the function formal parameter a in parentheses followed by a colon we have a code block with an indent for this case we add 1 to a and assign it to B we return or output the value for B after we Define the function we can call it the function will add 1 to 5 and return a 6. we can call the function again this time assign it to the variable C the value for C is 11. let's explore this further let's go over an example when you call a function it should be noted that this is a simplified model of python and python does not work like this under the hood we call the function giving it an input 5. it helps to think of the value of 5 as being passed to the function now the sequences of commands are run the value of a is five b would be assigned a value of 6. we then return the value of B in this case as B was assigned a value of 6 the function returns a 6. if we call the function again the process starts from scratch we pass in an 8. the subsequent operations are performed everything that happened the last call will happen again with a different value of a the function returns a value in this case 9. again this is just a helpful analogy let's try and make this function more complex it's customary to document the function on the first few lines this tells anyone who uses the function what it does this documentation is surrounded in triple quotes you can use the help command on the function to display the documentation as follows this will print out the function name and the documentation we will not include the documentation in the rest of the examples a function can have multiple parameters the function mult multiplies two numbers in other words it finds their product if we pass the integers 2 and 3 the result is a new integer if we pass the integer 10 and the float 3.14 the result is a float 31.4 if we pass in the integer 2 and the string Michael Jackson the string Michael Jackson is repeated two times this is because the multiplication symbol can also mean repeat a sequence if you accidentally multiply an integer with a string instead of two integers you won't get an error instead you will get a string and your program will progress potentially failing later because you have a string where you expected an integer this property will make coding simpler but you must test your code more thoroughly in many cases a function does not have a return statement in these cases python will return the special none object practically speaking if your function has no return statement you can treat it as if the function returns nothing at all the function MJ simply prints the name Michael Jackson we call the function function prints Michael Jackson let's define the function no work that performs no task python doesn't allow a function to have an empty body so we can use the keyword pass which doesn't do anything but satisfies the requirement of a non-empty body if we call the function and print it out the function returns a none in the background if the return statement is not called python will automatically return a none it is helpful to view the function no work with The Following return statement usually functions perform more than one task this function prints a statement then returns a value let's use this table to represent the different values as the function is called we call the function with an input of 2. we find the value of B the function prints the statement with the value of a and b finally the function Returns the value of B in this case 3. we can use Loops in functions this function prints out the values and indexes of a loop or Tuple we call the function with the list album ratings as an input let's display the list on the right with its corresponding index stuff is used as an input to the function enumerate this operation will pass the index to I and the value in the list to s the function would begin to iterate through the loop the function will print the first index and the first value in the list we continue iterating through the loop the values of I and S are updated the print statement is reached similarly the next values of the list and index are printed the process is repeated the values of I and S are updated we continue iterating until the final values in the list are printed out variatic parameters allow us to input a variable number of elements consider the following function the function has an asterisk on the parameter names when we call the function three parameters are packed into the Tuple names we then iterate through the loop the values are printed out accordingly if we call the same function with only two parameters as inputs the variable names only contain two elements the result is only two values are printed out the scope of a variable is the part of the program where that variable is accessible variables are defined outside of any function are said to be within the global scope meaning they can be accessed anywhere after they are defined here we have a function that adds the string DC to the parameter X when we reach the part where the value of x is set to AC this is within the global scope meaning X is accessible anywhere after it is defined a variable defined in the global scope is called a global variable when we call the function we enter a new scope or the scope of add DC we pass as an argument to the add DC function in this case AC within the scope of the function the value of x is set to AC DC the function Returns the value and is assigned to Z within the global scope the value Z is set to AC DC after the value is returned the scope of the function is deleted local variables only exist within the scope of a function consider the function Thriller the local variable date is set to 1982. when we call the function we create a new scope within that scope of the function the value of the date is set to 1982. the value of date does not exist within the global scope variables inside the global scope can have the same name as variables in the local scope with no conflict consider the function Thriller the local variable date is set to 1982. the global variable date is set to 2017. when we call the function we create a new scope within that scope the value of the date is set to 1982. if we call the function it Returns the value of date in the local scope in this case 1982 when we print in the global scope we use the global variable value the global value of the variable is 2017. therefore the value is set to 2017. if a variable is not defined within a function python will check the global scope consider the function ACDC the function has the variable rating with no value assigned if we Define the variable rating in the global scope then call the function python will see there is no value for the variable rating as a result python will lead the scope and check if the variable ratings exists in the global scope it will use the value of ratings in the global scope within the scope of AC DC in the function we'll print out a 9. the value of Z in the global scope will be 10 as we added 1. the value of rating will be unchanged within the global scope consider the function Pink Floyd if we Define the variable claimed sales with the keyword Global the variable will be a global variable we call the function Pink Floyd the variable claimed sales is set to the string 45 million in the global scope when we print the variable we get a value of 45 million there is a lot more you can do with functions check out the lab for more examples [Music] hello and welcome to exception handling after watching this video you will be able to explain exception handling demonstrate the use of exception handling and understand the basics of exception handling have you ever mistakenly entered a number when you were supposed to enter text in an input field most of us have either in air or when testing out a program but do you know why it gave an error message instead of completing and terminating the program in order for the error message to appear an event was triggered in the background this event was activated because the program tried to perform a computation on the name entry and realized the entry contained numbers and not letters by encasing this code in an exception Handler the program knew how to deal with this type of error and was able to Output the error message to continue with the program this is one of many errors that can happen when asking for user input so let us see how exception handling works we will first explore the try accept statement this type of statement will first attempt to execute the code in the try block but if an error occurs it will kick out and begin searching for the exception that matches the error once it finds the correct exception to handle the error it will then execute that line of code for example perhaps you are writing a program that will open and write a file after starting the program an error occurred as the data was not able to be read because of this error the program skipped over the code lines under the try statement and went directly to the exception line since this error fell within the i o error guidelines it printed unable to open or read the data in the file to our console when writing simple programs we can sometimes get away with only one accept statement but what happens if another error occurs that is not caught by the i o error if that happened we would need to add another accept statement for this accept statement you will notice that the type of error to catch is not specified while this may seem a logical step so the program will catch all errors and not determine donate this is not a best practice for example perhaps our small program was just one section of a much larger program that was over a thousand lines of code our task was to debug the program as it kept throwing an error causing A disruption for our users when investigating the program you found this error kept appearing because this error had no details you ended up spending hours trying to pinpoint and fix the error so far in our program we have defined that an error message should print out if an error occurs but we do not receive any messages that the program executed properly this is where we can now add an else statement to give us that notification by adding this else statement it will provide us a notification to the console that the file was written successfully now that we have defined what will happen if our program executes properly or if an error occurs there is one last statement to add for this example since we are opening a file the last thing we need to do is close the file by adding a finally statement it will tell the program to close the file no matter the end result and print file is now closed to our console in this video you learned how to write a try accept statement why it is important to always Define errors when creating exceptions and how to add an else and finally statement in this module we are going to talk about objects and classes python has many different kinds of data types integers floats strings lists dictionaries booleans in Python each is an object every object has the following a type internal representation a set of functions called Methods to interact with the data an object is an instance of a particular type for example we have two types type 1 and type 2. we can have several objects of type 1 as shown in yellow each object is an instance of type 1. we also have several objects of type 2 shown in green each object is an instance of type 2. let's do several less abstract examples every time we create an integer we are creating an instance of type integer or we are creating an integer object in this case we are creating five instances of type integer or five integer objects similarly every time we create a list we are creating an instance of type list or we are creating a list object in this case we are creating five instances of type list or five list objects we can find out the type of an object by using the type command in this case we have an object of type list we have an object of type integer we have an object of type string finally we have an object of type dictionary a class or types methods are functions that every instance of that class or type provides it's how you interact with the object we have been using methods all this time for example on lists sorting is an example of a method that interacts with the data in the object consider the list ratings the data is a series of numbers contained within the list the method sort will change the data within the object we call the method by adding a period at the end of the object's name and the method's name we would like to call with parentheses we have the ratings list represented in Orange the data contained in the list is a sequence of numbers we call the sort method this changes the data contained in the object you can say it changes the state of the object we can call the reverse method on the list changing the list again we call the method reversing the order of the sequence within the object in many cases you don't have to know the inner workings of the class and its methods you just have to know how to use them next we will cover how to construct your own classes you can create your own type or class in Python in this section you will create a class the class has data attributes the class has methods we then create instances or instances of that class or objects the class data attributes Define the class let's create two classes the first class will be a circle the second will be a rectangle let's think about what constitutes a circle examining this image all we need is a radius to define a circle and let's add color to make it easier to distinguish between different instances of the class later therefore our class data attributes are radius and color similarly examining the image in order to define a rectangle we need the height and width we will also add color to distinguish between instances later therefore the data attributes are color height and width to create the class Circle you will need to include the class definition this tells python you are creating your own class the name of the class for this course in parentheses you will always place the term object this is the parent of the class for the class rectangle we change the name of the class but the rest is kept the same classes are outlines we have to set the attributes to create objects we can create an object that is an instance of type Circle the color data attribute is red and the data attribute radius is 4. we can also create a second object that is an instance of type Circle in this case the colored data attribute is green and the data attribute radius is 2. we can also create an object that is an instance of type rectangle the color data attribute is blue and the data attribute of height and width is 2. the second object is also an instance of type rectangle in this case the color data attribute is yellow and the height is 1 and the width is 3. we now have different objects of class Circle or type Circle we also have different objects of class rectangle or type rectangle let us continue building the circle class in Python we Define our class we then initialize each instance of the class with data attributes radius and color using the class Constructor the function init is a Constructor it's a special function that tells python you are making a new class there are other special functions in Python to make more complex classes the radius and color parameters are used to initialize the radius and color data attributes of the class instance the self parameter refers to the newly created instance of the class the parameters radius and color can be used in the constructor's body to access the values passed to the class Constructor when the class is constructed set the value of the radius and color data attributes to the values passed to the Constructor method similarly we can Define the class rectangle in Python the name of the class is different this time the class data attributes are color height and width after we have created the class in order to create an object of class Circle we introduce a variable this will be the name of the object we create the object by using the object Constructor the object instructor consists of the name of the class as well as the parameters these are the data attributes when we create a circle object we call the code like a function the arguments passed the circle Constructor are used to initialize the data attributes of the newly created Circle instance it is helpful to think of self as a box that contains all the data attributes of the object typing the object's name followed by a DOT and the data attribute name gives us the data attribute value for example radius in this case the radius is 10. we can do the same for color we can see the relationship between the self parameter and the object in Python we can also set or change the data attribute directly typing the object's name followed by a DOT and the data attribute name and set it equal to the corresponding value we can verify that the color data attribute has changed usually in order to change the data in an object we Define methods in the class let's discuss methods we have seen how data attributes consist of the data defining the objects methods are functions that interact and change the data attributes changing or using the data attributes of the object let's say we would like to change the size of a circle this involves changing the radius attribute we add a method add radius to the class Circle the method is a function that requires a self as well as other parameters in this case we are going to add a value to the radius we denote that value as R we are going to add R to the data attribute radius let's see how this part of the code works when we create an object and call the add radius method as before we create an object with the object Constructor we pass two arguments to the Constructor the radius is set to 2 and the color is set to red in the constructor's body the data attributes are set we can use the Box analogy to see the current state of the object we call the method by adding a DOT followed by the method name and parentheses in this case the argument of the function is the amount we would like to add we do not need to worry about the self parameter when calling the method just like with the Constructor python will take care of that for us in many cases there may not be any parameters other than self specified in the method's definition so we don't pass any arguments when calling the function internally the method is called with a value of 8 and the proper self object the method assigns a new value to self.radius this changes the object in particular the radius data attribute when we call the add radius method this changes the object by changing the value of the radius data attribute we can add default values to the parameters of a class's Constructor in the labs we also create the method called Draw Circle see the lab for the implementation of draw a circle in the labs we can create a new object of type circle using the Constructor the color will be red and the radius will be three we can access the data attribute radius we can access the attribute color finally we can use the method draw a circle to draw the circle similarly we can create a new object of type Circle we can access the data attribute of radius we can access the data attribute color we can use the method draw a circle to draw the circle in summary we have created an object of class circle called red circle with a radius attribute of 3 and a color attribute of red we also created an object of class circle called blue circle with a radius attribute of 10 and a color attribute of blue in the lab we have a similar class for rectangle we can create a new object of type rectangle using the Constructor we can access a data attribute of height we can also access the data attribute of width we can do the same for the data attribute of color we can use the method draw a rectangle to draw the rectangle so we have a class an object that is a realization or instantiation of that class for example we can create two objects of class Circle or two objects of class rectangle the dir function is useful for obtaining the list of data attributes and methods associated with a class the object you're interested in is passed as an argument the return value is a list of that object's data attributes the attributes surrounded by underscores are for internal use and you shouldn't have to worry about them the regular looking attributes are the ones you should concern yourself with these are the objects methods and data attributes there is a lot more you can do with objects in Python checkpython.org for more info [Music] in this section we will use Python's built-in open function to create a file object and obtain the data from a txt file we will use Python's open function to get a file object we can apply a method to that object to read data from the file we can open the file example 1 dot txt as follows we use the open function the first argument is the file path this is made up of the file name and the file directory the second parameter is the mode common values used include R for reading W for writing and a for appending we will use R for reading finally we have the file object we can now use the file object to obtain information about the file we can use the data attribute name to get the name of the file the result is a string that contains the name of the file we can see what mode the object is in using the data attribute mode and R is shown representing read you should always close the file object using the method close this may get tedious sometimes so let's use the with statement using a with statement to open the file is better practice because it automatically closes the file the code will run everything in the indent block then closes the file this code reads the file example1.txt we can use the file object file 1. the code will perform all operations in the indent block then close the file at the end of the indent the method read stores the values of the file in the variable file underscore stuff as a string you can print the file content you can check if the file content is closed but you cannot read from it outside the indent but you can print the file content outside the indent as well we can print the file content we will see the following when we examine the raw string we will see the slash n this is so python knows to start a new line we can output every line as an element in a list using the method read lines the first line corresponds to the first element in the list the second line corresponds to the second element in the list and so on we can use the method read line to read the first line of the file if we run this command it will store the first line in the variable file underscore stuff then print the first line we can use the method read line twice the first time it's called it will save the first line in the variable file underscore stuff and then print the first line the second time it's called it will save the second line in the variable file underscore stuff and then print the second line can use a loop to print out each line individually as follows let's represent every character in a string as a grid we can specify the number of characters we would like to read from a string as an argument to the method read lines when we use a 4 as an argument in the method read lines we print out the first four characters in the file each time we call the method we will progress through the text if we call the method with the argument 16 the first 16 characters are printed out and then the new line if we call the method a second time the next five characters are printed out finally if we call the method last time with the argument nine the last nine characters are printed out check out the labs for more examples of methods and other file types [Music] we can also write to files using the open function we will use Python's open function to get a file object to create a text file we can apply method write to write data to that file as a result text will be written to the file we can create the file example 2 dot txt as follows we use the open function the first argument is the file path this is made up of the file name if you have that file in your directory it will be overwritten and the file directory we set the mode parameter to W for writing finally we have the file object as before we use the with statement the code will run everything in the indent block then close the file we create the file object file 1. we use the open function this creates a file example2.txt in your directory we use the method write to write data into the file the argument is the text we would like input into the file if we use the right method successively each time it's called it will write to the file the first time it is called we will write this is line a with a slash n to represent a new line the second time we call the method it will write this is line B then it will close the file we can write each element in a list to a file as before we use a with command and the open function to create a file the list lines has three elements consisting of text we use a for Loop to read each element of the first lines and pass it to the variable line the first iteration of the loop writes the first element of the list to the file example 2. the second iteration writes the second element of the list and so on at the end of the loop the file will be closed we can set the mode to appended using a lowercase a this will not create a new file but just use the existing file if we call the method write it will just write to the existing file then add this is line C then close the file we can copy one file to a new file as follows first we read the file example 1 and interact with it via the file object read file then we create a new file example 3 and use the file object write file to interact with it the for Loop takes a line from the file object read file and stores it in the file example 3 using the file object write file the first iteration copies the first line the second iteration copies the second line till the end of the file is reached then both files are closed check out the labs for more examples [Music] dependencies or libraries are pre-written code to help solve problems in this video we will introduce pandas a popular library for data analysis we can import the library or a dependency like pandas using the following command we start with the import command followed by the name of the library we now have access to a large number of pre-built classes and functions this assumes the library is installed in our lab environment all the necessary libraries are installed let's say we would like to load a CSV file using the Panda's built-in function read CSV a CSV is a typical file type used to store data we simply type the word pandas then a DOT and the name of the function with all the inputs typing pandas all the time may get tedious we can use the as statement to shorten the name of the library in this case we use the standard abbreviation PD now we type PD and a DOT followed by the name of the function we would like to use in this case read underscore CSV we are not limited to the abbreviation PD in this case we use the term banana we will stick with PD for the rest of this video Let's examine this code more in depth One Way pandas allows you to work with data is with a data frame let's go over the process to go from a CSV file to a data frame this variable stores the path of the CSV it is used as an argument to the read underscore CSV function the result is stored to the variable DF this is short for data frame now that we have the data in a data frame we can work with it we can use the method head to examine the first five rows of a data frame the process for loading an Excel file is similar we use the path of the Excel file the function reads Excel the result is a data frame a data frame is comprised of rows and columns we can create a data frame out of a dictionary the keys correspond to the column labels the values are lists corresponding to the rows we then cast the dictionary to a data frame using the function data frame we can see the direct correspondence between the table the keys correspond to the table headers the values are lists corresponding to the rows we can create a new data frame consisting of one column we just put the data frame name in this case DF and the name of the column header enclosed in double brackets the result is a new data frame comprised of the original column you can do the same thing for multiple columns we just put the data frame name in this case DF and the name of the multiple column headers enclosed in double brackets the result is a new data frame comprised of the specified columns [Music] a data frame we can work with the data and save the results in other formats consider the stack of 13 blocks of different colors we can see there are three unique colors let's say you would like to find out how many unique elements are in a column of a data frame this may be much more difficult because instead of 13 elements you may have millions pandas has the method unique to determine the unique elements in a column of a data frame let's say we would like to determine the unique year of the albums in the data set we enter the name of the data frame then enter the name of the column released within brackets then we apply the method unique the result is all of the unique elements in the column released let's say we would like to create a new database consisting of songs from the 1980s and after we can look at the column released for songs made after 1979. then select the corresponding rows we can accomplish this within one line of code in pandas but let's break up the steps we can use the inequality operators for the entire data frame in pandas the result is a series of Boolean values for our case we simply specify the column released and the inequality for the albums after 1979. the result is a series of Boolean values the result is true when the condition is true and false otherwise we can select the specified columns in one line we simply use the data frame's names and in square brackets we place the previously mentioned inequality and assign it to the variable df1 we now have a new data frame where each album was released after 1979. we can save the new data frame using the method to underscore CSV the argument is the name of the CSV file make sure you include a DOT CSV extension there are other functions to save the data frame in other formats [Music] in this video we will be covering numpy in 1D in particular ND arrays numpy is a library for scientific Computing it has many useful functions there are many other advantages like speed and memory numpy is also the basis for pandas so check out our pandas video in this video we will be covering the basics and array creation indexing and slicing basic operations Universal functions let's go over how to create a numpy array a python list is a container that allows you to store and access data each element is associated with an index we can access each element using a square bracket as follows a numpy array or ND array is similar to a list it's usually fixed in size and each element is of the same type in this case integers we can cast a list to a numpy array by first importing numpy we then cast The List as follows we can access the data via an index as with the lists we can access each element with an integer and a square bracket the value of a is stored as follows if we check the type of the array we get numpy dot ND array as numpy arrays contain data of the same type we can use the attribute D type to obtain the data type of the arrays elements in this case a 64-bit integer let's review some basic array attributes using the array a the attribute size is the number of elements in the array as there are five elements the result is five the next two attributes will make more sense when we get to higher Dimensions but let's review them the attribute n dim represents the number of array Dimensions or the rank of the array in this case one the attribute shape is a tuple of integers indicating the size of the array in each dimension we can create a numpy array with real numbers when we check the type of the array we get numpy dot ND array if we examine the attribute d-type we see float 64 as the elements are not integers there are many other attributes check out numpy.org let's review some indexing and slicing methods we can change the first element of the array to a hundred as follows the array's first value is now a hundred we can change the fifth element of the array as follows the fifth element is now zero like lists and tuples we can slice a numpy array the elements of the array correspond to the following index we can select the elements from one to three and assign it to a new numpy array d as follows the elements in D correspond to the index like lists we do not count the element corresponding to the last index we can assign the corresponding indices to new values as follows the array C now has new values via the labs or numpy.org for more examples of what you can do with numpy numpy makes it easier to do many operations that are commonly performed in data science these same operations are usually computationally faster and require less memory in numpy compared to regular python let's review some of these operations on one-dimensional arrays we will look at many of the operations in the context of euclidean vectors to make things more interesting vector addition is a widely used operation in data science consider the vector U with two elements the elements are distinguished by the different colors similarly consider the vector v with two components in vector addition we create a new Vector in this case z the first component of Zed is the addition of the first component of vectors u and v similarly the second component is the sum of the second components of u and v this new Vector Z is now a linear combination of the vector u and v representing vector addition with line segment or arrows is helpful the first Vector is represented in red the vector will point in the direction of the two components the first component of the vector is one as a result the arrow is offset one unit from the origin in the horizontal Direction the second component is zero we represent this component in the vertical direction as this component is zero the vector does not point in the vertical Direction we represent the second Vector in blue the first component is zero therefore the Arrow does not point to the horizontal Direction the second component is one as a result the vector points in the vertical Direction one unit when we add the vector u and v we get the new Vector Z we add the first component this corresponds to the horizontal Direction we also add the second component it's helpful to use the tip to tail method when adding vectors placing the tail of a vector v on the tip of vector U the new Vector Z is constructed by connecting the base of the first Vector U with the tail of the second V the following three lines of code will add the two lists and place the result in the list Z we can also perform vector addition with one line of numpy code it would require multiple lines to perform Vector subtraction on two lists as shown on the right side of the screen in addition the numpy code will run much faster this is important if you have lots of data we can also perform Vector subtraction by changing the addition sign to a subtraction sign it would require multiple lines to perform Vector subtraction on two lists as shown on the right side of the screen Vector multiplication with a scalar is another commonly performed operation consider the vector y each component is specified by a different color we simply multiply the vector by a scalar value in this case 2. each component of the vector is multiplied by two in this case each component is doubled we can use the line segment or arrows to visualize what's going on the original Vector Y is in purple after multiplying it by a scalar value of 2 the vector is stretched out by two units as shown in red the new Vector is twice as long in each Direction Vector multiplication with a scalar only requires one line of code using numpy it would require multiple lines to perform the same task as shown with python lists as shown on the right side of the screen in addition the operation would also be much slower hadamard product is another widely used operation in data science consider the following two vectors u and v the hadamard product of u and v is a new Vector Z the first component of Z is the product of the first element of u and v similarly the second component is the product of the second element of u and v the resultant Vector consists of the entry-wise product of u and v we can also perform hadamard product with one line of code in numpy it would require multiple lines to perform hadamar product on two lists as shown on the right side of the screen the dot product is another widely used operation in data science consider the vector u and v the dot product is a single number given by the following term and represents how similar two vectors are we multiply the first component from V and U we then multiply the second component and add the result together the result is a number that represents how similar the two vectors are we can also perform dot product using the numpy function Dot and assign it with the variable result as follows consider the array U the array contains the following elements if we add a scalar value to the array numpy will add that value to each element this property is known as Broadcasting a universal function is a function that operates on ND arrays we can apply a universal function to a numpy array consider the arrays a we can calculate the mean or average value of all the elements in a using the method mean this corresponds to the average of all the elements in this case the result is zero there are many other functions for example consider the numpy arrays B we can find the maximum value using the method 5. we see the largest value is 5. therefore the method Max returns a 5. we can use numpy to create functions that map numpy arrays to new numpy arrays let's Implement some code on the left side of the screen and use the right side of the screen to demonstrate what's going on we can access the value of pi in Num pi as follows we can create the following numpy array in radians this array corresponds to the following vector we can apply the function sine to the array X and assign the values to the array y applies the sine function to each element in the array this corresponds to applying the sine function to each component of the vector the result is a new array Y where each value corresponds to a sine function being applied to each element in the array X a useful function for plotting mathematical functions is line space line space returns evenly spaced numbers over specified interval we specify the starting point of the sequence the ending point of the sequence the parameter num indicates the number of samples to generate in this case five the space between samples is one if we change the parameter num to 9 we get 9 evenly spaced numbers over the interval from negative two to two the result is the difference between subsequent samples is 0.5 as opposed to 1 as before we can use the function line space to generate 100 evenly spaced samples from the interval 0 to 2 pi we can use the numpy function sine to map the array X to a new array y we can import the library Pi plot as PLT to help us plot the function as we are using a jupyter notebook we use the command mat plot lib inline to display the plot the following command plots a graph the first input corresponds to the values for the horizontal or x-axis the second input corresponds to the values for the vertical or y-axis there's a lot more you can do with numpy check out the labs at numpy.org for more thanks for watching this video [Music] we can create numpy arrays with more than one dimension this section will focus only on 2D arrays but you can use numpy to build arrays of much higher dimensions in this video we will cover the basics and array Creation in 2D indexing and slicing in 2D and basic operations n2d consider the list a the list contains three nested lists each of equal size each list is color-coded for Simplicity we can cast the list to a numpy array as follows it is helpful to visualize the numpy array as a rectangular array each nested list corresponds to a different row of The Matrix we can use the attribute and dim to obtain the number of axes or Dimensions referred to as the rank the term rank does not refer to the number of linearly independent columns like a matrix it's useful to think of n-dim as the number of nested lists the first list represents the First Dimension this list contains another set of lists this represents the second dimension or axes the number of lists the list contains does not have to do with the dimension but the shape of the list as with the 1D array the attribute shape returns a tuple it's helpful to use the rectangular representation as well the first element in the Tuple corresponds to the number of nested lists contained in the original list or the number of rows in the rectangular representation in this case three the second element corresponds to the size of each of the nested lists or the number of columns in the rectangular array 0. the convention is to label this axis zero and this axis 1 as follows we can also use the attribute size to get the size of the array we see there are three rows and three columns multiplying the number of columns and rows together we get the total number of elements in this case 9. check out the labs for arrays of different shapes and other attributes we can use rectangular brackets to access the different elements of the array the following image demonstrates the relationship between the indexing conventions for the list like representation the index in the first bracket corresponds to the different nested lists each a different color the second bracket corresponds to the index of a particular element within the nested list using the rectangular representation the first index corresponds to the row index the second index corresponds to the column index we could also use a single bracket to access the elements as follows consider the following syntax this index corresponds to the second row and this index the third column the value is 23. consider this example this index corresponds to the first row and the second index corresponds to the First Column and a value of 11. we can also use slicing in numpy arrays the first index corresponds to the first row the second index accesses the first two columns consider this example the first index corresponds the first two rows the second index accesses the last column we can also add arrays the process is identical to Matrix addition consider the Matrix X each element is colored differently consider the Matrix y similarly each element is colored differently we can add the matrices this corresponds to adding the elements in the same position I.E adding elements contained in the same color boxes together the result is a new Matrix that is the same size as Matrix y or X each element in this new Matrix is the sum of the corresponding elements in X and Y to add two arrays in numpy we Define the array in this case X then we Define the second array y we add the arrays the result is identical to Matrix addition multiplying a numpy array by a scalar is identical to multiplying a matrix by a scalar consider the Matrix y if we multiply the matrix by the scalar 2 we simply multiply every element in the matrix by 2. the result is a new Matrix of the same size where each element is multiplied by two consider the array y we first Define the array we multiply the array by a scalar as follows and assign it to the variable Z the result is a new array where each element is multiplied by two multiplication of two arrays corresponds to an element-wise product or had a marred product consider array X and array y hadamard product corresponds to multiplying each of the elements in the same position I.E multiplying elements contained in the same color boxes together the result is a new Matrix that is the same size as Matrix y or X each element in this new Matrix is the product of the corresponding elements in X and Y consider the array X and Y we can find the products of two arrays X and Y in one line and assign it to the variable Z as follows the result is identical to hadamard product we can also perform matrix multiplication with numpy arrays matrix multiplication is a little more complex but let's provide a basic overview consider the Matrix a where each row is a different color also consider the Matrix B where each column is a different color in linear algebra before we multiply Matrix a by matrix B we must make sure that the number of columns in Matrix a in this case 3 is equal to the number of rows in Matrix B in this case three for matrix multiplication to obtain the ith rho and jth column of the new Matrix we take the dot product of the ith row of a with The jth Columns of B for the First Column first row we take the dot product of the first row of a with the First Column of b as follows the result is zero for the first row and the second column of the new Matrix we take the dot product of the first row of the Matrix a but this time we use the second column of Matrix B the result is two for the second row and the First Column of the new Matrix we take the dot product of the second row of the Matrix a with the First Column of Matrix B the result is zero finally for the second row and the second column of the new Matrix we take the dot product of the second row of the Matrix a with the second column of Matrix B the result is 2. in numpy we can Define the numpy arrays A and B we can perform matrix multiplication and assign it to array C the result is the array C it corresponds to the matrix multiplication of array A and B there is a lot more you can do with it in numpy check out numpy.org thanks for watching this video foreign [Music] we will discuss application program interfaces or apis for short specifically we will discuss what an API is API libraries and rest apis including request and response and an example with pi coin gecko an API lets two pieces of software talk to each other for example you have your program you have some data and you have other software components you use the API to communicate with other software via inputs and outputs just like a function you don't have to know how the API works just its inputs and outputs pandas is actually a set of software components much of which are not even written in Python you have some data you have a set of software components we use the pandas API to process the data by communicating with the other software components let us clean up the diagram when you create a dictionary then create a pandas object with the data frame Constructor in API lingo this is an instance the data in the dictionary is passed along to the pandas API you then use the data frame to communicate with the API when you call the method head the data frame communicates with the API displaying the first few rows of the data frame when you call the method mean the API will calculate the mean and return the values rest apis are another popular type of API they allow you to communicate through the internet letting you take advantage of resources like storage access more data artificial intelligence algorithms and much more the re stands for representational the s for state and T for transfer in rest apis your program is called the client the API communicates with a web service you call through the internet there is a set of rules regarding communication input or request and output or response here are some common terms you or your code can be thought of as a client the web service is referred to as a resource the client finds the service via an endpoint we will review this more in the next section the client sends requests to the resource and the response to the client HTTP methods are a way of transmitting data over the Internet we tell the rest apis what to do by sending a request the request is usually communicated via an HTTP message the HTTP message usually contains a Json file this contains instructions for what operation we would like the service to perform this operation is transmitted to the web service via the Internet and the service performs the operation in a similar manner the web service returns a response via an HTTP message where the information is usually returned via a Json file and this information is transmitted back to the client cryptocurrency data is excellent to use in an API because it is constantly updated and is vital to cryptocurrency trading we will use the pi coin gecko python client or wrapper for the coin gecko API updated every minute by coing gecko we use the wrapper or client because it is easy to use so you can focus on the task of collecting data we will also introduce Panda's time series functions for dealing with time series data using pi coin gecko to collect data is simple all we need is to install and import the library then create a client object and finally use a function to request our data in this function we are getting data on Bitcoin in US dollars for the past 30 days in this case our response is a Json file expressed as a python dictionary of nested lists including price market cap and total volumes which contain the Unix timestamp and the price at the time we are only interested in price so that is what we will select using the key price to make things simple we can convert our nested list to a data frame with the columns time stamp and price it is difficult to understand the column timestamp we will convert it to a more readable format using the pandas function to underscore date time using this to underscore date time function we create readable time data the input is the timestamp column unit of time is set to milliseconds we append the output to the new column date now we want to create a Candlestick plot to get the data for the daily candlesticks we will Group by the date to find the minimum maximum first and last price of each day finally we will use plotly to create the Candlestick chart and plot it now we can view the Candlestick chart by opening the HTML file and clicking trust HTML in the top left of the tab it should look something like this thank you in this video we will discuss application program interfaces that use some kind of artificial intelligence we will transcribe an audio file using the Watson text to speech API we will then translate the text to a new language using the Watson language translator API in the API call you will send a copy of the audio file to the API this is sometimes called a post request then the API will send the text transcription of what the individual is saying under the hood the API is making a get request we then send the text we would like to translate into a second language to a second API the API will translate the text and send the translation back to you in this case we translate English to Spanish we then provide an overview of API keys and endpoints Watson speech to text and Watson translate first we will review API keys and endpoints they will give you access to the API an API key is a way to access the API it's a unique set of characters that the API uses to identify you and authorize you usually your first call to the API includes the API key this will allow you access to the API in many apis you may get charged for each call so like your password you should keep your API key a secret an endpoint is simply the location of the service it's used to find the API on the internet just like a web address now we will transcribe an audio file using the Watson text to speech API before you start the lab you should sign up for an API key we will download an audio file into your directory first we import speech to text V1 from IBM Watson the service endpoint is based on the location of the service instance we store the information in the variable URL underscore s2t to find out which URL to use view the service credentials you will do the same for your API key you create a speech-to-text adapter object the parameters are the endpoint and API key you will use this object to communicate with the Watson speech to text service we have the path of the WAV file we would like to convert to text we create the file object wave with the WAV file using open we set the mode to RB which means to read the file in binary format the file object allows us access to the WAV file that contains the audio we use the method recognize from the speech-to-text adapter object this basically sends the audio file to Watson's speech to text service the parameter audio is the file object the content type is the audio file format the service sends a response stored in the object response the attribute result contains a python dictionary the key results value has a list that contains a dictionary we are interested in the key transcript we can assign it to the variable recognized underscore text as follows recognized underscore text now contains a string with the transcribed text now let's see how to translate the text using the Watson language translator first we import language translator V3 from IBM underscore Watson we assign the service endpoint to the variable urlt you can obtain the service in the lab instructions you require an API key see the lab instructions on how to obtain the API key this API request requires the date of the version see the documentation we create a language translator object language translator we can get a list of the languages that the service can identify as follows the method Returns the language code for example English has a symbol e n to Spanish which has the symbol e s in the last section we assigned the transcribed text of the variable to recognized underscore text we can use the method translate this will translate the text the result is a detailed response object the parameter text is the text model underscore ID is the type of model we would like to use in this case we set it to en hyphen es for English to Spanish we use the method get result to get the translated text and assign it to the variable translation the result is a dictionary that includes the translation word count and character count we can obtain the translation and assign it to the variable Spanish underscore translation as follows using the variable Spanish underscore translation we can translate the text back to English as follows the result is a dictionary we can obtain the string with the text as follows we can then translate the text to French as follows thanks for watching this video [Music] foreign [Music] we will discuss the HTTP protocol specifically we will discuss uniform resource locator or URL request and response we touched on rest apis in the last section the HTTP protocol can be thought of as a general protocol of transferring information through the web this includes many types of rest apis recall that rest apis function by sending a request and the request is communicated via HTTP message the HTTP message usually contains a Json file when you the client use a web page your browser sends an HTTP request to the server where the page is hosted the server tries to find the desired resource by default index.html if your request is successful the server will send the object to the client in an HTTP response this includes information like the type of the resource the length of the research source and other information the table under the web server represents a list of resources stored in the web server in this case an HTML file PNG image and a text file when the request is made for the information the web server sends the requested information that is one of the files a uniform resource locator or URL is the most popular way to find resources on the web we can break the URL into three parts first we have the scheme this is the protocol and for this lab it will always be HTTP colon forward slash forward slash the internet address or base URL this will be used to find the location some examples include www.ibm.com and www.getlab.com and finally the route this is the location on the web server for example slash Images slash idsnlogo.png let us review the request and response process the following is an example of the request message for the get request method there are other HTTP methods we can use in the start line we have the get method this is an HTTP method in this case it's requesting the file index dot HTML the request header passes additional information with an HTTP request in the get method the request header is empty some requests have a body we will have an example of a request body later the following table represents the response the response start line contains the version number followed by a descriptive phrase in this case HTTP 1.0 a status code 200 meaning success and the descriptive phrase okay we have status codes later the response header contains information finally we have the response body containing the requested file in this case an HTML document let us look at other status codes some status code examples are shown in the table below the prefix indicates the class for example the 100s are informational responses 100 indicates that everything is okay so far the two hundreds are successful responses for example 200 the request has succeeded anything in the 400s is bad news 401 means the request is unauthorized 500s stand for Server errors like 501 for not implemented when an HTTP request is made an HTTP method is sent this tells the server what action to perform a list of several HTTP methods is shown here in the next video we will use Python to apply the get method that retrieves data from the server and the post method that sends data to the server [Music] in this video we will discuss the HTTP protocol using the requests Library a popular method for dealing with the HTTP protocol in Python we will review python Library requests for working with the HTTP protocols and we will provide an overview of get requests and post requests let us review the request module in Python this is one of several libraries including HTTP lib and URL lib that can work with the HTTP protocol requests is a python library that allows you to send HTTP 1.1 requests easily we can import the library as follows you can make a get request via the method get to www.ibm.com we have the response object R this has information about the request like the status of the request we can view the status code using the attribute status underscore code which is 200 for ok you can view the request headers you can view the request body in the following line as there is no body for a get request we get a none you can view the HTTP response header using the attribute headers this returns a python dictionary of HTTP response headers we can look at the dictionary values we can obtain the date the request was sent by using the key date the key content type indicates the type of data using the response object R we can also check the encoding as the content type is text or HTML we can use the attribute text to display the HTML in the body we can review the first 100 characters you can also download other content See the lab for more you can use the get method to modify the results of your query for example retrieving data from an API in the lab we will use httpbin.org a simple HTTP request and response service we send a get request to the server like before we have the base URL in the route we append slash get this indicates we would like to perform a get request this is demonstrated in the following table after get is requested we have the query string this is part of a uniform resource locator or URL and this sends other information to the web server the start of the query is a question mark followed by a series of parameter and value pairs as shown in the table below the first parameter name is name and the value is Joseph the second parameter name is ID and the value is one two three each pair parameter and value is separated by an equal sign the series of pairs is separated by the Ampersand let us complete an example in Python we have the base URL with get appended to the end to create a query string we use the dictionary payload the keys are the parameter names and the values are the value of the query string then we pass the dictionary payload to the params parameter of the get function we can print out the URL and see the name and values we can see the request body as the info is sent in the URL the body has a value of none we can print out the status code we can view the response as text and we can look at the key content type to look at the content type as the content content type is in the Json we format it using the method Json it returns a python dict the key args has the name and values for the query string like a get request a post request is used to send data to a server but the post request sends the data in a request body not the URL in order to send the post request in the URL we change the route to post this endpoint will expect data and it is a convenient way to configure an HTTP request to send data to a server we have the payload dictionary to make a post request we use the post function the variable payload is passed to the parameter data comparing the URL using the attribute URL from the response object of the get and post request we see the post request has no name or value pairs in its URL we can compare the post and get request body we see only the post request has a body and we can view the key form to get the payload [Music] thank you in this video we will review hypertext markup language or HTML for web scraping lots of useful data is available on web pages such as real estate prices and solutions to coding questions the website Wikipedia is a repository of the world's information if you have an understanding of HTML you can use Python to extract this information in this video you will review the HTML of a basic web page understand the composition of an HTML tag understand HTML trees and understand HTML tables let us say you had a request to find the name and salary of players in a national basketball league from the following page the web page is comprised of HTML it consists of text surrounded by a series of blue text elements enclosed in angle brackets called Tags the tags tell the browser how to display the content the data we require is in this text the first portion contains the doctype HTML which declares this document is an HTML document HTML element is the root element of an HTML page and head element contains meta information about the HTML page next we have the body this is what is displayed on the web page this is usually the data we are interested in we see the elements with an H3 this means type 3 heading which makes the text larger and bold these tags have the names of the players notice the data is enclosed in the elements it starts with an H3 in Brackets and ends in a slash H3 in Brackets there is also a different tag P this means paragraph each P tag contains a player's salary let us take a closer look at the composition of an HTML tag here is an example of an HTML anchor tag it will display IBM and when you click it it will send you to ibm.com we have the tag name in this case a this tag defines a hyperlink which is used to link from one page to another it is helpful to think of each tag name as a class in Python and each individual tag as an instance we have an opening or start tag and we have the end tag this has the tag name preceded by a slash these tags contain the content in this case what is displayed on the web page we have the attribute this is composed of the attribute name and attribute value in this case it is the URL to the destination web page real web pages are more complex and depending on your browser you can select the HTML element then click inspect the result will give you the ability to inspect the HTML there is also other types of content such as CSS and JavaScript that we will not go over in this course the actual element is shown here each HTML document can actually be referred to as a document tree let us go over a simple example tags may contain strings and other tags these elements are the tags children we can represent this as a family tree each nested tag is a level in the tree the tag HTML tag contains the head and body tag The Head and the body tag are the descendants of the HTML tag in particular they are the Children of the HTML tag HTML tag is their parent the head and body tag are siblings as they are on the same level title tag is the child of the head tag and its parent is the head tag the title tag is a descendant of the HTML tag but not its child The Heading and paragraph tags are the children of the body tag and as they are all children of the body tag they are siblings of each other the Bold tag is a child of the heading tag the content of the tag is also part of the tree but this can get unwieldy to draw next let us review HTML tables to Define an HTML table we have the table tag each table row is defined with a TR tag and you can also use a table header tag for the first row the table row cell contains a set of TD tags each defines a table cell for the first row first cell we have for the first row second cell we have and so on for the second row we have and for the second row first cell we have and for the second row second cell we have and so on we now have some basic knowledge of HTML now let us try and extract some data from a web page [Music] in this video we will cover web scraping after watching this video you will be able to Define web scraping understand the role of beautiful soup objects apply the find underscore all method and web scrape a website what would you do if you wanted to analyze hundreds of points of data to find the best players of a sports team would you start manually copying and pasting information from different websites into a spreadsheet spending hours trying to find the right data and eventually giving up because the task was too overwhelming that is where web scraping can help web scraping is a process that can be used to automatically extract information from a website and can easily be accomplished within a matter of minutes and not hours to get started we just need a little python code and the help of two modules named requests and beautiful soup let us say you were asked to find the name and salary of players in a national basketball league from the following web page first we import beautiful soup we can store the webpage HTML as a string in the variable HTML to parse a document pass it into the beautiful soup Constructor we get the beautiful soup object soup which represents the document as a nested data structure beautiful soup represents HTML as a set of tree-like objects with methods used to parse the HTML we will review the beautiful soup object using the beautiful soup object soup we created the tag object corresponds to an HTML tag in the original document for example the tag title consider the tag H3 if there is more than one tag with the same name the first element with that tag is selected in this case with LeBron James we see the name is enclosed in the Bold attribute B to extract it use the tree representation so let us use the tree representation the variable tag Dash object is located here we can access the child of the tag or navigate down the branch as follows you can navigate up the tree by using the parent attribute the variable tag child is located here and we can access the parent this is the original tag object we can also find The Sibling of tag object we simply use the next Dash sibling attribute we can find the sibling of sibling one we simply use the next sibling attribute consider the tag Dash child object you can access the attribute name and value as a key value pair in a dictionary as follows you can return the content as a navigable string this is like a python string that supports beautiful soup functionality now let us review the method find all this is a filter you can use filters to filter based on a Tag's name its attributes the text of a string or on some combination of these and consider the list of pizza places like before create a beautiful soup object but this time name it table the find underscore all method looks through a Tag's descendants and retrieves all descendants that match your filters apply it to the table with the tag TR the result is a python iterable just like a list each element is a tag object for TR this corresponds to each row in the list including the table header each element is a tag object so consider the first row for example we can extract the first table cell we can also iterate through each table cell first we iterate through the list table rows via the variable row each element corresponds to a row in the table we can apply the method find all to find all the table cells then we can iterate through the variable cells for each row for each iteration the variable cell corresponds to an element in the table for that particular row and we continue to iterate through each element and repeat the process for each row let us see how to apply beautiful soup to a web page to scrape a web page we also need the requests Library the first step is to import the modules that are needed use the get method from the requests library to download the web page the input is the URL use the text attribute to get the text and assign it to the variable page then create a beautiful soup object soup from the variable page it will allow you to parse through the HTML page and you can now scrape the page check out the labs for more [Music] hello welcome to working with different file formats after watching this video you will be able to Define different file formats such as CSV XML and Json write simple programs to read and output data and list what python libraries are needed to extract data when collecting data you will find there are many different file formats that need to be collected or read in order to complete a data-driven story or analysis when Gathering the data python can make the process simpler with its predefined libraries but before we explore python let us first check out some of the various file formats looking at a file name you will notice an extension at the end of the title these extensions let you know what type of file it is and what is needed to open it for instance if you see a title like file example.csv you will know this is a CSV file but this is only one example of different file types there are many more such as Json or XML when coming across these different file formats and trying to access their data we need to utilize python libraries to make this process easier the first python library to become familiar with is called pandas by importing this library in the beginning of the code we are then able to easily read the different file types since we have now imported the panda Library let us use it to read the first CSV file in this instance we have come across the file example.csv file the first step is to assign the file to a variable then create another variable to read the file with the help of the panda Library we can then call read underscore CSV function to Output the data to the screen with this example there were no headers for the data so it added the first line as the header since we do not want the first line of data as the header let us find out how to correct this issue now that we have learned how to read and output the data from a CSV file let us make it look a little more organized from the last example we were able to print out the data but because the file had no headers it printed the first line of data as the header we easily solve this by adding a data frame attribute we use the variable DF to call the file then add the columns attribute by adding this one line to our program we can then neatly organize the data output into the specified headers for each column the next file format we will explore is the Json file format in this type of file the text is written in a language independent data format and is similar to a python dictionary the first step in reading this type of file is to import Json after importing Json we can add a line to open the file call the load attribute of Json to begin and read the file and lastly we can then print the file the next file format type is XML also known as extensible markup language while the pandas Library does not have an attribute to read this type of file let us explore how to parse this type of file the first step to read this type of file is to import XML by importing this Library we can then use the E tree attribute to parse the XML file we then add the column headers and assign them to the data frame then create a loop to go through the document to collect the necessary data and append the data to a data frame in this video you learned how to recognize different file types how to use Python libraries to extract data and how to use data frames when collecting data [Music] foreign [Music] we will review hypertext markup language or HTML for web scraping lots of useful data is available on web pages such as real estate prices and solutions to coding questions the website Wikipedia is a repository of the world's information if you have an understanding of HTML you can use Python to extract this information in this video you will review the HTML of a basic web page understand the composition of an HTML tag understand HTML trees and understand HTML tables let us say you had a request to find the name and salary of players in a national basketball league from the following page the web page is comprised of HTML it consists of text surrounded by a series of blue text elements enclosed in angle brackets called Tags the tags tell the browser how to display the content the data we require is in this text the first portion contains the doctype HTML which declares this document is an HTML document HTML element is the root element of an HTML page and head element contains meta information about the HTML page next we have the body this is what is displayed on the web page this is usually the data we are interested in we see the elements with an H3 this means type 3 heading which makes the text larger and bold these tags have the names of the players notice the data is enclosed in the elements it starts with an H3 in Brackets and ends in a slash H3 in Brackets there is also a different tag P this means paragraph each P tag contains a player's salary let us take a closer look at the composition of an HTML tag here is an example of an HTML anchor tag it will display IBM and when you click it it will send you to ibm.com we have the tag name in this case a this tag defines a hyperlink which is used to link from one page to another it is helpful to think of each tag name as a class in Python and each individual tag as an instance we have an opening or start tag and we have the end tag this has the tag name preceded by a slash these tags contain the content in this case what is displayed on the web page we have the attribute this is composed of the attribute name and attribute value in this case it is the URL to the destination web page real web pages are more complex and depending on your browser you can select the HTML element then click inspect the result will give you the ability to inspect the HTML there is also other types of content such as CSS and JavaScript that we will not go over in this course the actual element is shown here each HTML document can actually be referred to as a document tree let us go over a simple example tags may contain strings and other tags these elements are the tags children we can represent this as a family tree each nested tag is a level in the tree the tag HTML tag contains the head and body tag The Head and the body tag are the descendants of the HTML tag in particular they are the Children of the HTML tag HTML tag is their parent the head and body tag are siblings as they are on the same level title tag is the child of the head tag and its parent is the head tag the title tag is a descendant of the HTML tag but not its child The Heading and paragraph tags are the children of the body tag and as they are all children of the body tag they are siblings of each other the Bold tag is a child of the heading tag the content of the tag is also part of the tree but this can get unwieldy to draw next let us review HTML tables to Define an HTML table we have the table tag each table row is defined with a TR tag and you can also use a table header tag for the first row the table row cell contains a set of TD tags each defines a table cell for the first row first cell we have for the first row second cell we have and so on for the second row we have and for the second row first cell we have and for the second row second cell we have and so on we now have some basic knowledge of HTML now let us try and extract some data from a web page [Music] in this video we will cover web scraping after watching this video you will be able to Define web scraping understand the role of beautiful soup objects apply the find underscore all method and web scrape a website what would you do if you wanted to analyze hundreds of points of data to find the best players of a sports team would you start manually copying and pasting information from different websites into a spreadsheet spending hours trying to find the right data and eventually giving up because the task was too overwhelming that is where web scraping can help web scraping is a process that can be used to automatically extract information from a website and can easily be accomplished within a matter of minutes and not hours to get started we just need a little python code and the help of two modules named requests and beautiful soup let us say you were asked to find the name and salary of players in a national basketball league from the following web page first we import beautiful soup we can store the webpage HTML as a string in the variable HTML to parse a document pass it into the beautiful soup Constructor we get the beautiful soup object soup which represents the document as a nested data structure beautiful soup represents HTML as a set of tree-like objects with methods used to parse the HTML we will review the beautiful soup object using the beautiful soup object soup we created the tag object corresponds to an HTML tag in the original document for example the tag title consider the tag H3 if there is more than one tag with the same name the first element with that tag is selected in this case with LeBron James we see the name is enclosed in the Bold attribute B to extract it use the tree representation so let us use the tree representation the variable tag Dash object is located here we can access the child of the tag or navigate down the branch as follows you can navigate up the tree by using the parent attribute the variable tag child is located here and we can access the parent this is the original tag object we can also find The Sibling of tag object we simply use the next Dash sibling attribute we can find the sibling of sibling one we simply use the next sibling attribute consider the tag Dash child object you can access the attribute name and value as a key value pair in a dictionary as follows you can return the content as a navigable string this is like a python string that supports beautiful soup functionality now let us review the method find all this is a filter you can use filters to filter based on a Tag's name its attributes the text of a string or on some combination of these and consider the list of pizza places like before create a beautiful soup object but this time name it table the find underscore all method looks through a tags descendants and retrieves all descendants that match your filters apply it to the table with the tag TR the result is a python iterable just like a list each element is a tag object for TR this corresponds to each row in the list including the table header each element is a tag object so consider the first row for example we can extract the first table cell we can also iterate through each table cell first we iterate through the list table rows via the variable row each element corresponds to a row in the table we can apply the method find all to find all the table cells then we can iterate through the variable cells for each row for each iteration the variable cell corresponds to an element in the table for that particular row and we continue to iterate through each element and repeat the process for each row let us see how to apply beautiful soup to a web page to scrape a web page we also need the requests Library the first step is to import the modules that are needed use the get method from the requests library to download the web page the input is the URL use the text attribute to get the text and assign it to the variable page then create a beautiful soup object soup from the variable page it will allow you to parse through the HTML page and you can now scrape the page check out the labs for more [Music] hello and welcome to SQL for data science the demand for data scientists is high boasting a median-based salary of a hundred and ten thousand dollars and job satisfaction score of 4.4 out of 5. it's no wonder that it's the top spot on glassdoor's best jobs in America Glassdoor analyzed data from data scientist job postings on Glassdoor and found that SQL is listed as one of the top three skills for a data scientist before you step into the field of data science it is vitally important that you set yourself apart by mastering the foundations of this field one of the foundational skills that you will require is SQL SQL is a powerful language that's used for communicating with databases every application that manipulates any kind of data needs to store that data somewhere whether it's big data or just a table with a few simple rows for government or a small startup or a big database that spans over multiple servers or a mobile phone that runs its own small database here are some of the advantages of learning SQL for someone interested in data science SQL will boost your Professional Profile as a data scientist as it is one of the most sought after skills by hiring employers learning SQL will give you a good understanding of relational databases tapping into all this information requires being able to communicate with the databases that store the data even if you work with reporting tools that generate SQL queries for you it may be useful to write your own SQL statements so that you need not wait for other team members to create SQL statements for you in this course you will learn the basics of both the SQL language and relational databases the course includes interesting quizzes and Hands-On lab assignments where you can get experience working with databases in the first few modules you work directly with the database and develop a working knowledge of SQL then you will connect to a database and run SQL queries like a data scientist typically would where you will use Python and jupyter notebooks to connect to relational databases to access and analyze data there is also an assignment included towards the end of the course where you will get an opportunity to apply the concepts that you learned so let's get started with SQL for data science [Music] hello and welcome to SQL for data science first we will talk a little bit about what you'll learn in this course this course teaches you the basics of the SQL language and the relational database model there will be some lab exercises and at the end of each section there are a few review questions and at the end there is a final exam by the end of this course you will be able to discuss SQL Basics and explain various aspects of the relational database model in this video we will learn about SQL and relational databases by the end of this video you will be able to describe SQL data database a relational database and list five basic SQL commands but wait what is SQL and what is a relational database what is SQL SQL is a language used for relational databases to query or get data out of a database SQL is also referred to as SQL and is short for its original name Structured English query language so SQL is a language used for a database to query data but what is data and what is a database data is a collection of facts in the form of words numbers or even pictures data is one of the most critical assets of any business it is used and collected practically everywhere your bank stores data about you your name address phone number account numbers Etc your credit card company and your PayPal accounts also store data about you data is important so it needs to be secure and it needs to be stored and accessed quickly the answer is a database so what is a database databases are everywhere and used every day but they are largely taken for granted a database is a repository of data it is a program that stores data a database also provides the functionality for adding modifying and querying that data there are different kinds of databases of different requirements the data can be stored in various forms when data is stored in tabular form the data is organized in tables like in a spreadsheet which is columns and rows that's a relational database The Columns contain properties about the item such as last name first name email address City a table is a collection of related things like a list of employees or a list of book authors in a relational database you can form relationships between tables so a database is a repository of data a set of software tools for the data in the database is called a database management system or dbms for short the terms database database server database system data server and database Management systems are often used interchangeably for relational databases it's called a relational database management system or rdbms rdbms is a set of software tools that controls the data such as access organization and storage an rdbms serves as the backbone of applications in many Industries including banking Transportation health and so on examples of relational database Management systems are MySQL Oracle database db2 warehouse and db2 on cloud for the majority of people using a database there are five simple commands to create a table insert data to populate the table select data from the table update data in the table delete data from the table so those are the building blocks for SQL for data science you can now describe what is SQL what is data what is a database and what is a relational database you know the rdbms stands for relational database management system and you can list five basic SQL commands to create a table insert data to populate the table select data from the table update data in the table and delete data from the table thanks for watching this video [Music] hello and welcome to retrieving data with the select statement in this video we will learn about retrieving data from a relational database table by selecting Columns of a table at the end of this lesson you will be able to retrieve data from a relational database table Define the use of a predicate identify the syntax of the select statement using the where clause and list the comparison operators supported by a relational database management system the main purpose of a database management system is not just to store the data but also facilitate retrieval of the data so after creating a relational database table and inserting data into the table we want to see the data to see the data we use the select statement the select statement is a data manipulation language statement data manipulation language statements or DML statements are used to read and modify data the select statement is called a query and the output we get from executing this query is called a result set or a result table in its simplest form a select statement is Select star from table name based on the book entity example we would create the table using the entity name book and the entity attributes as The Columns of the table the data would be added to the book table by adding rows to the table using the insert statement in the book entity examples select star from book gives a result set of four rows all the data rows for all columns in the table book are displayed in addition you can also retrieve all the rows for all columns by specifying the column names individually in the select statement you don't always have to retrieve all the columns in a table you can retrieve just a subset of columns if you want you can retrieve just two columns from the table book for example book underscore ID and title in this case the select statement is Select book underscore ID title from book in this case only the two columns display for each of the four rows also notice that the order of the columns displayed always matches the order in the select statement however what if we want to know the title of the book whose book underscore ID is B1 relational operation helps us in restricting the results set by allowing us to use the Clause where the where Clause always requires a predicate a predicate is a condition that evaluates to true false or unknown predicates are used in the search condition of the where clause so if we need to know the title of the book whose book underscore ID is B1 we use the where Clause with the predicate book underscore ID equals B1 select book underscore ID title from book where book underscore ID equals B1 notice that the result set is now restricted to just one row whose condition evaluates to true the previous example used the comparison operator equal to in the where clause there are other comparison operators supported by a relational database management system equal to greater than less than greater than or equal to less than or equal to and not equal to now you can retrieve data and select columns from a relational database table Define the use of a predicate identify the syntax of the select statement using the where clause and list the comparison operators supported by a relational database management system thanks for watching this video [Music] hello and welcome in this video we'll briefly present a few useful Expressions that are used with select statements the first one is count is a built-in database function that retrieves the number of rows that match the query criteria for example get the total number of rows in a given table select count star from table name let's say you create a table called Metals which has a column called country and you want to retrieve the number of rows where the metal recipient is from Canada you can issue a query like this select count country from Metals where country equals Canada the second expression is distinct distinct is used to remove duplicate values from a result set example to retrieve unique values in a column select distinct column name from table name in the metals table mentioned earlier a country may have received a gold medal multiple times example retrieve the list of unique countries that received gold medals that is removing all duplicate values of the same country select distinct country from Metals Where Metal type equals gold the third expression is limit limit is used for restricting the number of rows retrieved from the database example retrieve just the first 10 rows in a table select star from table name limit 10. this can be very useful to examine the result set by looking at just a few rows instead of retrieving the entire result set which may be very large example retrieve just a few rows in the metals table for a particular year select star from Metals where year equals 2018 limit five in this video we looked at some useful Expressions that are used with select statements namely the count distinct and limit built-in functions thanks for watching this video [Music] hello and welcome to the insert statement in this video we will learn about populating a relational database table at the end of this video you will be able to identify the syntax of the insert statement and explain two methods to add rows to a table after a table is created the table needs to be populated with data to insert data into a table we use the insert statement the insert statement is used to add new rows to a table the insert statement is one of the data manipulation language statements data manipulation language statements or DML statements are used to read and modify data based on the author entity example we created the table using the entity name author and the entity attributes as The Columns of the table now we will add the data to the author table by adding rows to the table to add the data to the author table we use the insert statement the syntax of the insert statement looks like this insert into table name column name values in this statement table name identifies the table the column name list identifies each column in the table and the values Clause specifies the data values to be added to the columns in the table to add a row with the data for Raul Chong we insert a row with an author underscore ID of A1 the last name is Chong the first name as Raul the email as RFC ibm.com the city s Toronto and the country SCA for Canada the author table has six columns so the insert statement lists the six column names separated by commas followed by a value for each of the columns also separated by commas it is important that the number of values provided in the values Clause is equal to the number of column names specified in the column name list this ensures that each column has a value tables do not need to be populated one row at a time multiple rows can be inserted by specifying each row in the values clause in the values Clause each row is separated by a comma for example in this insert statement we are inserting two rows one for Raul Chong and one for RAV Ahuja now you can identify the syntax of the insert statement and explain the two methods to add rows to a table one row at a time or multiple rows thanks for watching this video [Music] hello and welcome to the update statement and the delete statement in this video we will learn about altering and deleting data in a relational database table at the end of this lesson you will be able to identify the syntax of the update statement and delete statement and explain the importance of the where clause in these statements after a table is created and populated with data the data in a table can be altered with the update statement the update statement is one of the data manipulation language or DML statements DML statements are used to read and modify data based on the author entity example we created the table using the entity name author and the entity attributes as The Columns of the table rows were added to the author table to populate the table some time later you want to alter the data in the table to alter or modify the data in the author table we use the update statement the syntax of the update statement looks like this update table name set column name equal to Value where condition in this statement table name identifies the table the column name identifies the column value to be changed as specified in the where condition let's look at an example in this example you want to update the first name and last name of the author with author underscore ida2 from RAV Ahuja to Lakshmi Kata in this example to see the update statement in action we start by selecting all rows from the author table to see the values to change the first name and last name to Lakshmi Kata where the author ID is equal to A2 enter the update statement as follows update author set last name equal to Kata first name equal to Lakshmi where author ID is equal to A2 now to see the result of the update select all rows again from the author table and you will see that in Row 2 the name changed from Rob Ahuja to Lakshmi Kata note that if you do not specify the where Clause all the rows in the table will be updated in this example without specifying the where Clause all rows in the table would have changed the first and last names to Lakshmi Kata sometime later there might be a need to remove one or more rows from a table the rows are removed with the delete statement the delete statement is one of the data manipulation language statements used to read and modify data the syntax of the delete statement looks like this delete from table name where condition the rows to be removed are specified in the where condition based on the author entity example we want to delete the rows for author ID A2 and A3 let's look at an example delete from author where author ID in A2 A3 note that if you do not specify the where Clause all the rows in the table will be removed now you can identify the syntax of the update statement and delete statement and explain the importance of the where clause in these statements thanks for watching this video foreign [Music] hello and welcome to database Concepts in this video we will learn about different types of models how we use models to map data to tables and Define relationships between tables at the end of this lesson you will be able to explain the advantage of the relational model explain how the entity name and attributes map to a relational database table describe the difference between an entity and an attribute Identify some commonly used data types and describe the function of primary keys the relational model is the most used data model for databases because this model allows for data Independence data is stored in a simple data structure tables this provides logical data Independence physical data Independence and physical storage Independence an entity relationship data model or ER data model is an alternative to a relational data model using a simplified Library database as an example this figure shows an entity relationship diagram or ERD that represents entities called tables and their relationships in the library example we have books a book can be written by one or many authors the library can have one or many copies of a book each copy can be Borrowed by only one borrower at a time an entity relationship model proposes thinking of a database as a collection of entities rather than being used as a model on its own the ER model is used as a tool to design relational databases in the ER model entities are objects that exist independently of any other entities in the database the building blocks of an ER diagram are entities and attributes an entity can be a noun person place or thing in an ER diagram an entity is drawn as a rectangle entities have attributes which are the data elements that characterize the entity attributes tell us more about the entity in an ER diagram attributes are drawn as ovals using a simplified Library as an example the book is an example of an entity attributes are certain Properties or characteristics of an entity and tell us more about the entity The Entity book has attributes such as book title the addition of the book The Year the book was written Etc attributes are connected to exactly one entity The Entity book becomes a table in the database and the attributes become the columns in a table a table is a combination of rows and columns while mapping the entity becomes the table having said that the table has not yet taken the form of rows and columns the attributes get translated into columns in a table providing the actual table form of rows and columns later we add some data values to each of the columns which completes the table form each attribute stores data values of different formats characters numbers dates currency and many more besides in The Book Table example the title is made up of characters as book titles vary in length we can set the variable character data type for the title column varcar for character columns that do not vary in length we use character or car the addition and year columns would be numeric the ISBN column would be car because it contains dashes as well as numbers and so on using the book entity mapping as an example we can create the tables for the remainder of our simplified Library example using entity names like author author list borrower loan and copy The Entity attributes will be The Columns of the tables each table is assigned a primary key the primary key of A relational table uniquely identifies each Tuple or row in a table preventing duplication of data and providing a way of defining relationships between tables tables can also contain foreign Keys which are primary Keys defined in other tables creating a link between the tables now you know that the key advantage of the relational model is logical and physical data Independence and storage Independence entities are independent objects which can have multiple characteristics called attributes when mapping to a relational database entities are represented as tables and attributes mapped to columns common data types include characters such as car and varcar numbers such as integer and decimal and time stamps including date and time a primary key uniquely identifies a specific row in a table and prevents duplication of data [Music] hello and welcome this video will cover the key Concepts around databases in the cloud in order to learn SQL you first need to have a database available to practice your SQL queries an easy way to do so is to create an instance of a database in the cloud and use it to execute your SQL queries after completing this lesson you will be able to understand basic concepts related to Cloud databases list a few Cloud databases describe database service instances as well as demonstrate how to create a service instance on an IBM db2 on cloud a cloud database is a database service built and accessed through a cloud platform it serves many of the same functions as traditional databases with the added flexibility of cloud computing some advantages of using Cloud databases include ease of use users can access Cloud databases from virtually anywhere using a vendors API or web interface or your own applications whether on cloud or remote scalability Cloud databases can expand and Shrink their storage and compute capacities during runtime to accommodate changing needs and usage demands so organizations only pay for what they actually use Disaster Recovery in the event of a natural disaster or equipment failure or power outage data is kept secure through backups on remote servers on cloud in geographically distributed regions a few examples of relational databases on cloud include IBM db2 on cloud databases for postgresql on IBM Cloud Oracle database cloud service Microsoft Azure SQL database and Amazon relational database Services these Cloud databases can run in the cloud either as a virtual machine which you can manage or delivered as a managed service depending on the vendor the database Services can either be single or multi-tenant depending on the service plan to run a database in Cloud you must first provision an instance of the database service on the cloud platform of your choice an instance of a database as a service or dbaas provides users with access to database resources in Cloud without the need for setting up of the underlying Hardware installing the database software and administering the database the database service instance will hold your data in related tables once your data is loaded into the database instance you can connect to the database instance using a web interface or apis in your applications once connected your application can send SQL queries across to the database instance the database instance then resolves the SQL statements into operations against the data and objects in the database any data retrieved is returned to the application as a result set now let's see how a database instance is created for db2 on cloud IBM db2 on cloud is a SQL database provisioned for you in the cloud you can use db2 on cloud just as you would use any database software but without the overhead and expensive hardware setup or software installation and maintenance now let's see how we can set up a service instance of db2 navigate to IBM Cloud catalog and select the db2 service note there are several variations of the db2 service including db2 hosted and db2 warehouse for our purposes we will choose the db2 service which comes with a free light plan select the light plan if need to change the defaults you can type a service instance name choose the region to deploy to as well as an org and space for the service then click create you can view the IBM db2 service that you created by selecting services from your IBM Cloud dashboard from this dashboard you can manage your database instance for example you can click on the open console button to launch the web console for your database instance the web console allows you to create tables load data explore data in your tables and issue SQL queries in order to access your database instance from your applications you will need the service credentials for the first time around you'll need to create a set of new credentials you can also choose to create multiple sets of credentials for different applications and users once a set of service credentials is created you can view it as a Json snippet the credentials include the necessary details to establish a connection to the database and includes the following a database name and port number a host name which is the name of the server on the cloud on which your database instance resides a username which is the user ID you'll use to connect along with the password note that your username is also the schema name in which your tables will be created by default now that you know how to create a database instance on cloud the next step is to actually go and create one thank you for watching this video [Music] welcome to types of SQL statements at the end of this video you will be able to distinguish between data definition language statements and data manipulation language statements SQL statements are used for interacting with entities that is tables attributes that is columns and they're tuples or rows with data values in relational databases SQL statements fall into two different categories data definition language statements and data manipulation language statements data definition language or ddl statements are used to define change or drop database objects such as tables common ddl statement types include create alter truncate and drop create which is used for creating tables and defining its columns alter is used for altering tables including adding and dropping columns and modifying their data types truncate is used for deleting data in a table but not the table itself drop is used for deleting tables data manipulation language or DML statements are used to read and modify data in tables these are also sometimes referred to as crud operations that is create read update and delete rows in a table common DML statement types include insert select update and delete insert is used for inserting a row or several rows of data into a table select reads or selects row or rows from a table update edits row or rows in a table and delete removes a row or rows of data from a table now you know that ddl or data definition language statements are used for defining or changing objects in a database such as tables and DML or data manipulation language statements are used for manipulating or working with data in tables thanks for watching this video foreign [Music] table statement at the end of this video you will be able to explain how the entity name and attributes are used to create a relational database table now let's look at the most common ddl statement create the syntax of the create table is shown here you start with create table followed by the name of the table you want to create then enclose the rest of the statement inside of pair of parentheses or round brackets each row inside the parentheses specifies the name of a column followed by its data type and possibly some additional optional values that we will see later each attribute or column definition is separated by a comma for example if we want to create a table for provinces in Canada you would specify create table provinces open parentheses ID car to primary key not null comma name varcar 24 close parentheses in this example the data types used are car which is a character string of a fixed length in this case two and varcar which is a character string of a variable length in this case the variable character field can be up to 24 characters long issuing this statement would create a table in the database with two columns the First Column ID for storing the abbreviated two-letter Province shortcodes such as a b b c Etc and the second column called name for storing the full name of the province such as Alberta British Columbia Etc now let's look at a more elaborate example based on the library database this database includes several entities such as author book borrower Etc let's start by creating the table for the author entity the name of the table will be author and its attributes such as author underscore ID first name last name Etc will be The Columns of the table in this table we will also assign the author underscore ID attribute as the primary key so that no duplicate values can exist recall the primary key of A relational table uniquely identifies each Tuple or row in a table to create the author table issue the following command create table author open parentheses author underscore ID Car 2 primary key not null comma last name varcar 15 not null comma first name varcar 15 not null comma email varcar 40 comma City varcar 15 comma Country Car 2 close parentheses note that the author underscore ID is the primary key this constraint prevents duplicate values in the table also note that the last name and first name have the constraint not null this ensures that these fields cannot contain a null value since an author must have a name now you know that create is a ddl statement for creating entities or tables in a database and the create table statement includes definition of attributes of columns in the table including names of columns data types of columns and other optional values if required such as the primary key constraint thanks for watching this video foreign [Music] drop and truncate tables after watching this video you will be able to describe the altar table drop table and truncate statements explain the syntax and use the statements in queries you use the alter table statement to add or remove columns from a table to modify the data type of columns to add or remove keys and to add or remove constraints the syntax of the alter table statement is shown here you start with alter table followed by the name of the table that you want to alter differently to the create table statement though you do not use parentheses to enclose the parameters for the alter table statement each row in the alter table statement specifies one change that you want to make to the table for example to add a telephone number column to the author table in the library database to store the author's telephone number use the following statement alter table author add column telephone underscore number big int semicolon in this example the data type for the column is Big int which can hold the number up to 19 digits long you can also use the alter table statement to modify the data type of a column to do this use the alter column Clause specifying the new data type for the column for example using a numeric data type for telephone number means that you cannot include parentheses plus signs or dashes as part of the number you can change the column to use the card data type to overcome this this code shows how to alter the author table alter table author alter column telephone underscore number set data type car open parentheses 20 close parentheses semicolon altering the data type of a column containing existing data can cause problems though if the existing data is not compatible with the new data type for example changing a column from the car data type to a numeric data type will not work if the column already contains non-numeric data if you try to do this you will see an error message in the notification log and the statement will not run if your spec changes and you no longer need this extra column you can again use the alter table statement this time with the drop column Clause to remove the column as shown alter table author drop column telephone underscore number semicolon similar to using drop column to delete a column from a table you use the drop table statement to delete a table from a database if you delete a table that contains data by default the data will be deleted alongside the table the Syntax for the drop table statement is drop table table underscore name semicolon so you use this statement drop table author semicolon to remove the table from the database sometimes you might want to just delete the data in a table rather than deleting the table itself while you can use the delete statement without a where Clause to do this it is generally quicker and more efficient to truncate the table instead you use the truncate table statement to delete all of the rows in a table the syntax of this statement is truncate table table underscore name immediate semicolon the immediate specifies to process the statement immediately and that it cannot be undone so to truncate the author table you use this statement truncate table author immediate semicolon in this video you learned that the alter table statement changes the structure of an existing table for example to add modify or drop columns the drop table statement deletes an existing table and the truncate table statement deletes all rows of data in a table foreign hello and welcome to retrieving data with select statement string patterns in this video we will learn about some Advanced Techniques in retrieving data from a relational database table at the end of this lesson you will be able to describe how to simplify a select statement by using string patterns ranges or sets of values the main purpose of a database management system is not just to store the data but also facilitate retrieval of the data in its simplest form a select statement is Select star from table name based on our simplified Library database model and the table book select star from book gives a result set of four rows all the data rows for all columns in the table book are displayed or you can retrieve a subset of columns for example just two columns from the table book such as book underscore ID and title or you can restrict the result set by using the where clause for example you can select the title of the book whose book underscore ID is B1 but what if we don't know exactly what value to specify in the where clause the where Clause always requires a predicate which is a condition that evaluates to true false or unknown but what if we don't know exactly what value the predicate is for example what if we can't remember the name of the author but we remember that their first name starts with R in a relational database we can use string patterns to search data rows that match this condition let's look at some examples of using string patterns if we can't remember the name of the author but we remember that their name starts with r we use the where Clause with the like predicate the like predicate is used in a where Clause to search for a pattern in a column the percent sign is used to define missing letters the percent sign can be placed before the pattern after the pattern or both before and after the pattern in this example we use the percent sign after the pattern which is the letter r the percent sign is called a wild card character a wild card character is used to substitute other characters so if we can't remember the name of the author but we can remember that their first name starts with the letter R we add the like predicate to the where clause for example select first name from author where first name like R percentile this will return all rows in the author table whose author's first name starts with the letter r and here is the result set two rows are returned for authors Raul and RAV what if we wanted to retrieve the list of books whose number of pages is more than 290 but less than 300. we could write the select statement like this specifying the where Clause as where pages is greater than or equal to 290 and pages is less than or equal to 300. but in a relational database we can use a range of numbers to specify the same condition instead of using the comparison operators greater than or equal to we use the comparison operator between and between and Compares two values the values in the range are inclusive in this case we rewrite the query to specify the where Clause as where Pages between 290 and 300. the result set is the save but the select statement is easier and quicker to write in some cases there are data values that cannot be grouped under ranges for example if we want to know which countries the authors are from if we wanted to retrieve authors from Australia or Brazil we could write the select statement with the where Clause repeating the two country values however what if we want to retrieve authors from Canada India and China the where Clause would become very long repeatedly listing the required country conditions instead we can use the in operator the in operator allows us to specify a set of values in a where Clause this operator takes a list of Expressions to compare against in this case the country's Australia or Brazil now you can describe how to simplify a select statement by using string patterns ranges or sets of values thanks for watching this video [Music] hello and welcome to sorting select statement result sets in this video we will learn about some Advanced Techniques in retrieving data from a relational database table and sorting how the result set displays at the end of this lesson you will be able to describe how to sort the result set by either ascending or descending order and explain how to indicate which column to use for the Sorting order the main purpose of a database management system is not just to store the data but also facilitate retrieval of the data in its simplest form a select statement is Select star from table name based on our simplified Library database model and the table book select star from book gives a result set of four rows all the data rows for all columns in the table book are displayed we can choose to list the book titles only as shown in this example Select Title from book however the order does not seem to be in any order displaying the result set in alphabetical order would make the result set more convenient to do this we use the order by clause to display the results set in alphabetical order we add the order by Clause to the select statement the order by Clause is used in a query to sort the result set by a specified column in this example we have used order by on the column title to sort the result set by default the result set is sorted in ascending order in this example the result set is sorted in alphabetical order by book title to sort into sending order use the keyword Des the result set is now sorted according to the column specified which is title and is sorted in descending order notice the order of the first three rows the first three words of the title are the same so the Sorting starts from the point where the characters differ another way of specifying the sort column is to indicate the column sequence number in this example Select Title pages from book order by 2 indicates the column sequence number in the query for the Sorting order instead of specifying the column name Pages the number two is used in the select statement the second column specified in the column list is pages so the sort order is based on the values in the pages column in this case the pages column indicates the number of pages in the book as you can see the result set is an ascending order by number of pages now you can describe how to sort the result set by either ascending or descending order and explain how to indicate which column to use for the Sorting order thanks for watching this video [Music] hello and welcome to grouping select statement results sets in this video we will learn about some Advanced Techniques in retrieving data from a relational database table and sorting and grouping how the result set displays at the end of this lesson you will be able to explain how to eliminate duplicates from a result set and describe how to further restrict a result set at times a select statement result set can contain duplicate values based on our simplified Library database model in the author table example the country column lists the two-letter country code of the author's country if we select just the country column we get a list of all of the countries for example select country from author order by 1. the order by Clause sorts the result set this results at list the countries the authors belong to sorted alphabetically by country in this case the result set displays 20 rows one row for each of the 20 authors but some of the authors come from the same country so the result set contains duplicates however all we need is a list of countries the authors come from so in this case duplicates do not make sense to eliminate duplicates we use the keyword distinct using the keyword distinct reduces the result set to just six rows but what if we wanted to also know how many authors come from the same country so now we know that the 20 authors come from six different countries but we might want to also know how many authors come from the same country to display the result set listing the country and number of authors that come from that country we add the group by Clause to the select statement the group by Clause groups result into subsets that has matching values for one or more columns in this example countries are grouped and then counted using the count function notice the column heading for the second column in the result set the numeric value 2 displays as a column name because the column name is not directly available in the table the second column in the result set was calculated by the count function instead of using the column name 2 we can assign a column name to the result set we do this using the as keyword in this example we change the derived column name 2 to column name count using the as count keyword this helps clarify the meaning of the result set now that we have the count of authors from different countries we can further restrict the number of rows by passing some conditions for example we can check if there are more than four authors from the same country to set a condition to a group by Clause we use the keyword having the having Clause is used in combination with the group by clause it is very important to note that the where Clause is for the entire result set but the having Clause Works only with the group by clause to check if there are more than four authors from the same country we add the following to the select statement having count country greater than four only countries that have five or more authors from that country are listed in the result set in this example those countries are China with six authors and India also with six authors now you can explain how to eliminate duplicates from a result set and describe how to further restrict a result set thanks for watching this video [Music] hello and welcome in this video we'll go over SQL functions built into the database so let's get started while it is very much possible to First fetch data from a database and then perform operations on it from your applications and notebooks most databases come with built-in functions these functions can be included in SQL statements allowing you to perform operations on data right within the database itself using database functions can significantly reduce the amount of data that needs to be retrieved from the database that is reduces Network traffic and use of bandwidth when working with large data sets it may be faster to use built-in functions rather than first retrieving the data into your application and then executing functions on the retrieve data Note that it is also possible to create your own functions that is user-defined functions in the database but that is a more advanced topic for the examples in this lesson let's consider this Pet Rescue table in the database for a pet rescue organization it records rescue transaction details and includes the columns ID animal quantity cost and rescue date for the purposes of this lesson we have populated it with several rows of data as shown here what are aggregate or column functions an aggregate function takes a collection of like values such as all of the values in a column as input and returns a single value or null examples of aggregate functions include sum minimum maximum average Etc let's look at some examples based on the Pet Rescue table the sum function is used to add up all the values in a column to use the function you write the column name within parentheses after the function name for example to add up all of the values in the cost column select sum cost from Pet Rescue when you use an aggregate function the column in the results set by default is given a number it is possible to explicitly name the resulting column for example let's say we want to call the output column in the previous query as sum of cost select sum cost as sum of cost from Pet Rescue note the use of as in this example minimum as the name implies is used to get the lowest value similarly maximum is used to get the highest value for example to get the maximum quantity of any animal rescue in a single transaction select Max quantity from Pet Rescue aggregate functions can also be applied on a subset of data instead of an entire column for example to get the minimum quantity of ID column for dogs select Min ID from Pet Rescue where animal equals dog the average function is used to return the average or the mean value for example to specify the average value of cost as select average cost from Pet Rescue note that we can perform mathematical operations between columns and then apply aggregate functions on them for example to calculate the average cost per dog select average cost divided by quantity from Pet Rescue where animal equals dog in this case the cost is for multiple units so we first divide the cost by the quantity of the rescue now let's look at the scalar and string functions scalar functions perform operations on individual values for example to round up or down every value in the cost column to the nearest integer select round cost from Pet Rescue there is a class of scalar functions called string functions that can be used for operations on strings that is car and varcar values for example to retrieve the length of each value in animal column select length animal from Pet Rescue uppercase and lowercase functions can be used to return uppercase or lowercase values of strings for example to retrieve animal values in uppercase select uppercase animal from Pet Rescue scalar functions can be used in the where clause for example to get lowercase values of the animal column for cat select star from Pet Rescue where lowercase animal equals cat this type of a statement is useful for matching values in the where Clause if you are not sure whether the values are stored in upper lower or mixed case in the table you can also have one function operate on the output of another function for example to get unique cases for animal column in uppercase select distinct uppercase animal from Pet Rescue in this video we looked at some built-in SQL aggregate functions such as sum minimum maximum and average we also looked at scalar and string functions such as round lowercase and uppercase thank you for watching [Music] hello and welcome in this video we'll go over date and time SQL functions built into the database so let's get started most databases contain special data types for dates and times db2 contains date time and time stamp types in db2 date has eight digits for year month and day time has six digits hours minutes and seconds timestamp has 20 digits year month day hour minute seconds and microseconds where double X represents month and six Z's or Zeds represents microseconds functions exist to extract the day month day of month day of week day of year week hour minute and second let us look at some examples of queries for date and time functions the day function can be used to extract the day portion from a date for example to get the day portion for each rescue date involving cat select day rescue date from Pet Rescue where animal equals cat date and time functions can be used in the where clause for example to get the number of rescues during the month of May that is for month 5. select count star from Pet Rescue where month rescue date equals zero five you can also perform date or time arithmetic for example to find out what date it is three days after each rescue date maybe you want to know this because the rescue needs to be processed within three days select rescue date plus three days from Pet Rescue special registers current time and current date are also available for example to find out how many days have passed since each rescue date till now select current date minus rescue date from Pet Rescue the result will be in years months days in this video we looked at different types of built-in SQL functions for working with dates and times thank you for watching [Music] hello and welcome in this video you'll learn how to write subqueries or nested select statements sub queries or subselects are like regular queries but placed within parentheses and nested inside another query this allows you to form more powerful queries than would have been otherwise possible an example of a nested query is shown in this example the sub query is inside the where Clause of another query consider the employees table from the previous video the first few rows of data are shown here the table contains several columns including an employee ID first name last name salary Etc we will now go over some examples involving this table let's consider a scenario which may necessitate the use of sub queries let's say we want to retrieve the list of employees who earn more than the average salary to do so we could try this code select star from employees where salary is greater than average salary however running this query will result in an error like the one shown indicating an invalid use of the aggregate function one of the limitations of built-in aggregate functions like the average function is that they cannot always be evaluated in the where clause so to evaluate a function like average in the where Clause we can make use of a sub-select expression like the one shown here select employee ID first name last name salary from employees where salary is less than open parenthesis select average salary from employees close parenthesis notice that the average function is evaluated in the first part of the sub query allowing us to circumvent the limitation of evaluating it directly in the where clause the sub select doesn't just have to go in the where Clause it can also go in other parts of the query such as in the list of columns to be selected such sub queries are called column Expressions now let's look at a scenario where we might want to use a column expression say we wanted to compare the salary of each employee with the average salary we could try a query like select employee ID salary average salary as average salary from employees running this query will result in an error indicating that no Group by Clause is specified we can circumvent this error by using the average function in a sub query placed in the list of the columns for example select employee ID salary open left parenthesis select average salary from employees close right parenthesis as average salary from employees another option is to make the subquery be part of the from clause sub queries like these are sometimes called derived tables or table Expressions because the outer query uses the results of the sub query as a data source let's look at an example to create a table expression that contains non-sensitive employee information select star from select employee ID first name last name Department ID from employees as employee for all the derived table in the sub query does not include sensitive Fields like date of birth or salary this example is a trivial one and we could just as easily have included the columns in the outer query however such derived tables can prove to be powerful and more complex situations such as when working with multiple tables and doing joins in this video you have seen how subqueries and nested queries can be used to form richer queries and how they can overcome some of the limitations of aggregate functions you also learn to use subqueries in the where clause in the list of columns and in the from clause thanks for watching this video [Music] hello and welcome in this video you will learn how to write queries that access more than one table there are several ways to access multiple tables in the same query namely using subqueries implicit join and join operators such as inner join and outer join in this video we'll examine the first two options the third option is covered in more detail in other videos let's consider the employees and departments tables from a previous video the employees table contains several columns for categories such as employee ID first name last name and salary to name a few the Department's table contains a department ID Department name manager ID and location ID some sample data from these tables is shown here we will utilize these tables for the examples in this video in a previous video we learned how to use sub queries now let's use subqueries to work with multiple tables if we want to retrieve only the employee records from the employees table for which a department ID exists in the Departments table we can use a sub query as follows select star from employees where Department ID in select Department ID Department from departments here the outer query accesses the employees table and the sub query on the Departments table is used for filtering the result set of the outer query let's say we want to retrieve only the list of employees from a specific location we do not have any location information in the employees table but the Department's table has a column called location ID therefore we can use a sub query from the Departments table as input to the employee table query as follows select star from employees where Department ID in select Department ID Department from departments where location ID equals l002 now let's retrieve the department ID and Department name for employees who earn more than seventy thousand dollars to do so we will need a sub query on the employees table to satisfy the salary criteria and then feed it as input to an outer query on the Department's table in order to get the matching Department info select Department ID Department name from departments where Department ID Department n select Department ID from employees or salary is greater than seventy thousand we can also access multiple Tables by specifying them in the from Clause of the query consider the example select star from employees departments here we specify two tables in the from clause this results in a table join but note we are not explicitly using the join operator the resulting join in this example is called a full join or Cartesian join because every Row in the first table is joined with every Row in the second table if you examine the result set you will see more rows than in both tables individually we can use additional operands to limit the result set let's look at an example where we limit the result set to only rows with matching Department IDs select star from employees departments where employees Department ID equals departments Department ID Department notice that in the where Clause we prefix the name of the column with the name of the table this is to fully qualify the column name since it's possible that different tables could have some column names that are exactly the same since the table names can sometimes be long we can use shorter aliases for table names as shown here select star from employees e departments D where e Department ID equals D Department ID Department here we Define the Alias e for employees table and D for departments table and then use these aliases in the where clause if we wanted to see the department name for each employee we would enter the code as follows select employee ID Department name from employees e departments D where e Department ID equals D Department ID Department similar to before the column names in the select Clause can also be prefixed by aliases as shown in the query select e employee ID D Department ID Department from employees e departments D where e Department ID equals D Department ID Department in this lesson we have shown you how to work with multiple tables using subqueries and implicit joins thanks for watching [Music] hello in this video you will learn how to access databases using python databases are powerful tools for data scientists after completing this module you will be able to explain the basic concepts related to using python to connect to databases then you will create tables load data and query data using SQL from jupyter notebooks and finally analyze the data in the lab assignments you will learn how to create an instance in the cloud connect to a database query data from the database using SQL and analyze the data using python you will be able to explain the basic concepts related to connecting a python application to a database describe SQL apis as well as list some of the proprietary apis used by popular SQL based dbms systems let's quickly review some of the benefits of using python a popular scripting language for connecting to databases the python ecosystem is very rich and provides easy to use tools for data science some of the most popular packages are numpy pandas matplotlib and SCI pi Thon is easy to learn and has a simple syntax due to its open source nature python has been ported to many platforms all your python programs can work on any of these platforms without requiring any changes at all if you are careful and avoid any system dependent features python supports relational database systems writing python code to access databases is made easier by the presence of the Python database API commonly referred to as the DB API and detailed documentation related to python is easily available notebooks are also very popular in the field of data science because they run in an environment that allows creation and sharing of documents that contain Live code equations visualizations and explanatory text a notebook interface is a virtual notebook environment used for programming examples of notebook interfaces include the mathematica notebook Maple worksheet Matlab notebook IPython Jupiter R markdown Apache Zeppelin Apache spark notebook and the databricks cloud in this module we will be using jupyter notebooks The jupyter Notebook is an open source web application that allows you to create and share documents that contain Live code equations visualizations and narrative text here are some of the advantages of using jupyter notebooks notebook support for over 40 programming languages including python R Julia and Scala notebooks can be shared with others by email Dropbox GitHub and the jupyter notebook viewer your code can produce Rich interactive output HTML images videos latex and custom mime types you can leverage Big Data tools such as Apache spark from python R and Scala and explore that same data with pandas scikit learn ggplot2 and tensorflow this is how a typical user accesses databases using python code written on a jupyter notebook a web-based editor there is a mechanism by which the Python program communicates with the dbms the python code connects to the database using API calls we will explain the basics of SQL apis and python DB apis an application programming interface is a set of functions that you can call to get access to some type of service a SQL API consists of Library function calls as an application programming interface API for the dbms to pass SQL statements to the dbms an application program calls functions in the API and it calls other functions to retrieve query results and Status information from the dbms the basic operation of a typical SQL API is Illustrated in the figure the application program begins its database access with one or more API calls that connect the program to the dbms to send the SQL statement to the dbms the program builds the statement as a text string in a buffer and then makes an API call to pass the buffer contents to the dbms the application program makes API calls to check the status of its dbms request and to handle errors the application program ends its database access with an API call that disconnects it from the database now let's learn basic concepts about some of the proprietary apis used by popular SQL based dbms systems each database system has its own Library as you can see the table shows a list of a few applications and corresponding SQL apis my SQL C API provides low-level access to the MySQL client server protocol and enable C programs to access database contents the psycho pg2 API connects python applications in postgresql databases the IBM underscore DB API is used to connect python applications to IBM db2 databases the dblib API is used to connect to SQL Server databases odbc is used for database access for Microsoft Windows OS I is used by Oracle databases and finally jdbc is used by Java applications thanks for watching this video [Music] hello and welcome to writing code using DB API after completing this video you will be able to explain the basic concepts related to the python DB API and database cursors and also write code using DB apis as we saw in the beginning of this module the user writes python programs using a Jupiter notebook there is a mechanism by which the python code communicates with the dbms the python code connects to the database using DB API calls DB API is Python's standard API for accessing relational databases it is a standard that allows you to write a single program that works with multiple kinds of relational databases instead of writing a separate program for each one so if you learn the DB API functions then you can apply that knowledge to use any database with python here are some advantages of using the DB API it's easy to implement and understand this API has been defined to encourage similarity between the python modules that are used to access databases it achieves consistency which leads to more easily understood modules the code is generally more portable across databases and it has a broader reach of database connectivity from python as we know each database system has its own Library as you can see the table shows a list of a few databases and corresponding DB apis to connect to python applications the IBM underscore DB library is used to connect to an IBM db2 database the my sequel connector python library is used to connect to a compose for my SQL database the psycho pg2 library is used to connect to a compose from postgresql database and finally the pi library is used to connect to a compose from mongodb database the two main Concepts in the python DB API are connection objects and query objects you use connection objects to connect to a database and manage your transactions cursor objects are used to run queries you open a cursor object and then run queries the cursor Works similar to a cursor in a text processing system where you scroll down in your result set and get your data into the application cursors are used to scan through the results of a database the DB API includes a connect Constructor for creating a connection to the database it returns a connection object which is then used by the various connection methods these connection methods are the cursor method which returns a new cursor object using the connection the commit method which is used to commit any pending transaction to the database the rollback method which causes the database to roll back to the start of any pending transaction the close method which is used to close a database connection these objects represent a database cursor which is used to manage the content of a fetch operation cursors created from the same connection are not isolated that is any changes done to the database by a cursor are immediately visible by the other cursors cursors created from different connections can or cannot be isolated depending on how the transaction support is implemented a database cursor is a control structure that enables transversal over the records in a database it behaves like a file name or file handle in a programming language just as a program opens a file to access its contents it opens a cursor to gain access to the query results similarly the program closes a file to end its access and closes a cursor to end access to the query results another similarity is that Justice File handle keeps track of the program's current position within an open file a cursor keeps track of the program's current position within the query results let's walk through a python application that uses the DB API to query a database first you import your database module by using the connect API from that module to open a connection to the database you use the connect Constructor and pass in the parameters that is the database name username and password the connect function returns connection object after this you create a cursor object on the connection object the cursor is used to run queries and fetch results after running the queries using the cursor we also use the cursor to fetch the results of the query finally when the system is done running the queries it frees all Resources by closing the connection remember that it is always important to close connections to avoid unused connections taking up resources thanks for watching this video [Music] hello and welcome to connecting to a database using the IBM underscore DB API after completing this lesson you will be able to understand the IBM underscore DB API as well as the credentials required to connect to a database using python we will also demonstrate how to connect to an IBM db2 database using python code written on a jupyter notebook the IBM underscore DB API provides a variety of useful python functions for accessing and manipulating data in an IBM data server database including functions for connecting to a database preparing and issuing SQL statements fetching rows from result sets calling stored procedures committing and rolling back transactions handling errors and retrieving metadata the IBM underscore DB API uses the IBM data server driver for odbc and CLI apis to connect to IBM db2 and infor mix we import the IBM underscore DB Library into our python application connecting to the db2 requires the following information a driver name a database name a host DNS name or IP address a host port a connection protocol a user ID and a user password here is an example of creating a db2 database Connection in Python we create a connection object DSN which stores the connection credentials the connect function of the IBM underscore DB API will be used to create a non-persistent connection the DSN object is passed as a parameter to the connection function if a connection has been established with the database then the code returns connected as the output otherwise the output will be unable to connect to database then we free all Resources by closing the connection remember that it is always important to close connections so that we can avoid unused connections taking up resources thank you for watching this video [Music] hello and welcome to creating tables loading data and querying data after completing this lesson you will be able to understand basic concepts related to creating tables loading data and querying data using python as well as demonstrate an example of how to perform these tasks using the IBM db2 on cloud database and jupyter notebooks for this example we will be using db2 as the database we first obtain a connection resource by connecting to the database by using the connect method of the IBM underscore DB API there are different ways of creating tables in db2 One is using the web console provided by db2 and the other option is to create tables from any SQL r or python environments let's take a look at how to create tables in db2 from our python application here is a sample table of a commercial trucks database let's see how we can create the trucks table in the db2 using python code to create a table we use the IBM underscore DB exec underscore immediate function the parameters for the function are connection which is a valid database connection resource that is returned from the IBM underscore DB connect or IBM underscore DBP connect function statement which is a string that contains the SQL statement and options which is an optional parameter that includes a dictionary that specifies the type of cursor to return for result sets here is the code to create a table called trucks in Python we use the IBM underscore DB exec underscore immediate function of the IBM underscore DB API the connection resource that was created is passed as the first parameter to this function the next parameter is the SQL statement which is the create table query used to create the trucks table the new table created will have five columns serial underscore no will be the primary key now let's take a look at loading data we use the IBM underscore DB exec underscore immediate function of the IBM underscore DB API the connection resource that was created is passed as the first parameter to this function the next parameter is the SQL statement which is the insert into query used to insert data in the trucks table a new row will be added to the trucks table similarly we add more rows to the trucks table using the IBM underscore DB exec underscore immediate function now that your python code has been connected to a database instance and the database table has been created and populated with data let's see how we can fetch data from the trucks table that we created on db2 using python code we use the IBM underscore DB exec underscore immediate function of the IBM underscore DB API the connection resource that was created is passed as the first parameter to this function the next parameter is the SQL statement which is the select from table query the python code Returns the output which shows the fields of the data in the trucks table you can check if the output returned by the select query shown is correct by referring to the db2 console let's look at how we can use pandas to retrieve data from the database tables pandas is a popular python library that contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python we load data from the trucks table into a data frame called DF a data frame represents a tabular spreadsheet like data structure containing an ordered collection of columns Each of which can be a different value type thanks for watching this video [Music] hello and welcome to analyzing data with python after completing this video you will be able to understand basic concepts related to performing exploratory analysis on data we will demonstrate an example of how to store data using the IBM db2 on cloud database and then use Python to do some basic data analysis on this data in this video we will be using the McDonald's menu nutritional facts data for popular menu items at McDonald's while using python to perform basic exploratory analysis McDonald's is an American fast food company and the world's largest restaurant chain by Revenue although McDonald's is known for fast food items such as hamburgers french fries soft drinks milkshakes and desserts the company has added to its menu salads fish smoothies and fruit McDonald's provides nutrition analysis of their menu items to help you balance your McDonald's meal with other foods you eat the data set used in this lesson has been obtained from the nutritional facts for McDonald's menu from kaggle we need to create a table on db2 to store the McDonald's menu nutrition facts data set that we will be using we will also be using the console provided by db2 for this process there are four steps involved in loading data into a table Source Target Define and finalize we first load the spreadsheet into the db2 using the console we then select the target schema and then you will be given an option to load the data into an existing table or create a new table when you choose to create a new table you have the option to specify the table name next you will see a preview of the data where you can also Define The Columns and data types review the settings and begin the load when the loading is complete you can see the statistics on the loaded data next view the table to explore further db2 allows you to analyze data using in-database analytics apis rstudio or python the data has been loaded into our relational database you can run Python scripts that retrieve data from and write data to a db2 database such scripts can be powerful tools to help you analyze your data for example you can use them to generate statistical models Based on data in your database and to plot the results of these models in this lesson we will be using Python scripts that will be run within a jupyter notebook now after obtaining a connection resource by connecting to the database by using the connect method of the IBM underscore DB API we use the SQL select query to verify the number of rows that have been loaded in the table created the figure shows a snapshot of the output the output obtained is 260 which is similar to the number of rows in the db2 console now let's see how we can use pandas to retrieve data from the database tables we load data from the McDonald's underscore nutrition table into the data frame DF using the read underscore SQL method the SQL select query and the connection object are passed as parameters to the read underscore SQL method we can view the first few rows of the data frame DF that we created using the head method now it's time to learn about your data Panda's methods are equipped with a set of common mathematical and statistical methods let's use the describe method to view the summary statistics of the data in the data frame then explore the output of the describe method we see that there are 260 observations or food items in our data frame we also see that there are nine unique categories of food items in our data frame we can also see summary statistics information such as frequency mean median standard deviation etc for the 260 food items across the different variables for example the maximum value for total fat is 118. let's investigate this data further let's try to understand one of the nutrients in the food items which is sodium a main source of sodium is table salt the average American eats five or more teaspoons of salt each day this is about 20 times as much as the body needs sodium is found naturally in foods but a lot of it is added during processing and preparation many foods that do not taste salty may still be high in sodium large amounts of sodium can be hidden in canned processed and convenience Foods sodium controls fluid balance in our bodies and maintains blood volume and blood pressure eating too much sodium May raise blood pressure and cause fluid retention which could lead to swelling of the legs and feet or other health issues when limiting sodium in your diet a common Target is to eat less than 2 000 milligrams of sodium per day now using the nutrition data set from McDonald's let's do some basic data analysis to answer the question which food item has the maximum sodium content we first use visualization to explore the sodium content of food items using the Swarm plot method provided by the Seabourn package we create a categorical scatter plot as shown on the right then give as the input category on the x-axis sodium on the y-axis and the data will be the data frame DF that contains the nutritional data set from McDonald's the plot shows the sodium values for the different food items by category notice a high value of around 3600 for sodium on the scatter plot we will be learning about visualizations later in this module let's further explore this high sodium value and identify which food items on the menu have this value for sodium let some basic data analysis using python to find which food items on the menu have maximum sodium content to check the values of sodium levels in the food items within the data set we use the code as shown in code one the describe method is used to understand the summary statistics associated with sodium notice that the maximum value of sodium is given as 3600 now let's further explore the row associated with the maximum sodium variable as shown in code two we use the idx max method to compute the index values at which the maximum value of sodium is obtained in the data frame we see that the output is 82. now let's find the item name associated with the 82nd item in our data frame as shown in code 3 we will use the dot at method to find the item Name by passing the index of 82 and the column name item to be returned for the 82nd row finally we find that the food item on the menu that has the highest sodium content is chicken McNuggets 40 pieces visualizations are very useful for initial data exploration they can help us understand relationships patterns and outliers in the data let's first create a scatter plot with protein on the x-axis and total fat on the y-axis Scatter Plots are very popular visualization tools and show the relationship between two variables with a point for each observation to do this we can use the joint plot function provided by the Seaborn package and give as input protein on the x-axis and total fat on the y-axis and the data will be the data frame DF that contains the nutritional data set from McDonald's the output scatter plot is shown on the right side the plot has an interesting shape it shows the correlation between the two variables protein and fat correlation is a measure of association between two variables and has a value of between negative one and plus one we see that the points on the scatter plot are closer to a straight line in the positive direction so we have a positive correlation between the two variables on the top right corner of the scatter plot we have the values of the Pearson correlation 0.81 and the significance of the correlation denoted as P which is a good value that shows the variables are certainly correlated the plot also shows two histograms one on the top and the other on the right side the histogram on the top is that of the variable protein and the histogram on the right side is that of the variable total fat we also notice that there is a point on the scatter plot outside the general pattern this is a possible outlier now let's see how we can visualize data using box plots box plots are charts that indicate the distribution of one or more variables the Box in a box plot captures the middle fifty percent of data lines and points indicate possible skewness and outliers let's create a box plot for sugar the function we are going to use is box plot from the Seaborn package we give the column name sugars as input to the box plot function the output is shown on the right side where we have the box plot with average values of sugar and food items around 30 grams we also notice a few outliers that indicate food items with extreme values of sugar there exists food items in the data set that have sugar content of around 128 grams candies may be among these high sugar content food items on the menu now that you know how to do basic exploratory data analysis using pandas and visualization tools proceed to the labs in this module where you can practice the concepts learned thank you for watching this video [Music] hello and welcome in this video we'll give you a few hints and tips for working with real world data sets many of the real world data sets are made available as dot CSV files these are text files which contain data values typically separated by commas in some cases a different separator such as a semicolon may be used for this video we will use an example of a file called dogs.csv although this is a fictional data set that contains names of dogs and their breeds we will use it to illustrate Concepts that you will then apply to real data sets sample contents of the dogs dot CSV file are shown here the first row in the table in many cases contains attribute labels which map to column names in a table in dogs.csv the first row contains the name of three attributes ID is the name of the first attribute and the subsequent rows contain ID values of 1 2 and 3. the name of the dog is the second attribute in this case the dog names Wolfie fluffy and Huggy are the values and the third attribute is called breed either the dominant breed or pure breed name it has values of German Shepherd Pomeranian and Labrador as we've just seen CSV files can have the first or a header row that contains the names of the attributes if you're loading the data into the database using the visual load tool in the database console ensure the header in first row is enabled this will map the attribute names in the first row of the CSV file into column names in the database table and the rest of the rows into the data rows in the table as shown here note that the default column names may not always be database or query friendly and if that is the case you may want to edit them before the table is created now let's talk about querying column names that are lower or mixed case that is a combination of upper and lowercase let's assume we loaded the dogs.csv file using the default column names from the CSV if we try to retrieve the contents of the ID column using the query select lowercase ID from dogs we'll get an error as shown indicating the ID is not valid this is because the database parser assumes uppercase names by default whereas when we loaded the CSV file into the database it had the ID column name in mixed case IE an uppercase i and a lowercase D in this case to select data from a column with a mixed case name we need to specify the column name in its correct case within double quotes as follows select double quotes uppercase i lowercase D and double quotes from dogs ensure you use double quotes around the column name and not single quotes next we'll cover querying column names that have spaces and other characters in a CSV file if the name of the column contains Spaces by default the database May map them to underscores for example in the name of dog column there are spaces in between the three words the database may change it to name underscore of underscore dog other special characters like parentheses or brackets may also get mapped to underscores therefore when you write a query ensure you use proper case formatting within quotes and substitute special characters to underscores as shown in this example select ID name of dog breed dominant breed if not pure breed from dogs please note the underscores separating the words within double quotes also note the double underscore between breed and dominant as shown finally it's also important to note the trailing underscore after the word breed near the end of the query this is used in place of the closing bracket when using quotes in Jupiter notebooks you may be issuing queries in a notebook by first assigning them to python variables in such cases if your query contains double quotes for example to specify a mixed case column name you could differentiate the quotes by using single quotes for the python variable to enclose the SQL query and double quotes for the column names for example select query equals open single quote select ID and double quotes from dogs close single quote now what if you need to specify single quotes within the query for example to specify a value in the where clause in this case you can use backslash as the Escape character as follows select query equals open single quote select star from dogs where open double quote name of dog close double quote equals backslash single quote Huggy backslash single quote close single quote if you have very long queries such as join queries or nested queries it may be useful to split the query into multiple lines for improved readability in Python notebooks you can use the backslash character to indicate continuation to the next row as shown in this example percent SQL select ID name of dog backslash from dogs backslash or name of dog equals huggy it would be helpful at this point to take a moment to review the special characters as shown please keep in mind that you might get an error if you split the query into multiple lines in a python notebook without the backslash when using SQL magic you can use the double percent SQL in the first line of the cell in Jupiter notebooks it implies that the rest of the content of the cell is to be interpreted by SQL magic for example double percent SQL new row select ID name of dog new row from Dogs new row where name of dog equals huggy again please note the special characters as shown when using double percent SQL the backslash is not needed at the end of each line at this point you might be asking how would you restrict the number of rows retrieved it's a good question because a table may contain thousands or even millions of rows and you may only want to see some sample data or look at just a few rows to see what kind of data the table contains you may be tempted to just do select star from table name to retrieve the results in a pandas data frame and do a head function on it but doing so may take a long time for a query to run instead you can restrict the result set by using the limit clause for example use the following query to retrieve just the first three rows in a table called census data select star from census underscore data limit three in this video we looked at some considerations and tips for working with real world data sets thanks for watching [Music] hello and welcome in this video we'll look at how to get information about tables and their columns in a database now how would you get a list of tables in the database sometimes your database may contain several tables and you may not remember the correct name for example you may wonder whether the table is called dog dogs or four-legged mammals database systems typically contain system or catalog tables from where you can query the list of tables and get their properties in db2 this catalog is called ciscat tables in SQL Server it's information schema tables and in Oracle it's all tables or user tables to get a list of tables in a db2 database you can run the following query select star from syscat tables this select statement will return too many tables including system tables so it's better to filter the result as shown here select tab schema tab name create underscore time from syscat tables where tab schema equals abc12345 please ensure that you replace abc12345 with your own db2 username when you do a select star from syscat tables you get all the properties of the tables sometimes we're interested in specific properties such as creation time let's say you've created several tables with similar names for example dog one dog underscore test dog test one and so on but you want to check which of these tables was the last one you created to do so you can issue a query like select tab schema tab name create underscore time from syscat tables where tab schema equals qcm 54853 the output will contain the schema name table name and creation time for all tables in your schema next let's talk about how to get a list of columns in a table if you can't recall the exact name of a column for example whether it had any lowercase characters or an underscore in his name in db2 you can issue a query like the one shown here select star from syscap columns where tab name equals dogs for your information in my sequel you can simply run the command show columns from dogs or you may want to know specific properties like the data type and length of the data type in db2 you can issue a statement like select distinct name call type length from CIS IBM CIS columns where TB name equals dogs here we look at the results of retrieving column properties for a real table called Chicago crime data from a Jupiter notebook notice in the output you can tell certain column names show different cases for example the column titled arrest has an uppercase a and the rest of the characters are lowercase so keep in mind that when you refer to this column in your query not only must you enclose the word arrest within double quotes you must also preserve the correct case inside the quotes in this video we saw how to retrieve table and column information thanks for watching [Music] in this video we'll be talking about data analysis and the scenario in which we'll be playing the data analyst or data scientist but before we begin talking about the problem use car prices we should first understand the importance of data analysis as you know data is collected everywhere around us whether it's collected manually by scientists or collected digitally every time you click on a website or your mobile device but data does not mean information data analysis and in essence data science helps us unlock the information and insights from raw data to answer our questions so data analysis plays an important role by helping us to discover useful information from the data answer questions and even predict the future or the unknown so let's begin with our scenario let's say we have a friend named Tom and Tom wants to sell his car but the problem is he doesn't know how much he should sell his car for Tom wants to sell his car for as much as he can but he also wants to set the price reasonably so someone would want to purchase it so the price he set should represent the value of the car how can we help Tom determine the best price for his car let's think like data scientists and clearly Define some of his problems for example is there data on the prices of other cars and their characteristics what features of cars affect their prices color brand does Horsepower also affect the selling price or perhaps something else as a data analyst or data scientist these are some of the questions we can start thinking about to answer these questions we're going to need some data in the next videos we'll be going into how to understand the data how to import it into Python and how to begin looking into some basic insights from the data [Music] in this video we'll be looking at the data set on used car prices the data set used in this course is an open data set by Jeffrey C Schlemmer this data set is in CSV format which separates each of the values with commas making it very easy to import in most tools or applications each line represents a row in the data set in the Hands-On lab for this module you'll be able to download and use the CSV file do you notice anything different about the first row sometimes the first row is a header which contains a column name for each of the 26 columns but in this example it's just another row of data so here's the documentation on what each of the 26 columns represent there are a lot of columns and I'll just go through a few of the column names but you can also check out the link at the bottom of the slide to go through the descriptions yourself the first attribute symboling corresponds to the insurance risk level of a car s are initially assigned a risk factor symbol associated with their price then if an automobile is more risky this symbol is adjusted by moving it up the scale a value of plus three indicates that the auto is risky minus three that it's probably pretty safe the second attribute normalized losses is the relative average loss payment per insured vehicle year this value is normalized for all Autos within a particular size classification two-door small station wagons Sports specialty Etc and represents the average loss per car per year the values range from 65 to 256. the other attributes are easy to understand if you would like to check out more details refer to the link at the bottom of the slide okay after we understand the meaning of each feature we'll notice that the 26 attribute is price this is our Target value or label in other words this means price is the value that we want to predict from the data set and the predictor should be all the other variables listed like symboling normalize losses make and so on thus the goal of this project is to predict price in terms of other car features just a quick note this data set is actually from 1985 so the car prices for the models may seem a little low but just bear in mind that the goal of this exercise is to learn how to analyze the data foreign in order to do data analysis in Python we should first tell you a little bit about the main packages relevant to analysis in Python a python library is a collection of functions and methods that allow you to perform lots of actions without writing any code the libraries usually contain built-in modules providing different functionalities which you can use directly and there are extensive libraries offering a broad range of facilities we have divided the python data analysis libraries into three groups the first group is called scientific Computing libraries pandas offers data structure and tools for Effective data manipulation and Analysis it provides facts access to structured data the primary instrument of pandas is a two-dimensional table consisting of column and row labels which are called a data frame it is designed to provide easy indexing functionality the numpy library uses arrays for its inputs and outputs it can be extended to objects for matrices and with minor coding changes developers can perform fast array processing PSI Pi includes functions for some advanced math problems as listed on this slide as well as data visualization using data visualization methods is the best way to communicate with others showing them meaningful results of analysis these libraries enable you to create graphs charts and maps the map plot lib package is the most well-known library for data visualization it is great for making graphs and plots the graphs are also highly customizable another high-level visualization library is Seabourn it is based on matplotlib it's very easy to generate various plots such as heat Maps time series and violin plots with machine learning algorithms we're able to develop a model using our data set and obtain predictions the algorithmic libraries tackle some machine learning tasks from basic to complex here we introduce two packages the scikit learn Library contains tool statistical modeling including regression classification clustering and so on this library is built on numpy Sci pi and matplotlib stats models is also a python module that allows users to explore data estimate statistical models and perform statistical tests [Music] in this video we'll look at how to read in data using Python's pandas package once we have our data in Python then we can perform all the subsequent data analysis procedures we need data acquisition is a process of loading and reading data into notebook from various sources to read any data using Python's pandas package there are two important factors to consider format and fell path format is the way data is encoded we can usually tell different encoding schemes by looking at the ending of the file name some common encodings are CSV Json xlsx hdf and so forth the path tells us where the data is stored usually it is stored either on the computer we are using or online on the internet in our case we found a data set of used cars which was obtained from the web address shown on the slide when Jerry entered the web address in his web browser he saw something like this each row is one data point a large number of properties are associated with each data point because the properties are separated from each other by commas we can guess the data format as CSV which stands for comma separated values at this point these are just numbers and don't mean much to humans but once we read in this data we can try to make more sense out of it in pandas the read CSV method can read in files with columns separated by commas into a panda's data frame reading data in pandas can be done quickly in three lines first import pandas then Define a variable with a file path and then use the read CSV method to import the data however read CSV assumes the data contains a header our data on used cars has no column headers so we need to specify read CSV to not assign headers by setting header to none after reading the data set it is a good idea to look at the data frame to get a better intuition and to ensure that everything occurred the way you expected since printing the entire data set may take up too much time and resources to save time we can just use dataframe dot head to show the first n rows of the data frame similarly dataframe.tail shows the bottom end rows of data frame here we printed out the first five rows of data it seems that the data set was read successfully we can see the pandas automatically set the column header as a list of integers because we set header equals none when we read the data it is difficult to work with the data frame without having meaningful column names however we can assign column names and pandas in our present case it turned out that we have the column names in a separate file online we first put the column names in a list called headers then we set df.columns equals headers to replace the default integer headers by the list if we use the head method introduced in the last slide to check the data set we see the correct headers inserted at the top of each column at some point in time after you've done operations on your data frame you may want to export your pandas data frame to a new CSV file you can do this using the method to underscore CSV to do this specify the file path which includes the file name that you want to write to for example if you would like to save data frame DF as automobile.csv to your own computer you can use the syntax df.2 underscore CSV for this course we will only read and save CSV files however pandas also supports importing and exporting of most data file types with different data set formats the code Syntax for reading and saving other data formats is very similar to read or save CSV file each column shows a different method to read and save files into a different format [Music] in this video we introduce some simple pandas methods that all data scientists and analysts should know when using python pandas and data at this point we assume that the data has been loaded it's time for us to explore the data set pandas has several built-in methods that can be used to understand the data type of features or to look at the distribution of data within the data set using these methods gives an overview of the data set and also point out potential issues such as the wrong data type of features which may need to be resolved later on data has a variety of types the main types stored in pandas objects are object float int and date time the data type names are somewhat different from those in Native python this table shows the differences and similarities between them some are very similar such as the numeric data types int and Float the object pandas type function similar to string in Python save for the change in name while the date time pandas type is a very useful type for handling time series data there are two reasons to check data types in a data set pandas automatically assigns types based on the encoding it detects from the original data table for a number of reasons this assignment may be incorrect for example it should be awkward if the car price column which we should expect to contain continuous numeric numbers is assigned the data type of object it would be more natural for it to have the float type Jerry may need to manually change the data type to float the second reason is that it allows an experienced data scientist to see which python functions can be applied to a specific column for example some math functions can only be applied to numerical data if these functions are applied to non-numerical data an error may result when the d-type method is applied to the data set the data type of each column is returned in a series a good data scientist's intuition tells us that most of the data types make sense the make of cars for example are names so this information should be of type object the last one on the list could be an issue as bore is a dimension of an engine we should expect a numerical data type to be used instead the object type is used in later sections Jerry will have to correct these type mismatches now we would like to check the statistical summary of each column to learn about the distribution of data in each column the statistical metrics can tell the data scientist if there are mathematical issues that may exist such as extreme outliers and large deviations the data scientists may have to address these issues later to get the quick statistics we use the describe method it Returns the number of terms in the column as count average column value as mean column standard deviation as STD the maximum minimum values as well as the boundary of each of the quartiles by default the dataframe.describe function skips rows and columns that do not contain numbers it is possible to make the describe method work for object type columns as well to enable a summary of all the columns we could add an argument include equals all inside the describe function bracket now the outcome shows the summary of all the 26 columns including object typed attributes we see that for the object type columns a different set of Statistics is evaluated like unique top and frequency unique is the number of distinct objects in the column top is the most frequently occurring object and freak is the number of times the top object appears in the column some values in the table are shown here as Nan which stands for not a number this is because that particular statistical metric cannot be calculated for that specific column data type another method you can use to check your data set is the data frame dot info function this function shows the top 30 rows and bottom 30 rows of the data frame [Music] hello in this video you will learn how to access databases using python databases are powerful tools for data scientists after completing this module you will be able to explain the basic concepts related to using python to connect to databases this is how a typical user accesses databases using python code written on a jupyter notebook a web-based editor there is a mechanism by which the Python program communicates with the dbms the python code connects to the database using API calls we will explain the basics of SQL apis and python DV apis an application programming interface is a set of functions that you can call to get access to some type of service a SQL API consists of Library function calls as an application programming interface API for the dbms to pass SQL statements to the dbms an application program calls functions in the API and it calls other functions to retrieve query results and Status information from the dbms the basic operation of a typical SQL API is Illustrated in the figure the application program begins its database access with one or more API calls that connect the program to the dbms to send the SQL statement to the dbms the program builds the statement as a text string in a buffer and then makes an API call to pass the buffer contents to the dbms the application program makes API calls to check the status of its dbms request and to handle errors the application program ends its database access with an API call that disconnects it from the database DB API is python standard API for accessing relational databases it is a standard that allows you to write a single program that works with multiple kinds of relational databases instead of writing a separate program for each one so if you learn the DB API functions then you can apply that knowledge to use any database with python the two main Concepts in the python DB API are connection objects and query objects who use connection objects to connect to a database and manage your transactions cursor objects are used to run queries you open a cursor object and then run queries the cursor Works similar to a cursor in a text processing system where you scroll down in your result set and get your data into the application cursors are used to scan through the results of a database here are the methods used with connection objects the cursor method returns a new cursor object using the connection the commit method is used to commit any pending transaction to the database the rollback method causes the database to roll back to the start of any pending transaction the close method is used to close a database connection let's walk through a python application that uses the DB API to query a database first you import your database module by using the connect API from that module to open a connection to the database you use the connection function and pass in the parameters that is the database name username and password the connect function returns a connection object after this you create a cursor object on the connection object the cursor is used to run queries and fetch results after running the queries using the cursor we also use the cursor to fetch the results of the query finally when the system is done running the queries it frees all Resources by closing the connection remember that it is always important to close connections to avoid unused connections taking up resources thanks for watching this video [Music] this video we'll be going through some data pre-processing techniques if you're unfamiliar with the term data pre-processing is a necessary step in data analysis it is the process of converting or mapping data from one raw form into another format to make it ready for further analysis data pre-processing is often called Data cleaning or data wrangling and there are likely other terms here are the topics that we'll be covering in this module first we'll show you how to identify and handle missing values a missing value condition occurs whenever a data entry is left empty then we'll cover data formats data from different sources may be in various formats in different units or in various conventions we will introduce some methods in Python pandas that can standardize the values into the same format or unit or convention after that we'll cover data normalization different Columns of numerical data may have very different ranges and direct comparison is often not meaningful normalization is a way to bring all data into a similar range for more useful comparison specifically we'll focus on the techniques of centering and scaling and then we'll introduce data binning binning creates bigger categories from a set of numerical values it is particularly useful for comparison between groups of data and lastly we'll talk about categorical variables and show you how to convert categorical values into numeric variables to make statistical modeling easier in Python we usually perform operations along columns each row of the column represents a sample I.E a different used car in the database you access a column by specifying the name of the column for example you can access symboling and body style each of these columns is a panda series there are many ways to manipulate data frames in Python for example you can add a value to each entry of a column to add 1 to each symboling entry use this command this changes each value of the data frame column by adding 1 to the current value [Music] in this video we will introduce the pervasive problem of missing values as well as strategies on what to do when you encounter missing values in your data when no data value is stored for feature for a particular observation we say this feature has a missing value usually missing value in data set appears as question mark and a 0 or just a blank cell in the example here the normalized losses feature has a missing value which is represented with n a n but how can you deal with missing data there are many ways to deal with missing values and this is regardless of python r or whatever tool you use of course each situation is different and should be judged differently however these are the typical options you can consider the first is to check if the person or group that collected the data can go back and find what the actual value should be another possibility is just to remove the data where that missing value is found when you drop data you could either drop the whole variable or just the single data entry with the missing value if you don't have a lot of observations with missing data usually dropping the particular entry is the best if you're removing data you want to look to do something that has the least amount of impact replacing data is better since no data is wasted however it is less accurate since we need to replace missing data with a guess of what the data should be one standard replacement technique is to replace missing values by the average value of the entire variable as an example suppose we have some entries that have missing values for the normalized losses column and the column average for entries with data is 4 500. while there is no way for us to get an accurate guess of what the missing values under the normalized losses column should have been you can approximate their values using the average value of the column four thousand five hundred but what if the values cannot be averaged as with categorical variables for a variable like fuel type there isn't an average fuel type since the variable values are not numbers in this case one possibility is to try using the mode the most common like gasoline finally sometimes we may find another way to guess the missing data this is usually because the data gatherer knows something additional about the missing data for example he may know that the missing values tend to be old cars and the normalized losses of old cars are significantly higher than the average vehicle and of course finally in some cases you may simply want to leave the missing data as missing data for one reason or another it may be useful to keep that observation even if some features are missing now let's go into how to drop missing values or replace missing values in Python to remove data that contains missing values Panda's library has a built-in method called drop n a essentially with the drop n a method you can choose to drop rows or columns that contain missing values like Nan so you'll need to specify access equals 0 to drop the rows or axis equals 1 to drop the columns that contain the missing values in this example there is a missing value in the price column since the price of used cars is what we're trying to predict in our upcoming analysis we'd have to remove the cars the rows that don't have a listed price it can simply be done in one line of code using dataframe dot drop n a setting the argument in place to True allows the modification to be done on the data set directly in place equals true just writes the result back into the data frame this is equivalent to this line of code don't forget that this line of code does not change the data frame but it is a good way to make sure that you are performing the correct operation to modify the data frame you have to set the parameter in place equal to true you should always check the documentation if you are not familiar with a function or method the pandas webpage has lots of useful resources to replace missing values like n a ends with actual values pandas library has a built-in method called replace which can be used to fill in the missing values with the newly calculated values as an example assume that we want to replace the missing values of the variable normalized losses by the mean value of the variable therefore the missing value should be replaced by the average of the entries within that column in Python first we calculate the mean of the column then we use the method for place to specify the value we would like to be replaced as the first parameter in this case n a n the second parameter is the value we would like to replace it with I.E the mean in this example this is a fairly simplified way of replacing missing values there are of course other techniques such as replacing missing values for the average of the group instead of the entire data set so we've gone through two ways in Python to deal with missing data we learn to drop problematic rows or columns containing missing values and then we learned how to replace missing values with other values but don't forget the other ways to deal with missing data you can always check for a higher quality data set or Source or in some cases you may want to leave the missing data as missing data [Music] in this video we'll look at the problem of data with different formats units and conventions and the pandas methods that help us deal with these issues data is usually collected from different places by different people which may be stored in different formats data formatting means bringing data into a common standard of expression that allows users to make meaningful comparisons as a part of data set cleaning data formatting ensures the data is consistent and easily understandable for example people may use different Expressions to represent New York City such as uppercase n uppercase y uppercase n lowercase y uppercase n uppercase Y and New York sometimes this unclean data is a good thing to see for example if you're looking at the different ways people tend to write New York then this is exactly the data that you want or if you're looking for ways to spot fraud perhaps writing n dot y dot is more likely to predict an anomaly than if someone wrote out New York in full but perhaps more often than not we just simply want to treat them all as the same entity or format to make statistical analyzes easier down the road referring to our used car data set there's a feature named City miles per gallon in the data set which refers to a car fuel consumption in miles per gallon unit however you may be someone who lives in a country that uses metric units so you would want to convert those values to liters per 100 kilometers the metric version to transform miles per gallon to liters per hundred kilometers we need to divide 235 by each value in the city miles per gallon column in Python this can easily be done in one line of code you take the column and set it to equal to 235 divided by the entire column in the second line of code rename column name from City miles per gallon to City liters per hundred kilometers using the data frame rename method for a number of reasons including when you import a data set into python the data type may be incorrectly established for example here we notice that the assigned data type to the price feature is object although the expected data type should really be an integer or float type it is important for later analysis to explore the features data type and convert them to the correct data types otherwise the develop models later on May behave strangely and totally valid data may end up being treated like missing data there are many data types in pandas objects can be letters or words into 64 are integers and floats are real numbers there are many others that we will not discuss to identify a feature's data type in Python we can use the dataframe.dypes method and check the data type of each variable in a data frame in the case of wrong data types the method data frame dot as type can be used to convert a data type from one format to another for example using as type int for the price column you can convert the object column into an integer type variable [Music] in this video we'll be talking about data normalization an important technique to understand in data pre-processing when we take a look at the used car data set we notice in the data that the feature length ranges from 150 to 250 while feature width and height ranges from 50 to 100 we may want to normalize these variables so that the range of the values is consistent this normalization can make some statistical analyzes easier down the road by making the ranges consistent between variables normalization enables a fairer comparison between the different features making sure they have the same impact it is also important for computational reasons here is another example that will help you understand why normalization is important consider a data set containing two features age and income where age ranges from zero to a hundred while income ranges from zero to twenty thousand and higher income is about 1 000 times larger than age and ranges from twenty thousand to five hundred thousand so these two features are in very different ranges when we do further analysis like linear regression for example the attribute income will intrinsically influence the result more due to its larger value but this doesn't necessarily mean it is more important as a predictor so the nature of the data biases the linear regression model to weigh income more heavily than age to avoid this we can normalize these two variables into values that range from 0 to 1. compare the two tables at the right after normalization both variables now have a similar influence on the models we will build later there are several ways to normalize data I will just Outline Three techniques the first method called Simple feature scaling just divides each value by the maximum value for that feature this makes the new values range between 0 and 1. the second method called min max takes each value X underscore old subtract it from the minimum value of that feature then divides by the range of that feature again the resulting new values range between 0 and 1. the third method is called z-score or standard score in this formula for each value you subtract the MU which is the average of the feature and then divide by the standard deviation the sigma the resulting values hover around zero and typically range between negative 3 and positive 3 but can be higher or lower following our earlier example we can apply the normalization method on the length feature first we use the simple feature scaling method where we divide it by the maximum value in the feature using the pandas method Max this can be done in just one line of code here's the min max method on the length feature we subtract each value by the minimum of that column then divide it by the range of that column the max minus the min finally we apply the z-score method on length feature to normalize the values here we apply the mean and STD method on the length feature mean method will return the average value of the feature in the data set an STD method will return the standard deviation of the features in the data set [Music] in this video we'll be talking about binning as a method of data pre-processing binning is when you group values together into bins for example you can bend age into 0 to 5 6 to 10 11 to 15 and so on sometimes binning can improve accuracy of the predictive models in addition sometimes we use data binning to group a set of numerical values into a smaller number of bins to have a better understanding of the data distribution as example price here is an attribute range from 5000 to 45 500. using binning we categorize the price into three bins low price median price and high prices in the actual card data set price is a numerical variable ranging from 5188 to forty five thousand four hundred it has 201 unique values we can categorize them into three bins low medium and high priced cars in Python we can easily implement the binning we would like three bins of equal bin width so we need four numbers as dividers that are equal distance apart first we use the numpy function linspace to return the array bins that contains four equally spaced numbers over the specified interval of the price we create a list group underscore names that contains the different bin names we use the pandas function cut to segment and sort the data values into bins you can then use histograms to visualize the distribution of the data after they've been divided into bins this is the histogram that we plotted based on the bidding that we applied in the price feature from the plot it is clear that most cars have a low price and only very few cars have high price [Music] in this video we'll discuss how to turn categorical variables into quantitative variables in Python most statistical models cannot take in objects or strings as input and for model training only take the numbers as inputs in the card data set the fuel type feature has a categorical variable has two values gas or diesel which are in string format for further analysis Jerry has to convert these variables into some form of numeric format we encode the values by adding new features corresponding to each unique element in the original feature we would like to encode in the case where the feature fuel has two unique values gas and Diesel we create two new features gas and Diesel when a value occurs in the original feature we set the corresponding value to 1 in the new feature the rest of the features are set to zero in the fuel example for car B the fuel value is diesel therefore we set the featured diesel equal to one and the gas feature to zero similarly for car D the fuel value is gas therefore we set the feature gas equal to 1 and the feature diesel equal to zero this technique is often called one hot encoding in pandas we can use get underscore dummies method to convert categorical variables to dummy variables in Python transforming categorical variables to dummy variables is simple following the example PD dot get underscore dummies method gets the fuel type column and creates the data frame dummy underscore variable underscore one the get underscore dummies method automatically generates a list of numbers each one corresponding to a particular category of the variable [Music] in this module we're going to cover the basics of exploratory data analysis using python exploratory data analysis or in short Eda is an approach to analyze data in order to summarize main characteristics of the data gain better understanding of the data set uncover relationships between different variables and extract important variables for the problem we're trying to solve the main question we are trying to answer in this module is what are the characteristics that have the most impact on the car price we will be going through a couple of different useful exploratory data analysis techniques in order to answer this question in this module you will learn about descriptive statistics which describe basic features of a data set and obtains a short summary about the sample and measures of the data basic of grouping data using Group by and how this can help to transform our data set the correlation between different variables and lastly Advanced correlation where we'll introduce you to various correlation statistical methods namely Pearson correlation and correlation heat Maps foreign we'll be talking about descriptive statistics when you begin to analyze data it's important to first explore your data before you spend time building complicated models one easy way to do so is to calculate some descriptive statistics for your data descriptive statistical analysis helps to describe basic features of a data set and obtains a short summary about the sample and measures of the data let's show you a couple different useful methods one way in which we can do this is by using the describe function in pandas using the describe function and applying it on your data frame the describe function automatically computes basic statistics for all numerical variables it shows the mean the total number of data points the standard deviation the quartiles and the extreme values any Nan values are automatically skipped in these statistics this function will give you a clearer idea of the distribution of your different variables you could have also categorical variables in your data set these are variables that can be divided up into different categories or groups and have discrete values for example in our data set we have the drive system as a categorical variable which consists of the categories forward wheel drive rear wheel drive and four-wheel drive one way you can summarize the categorical data is by using the function value underscore counts we can change the name of the column to make it easier to read we see that we have 118 cars in the front wheel drive category 75 cars in the rear-wheel drive category and eight cars in the four-wheel drive category box plots are a great way to visualize numeric data since you can visualize the various distributions of the data the main features that the box plot shows are the median of the data which represents where the middle data point is the upper quartile shows where the 75th percentile is the lower quartile shows where the 25th percentile is the data between the upper and lower quartile represents the interquartile range next you have the lower and upper extremes these are calculated as 1.5 times the interquartile range above the 75th percentile and as 1.5 times the IQR below the 25th percentile finally box plots also display outliers as individual dots that occur outside the upper and lower extremes with box plots you can easily spot outliers and also see the distribution and skewness of the data box plots make it easy to compare between groups in this example using box plot we can see the distribution of different categories of the drive Wheels feature over price feature we can see that the distribution of price between the rear wheel drive and the other categories are distinct but the price for front-wheel drive and four-wheel drive are almost indistinguishable oftentimes we tend to see continuous variables in our data these data points are numbers contained in some range for example in our data set price and engine size are continuous variables what if we want to understand the relationship between engine size and price could engine size possibly predict the price of a car one good way to visualize this is using a scatter plot each observation in the scatter plot is represented as a point this plot shows the relationship between two variables the predictor variable is the variable that you are using to predict an outcome in this case our predictor variable is the engine size Target variable is the variable that you are trying to predict in this case our Target variable is the price since this would be the outcome in a scatter plot we typically set the predictor variable on the x-axis or horizontal axis and we set the target variable on the y-axis or vertical axis in this case we will thus plot the engine size on the x-axis and the price on the y-axis we are using the matplot lib function scatter here taking in X and A Y variable something to note is that it's always important to label your axes and write a general plot title so that you know what you're looking at now how is the variable engine size related to price from the scatter plot we see that as the engine size goes up the price of the car also goes up this is giving us an initial indication that there is a positive linear relationship between these two variables [Music] in this video we'll cover the basics of grouping and how this can help to transform our data set assume you want to know is there any relationship between the different types of drive system forward rear and four-wheel drive and the price of the vehicles if so which type of drive system adds the most value to a vehicle it would be nice if we could group all the data by the different types of Drive wheels and compare the results of these different Drive Wheels against each other in pandas this can be done using the group by method the group by method is used on categorical variables groups the data into subsets according to the different categories of that variable you can Group by a single variable or you can Group by multiple variables by passing in multiple variable names as an example let's say we are interested in finding the average price of vehicles and observe how they differ between different types of body styles and drive Wheels variables to do this we first pick out the three data columns we are interested in which is done in the first line of code we then group the reduced data according to drive wheels and body style in the second line since we are interested in knowing how the average price differs across the board we can take the mean of each group and append it this bit at the very end of the line 2. the data is now grouped into subcategories and only the average price of each subcategory is shown we can see that according to our data rear-wheel drive convertibles and rear-wheel drive hard tops have the highest value while four-wheel drive hatchbacks have the lowest value a table of this form isn't the easiest to read and also not very easy to visualize to make it easier to understand we can transform this table to a pivot table by using the pivot method in the previous table both Drive wheels and body style were listening columns a pivot table has one variable displayed along the columns and the other variable displayed along the rows just with one line of code and by using the pandas pivot method we can pivot the body style variable so it is displayed along the columns and the drive wheels will be displayed along the rows the price data now becomes a rectangular grid which is easier to visualize this is similar to what is usually done in Excel spreadsheets another way to represent the pivot table is using a heat map plot heat map takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points it is a great way to plot the target variable over multiple variables and through this get visual clues of the relationship between these variables and the target in this example we use Pi plot's P color method to plot heat map and convert the previous pivot table into a graphical form we specified the red blue color scheme in the output plot each type of body style is numbered along the x-axis and each type of Drive Wheels is numbered along the y-axis the average prices are plotted with varying colors based on their values according to the color bar we see that the top section of the heat map seems to have higher prices in the bottom section [Music] in this video we'll talk about the correlation between different variables correlation is a statistical metric for measuring to what extent different variables are interdependent in other words when we look at two variables over time if one variable changes how does this effect change in the other variable for example smoking is known to be correlated to lung cancer since you have a higher chance of getting lung cancer if you smoke in another example there is a correlation between umbrella and Rain variables where more precipitation means more people use umbrellas also if it doesn't rain people would not carry umbrellas therefore we can say that umbrellas and Rain are interdependent and by definition they are correlated it is important to know that correlation doesn't imply causation in fact we can say that umbrella and Rain are correlated but we would not have enough information to say whether the umbrella caused the rain or the rain caused the umbrella in data science we usually deal more with correlation let's look at the correlation between engine size and price this time we'll visualize these two variables using a scatter plot and an added linear line called a regression line which indicates the relationship between the two the main goal of this plot is to see whether the engine size has any impact on the price in this example you can see that the straight line through the data points is very steep which shows that there is a positive linear relationship between the two variables with increase in values of engine size values of price go up as well and the slope of the line is positive so there is a positive correlation between engine size and price we can use Seabourn reg plot to create the scatter plot as another example now let's look at the relationship between highway miles per gallon to see its impact on the car price as we can see in this plot when highway miles per gallon value goes up the value of price goes down therefore there is a negative linear relationship between highway miles per gallon and price although this relationship is negative the slope of the line is steep which means that the highway miles per gallon is still a good predictor of price these two variables are said to have a negative correlation finally we have an example of a weak correlation for example both low Peak RPM and high values of peak RPM have low and high prices therefore we cannot use RPM to predict the values [Music] in this video we'll introduce you to various correlation statistical methods one way to measure the strength of the correlation between continuous numerical variables is by using a method called Pearson correlation Pearson correlation method will give you two values the correlation coefficient and the p-value so how do we interpret these values for the correlation coefficient a value close to 1 implies a large positive correlation while a value close to negative 1 implies a large negative correlation and a value close to zero implies no correlation between the variables next the p-value will tell us how certain we are about the correlation that we calculated for the p-value a value less than .001 gives us a strong certainty about the correlation coefficient that we calculated a value between .001 and .05 gives us moderate certainty a value between .05 and 0.1 will give us a weak certainty and a p-value larger than point one will give us no certainty of correlation at all we can say that there is a strong correlation when the correlation coefficient is close to 1 or negative 1 and the p-value is less than .001 the following plot shows data with different correlation values in this example we want to look at the correlation between the variables horsepower and car price see how easy you can calculate the Pearson correlation using the sci Pi stats package we can see that the correlation coefficient is approximately 0.8 and this is close to one so there's a strong positive correlation we can also see that the p-value is very small much smaller than .001 and so we can conclude that we are certain about the strong positive correlation taking all variables into account we can now create a heat map that indicates the correlation between each of the variables with one another the color scheme indicates the Pearson correlation coefficient indicating the strength of the correlation between two variables we can see a diagonal line with a dark red color indicating that all the values on this diagonal are highly correlated this makes sense because when you look closer the values on the diagonal are the correlation of all variables with themselves which will be always one this correlation heat map gives us a good overview of how the different variables are related to one another and most importantly how these variables are related to price [Music] [Music] in this video we will learn how to find out if there is a relationship between two categorical variables when dealing with the relationships between two categorical variables we can't use the same correlation method for continuous variables we will have to employ the use of the chi-squared test for the association the chi-square test is intended to test How likely it is that an observed distribution is due to chance it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent before we go into an example let's look through some important points the chi-square tests a null hypothesis that the variables are independent the test compares the observed data to the values the model expects if the data was distributed in different categories by chance anytime The observed data doesn't fit within the model of the expected values the probability that the variables are dependent become stronger thus proving the null hypothesis incorrect the chi-square does not tell you the type of relationship that exists between both variables but only that a relationship exists we will use the car's data set assuming we want to test the relationship between fuel type and aspiration these are categorical variables it is either the field type of the car gas or diesel and the aspiration is that either the car is standard or Turbo to do this we will find the observed counts of cars in each category this can be done by creating a cross tab using the pandas Library a cross tab is a table showing the relationship between two or more variables when the table only shows the relationship between two categorical variables the crosstab is also known as a contingency table in our case the cross tab or contingency table shows us the counts in each category a standard car with diesel fuel standard car with gas fuel a turbo car with diesel fuel or a turbo car with gas fuel the formula for chi-square is given as follows the summation of The observed value I.E the counts in each group minus the expected value all squared divided by the expected value expected values are based on the given totals that is what can we say individual cells would be if we did not know the observed values to calculate the expected value of a standard car with diesel we take the row total which is 20 multiply by the column total 168 divided by the grand total of 205. this will give you 16.39 if we do the same thing for turbo cars with gas fuel we will take the row total 185 multiplied by the column total 37 and we divide by the grand total of 205 to get 33.39 if we repeat the same procedure for all of them we get these values if we took the row totals column totals and grand total we will get the same values as the totals as The observed values now going back to this formula if we took a summation of all the observed minus the expected values all squared and divided by the expected value we will get a chi-square value of 29.6 on the chi-square table we check on the degree of Freedom equals one row and find the value closest to 29.6 here we can see that 29.6 will fall in a p-value less than 0.05 therefore we can say the p-value is less than 0.05 since the p-value is less than .05 we reject the null hypothesis that the two variables are independent and therefore we conclude that there is an association between fuel type and aspiration to do this in Python we will use the chi-square contingency function in the PSI Pi dot statistics package the function will print out the chi-square test value 29.6 the second value is the p-value which is very close to zero and a degree of freedom of 1. if you remember the chi-square table did not give an exact p-value but a range in which it falls python will give the exact p-value we can see the same results as in our previous slides it also prints out the expected values which we also calculated by hand since the p-value is close to zero we reject the null hypothesis the two variables are independent and conclude that there is evidence of association between fuel type and aspiration [Music] in this video we will examine model development by trying to predict the price of a car using our data set in this module you will learn about simple and multiple linear regression model evaluation using visualization polynomial regression and pipelines r squared and MSE for in-sample evaluation prediction and decision making and how you can determine a fair value for a used car a model or estimator can be thought of as a mathematical equation used to predict a value given one or more other values relating one or more independent variables or features to dependent variables for example you input a car model's highway miles per gallon as the independent variable or feature the output of the model or dependent variable is the price usually the more relevant data you have the more accurate your model is for example you input multiple independent variables or features to your model Therefore your model May predict a more accurate price for the car to understand why more data is important consider the following situation you have two almost identical cars pink cars sell for significantly less you want to use your model to determine the price of two cars one pink one red if your model's independent variables or features do not include color your model will predict the same price for cars that may sell for much less in addition to getting more data you can try different types of models in this course you will learn about simple linear regression multiple linear regression and polynomial regression [Music] in this video we'll be talking about simple linear regression and multiple linear regression linear regression will refer to one independent variable to make a prediction multiple linear regression will refer to multiple independent variables to make a prediction simple linear regression or SLR is a method to help us understand the relationship between two variables the predictor independent variable X and the target dependent variable y we would like to come up with a linear relationship between the variables shown here the parameter b0 is the intercept the parameter B1 is the slope when we fit or train the model we will come up with these parameters this step requires lots of math so we will not focus on this part let's clarify the prediction step it's hard to figure out how much a car costs but the highway miles per gallon is in the owner's manual if we assume there is a linear relationship between these variables we can use this relationship to formulate a model to determine the price of the car if the highway miles per gallon is 20 we can input this value into the model to obtain a prediction of twenty two thousand dollars in order to determine the line we take data points from our data set marked in red here we then use these trading points to fit our model the results of the trading points are the parameters we usually store the data points in two data frame or numpy arrays the value we would like to predict is called the Target that we store in the array why we store the independent variable in the data frame or array X each sample corresponds to a different Row in each data frame or array in many cases many factors influence how much people pay for a car for example make or how old the car is in this model this uncertainty is taken into account by assuming a small random value is added to the point on the line This is called noise figure on the left shows the distribution of the noise the vertical axis shows the value added and the horizontal axis illustrates the probability that the value will be added usually a small positive value is added or a small negative value sometimes large values are added but for the most part the values added are near zero we can summarize the process like this we have a set of training points we use these trading points to fit or train the model and get parameters we then use these parameters in the model we now have a model we use the Hat on the Y to denote the model as an estimate we can use this model to predict values that we haven't seen for example we have no car with 20 highway miles per gallon we can use our model to make a prediction for the price of this car but don't forget our model is not always correct we can see this by comparing the predicted value to the actual value we have a sample for 10 highway miles per gallon but the predicted value does not match the actual value if the linear assumption is correct this error is due to the noise but there can be other reasons to fit the model in Python first we import linear model from scikit learn then create a linear regression object using the Constructor we Define the predictor variable and Target variable then use the method fit to fit the model and find the parameters b0 and B1 the input are the features and the targets we can obtain a prediction using the method predict the output is an array the array has the same number of samples as the input X The Intercept b0 is an attribute of the object LM the slope B1 is also an attribute of the object LM the relationship between price and highway miles per gallon is given by this equation in bold price equals 38 423.31 minus 821.73 times highway miles per gallon like the equation we discussed before multiple linear regression is used to explain the relationship between one continuous Target y variable and two or more predictor X variables if we have for example four predictor variables then B zero intercept x equals zero B1 the coefficient or parameter of X1 B2 the coefficient of parameter X2 and so on if there are only two variables then we can visualize the values consider the following function the variables X1 and X2 can be visualized on a 2d plane Let's do an example on the next slide the table contains different values of the predictor variables X1 and X2 the position of each point is placed on the 2D plane color coded accordingly each value of the predictor variables X1 and X2 will be mapped to a new value y y hat the new values of y y hat are mapped in the vertical Direction with height proportional to the value the y-hat takes we can fit the multiple linear regression as follows we can extract the four predictor variables and store them in the variable Z then train the model as before using the method train with the features or dependent variables and the targets colon we can also obtain a prediction using the method predict in this case the input is an array or data frame with four columns the number of rows corresponds to the number of samples the output is an array with the same number of elements as number of samples The Intercept is an attribute of the object and the coefficients are also attributes it is helpful to visualize the equation replacing the dependent variable names with actual names this is identical to the form we discussed earlier foreign we'll look at model evaluation using visualization regression plots are a good estimate of the relationship between two variables the strength of the correlation and the direction of the relationship positive or negative the horizontal axis is the independent variable the vertical axis is the dependent variable each point represents a different Target point the fitted line represents the predicted value there are several ways to plot a regression plot a simple way is to use reg plot from the Seabourn Library first import Seaborn then use the reg plot function the parameter X is the name of the column that contains the independent variable or feature the parameter y contains the name of the column that contains the name of the dependent variable or Target the parameter data is the name of the data frame the result is given by the plot the residual plot represents the error between the actual value examining the predicted value and actual value we see a difference we obtained that value by subtracting the predicted value and the actual Target value we then plot that value on the vertical axis with an independent variable as the horizontal axis similarly for the second sample we repeat the process subtracting the target value from the predicted value then plotting the value accordingly looking at the plot gives us some insight into our data we expect to see the results to have zero mean distributed evenly around the x-axis with similar variance there is no curvature this type of residual plot suggests a linear plot is appropriate in this residual plot there is a curvature the values of the error change with X for example in the region all the residual errors are positive in this area the residuals are negative in the final location the error is large the residuals are not randomly separated this suggests the linear assumption is incorrect this plot suggests a non-linear function we will deal with this in the next section in this plot we see that variants of the residuals increases with X therefore our model is incorrect we can use Seaborn to create a residual plot first import Seaborn we use the resid plot function first parameter is a series of dependent variable or feature the second parameter is a series of dependent variable or Target we see in this case the residuals have a curvature a distribution plot counts the predicted value versus the actual value these plots are extremely useful for visualizing models with more than one independent variable or feature let's look at a simplified example we examine the vertical axis we then count and plot the number of predicted points that are approximately equal to one we then count and plot the number of predicted points that are approximately equal to two we repeat the process for predicted points there are approximately equal to 3. then we repeat the process for the Target values in this case all the target values are approximately equal to 2. the values of the targets and predicted values are continuous a histogram is for discrete values therefore pandas will convert them to a distribution the vertical axis is scaled to make the area under the distribution equal to one this is an example of using a distribution plot the dependent variable or feature is price the fitted values that result from the model are in blue the actual values are red we see the predicted values for prices in the range from forty thousand to fifty thousand are inaccurate the prices in the region from 10 000 to 20 000 are much closer to the Target value in this example we use multiple features or independent variables comparing it to the plot on the last slide we see predicted values are much closer to the Target values here is the code to create a distribution plot the actual values are used as a parameter we want a distribution instead of a histogram so we want the hist parameter set to false the color is red the label is also included the predicted values are included for the second plot the rest of the parameters are set accordingly [Music] in this video we will cover polynomial regression and Pipelines what do we do when a linear model is not the best fit for our data let's look into another type of regression model the polynomial regression we transform our data into a polynomial then use linear regression to fit the parameter then we will discuss pipelines pipelines are a way to simplify your code polynomial regression is a special case of the general linear regression this method is beneficial for describing curvilinear relationships what is a curvilinear relationship it's what you get by squaring or setting higher order terms of the predictor variables in the model transforming the data the model can be quadratic which means that the predictor variable in the model is squared we use a bracket to indicate it as an exponent this is a second order polynomial regression with a figure representing the function the model can be cubic which means that the predictor variable is cubed this is a third order polynomial regression we see by examining the figure that the function has more variation there also exists higher order polynomial regressions when a good fit hasn't been achieved by second or third order we can see in figures how much the graphs change when we change the order of the polynomial regression the degree of the regression makes a big difference and can result in a better fit if you pick the right value in all cases the relationship between the variable and the parameter is always linear let's look at an example from our data where we generate a polynomial regression model in Python we do this by using the polyfit function in this example we develop a third order polynomial regression model base we can print out the model symbolic form for the model is given by the following expression Negative 1.557 X 1 cubed plus 204.8 X1 squared plus 8965 X1 plus 1.37 times 10 to the power of 5. we can also have multi-dimensional polynomial linear regression the expression can get complicated here are just some of the terms for a two-dimensional second-order polynomial numpy's polyfit function cannot perform this type of regression we use the pre-processing library in scikit learn to create a polynomial feature object The Constructor takes the degree of the polynomial as a parameter then we transform the features into a polynomial feature with the fit underscore transform method let's do a more intuitive example consider the features shown here applying the method we transform the data we now have a new set of features that are our transformed version of our original features as the dimension of the data gets larger we may want to normalize multiple features inside kit learn instead we can use the pre-processing module to simplify many tasks for example we can standardize each feature simultaneously we import standard scalar we train the object fit the scale object then transform the data into a new data frame on array X underscore scale there are more normalization methods available in the pre-processing library as well as other transformations we can simplify our code by using a pipeline Library there are many steps to getting a prediction for example normalization polynomial transform and linear regression we simplify the process using a pipeline pipelines sequentially perform a series of Transformations the last step carries out a prediction first we import all the modules we need then we import the library pipeline we create a list of tuples the first element in the Tuple contains the name of the estimator model the second element contains model Constructor we input the list in the pipeline Constructor we now have a pipeline object we can train the pipeline by applying the train method to the pipeline object we can also produce a prediction as well the method normalizes the data performs a polynomial transform then outputs a prediction foreign now that we've seen how we can evaluate a model by using visualization we want to numerically evaluate our models let's look at some of the measures that we use for in-sample evaluation these measures are a way to numerically determine how good the model fits on our data two important measures that we often use to determine the fit of a model R mean square error MSE and r squared to measure the MSE we find the difference between the actual value Y and the predicted value y hat then Square it in this case the actual value is 150. the predicted value is 50. subtracting these points we get 100. we then Square the number we then take the mean or average of all the errors by adding them all together and dividing by the number of samples to find the MSE in Python we can import the mean underscore squared underscore error from scikit learn metrics the mean underscore squared underscore error function gets two inputs the actual value of Target variable and the predicted value of Target variable r squared is also called the coefficient of determination it's a measure to determine how close the data is to the fitted regression line so how close is our actual data to our estimated model think about it as comparing a regression model to a simple model I.E the mean of the data points if the variable X is a good predictor our model should perform much better than just the mean in this example the average of the data points Y Bar is 6. coefficient of determination r squared is 1 minus the ratio of the MSE of the regression lined divided by the MSE of the average of the data points for the most part it takes values between 0 and 1. let's look at a case where the line provides a relatively good fit the Blue Line represents the regression line the blue squares represent the MSE of the regression line the red line represents the average value of the data points the red squares represent the MSE of the red line we see the area of the blue squares is much smaller than the area of the red squares in this case because the line is a good fit the mean squared error is small therefore the numerator is small the mean squared error of the line is relatively large as such the numerator is large a small number divided by a larger number is an even smaller number taken to an extreme this value tends to zero if we plug in this value from the previous slide for r squared we get a value near 1. this means the line is a good fit for the data here is an example of a line that does not fit the data well if we just examine the area of the red squares compared to the blue squares we see the area is almost identical the ratio of the areas is close to 1. in this case the r squared is near zero this line performs about the same as just using the average of the data points therefore this line did not perform well we find the r squared value in Python by using the score method in the linear regression object from the value that we get from this example we can say that approximately 49.695 percent of the variation of price is explained by the simple linear model your r squared value is usually between 0 and 1. if your r squared is negative it could be due to overfitting that we will discuss in the next module [Music] thank you in this video our final topic will be on prediction and decision making how can we determine if our model is correct the first thing you should do is make sure your Model results make sense you should always use visualization numerical measures for evaluation and comparing between different models let's look at an example of prediction if you recall we train the model using the fit method now we want to find out what the price would be for a car that has a highway miles per gallon of 30. plugging this value into the predict method gives us a resulting price of 13 771.30 this seems to make sense for example the value is not negative extremely high or extremely low we can look at the coefficients by examining the co-f underscore attribute if you recall the expression for the simple linear model that predicts price from highway miles per gallon this value corresponds to the multiple of the highway miles per gallon feature as such an increase of one unit in highway miles per gallon the value of the car decreases approximately 821 dollars this value also seems reasonable sometimes your model will produce values that don't make sense for example if we plot the model out for highway miles per gallon in the ranges of zero to one hundred we get Negative values for the price this could be because the values in that range are not realistic the linear assumption is incorrect or we don't have data for cars in that range in this case it is unlikely that a car will have fuel mileage in that range so our model seems valid to generate a sequence of values in a specified range import numpy then use the numpy arrange function to generate the sequence the sequence starts at 1 and increments by 1 till we reach 100. the first parameter is the starting point of the sequence the second parameter is the endpoint plus one of the sequence the final parameter is the step size between elements in the sequence in this case it's one so we increment the sequence one step at a time from one to two and so on we can use the output to predict new values the output is a numpy array many of the values are negative using a regression plot to visualize your data is the first method you should try see the labs for examples of how to plot polynomial regression for this example the effect of the independent variable is evident in this case the data Trends down as the dependent variable increases the plot also shows some non-linear Behavior examining the residual plot we see in this case the residuals have a curvature suggesting non-linear Behavior a distribution plot is a good method for multiple linear regression for example we see the predicted values for prices in the range from thirty thousand to fifty thousand are inaccurate this suggests a non-linear model may be more suitable or we need more data in this range the mean square error is perhaps the most intuitive numerical measure for determining if a model is good or not let's see how different measures of mean square error impact the model the figure shows an example of a mean square error of 3495. this example has a mean square error of 3652. the final plot has a mean square error of twelve thousand eight hundred and seventy as the square error increases the targets get further from the predicted points as we discussed r squared is another popular method to evaluate your model it tells you how well your line fits into the model r-squared values range from zero to one r squared tells us what percent of the variability in the dependent variable is accounted for by the regression on the independent variable an r squared of 1 means that all movements of another dependent variable are completely explained by movements in the independent variables in this plot we see the target points in red and the predicted line in blue and r squared of 0.9986 the model appears to be a good fit that means that more than 99 percent of the variability of the predicted variable is explained by the independent variables this model has an r squared of 0.9226 there still is a strong linear relationship model is still a good fit an r squared of 0806 the data we can visually see that the values are scattered around the line they are still close to the line and we can say that 80 percent of the variability of the predicted variable is explained by the independent variables and an r squared 0.61 means that approximately 61 percent of The observed variation can be explained by the independent variables an acceptable value for r squared depends on what field you are studying and what your use case is Falcon Miller 1992 suggests that an acceptable r squared value should be at least 0.1 does a lower mean square error imply better fit not necessarily mse4 and mlr model will be smaller than the MSE for an SLR model since the errors of the data will decrease when more variables are included in the model polynomial regression will also have a smaller MSE than regular regression in the next section we will look at more accurate ways to evaluate the model [Music] evaluation tells us how our model performs in the real world in the previous module we talked about in Sample evaluation in Sample evaluation tells us how well our model fits the data already given to train it it does not give us an estimate of how well the train model can predict new data the solution is to split our data up use the in-sample data or training data to train the model the rest of the data called test data is used as out-of-sample data this data is then used to approximate how the model performs in the real world separating data into training and testing sets is an important part of model evaluation we use the test data to get an idea how our model will perform in the real world when we split a data set usually the larger portion of data is used for training and a smaller part is used for testing for example we can use seventy percent of the data for training we then use 30 for testing we use training set to build a model and discover predictive relationships we then use a testing set to evaluate model performance when we have completed testing our model we should use all the data to train the model a popular function in the sci kit learn package for splitting data sets is the train test split function this function randomly splits a data set into training and testing subsets from the example code snippet this method is imported from sklearn.cross validation the input parameters y underscore data is the target variable in the car appraisal example it would be the price and X underscore data the list of predictor variables in this case it would be all the other variables in the car data set that we are using to try to predict the price the output is an array X underscore train and Y underscore train the subsets for training X underscore test and Y underscore test the subsets for testing in this case the test size is a percentage of the data for the testing set here it is 30 percent the random state is a random seed for random data set splitting generalization error is a measure of how well our data does at predicting previously unseen data the error we obtain using our testing data is an approximation of this error this figure shows the distribution of the actual values in red compared to the predicted values from a linear regression in blue we see the distributions are somewhat similar if we generate the same plot using the test data we see the distributions are relatively different the difference is due to a generalization error and represents what we see in the real world using a lot of data for training gives us an accurate means of determining how well our model will perform in the real world but the Precision of the performance will be low let's clarify this with an example the center of this Bullseye represents the correct generalization error let's say we take a random sample of the data using ninety percent of the data for training and 10 percent for testing the first time we experiment we get a good estimate of the training data if we experiment again training the model with a different combination of samples we also get a good result but the results will be different relative to the first time we run the experiment repeating the experiment again with a different combination of training and testing samples the results are relatively close to the generalization error but distinct from each other repeating the process we get good approximation of the generalization error but the Precision is poor I.E all the results are extremely different from one another if we use fewer data points to train the model and more to test the bottle the accuracy of the generalization performance will be less but the model will have good precision the figure above demonstrates this all our error estimates are relatively close together but they are further away from the true generalization performance to overcome this problem we use cross-validation one of the most common out-of-sample evaluation metrics is cross-validation in this method the data set is split into k equal groups each group is referred to as a fold for example four Folds some of the folds can be used as a training set which we use to train the model and the remaining parts are used as a test set which we use to test the model for example we can use three folds for training then use one fold for testing this is repeated until each partition is used for both training and testing at the end we use the average results as the estimate of out-of-sample error the evaluation metric depends on the model for example the r squared the simplest way to apply cross validation is to call the cross underscore Val underscore score function which performs multiple out-of-sample evaluations this method is imported from sklearn's model selection package we then use the function cross underscore Val underscore score the first input parameter is the type of model we are using to do the cross validation in this example we initialize the linear regression model or object LR which we passed the cross underscore Val underscore score function the other parameters are X underscore data the predictor variable data and Y underscore data the target variable data we can manage the number of partitions with the CV parameter here CV equals 3 which means the data set is split into three equal partitions the function returns an array of scores one for each partition that was chosen as the testing set we can average the result together to estimate out of sample R Squared using the mean function in numpy let's see an animation let's see the result of the score array in the last slide first we split the data into three folds we use two folds for training the remaining fold for testing the model will produce an output we will use the output to calculate a score in the case of the r squared IE coefficient of determination we will store that value in an array we will repeat the process using two folds for training and one fold for testing save the score then use a different combination for training and the remaining fold for testing we store the final result the cross valve score function returns a score value to tell us the cross validation result what if we want a little more information what if we want to know the actual predicted values supplied by our model before the r squared values are calculated to do this we use the cross underscore Val underscore predict function the input parameters are exactly the same as the cross file underscore score function but the output is a prediction let's illustrate the process first we split the data into three folds we use two folds for training the remaining fold for testing the model will produce an output and we will store it in an array we will repeat the process using two folds for training one for testing the model produces an output again finally we use the last two folds for training then we use the testing data this final testing fold produces an output these predictions are stored in an array [Music] if you recall in the last module we discussed polynomial regression in this section we will discuss how to pick the best polynomial order and problems that arise When selecting the wrong order polynomial consider the following function we assume the training points come from a polynomial function plus some noise the goal of model selection is to determine the order of the polynomial to provide the best estimate of the function y x if we try and fit the function with a linear function the line is not complex enough to fit the data as a result there are many errors this is called underfitting where the model is too simple to fit the data if we increase the order of the polynomial the model fits better but the model is still not flexible enough and Exhibits underfitting this is an example of the eighth order polynomial used to fit the data we see the model does well at fitting the data and estimating the function even at the inflection points increasing it to a 16th order polynomial the model does extremely well at tracking the training point but performs poorly at estimating the function this is especially apparent where there is little training data the estimated function oscillates not tracking the function this is called overfitting where the model is too flexible and fits the noise rather than the function let's look at a plot of the mean square error for the training and testing set of different order polynomials the horizontal axis represents the order of the polynomial the vertical axis is the mean square error the training error decreases with the order of the polynomial the test error is a better means of estimating the error of a polynomial the error decreases till the best order of the polynomial is determined then the error begins to increase we select the order that minimizes the test error in this case it was 8. anything on the left would be considered underfitting anything on the right is overfitting if we select the best order of the polynomial we will still have some errors if you recall the original expression for the training points we see a noise term this term is one reason for the error this is because the noise is random and we can't predict it this is sometimes referred to as an irreducible error there are other sources of errors as well for example our polynomial assumption may be wrong our sample points may have come from a different function for example in this plot the data is generated from a sine wave the polynomial function does not do a good job of fitting the sine wave for real data the model may be too difficult to fit or we may not have the correct type of data to estimate the function let's try different order polynomials on the real data using horsepower the red points represent the training data the green points represent the test data if we just use the mean of the data our model does not perform well a linear function does fit the data better second order model looks similar to the linear function a third order function also appears to increase like the previous two orders here we see a fourth order polynomial at around 200 horsepower the predicted price suddenly decreases this seems erroneous let's use r squared to see if our assumption is correct the following is a plot of the r squared value the horizontal axis represents the order of polynomial models the closer the r squared is to 1 the more accurate the model is here we see the r squared is optimal when the order of the polynomial is three the r squared drastically decreases when the order is increased to four validating our initial assumption we can calculate different r squared values as follows first we create an empty list to store the values we create a list containing different polynomial orders we then iterate through the list using a loop we create a polynomial feature object with the order of the polynomial as a parameter we transform the training and test data into a polynomial using the fit transform method we fit the regression model using the transform data we then calculate the r squared using the test data and store it in the array [Music] in this video we'll discuss Ridge regression Rich regression prevents overfitting in this video we will focus on polynomial regression for visualization but overfitting is also a big problem when you have multiple independent variables or features consider the following fourth order polynomial in Orange the blue points are generated from this function we can use a 10th order polynomial to fit the data the estimated function in blue does a good job at approximating the true function in many cases real data has outliers for example this point shown here does not appear to come from the function in Orange if we use a 10th order polynomial function to fit the data the estimated function in blue is incorrect and is not a good estimate of the actual function in Orange if we examine the expression for the estimated function we see the estimated polynomial coefficients have a very large magnitude this is especially evident for the higher order polynomials Rich regression controls the magnitude of these polynomial coefficients by introducing the parameter Alpha Alpha is a parameter we select before fitting or training the model each row in the following table represents an increasing value of alpha let's see how different values of alpha change the model this table represents the polynomial coefficients for different values of alpha the column corresponds to the different polynomial coefficients and the rows correspond to the different values of alpha as Alpha increases the parameters get smaller this is most evident for the higher order polynomial features but Alpha must be selected carefully if Alpha is too large the coefficients will approach 0 and underfit the data if Alpha is zero the overfitting is evident for Alpha equal to 0.001 the overfitting begins to subside for Alpha equal to 0.01 the estimated function tracks the actual function when Alpha equals 1 we see the first signs of underfitting the estimated function does not have enough flexibility at Alpha equals to 10 we see extreme underfitting it does not even track the two points in order to select Alpha we use cross validation to make a prediction using Ridge regression import Ridge from SK learn linear models create a ridge object using the Constructor the parameter Alpha is one of the arguments of the Constructor we train the model using the fit method to make a prediction we use the predict method in order to determine the parameter Alpha we use some data for training we use a second set called validation data this is similar to test data but it is used to select parameters like Alpha we start with a small value of alpha we train the model make a prediction using the validation data then calculate the r squared and store the values repeat the value for a larger value of alpha we train the model again make a prediction using the validation data then calculate the r squared and store the values of r squared we repeat the process for a different Alpha value training the model and making a prediction we select the value of alpha that maximizes the r squared note that we can use other metrics to select the value of alpha like mean squared error the overfitting problem is even worse if we have lots of features the following plot shows the different values of r squared on the vertical axis the horizontal axis represents different values for Alpha we use several features from our used car data set and a second order polynomial function the training data is in red and validation data is in blue we see as the value for Alpha increases the value of r squared increases and converges at approximately 0.75 in this case we select the maximum value of alpha because running the experiment for higher values of alpha have little impact conversely as Alpha increases the r squared on the test data decreases this is because the term Alpha prevents overfitting this may improve the results in the Unseen data but the model has worse performance on the test data see the lab on how to generate this plot [Music] grid search allows us to scan through multiple free parameters with few lines of code parameters like the alpha term discussed in the previous video are not part of the fitting or training process these values are called hyper parameters scikit learn has a means of automatically iterating over these hyper parameters using cross-validation this method is called grid search grid search takes the model or objects you would like to train and different values of the hyper parameters it then calculates the mean square error or r squared for various hyper parameter values allowing you to choose the best values let the small circles represent different hyper parameters we start off with one value for hyper parameters and train the model we use different hyper parameters to train the model we continue the process until we have exhausted the different free parameter values each model produces an error we select the hyper parameter that minimizes the error to select the hyper parameter we split our data set into three parts the training set validation set and test set we train the model for different hyper parameters we use the r squared or mean square error for each model we select the hyper parameter that minimizes the mean squared error or maximizes the r squared on the validation set we finally test our model performance using the test data this is the scikit learn web page where the object Constructor parameters are given it should be noted that the attributes of an object are also called parameters we will not make the distinction even though some of the options are not hyper parameters per se in this module we will focus on the hyper parameter Alpha and the normalization parameter the value of your grid search is a python list that contains a python dictionary the key is the name of the free parameter the value of the dictionary is the different values of the free parameter this can be viewed as a table with various free parameter values we also have the object or model the grid search takes on the scoring method in this case r squared the number of folds the model or object and the free parameter values some of the outputs include the different scores for different free parameter values in this case the r squared along with the free parameter values that have the best score first we import the libraries we need including grid search CV the dictionary of parameter values we create a ridge regression object or model we then create a grid search CV object the inputs are the ridge regression object the parameter values and the number of Folds we will use r squared this is the default scoring method we fit the object we can find the best values for the free parameters using the attribute best estimator we can also get information like the mean score on the validation data using the attribute CV result what are the advantages of grid search is how quickly we can test multiple parameters for example Rich regression has the option to normalize the data to see how to standardize see module 4 the term Alpha is the first element in the dictionary the second element is the normalize option the key is the name of the parameter the value is the different options in this case because we can either normalize the data or not the values are true or false respectively the dictionary is a table or grid that contains two different values as before we need the ridge regression object or model the procedure is similar except that we have a table or grid of different parameter values the output is the score for all the different combinations of parameter values the code is also similar the dictionary contains the different free parameter values we can find the best value for the free parameters the resulting scores the different free parameters are stored in this dictionary grid1.cv underscore results underscore we can print out the score for the different free parameter values the parameter values are stored as shown here see the course labs for more examples [Music] hello everyone and welcome to data visualization with python I'm Alex eccleson a data scientist at IBM and I'm your instructor for this course throughout this course we're going to learn how to create meaningful effective and aesthetically pleasing data visuals and plots in Python using matplotlib and a couple of other libraries namely Seabourn and folium this course will consist of three modules in module 1 we will briefly discuss data visualization and some of the best practices to keep in mind when creating data visuals we will then learn about matplotlid its history architecture and the three layers that form its architecture we will also learn about the data set that we will use throughout the course in these lectures as well as the Hands-On sessions we will essentially be working with the data set that was curated by the United Nations on immigration from different countries to Canada from 1980 to 2013. then we will start learning how to use matplotlib to create plots and visuals and we will start off with line plots now we will generate the majority of our plots and visualizations in this course using data stored in pandas data frames for those of you who don't know what pandas is Panthers is a python library for data manipulation and Analysis so before we start building visualizations and plots we will take a brief crash course on pandas and learn how to use it to read data from CSV files like the one shown here into what's called a pandas data frame like the one shown here now if you're interested in learning more about the pandas Library we actually cover it in much more detail in our next course in this specialization which is data analysis with python so make sure to complete the next course in this specialization in module 2 we will continue on with a few more basic data visualizations such as area plots histograms and bar charts and learn how to use matplotlib to create them and even create different versions of these plots we will also cover a set of specialized visualizations such as pie charts box plots Scatter Plots and bubble plots and we will learn how to create them still using matplotlib in module 3 we will learn about more advanced visuals such as waffle charts that provide a fine grained view of the proportions of different categories in a data set we will also learn about word clouds that depict word frequency or importance in a body of text also in this module we will explore another Library Seaborn which is built on top of matplotlib to simplify the process of creating plots and visuals and we will get a taste of its Effectiveness through the creation of regression plots finally in this module we will explore another Library folium which was built primarily to visualize geospatial data so you will learn how to create maps of different regions of the world superimposed markers of different shapes on top of of maps and learn how to create chloroplath Maps now before I conclude this video let me stress one thing data visualization is best learned through Hands-On exercises and sessions therefore don't worry if you find some of the videos to be short the labs and the Hands-On sessions are very thorough and cover a lot of the concepts that are discussed in the videos in much more detail so it's very important that you complete the labs and the Hands-On sessions although they are ungraded components of the course I hope that you remember this and you keep it in mind as you progress in this course after completing this course you will be able to use different visualization libraries in Python namely matplotlib Seaborn and folium to create expressive visual representations of your data for different purposes so let's get right into it hello everyone and welcome to the first module of the data visualization with python course in this video we're going to introduce data visualization and go over an example of transforming a given visual into one which is more effective attractive and impactive so let's get started now one might ask why would I need to learn how to visualize data well data visualization is a way to show complex data in a form that is graphical and easy to understand this can be especially useful when one is trying to explore the data and getting acquainted with it also since a picture is worth a thousand words then plots and graphs can be very effective in conveying a clear description of the data especially when disclosing findings to an audience or sharing the data with other peer data scientists also they can be very valuable when it comes to supporting any recommendations you make to clients managers or other decision makers in your field Darkhorse analytics is a company that spun out of a research lab at the University of Alberta in 2008 and has done fascinating work on data visualization Dark Horse analytics specialize in quantitative Consulting in several areas including data visualization and geospatial Analysis their approach when creating a visual revolves around three key points less is more effective it is more attractive and it is more impactive in other words any feature or design you incorporate in your plot to make it more attractive or pleasing should support the message that the plot is meant to get across and not distract from it let's take a look at an example so here is a pie chart of what looks like people's preferences when it comes to different types of pig meat the chart's message is almost half of the people surveyed preferred bacon over the other types of pig meat but I'm sure that almost all of you agree that there is a lot going on in this pie chart and we're not even sure if features such as the blue background or the 3D orientation are meant to convey anything in fact these additional unnecessary features distract from the main message and can be confusing to the audience so let's apply Darkhorse analytics approach to transform this into a visual that's more effective attractive and impactive as I mentioned earlier the message here is that people are most likely to choose bacon over other types of pig meat so let's get rid of everything that can be distracting from this core message the first thing is let's get rid of the blue background and the gray background let's also get rid of borders as they do not convey any extra information Also let's drop the Redundant Legend since the pie chart is already color coded 3D isn't adding any extra information so let's say bye to it text building is also unnecessary and let's get rid of the different colors and the wedges whoa what just happened well let's thicken the lines to make them more meaningful now this looks a little familiar yes this is a bar graph after all one with horizontal bars and finally let's emphasize bacon so that it stands Out Among the other types of pig meat now let's juxtapose the pie chart and the bar graph and compare which is better and easy to understand I hope that we unanimously agree that the bar graph is the better of the two it is simple cleaner less distracting and much easier to read in fact pie charts have recently come under Fire from data visualization experts who argue that they are relevant only in the rarest of circumstances bar graphs and charts on the other hand are argued to be far superior ways to quickly get a message across but don't worry about this for now we will come back to this point when we learn how to create pie charts and bar graphs with matplotlib for more similar and interesting examples check out Darkhorse analytics website they have a couple more examples on how to clean bar graphs and maps of geospatial data all these examples reinforce the concept of less is more effective attractive and impactive in this video we will start learning about matplotlib this video will focus on the history of matplotlib and its architecture matplotlab is one of the most widely used if not the most popular data visualization library in Python it was created by John Hunter who was a neurobiologist and was part of a research team that was working on analyzing electrochord discography signals ecog for short the team was using a proprietary software for the analysis however they had only one license and were taking turns in using it so in order to overcome this limitation John set out to replace the proprietary software with a matlab-based version that could be utilized by him and his teammates and that could be extended by multiple investigators as a result matplotlib was originally developed as an ecog visualization tool and just like Matlab matplotlib was equipped with a scripting interface for quick and easy generation of Graphics represented by pipelot we will learn more about this in a moment Now map ellipse architecture is composed of three main layers the backend layer the artist layer where much of the heavy lifting happens and is usually the appropriate programming Paradigm when writing a web application server or a UI application or perhaps a script to be shared with other Developers and the scripting layer which is the appropriate layer for everyday purposes and is considered a lighter scripting interface to simplify common tasks and for quick and easy generation of graphics and plots now let's go into each layer in a little more details so the back end layer has three built-in abstract interface classes figure canvas which defines and encompasses the area on which the figure is drawn renderer an instance of the renderer class knows how to draw on the figure canvas and finally event which handles user inputs such as keyboard strokes and mouse clicks moving on to the artist layer it is composed of one main object which is the artist the artist is the object that knows how to take the renderer and use it to put ink on the canvas everything you see in a matplotlib figure is an artist instance the title the lines The Tick labels the images and so on all correspondent individual artists there are two types of artist objects the first type is the Primitive type such as a line a rectangle a circle or text and the second type is the composite type such as the figure or the axes the top level matplotlib object that contains and manages all of the elements in a given graphic is the figure artist and the most important composite artist is the axis because it is where most of the matplotlib API plotting methods are defined including methods to create and manipulate the ticks the axis lines the Grid or the plot background now it is important to note that each composite artist may contain other composite artists as well as primitive artists so a figure artist for example would contain an axis Artist as well as a rectangle or text artists now let's put the artist layer to use and see how we can use it to generate a graphic so let's try to generate a histogram of 10 000 random numbers using the artist layer first we import the figure canvas from the back end back-end underscore Ag and attach the figure artist to it note that AG stands for anti-grain geometry which is a high performance library that produces attractive images then we import the numpy library to generate the random numbers next we create an axis artist the axis artist is added automatically to the figure axis container fig.xes and note here that 111 is from the Matlab convention so it creates a grid with one row and one column and uses the first cell in that grid for the location of the new axes then we call the axis method hist to generate the histogram hiss creates a sequence of rectangle artists for each histogram bar and adds them to the axis container here 100 means create 100 bins finally we decorate the figure with a title and we save it now this is the generated histogram and so this is how we use the artist layer to generate a graphic as for the scripting layer it was developed for scientists who are not professional programmers and I'm sure you agree with me based on the histogram that we just created that the artist layer is syntactically heavy as it is meant for developers and not for individuals whose goal is to perform quick exploratory analysis of some data matplotlips scripting layer is essentially the matplotlib dot Pi plot interface which automates the process of defining a canvas and defining a figure artist instant instance and connecting them so let's see how the same code that we used earlier using the artist layer to generate a histogram of 10 000 random numbers would now look like so first we import the pi plot interface and you can see how all the methods associated with creating the histogram and other artist objects and manipulating them whether it is the hist method or showing the figure are part of the pi plot interface if you're interested in learning more about the history of matplotlib and its architecture this link will take you to a chapter written by the creators of matplotlab themselves it is definitely a recommended read in this video we will learn how to use matplotlib to create plots and we will do so using the Jupiter notebook as our environment now matplotlib is a well-established data visualization library that is well supported in different environments such as in Python scripts in the IPython shell web application servers and graphical user interface toolkits as well as the Jupiter notebook now for those of you who don't know what the Jupiter notebook is it's an open source web application that allows you to create and share documents that contain Live code visualizations and some explanatory text as well Jupiter has some specialized support for matplotlib and so if you start a jupyter notebook all you have to do is import matplotlib and you're ready to go in this course we will be working mostly with the scripting interface in other words we will learn how to create almost all of the visualization tools using the scripting interface as we proceed in the course you will appreciate the power of this interface when you find out that you can literally create almost all of the conventional visualization tools such as histograms bar charts box plots and many others using one function only the plot function let's start with an example let's first import the scripting interface as PLT and let's plot a circular Mark at the position five five so x equals 5 and Y equals 5. notice how the plot was generated within the browser and not in a separate window for example if the plot gets generated in a new window then we can enforce generating plots within the browser using what's called a magic function a magic function starts with percent sign matplotlib and to enforce plots to be rendered within the browser you pass in inline as the backend matplotlib has a number of different backends available one limitation of this backhand is that you cannot modify a figure once it's rendered so after rendering the above figure there is no way for us to add for example a figure title or label States axes you will need to generate a new plot and add a type a title and the axis labels before calling the show function a backend that overcomes this limitation is The Notebook backend with the notebook backend in place if a PLT function is called it checks if an active figure exists and any functions you call will be applied to this active figure if a figure does not exist it renders a new figure so when we call the plt.plot function to plot a circular Mark at position five five the backend checks if an active figure exists since there isn't an active figure it generates a figure and adds a circular Mark to position five five and what is beautiful about this backhand is that now we can easily add a title for example or labels to the axes after the plot was rendered without the need to regenerate the figure finally another thing that is great about matplotlib is that pandas also has a built-in implementation of it therefore plotting in pandas is as simple as calling the plot function on a given Panda series or data frame so say we have a data frame of the number of immigrants from India and China to Canada from 1980 to 1996. and say we're interested in generating a line plot of these data all we have to do is call the plot function on this data frame which we called India underscore China underscore DF and set the parameter kind to line and there you have it a line plot of the data in the data frame plotting a histogram of the data is not any different so say we would like to plot a histogram of the India column in our data frame all we have to do is call the plot function on that column and set the parameter kind to hist for histogram and there you have it a histogram of the number of Indian immigrants to Canada from 1980 to 1996. this concludes our video on basic plotting with matplotlib see you in the next video in this video we will learn more about the data set that we will be using throughout the course the population division of the United Nations compiled immigration data pertaining to 45 countries the data consists of the total number of immigrants from all over the world to each of the 45 countries as well as other metadata pertaining to the immigrants countries of origin in this course we will focus on immigration to Canada and we will work primarily with the data set involving immigration to the great white North here is a snapshot of the new end data on immigration to Canada in the form of an Excel file as you can see the first 20 rows contain textual data about the UN department and other irrelevant information Robert 21 contains the labels of the columns following that each row represents a country and contains metadata about the country such as what continent it resides in what region it belongs to and whether the region is developing or developed each row also contains the total number of immigrants from that country for the years 1980 all the way to 2013. throughout this course we will be using pandas for any analysis of the data before creating any visualizations so in order to start creating different types of plots of the data whether for exploratory analysis or for presentation we will need to import the data into appendix data frame to do that we will need to import the pandas Library as well as the xlrd library which is required to extract data from Excel spreadsheets files then we call the pandas function read underscore Excel to read the data into appendix data frame and let's name this data frame DF underscore can notice how we're skipping the first 20 rows to read only the data corresponding to each country if you want to confirm that you have imported your data correctly in pandas you can always use the head function to display the first five rows of the data frame so if we call this function on our data frame DF underscore can here is the output as you can see the output of the head function looks correct with the columns having the correct labels and each were representing a country and containing the total number of immigrants from that country this concludes our video on the integration to Canada data set I will see you in the next video in this video things will start getting more exciting we will generate our first visualization tool the line plot so what is a line plot as its name suggests it is applied in the form of a series of data points connected by straight line segments it is one of the most basic type of chart and is common in many fields not just data science the more important question is when to use line plots the best use case for a line plot is when you have a continuous data set and you're interested in visualizing the data over a period of time as an example say we're interested in the trend of immigrants from Haiti to Canada we can generate a line plot and the resulting figure will depict the trend of Haitian immigrants to Canada from 1980 to 2013. based on this line plot we can then research for justifications of obvious anomalies or changes so in this example we see that there is a spike of immigration from Haiti to Canada in 2010. a quick Google search for major events in Haiti in 2010 would return the tragic earthquake that took place in 2010 and therefore this influx of immigration to Canada was mainly due to that tragic earthquake okay now how can we generate this line plot before we go over the code to do that let's do a quick recap of our data set each rule represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make querying specific countries easier also let's add an extra column which represents the cumulative sum of annual immigration from each country from 1980 to 2013. so for Afghanistan it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada let's generate the line plot corresponding to immigration from Haiti first we import matplotlib as MPL and its scripting interface as PLT then we call the plot function on the row corresponding to Haiti and we set kind equals line to generate a line plot note that we use the years which is a list containing string format of years from 1980 to 2013 in order to exclude the column of total integration that we added then to complete the figure we give it a title and we label its axes finally we call the show function to display the figure note that this is the code to generate the line plot using the magic function percent sign matplotlib with the inline backend and there you have it a line plot that depicts immigration from Haiti to Canada from 1980 to 2013. in the lab session we explore line plots in more details so make sure to complete this modules lab session this concludes our video online plots I'll see you in the next video in this video we will learn about another visualization tool the area plot which is actually an extension of the line plot that we learned about in an earlier video so what is an area plot an area plot also known as an area chart or graph is a type of plot that depicts accumulated totals using numbers or percentages over time it is based on the line plot and is commonly used when trying to compare two or more quantities so how can we generate an area plot with matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier also let's add an extra column which represents the cumulative sum of annual immigration from each country from 1980 to 2013. so for Afghanistan it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada let's try to generate area plots for the countries with the highest number of integration to Canada we can try to find these countries by sorting our data frame in descending order of cumulative total immigration from 1980 to 2013. we use the sort underscore values function to sort our data frame in descending order and here is the result so it turns out that India followed by China then the UK Philippines and Pakistan are the top five countries with the highest number of immigration to Canada so can we now go ahead and generate the area plots using the first five rows of this data frame not quite yet first we need to create a new data frame of only these five countries and we need to exclude the total column more importantly to generate the area plots for these countries we need the years to be plotted on the horizontal axis and the annual immigration to be plotted on the vertical axis note that matplotlib plots the indices of a data frame on the horizontal axis and with the data frame as shown the countries will be plotted on the horizontal axis so to fix this we need to take the transpose of the data frame let's see how we can do this after we sort our data frame in descending order of cumulative annual immigration we create a new data frame of the top five countries and let's call it DF underscore top 5. we then select Only The Columns representing the years 1980 to 2013. in order to exclude the total column before applying the transpose method the resulting data frame is exactly what we want with five columns where each column represents one of the top five countries and the years being the indices now we can go ahead and call the plot function on data frame DF underscore top 5 to generate the area plots to do that first we import matplotlib as mlmpl and its scripting interface as PLT then we call the plot function on the data frame DF underscore top 5 and we set client equals area to generate an area plot then to complete the figure we give it a title and we label its axes finally we call the show function to display the figure note that here we're generating the area plot using the inline backhand and there you have it an area plot that depicts the immigration trend of the five countries with the highest immigration to Canada from 1980 to 2013. in the lab session we explore area plots in more details so make sure to complete this modules lab session and with this we conclude our video on area plots I'll see you in the next video in this video we will learn about another visualization tool the histogram and we will learn how to create it using matplotlib let's start by defining what a histogram is a histogram is a way of representing the frequency distribution of a numeric data set the way it works is it partitions the spread of the numeric data into biz signs each data point in the data set to a bin and then counts the number of data points that have been assigned to each bin so the vertical axis is actually the frequency or the number of data points in each bin for example let's say the range of the numeric values in the data set is 34 129. now the first step in creating the histogram is partitioning the horizontal axis in say 10 bins of equal width and then we construct the histogram by counting how many data points have a value that is between the limits of the first pin the second bin the third bin and so on say the number of data points having a value between 0 and 3413 is 175. then we draw a bar of that height for this bin we repeat the same thing for all the other bins and if no data points fall into a bin then that bin would have a bar of height zero so how do we create a histogram using matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier Also let's add an extra column which represents the cumulative sum of annual integration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada say we're interested in visualizing the distribution of immigrants to Canada in the year 2013. the simplest way to do that is to generate a histogram of the data in column in 2013 and let's see how we can do that with matplotlib first we import matplotlib as MPL and its scripting interface as PLT then we call the plot function on the data in column 2013 and we set kind equals hist to generate a histogram then to complete the figure we give it a title and we label its axes finally we call the show function to display the figure and there you have it a histogram that depicts the distribution of immigration to Canada in 2013. but notice how the bins are not aligned with the tick marks on the horizontal axis this can make the histogram hard to read so let's try to fix this in order to make our histogram more effective one way to solve this issue is to borrow the histogram function from the numpy library so as usual we start by importing matplotlib and its scripting interface but this time we also import the numpy library then we call the numpy histogram function on the data in column 2013. what this function is going to do is it is going to partition the spread of the data in column 2013 into 10 bins of equal width compute the number of data points that fall in each bin and then return this frequency of each bin which we're calling count here and the bin edges which we're calling bin underscore edges we then pass these bin edges as an additional parameter in our plot function to generate the histogram and there you go a nice looking histogram whose bin edges are aligned with the tick marks on the horizontal axis in the lab session we explore histograms in more details so make sure to complete this modules lab session and with this we conclude our video on histograms I'll see you in the next video in this video we will learn about an additional visualization tool namely the bar chart and learn how to create it using matplotlib a bar chart is a very popular visualization tool like a histogram a bar chart also known as a bar graph is a type of plot where the length of each bar is proportional to the value of the item that it represents it is commonly used to compare the values of a variable at a given point in time for example say you are interested in visualizing in a discrete fashion how immigration from Iceland to Canada looked like from 1980 to 2013. one way to do that is by building a bar chart where the height of the bar represents the total immigration from Iceland to Canada in a particular year so how do we do that with matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier Also let's add an extra column which represents the cumulative sum of annual immigration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada let's see how we can use matplotlib to generate a bar chart to visualize how immigration from Iceland to Canada looked like from 1980 to 2013. as usual we start by importing matplotlib and its scripting interface then we use the years variable to create a new data frame let's name it DF underscore Iceland which includes the data pertaining to annual immigration from Iceland to Canada and excluding the total column then we call the plot function on DF underscore Iceland and we set client equals bar to generate a bar chart then to complete the figure we give it a title we label its axes finally we call the show function to display the figure and there you have it a bar chart that depicts the immigration from Iceland to Canada from 1980 to 2013. by examining the bar chart we notice that immigration to Canada from Iceland has seen an increasing Trend since 2010. I'm sure that the Curious among you are already wondering who the culprit behind this increasing trend is in the lab session we revealed the reason and we also learned how to create a bar chart with horizontal bars so make sure to complete this modules lab session and with this we can include our video on bar charts I'll see you in the next video in this video we will learn about another visualization tool the pie chart and we will learn how to create it using matplotlib so what is a pie chart apply chart is a circular statistical graphic divided into slices to illustrate numerical proportion for example here is a pie chart of the Canadian federal election back in 2015 where the Liberals in red won more than 50 percent of the seats in the House of Commons that is why the red color occupies more than half of the circle so how do we create a pie chart with matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier Also let's add an extra column which represents the cumulative sum of annual integration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada say we're interested in visualizing a breakdown of immigration to Canada continent-wise the first step is to group our data by continent using the continent column and we use pandas for this we call the pandas Group by function on DF underscore Canada and we sum the number of immigrants from the countries that belong to the same continent here is a resulting data frame and let's name it DF underscore continents the resulting data frame has six rows each representing a continent and 35 columns representing the years from 1980 to 2013 plus the cumulative sum of integration for each continent and now we're ready to start creating our pie chart we start with the usual importing matplotlab as MPL and its scripting layer the pi plot interface as PLT then we call the plot function on column total of the data frame DF underscore continents and we set kind equals pi to generate a pie chart then to complete the figure we give it a title finally we call the show function to display the figure and there you have it a pie chart that depicts each continent's proportion of immigration to Canada from 1980 to 2013. in the lab session we will go through the process of creating a very professional looking and aesthetically pleasing pie chart and transform the pie chart that we just created into one that looks like this so make sure to complete this modules lab session one last comment on pie charts there are some very vocal opponents to the use of pie charts under any circumstances most argue that pie charts fail to accurately display data with any consistency bar charts are much better when it comes to representing the data in a consistent way and getting the message across if you're interested in learning more about the arguments against white charts here is a link to a very interesting article that discusses very clearly the flaws of pie charts you can also find the link under the video and with this we conclude our video on pie charts I'll see you in the next video [Music] thank you in this video we will learn about another visualization tool the box plot and how to create one using matplotlib so what is a box plot a box plot is a way of statistically representing the distribution of given data through five main dimensions the First Dimension is minimum which is the smallest number in the sorted data its value can be obtained by subtracting 1.5 times the IQR where IQR is interquartile range from the first quartile the second dimension is first quartile which is 25 of the way through the sorted data in other words a quarter of the data points are less than this value the third dimension is median which is the median of the sorted data the fourth dimension is third quartile which is 75 percent of the way through the sorted data in other words three quarters of the data points are less than this value and the final Dimension is maximum which is the highest number in the sorted data where maximum equals third quartile summed with 1.5 multiplied by IQR finally box plots also display outliers as individual dots that occur outside the upper and lower extremes now let's see how we can create a box plot with matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier Also let's add an extra column which represents the cumulative sum of annual immigration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada say we're interested in creating a box plot to visualize immigration from Japan to Canada as with other tools that we learned so far we start by importing matplotlib as MPL and the pi plot interface as PLT then we create a new data frame of the data pertaining to Japan and we're excluding the column total using the year's variable then we transpose the resulting data frame to make it in the correct format to create the box plot let's name this new data frame DF underscore Japan following that we call the plot function on DF underscore Japan and we set kind equals box to generate a box plot then to complete the figure we give it a title and we label the vertical axis finally we call the show function to display the figure and there you have it a box plot that provides a pleasing distribution of Japanese immigration to Canada from 1980 to 2013. in the lab session we explore box plots in more detail and learn how to create multiple box plots as well as horizontal box plots so make sure to complete this module's lab session and with this we conclude our video on box plots see you in the next video [Music] in this video we will learn about an additional visualization tool the scatter plot and we will learn how to create it using matplotlib so what is a scatter plot a scatter plot is a type of plot that displays values pertaining to typically two variables against each other usually it is a dependent variable to be plotted against an independent variable in order to determine if any correlation between the two variables exists for example here is a scatter plot of income versus education and by looking at the plotted data one can conclude that an individual with more years of education is likely to earn a higher income than an individual with fewer years of education so how can we create a scatter plot with matplotlib before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data frame so that the country name becomes the index of each row this should make retrieving rows pertaining to specific countries a lot easier Also let's add an extra column which represents the cumulative sum of annual integration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada say we're interested in plotting a scatter plot of the total annual immigration to Canada from 1980 to 2013. to be able to do that we first need to create a new data frame that shows each here and the corresponding total number of immigration from all the countries worldwide as shown here let's name this new data frame DF underscore total in the lab session we will walk together through the process of creating DF underscore total from DF underscore Canada so make sure to complete this modules lab session then we proceed as usual we import matplotlib as MPL and its scripting layer the pi plot interface as PLT then we call the plot function on the data frame DF underscore total and we set kind equals scatter to generate a scatter plot now unlike the other data visualization tools we're only passing the client parameter was enough to generate the plot with Scatter Plots we also need to pass the variable to be plotted on the horizontal axis as the X parameter and the variable to be plotted on the vertical axis as the Y parameter in this case we're passing column year as the X parameter and column total as the Y parameter then to complete the figure we give it a title and we label its axes finally we call the show function to display the figure and there you have it a scatter plot that shows total integration to Canada from countries all over the world from 1980 to 2013. the scatter plot clearly depicts an overall Rising trend of immigration with time in the lab session we explore Scatter Plots in more details and learn about a very interesting variation of the scatter plot a plot called the bubble plot and we learn how to create it using matplotlid so make sure to complete this modules lab session and with this we conclude our video on Scatter Plots I'll see you in the next video in this video we will learn about what some consider an advanced visualization tool namely the waffle chart so what is a waffle chart a waffle chart is a great way to visualize data in relation to a whole or to highlight progress against a given threshold for example say immigration from Scandinavia to Canada is comprised only of immigration from Denmark Norway and Sweden and we're interested in visualizing the contribution of each of these countries to the Scandinavian immigration to Canada the main idea here is for a given waffle chart whose desired height and width are defined the contribution of each country is transformed into a number of tiles that is proportional to the country's contribution to the total so the more the contribution the more the tiles resulting in what resembles a while Fork when combined hence the name waffle chart unfortunately matplotlib does not have a built-in function to create waffle charts therefore in the lab session I'll walk you through the process of creating your own python function to create a waffle chart so it's really important that you complete this module's lab session and with this we conclude our video on waffle charts I'll see you in the next video in this video we will learn about another Advanced visualization tool the word cloud so what is a word cloud a word cloud is simply a depiction of the importance of different words in a body of text a word cloud Works in a simple way the more a specific word appears in a source of textual data the bigger and Bolder it appears in the world Cloud so given some Text data on recruitment for example we generate a cloud of words like this This Cloud is telling us that words such as recruitment Talent candidates and so on are the words that really stand out in these text documents and assuming that we didn't know anything about the content of these documents a word cloud can be very useful to assign a topic to some unknown textual data unfortunately just like waffle charts matplotlib does not have a built-in function to generate word clouds however luckily a python library for cloud word generation that was created by Andreas Muller is publicly available so in the lab session we will learn how to use Muller's word cloud generator and we will also create interesting word clouds superimposed on different background images so make sure to complete this modules lab session and with this we conclude our video on word clouds I'll see you in the next video in this video we will learn about a new visualization library in Python which is Seaborn although Seabourn is another data visualization Library it is actually based on matplotlib it was built primarily to provide a high level interface for drawing attractive statistical Graphics such as regression plots box plots and so on seabon makes creating plots very efficient therefore with Seabourn you can generate plots with code that is five times less than with map plot lib let's see how we can use Seaboard to create a statistical graphic let's look into regression plots let's say we have a data frame called DF underscore total of total immigration to Canada from 1980 to 2013. with the year in one column and the corresponding total immigration in another column and say we're interested in creating a scatter plot along with a regression line to highlight any Trends in the data with Seaborn you can do all this with literally one line of code the way to do this we first import C1 and let's import it as SNS then we call the seaborne reg plot function we basically tell it to use the data frame DF underscore total and to plot the column year on the horizontal axis and the column total on the vertical axis and the output of this one line of code is a scatter plot with a regression line and not just that but also 95 percent confidence interval isn't that really amazing seaborn's red plot function also accepts additional parameters for any personal customization so you can change the color for example using the color parameter let's go ahead and change the color to Green also you can change the marker shape as well using the marker parameter let's go ahead and change the shape of our markers to a plus marker instead of the default circular marker in the lab session we explored regression plots with Seabourn in more details so make sure to complete this modules lab session and with this we conclude our short introduction to Seabourn and regression plots I'll see you in the next video in this video we will learn about a very interesting data visualization library in Python which is volume folium is a powerful data visualization library in Python that was built primarily to help people visualize geospatial data with volume you can create a map of any location in the world as long as you know its latitude and longitude values you can also create a map and superimposed markers as well as clusters of markers on top of the map for cool and very interesting visualizations you can also create maps of different styles such as street level map a statement map and a couple others which we will look into in just a moment creating a world map with volume is pretty straightforward you simply call the map function and that is all what is really interesting about the maps created by folium is that they are interactive so you can zoom in and out after the map is rendered which is a super useful feature the default map style is the openstreet map which shows a street view of an area when you're zoomed in and shows the borders of the world countries when you're zoomed all the way out now let's create a world map centered around Canada to do that we pass in the latitude and the longitude values of Canada using the location parameter and with folium you can set the initial zoom level using the zoom start parameter now I say initial because you can easily change the zoom level after the map is rendered by zooming in or zooming out you can play with this parameter to figure out what the initial zoom level looks like for different values now let's set the zoom level for our map of Canada to four and there you go here is a world map centered around Canada another amazing feature of folium is that you can create different map styles using the tiles parameter let's create a statement toner map of Canada this style is great for visualizing and exploring River meanders and coastal zones another style is stament Terrain let's create a map of Canada in statement Terrain this style is great for visualizing Hill shading and natural vegetation colors and with this we conclude our introduction to volume I'll see you in the next video in this video we will continue working with the phonium library and learn how to superimpose markers on top of a map for interesting visualizations in the previous video we learned how to create a world map centered around Canada so let's create this map again and name it Canada underscore math this time Ontario is a Canadian province and contains about 40 percent of the Canadian population it is considered Canada's most populous province let's see how we can add a circular Mark to the center of Ontario to do that we need to create what is called a feature group let's go ahead and create a feature group named Ontario now when a feature group is created it is empty and that means what's next is to start creating what is called children and adding them to the Future group so let's create a child in the form of a red circular Mark located at the center of the Ontario Province we specify the location of the child by passing in its latitude and longitude values and once we're done adding children to the Future group we add the future group to the map and there you have it a red circular mark superimposed on top of the map and added to the center of the province of Ontario now it would be nice if we could actually label this marker in order to let other people know what it actually represents to do that we simply use the marker function and the pop-up parameter to pass in whatever text we want to add to this marker and there you go now our marker displays Ontario when clicked on in the lab session we will look into a real world example and explore crime rate in San Francisco we will create a map of San Francisco and superimpose thousands of these markers on top of the map not just that but I'll show you how you can also create clusters of markers in order to make your map look less congested this module's lab session is a very interesting one so please make sure to complete it and with this we conclude our video on adding markers to maps with volume I'll see you in the next video in this video we will learn how to create a special type of map called choropleth map with volume I'm sure that most of you have seen Maps similar to this one and this one these are what we call choropleth Maps so what is a chloropleth map a choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map such as population density or per capita income the higher the measurement the darker the color so the map to the left is a choropleth map of the world showing infant mortality rate per 1000 births the darker the color the higher the infant mortality rate according to the map African countries had very high infant mortality rates with some of them reporting a rate that is higher than 160 per 1000 births similarly the map to the right is a choropleth map of the U.S showing population per square mile by state again the darker the color the higher the population according to the map states in the eastern part of the U.S tend to be more populous than states in the western part with California being an exception in order to create a chloroplast map of a region of Interest volume requires a junior Json file that includes geospatial data of the region for a chloroplyph map of the world we would need a go Json file that lists each country along with any geospatial data to Define its borders and boundaries here is an example of what the gojson file would include about each country the example here pertains to the country Brunei as you can see the file includes the country's name its ID geometry shape and the coordinates that Define the country's borders and boundaries so let's see how we can create a choropleth map of the world like this one showing immigration to Canada before we go over the code to do that let's do a quick recap of our data set recall that each row represents a country and contains metadata about the country such as where it is located geographically and whether it is developing or developed each row also contains numerical figures of annual immigration from that country to Canada from 1980 to 2013. now let's process the data and let's add an extra column which represents the cumulative sum of annual integration from each country from 1980 to 2013. so for Afghanistan for example it is 58 639 total and for Albania it is 15 699 and so on and let's name our data frame DF underscore Canada so now that we know how our data is stored in the data frame DF underscore Canada let's see how we can generate a choropleth map of the world showing immigration to Canada we should be experts now in creating world maps with volume so let's go ahead and create a world map but this time let's use the matte box bright tile set the result is a nice world map displaying the name of every country now to convert this map into a choropleth map we first Define a variable that points to our geojson file then we apply the choropleth function to our world map and we tell it to use the columns country and total in our DF underscore Canada data frame and to use the country names to look up the geospatial information about each country in the Geo Json file and there you have it a choropleth map of Canada showing the intensity of immigration from different countries worldwide in the lab session we explore choropleth maps in more details so please make sure to complete this modules lab session and with this we conclude our video on chloropleth maps in this video we are going to see how an interactive data application can help improve business performance and the tools available for building the application with real-time visuals on the dashboard understanding a business's moving Parts becomes easy based on the report type and data suitable graphs and charts can be created in one central location this provides an easy way for stakeholders to understand what is going right wrong and what needs to be improved getting the big picture in one place can help businesses make informed decisions this improves business performance in general the best dashboards answer critical business questions let us say you are assigned a task to Monitor and Report the performance of domestic U.S flights following are the yearly review report items the top 10 Airline carriers in the year 2019 in terms of number of flights the number of flights in 2019 split by month and the number of travelers from the state of California to other states split by distance group let us look at the two ways of presenting the report for this type 1 report the information is presented through tables with inference from tables documented for reference for report type 2 we are presenting the same report in the dashboard format hovering over each chart will provide details about the data points in the bottom Sunburst chart you can click on different numbers drill down into levels and get detailed information about each segment Can You observe the difference in the presentation of the findings and what if we need to get the report on the real-time data not the static data also presenting the result using tables and documents is time consuming less visually appealing and more difficult to comprehend a data scientist should be able to create and deliver a story around the findings in a way stakeholders can easily understand with that in mind dashboards are the way to go let us take a look at web-based dashboarding tool options available in Python Dash is a python framework for building web analytic applications it is written on top of flask plotly JS and react.js Dash is well suited for building data visualization apps with highly custom user interfaces panel works with visualizations from bokeh matplotlib Hollow views and many other python plotting libraries making them instantly viewable either individually or when combined with interactive widgets that control them panel Works equally well in Jupiter notebooks for creating quick data exploration tools panel can also be used in Standalone deployed apps and dashboards allowing you to easily switch between those contexts as needed voila turns Jupiter notebooks into Standalone web applications it can be used with separate layout tools like Jupiter Flex or templates like voila beautify streamlit can easily turn data scripts into shareable web apps with three main applications Embrace python scripting treat widgets as variables and reuse data and computation there are other tools that can be used for dashboarding bokeh is a plotting Library a widget and app Library it acts as a server for both plots and dashboards panelists one of the web-based dashboarding tools built on bokeh ippy widgets provides an array of Jupiter compatible widgets and an interface supported by many python libraries but sharing a dashboard requires a separate Deployable server like voila matplotlib is a comprehensive library for creating static animated and interactive visualizations in Python bowtie allows users to build dashboards in pure python flask is a python-backed web server that can be used to build arbitrary websites including those with python plots that function as flask dashboards learn more about the tools from The Source link in this course we will be focusing on dash [Music] foreign we are going to provide an overview of the plotly Python Library so what is plotly plotly is an interactive open source plotting library that supports over 40 unique chart types it is available in Python R and JavaScript plotly python is built on top of plotly JavaScript library and includes chart types like statistical Financial Maps scientific and three-dimensional data the web-based visualizations created using plotly python can be displayed in Jupiter notebook saved to Standalone HTML files or served as part of pure python built web applications using Dash the focus of this lesson will be on two of the plotly sub modules plotly graph objects and plotly express plotly graph objects is the low level interface to figures traces and layout the plotly graph objects module provides an automatically generated hierarchy of classes figures traces and layout called graph objects these graph objects are used for representing figures with a top level class plotly.graph underscore objects.figure plotly expresses a high level wrapper for plotly it is a recommended starting point for creating most common figures provided by plotly using a more simple syntax it uses graph objects internally let us see how to use plotly.graph underscore objects submodule by creating a simple line chart first import the required packages here we are importing graph objects as go then generate sample data using numpy the plotly.graph contains a Json object which has a dictionary structure since we imported plotly graph objects as go in the previous slide go will be the Json object the chart can be plotted by updating the values of the go object keywords we will create the figure by adding a scatter type Trace next the layout of the figure is updated using the update layout method here we are updating the x-axis y-axis and chart title this is the plotted figure now we will create the same line chart using plotly Express in plotly express the entire line chart can be created using a single command visualization is automatically interactive flotly Express makes visualization easy to create and modify and now it's time to play with the plotly library next is going to be a lab session we will be using the airline reporting database from data asset exchange to demonstrate how to use plotly graph objects and express for creating charts here is a quick overview of the airline reporting data set the reporting carrier on-time performance data set contains information on approximately 200 million domestic U.S flights reported to the United States Bureau of Transportation statistics the data set contains basic information about each flight such as date time departure Airport arrival airport and if applicable the amount of time the flight was delayed and information about the reason for the delay now let us start the lab foreign [Music] we are going to see an overview of Dash Library Dash is an open source user interface python library for creating reactive web-based applications it is Enterprise ready and a first-class member of plotley's Open Source tools Dash applications are web servers running flask and communicating Json packets over HTTP requests Dash's front end renders components using react.js it is easy to build a graphical user interface using Dash as it abstracts all Technologies required to build the applications Dash's declarative and reactive Dash output can be rendered in web browser and can be deployed to servers Dash uses a simple reactive decorator for binding code to the UI this is inherently mobile and cross-platform ready let us say you are planning to create an application to answer a business question as a first step you need to determine the layout of the application decide which chart to use and where to place for example this is called layout part in dash the second part is to add interactivity to the application there are two components of Dash first is core components we can import core components as DCC using the import statement next is HTML components we can import HTML components as HTML using this import statement let us explore these further the dash underscore HTML underscore components library has a component for every HTML tag you can compose your layout using python structures with the dash dash html-components Library the dash underscore HTML underscore components Library also provides classes for all of the HTML tags the keyword arguments describe the HTML attributes like style class name and ID no knowledge of HTML or css is required but can help in styling the dashboards let us see an example of how to use HTML components we start by creating a dash application from here we create division in our application layout and then add components to it in the outer layer division we first provide a name for our application using the HTML heading component H1 the style parameter is used to change the font color size and border of the heading next we will add paragraph content to the page using an HTML paragraph component p division can be created inside the outer division here we are providing division content as this is a new Division and styling it using style parameter components to put all this together in the application layout create an HTML Division and add components multiple divisions can be added to the outer application layout the dash underscore core underscore components describe higher level components that are interactive and generated with JavaScript HTML and CSS through the react.js library some examples of core components are creating a slider input area check items and date picker you can explore other components using the reference link provided at the end of the slide now let us see how to add a slicer and drop down to the application for the drop down we use the DCC dot drop down component we will create a drop down list under the options parameter as a dictionary label will hold the drop down display label name and value will hold the value of the label we can also provide a default drop down display label using value parameter for the slider we use the dcc.slider component and provide Min and max value to the slider the marks parameter is used for adding a slide marker and value parameter for adding default value [Music] in this video we will see how to connect core and HTML components using callbacks a callback is a python function that is automatically called by Dash whenever an input component's property changes callback function is decorated with at app.callbackdecorator so what does this decorator tell Dash basically Whenever there is a change in the input component value the Callback function wrapped by The Decorator is called followed by the update to the output component children in the application layout let us look at the Callback function skeleton first create a function that will perform operations to return the desired result for the output component decorate the Callback function with at app.callback decorator this takes two parameters output this sets the result returned from the Callback function to a component ID and input this sets the input provided to the Callback function to a component ID from here we will connect the input and output to desired properties we will see this in action with an example using the airline data the use case here is to extract the top 10 Airline carriers in the provided input year selected by the number of flights based on the input year the output will change first we import the required packages as seen before we will import pandas dash dash core and HTML components the new entry here has Dash dependencies from Dash dependencies we will import input and output that we will use in the Callback function we read the airline data into the pandas data frame we load our data frame at the start of the app and it can be read inside the Callback function we will start designing the dash application layout by adding components first we will provide the title to the Dash app using the HTML heading component H1 and style it using the style parameter next we will add an HTML Division and text input core component in dash the inputs and outputs of the application are simply the properties of a particular component in this example our input is the value property of the component that has the ID input Dash year by default the value has 2010. we will update this value in our callback function lastly we will add a division with a graph core component the core component has bar Dash plot as ID which we will update inside the Callback function note the component IDs we will add a callback decorator app.callback input to the Callback will be the component with ID input Dash year and property value output to the Callback will be the component with ID bar Dash plot and property figure component underscore ID and component underscore property keywords are optional and are included here for clarity next we will Define the Callback function get underscore graph the entered year will be the input using the year we extracted the required information from data finally the application layout graph is updated and lastly we will run the application this is the output of the code our initial input year is 2010. note that as we update the year the graph is updated for that year second example is a callback with two inputs it is similar to the one input callback except for a few changes we will add a division with one more text input with the component ID input Dash a b now we will add the new input and component ID input Dash a b to The Decorator inside the list next we will Define callback function get underscore graph this takes the entered year and the entered State as input parameters computation is performed to extract the information and the application layout is updated with the graph this is the output of the code our initial input year is 2010 and state is Al which is Alabama as I updated the year and state you can observe that the graph is updated in parallel now let us start the lab [Music] congratulations on making it this far successfully completing this Capstone project will earn you the data analyst professional certificate this Capstone provides you a practical hands-on experience to demonstrate all of the skills you have picked up so far in this professional certificate program as part of the Capstone project you will take on the role of a data analyst with a global I.T and business services firm in this role you will be analyzing several data sets to help identify trends for emerging Technologies in module 1 you will collect data for the technology skills that are in demand from various sources including job postings blog posts and surveys in module 2 you will take the collected data and prepare it for analysis by using data wrangling techniques like finding duplicates removing duplicates finding missing values and inputting missing values you'll continue to module 3 and apply statistical techniques to analyze the data and identify insights and trends like what are the top programming languages that are in demand what are the top database skills that are in demand what are the most popular Ides and demographic data like gender and age distribution of Developers in module 4 you'll focus on choosing an appropriate visualization based on the data you want to present using charts plots and histograms to help Reveal Your findings and trends in module 5 you will employ cognos to create interactive dashboards to help you analyze and present the data dynamically and in the final module you will use your storytelling skills to provide a narrative and present the findings of your analysis you will be provided with a presentation template to begin the process and to help you create a compelling story and present your findings each module includes a short quiz to test your knowledge you will be evaluated based on the quizzes in each module the dashboard and the storytelling presentation you create will be reviewed and graded by your peers to begin I recommend taking a few minutes to explore the course site review the material we'll cover each week and preview the assignments you'll need to complete to pass the course click discussions to see forums where you can discuss the course material with fellow Learners and course team if you have questions about course content please post them in the forums to get help from others in the course community for technical problems with the Coursera platform visit the learner help center we are excited to have you join us and hope you enjoy the course good luck and let's get started [Music] while finding and cleaning data is an important first step in data analysis a concept can be lost if you are not able to organize and represent the findings effectively to your audience in this video you will learn how to represent your findings by focusing on specific elements to create a successful data findings report after the data has been collected cleaned and organized the work of interpretation begins you are now able to obtain a complete view of the data and hopefully answer the questions that were formed before starting the analysis now you typically begin to compose a findings report that explains what was learned depending on the stakeholders and how they receive the information the report could vary in form this could include a paper style report a slideshow presentation or maybe even both the findings report is a crucial part of data analysis as it conveys what was discovered when beginning this process the collected data and information may seem a little little overwhelming the best way to get through this block is to begin by creating an outline by completing an outline you can then get a complete picture and begin to write in a precise but simple manner while there are many different formats for creating a data-driven presentation we have created a simple outline that is easy to follow yet effective when creating your outline always remember to structure it towards your audience and create a presentation that is appropriate for your situation you first begin with your cover page this beginning section will have the title of your presentation your name and then the date the next section in your outline will be an executive summary and then the table of contents the table of contents will contain the sections and subsections of your report in order to give your audience an overview of the contents this also enables readers to go directly to a specific section that may be more important to them continue your presentation with the introduction methodology result discussion conclusion and finally the appendix note that the depth and length for each element may vary depending on the audience and format of report the first step in creating your report is properly creating an executive summary this summary will briefly explain the details of the project and should be considered a stand-alone document this information is taken from the main points of your report and while it is acceptable to repeat information no new information is presented the next section after the table of contents is the introduction the introduction explains the nature of the analysis States the problem and gives the questions that were to be answered by performing the analysis the next section is methodology methodology explains the data sources that were used in the analysis and outlines the plan for the collected data for example was the cluster or regression method used to analyze the data next we have the results section this section goes into the detail of the data collection how it was organized and how it was analyzed this portion would also contain the charts and graphs that would substantiate the results and call attention to more complex or crucial findings by providing this interpretation of data you are able to give a detailed explanation to the audience and convey how it relates to the problem that was stated in the introduction next discuss the report findings and implications for this section you would engage the audience with a discussion of your implications that were drawn from the research for example let's say you were conducting research for top programming languages for college graduates would you find they need to learn multiple languages to remain competitive in the job market or would one language always reign supreme we have now reached the conclusion of the report findings this final section should reiterate the problem given in the introduction and gives an overall summary of the findings it would also State the outcome of the analysis and if any other steps would be taken in the future and last we have the appendix this section would contain information that really didn't fit in the main body of the report but you deemed it was still important enough to include this type of information could include locations where the raw data was collected or other details such as resources acknowledgments or references in this video we learned about the important elements in creating a successful data findings report in the next video we will learn the best practices when presenting your findings [Music] thank you okay you've spent weeks maybe months studying the data and the time has come to report your findings the questions have been answered and you feel good about the story so how will you speak to your audience so they leave with the intended message in this video learn how to present your findings in a way that will engage and keep the attention of your audience delivering data-driven presentations may seem easy but there are a few important factors to remember in accurately conveying your message make sure charts and graphs are not too small and are clearly labeled use the data only as supporting evidence share only one point from each chart or graph and eliminate data that does not support the key message have you ever sat through a presentation and the information being presented was difficult to read or understand while this may seem apparent small charts and labels can be easily overlooked make sure to test the visualizations by sitting at different distances like your audience and if the data cannot be seen clearly then maybe a redesign should be considered when preparing the report you may feel the only way to explain the findings is to pack the slides with data while this may seem sensible as a data analyst your audience will probably not appreciate the intricacies of the data and just see a pile of numbers to resolve this issue Begin by forming the key messages that need to be conveyed to the audience and build the story around these messages after forming the outline go back and insert the data to support your findings by not relying heavily on the data and using this method to create the presentation you will create a story that is engaging and interesting to your audience presenting your data using charts and graphs is the best way to get your message across however if you are supplying too much information it can be confusing for example look at this pie chart can you decipher what the key message is and what the presenter is trying to convey in the example the chart has so much information it is hard to determine what point the presenter is trying to make and what the focus should be for the Audience by sticking with one idea and not summarizing multiple points into one visualization you are able to accurately convey the idea to the audience and avoid any confusion data analysts can spend months researching data however some items that seem interesting to the analyst may not be relevant to the project trying to explain every little detail to your audience and not recognizing irrelevant data could damage the key message by eliminating this unnecessary data and highlighting only data points that support your key ideas you will keep the presentation clear and concise in this video we learned about creating a data-driven presentation that will keep your audience engaged and how to deliver a clear and concise message [Music]