201803 R and Python in Periscope Data 7 Ways to Go Deep on Your Data Using Python and R With new support for R and Python in Periscope, data teams can perform a range of new analyses that will translate into new value for their organizations. These advanced languages open up new possibilities such as machine learning, natural language processing, advanced visualizations and prompt data cleaning. With support for these languages, data teams can explore answers to questions that couldn’t even be asked using SQL alone. Collecting data is important for companies looking to make more informed decisions, but the organizations that will benefit the most from the recent data boom are the ones that analyze information at the deepest levels. Periscope Data is built to streamline that analysis process, turning data into value as efficiently and quickly as possible. Our platform connects directly to all of your data sources and allows you to use SQL, a basic coding language, to prepare data tables for visualization. This workflow is a great way to analyze some data, but we wanted a way to give data teams a more robust toolbox and allow them to analyze their data in ways that would create even more value for their companies. To better empower our customers’ data teams, Periscope added support for R and Python, two advanced coding languages. Periscope Data customers can still use SQL to prepare their data, but now they have the ability to pass those SQL tables into R or Python to perform deeper analysis. These languages are open-ended and continually add new libraries, meaning data teams can explore at a deeper level and generate insights that were not discoverable using SQL alone. To help illustrate how data teams can use R and Python to conduct deeper analysis of their data, the Periscope team has compiled seven suggestions for practical uses of these languages. This list is by no means exhaustive, but these items are meant as a place for data scientists and analysts to start a conversation about the insights they’re producing and the value they want to create. Machine Learning In the last couple of years, the field of machine learning has become increasingly popular as data teams shift their focus toward predictive and prescriptive analytics. While the overall concept is still relatively nebulus, there’s a growing realization of the importance of using existing data to attempt to intelligently identify key indicators and predict future outcomes. In order for machine learning to be successful, it’s more important than ever that the data set used to make a prediction is clean. Bad or incomplete data can lead to bad predictions, which is detrimental to your organization and also detrimental to establishing a culture of data-based decisions. Luckily, the advanced languages that are now a part of the Periscope Data offering enable teams to clean data more efficiently and lay the groundwork for intelligent predictions. Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 2 Using R and Python for machine learning SQL-only analysis platforms do not have the capability to build machine learning models. R and Python include many analysis capabilities that are needed to identify complex patterns in existing datasets. Data teams are building complex machine learning models that look at past data with a known outcome and apply algorithms that relate inputs to that known outcome. The same algorithm can then be used with new inputs to predict unknown futures as accurately as possible. In short, if data can be used to draw a line from the past to the present, the same line may be applicable when extended from the present into the future. For example, consider a company’s lead scoring process. Data analysts can look backward to build a model that attempts to determine the value of specific lead characteristics and interactions between a company and its prospects. That model can then be tested to see if it holds true with new data. From there, they can adjust their strategy to focus on certain prospects and replace low-value interactions with high-value ones. The same model could even calculate how much new revenue those changes will generate. In this instance, machine learning can be used both as a predictive and a prescriptive technique. The future of machine learning is wide open and limited only by imagination and the capabilities of your team’s data platform. Machine learning in Periscope Data If you look at machine learning as combining clean data with advanced regression analysis tools, R and Python enable Periscope Data customers to address both needs. Periscope allows users to easily create data sets from their database that they can use to train a machine learning model and then test against unknown data. Companies that have a data scientist who can build models using regression analysis can make machine learning models that are accessible to their entire team of analysts. As the field of machine learning advances, the R and Python new and existing libraries will be updated to accommodate new capabilities. Teams that record reliable, clean data will be able to use that information in a variety of new ways to predict new things. These teams will increasingly add value to their organizations as machine learning becomes more concrete in the upcoming years. Advanced Statistical Analysis Data analysis is not new. Companies have been storing and tracking numbers for a long time. What started in basic spreadsheets has advanced to pivot tables, SQL queries and beyond that, advanced coding languages. With each step of this evolution, businesses are empowered to move away from simply describing what has happened into a more thorough picture of understanding Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 3 why things happen and better predicting what is going to happen next. As data teams get more advanced, they move closer and closer to the executive decision makers, working together to identify operational moves that will optimize results. Using R and Python for advanced statistical analysis SQL is a descriptive language; it does a great job of answering questions about what is happening. R and Python allow teams to answer questions about why something is happening. This evolution of analysis means executives want data-based understanding of what is actually meaningful. Your data team needs to be able to identify variables that positively correlate with intended business outcomes. Using SQL alone doesn’t allow for this, but a language like R can handle this type of query with just a single line. To illustrate how easy writing advanced analytics can be with these languages, look at the image below. On the left is SQL code intended to determine a correlation between just two variables. On the right is the same type of function, when implemented in R, able to analyze many relationships at once in a large matrix of data. In SQL SELECT user1, user2, ((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) / n) * (sum2sq - pow(sum2, 2.0) / n))) AS r, n FROM (SELECT n1.user AS user1, n2.user AS user2 SUM(n1.rating) AS sum1, SUM(n2.rating) AS sum2, SUM(n1.rating * n1.rating) AS sum1sq, SUM(n2.rating * n2.rating) AS sum2sq, SUM(n1.rating * n2.rating) AS psum, COUNT(*) AS n FROM testdata AS n1 LEFT JOIN testdata AS n2 ON n1.movie = n2.movie WHERE n1.user > n2.user GROUP BY n1.user, n2.user) AS step1 In R cor(df) ORDER BY r DESC, n DESC Incorporating more advanced languages into your data approach opens the door for more powerful executive conversations. R and Python help your team uncover patterns that were previously not visible and help you quantify your findings to easily illustrate their importance. Imagine that your team was tasked with finding ways to minimize customer churn. Using Periscope’s advanced language support, you could pass a table of data from SQL into R and run Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 4 correlations across as many variables as you can imagine. Any factors that show a strong enough correlation would be a great starting point for identifying causal relationships with churn. To establish causal relations, your team could make educated recommendations about operational adjustments based on those variables. Those hypotheses could be tested for a period and anything that shows a statistically significant improvement can be formally adopted into your company’s process. It’s an easy way to cut through the noise and pinpoint exactly what is driving your business. These new languages give analysts the tools to look for connections that they think are relevant while still allowing data to be the ultimate decision maker. To take statistical analysis to the next level, data teams that have skilled scientists can use the new languages to perform detailed predictive analysis, such as logistic regressions and cluster analysis, that can be used to create machine learning models for even more advanced recommendations. Even better, these languages are created and improved by thousands of experts, with new capabilities appearing frequently. Utilizing R and Python as part of your data team’s analysis is a good way to ensure you always have access to the best tools available. Advanced statistical analysis in Periscope Data For Periscope users, data can easily be passed from SQL into R or Python, where it can be analyzed and visualized before passing it back into a formal Periscope dashboard for distribution, collaboration and presentation. The libraries available in those languages represent the combined brilliance of thousands of the most skilled professionals and academics in the field. The beautiful thing about Periscope Data is that our agile platform allows dashboards to refresh instantly to show new information. There’s no need to download any new information or re-run reports to get new visualizations. This approach allows a team’s most skilled data scientists to avoid doing the same work twice, freeing them up to work on more complex analysis that illustrate a deeper understanding of emerging business needs. The overall effect of this model is that every analyst, employee and executive can view data from the perspective of the most advanced scientist at the company. Complex Visuals When it comes to key findings from data analysts, insights are only as valuable as the data team’s ability to communicate them to the rest of the company. One of the most impactful ways data teams can communicate is by turning ideas into visuals to share with internal stakeholders. The biggest obstacle with this approach is that the information is then limited by what a particular chart can and cannot display. Basic charts can illustrate basic learnings, but it’s impossible to show more complex concepts without a set of complex charting tools. Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 5 These advanced charts often explore many phenomena at once and need to be created one layer at a time. Analysts usually build these charts in an iterative process, approaching one factor, and one visual layer, at a time. Depending on the depth of the data, this process can take several hours or even days to analyze the information and create the corresponding graphic. Using R and Python for more complex visuals Many analysts use basic charting to efficiently show basic trends: how KPI change over time, what the current value is or the magnitude of a metric across a category. In many cases, those charts are sufficient, but the level of analysis that simple visuals provides may not tell the sufficient amount of information for an effective business decision to be made. The charting libraries available with R and Python allow data teams to have control over visuals far beyond what has historically been the data analyst toolkit. These charts make deep analysis easier by opening the door for further visualization mappings to illustrate many variables at once. Some visuals are designed to tell multiple stories, especially the more complex ones. Consider the chart below, which displays the mileage performance of vehicles with different engine types. A first look would illustrate that vehicles with fewer cylinders in the engine would appear to get better overall mileage while driving in the city. But there’s more to this chart: the 8-cylinder engine has a unique shape that needs explanation, the 4-cylinder engine has a long tail while the other two have definite limits and there’s a peculiar bimodal distribution in all three. This chart can be examined for more findings, but it’s clear that the complete story this data is telling goes deeper than anything that could be derived from simple tables or bar charts. Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 6 Creating more complex visuals in Periscope Data Complex visuals are made easy in R and Python. For example in R, the ggplot2 package allows very detailed control over the aesthetics of a chart. Want to map variable_a to the transparency, then alpha=variable_a. Want to map variable_b to the color, then color=variable_b. Python also has a lot of customization with visuals through matplotlib. Data teams may prefer to use either language, depending on their background, to visualize data in any way that helps them communicate with their stakeholders. Once the visuals have been created in R or Python, they can be saved directly onto a Periscope Data dashboard and will appear beside all the other charts. In addition to the granular visual mapping abilities, R and Python also give chart creators more controls to better customize the final visualization that appears in their Periscope Data dashboard. Additional Charting Libraries At Periscope Data, the most common feature request we receive from current customers is new chart options. Rather than handle these as a series of one-off feature builds, we decided to give more power to our customers by enabling them to use R and Python to create their own visualizations that could then be displayed in dashboards. With these advanced languages, Periscope Data customers are no longer limited to a predetermined list of visualization options. Instead, those tables generated in SQL can now be passed into R or Python, where the information can be analyzed deeply and displayed in an unlimited variety of ways. Data analysts are creative problem solvers and with the right tools, they can creatively illustrate their findings in any way that helps them convey their insights effectively. Using R and Python to supplement charting libraries Periscope data was designed to give analysts a toolkit of charts that are easy to read and can effectively analyze a specific phenomenon — lines, bars, pies, tables, scatter plots, etc. These charts are widely accessible, but a simple charting toolkit also means a simple range of insights that can be communicated. As insights expand to increasingly complex research, more complex charting tools will be needed to illustrate those findings. Using R and Python to expand to the charting possibilities means an expansion of the types of information that can be studied and conveyed with those visualizations. Although R and Python were only recently made available to Periscope Data customers, we’ve seen data teams creating a range of new visualizations: box and whisker charts, treemaps, formatted tables, customized number overlays, calendar-like charts, heat maps, log scales, quadrant charts and more. These charts all come with the ease of collaboration and sharing Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 7 that customers expect from Periscope Data. There really is no limit to what data teams can create with R and Python. Not only do the advanced languages offer new types of charts, they can also be used to better customize the visualizations that are made. With R and Python, chart creators have much greater control over formatting options like color, font and overlays, so charts can be made in any style that your business needs. These charts are not javascript enabled, but can be powerful in creatively communicating any message. Supplementing the Periscope Data charting libraries Using R and Python to create new types of charts in Periscope Data is simple. Just run a standard SQL query, and select the desired language form the dropdown below the SQL editor. From there, you can use 25+ supported charting libraries to customize your visualization. When R and Python evolve and add new libraries, your charting options in Periscope Data will grow even more robust. As the visualization options expand, data teams can perform more advanced analyses, find previously hidden insights and add even more value to their companies. If you’re interested in exploring seeing creative new charting options, visit the data visualization catalogue to get inspiration. Natural Language Processing Most data analysis is performed on quantitative data, but companies that limit their research to just numbers could be missing out on valuable information. To maximize their value to an organization, data teams should start looking into ways to extract value from an additional type of information – text. To perform this analysis, teams use Natural Language Processing (NLP) to decode human language. While SQL has some light NLP capabilities, Python offers huge libraries that can analyze this type of information at incredible depth. More advanced languages, such as Python, can adapt and grow as new discoveries are made in NLP, so as human speech processing evolves, companies will be able to gain the right insights, including meaning and context, from their text database. Using Natural Language Processing to generate new insights The problem with qualitative data in general is a lack of structure. Numerical data is inherently more sequential and can be organized and analyzed easily. Text-based data is unstructured and often disorganized. Libraries like Natural Language Toolkit (NLTK) in Python have been built to help create order out of human speech patterns and perform intensive analysis on what is being communicated. NLTK does more than just convert words into searchable sets of characters, Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 8 it actually works to dissect human language construction and derive the intended meaning from a sequence of words. NLP is still in its infancy, but companies are already using it to generate some meaningful insights. A basic example is sentiment analysis, where data teams can identify what is being said about a product or a brand and analyze those statements for meaning. It’s much simpler than putting together a formal focus group and provides a more concrete understanding of the conversation. Any company that uses text as a major unit of data can generate value from NLP technology, they just need to think of a way to provide structure around the outcome they intend to get. NLTK is an advanced tool that is capable of sorting through volumes of text-based data, but just like your quantitative data sets, the best insights only come from a well-structured, carefully designed analysis. The possibilities for NLP-based insights are limitless and tools like NLTK will only continue to improve. With the right vision and analytical execution, data teams that utilize the tools correctly will give their companies a major advantage. Natural Language Processing in Periscope Data One Periscope Data customer that is making the most of NLP is Crisis Text Line, a free, anonymous 24/7 text-based crisis intervention system that aims to mitigate crises by connecting people to counselors who are trained to cool down hot moments. They use natural language processing and machine learning to pull insights from their rich data set and identify keywords in texts to help steer a counselor toward a safe resolution. Later, a second phase of this process utilizes a large community of professional counselors to analyze conversations based on common keywords and tags to help assess trends and train counselors to have high-quality conversations with texters. This innovative approach to predictive modeling allows Crisis Text Line to detect keywords that identify and predict trends in real time. The Crisis Text Line data team uses Periscope Data to conduct this complex analysis and quickly visualize the results. In the near future, the team plans to set up a self-service data environment that will empower counselors to access information without help from the data team. This setup would give counselors quicker access to data and ultimately lead to better-informed conversations with texters. Often, the end users have difficulty predicting the needs of texters ahead of time, so a data tool that relies on upfront modeling is ineffective. An agile data environment like Periscope’s allows the team of counselors to find answers on their own. Data Cleaning However your team does data analysis, there’s a universal truth — the insights you collect are only going to be as good as the data that goes into finding them. Clean data sets are imperative in the Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 9 analytical process, so data teams spend a lot of time making sure their data is as good as possible before running an analysis. The problem is that the data-cleaning process is long and manual, taking 60 - 80% of a data scientist’s time. With all that effort spent preparing the data for analysis, these scientists hardly have any time left to actually perform the research and find insights in the data. Advanced coding languages, such as R or Python, include packages that will assist with data cleanup, giving data teams more bandwidth to perform analysis and better tools to dive deeper into the clean data sets. Using R and Python to Clean Data Better Scripting languages such as Python and R can assist with data cleanup, allowing data scientists to do bulk cleanup. For example, Python’s re library makes string operations much faster and simpler than using SQL for the same action, dramatically reducing the amount of time and effort that goes into cleaning. Consider a data set with a lot of missing data. Built-in Pandas functions such as fillna and dropna allow data scientists to treat all empty cells in a range the same way. Those cells can be filled with the mean, median or specific values (fillna) or removed entirely (dropna). Other large-scale cleanup activities like removing duplicates can also be handled with individual lines of code rather than the time-intensive processes that must be used to complete the same task in SQL. Cleanup queries like this are shorter and simpler in the advanced languages than similar queries would be in SQL, which results in fewer resources used and a lighter load on system. If the entire data system is running more efficiently, it frees up resources to run more queries and get to the results faster, which means the data team has more time to search through the resulting data for insights. All of those efficiencies add up to a lot more room for the data team to creatively analyze the data and provide value to a company. Cleaning Data in Periscope Data Using R and Python to perform data cleanup in Periscope Data is simple. Just pull the data from SQL and then pass it into one of the more advanced languages in the Periscope editor. From there, a data team can run the efficient, scripted cleanup processes to prepare data for analysis in a fraction of the time. With clean data, the teams can use the time they saved to explore deeper questions about the information and build more advanced charts to illustrate their findings. Restructuring Datasets Over the course of the analysis process, a dataset needs to go through several different transformations. Once the information has been collected and cleaned, the scientists and analysts need to be able to manipulate the information in multiple ways to best represent Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 10 important insights. Often, this involves long in-depth analysis sessions and several iterations of research to look at a question from new angles. When the data teams have found an important insight, they still need to be able to visualize that information clearly for relevant stakeholders. When the goal shifts from performing calculations for generating insights to displaying information in the most easy-to-understand way, the data will likely need to be manipulated again. In Python, we call this manipulation pivoting and melting. R can also perform his operation, using the spread and gather functions in the Tidyr library. To illustrate these concepts, study the two images below. Pivoting (or spreading) is changing the data from the image on the left to look like the image on the right. Melting (or gathering) is the opposite, moving from the image on the right to the image on the left. Using R and Python to Restructure Datasets Unlike SQL, R and Python are languages that allow for both preparation and visualization of data. In SQL, queries can be run to produce tables, which can then be passed into a platform like Periscope Data for charting. Since the data is being prepared in one environment and visualized in another, some of the formatting options are lost in the process. The structure of those data tables limits what visuals can be made with the information. With R and Python, there are libraries that make data manipulation extremely simple, which results in significantly improved visuals. Simply restructuring data tables can have an enormous effect on the overall readability of the information. Making a wide table into a long table allows the series to be mapped more easily. Doing the reverse makes the data easier to read as a table. These languages also allow queries to be run with less stress on a system. For example, in Python, converting object datatypes to category type reduces the memory used, enabling a faster analyses Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 11 and faster plotting. However your team analyzes and presents information, the flexibility to shift the data around is crucial to effectively translating insights into operations. Restructuring Data in Periscope Data Periscope Data makes it easy for teams to reformat data tables in advanced languages. Just run a SQL query to process the dataset, then pass the table into R or Python and use the reshape2/ tidyR or pandas libraries to execute the transformation with a simple command. Using just SQL, manipulations like this would take multiple lines of complex transposing code. In some cases, queries that would take 50-100 lines to perform and run in SQL can be managed in R or Python with just a single line. Once the data has been transformed, R and Python offer more advanced charting libraries that can create complex, customized visuals for data teams. Visuals that have been created to fit the specifications and preferences of key stakeholders, data teams can pass them directly into Periscope Data to be included in shared dashboards. Go Deeper, See Clearer with Periscope Data With R and Python, your team can analyze and visualize data in exciting new ways, unveiling valuable insights that were hidden using SQL alone. These deeper analyses are the next step in the evolution of the data industry, allowing skilled data scientists to create new insights and explore questions that previously could not be answered. The best part of supporting R and Python in Periscope Data is that charts are always saved, so they can be updated with current data using a single click of the refresh button. That means an entire team of analysts can explore data at the same depth as the most advanced data scientist on the team. If you want to see how your data team can use R and Python to go deeper on your data, set up a free trial of Periscope Data. If you have any questions, you can also request a contact and one of our experts will reach out to you soon. About Periscope Data Periscope Data builds software that turns data teams into superheroes. Its Unified Data Platform is the industry’s first to address the complete analytics lifecycle, allowing data teams to ingest, store, analyze, visualize and report on data all from one connected platform. This empowers them to collaborate and drive faster insight, while allowing businesses to foster a data-driven culture around a single source of truth. Periscope Data serves 1000+ customers globally, including Adobe, Crunchbase, EY, Flexport, New Relic, Supercell, Tinder and ZipRecruiter. Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R 12