Uploaded by spa

7-Ways-to-Go-Deep-on-Your-Data-Using-Python-and-R

advertisement
201803
R and Python in
Periscope Data
7 Ways to Go
Deep on Your
Data Using
Python and R
With new support for R and Python in Periscope, data teams can perform a range of new analyses
that will translate into new value for their organizations. These advanced languages open up
new possibilities such as machine learning, natural language processing, advanced visualizations
and prompt data cleaning. With support for these languages, data teams can explore answers to
questions that couldn’t even be asked using SQL alone.
Collecting data is important for companies looking to make more informed decisions, but the
organizations that will benefit the most from the recent data boom are the ones that analyze
information at the deepest levels. Periscope Data is built to streamline that analysis process,
turning data into value as efficiently and quickly as possible. Our platform connects directly to all
of your data sources and allows you to use SQL, a basic coding language, to prepare data tables
for visualization. This workflow is a great way to analyze some data, but we wanted a way to give
data teams a more robust toolbox and allow them to analyze their data in ways that would create
even more value for their companies.
To better empower our customers’ data teams, Periscope added support for R and Python, two
advanced coding languages. Periscope Data customers can still use SQL to prepare their data,
but now they have the ability to pass those SQL tables into R or Python to perform deeper
analysis. These languages are open-ended and continually add new libraries, meaning data teams
can explore at a deeper level and generate insights that were not discoverable using SQL alone.
To help illustrate how data teams can use R and Python to conduct deeper analysis of their data,
the Periscope team has compiled seven suggestions for practical uses of these languages. This list
is by no means exhaustive, but these items are meant as a place for data scientists and analysts to
start a conversation about the insights they’re producing and the value they want to create.
Machine Learning
In the last couple of years, the field of machine learning has become increasingly popular as data
teams shift their focus toward predictive and prescriptive analytics. While the overall concept is
still relatively nebulus, there’s a growing realization of the importance of using existing data to
attempt to intelligently identify key indicators and predict future outcomes.
In order for machine learning to be successful, it’s more important than ever that the data set
used to make a prediction is clean. Bad or incomplete data can lead to bad predictions, which is
detrimental to your organization and also detrimental to establishing a culture of data-based
decisions. Luckily, the advanced languages that are now a part of the Periscope Data offering
enable teams to clean data more efficiently and lay the groundwork for intelligent predictions.
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
2
Using R and Python for machine learning
SQL-only analysis platforms do not have the capability to build machine learning models. R and
Python include many analysis capabilities that are needed to identify complex patterns in existing
datasets. Data teams are building complex machine learning models that look at past data with
a known outcome and apply algorithms that relate inputs to that known outcome. The same
algorithm can then be used with new inputs to predict unknown futures as accurately as possible.
In short, if data can be used to draw a line from the past to the present, the same line may be
applicable when extended from the present into the future.
For example, consider a company’s lead scoring process. Data analysts can look backward to build
a model that attempts to determine the value of specific lead characteristics and interactions
between a company and its prospects. That model can then be tested to see if it holds true with
new data. From there, they can adjust their strategy to focus on certain prospects and replace
low-value interactions with high-value ones. The same model could even calculate how much new
revenue those changes will generate. In this instance, machine learning can be used both as a
predictive and a prescriptive technique.
The future of machine learning is wide open and limited only by imagination and the capabilities
of your team’s data platform.
Machine learning in Periscope Data
If you look at machine learning as combining clean data with advanced regression analysis tools,
R and Python enable Periscope Data customers to address both needs. Periscope allows users to
easily create data sets from their database that they can use to train a machine learning model
and then test against unknown data. Companies that have a data scientist who can build models
using regression analysis can make machine learning models that are accessible to their entire
team of analysts.
As the field of machine learning advances, the R and Python new and existing libraries will be
updated to accommodate new capabilities. Teams that record reliable, clean data will be able
to use that information in a variety of new ways to predict new things. These teams will
increasingly add value to their organizations as machine learning becomes more concrete in
the upcoming years.
Advanced Statistical Analysis
Data analysis is not new. Companies have been storing and tracking numbers for a long time.
What started in basic spreadsheets has advanced to pivot tables, SQL queries and beyond that,
advanced coding languages. With each step of this evolution, businesses are empowered to move
away from simply describing what has happened into a more thorough picture of understanding
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
3
why things happen and better predicting what is going to happen next. As data teams get more
advanced, they move closer and closer to the executive decision makers, working together to
identify operational moves that will optimize results.
Using R and Python for advanced statistical analysis
SQL is a descriptive language; it does a great job of answering questions about what is happening.
R and Python allow teams to answer questions about why something is happening. This evolution
of analysis means executives want data-based understanding of what is actually meaningful. Your
data team needs to be able to identify variables that positively correlate with intended business
outcomes. Using SQL alone doesn’t allow for this, but a language like R can handle this type of
query with just a single line.
To illustrate how easy writing advanced analytics can be with these languages, look at the image
below. On the left is SQL code intended to determine a correlation between just two variables. On
the right is the same type of function, when implemented in R, able to analyze many relationships
at once in a large matrix of data.
In SQL
SELECT
user1, user2,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - pow(sum1, 2.0) /
n) * (sum2sq - pow(sum2, 2.0) / n))) AS r,
n
FROM
(SELECT
n1.user AS user1, n2.user AS user2
SUM(n1.rating) AS sum1,
SUM(n2.rating) AS sum2,
SUM(n1.rating * n1.rating) AS sum1sq,
SUM(n2.rating * n2.rating) AS sum2sq,
SUM(n1.rating * n2.rating) AS psum,
COUNT(*) AS n
FROM
testdata AS n1
LEFT JOIN
testdata AS n2
ON
n1.movie = n2.movie
WHERE
n1.user > n2.user
GROUP BY
n1.user, n2.user) AS step1
In R
cor(df)
ORDER BY
r DESC,
n DESC
Incorporating more advanced languages into your data approach opens the door for more
powerful executive conversations. R and Python help your team uncover patterns that were
previously not visible and help you quantify your findings to easily illustrate their importance.
Imagine that your team was tasked with finding ways to minimize customer churn. Using
Periscope’s advanced language support, you could pass a table of data from SQL into R and run
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
4
correlations across as many variables as you can imagine. Any factors that show a strong enough
correlation would be a great starting point for identifying causal relationships with churn.
To establish causal relations, your team could make educated recommendations about
operational adjustments based on those variables. Those hypotheses could be tested for a period
and anything that shows a statistically significant improvement can be formally adopted into your
company’s process. It’s an easy way to cut through the noise and pinpoint exactly what is driving
your business. These new languages give analysts the tools to look for connections that they think
are relevant while still allowing data to be the ultimate decision maker.
To take statistical analysis to the next level, data teams that have skilled scientists can use the
new languages to perform detailed predictive analysis, such as logistic regressions and cluster
analysis, that can be used to create machine learning models for even more advanced
recommendations. Even better, these languages are created and improved by thousands of
experts, with new capabilities appearing frequently. Utilizing R and Python as part of your data
team’s analysis is a good way to ensure you always have access to the best tools available.
Advanced statistical analysis in Periscope Data
For Periscope users, data can easily be passed from SQL into R or Python, where it can be
analyzed and visualized before passing it back into a formal Periscope dashboard for distribution,
collaboration and presentation. The libraries available in those languages represent the combined
brilliance of thousands of the most skilled professionals and academics in the field.
The beautiful thing about Periscope Data is that our agile platform allows dashboards to refresh
instantly to show new information. There’s no need to download any new information or re-run
reports to get new visualizations. This approach allows a team’s most skilled data scientists to
avoid doing the same work twice, freeing them up to work on more complex analysis that illustrate
a deeper understanding of emerging business needs. The overall effect of this model is that every
analyst, employee and executive can view data from the perspective of the most advanced
scientist at the company.
Complex Visuals
When it comes to key findings from data analysts, insights are only as valuable as the data team’s
ability to communicate them to the rest of the company. One of the most impactful ways data
teams can communicate is by turning ideas into visuals to share with internal stakeholders. The
biggest obstacle with this approach is that the information is then limited by what a particular
chart can and cannot display. Basic charts can illustrate basic learnings, but it’s impossible to
show more complex concepts without a set of complex charting tools.
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
5
These advanced charts often explore many phenomena at once and need to be created one layer
at a time. Analysts usually build these charts in an iterative process, approaching one factor, and
one visual layer, at a time. Depending on the depth of the data, this process can take several hours
or even days to analyze the information and create the corresponding graphic.
Using R and Python for more complex visuals
Many analysts use basic charting to efficiently show basic trends: how KPI change over time, what
the current value is or the magnitude of a metric across a category. In many cases, those charts
are sufficient, but the level of analysis that simple visuals provides may not tell the sufficient
amount of information for an effective business decision to be made.
The charting libraries available with R and Python allow data teams to have control over visuals far
beyond what has historically been the data analyst toolkit. These charts make deep analysis easier
by opening the door for further visualization mappings to illustrate many variables at once.
Some visuals are designed to tell multiple stories, especially the more complex ones. Consider the
chart below, which displays the mileage performance of vehicles with different engine types. A first
look would illustrate that vehicles with fewer cylinders in the engine would appear to get better
overall mileage while driving in the city. But there’s more to this chart: the 8-cylinder engine has
a unique shape that needs explanation, the 4-cylinder engine has a long tail while the other two
have definite limits and there’s a peculiar bimodal distribution in all three.
This chart can be examined for more findings, but it’s clear that the complete story this data is
telling goes deeper than anything that could be derived from simple tables or bar charts.
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
6
Creating more complex visuals in Periscope Data
Complex visuals are made easy in R and Python. For example in R, the ggplot2 package allows
very detailed control over the aesthetics of a chart. Want to map variable_a to the transparency,
then alpha=variable_a. Want to map variable_b to the color, then color=variable_b.
Python also has a lot of customization with visuals through matplotlib. Data teams may prefer to
use either language, depending on their background, to visualize data in any way that helps them
communicate with their stakeholders. Once the visuals have been created in R or Python, they can
be saved directly onto a Periscope Data dashboard and will appear beside all the other charts.
In addition to the granular visual mapping abilities, R and Python also give chart creators more
controls to better customize the final visualization that appears in their Periscope Data dashboard.
Additional Charting Libraries
At Periscope Data, the most common feature request we receive from current customers is new
chart options. Rather than handle these as a series of one-off feature builds, we decided to give
more power to our customers by enabling them to use R and Python to create their own
visualizations that could then be displayed in dashboards.
With these advanced languages, Periscope Data customers are no longer limited to a predetermined
list of visualization options. Instead, those tables generated in SQL can now be passed into R
or Python, where the information can be analyzed deeply and displayed in an unlimited variety
of ways. Data analysts are creative problem solvers and with the right tools, they can creatively
illustrate their findings in any way that helps them convey their insights effectively.
Using R and Python to supplement charting libraries
Periscope data was designed to give analysts a toolkit of charts that are easy to read and can
effectively analyze a specific phenomenon — lines, bars, pies, tables, scatter plots, etc. These
charts are widely accessible, but a simple charting toolkit also means a simple range of insights
that can be communicated. As insights expand to increasingly complex research, more complex
charting tools will be needed to illustrate those findings. Using R and Python to expand to the
charting possibilities means an expansion of the types of information that can be studied and
conveyed with those visualizations.
Although R and Python were only recently made available to Periscope Data customers, we’ve
seen data teams creating a range of new visualizations: box and whisker charts, treemaps,
formatted tables, customized number overlays, calendar-like charts, heat maps, log scales,
quadrant charts and more. These charts all come with the ease of collaboration and sharing
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
7
that customers expect from Periscope Data. There really is no limit to what data teams can create
with R and Python.
Not only do the advanced languages offer new types of charts, they can also be used to better
customize the visualizations that are made. With R and Python, chart creators have much greater
control over formatting options like color, font and overlays, so charts can be made in any style
that your business needs. These charts are not javascript enabled, but can be powerful in
creatively communicating any message.
Supplementing the Periscope Data charting libraries
Using R and Python to create new types of charts in Periscope Data is simple. Just run a standard
SQL query, and select the desired language form the dropdown below the SQL editor. From there,
you can use 25+ supported charting libraries to customize your visualization. When R and Python
evolve and add new libraries, your charting options in Periscope Data will grow even more robust.
As the visualization options expand, data teams can perform more advanced analyses, find
previously hidden insights and add even more value to their companies.
If you’re interested in exploring seeing creative new charting options, visit the data visualization
catalogue to get inspiration.
Natural Language Processing
Most data analysis is performed on quantitative data, but companies that limit their research
to just numbers could be missing out on valuable information. To maximize their value to an
organization, data teams should start looking into ways to extract value from an additional type
of information – text. To perform this analysis, teams use Natural Language Processing (NLP) to
decode human language.
While SQL has some light NLP capabilities, Python offers huge libraries that can analyze this type
of information at incredible depth. More advanced languages, such as Python, can adapt and grow
as new discoveries are made in NLP, so as human speech processing evolves, companies will be
able to gain the right insights, including meaning and context, from their text database.
Using Natural Language Processing to generate new insights
The problem with qualitative data in general is a lack of structure. Numerical data is inherently
more sequential and can be organized and analyzed easily. Text-based data is unstructured and
often disorganized. Libraries like Natural Language Toolkit (NLTK) in Python have been built to
help create order out of human speech patterns and perform intensive analysis on what is being
communicated. NLTK does more than just convert words into searchable sets of characters,
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
8
it actually works to dissect human language construction and derive the intended meaning from
a sequence of words.
NLP is still in its infancy, but companies are already using it to generate some meaningful insights.
A basic example is sentiment analysis, where data teams can identify what is being said about
a product or a brand and analyze those statements for meaning. It’s much simpler than putting
together a formal focus group and provides a more concrete understanding of the conversation.
Any company that uses text as a major unit of data can generate value from NLP technology, they
just need to think of a way to provide structure around the outcome they intend to get. NLTK is
an advanced tool that is capable of sorting through volumes of text-based data, but just like your
quantitative data sets, the best insights only come from a well-structured, carefully designed
analysis. The possibilities for NLP-based insights are limitless and tools like NLTK will only
continue to improve. With the right vision and analytical execution, data teams that utilize the
tools correctly will give their companies a major advantage.
Natural Language Processing in Periscope Data
One Periscope Data customer that is making the most of NLP is Crisis Text Line, a free,
anonymous 24/7 text-based crisis intervention system that aims to mitigate crises by connecting
people to counselors who are trained to cool down hot moments. They use natural language
processing and machine learning to pull insights from their rich data set and identify keywords
in texts to help steer a counselor toward a safe resolution. Later, a second phase of this process
utilizes a large community of professional counselors to analyze conversations based on common
keywords and tags to help assess trends and train counselors to have high-quality conversations
with texters.
This innovative approach to predictive modeling allows Crisis Text Line to detect keywords that
identify and predict trends in real time. The Crisis Text Line data team uses Periscope Data to
conduct this complex analysis and quickly visualize the results. In the near future, the team plans
to set up a self-service data environment that will empower counselors to access information
without help from the data team. This setup would give counselors quicker access to data and
ultimately lead to better-informed conversations with texters. Often, the end users have difficulty
predicting the needs of texters ahead of time, so a data tool that relies on upfront modeling is
ineffective. An agile data environment like Periscope’s allows the team of counselors to find
answers on their own.
Data Cleaning
However your team does data analysis, there’s a universal truth — the insights you collect are only
going to be as good as the data that goes into finding them. Clean data sets are imperative in the
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
9
analytical process, so data teams spend a lot of time making sure their data is as good as possible
before running an analysis. The problem is that the data-cleaning process is long and manual,
taking 60 - 80% of a data scientist’s time.
With all that effort spent preparing the data for analysis, these scientists hardly have any time
left to actually perform the research and find insights in the data. Advanced coding languages,
such as R or Python, include packages that will assist with data cleanup, giving data teams more
bandwidth to perform analysis and better tools to dive deeper into the clean data sets.
Using R and Python to Clean Data Better
Scripting languages such as Python and R can assist with data cleanup, allowing data scientists
to do bulk cleanup. For example, Python’s re library makes string operations much faster and
simpler than using SQL for the same action, dramatically reducing the amount of time and effort
that goes into cleaning. Consider a data set with a lot of missing data. Built-in Pandas functions
such as fillna and dropna allow data scientists to treat all empty cells in a range the same way.
Those cells can be filled with the mean, median or specific values (fillna) or removed entirely
(dropna). Other large-scale cleanup activities like removing duplicates can also be handled with
individual lines of code rather than the time-intensive processes that must be used to complete
the same task in SQL.
Cleanup queries like this are shorter and simpler in the advanced languages than similar queries
would be in SQL, which results in fewer resources used and a lighter load on system. If the entire
data system is running more efficiently, it frees up resources to run more queries and get to the
results faster, which means the data team has more time to search through the resulting data for
insights. All of those efficiencies add up to a lot more room for the data team to creatively analyze
the data and provide value to a company.
Cleaning Data in Periscope Data
Using R and Python to perform data cleanup in Periscope Data is simple. Just pull the data from
SQL and then pass it into one of the more advanced languages in the Periscope editor. From
there, a data team can run the efficient, scripted cleanup processes to prepare data for analysis in
a fraction of the time. With clean data, the teams can use the time they saved to explore deeper
questions about the information and build more advanced charts to illustrate their findings.
Restructuring Datasets
Over the course of the analysis process, a dataset needs to go through several different
transformations. Once the information has been collected and cleaned, the scientists and
analysts need to be able to manipulate the information in multiple ways to best represent
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
10
important insights. Often, this involves long in-depth analysis sessions and several iterations
of research to look at a question from new angles.
When the data teams have found an important insight, they still need to be able to visualize that
information clearly for relevant stakeholders. When the goal shifts from performing calculations
for generating insights to displaying information in the most easy-to-understand way, the data
will likely need to be manipulated again. In Python, we call this manipulation pivoting and melting.
R can also perform his operation, using the spread and gather functions in the Tidyr library.
To illustrate these concepts, study the two images below. Pivoting (or spreading) is changing the
data from the image on the left to look like the image on the right. Melting (or gathering) is the
opposite, moving from the image on the right to the image on the left.
Using R and Python to Restructure Datasets
Unlike SQL, R and Python are languages that allow for both preparation and visualization of data.
In SQL, queries can be run to produce tables, which can then be passed into a platform like Periscope Data for charting. Since the data is being prepared in one environment and visualized in
another, some of the formatting options are lost in the process. The structure of those data tables
limits what visuals can be made with the information.
With R and Python, there are libraries that make data manipulation extremely simple, which results
in significantly improved visuals. Simply restructuring data tables can have an enormous effect on
the overall readability of the information. Making a wide table into a long table allows the series to
be mapped more easily. Doing the reverse makes the data easier to read as a table.
These languages also allow queries to be run with less stress on a system. For example, in Python,
converting object datatypes to category type reduces the memory used, enabling a faster analyses
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
11
and faster plotting. However your team analyzes and presents information, the flexibility to shift
the data around is crucial to effectively translating insights into operations.
Restructuring Data in Periscope Data
Periscope Data makes it easy for teams to reformat data tables in advanced languages. Just run
a SQL query to process the dataset, then pass the table into R or Python and use the reshape2/
tidyR or pandas libraries to execute the transformation with a simple command. Using just SQL,
manipulations like this would take multiple lines of complex transposing code. In some cases,
queries that would take 50-100 lines to perform and run in SQL can be managed in R or Python
with just a single line.
Once the data has been transformed, R and Python offer more advanced charting libraries that
can create complex, customized visuals for data teams. Visuals that have been created to fit the
specifications and preferences of key stakeholders, data teams can pass them directly into
Periscope Data to be included in shared dashboards.
Go Deeper, See Clearer with Periscope Data
With R and Python, your team can analyze and visualize data in exciting new ways, unveiling
valuable insights that were hidden using SQL alone. These deeper analyses are the next step
in the evolution of the data industry, allowing skilled data scientists to create new insights and
explore questions that previously could not be answered.
The best part of supporting R and Python in Periscope Data is that charts are always saved, so
they can be updated with current data using a single click of the refresh button. That means an
entire team of analysts can explore data at the same depth as the most advanced data scientist
on the team.
If you want to see how your data team can use R and Python to go deeper on your data, set up a
free trial of Periscope Data. If you have any questions, you can also request a contact and one of
our experts will reach out to you soon.
About Periscope Data
Periscope Data builds software that turns data teams into superheroes. Its Unified Data Platform is the industry’s first to address the complete
analytics lifecycle, allowing data teams to ingest, store, analyze, visualize and report on data all from one connected platform. This empowers
them to collaborate and drive faster insight, while allowing businesses to foster a data-driven culture around a single source of truth. Periscope
Data serves 1000+ customers globally, including Adobe, Crunchbase, EY, Flexport, New Relic, Supercell, Tinder and ZipRecruiter.
Periscope Data: 7 Ways to Go Deep on Your Data Using Python and R
12
Download