2019-08-26T14:48:10+03:00[Europe/Moscow]entrueBigQuery Data Transfer Service, BigQuery Data Transfer Service, Cloud Dataflow IAM roles, Dataflow Admin , Dataflow Developer , Dataflow Viewer , Dataflow Worker , Example role assignment, Example role assignment (continued), Assigning Cloud Dataflow roles, Predefined roles and permissions, There are three types of roles in Cloud Identity and Access Management:, To determine if one or more permissions are included in a primitive, predefined, or custom role, you can use one of the following methods:, When you assign both predefined and primitive roles to a user..., When you assign roles at the organization and project level, You can also assign roles at the dataset level to provide access only to one or more datasets, For any job you create, BigQuery Admin, BigQuery Data Editor , BigQuery Data Owner, BigQuery Data Viewer, BigQuery Job User , BigQuery Metadata Viewer , BigQuery Read Session User , BigQuery User , Introduction to partitioned tables, There are two types of table partitioning in BigQuery:, When you create a table partitioned by ingestion time, When you create ingestion-time partitioned tables, BigQuery also allows partitioned tables, When you create partitioned tables, two special partitions are created:, As an alternative to partitioned tables, Partitioned tables perform better than tables sharded by date, Querying multiple tables using a wildcard table, Wildcard table queries are subject to the following limitations., Wildcard table queries are subject to the following limitations. (cont), When to use wildcard tables, Querying sets of tables using wildcard tables, Filtering selected tables using _TABLE_SUFFIX, Scanning a range of tables using _TABLE_SUFFIX, Scanning a range of ingestion-time partitioned tables using _PARTITIONTIME, Querying all tables in a dataset, Cloud Spanner: How indexes are created for you and how you can create secondary indexes., Indexes, There are two types of indexes:, The index-based query mechanism supports a wide range of queries and is suitable for most applications. , As you design your Cloud Bigtable schema, keep the following concepts in mind:, As you design your Cloud Bigtable schema, keep the following concepts in mind: (cont), (Choosing a row key) Start by asking how you'll use the data that you plan to store., One key difference between Hive and Pig:, Hive, Pig, Hive provides a subset of SQL., Hive is designed for batch jobs and not for transactions., Pig provides SQL primitives similar to Hive, but in a more flexible scripting language format. , Pub/Sub, Dataflow, BigQuery, Consider the long term goal of a pure serverless solution..., Cloud Storage is an object store. . Objects have several properties:, Cloud Storage is very good at bulk and parallel operations on larger objects. So for performance, keep these in mind:, Latency is higher for data in Cloud Storage than HDFS on a Persistent Disk in the cluster., Cloud Bigtable stores data in a file system called Colossus., Cloud Bigtable has three levels of operation., Cloud Bigtable is a learning system. , When a node is lost in the cluster, no data is lost. , Cloud Bigtable has only one index., RISC, The most important control over resource consumption and costs is..., The easy path to migrate your Hadoop workload to Dataproc is..., Cloud Dataproc Autoscaling provides flexible capacity for more efficient utilization., Efficient utilization (how to not pay for resources you don't use)., With Spark you program with request. Spark doesn't immediately perform these actions, instead it stores them in a graph system called a directed acyclic graph, a d-a-g or DAG. , To allow Spark to perform its magic, the program needs to build a chain of transformations using the dot operator., A Cloud Function is a serverless, stateless, execution environment for application code. , You can trigger periodic events using Cloud Scheduler, Cloud Function code can be deployed to the service through Console, the gcloud command line, or from your local computer, Cloud Dataproc has a collection of connectors to BigQuery, If BigQuery was triggered to ingest each shard, it would cause a separate BigQuery job to run for each shard of data., The Cloud Dataproc Workflow Template is a YAML file that is processed through a Directed Acyclic Graph (DAG)., Cloud Composer is a workflow orchestration service based on Apache Airflow, The thing about BigQuery is it separates out storage and compute. So you're essentially paying for storage, for the data that you put, but the cost of storage in BigQuery is about the same as the cost of storage in Google Cloud Storage., If you want a transactional database that gives you millisecond, microsecond responses, BigQuery is not the answer for that. , BigQuery is completely no-ops., So you have multiple owners in a project. Then in the project you basically create a dataset., So a dataset contains tables. It will also contain views., When you look at a BigQuery plan, you're looking for any stage where there's a significant difference between the average and the max time. And whenever you do this, it indicates a significant data skew., So, we have a pipeline, and a pipeline is a set of steps, each of these steps is called a transform, and this particular transform, its source is BigQuery and it sink is cloud storage., One of the neat things that dataflow will let you do is that, you can have a running pipeline and you can change the code and you can essentially replace the running pipeline. , Grep, Dataflow Templates enable a new development and execution workflow., What is Dataprep?, Dataprep offers a graphical user interface for interactively designing a pipeline., Dataprep provides a high-leverage method to quickly create Dataflow pipelines without coding., What's a label?, What's an input?, What's an example?, What's a model?, What's training?, What's a prediction?flashcards
https://studylib.net
409 cards
Google Cloud Professional Data Engineer
Notes for the Google Cloud Professional Data Engineer Exam
Automates data movement from SaaS applications to Google BigQuery on a scheduled, managed basis. Your analytics team can lay the foundation for a data warehouse without writing a single line of code.
BigQuery Data Transfer Service
Supports Google Ads, Campaign Manager, Google Ad Manager, and YouTube Content and Channel Owner Reports. Through BigQuery Data Transfer Service, users also gain access to data connectors that allow you to easily transfer data from Teradata and Amazon S3 to BigQuery.
Cloud Dataflow IAM roles
You can use Cloud Dataflow IAM roles to limit access for users within a project or organization, to just Cloud Dataflow-related resources, as opposed to granting users viewer, editor, or owner access to the entire Cloud Platform project.
Dataflow Admin
Minimal role for creating and managing dataflow jobs.
Dataflow Developer
Provides the permissions necessary to execute and manipulate Cloud Dataflow jobs.
Dataflow Viewer
Provides read-only access to all Cloud Dataflow-related resources.
Dataflow Worker
Provides the permissions necessary for a Compute Engine service account to execute work units for a Cloud Dataflow pipeline.
Example role assignment
- The developer who creates and examines jobs will need the roles/dataflow.admin role.
- For more sophisticated permissions management, the developer interacting with the Cloud Dataflow job will need the roles/dataflow.developer role.
- They will need the roles/storage.objectAdmin or a related role in order to stage the required files.
Example role assignment (continued)
- For debugging and quota checking, they will need the project roles/compute.viewer role.
- Absent other role assignments, this will allow the developer to create and cancel Cloud Dataflow jobs, but not interact with the individual VMs or access other Cloud services.
- The controller service account needs the roles/dataflow.worker role to process data for the Cloud Dataflow service. It will need other roles (such as roles/storage.objectAdmin in order to access job data.
Assigning Cloud Dataflow roles
Cloud Dataflow roles can currently be set on organizations and projects only.
Predefined roles and permissions
When an identity calls a Google Cloud Platform API, Google BigQuery requires that the identity has the appropriate permissions to use the resource. You can grant permissions by granting roles to a user, a group, or a service account.
There are three types of roles in Cloud Identity and Access Management:
- Primitive roles include the Owner, Editor, and Viewer roles that existed prior to the introduction of Cloud Identity and Access Management.
- Predefined roles provide granular access for a specific service and are managed by GCP. Predefined roles are meant to support common use cases and access control patterns.
- Custom roles provide granular access according to a user-specified list of permissions.
To determine if one or more permissions are included in a primitive, predefined, or custom role, you can use one of the following methods:
- The gcloud iam roles describe command
- The roles.get() method in the Cloud IAM API
When you assign both predefined and primitive roles to a user...
The permissions granted are a union of each role's permissions.
When you assign roles at the organization and project level
You provide permission to run BigQuery jobs or to manage all of a project's BigQuery resources.
You can also assign roles at the dataset level to provide access only to one or more datasets
In the IAM policy hierarchy, BigQuery datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.
For any job you create
You automatically have the equivalent of the bigquery.jobs.get and bigquery.jobs.update permissions for that job.
BigQuery Admin
Provides permissions to manage all resources within the project. Can manage all data within the project, and can cancel jobs from other users running within the project.
BigQuery Data Editor
When applied to a dataset, dataEditor provides permissions to:
- Read the dataset's metadata and to list tables in the dataset.
- Create, update, get, and delete the dataset's tables.
When applied at the project or organization level, this role can also create new datasets.
BigQuery Data Owner
When applied to a dataset, dataOwner provides permissions to:
- Read, update, and delete the dataset.
- Create, update, get, and delete the dataset's tables.
When applied at the project or organization level, this role can also create new datasets.
BigQuery Data Viewer
When applied to a dataset, dataViewer provides permissions to:
- Read the dataset's metadata and to list tables in the dataset.
- Read data and metadata from the dataset's tables.
When applied at the project or organization level, this role can also enumerate all datasets in the project. Additional roles, however, are necessary to allow the running of jobs.
BigQuery Job User
Provides permissions to run jobs, including queries, within the project. The jobUser role can enumerate their own jobs and cancel their own jobs.
BigQuery Metadata Viewer
When applied at the project or organization level, metadataViewer provides permissions to:
- List all datasets and read metadata for all datasets in the project.
- List all tables and views and read metadata for all tables and views in the project.
Additional roles are necessary to allow the running of jobs.
BigQuery Read Session User
Access to create and use read sessions
BigQuery User
Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets.
Introduction to partitioned tables
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
There are two types of table partitioning in BigQuery:
- Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date.
- Partitioned tables: Tables that are partitioned based on a TIMESTAMP or DATE column.
When you create a table partitioned by ingestion time
BigQuery automatically loads data into daily, date-based partitions that reflect the data's ingestion or arrival date. Pseudo column and suffix identifiers allow you to restate (replace) and redirect data to partitions for a specific day.
When you create ingestion-time partitioned tables
The partitions have the same schema definition as the table. If you need to load data into a partition with a schema that is not the same as the schema of the table, you must update the schema of the table before loading the data. Alternatively, you can use schema update options to update the schema of the table in a load job or query job.
BigQuery also allows partitioned tables
Partitioned tables allow you to bind the partitioning scheme to a specific TIMESTAMP or DATE column. Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
When you create partitioned tables, two special partitions are created:
- The __NULL__ partition — represents rows with NULL values in the partitioning column
- The __UNPARTITIONED__ partition — represents data that exists outside the allowed range of dates
As an alternative to partitioned tables
You can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD. This is referred to as creating date-sharded tables. Using either standard SQL or legacy SQL, you can specify a query with a UNION operator to limit the tables scanned by the query.
Partitioned tables perform better than tables sharded by date
When you create date-named tables, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
Querying multiple tables using a wildcard table
A wildcard table represents a union of all the tables that match the wildcard expression. For example, the following FROM clause uses the wildcard expression gsod* to match all tables in the noaa_gsod dataset that begin with the string gsod.
Wildcard table queries are subject to the following limitations.
The wildcard table functionality does not support views. If the wildcard table matches any view in the dataset, the query returns an error. This is true whether or not your query contains a WHERE clause on the _TABLE_SUFFIX pseudo column to filter out the view.
Currently, cached results are not supported for queries against multiple tables using a wildcard even if the Use Cached Results option is checked. If you run the same wildcard query multiple times, you are billed for each query.
Wildcard table queries are subject to the following limitations. (cont)
Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.
Queries that contain Data Manipulation Language (DML) statements cannot use a wildcard table as the target of the query. For example, a wildcard table may be used in the FROM clause of an UPDATE query, but a wildcard table cannot be used as the target of the UPDATE operation.
When to use wildcard tables
Wildcard tables are useful when a dataset contains multiple, similarly named tables that have compatible schemas.
Querying sets of tables using wildcard tables
To query a group of tables that share a common prefix, use the table wildcard symbol (*) after the table prefix in your FROM statement.
Filtering selected tables using _TABLE_SUFFIX
To restrict the query so that it scans an arbitrary set of tables, use the _TABLE_SUFFIX pseudo column in the WHERE clause. The _TABLE_SUFFIX pseudo column contains the values matched by the table wildcard.
Scanning a range of tables using _TABLE_SUFFIX
To scan a range of tables, use the _TABLE_SUFFIX pseudo column along with the BETWEEN clause.
Scanning a range of ingestion-time partitioned tables using _PARTITIONTIME
To scan a range of ingestion-time partitioned tables, use the _PARTITIONTIME pseudo column with the _TABLE_SUFFIX pseudo column.
Querying all tables in a dataset
To scan all tables in a dataset, you can use an empty prefix and the table wildcard, which means that the _TABLE_SUFFIX pseudo column contains full table names.
Cloud Spanner: How indexes are created for you and how you can create secondary indexes.
Cloud Spanner automatically creates an index for each table's primary key column. You can also create secondary indexes for other columns. Adding a secondary index on a column makes it more efficient to look up data in that column. For example, if you need to quickly look up a set of SingerId values for a given range of LastName values, you should create a secondary index on LastName, so Cloud Spanner does not need to scan the entire table.
Indexes
Every Cloud Firestore (in Datastore mode query) computes its results using one or more indexes which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
There are two types of indexes:
Built-in indexes - By default, a Datastore mode database automatically predefines an index for each property of each entity kind. These single property indexes are suitable for simple types of queries.
Composite indexes - Composite indexes index multiple property values per indexed entity. Composite indexes support complex queries and are defined in an index configuration file (index.yaml).
The index-based query mechanism supports a wide range of queries and is suitable for most applications.
However, it does not support some kinds of query common in other database technologies: in particular, joins and aggregate queries aren't supported within the Datastore mode query engine.
As you design your Cloud Bigtable schema, keep the following concepts in mind:
• Each table has only one index, the row key.
• Rows are sorted lexicographically by row key,
• Columns are grouped by column family and sorted in lexicographic order within the column family.
• All operations are atomic at the row level.
As you design your Cloud Bigtable schema, keep the following concepts in mind: (cont)
• Ideally, both reads and writes should be distributed evenly
• In general, keep all information for an entity in a single row.
• Related entities should be stored in adjacent rows,
• Cloud Bigtable tables are sparse.
(Choosing a row key) Start by asking how you'll use the data that you plan to store.
- User information: Do you need quick access to information about connections between users (for example, whether user A follows user B)?
- User-generated content: If you show users a sample of a large amount of user-generated content, such as status updates, how will you decide which status updates to display to a given user?
- Time series data: Will you often need to retrieve the most recent N records, or records that fall within a certain time range? If you're storing data for several kinds of events, will you need to filter based on the type of event?
One key difference between Hive and Pig:
Hive is declarative and Pig is imperative.
Hive
Specifies exactly how to perform the data analysis, which limits the flexibility of the underlying systems. They can't decide how to use the available resources to get the results.
Pig
Makes execution plans but then requests the underlying systems determine how to process the data. So Pig can fit better into the pipeline paradigm of processing data.
Hive provides a subset of SQL.
The way it does this is by maintaining metadata to define a schema on top of the data. This is one way to work with a large amount of distributed data in HDFS using familiar SQL syntax.
Hive is designed for batch jobs and not for transactions.
It ingests data into a data warehouse format requiring a schema. It does not support real-time queries, row-level updates, or unstructured data. Some queries may run much slower than others due to the underlying transformations Hive has to implement to simulate SQL.
Pig provides SQL primitives similar to Hive, but in a more flexible scripting language format.
Pig can also deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. For this reason it is sometimes used for Extract Transform Load (ETL). It generates Java MapReduce jobs. Pig is not designed to deal with unstructured data.
Pub/Sub
A messaging service that decouples senders and receivers, enabling asynchronous messaging for ingesting data. This is useful for ingesting streaming data. Publishers publish messages to a topic, and subscribers subscribe to the topic and receive the messages. Messages are stored until they're delivered and acknowledged by the subscribers.
Dataflow
A service that provides a unified method for both batch and streaming data processing. Dataflow builds on the Apache Beam SDK.
BigQuery
A data warehouse service with a highly scalable interactive data exploration interface. It can capture and analyze data in real time. So it makes a great partner for data that's been ingested through Pub/Sub and processed though Dataflow.
Consider the long term goal of a pure serverless solution...
You create a pipeline job in Python or Java using Apache Beam. Data is read in batch from Cloud Storage or as a stream through Pub/Sub, it is processed in Dataflow. The results are stored in BigQuery for interactive analysis or in Bigtable for use by applications, there's nothing to maintain.
Cloud Storage is an object store. . Objects have several properties:
Among them is "class" which declares the frequency of anticipated access of the object. This allows the Cloud Storage service to distinguish between types of objects and internally manage them to service levels
Cloud Storage is very good at bulk and parallel operations on larger objects. So for performance, keep these in mind:
● Avoid small reads
● Use large block/file sizes where possible
● Avoid iterating over many nested directories in a single job
Latency is higher for data in Cloud Storage than HDFS on a Persistent Disk in the cluster.
Throughput for processing data in Cloud storage is higher than throughput for HDFS on Persistent disk in the cluster.
Cloud Bigtable stores data in a file system called Colossus.
Colossus also contains data structures called Tablets that are used to identify and manage the data. And metadata about the Tablets is what is stored on the VMs in the Bigtable cluster itself.
Cloud Bigtable has three levels of operation.
It can manipulate the actual data. It can manipulate the Tablets that point to and describe the data. Or it can manipulate the metadata that points to the Tablets. Rebalancing tablets from one node to another is very fast, because only the pointers are updated.
Cloud Bigtable is a learning system.
It detects "hot spots" where a lot of activity is going through a single Tablet and splits the Tablet in two. It can also rebalance the processing by moving the pointer of a tablet to a different VM in the cluster. So its best use case is with big data -- above 300 GB
When a node is lost in the cluster, no data is lost.
Recovery is fast because only the metadata needs to be copied to the replacement node. Colossus provides better durability than the default 3 replicas provided by HDFS.
Cloud Bigtable has only one index.
That index is called the Row Key. There are no alternate indexes or secondary indexes. And when data is entered, it is organized lexicographically by the Row Key.
RISC
(Reduced Instruction Set Computing). Simplify the operations. And when
you don't have to account for variations, you can make those that remain very fast.
The most important control over resource consumption and costs is...
Writing a query that controls the amount of data processed. In general, this is done with SELECT by choosing subsets of data at the start of a job rather than by using LIMIT which only omits data from the final results at the end of a job.
The easy path to migrate your Hadoop workload to Dataproc is...
-Step one, copy the data to Cloud Storage.
-Step two, update the file prefix in your application scripts from HDFS to GS. You just change hdfs:// to gs://.
-Step three, create a Dataproc cluster and run your job. You can also stage your scripts in Cloud Storage. In most cases, that's all you have to do to get your current job running in Dataproc.
Cloud Dataproc Autoscaling provides flexible capacity for more efficient utilization.
It makes scaling decisions based on Hadoop YARN Metrics. It is designed to be used only with off-cluster persistent data, not on-cluster HDFS or HBase. It works best with a cluster that processes a lot of jobs or that processes a single large job.
Efficient utilization (how to not pay for resources you don't use).
1) A fixed amount of time after the cluster enters the Idle state.
2) Set a timer. (You give it a timestamp) The count starts immediately once the expiration has been set.
3) Set a duration. Time in seconds to wait before deleting the cluster. (Range is from 10 minutes minimum to 14 days maximum, with a granularity of 1 second)
With Spark you program with request. Spark doesn't immediately perform these actions, instead it stores them in a graph system called a directed acyclic graph, a d-a-g or DAG.
Only when a request is submitted that requires output, the Spark actually process the data. The benefit of this strategy is that Spark can look at all the requests and the intermediate results and construct parallel pipelines based on the resources that are available in the cluster at that time.
To allow Spark to perform its magic, the program needs to build a chain of transformations using the dot operator.
When it is passed to Spark in this way, Spark understands the multiple steps and that the results of one transformation are to be passed to the next transformation. This allows Spark to organize the processing in any way it decides based on the resources available in the cluster.
A Cloud Function is a serverless, stateless, execution environment for application code.
You deploy your code to the Cloud Functions service and set it up to be triggered by a class of events.
You can trigger periodic events using Cloud Scheduler
However, for data processing there are tools such as Cloud Dataproc Workflow Templates and Cloud Composer that are designed to manage workflows without having to code the service yourself.
Cloud Function code can be deployed to the service through Console, the gcloud command line, or from your local computer
At that time you specify the trigger that will cause the Cloud Function to run, such as the trigger bucket for Cloud Storage or the trigger topic for Cloud Pub/Sub.
Cloud Dataproc has a collection of connectors to BigQuery
Map Reduce, Hadoop, Spark Scala, and Spark Pyspark
If BigQuery was triggered to ingest each shard, it would cause a separate BigQuery job to run for each shard of data.
To overcome this problem, Cloud Dataproc's BigQuery Connectors use Cloud Storage. The workers are able to write their shards as objects in Cloud Storage.
When the work is complete, as part of the connector shutdown process, a single load job is issued in BigQuery.
The Cloud Dataproc Workflow Template is a YAML file that is processed through a Directed Acyclic Graph (DAG).
It can create a new cluster, select from an existing cluster, submit jobs, hold jobs for submission until dependencies can complete, and it can delete a cluster when the job is done.
Cloud Composer is a workflow orchestration service based on Apache Airflow
Cloud Composer can be used to automate Cloud Dataproc jobs and to control clusters.
The thing about BigQuery is it separates out storage and compute. So you're essentially paying for storage, for the data that you put, but the cost of storage in BigQuery is about the same as the cost of storage in Google Cloud Storage.
So, you don't really need to choose. You can store your data if it's structured data, it's tabular data, store it in BigQuery and you will get the same sort of discounts that you get in cloud storage as well. So if you have some data and you haven't edited it in a few weeks, you start getting automatically the discounted rates for older data.
If you want a transactional database that gives you millisecond, microsecond responses, BigQuery is not the answer for that.
The answer for that would be something like Cloud SQL or something like Spanner. But for ad hoc analysis of very large datasets, for data warehousing, for business intelligence, those kinds of operations, BigQuery is a great choice.
BigQuery is completely no-ops.
You put your data onto BigQuery, you're not managing any clusters beyond that point. So you want to run a query on BigQuery, you just run the query on BigQuery. You don't need to create a cluster ahead of time. You only pay when you query the data.
So you have multiple owners in a project. Then in the project you basically create a dataset.
A dataset is basically a collection of tables. Dataset is a collection of tables belong to your organization, and you basically do access control on a dataset basis not at the table basis because you will want to join tables together typically within a dataset. So you don't do access control on a single table, you do access control on a dataset that consists of multiple tables.
So a dataset contains tables. It will also contain views.
A view is a live view of a table. So essentially you can think of a view as a query. That query returns our result set and then you can write a query to process that view just like you're writing something on a table. So you can use views as a cool way to actually restrict control to your dataset, because a view basically is a select ready make.
When you look at a BigQuery plan, you're looking for any stage where there's a significant difference between the average and the max time. And whenever you do this, it indicates a significant data skew.
One of the ways that you can fix this is by probably removing the tail, for example with the Having clause, filtering them out so that you are not processing those tail things.
So, we have a pipeline, and a pipeline is a set of steps, each of these steps is called a transform, and this particular transform, its source is BigQuery and it sink is cloud storage.
So, typical pipeline goes from a source to a sink, involves branching, and involves a number of transforms The input to that transform is a parallel collection. It's a PCollection. So, each transform on the pipeline, it's input is a PCollection.
One of the neat things that dataflow will let you do is that, you can have a running pipeline and you can change the code and you can essentially replace the running pipeline.
Well, when you replace a running pipeline, you don't lose any data. So, some of the data gets processed by the old pipeline and any data that wasn't process by the old pipeline will get processed by the new pipeline. You can obviously see the advantage of your processing, streaming data right. But in order for that replacement to work, your transforms have to have names, unique names.
Grep
The grep command processes text line by line, and prints any lines which match a specified pattern. Grep, which stands for "global /regular expression /print," is a powerful tool for matching a regular expression against text in a file, multiple files, or a stream of files.
Dataflow Templates enable a new development and execution workflow.
The templates help separate the development activities and the developers from the execution activities and the users.
What is Dataprep?
Dataprep is an interactive graphical system for preparing structured or unstructured data for use in analytics such as BigQuery, visualization like, Data Studio and to train machine learning models.
Dataprep offers a graphical user interface for interactively designing a pipeline.
The elements are divided into datasets, recipes and output. A dataset roughly translates into a Dataflow pipeline read, a recipe usually translates into multiple pipeline transformations, and an output translates into a pipeline action.
Dataprep provides a high-leverage method to quickly create Dataflow pipelines without coding.
This is especially useful for data quality tasks and for Master Data task, combining data from multiple sources where programming may not be required.
What's a label?
The label is a correct output for an input. (the true answer - either known or to be determined)
What's an input?
The input is a thing that you will know and that you can provide at the time of prediction. These are things, for example, if they're images, the image itself as an input.
What's an example?
An example is a combination of the label and the input. An input and its corresponding label together form an example.
What's a model?
A model is a mathematical function that takes an input and creates an output that approximates a label for that input.
What's training?
Training is this process of adjusting the weights of a model in such a way that it can make predictions given an input.
What's a prediction?
A prediction is this process of taking an input and applying the mathematical model to it. So, to get an output, that is hopefully the correct output for that input.
Automates data movement from SaaS applications to Google BigQuery on a scheduled, managed basis. Your analytics team can lay the foundation for a data warehouse without writing a single line of code.
BigQuery Data Transfer Service
Supports Google Ads, Campaign Manager, Google Ad Manager, and YouTube Content and Channel Owner Reports. Through BigQuery Data Transfer Service, users also gain access to data connectors that allow you to easily transfer data from Teradata and Amazon S3 to BigQuery.
Cloud Dataflow IAM roles
You can use Cloud Dataflow IAM roles to limit access for users within a project or organization, to just Cloud Dataflow-related resources, as opposed to granting users viewer, editor, or owner access to the entire Cloud Platform project.
Dataflow Admin
Minimal role for creating and managing dataflow jobs.
Dataflow Developer
Provides the permissions necessary to execute and manipulate Cloud Dataflow jobs.
Dataflow Viewer
Provides read-only access to all Cloud Dataflow-related resources.
Dataflow Worker
Provides the permissions necessary for a Compute Engine service account to execute work units for a Cloud Dataflow pipeline.
Example role assignment
- The developer who creates and examines jobs will need the roles/dataflow.admin role.
- For more sophisticated permissions management, the developer interacting with the Cloud Dataflow job will need the roles/dataflow.developer role.
- They will need the roles/storage.objectAdmin or a related role in order to stage the required files.
Example role assignment (continued)
- For debugging and quota checking, they will need the project roles/compute.viewer role.
- Absent other role assignments, this will allow the developer to create and cancel Cloud Dataflow jobs, but not interact with the individual VMs or access other Cloud services.
- The controller service account needs the roles/dataflow.worker role to process data for the Cloud Dataflow service. It will need other roles (such as roles/storage.objectAdmin in order to access job data.
Assigning Cloud Dataflow roles
Cloud Dataflow roles can currently be set on organizations and projects only.
Predefined roles and permissions
When an identity calls a Google Cloud Platform API, Google BigQuery requires that the identity has the appropriate permissions to use the resource. You can grant permissions by granting roles to a user, a group, or a service account.
There are three types of roles in Cloud Identity and Access Management:
- Primitive roles include the Owner, Editor, and Viewer roles that existed prior to the introduction of Cloud Identity and Access Management.
- Predefined roles provide granular access for a specific service and are managed by GCP. Predefined roles are meant to support common use cases and access control patterns.
- Custom roles provide granular access according to a user-specified list of permissions.
To determine if one or more permissions are included in a primitive, predefined, or custom role, you can use one of the following methods:
- The gcloud iam roles describe command
- The roles.get() method in the Cloud IAM API
When you assign both predefined and primitive roles to a user...
The permissions granted are a union of each role's permissions.
When you assign roles at the organization and project level
You provide permission to run BigQuery jobs or to manage all of a project's BigQuery resources.
You can also assign roles at the dataset level to provide access only to one or more datasets
In the IAM policy hierarchy, BigQuery datasets are child resources of projects. Tables and views are child resources of datasets — they inherit permissions from their parent dataset.
For any job you create
You automatically have the equivalent of the bigquery.jobs.get and bigquery.jobs.update permissions for that job.
BigQuery Admin
Provides permissions to manage all resources within the project. Can manage all data within the project, and can cancel jobs from other users running within the project.
BigQuery Data Editor
When applied to a dataset, dataEditor provides permissions to:
- Read the dataset's metadata and to list tables in the dataset.
- Create, update, get, and delete the dataset's tables.
When applied at the project or organization level, this role can also create new datasets.
BigQuery Data Owner
When applied to a dataset, dataOwner provides permissions to:
- Read, update, and delete the dataset.
- Create, update, get, and delete the dataset's tables.
When applied at the project or organization level, this role can also create new datasets.
BigQuery Data Viewer
When applied to a dataset, dataViewer provides permissions to:
- Read the dataset's metadata and to list tables in the dataset.
- Read data and metadata from the dataset's tables.
When applied at the project or organization level, this role can also enumerate all datasets in the project. Additional roles, however, are necessary to allow the running of jobs.
BigQuery Job User
Provides permissions to run jobs, including queries, within the project. The jobUser role can enumerate their own jobs and cancel their own jobs.
BigQuery Metadata Viewer
When applied at the project or organization level, metadataViewer provides permissions to:
- List all datasets and read metadata for all datasets in the project.
- List all tables and views and read metadata for all tables and views in the project.
Additional roles are necessary to allow the running of jobs.
BigQuery Read Session User
Access to create and use read sessions
BigQuery User
Provides permissions to run jobs, including queries, within the project. The user role can enumerate their own jobs, cancel their own jobs, and enumerate datasets within a project. Additionally, allows the creation of new datasets within the project; the creator is granted the bigquery.dataOwner role for these new datasets.
Introduction to partitioned tables
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
There are two types of table partitioning in BigQuery:
- Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date.
- Partitioned tables: Tables that are partitioned based on a TIMESTAMP or DATE column.
When you create a table partitioned by ingestion time
BigQuery automatically loads data into daily, date-based partitions that reflect the data's ingestion or arrival date. Pseudo column and suffix identifiers allow you to restate (replace) and redirect data to partitions for a specific day.
When you create ingestion-time partitioned tables
The partitions have the same schema definition as the table. If you need to load data into a partition with a schema that is not the same as the schema of the table, you must update the schema of the table before loading the data. Alternatively, you can use schema update options to update the schema of the table in a load job or query job.
BigQuery also allows partitioned tables
Partitioned tables allow you to bind the partitioning scheme to a specific TIMESTAMP or DATE column. Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
When you create partitioned tables, two special partitions are created:
- The __NULL__ partition — represents rows with NULL values in the partitioning column
- The __UNPARTITIONED__ partition — represents data that exists outside the allowed range of dates
As an alternative to partitioned tables
You can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD. This is referred to as creating date-sharded tables. Using either standard SQL or legacy SQL, you can specify a query with a UNION operator to limit the tables scanned by the query.
Partitioned tables perform better than tables sharded by date
When you create date-named tables, BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when date-named tables are used, BigQuery might be required to verify permissions for each queried table. This practice also adds to query overhead and impacts query performance.
Querying multiple tables using a wildcard table
A wildcard table represents a union of all the tables that match the wildcard expression. For example, the following FROM clause uses the wildcard expression gsod* to match all tables in the noaa_gsod dataset that begin with the string gsod.
Wildcard table queries are subject to the following limitations.
The wildcard table functionality does not support views. If the wildcard table matches any view in the dataset, the query returns an error. This is true whether or not your query contains a WHERE clause on the _TABLE_SUFFIX pseudo column to filter out the view.
Currently, cached results are not supported for queries against multiple tables using a wildcard even if the Use Cached Results option is checked. If you run the same wildcard query multiple times, you are billed for each query.
Wildcard table queries are subject to the following limitations. (cont)
Wildcard tables support native BigQuery storage only. You cannot use wildcards when querying an external table or a view.
Queries that contain Data Manipulation Language (DML) statements cannot use a wildcard table as the target of the query. For example, a wildcard table may be used in the FROM clause of an UPDATE query, but a wildcard table cannot be used as the target of the UPDATE operation.
When to use wildcard tables
Wildcard tables are useful when a dataset contains multiple, similarly named tables that have compatible schemas.
Querying sets of tables using wildcard tables
To query a group of tables that share a common prefix, use the table wildcard symbol (*) after the table prefix in your FROM statement.
Filtering selected tables using _TABLE_SUFFIX
To restrict the query so that it scans an arbitrary set of tables, use the _TABLE_SUFFIX pseudo column in the WHERE clause. The _TABLE_SUFFIX pseudo column contains the values matched by the table wildcard.
Scanning a range of tables using _TABLE_SUFFIX
To scan a range of tables, use the _TABLE_SUFFIX pseudo column along with the BETWEEN clause.
Scanning a range of ingestion-time partitioned tables using _PARTITIONTIME
To scan a range of ingestion-time partitioned tables, use the _PARTITIONTIME pseudo column with the _TABLE_SUFFIX pseudo column.
Querying all tables in a dataset
To scan all tables in a dataset, you can use an empty prefix and the table wildcard, which means that the _TABLE_SUFFIX pseudo column contains full table names.
Cloud Spanner: How indexes are created for you and how you can create secondary indexes.
Cloud Spanner automatically creates an index for each table's primary key column. You can also create secondary indexes for other columns. Adding a secondary index on a column makes it more efficient to look up data in that column. For example, if you need to quickly look up a set of SingerId values for a given range of LastName values, you should create a secondary index on LastName, so Cloud Spanner does not need to scan the entire table.
Indexes
Every Cloud Firestore (in Datastore mode query) computes its results using one or more indexes which contain entity keys in a sequence specified by the index's properties and, optionally, the entity's ancestors. The indexes are updated to reflect any changes the application makes to its entities, so that the correct results of all queries are available with no further computation needed.
There are two types of indexes:
Built-in indexes - By default, a Datastore mode database automatically predefines an index for each property of each entity kind. These single property indexes are suitable for simple types of queries.
Composite indexes - Composite indexes index multiple property values per indexed entity. Composite indexes support complex queries and are defined in an index configuration file (index.yaml).
The index-based query mechanism supports a wide range of queries and is suitable for most applications.
However, it does not support some kinds of query common in other database technologies: in particular, joins and aggregate queries aren't supported within the Datastore mode query engine.
As you design your Cloud Bigtable schema, keep the following concepts in mind:
• Each table has only one index, the row key.
• Rows are sorted lexicographically by row key,
• Columns are grouped by column family and sorted in lexicographic order within the column family.
• All operations are atomic at the row level.
As you design your Cloud Bigtable schema, keep the following concepts in mind: (cont)
• Ideally, both reads and writes should be distributed evenly
• In general, keep all information for an entity in a single row.
• Related entities should be stored in adjacent rows,
• Cloud Bigtable tables are sparse.
(Choosing a row key) Start by asking how you'll use the data that you plan to store.
- User information: Do you need quick access to information about connections between users (for example, whether user A follows user B)?
- User-generated content: If you show users a sample of a large amount of user-generated content, such as status updates, how will you decide which status updates to display to a given user?
- Time series data: Will you often need to retrieve the most recent N records, or records that fall within a certain time range? If you're storing data for several kinds of events, will you need to filter based on the type of event?
One key difference between Hive and Pig:
Hive is declarative and Pig is imperative.
Hive
Specifies exactly how to perform the data analysis, which limits the flexibility of the underlying systems. They can't decide how to use the available resources to get the results.
Pig
Makes execution plans but then requests the underlying systems determine how to process the data. So Pig can fit better into the pipeline paradigm of processing data.
Hive provides a subset of SQL.
The way it does this is by maintaining metadata to define a schema on top of the data. This is one way to work with a large amount of distributed data in HDFS using familiar SQL syntax.
Hive is designed for batch jobs and not for transactions.
It ingests data into a data warehouse format requiring a schema. It does not support real-time queries, row-level updates, or unstructured data. Some queries may run much slower than others due to the underlying transformations Hive has to implement to simulate SQL.
Pig provides SQL primitives similar to Hive, but in a more flexible scripting language format.
Pig can also deal with semi-structured data, such as data having partial schemas, or for which the schema is not yet known. For this reason it is sometimes used for Extract Transform Load (ETL). It generates Java MapReduce jobs. Pig is not designed to deal with unstructured data.
Pub/Sub
A messaging service that decouples senders and receivers, enabling asynchronous messaging for ingesting data. This is useful for ingesting streaming data. Publishers publish messages to a topic, and subscribers subscribe to the topic and receive the messages. Messages are stored until they're delivered and acknowledged by the subscribers.
Dataflow
A service that provides a unified method for both batch and streaming data processing. Dataflow builds on the Apache Beam SDK.
BigQuery
A data warehouse service with a highly scalable interactive data exploration interface. It can capture and analyze data in real time. So it makes a great partner for data that's been ingested through Pub/Sub and processed though Dataflow.
Consider the long term goal of a pure serverless solution...
You create a pipeline job in Python or Java using Apache Beam. Data is read in batch from Cloud Storage or as a stream through Pub/Sub, it is processed in Dataflow. The results are stored in BigQuery for interactive analysis or in Bigtable for use by applications, there's nothing to maintain.
Cloud Storage is an object store. . Objects have several properties:
Among them is "class" which declares the frequency of anticipated access of the object. This allows the Cloud Storage service to distinguish between types of objects and internally manage them to service levels
Cloud Storage is very good at bulk and parallel operations on larger objects. So for performance, keep these in mind:
● Avoid small reads
● Use large block/file sizes where possible
● Avoid iterating over many nested directories in a single job
Latency is higher for data in Cloud Storage than HDFS on a Persistent Disk in the cluster.
Throughput for processing data in Cloud storage is higher than throughput for HDFS on Persistent disk in the cluster.
Cloud Bigtable stores data in a file system called Colossus.
Colossus also contains data structures called Tablets that are used to identify and manage the data. And metadata about the Tablets is what is stored on the VMs in the Bigtable cluster itself.
Cloud Bigtable has three levels of operation.
It can manipulate the actual data. It can manipulate the Tablets that point to and describe the data. Or it can manipulate the metadata that points to the Tablets. Rebalancing tablets from one node to another is very fast, because only the pointers are updated.
Cloud Bigtable is a learning system.
It detects "hot spots" where a lot of activity is going through a single Tablet and splits the Tablet in two. It can also rebalance the processing by moving the pointer of a tablet to a different VM in the cluster. So its best use case is with big data -- above 300 GB
When a node is lost in the cluster, no data is lost.
Recovery is fast because only the metadata needs to be copied to the replacement node. Colossus provides better durability than the default 3 replicas provided by HDFS.
Cloud Bigtable has only one index.
That index is called the Row Key. There are no alternate indexes or secondary indexes. And when data is entered, it is organized lexicographically by the Row Key.
RISC
(Reduced Instruction Set Computing). Simplify the operations. And when
you don't have to account for variations, you can make those that remain very fast.
The most important control over resource consumption and costs is...
Writing a query that controls the amount of data processed. In general, this is done with SELECT by choosing subsets of data at the start of a job rather than by using LIMIT which only omits data from the final results at the end of a job.
The easy path to migrate your Hadoop workload to Dataproc is...
-Step one, copy the data to Cloud Storage.
-Step two, update the file prefix in your application scripts from HDFS to GS. You just change hdfs:// to gs://.
-Step three, create a Dataproc cluster and run your job. You can also stage your scripts in Cloud Storage. In most cases, that's all you have to do to get your current job running in Dataproc.
Cloud Dataproc Autoscaling provides flexible capacity for more efficient utilization.
It makes scaling decisions based on Hadoop YARN Metrics. It is designed to be used only with off-cluster persistent data, not on-cluster HDFS or HBase. It works best with a cluster that processes a lot of jobs or that processes a single large job.
Efficient utilization (how to not pay for resources you don't use).
1) A fixed amount of time after the cluster enters the Idle state.
2) Set a timer. (You give it a timestamp) The count starts immediately once the expiration has been set.
3) Set a duration. Time in seconds to wait before deleting the cluster. (Range is from 10 minutes minimum to 14 days maximum, with a granularity of 1 second)
With Spark you program with request. Spark doesn't immediately perform these actions, instead it stores them in a graph system called a directed acyclic graph, a d-a-g or DAG.
Only when a request is submitted that requires output, the Spark actually process the data. The benefit of this strategy is that Spark can look at all the requests and the intermediate results and construct parallel pipelines based on the resources that are available in the cluster at that time.
To allow Spark to perform its magic, the program needs to build a chain of transformations using the dot operator.
When it is passed to Spark in this way, Spark understands the multiple steps and that the results of one transformation are to be passed to the next transformation. This allows Spark to organize the processing in any way it decides based on the resources available in the cluster.
A Cloud Function is a serverless, stateless, execution environment for application code.
You deploy your code to the Cloud Functions service and set it up to be triggered by a class of events.
You can trigger periodic events using Cloud Scheduler
However, for data processing there are tools such as Cloud Dataproc Workflow Templates and Cloud Composer that are designed to manage workflows without having to code the service yourself.
Cloud Function code can be deployed to the service through Console, the gcloud command line, or from your local computer
At that time you specify the trigger that will cause the Cloud Function to run, such as the trigger bucket for Cloud Storage or the trigger topic for Cloud Pub/Sub.
Cloud Dataproc has a collection of connectors to BigQuery
Map Reduce, Hadoop, Spark Scala, and Spark Pyspark
If BigQuery was triggered to ingest each shard, it would cause a separate BigQuery job to run for each shard of data.
To overcome this problem, Cloud Dataproc's BigQuery Connectors use Cloud Storage. The workers are able to write their shards as objects in Cloud Storage.
When the work is complete, as part of the connector shutdown process, a single load job is issued in BigQuery.
The Cloud Dataproc Workflow Template is a YAML file that is processed through a Directed Acyclic Graph (DAG).
It can create a new cluster, select from an existing cluster, submit jobs, hold jobs for submission until dependencies can complete, and it can delete a cluster when the job is done.
Cloud Composer is a workflow orchestration service based on Apache Airflow
Cloud Composer can be used to automate Cloud Dataproc jobs and to control clusters.
The thing about BigQuery is it separates out storage and compute. So you're essentially paying for storage, for the data that you put, but the cost of storage in BigQuery is about the same as the cost of storage in Google Cloud Storage.
So, you don't really need to choose. You can store your data if it's structured data, it's tabular data, store it in BigQuery and you will get the same sort of discounts that you get in cloud storage as well. So if you have some data and you haven't edited it in a few weeks, you start getting automatically the discounted rates for older data.
If you want a transactional database that gives you millisecond, microsecond responses, BigQuery is not the answer for that.
The answer for that would be something like Cloud SQL or something like Spanner. But for ad hoc analysis of very large datasets, for data warehousing, for business intelligence, those kinds of operations, BigQuery is a great choice.
BigQuery is completely no-ops.
You put your data onto BigQuery, you're not managing any clusters beyond that point. So you want to run a query on BigQuery, you just run the query on BigQuery. You don't need to create a cluster ahead of time. You only pay when you query the data.
So you have multiple owners in a project. Then in the project you basically create a dataset.
A dataset is basically a collection of tables. Dataset is a collection of tables belong to your organization, and you basically do access control on a dataset basis not at the table basis because you will want to join tables together typically within a dataset. So you don't do access control on a single table, you do access control on a dataset that consists of multiple tables.
So a dataset contains tables. It will also contain views.
A view is a live view of a table. So essentially you can think of a view as a query. That query returns our result set and then you can write a query to process that view just like you're writing something on a table. So you can use views as a cool way to actually restrict control to your dataset, because a view basically is a select ready make.
When you look at a BigQuery plan, you're looking for any stage where there's a significant difference between the average and the max time. And whenever you do this, it indicates a significant data skew.
One of the ways that you can fix this is by probably removing the tail, for example with the Having clause, filtering them out so that you are not processing those tail things.
So, we have a pipeline, and a pipeline is a set of steps, each of these steps is called a transform, and this particular transform, its source is BigQuery and it sink is cloud storage.
So, typical pipeline goes from a source to a sink, involves branching, and involves a number of transforms The input to that transform is a parallel collection. It's a PCollection. So, each transform on the pipeline, it's input is a PCollection.
One of the neat things that dataflow will let you do is that, you can have a running pipeline and you can change the code and you can essentially replace the running pipeline.
Well, when you replace a running pipeline, you don't lose any data. So, some of the data gets processed by the old pipeline and any data that wasn't process by the old pipeline will get processed by the new pipeline. You can obviously see the advantage of your processing, streaming data right. But in order for that replacement to work, your transforms have to have names, unique names.
Grep
The grep command processes text line by line, and prints any lines which match a specified pattern. Grep, which stands for "global /regular expression /print," is a powerful tool for matching a regular expression against text in a file, multiple files, or a stream of files.
Dataflow Templates enable a new development and execution workflow.
The templates help separate the development activities and the developers from the execution activities and the users.
What is Dataprep?
Dataprep is an interactive graphical system for preparing structured or unstructured data for use in analytics such as BigQuery, visualization like, Data Studio and to train machine learning models.
Dataprep offers a graphical user interface for interactively designing a pipeline.
The elements are divided into datasets, recipes and output. A dataset roughly translates into a Dataflow pipeline read, a recipe usually translates into multiple pipeline transformations, and an output translates into a pipeline action.
Dataprep provides a high-leverage method to quickly create Dataflow pipelines without coding.
This is especially useful for data quality tasks and for Master Data task, combining data from multiple sources where programming may not be required.
What's a label?
The label is a correct output for an input. (the true answer - either known or to be determined)
What's an input?
The input is a thing that you will know and that you can provide at the time of prediction. These are things, for example, if they're images, the image itself as an input.
What's an example?
An example is a combination of the label and the input. An input and its corresponding label together form an example.
What's a model?
A model is a mathematical function that takes an input and creates an output that approximates a label for that input.
What's training?
Training is this process of adjusting the weights of a model in such a way that it can make predictions given an input.
What's a prediction?
A prediction is this process of taking an input and applying the mathematical model to it. So, to get an output, that is hopefully the correct output for that input.
Studylib tips
Did you forget to review your flashcards?
Try the Chrome extension that turns your New Tab screen into a flashcards viewer!
The idea behind StudyLib Extension is that reviewing flashcards will be easier if we distribute all flashcards reviewing into smaller sessions throughout the working day.