[Full-Version] 2022 New TopExamCollection Professional-Data-Engineer PDF Recently Updated Questions [Q71-Q93]

Share

[Full-Version] 2022 New TopExamCollection Professional-Data-Engineer PDF Recently Updated Questions

Professional-Data-Engineer Exam with Guarantee Updated 253 Questions


How to Prepare For Google Professional Data Engineer Exam

Preparation Guide for Google Professional Data Engineer Exam

Introduction to Google Professional Data Engineer Exam

Google has established a path for IT professionals endorse as a Data Engineer on the GCP platform. This accreditation program gives Google cloud professionals a way to endorse their skills. The evaluation relies on a meticulous exam using industry standard methodology to conclude whether or not a aspirant meets Google’s proficiency standards.

The Professional Data Engineer exam assesses your ability to:

  • Ensure solution quality
  • Build and operationalize data processing systems
  • Design data processing systems
  • Operationalize machine learning models

Google Professional Data Engineer Exam certification is evidence of your skills, expertise in those areas in which you like to work. If candidate wants to work on Google Professional Data Engineer and prove his knowledge, Certification offered by Google. This Google Professional Data Engineer Certification helps a candidate to validates his skills in Big Data and Data engineering Technology.


Google Cloud Big Data & Machine Learning Fundamentals course

This course is a gateway to introduce you to Google Cloud's big data and different machine learning functions. However, to successfully pass this training, you have to attain one year of experience in SQL, extract transform, data modeling, machine learning, programming in Python, and load activities. So, the objectives of the course are the following:

  • Utilize Cloud SQL & Dataproc to migrate existing MySQL, Pig, Spark, or Hive workloads to Google Cloud
  • Create ML models using BigQuery ML, APIs, and AutoML.
  • Recognize the purpose of the key Big data and Machine Learning products in Google Cloud
  • Hire BigQuery and Cloud SQL for interactive data analysis

 

NEW QUESTION 71
You have spent a few days loading data from comma-separated values (CSV) files into the Google
BigQuery table CLICK_STREAM. The column DTstores the epoch time of click events. For convenience,
you chose a simple schema where every field is treated as the STRINGtype. Now, you want to compute
web session durations of users who visit your site, and you want to change its data type to the
TIMESTAMP. You want to minimize the migration effort without making future queries computationally
expensive. What should you do?

  • A. Add a column TSof the TIMESTAMPtype to the table CLICK_STREAM, and populate the numeric
    values from the column TSfor each row. Reference the column TSinstead of the column DTfrom now
    on.
  • B. Delete the table CLICK_STREAM, and then re-create it such that the column DTis of the TIMESTAMP
    type. Reload the data.
  • C. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to
    cast strings from the column DTinto TIMESTAMPvalues. Run the query into a destination table
    NEW_CLICK_STREAM, in which the column TSis the TIMESTAMPtype. Reference the table
    NEW_CLICK_STREAMinstead of the table CLICK_STREAMfrom now on. In the future, new data is
    loaded into the table NEW_CLICK_STREAM.
  • D. Create a view CLICK_STREAM_V, where strings from the column DTare cast into TIMESTAMPvalues.
    Reference the view CLICK_STREAM_Vinstead of the table CLICK_STREAMfrom now on.
  • E. Add two columns to the table CLICK STREAM: TSof the TIMESTAMPtype and IS_NEWof the
    BOOLEANtype. Reload all data in append mode. For each appended row, set the value of IS_NEWto
    true. For future queries, reference the column TSinstead of the column DT, with the WHEREclause
    ensuring that the value of IS_NEWmust be true.

Answer: E

 

NEW QUESTION 72
You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non- public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?

  • A. Use a service account with the ability to read the batch files and to write to BigQuery
  • B. Restrict the Google Cloud Storage bucket so only you can see the files
  • C. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery
  • D. Grant the Project Owner role to a service account, and run the job with it

Answer: A

 

NEW QUESTION 73
Which of the following is not true about Dataflow pipelines?

  • A. Pipelines can share data between instances
  • B. Pipelines are a set of operations
  • C. Pipelines represent a data processing job
  • D. Pipelines represent a directed graph of steps

Answer: A

Explanation:
Explanation
The data and transforms in a pipeline are unique to, and owned by, that pipeline. While your program can create multiple pipelines, pipelines cannot share data or transforms Reference: https://cloud.google.com/dataflow/model/pipelines

 

NEW QUESTION 74
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?

  • A. An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/num_undelivered_messages for the destination
  • B. An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/used_bytes for the destination
  • C. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/used_bytes for the destination
  • D. An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/num_undelivered_messages for the destination

Answer: C

 

NEW QUESTION 75
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

  • A. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
  • B. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
  • C. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
  • D. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.

Answer: B

Explanation:
Explanation

 

NEW QUESTION 76
You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

  • A. Use Cloud Dataproc to run your transformations. Use the diagnosecommand to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
  • B. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
  • C. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs.
    Configure the job to use non-default Compute Engine machine types when needed.
  • D. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.

Answer: A

Explanation:
Explanation

 

NEW QUESTION 77
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
* No interaction by the user on the site for 1 hour
* Has added more than $30 worth of products to the basket
* Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

  • A. Use a sliding time window with a duration of 60 minutes.
  • B. Use a session window with a gap time duration of 60 minutes.
  • C. Use a global window with a time based trigger with a delay of 60 minutes.
  • D. Use a fixed-time window with a duration of 60 minutes.

Answer: C

 

NEW QUESTION 78
Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl.
being able to reprocess all failing data).
What should you do?

  • A. Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
  • B. Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.
  • C. Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.
  • D. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.

Answer: B

Explanation:
https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow

 

NEW QUESTION 79
You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this? (Choose two.)

  • A. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
  • B. Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
  • C. Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.
  • D. Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
  • E. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.

Answer: C,D

 

NEW QUESTION 80
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

  • A. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
  • B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
  • C. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
  • D. Add a SideInput that returns a Boolean if the element is corrupt.

Answer: B

 

NEW QUESTION 81
An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user- level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do?

  • A. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.
  • B. Create and share a new dataset and table that contains the aggregate results.
  • C. Create and share an authorized view that provides the aggregate results.
  • D. Create and share a new dataset and view that provides the aggregate results.

Answer: A

Explanation:
Explanation/Reference:
Reference: https://cloud.google.com/bigquery/docs/access-control

 

NEW QUESTION 82
You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

  • A. Option C
  • B. Option A
  • C. Option B.
  • D. Option D

Answer: B

 

NEW QUESTION 83
You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?

  • A. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.
  • B. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
  • C. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
  • D. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.

Answer: B

 

NEW QUESTION 84
Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?

  • A. Use K-means Clustering to detect faces in the pixels.
  • B. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
  • C. Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.
  • D. Use feature engineering to add features for eyes, noses, and mouths to the input data.

Answer: B

Explanation:
Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as "deep" learning. So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer's output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features of faces, such as eyes, because it wouldn't be able to "build" these features using previous hidden layers that detect low-level features, such as lines. Feature engineering is difficult to perform on raw image data.
K-means Clustering is an unsupervised learning method used to categorize unlabeled data.
Reference: https://deeplearning4j.org/neuralnet-overview

 

NEW QUESTION 85
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible. What should you do?

  • A. Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
  • B. Load the data every 30 minutes into a new partitioned table in BigQuery.
  • C. Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
  • D. Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.

Answer: C

 

NEW QUESTION 86
All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.

  • A. after
  • B. only if
  • C. before
  • D. once

Answer: C

Explanation:
In a Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a Cloud Bigtable node.
The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, which is a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster.
When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster.
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 87
To run a TensorFlow training job on your own computer using Cloud Machine Learning Engine, what would your command start with?

  • A. gcloud ml-engine local train
  • B. gcloud ml-engine jobs submit training local
  • C. gcloud ml-engine jobs submit training
  • D. You can't run a TensorFlow program on your own computer using Cloud ML Engine .

Answer: A

Explanation:
gcloud ml-engine local train - run a Cloud ML Engine training job locally This command runs the specified module in an environment similar to that of a live Cloud ML Engine Training Job.
This is especially useful in the case of testing distributed models, as it allows you to validate that you are properly interacting with the Cloud ML Engine cluster configuration.
Reference: https://cloud.google.com/sdk/gcloud/reference/ml-engine/local/train

 

NEW QUESTION 88
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DTstores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRINGtype. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?

  • A. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DTinto TIMESTAMPvalues. Run the query into a destination table NEW_CLICK_STREAM, in which the column TSis the TIMESTAMPtype. Reference the table NEW_CLICK_STREAMinstead of the table CLICK_STREAMfrom now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
  • B. Add a column TSof the TIMESTAMPtype to the table CLICK_STREAM, and populate the numeric values from the column TSfor each row. Reference the column TSinstead of the column DTfrom now on.
  • C. Delete the table CLICK_STREAM, and then re-create it such that the column DTis of the TIMESTAMP type. Reload the data.
  • D. Add two columns to the table CLICK STREAM: TSof the TIMESTAMPtype and IS_NEWof the BOOLEANtype. Reload all data in append mode. For each appended row, set the value of IS_NEWto true. For future queries, reference the column TSinstead of the column DT, with the WHEREclause ensuring that the value of IS_NEWmust be true.
  • E. Create a view CLICK_STREAM_V, where strings from the column DTare cast into TIMESTAMPvalues.
    Reference the view CLICK_STREAM_Vinstead of the table CLICK_STREAMfrom now on.

Answer: D

 

NEW QUESTION 89
You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

  • A. Use Cloud ML Engine for training existing Spark ML models
  • B. Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
  • C. Rewrite your models on TensorFlow, and start using Cloud ML Engine
  • D. Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery

Answer: A

 

NEW QUESTION 90
Your financial services company is moving to cloud technology and wants to store 50 TB of financial time- series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?

  • A. Cloud Bigtable
  • B. Google Cloud Storage
  • C. Google Cloud Datastore
  • D. Google BigQuery

Answer: A

Explanation:
https://cloud.google.com/blog/products/databases/getting-started-with-time-series-trend-predictions-using- gcp

 

NEW QUESTION 91
When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?

  • A. You are missing gcloud on your machine
  • B. Pipelines cannot be run locally
  • C. Your gcloud does not have access to the BigQuery resources
  • D. BigQuery cannot be accessed from local machines

Answer: C

Explanation:
When reading from a Dataflow source or writing to a Dataflow sink using DirectPipelineRunner, the Cloud Platform account that you configured with the gcloud executable will need access to the corresponding source/sink Reference: https://cloud.google.com/dataflow/java- sdk/JavaDoc/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner

 

NEW QUESTION 92
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform. What should you do?

  • A. Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
  • B. Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs
  • C. Export the information to Cloud Stackdriver, and set up an Alerting policy
  • D. Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver

Answer: C

Explanation:
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.

 

NEW QUESTION 93
......

Latest Professional-Data-Engineer Pass Guaranteed Exam Dumps Certification Sample Questions: https://www.topexamcollection.com/Professional-Data-Engineer-vce-collection.html

Professional-Data-Engineer Updated Exam Dumps [2022] Practice Valid Exam Dumps Question: https://drive.google.com/open?id=1LYP3_PRPA52QfljSTMYx7wuDMPTwOctM